[PATCH 1/3] drm/xe: Add a function to zap page table by address range

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/3] drm/xe: Add a function to zap page table by address range
@ 2025-01-28 22:21 Oak Zeng
  2025-01-28 22:21 ` [PATCH 2/3] drm/xe: Clear scratch page before vm_bind Oak Zeng
                   ` (7 more replies)
  0 siblings, 8 replies; 32+ messages in thread
From: Oak Zeng @ 2025-01-28 22:21 UTC (permalink / raw)
  To: intel-xe; +Cc: joonas.lahtinen, Thomas.Hellstrom

Add a function xe_pt_zap_range. This is similar to xe_pt_zap_ptes
but is used when we don't have a vma to work with, such as zap
a range mapped to scratch page where we don't have vma.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c | 28 ++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_pt.h |  2 ++
 2 files changed, 30 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 1ddcc7e79a93..2363260da6a6 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -792,6 +792,34 @@ static const struct xe_pt_walk_ops xe_pt_zap_ptes_ops = {
 	.pt_entry = xe_pt_zap_ptes_entry,
 };
 
+/**
+ * xe_pt_zap_range() - Zap (zero) gpu ptes of an virtual address range
+ * @tile: The tile we're zapping for.
+ * @vm: The vm we're zapping for.
+ * @start: Start of the virtual address range, inclusive.
+ * @end: End of the virtual address range, exclusive.
+ *
+ * This is similar to xe_pt_zap_ptes() but it's used when we don't have a
+ * vma to work with. This is used for example when we're clearing the scratch
+ * page mapping during vm_bind.
+ *
+ */
+void xe_pt_zap_range(struct xe_tile *tile, struct xe_vm *vm, u64 start, u64 end)
+{
+	struct xe_pt_zap_ptes_walk xe_walk = {
+		.base = {
+			.ops = &xe_pt_zap_ptes_ops,
+			.shifts = xe_normal_pt_shifts,
+			.max_level = XE_PT_HIGHEST_LEVEL,
+		},
+		.tile = tile,
+	};
+	struct xe_pt *pt = vm->pt_root[tile->id];
+
+	(void)xe_pt_walk_shared(&pt->base, pt->level, start,
+				end, &xe_walk.base);
+}
+
 /**
  * xe_pt_zap_ptes() - Zap (zero) gpu ptes of an address range
  * @tile: The tile we're zapping for.
diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
index 9ab386431cad..b166b324f455 100644
--- a/drivers/gpu/drm/xe/xe_pt.h
+++ b/drivers/gpu/drm/xe/xe_pt.h
@@ -43,4 +43,6 @@ void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops *vops);
 
 bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
 
+void xe_pt_zap_range(struct xe_tile *tile, struct xe_vm *vm, u64 start, u64 end);
+
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-01-28 22:21 [PATCH 1/3] drm/xe: Add a function to zap page table by address range Oak Zeng
@ 2025-01-28 22:21 ` Oak Zeng
  2025-01-28 23:05   ` Cavitt, Jonathan
  2025-01-28 23:19   ` Matthew Brost
  2025-01-28 22:21 ` [PATCH 3/3] drm/xe: Allow scratch page under fault mode for certain platform Oak Zeng
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 32+ messages in thread
From: Oak Zeng @ 2025-01-28 22:21 UTC (permalink / raw)
  To: intel-xe; +Cc: joonas.lahtinen, Thomas.Hellstrom

When a vm runs under fault mode, if scratch page is enabled, we need
to clear the scratch page mapping before vm_bind for the vm_bind
address range. Under fault mode, we depend on recoverable page fault
to establish mapping in page table. If scratch page is not cleared,
GPU access of address won't cause page fault because it always hits
the existing scratch page mapping.

When vm_bind with IMMEDIATE flag, there is no need of clearing as
immediate bind can overwrite the scratch page mapping.

So far only is xe2 and xe3 products are allowed to enable scratch page
under fault mode. On other platform we don't allow scratch page under
fault mode, so no need of such clearing.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 690330352d4c..196d347c6ac0 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -38,6 +38,7 @@
 #include "xe_trace_bo.h"
 #include "xe_wa.h"
 #include "xe_hmm.h"
+#include "i915_drv.h"
 
 static struct drm_gem_object *xe_vm_obj(struct xe_vm *vm)
 {
@@ -2917,6 +2918,34 @@ static int xe_vm_bind_ioctl_validate_bo(struct xe_device *xe, struct xe_bo *bo,
 	return 0;
 }
 
+static bool __xe_vm_needs_clear_scratch_pages(struct xe_device *xe,
+					      struct xe_vm *vm, u32 bind_flags)
+{
+	if (!xe_vm_in_fault_mode(vm))
+		return false;
+
+	if (!xe_vm_has_scratch(vm))
+		return false;
+
+	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
+		return false;
+
+	if (!(IS_LUNARLAKE(xe) || IS_BATTLEMAGE(xe) || IS_PANTHERLAKE(xe)))
+		return false;
+
+	return true;
+}
+
+static void __xe_vm_clear_scratch_pages(struct xe_device *xe, struct xe_vm *vm,
+					u64 start, u64 end)
+{
+	struct xe_tile *tile;
+	u8 id;
+
+	for_each_tile(tile, xe, id)
+		xe_pt_zap_range(tile, vm, start, end);
+}
+
 int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 {
 	struct xe_device *xe = to_xe_device(dev);
@@ -3062,6 +3091,9 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		u32 prefetch_region = bind_ops[i].prefetch_mem_region_instance;
 		u16 pat_index = bind_ops[i].pat_index;
 
+		if (__xe_vm_needs_clear_scratch_pages(xe, vm, flags))
+			__xe_vm_clear_scratch_pages(xe, vm, addr, addr + range);
+
 		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
 						  addr, range, op, flags,
 						  prefetch_region, pat_index);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* RE: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-01-28 22:21 ` [PATCH 2/3] drm/xe: Clear scratch page before vm_bind Oak Zeng
@ 2025-01-28 23:05   ` Cavitt, Jonathan
  2025-01-28 23:19   ` Matthew Brost
  1 sibling, 0 replies; 32+ messages in thread
From: Cavitt, Jonathan @ 2025-01-28 23:05 UTC (permalink / raw)
  To: Zeng, Oak, intel-xe@lists.freedesktop.org
  Cc: joonas.lahtinen@linux.intel.com, Thomas.Hellstrom@linux.intel.com,
	Cavitt, Jonathan

-----Original Message-----
From: Intel-xe <intel-xe-bounces@lists.freedesktop.org> On Behalf Of Oak Zeng
Sent: Tuesday, January 28, 2025 2:22 PM
To: intel-xe@lists.freedesktop.org
Cc: joonas.lahtinen@linux.intel.com; Thomas.Hellstrom@linux.intel.com
Subject: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
> 
> When a vm runs under fault mode, if scratch page is enabled, we need
> to clear the scratch page mapping before vm_bind for the vm_bind
> address range. Under fault mode, we depend on recoverable page fault
> to establish mapping in page table. If scratch page is not cleared,
> GPU access of address won't cause page fault because it always hits
> the existing scratch page mapping.
> 
> When vm_bind with IMMEDIATE flag, there is no need of clearing as
> immediate bind can overwrite the scratch page mapping.
> 
> So far only is xe2 and xe3 products are allowed to enable scratch page
> under fault mode. On other platform we don't allow scratch page under
> fault mode, so no need of such clearing.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_vm.c | 32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 690330352d4c..196d347c6ac0 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -38,6 +38,7 @@
>  #include "xe_trace_bo.h"
>  #include "xe_wa.h"
>  #include "xe_hmm.h"
> +#include "i915_drv.h"
>  
>  static struct drm_gem_object *xe_vm_obj(struct xe_vm *vm)
>  {
> @@ -2917,6 +2918,34 @@ static int xe_vm_bind_ioctl_validate_bo(struct xe_device *xe, struct xe_bo *bo,
>  	return 0;
>  }
>  
> +static bool __xe_vm_needs_clear_scratch_pages(struct xe_device *xe,
> +					      struct xe_vm *vm, u32 bind_flags)
> +{
> +	if (!xe_vm_in_fault_mode(vm))
> +		return false;
> +
> +	if (!xe_vm_has_scratch(vm))
> +		return false;
> +
> +	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> +		return false;
> +
> +	if (!(IS_LUNARLAKE(xe) || IS_BATTLEMAGE(xe) || IS_PANTHERLAKE(xe)))

We should probably break this platform check out into its own function or macro, especially
since it's being mirrored in the next patch during xe_vm_create_ioctl.  It'd also be easier to
extend that way if future platforms also need this functionality.

But other than that:
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
-Jonathan Cavitt

> +		return false;
> +
> +	return true;
> +}
> +
> +static void __xe_vm_clear_scratch_pages(struct xe_device *xe, struct xe_vm *vm,
> +					u64 start, u64 end)
> +{
> +	struct xe_tile *tile;
> +	u8 id;
> +
> +	for_each_tile(tile, xe, id)
> +		xe_pt_zap_range(tile, vm, start, end);
> +}
> +
>  int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>  {
>  	struct xe_device *xe = to_xe_device(dev);
> @@ -3062,6 +3091,9 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>  		u32 prefetch_region = bind_ops[i].prefetch_mem_region_instance;
>  		u16 pat_index = bind_ops[i].pat_index;
>  
> +		if (__xe_vm_needs_clear_scratch_pages(xe, vm, flags))
> +			__xe_vm_clear_scratch_pages(xe, vm, addr, addr + range);
> +
>  		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>  						  addr, range, op, flags,
>  						  prefetch_region, pat_index);
> -- 
> 2.26.3
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-01-28 22:21 ` [PATCH 2/3] drm/xe: Clear scratch page before vm_bind Oak Zeng
  2025-01-28 23:05   ` Cavitt, Jonathan
@ 2025-01-28 23:19   ` Matthew Brost
  2025-01-29  5:04     ` Matthew Brost
  2025-01-29 21:01     ` Zeng, Oak
  1 sibling, 2 replies; 32+ messages in thread
From: Matthew Brost @ 2025-01-28 23:19 UTC (permalink / raw)
  To: Oak Zeng; +Cc: intel-xe, joonas.lahtinen, Thomas.Hellstrom

On Tue, Jan 28, 2025 at 05:21:44PM -0500, Oak Zeng wrote:
> When a vm runs under fault mode, if scratch page is enabled, we need
> to clear the scratch page mapping before vm_bind for the vm_bind
> address range. Under fault mode, we depend on recoverable page fault
> to establish mapping in page table. If scratch page is not cleared,
> GPU access of address won't cause page fault because it always hits
> the existing scratch page mapping.
> 
> When vm_bind with IMMEDIATE flag, there is no need of clearing as
> immediate bind can overwrite the scratch page mapping.
> 
> So far only is xe2 and xe3 products are allowed to enable scratch page
> under fault mode. On other platform we don't allow scratch page under
> fault mode, so no need of such clearing.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_vm.c | 32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 690330352d4c..196d347c6ac0 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -38,6 +38,7 @@
>  #include "xe_trace_bo.h"
>  #include "xe_wa.h"
>  #include "xe_hmm.h"
> +#include "i915_drv.h"
>  
>  static struct drm_gem_object *xe_vm_obj(struct xe_vm *vm)
>  {
> @@ -2917,6 +2918,34 @@ static int xe_vm_bind_ioctl_validate_bo(struct xe_device *xe, struct xe_bo *bo,
>  	return 0;
>  }
>  
> +static bool __xe_vm_needs_clear_scratch_pages(struct xe_device *xe,
> +					      struct xe_vm *vm, u32 bind_flags)
> +{
> +	if (!xe_vm_in_fault_mode(vm))
> +		return false;
> +
> +	if (!xe_vm_has_scratch(vm))
> +		return false;
> +
> +	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> +		return false;
> +
> +	if (!(IS_LUNARLAKE(xe) || IS_BATTLEMAGE(xe) || IS_PANTHERLAKE(xe)))
> +		return false;
> +
> +	return true;
> +}
> +
> +static void __xe_vm_clear_scratch_pages(struct xe_device *xe, struct xe_vm *vm,
> +					u64 start, u64 end)
> +{
> +	struct xe_tile *tile;
> +	u8 id;
> +
> +	for_each_tile(tile, xe, id)
> +		xe_pt_zap_range(tile, vm, start, end);
> +}
> +
>  int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>  {
>  	struct xe_device *xe = to_xe_device(dev);
> @@ -3062,6 +3091,9 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>  		u32 prefetch_region = bind_ops[i].prefetch_mem_region_instance;
>  		u16 pat_index = bind_ops[i].pat_index;
>  
> +		if (__xe_vm_needs_clear_scratch_pages(xe, vm, flags))
> +			__xe_vm_clear_scratch_pages(xe, vm, addr, addr + range);

A few things...

- I believe this is only needed for bind user operations or internal MAP
  GPU VMA operations.
- I believe a TLB invalidation will be required.
- I don't think calling zap PTEs range works here, given how the scratch
  tables are set up (i.e., new PTEs need to be created pointing to an
  invalid state).
- This series appears to be untested based on the points above.

Therefore, instead of this series, I believe you will need to fully
update the bind pipeline to process MAP GPU VMA operations here.

So roughly...

- Maybe include a bit in xe_vma_op_map that specifies "invalidate on
  bind," set in vm_bind_ioctl_ops_create, since this will need to be
  wired throughout the bind pipeline.
- Don't validate backing memory in this case.
- Ensure that xe_vma_ops_incr_pt_update_ops is called in this case for
  MAP operations, forcing entry into the xe_pt.c backend.
- Update xe_pt_stage_bind_walk with a variable that indicates clearing
  the PTE. Instead of calling pte_encode_vma in xe_pt_stage_bind_entry,
  set this variable for PT bind operations derived from MAP operations
  that meet the "invalidate on bind" condition.
- Ensure needs_invalidation is set in struct xe_vm_pgtable_update_ops if
  a MAP operation is included that meets the "invalidate on bind"
  condition.
- Set the VMA tile_invalidated in addition to tile_present for MAP
  operations that meet the "invalidate on bind" condition.

I might be missing some implementation details mentioned above, but this
should provide you with some direction.

Lastly, and perhaps most importantly, please test this using an IGT and
include the results in the next post.

Matt

> +
>  		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>  						  addr, range, op, flags,
>  						  prefetch_region, pat_index);
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-01-28 23:19   ` Matthew Brost
@ 2025-01-29  5:04     ` Matthew Brost
  2025-01-29  8:21       ` Thomas Hellström
  2025-01-29 21:01     ` Zeng, Oak
  1 sibling, 1 reply; 32+ messages in thread
From: Matthew Brost @ 2025-01-29  5:04 UTC (permalink / raw)
  To: Oak Zeng; +Cc: intel-xe, joonas.lahtinen, Thomas.Hellstrom

On Tue, Jan 28, 2025 at 03:19:13PM -0800, Matthew Brost wrote:
> On Tue, Jan 28, 2025 at 05:21:44PM -0500, Oak Zeng wrote:
> > When a vm runs under fault mode, if scratch page is enabled, we need
> > to clear the scratch page mapping before vm_bind for the vm_bind
> > address range. Under fault mode, we depend on recoverable page fault
> > to establish mapping in page table. If scratch page is not cleared,
> > GPU access of address won't cause page fault because it always hits
> > the existing scratch page mapping.
> > 
> > When vm_bind with IMMEDIATE flag, there is no need of clearing as
> > immediate bind can overwrite the scratch page mapping.
> > 
> > So far only is xe2 and xe3 products are allowed to enable scratch page
> > under fault mode. On other platform we don't allow scratch page under
> > fault mode, so no need of such clearing.
> > 
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_vm.c | 32 ++++++++++++++++++++++++++++++++
> >  1 file changed, 32 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > index 690330352d4c..196d347c6ac0 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -38,6 +38,7 @@
> >  #include "xe_trace_bo.h"
> >  #include "xe_wa.h"
> >  #include "xe_hmm.h"
> > +#include "i915_drv.h"
> >  
> >  static struct drm_gem_object *xe_vm_obj(struct xe_vm *vm)
> >  {
> > @@ -2917,6 +2918,34 @@ static int xe_vm_bind_ioctl_validate_bo(struct xe_device *xe, struct xe_bo *bo,
> >  	return 0;
> >  }
> >  
> > +static bool __xe_vm_needs_clear_scratch_pages(struct xe_device *xe,
> > +					      struct xe_vm *vm, u32 bind_flags)
> > +{
> > +	if (!xe_vm_in_fault_mode(vm))
> > +		return false;
> > +
> > +	if (!xe_vm_has_scratch(vm))
> > +		return false;
> > +
> > +	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> > +		return false;
> > +
> > +	if (!(IS_LUNARLAKE(xe) || IS_BATTLEMAGE(xe) || IS_PANTHERLAKE(xe)))
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> > +static void __xe_vm_clear_scratch_pages(struct xe_device *xe, struct xe_vm *vm,
> > +					u64 start, u64 end)
> > +{
> > +	struct xe_tile *tile;
> > +	u8 id;
> > +
> > +	for_each_tile(tile, xe, id)
> > +		xe_pt_zap_range(tile, vm, start, end);
> > +}
> > +
> >  int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> >  {
> >  	struct xe_device *xe = to_xe_device(dev);
> > @@ -3062,6 +3091,9 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> >  		u32 prefetch_region = bind_ops[i].prefetch_mem_region_instance;
> >  		u16 pat_index = bind_ops[i].pat_index;
> >  
> > +		if (__xe_vm_needs_clear_scratch_pages(xe, vm, flags))
> > +			__xe_vm_clear_scratch_pages(xe, vm, addr, addr + range);
> 
> A few things...
> 
> - I believe this is only needed for bind user operations or internal MAP
>   GPU VMA operations.
> - I believe a TLB invalidation will be required.
> - I don't think calling zap PTEs range works here, given how the scratch
>   tables are set up (i.e., new PTEs need to be created pointing to an
>   invalid state).
> - This series appears to be untested based on the points above.
> 
> Therefore, instead of this series, I believe you will need to fully
> update the bind pipeline to process MAP GPU VMA operations here.
> 
> So roughly...
> 
> - Maybe include a bit in xe_vma_op_map that specifies "invalidate on
>   bind," set in vm_bind_ioctl_ops_create, since this will need to be
>   wired throughout the bind pipeline.
> - Don't validate backing memory in this case.
> - Ensure that xe_vma_ops_incr_pt_update_ops is called in this case for
>   MAP operations, forcing entry into the xe_pt.c backend.
> - Update xe_pt_stage_bind_walk with a variable that indicates clearing
>   the PTE. Instead of calling pte_encode_vma in xe_pt_stage_bind_entry,
>   set this variable for PT bind operations derived from MAP operations
>   that meet the "invalidate on bind" condition.

Ugh, typos / nonsense here. Let me try again.

"Update xe_pt_stage_bind_walk with a variable that indicates clearing
the PTE, set this for MAP ops that meet the 'invalidate on bind'
condition. When this variable is set, instead of calling pte_encode_vma
in xe_pt_stage_bind_entry, call a function which encodes an invalid
PTE."

Matt

> - Ensure needs_invalidation is set in struct xe_vm_pgtable_update_ops if
>   a MAP operation is included that meets the "invalidate on bind"
>   condition.
> - Set the VMA tile_invalidated in addition to tile_present for MAP
>   operations that meet the "invalidate on bind" condition.
> 
> I might be missing some implementation details mentioned above, but this
> should provide you with some direction.
> 
> Lastly, and perhaps most importantly, please test this using an IGT and
> include the results in the next post.
> 
> Matt
> 
> > +
> >  		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
> >  						  addr, range, op, flags,
> >  						  prefetch_region, pat_index);
> > -- 
> > 2.26.3
> > 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-01-29  5:04     ` Matthew Brost
@ 2025-01-29  8:21       ` Thomas Hellström
  0 siblings, 0 replies; 32+ messages in thread
From: Thomas Hellström @ 2025-01-29  8:21 UTC (permalink / raw)
  To: Matthew Brost, Oak Zeng; +Cc: intel-xe, joonas.lahtinen

On Tue, 2025-01-28 at 21:04 -0800, Matthew Brost wrote:
> On Tue, Jan 28, 2025 at 03:19:13PM -0800, Matthew Brost wrote:
> > On Tue, Jan 28, 2025 at 05:21:44PM -0500, Oak Zeng wrote:
> > > When a vm runs under fault mode, if scratch page is enabled, we
> > > need
> > > to clear the scratch page mapping before vm_bind for the vm_bind
> > > address range. Under fault mode, we depend on recoverable page
> > > fault
> > > to establish mapping in page table. If scratch page is not
> > > cleared,
> > > GPU access of address won't cause page fault because it always
> > > hits
> > > the existing scratch page mapping.
> > > 
> > > When vm_bind with IMMEDIATE flag, there is no need of clearing as
> > > immediate bind can overwrite the scratch page mapping.
> > > 
> > > So far only is xe2 and xe3 products are allowed to enable scratch
> > > page
> > > under fault mode. On other platform we don't allow scratch page
> > > under
> > > fault mode, so no need of such clearing.
> > > 
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_vm.c | 32 ++++++++++++++++++++++++++++++++
> > >  1 file changed, 32 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > b/drivers/gpu/drm/xe/xe_vm.c
> > > index 690330352d4c..196d347c6ac0 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > @@ -38,6 +38,7 @@
> > >  #include "xe_trace_bo.h"
> > >  #include "xe_wa.h"
> > >  #include "xe_hmm.h"
> > > +#include "i915_drv.h"
> > >  
> > >  static struct drm_gem_object *xe_vm_obj(struct xe_vm *vm)
> > >  {
> > > @@ -2917,6 +2918,34 @@ static int
> > > xe_vm_bind_ioctl_validate_bo(struct xe_device *xe, struct xe_bo
> > > *bo,
> > >  	return 0;
> > >  }
> > >  
> > > +static bool __xe_vm_needs_clear_scratch_pages(struct xe_device
> > > *xe,
> > > +					      struct xe_vm *vm,
> > > u32 bind_flags)
> > > +{
> > > +	if (!xe_vm_in_fault_mode(vm))
> > > +		return false;
> > > +
> > > +	if (!xe_vm_has_scratch(vm))
> > > +		return false;
> > > +
> > > +	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> > > +		return false;
> > > +
> > > +	if (!(IS_LUNARLAKE(xe) || IS_BATTLEMAGE(xe) ||
> > > IS_PANTHERLAKE(xe)))
> > > +		return false;
> > > +
> > > +	return true;
> > > +}
> > > +
> > > +static void __xe_vm_clear_scratch_pages(struct xe_device *xe,
> > > struct xe_vm *vm,
> > > +					u64 start, u64 end)
> > > +{
> > > +	struct xe_tile *tile;
> > > +	u8 id;
> > > +
> > > +	for_each_tile(tile, xe, id)
> > > +		xe_pt_zap_range(tile, vm, start, end);
> > > +}
> > > +
> > >  int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct
> > > drm_file *file)
> > >  {
> > >  	struct xe_device *xe = to_xe_device(dev);
> > > @@ -3062,6 +3091,9 @@ int xe_vm_bind_ioctl(struct drm_device
> > > *dev, void *data, struct drm_file *file)
> > >  		u32 prefetch_region =
> > > bind_ops[i].prefetch_mem_region_instance;
> > >  		u16 pat_index = bind_ops[i].pat_index;
> > >  
> > > +		if (__xe_vm_needs_clear_scratch_pages(xe, vm,
> > > flags))
> > > +			__xe_vm_clear_scratch_pages(xe, vm,
> > > addr, addr + range);
> > 
> > A few things...
> > 
> > - I believe this is only needed for bind user operations or
> > internal MAP
> >   GPU VMA operations.
> > - I believe a TLB invalidation will be required.
> > - I don't think calling zap PTEs range works here, given how the
> > scratch
> >   tables are set up (i.e., new PTEs need to be created pointing to
> > an
> >   invalid state).
> > - This series appears to be untested based on the points above.
> > 
> > Therefore, instead of this series, I believe you will need to fully
> > update the bind pipeline to process MAP GPU VMA operations here.
> > 
> > So roughly...
> > 
> > - Maybe include a bit in xe_vma_op_map that specifies "invalidate
> > on
> >   bind," set in vm_bind_ioctl_ops_create, since this will need to
> > be
> >   wired throughout the bind pipeline.
> > - Don't validate backing memory in this case.
> > - Ensure that xe_vma_ops_incr_pt_update_ops is called in this case
> > for
> >   MAP operations, forcing entry into the xe_pt.c backend.
> > - Update xe_pt_stage_bind_walk with a variable that indicates
> > clearing
> >   the PTE. Instead of calling pte_encode_vma in
> > xe_pt_stage_bind_entry,
> >   set this variable for PT bind operations derived from MAP
> > operations
> >   that meet the "invalidate on bind" condition.
> 
> Ugh, typos / nonsense here. Let me try again.
> 
> "Update xe_pt_stage_bind_walk with a variable that indicates clearing
> the PTE, set this for MAP ops that meet the 'invalidate on bind'
> condition. When this variable is set, instead of calling
> pte_encode_vma
> in xe_pt_stage_bind_entry, call a function which encodes an invalid
> PTE."
> 
> Matt

I agree with Matt here. This needs to look just like a regular bind,
with the exception that PTEs are cleared instead of pointing towards
valid pages.

/Thomas


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-01-28 23:19   ` Matthew Brost
  2025-01-29  5:04     ` Matthew Brost
@ 2025-01-29 21:01     ` Zeng, Oak
  2025-01-30  1:36       ` Matthew Brost
  1 sibling, 1 reply; 32+ messages in thread
From: Zeng, Oak @ 2025-01-29 21:01 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, joonas.lahtinen@linux.intel.com,
	Thomas.Hellstrom@linux.intel.com



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: January 28, 2025 6:19 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; joonas.lahtinen@linux.intel.com;
> Thomas.Hellstrom@linux.intel.com
> Subject: Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
> 
> On Tue, Jan 28, 2025 at 05:21:44PM -0500, Oak Zeng wrote:
> > When a vm runs under fault mode, if scratch page is enabled, we
> need
> > to clear the scratch page mapping before vm_bind for the vm_bind
> > address range. Under fault mode, we depend on recoverable page
> fault
> > to establish mapping in page table. If scratch page is not cleared,
> > GPU access of address won't cause page fault because it always hits
> > the existing scratch page mapping.
> >
> > When vm_bind with IMMEDIATE flag, there is no need of clearing as
> > immediate bind can overwrite the scratch page mapping.
> >
> > So far only is xe2 and xe3 products are allowed to enable scratch
> page
> > under fault mode. On other platform we don't allow scratch page
> under
> > fault mode, so no need of such clearing.
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_vm.c | 32
> ++++++++++++++++++++++++++++++++
> >  1 file changed, 32 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> b/drivers/gpu/drm/xe/xe_vm.c
> > index 690330352d4c..196d347c6ac0 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -38,6 +38,7 @@
> >  #include "xe_trace_bo.h"
> >  #include "xe_wa.h"
> >  #include "xe_hmm.h"
> > +#include "i915_drv.h"
> >
> >  static struct drm_gem_object *xe_vm_obj(struct xe_vm *vm)
> >  {
> > @@ -2917,6 +2918,34 @@ static int
> xe_vm_bind_ioctl_validate_bo(struct xe_device *xe, struct xe_bo
> *bo,
> >  	return 0;
> >  }
> >
> > +static bool __xe_vm_needs_clear_scratch_pages(struct xe_device
> *xe,
> > +					      struct xe_vm *vm, u32
> bind_flags)
> > +{
> > +	if (!xe_vm_in_fault_mode(vm))
> > +		return false;
> > +
> > +	if (!xe_vm_has_scratch(vm))
> > +		return false;
> > +
> > +	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> > +		return false;
> > +
> > +	if (!(IS_LUNARLAKE(xe) || IS_BATTLEMAGE(xe) ||
> IS_PANTHERLAKE(xe)))
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> > +static void __xe_vm_clear_scratch_pages(struct xe_device *xe,
> struct xe_vm *vm,
> > +					u64 start, u64 end)
> > +{
> > +	struct xe_tile *tile;
> > +	u8 id;
> > +
> > +	for_each_tile(tile, xe, id)
> > +		xe_pt_zap_range(tile, vm, start, end);
> > +}
> > +
> >  int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct
> drm_file *file)
> >  {
> >  	struct xe_device *xe = to_xe_device(dev);
> > @@ -3062,6 +3091,9 @@ int xe_vm_bind_ioctl(struct drm_device
> *dev, void *data, struct drm_file *file)
> >  		u32 prefetch_region =
> bind_ops[i].prefetch_mem_region_instance;
> >  		u16 pat_index = bind_ops[i].pat_index;
> >
> > +		if (__xe_vm_needs_clear_scratch_pages(xe, vm,
> flags))
> > +			__xe_vm_clear_scratch_pages(xe, vm, addr,
> addr + range);
> 
> A few things...
> 
> - I believe this is only needed for bind user operations or internal MAP
>   GPU VMA operations.

Did you mean this is only needed for user bind, but not for internal map?

> - I believe a TLB invalidation will be required.

My understanding is, NULL pte won't be cached in TLB, so TLB invalidation is not needed. 

> - I don't think calling zap PTEs range works here, given how the scratch
>   tables are set up (i.e., new PTEs need to be created pointing to an
>   invalid state).

You are right. 

I didn't realize that using xe_pt_walk_shared to walk and zap PTEs has a limitation:
For the virtual address range we want to zap, all the page tables has to be already
exist. This interface doesn't create new page tables. Even though xe_pt_walk_shared
takes a range parameter (addr, end), range parameter can't be arbitrary.

Today only [xe_vma_start, xe_vma_end) is used to specify the xe_pt_walk_shared
Walking range. Arbitrary range won't work as you pointed out. To me this is a small
Interface design issue. If you agree, I can re-parameterize xe_pt_walk_shared to take
VMA vs addr/end. This way people won't make the same mistake in the future.

Anyway, I will follow the direction you give below to rework this series.

Oak

> - This series appears to be untested based on the points above.
> 
> Therefore, instead of this series, I believe you will need to fully
> update the bind pipeline to process MAP GPU VMA operations here.
> 
> So roughly...
> 
> - Maybe include a bit in xe_vma_op_map that specifies "invalidate on
>   bind," set in vm_bind_ioctl_ops_create, since this will need to be
>   wired throughout the bind pipeline.
> - Don't validate backing memory in this case.
> - Ensure that xe_vma_ops_incr_pt_update_ops is called in this case
> for
>   MAP operations, forcing entry into the xe_pt.c backend.
> - Update xe_pt_stage_bind_walk with a variable that indicates
> clearing
>   the PTE. Instead of calling pte_encode_vma in
> xe_pt_stage_bind_entry,
>   set this variable for PT bind operations derived from MAP operations
>   that meet the "invalidate on bind" condition.
> - Ensure needs_invalidation is set in struct
> xe_vm_pgtable_update_ops if
>   a MAP operation is included that meets the "invalidate on bind"
>   condition.
> - Set the VMA tile_invalidated in addition to tile_present for MAP
>   operations that meet the "invalidate on bind" condition.
> 
> I might be missing some implementation details mentioned above,
> but this
> should provide you with some direction.
> 
> Lastly, and perhaps most importantly, please test this using an IGT
> and
> include the results in the next post.
> 
> Matt
> 
> > +
> >  		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i],
> obj_offset,
> >  						  addr, range, op, flags,
> >  						  prefetch_region,
> pat_index);
> > --
> > 2.26.3
> >

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-01-29 21:01     ` Zeng, Oak
@ 2025-01-30  1:36       ` Matthew Brost
  2025-01-30  8:36         ` Thomas Hellström
  0 siblings, 1 reply; 32+ messages in thread
From: Matthew Brost @ 2025-01-30  1:36 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-xe@lists.freedesktop.org, joonas.lahtinen@linux.intel.com,
	Thomas.Hellstrom@linux.intel.com

On Wed, Jan 29, 2025 at 02:01:38PM -0700, Zeng, Oak wrote:
> 
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: January 28, 2025 6:19 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; joonas.lahtinen@linux.intel.com;
> > Thomas.Hellstrom@linux.intel.com
> > Subject: Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
> > 
> > On Tue, Jan 28, 2025 at 05:21:44PM -0500, Oak Zeng wrote:
> > > When a vm runs under fault mode, if scratch page is enabled, we
> > need
> > > to clear the scratch page mapping before vm_bind for the vm_bind
> > > address range. Under fault mode, we depend on recoverable page
> > fault
> > > to establish mapping in page table. If scratch page is not cleared,
> > > GPU access of address won't cause page fault because it always hits
> > > the existing scratch page mapping.
> > >
> > > When vm_bind with IMMEDIATE flag, there is no need of clearing as
> > > immediate bind can overwrite the scratch page mapping.
> > >
> > > So far only is xe2 and xe3 products are allowed to enable scratch
> > page
> > > under fault mode. On other platform we don't allow scratch page
> > under
> > > fault mode, so no need of such clearing.
> > >
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_vm.c | 32
> > ++++++++++++++++++++++++++++++++
> > >  1 file changed, 32 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > b/drivers/gpu/drm/xe/xe_vm.c
> > > index 690330352d4c..196d347c6ac0 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > @@ -38,6 +38,7 @@
> > >  #include "xe_trace_bo.h"
> > >  #include "xe_wa.h"
> > >  #include "xe_hmm.h"
> > > +#include "i915_drv.h"
> > >
> > >  static struct drm_gem_object *xe_vm_obj(struct xe_vm *vm)
> > >  {
> > > @@ -2917,6 +2918,34 @@ static int
> > xe_vm_bind_ioctl_validate_bo(struct xe_device *xe, struct xe_bo
> > *bo,
> > >  	return 0;
> > >  }
> > >
> > > +static bool __xe_vm_needs_clear_scratch_pages(struct xe_device
> > *xe,
> > > +					      struct xe_vm *vm, u32
> > bind_flags)
> > > +{
> > > +	if (!xe_vm_in_fault_mode(vm))
> > > +		return false;
> > > +
> > > +	if (!xe_vm_has_scratch(vm))
> > > +		return false;
> > > +
> > > +	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> > > +		return false;
> > > +
> > > +	if (!(IS_LUNARLAKE(xe) || IS_BATTLEMAGE(xe) ||
> > IS_PANTHERLAKE(xe)))
> > > +		return false;
> > > +
> > > +	return true;
> > > +}
> > > +
> > > +static void __xe_vm_clear_scratch_pages(struct xe_device *xe,
> > struct xe_vm *vm,
> > > +					u64 start, u64 end)
> > > +{
> > > +	struct xe_tile *tile;
> > > +	u8 id;
> > > +
> > > +	for_each_tile(tile, xe, id)
> > > +		xe_pt_zap_range(tile, vm, start, end);
> > > +}
> > > +
> > >  int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct
> > drm_file *file)
> > >  {
> > >  	struct xe_device *xe = to_xe_device(dev);
> > > @@ -3062,6 +3091,9 @@ int xe_vm_bind_ioctl(struct drm_device
> > *dev, void *data, struct drm_file *file)
> > >  		u32 prefetch_region =
> > bind_ops[i].prefetch_mem_region_instance;
> > >  		u16 pat_index = bind_ops[i].pat_index;
> > >
> > > +		if (__xe_vm_needs_clear_scratch_pages(xe, vm,
> > flags))
> > > +			__xe_vm_clear_scratch_pages(xe, vm, addr,
> > addr + range);
> > 
> > A few things...
> > 
> > - I believe this is only needed for bind user operations or internal MAP
> >   GPU VMA operations.
> 
> Did you mean this is only needed for user bind, but not for internal map?
>

Needed for user bind and also for internal MAP. With updating the bind
pipeline you will only care MAP ops.
 
> > - I believe a TLB invalidation will be required.
> 
> My understanding is, NULL pte won't be cached in TLB, so TLB invalidation is not needed. 
> 

I would think NULL ptes are cached but not certain either way.

> > - I don't think calling zap PTEs range works here, given how the scratch
> >   tables are set up (i.e., new PTEs need to be created pointing to an
> >   invalid state).
> 
> You are right. 
> 
> I didn't realize that using xe_pt_walk_shared to walk and zap PTEs has a limitation:
> For the virtual address range we want to zap, all the page tables has to be already
> exist. This interface doesn't create new page tables. Even though xe_pt_walk_shared
> takes a range parameter (addr, end), range parameter can't be arbitrary.
> 
> Today only [xe_vma_start, xe_vma_end) is used to specify the xe_pt_walk_shared
> Walking range. Arbitrary range won't work as you pointed out. To me this is a small
> Interface design issue. If you agree, I can re-parameterize xe_pt_walk_shared to take
> VMA vs addr/end. This way people won't make the same mistake in the future.
> 

Ah, no. SVM will use ranges so I think (addr, end) are the right
parameters for internal PT functions.

> Anyway, I will follow the direction you give below to rework this series.
> 

+1

Matt

> Oak
> 
> > - This series appears to be untested based on the points above.
> > 
> > Therefore, instead of this series, I believe you will need to fully
> > update the bind pipeline to process MAP GPU VMA operations here.
> > 
> > So roughly...
> > 
> > - Maybe include a bit in xe_vma_op_map that specifies "invalidate on
> >   bind," set in vm_bind_ioctl_ops_create, since this will need to be
> >   wired throughout the bind pipeline.
> > - Don't validate backing memory in this case.
> > - Ensure that xe_vma_ops_incr_pt_update_ops is called in this case
> > for
> >   MAP operations, forcing entry into the xe_pt.c backend.
> > - Update xe_pt_stage_bind_walk with a variable that indicates
> > clearing
> >   the PTE. Instead of calling pte_encode_vma in
> > xe_pt_stage_bind_entry,
> >   set this variable for PT bind operations derived from MAP operations
> >   that meet the "invalidate on bind" condition.
> > - Ensure needs_invalidation is set in struct
> > xe_vm_pgtable_update_ops if
> >   a MAP operation is included that meets the "invalidate on bind"
> >   condition.
> > - Set the VMA tile_invalidated in addition to tile_present for MAP
> >   operations that meet the "invalidate on bind" condition.
> > 
> > I might be missing some implementation details mentioned above,
> > but this
> > should provide you with some direction.
> > 
> > Lastly, and perhaps most importantly, please test this using an IGT
> > and
> > include the results in the next post.
> > 
> > Matt
> > 
> > > +
> > >  		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i],
> > obj_offset,
> > >  						  addr, range, op, flags,
> > >  						  prefetch_region,
> > pat_index);
> > > --
> > > 2.26.3
> > >

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-01-30  1:36       ` Matthew Brost
@ 2025-01-30  8:36         ` Thomas Hellström
  2025-01-30 15:18           ` Zeng, Oak
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Hellström @ 2025-01-30  8:36 UTC (permalink / raw)
  To: Matthew Brost, Zeng, Oak
  Cc: intel-xe@lists.freedesktop.org, joonas.lahtinen@linux.intel.com

On Wed, 2025-01-29 at 17:36 -0800, Matthew Brost wrote:
> On Wed, Jan 29, 2025 at 02:01:38PM -0700, Zeng, Oak wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost@intel.com>
> > > Sent: January 28, 2025 6:19 PM
> > > To: Zeng, Oak <oak.zeng@intel.com>
> > > Cc: intel-xe@lists.freedesktop.org;
> > > joonas.lahtinen@linux.intel.com;
> > > Thomas.Hellstrom@linux.intel.com
> > > Subject: Re: [PATCH 2/3] drm/xe: Clear scratch page before
> > > vm_bind
> > > 
> > > On Tue, Jan 28, 2025 at 05:21:44PM -0500, Oak Zeng wrote:
> > > > When a vm runs under fault mode, if scratch page is enabled, we
> > > need
> > > > to clear the scratch page mapping before vm_bind for the
> > > > vm_bind
> > > > address range. Under fault mode, we depend on recoverable page
> > > fault
> > > > to establish mapping in page table. If scratch page is not
> > > > cleared,
> > > > GPU access of address won't cause page fault because it always
> > > > hits
> > > > the existing scratch page mapping.
> > > > 
> > > > When vm_bind with IMMEDIATE flag, there is no need of clearing
> > > > as
> > > > immediate bind can overwrite the scratch page mapping.
> > > > 
> > > > So far only is xe2 and xe3 products are allowed to enable
> > > > scratch
> > > page
> > > > under fault mode. On other platform we don't allow scratch page
> > > under
> > > > fault mode, so no need of such clearing.
> > > > 
> > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_vm.c | 32
> > > ++++++++++++++++++++++++++++++++
> > > >  1 file changed, 32 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > b/drivers/gpu/drm/xe/xe_vm.c
> > > > index 690330352d4c..196d347c6ac0 100644
> > > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > > @@ -38,6 +38,7 @@
> > > >  #include "xe_trace_bo.h"
> > > >  #include "xe_wa.h"
> > > >  #include "xe_hmm.h"
> > > > +#include "i915_drv.h"
> > > > 
> > > >  static struct drm_gem_object *xe_vm_obj(struct xe_vm *vm)
> > > >  {
> > > > @@ -2917,6 +2918,34 @@ static int
> > > xe_vm_bind_ioctl_validate_bo(struct xe_device *xe, struct xe_bo
> > > *bo,
> > > >  	return 0;
> > > >  }
> > > > 
> > > > +static bool __xe_vm_needs_clear_scratch_pages(struct xe_device
> > > *xe,
> > > > +					      struct xe_vm
> > > > *vm, u32
> > > bind_flags)
> > > > +{
> > > > +	if (!xe_vm_in_fault_mode(vm))
> > > > +		return false;
> > > > +
> > > > +	if (!xe_vm_has_scratch(vm))
> > > > +		return false;
> > > > +
> > > > +	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> > > > +		return false;
> > > > +
> > > > +	if (!(IS_LUNARLAKE(xe) || IS_BATTLEMAGE(xe) ||
> > > IS_PANTHERLAKE(xe)))
> > > > +		return false;
> > > > +
> > > > +	return true;
> > > > +}
> > > > +
> > > > +static void __xe_vm_clear_scratch_pages(struct xe_device *xe,
> > > struct xe_vm *vm,
> > > > +					u64 start, u64 end)
> > > > +{
> > > > +	struct xe_tile *tile;
> > > > +	u8 id;
> > > > +
> > > > +	for_each_tile(tile, xe, id)
> > > > +		xe_pt_zap_range(tile, vm, start, end);
> > > > +}
> > > > +
> > > >  int xe_vm_bind_ioctl(struct drm_device *dev, void *data,
> > > > struct
> > > drm_file *file)
> > > >  {
> > > >  	struct xe_device *xe = to_xe_device(dev);
> > > > @@ -3062,6 +3091,9 @@ int xe_vm_bind_ioctl(struct drm_device
> > > *dev, void *data, struct drm_file *file)
> > > >  		u32 prefetch_region =
> > > bind_ops[i].prefetch_mem_region_instance;
> > > >  		u16 pat_index = bind_ops[i].pat_index;
> > > > 
> > > > +		if (__xe_vm_needs_clear_scratch_pages(xe, vm,
> > > flags))
> > > > +			__xe_vm_clear_scratch_pages(xe, vm,
> > > > addr,
> > > addr + range);
> > > 
> > > A few things...
> > > 
> > > - I believe this is only needed for bind user operations or
> > > internal MAP
> > >   GPU VMA operations.
> > 
> > Did you mean this is only needed for user bind, but not for
> > internal map?
> > 
> 
> Needed for user bind and also for internal MAP. With updating the
> bind
> pipeline you will only care MAP ops.
>  
> > > - I believe a TLB invalidation will be required.
> > 
> > My understanding is, NULL pte won't be cached in TLB, so TLB
> > invalidation is not needed. 
> > 
> 
> I would think NULL ptes are cached but not certain either way.

Note that a NULL / scratch mapping for larger pages may point to a
physical page.

/Thomas


> 
> > > - I don't think calling zap PTEs range works here, given how the
> > > scratch
> > >   tables are set up (i.e., new PTEs need to be created pointing
> > > to an
> > >   invalid state).
> > 
> > You are right. 
> > 
> > I didn't realize that using xe_pt_walk_shared to walk and zap PTEs
> > has a limitation:
> > For the virtual address range we want to zap, all the page tables
> > has to be already
> > exist. This interface doesn't create new page tables. Even though
> > xe_pt_walk_shared
> > takes a range parameter (addr, end), range parameter can't be
> > arbitrary.
> > 
> > Today only [xe_vma_start, xe_vma_end) is used to specify the
> > xe_pt_walk_shared
> > Walking range. Arbitrary range won't work as you pointed out. To me
> > this is a small
> > Interface design issue. If you agree, I can re-parameterize
> > xe_pt_walk_shared to take
> > VMA vs addr/end. This way people won't make the same mistake in the
> > future.
> > 
> 
> Ah, no. SVM will use ranges so I think (addr, end) are the right
> parameters for internal PT functions.
> 
> > Anyway, I will follow the direction you give below to rework this
> > series.
> > 
> 
> +1
> 
> Matt
> 
> > Oak
> > 
> > > - This series appears to be untested based on the points above.
> > > 
> > > Therefore, instead of this series, I believe you will need to
> > > fully
> > > update the bind pipeline to process MAP GPU VMA operations here.
> > > 
> > > So roughly...
> > > 
> > > - Maybe include a bit in xe_vma_op_map that specifies "invalidate
> > > on
> > >   bind," set in vm_bind_ioctl_ops_create, since this will need to
> > > be
> > >   wired throughout the bind pipeline.
> > > - Don't validate backing memory in this case.
> > > - Ensure that xe_vma_ops_incr_pt_update_ops is called in this
> > > case
> > > for
> > >   MAP operations, forcing entry into the xe_pt.c backend.
> > > - Update xe_pt_stage_bind_walk with a variable that indicates
> > > clearing
> > >   the PTE. Instead of calling pte_encode_vma in
> > > xe_pt_stage_bind_entry,
> > >   set this variable for PT bind operations derived from MAP
> > > operations
> > >   that meet the "invalidate on bind" condition.
> > > - Ensure needs_invalidation is set in struct
> > > xe_vm_pgtable_update_ops if
> > >   a MAP operation is included that meets the "invalidate on bind"
> > >   condition.
> > > - Set the VMA tile_invalidated in addition to tile_present for
> > > MAP
> > >   operations that meet the "invalidate on bind" condition.
> > > 
> > > I might be missing some implementation details mentioned above,
> > > but this
> > > should provide you with some direction.
> > > 
> > > Lastly, and perhaps most importantly, please test this using an
> > > IGT
> > > and
> > > include the results in the next post.
> > > 
> > > Matt
> > > 
> > > > +
> > > >  		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i],
> > > obj_offset,
> > > >  						  addr, range,
> > > > op, flags,
> > > >  						 
> > > > prefetch_region,
> > > pat_index);
> > > > --
> > > > 2.26.3
> > > > 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-01-30  8:36         ` Thomas Hellström
@ 2025-01-30 15:18           ` Zeng, Oak
  0 siblings, 0 replies; 32+ messages in thread
From: Zeng, Oak @ 2025-01-30 15:18 UTC (permalink / raw)
  To: Thomas Hellström, Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, joonas.lahtinen@linux.intel.com



> -----Original Message-----
> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Sent: January 30, 2025 3:37 AM
> To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; joonas.lahtinen@linux.intel.com
> Subject: Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
> 
> On Wed, 2025-01-29 at 17:36 -0800, Matthew Brost wrote:
> > On Wed, Jan 29, 2025 at 02:01:38PM -0700, Zeng, Oak wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > Sent: January 28, 2025 6:19 PM
> > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > Cc: intel-xe@lists.freedesktop.org;
> > > > joonas.lahtinen@linux.intel.com;
> > > > Thomas.Hellstrom@linux.intel.com
> > > > Subject: Re: [PATCH 2/3] drm/xe: Clear scratch page before
> > > > vm_bind
> > > >
> > > > On Tue, Jan 28, 2025 at 05:21:44PM -0500, Oak Zeng wrote:
> > > > > When a vm runs under fault mode, if scratch page is enabled,
> we
> > > > need
> > > > > to clear the scratch page mapping before vm_bind for the
> > > > > vm_bind
> > > > > address range. Under fault mode, we depend on recoverable
> page
> > > > fault
> > > > > to establish mapping in page table. If scratch page is not
> > > > > cleared,
> > > > > GPU access of address won't cause page fault because it
> always
> > > > > hits
> > > > > the existing scratch page mapping.
> > > > >
> > > > > When vm_bind with IMMEDIATE flag, there is no need of
> clearing
> > > > > as
> > > > > immediate bind can overwrite the scratch page mapping.
> > > > >
> > > > > So far only is xe2 and xe3 products are allowed to enable
> > > > > scratch
> > > > page
> > > > > under fault mode. On other platform we don't allow scratch
> page
> > > > under
> > > > > fault mode, so no need of such clearing.
> > > > >
> > > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/xe/xe_vm.c | 32
> > > > ++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 32 insertions(+)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > > b/drivers/gpu/drm/xe/xe_vm.c
> > > > > index 690330352d4c..196d347c6ac0 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > > > @@ -38,6 +38,7 @@
> > > > >  #include "xe_trace_bo.h"
> > > > >  #include "xe_wa.h"
> > > > >  #include "xe_hmm.h"
> > > > > +#include "i915_drv.h"
> > > > >
> > > > >  static struct drm_gem_object *xe_vm_obj(struct xe_vm *vm)
> > > > >  {
> > > > > @@ -2917,6 +2918,34 @@ static int
> > > > xe_vm_bind_ioctl_validate_bo(struct xe_device *xe, struct
> xe_bo
> > > > *bo,
> > > > >  	return 0;
> > > > >  }
> > > > >
> > > > > +static bool __xe_vm_needs_clear_scratch_pages(struct
> xe_device
> > > > *xe,
> > > > > +					      struct xe_vm
> > > > > *vm, u32
> > > > bind_flags)
> > > > > +{
> > > > > +	if (!xe_vm_in_fault_mode(vm))
> > > > > +		return false;
> > > > > +
> > > > > +	if (!xe_vm_has_scratch(vm))
> > > > > +		return false;
> > > > > +
> > > > > +	if (bind_flags &
> DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> > > > > +		return false;
> > > > > +
> > > > > +	if (!(IS_LUNARLAKE(xe) || IS_BATTLEMAGE(xe) ||
> > > > IS_PANTHERLAKE(xe)))
> > > > > +		return false;
> > > > > +
> > > > > +	return true;
> > > > > +}
> > > > > +
> > > > > +static void __xe_vm_clear_scratch_pages(struct xe_device
> *xe,
> > > > struct xe_vm *vm,
> > > > > +					u64 start, u64 end)
> > > > > +{
> > > > > +	struct xe_tile *tile;
> > > > > +	u8 id;
> > > > > +
> > > > > +	for_each_tile(tile, xe, id)
> > > > > +		xe_pt_zap_range(tile, vm, start, end);
> > > > > +}
> > > > > +
> > > > >  int xe_vm_bind_ioctl(struct drm_device *dev, void *data,
> > > > > struct
> > > > drm_file *file)
> > > > >  {
> > > > >  	struct xe_device *xe = to_xe_device(dev);
> > > > > @@ -3062,6 +3091,9 @@ int xe_vm_bind_ioctl(struct
> drm_device
> > > > *dev, void *data, struct drm_file *file)
> > > > >  		u32 prefetch_region =
> > > > bind_ops[i].prefetch_mem_region_instance;
> > > > >  		u16 pat_index = bind_ops[i].pat_index;
> > > > >
> > > > > +		if (__xe_vm_needs_clear_scratch_pages(xe,
> vm,
> > > > flags))
> > > > > +			__xe_vm_clear_scratch_pages(xe,
> vm,
> > > > > addr,
> > > > addr + range);
> > > >
> > > > A few things...
> > > >
> > > > - I believe this is only needed for bind user operations or
> > > > internal MAP
> > > >   GPU VMA operations.
> > >
> > > Did you mean this is only needed for user bind, but not for
> > > internal map?
> > >
> >
> > Needed for user bind and also for internal MAP. With updating the
> > bind
> > pipeline you will only care MAP ops.
> >
> > > > - I believe a TLB invalidation will be required.
> > >
> > > My understanding is, NULL pte won't be cached in TLB, so TLB
> > > invalidation is not needed.
> > >
> >
> > I would think NULL ptes are cached but not certain either way.
> 
> Note that a NULL / scratch mapping for larger pages may point to a
> physical page.

Thanks for pointing this out. Looking at page table definition, some high
Level page table indeed doesn't support NULL entry.

So yes TLB invalidation is required.

Oak

> 
> /Thomas
> 
> 
> >
> > > > - I don't think calling zap PTEs range works here, given how the
> > > > scratch
> > > >   tables are set up (i.e., new PTEs need to be created pointing
> > > > to an
> > > >   invalid state).
> > >
> > > You are right.
> > >
> > > I didn't realize that using xe_pt_walk_shared to walk and zap PTEs
> > > has a limitation:
> > > For the virtual address range we want to zap, all the page tables
> > > has to be already
> > > exist. This interface doesn't create new page tables. Even though
> > > xe_pt_walk_shared
> > > takes a range parameter (addr, end), range parameter can't be
> > > arbitrary.
> > >
> > > Today only [xe_vma_start, xe_vma_end) is used to specify the
> > > xe_pt_walk_shared
> > > Walking range. Arbitrary range won't work as you pointed out. To
> me
> > > this is a small
> > > Interface design issue. If you agree, I can re-parameterize
> > > xe_pt_walk_shared to take
> > > VMA vs addr/end. This way people won't make the same mistake
> in the
> > > future.
> > >
> >
> > Ah, no. SVM will use ranges so I think (addr, end) are the right
> > parameters for internal PT functions.
> >
> > > Anyway, I will follow the direction you give below to rework this
> > > series.
> > >
> >
> > +1
> >
> > Matt
> >
> > > Oak
> > >
> > > > - This series appears to be untested based on the points above.
> > > >
> > > > Therefore, instead of this series, I believe you will need to
> > > > fully
> > > > update the bind pipeline to process MAP GPU VMA operations
> here.
> > > >
> > > > So roughly...
> > > >
> > > > - Maybe include a bit in xe_vma_op_map that specifies
> "invalidate
> > > > on
> > > >   bind," set in vm_bind_ioctl_ops_create, since this will need to
> > > > be
> > > >   wired throughout the bind pipeline.
> > > > - Don't validate backing memory in this case.
> > > > - Ensure that xe_vma_ops_incr_pt_update_ops is called in this
> > > > case
> > > > for
> > > >   MAP operations, forcing entry into the xe_pt.c backend.
> > > > - Update xe_pt_stage_bind_walk with a variable that indicates
> > > > clearing
> > > >   the PTE. Instead of calling pte_encode_vma in
> > > > xe_pt_stage_bind_entry,
> > > >   set this variable for PT bind operations derived from MAP
> > > > operations
> > > >   that meet the "invalidate on bind" condition.
> > > > - Ensure needs_invalidation is set in struct
> > > > xe_vm_pgtable_update_ops if
> > > >   a MAP operation is included that meets the "invalidate on bind"
> > > >   condition.
> > > > - Set the VMA tile_invalidated in addition to tile_present for
> > > > MAP
> > > >   operations that meet the "invalidate on bind" condition.
> > > >
> > > > I might be missing some implementation details mentioned
> above,
> > > > but this
> > > > should provide you with some direction.
> > > >
> > > > Lastly, and perhaps most importantly, please test this using an
> > > > IGT
> > > > and
> > > > include the results in the next post.
> > > >
> > > > Matt
> > > >
> > > > > +
> > > > >  		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i],
> > > > obj_offset,
> > > > >  						  addr, range,
> > > > > op, flags,
> > > > >
> > > > > prefetch_region,
> > > > pat_index);
> > > > > --
> > > > > 2.26.3
> > > > >


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 3/3] drm/xe: Allow scratch page under fault mode for certain platform
  2025-01-28 22:21 [PATCH 1/3] drm/xe: Add a function to zap page table by address range Oak Zeng
  2025-01-28 22:21 ` [PATCH 2/3] drm/xe: Clear scratch page before vm_bind Oak Zeng
@ 2025-01-28 22:21 ` Oak Zeng
  2025-01-28 23:05   ` Cavitt, Jonathan
  2025-01-29  8:52   ` Thomas Hellström
  2025-01-28 23:04 ` [PATCH 1/3] drm/xe: Add a function to zap page table by address range Cavitt, Jonathan
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 32+ messages in thread
From: Oak Zeng @ 2025-01-28 22:21 UTC (permalink / raw)
  To: intel-xe; +Cc: joonas.lahtinen, Thomas.Hellstrom

Normally scratch page is not allowed when a vm is operate under page
fault mode, i.e., in the existing codes, DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE
and DRM_XE_VM_CREATE_FLAG_FAULT_MODE are mutual exclusive. The reason
is fault mode relies on recoverable page to work, while scratch page
can mute recoverable page fault.

On xe2 and xe3, HW prefetching can cause page fault interrupt. Due to
page fault interrupt overhead (i.e., need Guc and KMD involved to fix
the page fault), HW prefetching can be slowed by many orders of magnitude.

Fix this problem by allowing scratch page under fault mode for xe2 and xe3.
With scratch page in place, HW prefetching could always hit scratch page
instead of causing interrupt.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 196d347c6ac0..3346f88f284a 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -1732,6 +1732,11 @@ int xe_vm_create_ioctl(struct drm_device *dev, void *data,
 	if (XE_IOCTL_DBG(xe, args->extensions))
 		return -EINVAL;
 
+	if (XE_IOCTL_DBG(xe, args->flags & DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE &&
+			 args->flags & DRM_XE_VM_CREATE_FLAG_FAULT_MODE &&
+			 !(IS_LUNARLAKE(xe) || IS_BATTLEMAGE(xe) || IS_PANTHERLAKE(xe))))
+		return -EINVAL;
+
 	if (XE_WA(xe_root_mmio_gt(xe), 14016763929))
 		args->flags |= DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE;
 
@@ -1745,10 +1750,6 @@ int xe_vm_create_ioctl(struct drm_device *dev, void *data,
 	if (XE_IOCTL_DBG(xe, args->flags & ~ALL_DRM_XE_VM_CREATE_FLAGS))
 		return -EINVAL;
 
-	if (XE_IOCTL_DBG(xe, args->flags & DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE &&
-			 args->flags & DRM_XE_VM_CREATE_FLAG_FAULT_MODE))
-		return -EINVAL;
-
 	if (XE_IOCTL_DBG(xe, !(args->flags & DRM_XE_VM_CREATE_FLAG_LR_MODE) &&
 			 args->flags & DRM_XE_VM_CREATE_FLAG_FAULT_MODE))
 		return -EINVAL;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* RE: [PATCH 3/3] drm/xe: Allow scratch page under fault mode for certain platform
  2025-01-28 22:21 ` [PATCH 3/3] drm/xe: Allow scratch page under fault mode for certain platform Oak Zeng
@ 2025-01-28 23:05   ` Cavitt, Jonathan
  2025-01-29  8:52   ` Thomas Hellström
  1 sibling, 0 replies; 32+ messages in thread
From: Cavitt, Jonathan @ 2025-01-28 23:05 UTC (permalink / raw)
  To: Zeng, Oak, intel-xe@lists.freedesktop.org
  Cc: joonas.lahtinen@linux.intel.com, Thomas.Hellstrom@linux.intel.com,
	Cavitt, Jonathan

-----Original Message-----
From: Intel-xe <intel-xe-bounces@lists.freedesktop.org> On Behalf Of Oak Zeng
Sent: Tuesday, January 28, 2025 2:22 PM
To: intel-xe@lists.freedesktop.org
Cc: joonas.lahtinen@linux.intel.com; Thomas.Hellstrom@linux.intel.com
Subject: [PATCH 3/3] drm/xe: Allow scratch page under fault mode for certain platform
> 
> Normally scratch page is not allowed when a vm is operate under page
> fault mode, i.e., in the existing codes, DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE
> and DRM_XE_VM_CREATE_FLAG_FAULT_MODE are mutual exclusive. The reason
> is fault mode relies on recoverable page to work, while scratch page
> can mute recoverable page fault.
> 
> On xe2 and xe3, HW prefetching can cause page fault interrupt. Due to
> page fault interrupt overhead (i.e., need Guc and KMD involved to fix
> the page fault), HW prefetching can be slowed by many orders of magnitude.
> 
> Fix this problem by allowing scratch page under fault mode for xe2 and xe3.
> With scratch page in place, HW prefetching could always hit scratch page
> instead of causing interrupt.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>

LGTM, though see my comment for patch 2 regarding the added platform checks.
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
-Jonathan Cavitt

> ---
>  drivers/gpu/drm/xe/xe_vm.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 196d347c6ac0..3346f88f284a 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -1732,6 +1732,11 @@ int xe_vm_create_ioctl(struct drm_device *dev, void *data,
>  	if (XE_IOCTL_DBG(xe, args->extensions))
>  		return -EINVAL;
>  
> +	if (XE_IOCTL_DBG(xe, args->flags & DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE &&
> +			 args->flags & DRM_XE_VM_CREATE_FLAG_FAULT_MODE &&
> +			 !(IS_LUNARLAKE(xe) || IS_BATTLEMAGE(xe) || IS_PANTHERLAKE(xe))))
> +		return -EINVAL;
> +
>  	if (XE_WA(xe_root_mmio_gt(xe), 14016763929))
>  		args->flags |= DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE;
>  
> @@ -1745,10 +1750,6 @@ int xe_vm_create_ioctl(struct drm_device *dev, void *data,
>  	if (XE_IOCTL_DBG(xe, args->flags & ~ALL_DRM_XE_VM_CREATE_FLAGS))
>  		return -EINVAL;
>  
> -	if (XE_IOCTL_DBG(xe, args->flags & DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE &&
> -			 args->flags & DRM_XE_VM_CREATE_FLAG_FAULT_MODE))
> -		return -EINVAL;
> -
>  	if (XE_IOCTL_DBG(xe, !(args->flags & DRM_XE_VM_CREATE_FLAG_LR_MODE) &&
>  			 args->flags & DRM_XE_VM_CREATE_FLAG_FAULT_MODE))
>  		return -EINVAL;
> -- 
> 2.26.3
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/3] drm/xe: Allow scratch page under fault mode for certain platform
  2025-01-28 22:21 ` [PATCH 3/3] drm/xe: Allow scratch page under fault mode for certain platform Oak Zeng
  2025-01-28 23:05   ` Cavitt, Jonathan
@ 2025-01-29  8:52   ` Thomas Hellström
  2025-01-29 16:41     ` Matthew Brost
  1 sibling, 1 reply; 32+ messages in thread
From: Thomas Hellström @ 2025-01-29  8:52 UTC (permalink / raw)
  To: Oak Zeng, intel-xe; +Cc: joonas.lahtinen

On Tue, 2025-01-28 at 17:21 -0500, Oak Zeng wrote:
> Normally scratch page is not allowed when a vm is operate under page
> fault mode, i.e., in the existing codes,
> DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE
> and DRM_XE_VM_CREATE_FLAG_FAULT_MODE are mutual exclusive. The reason
> is fault mode relies on recoverable page to work, while scratch page
> can mute recoverable page fault.
> 
> On xe2 and xe3, HW prefetching can cause page fault interrupt. Due to
> page fault interrupt overhead (i.e., need Guc and KMD involved to fix
> the page fault), HW prefetching can be slowed by many orders of
> magnitude.

The problem on xe2 and xe3 as I understand it is not the overhead but
the fact that OOB prefetching causes pagefaults and we can't resolve
those.

> 
> Fix this problem by allowing scratch page under fault mode for xe2
> and xe3.
> With scratch page in place, HW prefetching could always hit scratch
> page
> instead of causing interrupt.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_vm.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 196d347c6ac0..3346f88f284a 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -1732,6 +1732,11 @@ int xe_vm_create_ioctl(struct drm_device *dev,
> void *data,
>  	if (XE_IOCTL_DBG(xe, args->extensions))
>  		return -EINVAL;
>  
> +	if (XE_IOCTL_DBG(xe, args->flags &
> DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE &&
> +			 args->flags &
> DRM_XE_VM_CREATE_FLAG_FAULT_MODE &&
> +			 !(IS_LUNARLAKE(xe) || IS_BATTLEMAGE(xe) ||
> IS_PANTHERLAKE(xe))))
> +		return -EINVAL;
> +

Could we perhaps add a bit in the xe->info structure?

/Thomas


>  	if (XE_WA(xe_root_mmio_gt(xe), 14016763929))
>  		args->flags |= DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE;
>  
> @@ -1745,10 +1750,6 @@ int xe_vm_create_ioctl(struct drm_device *dev,
> void *data,
>  	if (XE_IOCTL_DBG(xe, args->flags &
> ~ALL_DRM_XE_VM_CREATE_FLAGS))
>  		return -EINVAL;
>  
> -	if (XE_IOCTL_DBG(xe, args->flags &
> DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE &&
> -			 args->flags &
> DRM_XE_VM_CREATE_FLAG_FAULT_MODE))
> -		return -EINVAL;
> -
>  	if (XE_IOCTL_DBG(xe, !(args->flags &
> DRM_XE_VM_CREATE_FLAG_LR_MODE) &&
>  			 args->flags &
> DRM_XE_VM_CREATE_FLAG_FAULT_MODE))
>  		return -EINVAL;


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/3] drm/xe: Allow scratch page under fault mode for certain platform
  2025-01-29  8:52   ` Thomas Hellström
@ 2025-01-29 16:41     ` Matthew Brost
  0 siblings, 0 replies; 32+ messages in thread
From: Matthew Brost @ 2025-01-29 16:41 UTC (permalink / raw)
  To: Thomas Hellström; +Cc: Oak Zeng, intel-xe, joonas.lahtinen

On Wed, Jan 29, 2025 at 09:52:54AM +0100, Thomas Hellström wrote:
> On Tue, 2025-01-28 at 17:21 -0500, Oak Zeng wrote:
> > Normally scratch page is not allowed when a vm is operate under page
> > fault mode, i.e., in the existing codes,
> > DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE
> > and DRM_XE_VM_CREATE_FLAG_FAULT_MODE are mutual exclusive. The reason
> > is fault mode relies on recoverable page to work, while scratch page
> > can mute recoverable page fault.
> > 
> > On xe2 and xe3, HW prefetching can cause page fault interrupt. Due to
> > page fault interrupt overhead (i.e., need Guc and KMD involved to fix
> > the page fault), HW prefetching can be slowed by many orders of
> > magnitude.
> 
> The problem on xe2 and xe3 as I understand it is not the overhead but
> the fact that OOB prefetching causes pagefaults and we can't resolve
> those.
> 
> > 
> > Fix this problem by allowing scratch page under fault mode for xe2
> > and xe3.
> > With scratch page in place, HW prefetching could always hit scratch
> > page
> > instead of causing interrupt.
> > 
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_vm.c | 9 +++++----
> >  1 file changed, 5 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > index 196d347c6ac0..3346f88f284a 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -1732,6 +1732,11 @@ int xe_vm_create_ioctl(struct drm_device *dev,
> > void *data,
> >  	if (XE_IOCTL_DBG(xe, args->extensions))
> >  		return -EINVAL;
> >  
> > +	if (XE_IOCTL_DBG(xe, args->flags &
> > DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE &&
> > +			 args->flags &
> > DRM_XE_VM_CREATE_FLAG_FAULT_MODE &&
> > +			 !(IS_LUNARLAKE(xe) || IS_BATTLEMAGE(xe) ||
> > IS_PANTHERLAKE(xe))))
> > +		return -EINVAL;
> > +
> 
> Could we perhaps add a bit in the xe->info structure?
> 

+1. This seems like a good idea.

Matt

> /Thomas
> 
> 
> >  	if (XE_WA(xe_root_mmio_gt(xe), 14016763929))
> >  		args->flags |= DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE;
> >  
> > @@ -1745,10 +1750,6 @@ int xe_vm_create_ioctl(struct drm_device *dev,
> > void *data,
> >  	if (XE_IOCTL_DBG(xe, args->flags &
> > ~ALL_DRM_XE_VM_CREATE_FLAGS))
> >  		return -EINVAL;
> >  
> > -	if (XE_IOCTL_DBG(xe, args->flags &
> > DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE &&
> > -			 args->flags &
> > DRM_XE_VM_CREATE_FLAG_FAULT_MODE))
> > -		return -EINVAL;
> > -
> >  	if (XE_IOCTL_DBG(xe, !(args->flags &
> > DRM_XE_VM_CREATE_FLAG_LR_MODE) &&
> >  			 args->flags &
> > DRM_XE_VM_CREATE_FLAG_FAULT_MODE))
> >  		return -EINVAL;
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 1/3] drm/xe: Add a function to zap page table by address range
  2025-01-28 22:21 [PATCH 1/3] drm/xe: Add a function to zap page table by address range Oak Zeng
  2025-01-28 22:21 ` [PATCH 2/3] drm/xe: Clear scratch page before vm_bind Oak Zeng
  2025-01-28 22:21 ` [PATCH 3/3] drm/xe: Allow scratch page under fault mode for certain platform Oak Zeng
@ 2025-01-28 23:04 ` Cavitt, Jonathan
  2025-01-28 23:12 ` ✓ CI.Patch_applied: success for series starting with [1/3] " Patchwork
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Cavitt, Jonathan @ 2025-01-28 23:04 UTC (permalink / raw)
  To: Zeng, Oak, intel-xe@lists.freedesktop.org
  Cc: joonas.lahtinen@linux.intel.com, Thomas.Hellstrom@linux.intel.com,
	Cavitt, Jonathan

-----Original Message-----
From: Intel-xe <intel-xe-bounces@lists.freedesktop.org> On Behalf Of Oak Zeng
Sent: Tuesday, January 28, 2025 2:22 PM
To: intel-xe@lists.freedesktop.org
Cc: joonas.lahtinen@linux.intel.com; Thomas.Hellstrom@linux.intel.com
Subject: [PATCH 1/3] drm/xe: Add a function to zap page table by address range
> 
> Add a function xe_pt_zap_range. This is similar to xe_pt_zap_ptes
> but is used when we don't have a vma to work with, such as zap
> a range mapped to scratch page where we don't have vma.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_pt.c | 28 ++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_pt.h |  2 ++
>  2 files changed, 30 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index 1ddcc7e79a93..2363260da6a6 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -792,6 +792,34 @@ static const struct xe_pt_walk_ops xe_pt_zap_ptes_ops = {
>  	.pt_entry = xe_pt_zap_ptes_entry,
>  };
>  
> +/**
> + * xe_pt_zap_range() - Zap (zero) gpu ptes of an virtual address range
> + * @tile: The tile we're zapping for.
> + * @vm: The vm we're zapping for.
> + * @start: Start of the virtual address range, inclusive.
> + * @end: End of the virtual address range, exclusive.
> + *
> + * This is similar to xe_pt_zap_ptes() but it's used when we don't have a
> + * vma to work with. This is used for example when we're clearing the scratch
> + * page mapping during vm_bind.
> + *

NIT:
The asterisk here can be removed.

> + */
> +void xe_pt_zap_range(struct xe_tile *tile, struct xe_vm *vm, u64 start, u64 end)

It looks like the function xe_pt_zap_ptes returns if the walk needs to be invalidated.
Will that also be needed here?

Assuming not:
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
-Jonathan Cavitt

> +{
> +	struct xe_pt_zap_ptes_walk xe_walk = {
> +		.base = {
> +			.ops = &xe_pt_zap_ptes_ops,
> +			.shifts = xe_normal_pt_shifts,
> +			.max_level = XE_PT_HIGHEST_LEVEL,
> +		},
> +		.tile = tile,
> +	};
> +	struct xe_pt *pt = vm->pt_root[tile->id];
> +
> +	(void)xe_pt_walk_shared(&pt->base, pt->level, start,
> +				end, &xe_walk.base);
> +}
> +
>  /**
>   * xe_pt_zap_ptes() - Zap (zero) gpu ptes of an address range
>   * @tile: The tile we're zapping for.
> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> index 9ab386431cad..b166b324f455 100644
> --- a/drivers/gpu/drm/xe/xe_pt.h
> +++ b/drivers/gpu/drm/xe/xe_pt.h
> @@ -43,4 +43,6 @@ void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops *vops);
>  
>  bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>  
> +void xe_pt_zap_range(struct xe_tile *tile, struct xe_vm *vm, u64 start, u64 end);
> +
>  #endif
> -- 
> 2.26.3
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* ✓ CI.Patch_applied: success for series starting with [1/3] drm/xe: Add a function to zap page table by address range
  2025-01-28 22:21 [PATCH 1/3] drm/xe: Add a function to zap page table by address range Oak Zeng
                   ` (2 preceding siblings ...)
  2025-01-28 23:04 ` [PATCH 1/3] drm/xe: Add a function to zap page table by address range Cavitt, Jonathan
@ 2025-01-28 23:12 ` Patchwork
  2025-01-28 23:12 ` ✓ CI.checkpatch: " Patchwork
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Patchwork @ 2025-01-28 23:12 UTC (permalink / raw)
  To: Oak Zeng; +Cc: intel-xe

== Series Details ==

Series: series starting with [1/3] drm/xe: Add a function to zap page table by address range
URL   : https://patchwork.freedesktop.org/series/144072/
State : success

== Summary ==

=== Applying kernel patches on branch 'drm-tip' with base: ===
Base commit: 8f3ba847175b drm-tip: 2025y-01m-28d-22h-47m-54s UTC integration manifest
=== git am output follows ===
Applying: drm/xe: Add a function to zap page table by address range
Applying: drm/xe: Clear scratch page before vm_bind
Applying: drm/xe: Allow scratch page under fault mode for certain platform



^ permalink raw reply	[flat|nested] 32+ messages in thread

* ✓ CI.checkpatch: success for series starting with [1/3] drm/xe: Add a function to zap page table by address range
  2025-01-28 22:21 [PATCH 1/3] drm/xe: Add a function to zap page table by address range Oak Zeng
                   ` (3 preceding siblings ...)
  2025-01-28 23:12 ` ✓ CI.Patch_applied: success for series starting with [1/3] " Patchwork
@ 2025-01-28 23:12 ` Patchwork
  2025-01-28 23:13 ` ✗ CI.KUnit: failure " Patchwork
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Patchwork @ 2025-01-28 23:12 UTC (permalink / raw)
  To: Oak Zeng; +Cc: intel-xe

== Series Details ==

Series: series starting with [1/3] drm/xe: Add a function to zap page table by address range
URL   : https://patchwork.freedesktop.org/series/144072/
State : success

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
30ab6715fc09baee6cc14cb3c89ad8858688d474
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit cb0f1cdb1d31379f15c7b5e96e66a8c4d48691ee
Author: Oak Zeng <oak.zeng@intel.com>
Date:   Tue Jan 28 17:21:45 2025 -0500

    drm/xe: Allow scratch page under fault mode for certain platform
    
    Normally scratch page is not allowed when a vm is operate under page
    fault mode, i.e., in the existing codes, DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE
    and DRM_XE_VM_CREATE_FLAG_FAULT_MODE are mutual exclusive. The reason
    is fault mode relies on recoverable page to work, while scratch page
    can mute recoverable page fault.
    
    On xe2 and xe3, HW prefetching can cause page fault interrupt. Due to
    page fault interrupt overhead (i.e., need Guc and KMD involved to fix
    the page fault), HW prefetching can be slowed by many orders of magnitude.
    
    Fix this problem by allowing scratch page under fault mode for xe2 and xe3.
    With scratch page in place, HW prefetching could always hit scratch page
    instead of causing interrupt.
    
    Signed-off-by: Oak Zeng <oak.zeng@intel.com>
+ /mt/dim checkpatch 8f3ba847175bfd495cbfe5b51d3cdfd8d32d24db drm-intel
fc3c14dbb3cb drm/xe: Add a function to zap page table by address range
108d849b4e71 drm/xe: Clear scratch page before vm_bind
cb0f1cdb1d31 drm/xe: Allow scratch page under fault mode for certain platform



^ permalink raw reply	[flat|nested] 32+ messages in thread

* ✗ CI.KUnit: failure for series starting with [1/3] drm/xe: Add a function to zap page table by address range
  2025-01-28 22:21 [PATCH 1/3] drm/xe: Add a function to zap page table by address range Oak Zeng
                   ` (4 preceding siblings ...)
  2025-01-28 23:12 ` ✓ CI.checkpatch: " Patchwork
@ 2025-01-28 23:13 ` Patchwork
  2025-01-28 23:38 ` [PATCH 1/3] " Matthew Brost
  2025-01-29  8:48 ` Thomas Hellström
  7 siblings, 0 replies; 32+ messages in thread
From: Patchwork @ 2025-01-28 23:13 UTC (permalink / raw)
  To: Oak Zeng; +Cc: intel-xe

== Series Details ==

Series: series starting with [1/3] drm/xe: Add a function to zap page table by address range
URL   : https://patchwork.freedesktop.org/series/144072/
State : failure

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
ERROR:root:../lib/iomap.c:156:5: warning: no previous prototype for ‘ioread64_lo_hi’ [-Wmissing-prototypes]
  156 | u64 ioread64_lo_hi(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~
../lib/iomap.c:163:5: warning: no previous prototype for ‘ioread64_hi_lo’ [-Wmissing-prototypes]
  163 | u64 ioread64_hi_lo(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~
../lib/iomap.c:170:5: warning: no previous prototype for ‘ioread64be_lo_hi’ [-Wmissing-prototypes]
  170 | u64 ioread64be_lo_hi(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~~~
../lib/iomap.c:178:5: warning: no previous prototype for ‘ioread64be_hi_lo’ [-Wmissing-prototypes]
  178 | u64 ioread64be_hi_lo(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~~~
../lib/iomap.c:264:6: warning: no previous prototype for ‘iowrite64_lo_hi’ [-Wmissing-prototypes]
  264 | void iowrite64_lo_hi(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~
../lib/iomap.c:272:6: warning: no previous prototype for ‘iowrite64_hi_lo’ [-Wmissing-prototypes]
  272 | void iowrite64_hi_lo(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~
../lib/iomap.c:280:6: warning: no previous prototype for ‘iowrite64be_lo_hi’ [-Wmissing-prototypes]
  280 | void iowrite64be_lo_hi(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~~~
../lib/iomap.c:288:6: warning: no previous prototype for ‘iowrite64be_hi_lo’ [-Wmissing-prototypes]
  288 | void iowrite64be_hi_lo(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~~~
../drivers/gpu/drm/xe/xe_vm.c:41:10: fatal error: i915_drv.h: No such file or directory
   41 | #include "i915_drv.h"
      |          ^~~~~~~~~~~~
compilation terminated.
make[7]: *** [../scripts/Makefile.build:194: drivers/gpu/drm/xe/xe_vm.o] Error 1
make[7]: *** Waiting for unfinished jobs....
make[6]: *** [../scripts/Makefile.build:440: drivers/gpu/drm/xe] Error 2
make[5]: *** [../scripts/Makefile.build:440: drivers/gpu/drm] Error 2
make[4]: *** [../scripts/Makefile.build:440: drivers/gpu] Error 2
make[3]: *** [../scripts/Makefile.build:440: drivers] Error 2
make[2]: *** [/kernel/Makefile:1989: .] Error 2
make[1]: *** [/kernel/Makefile:251: __sub-make] Error 2
make: *** [Makefile:251: __sub-make] Error 2

[23:12:39] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[23:12:43] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json ARCH=um O=.kunit --jobs=48
+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/3] drm/xe: Add a function to zap page table by address range
  2025-01-28 22:21 [PATCH 1/3] drm/xe: Add a function to zap page table by address range Oak Zeng
                   ` (5 preceding siblings ...)
  2025-01-28 23:13 ` ✗ CI.KUnit: failure " Patchwork
@ 2025-01-28 23:38 ` Matthew Brost
  2025-01-29  8:48 ` Thomas Hellström
  7 siblings, 0 replies; 32+ messages in thread
From: Matthew Brost @ 2025-01-28 23:38 UTC (permalink / raw)
  To: Oak Zeng; +Cc: intel-xe, joonas.lahtinen, Thomas.Hellstrom

On Tue, Jan 28, 2025 at 05:21:43PM -0500, Oak Zeng wrote:
> Add a function xe_pt_zap_range. This is similar to xe_pt_zap_ptes
> but is used when we don't have a vma to work with, such as zap
> a range mapped to scratch page where we don't have vma.
> 

See my reply to the following patch. This won't work and isn't needed.

Let me explain why this won't work.

When we set up scratch PTEs on a VM, we create shell PTE structure to
largest page size (1 GB), which we 1 GB entry pointing to a scratch
page.

With 48 bits of address space, this is 1 level to 1 GB pages, with 57
bits this is 2 levels to 1 GB pages.

Now, let's say we perform a 4K bind from 0x0 to 0x1000 with immediate
clear in fault mode.

How can we invalidate just the 4K range of memory? We can't. We need to
allocate new memory for page tables so that 0x0000–0x1000 points to an
invalidated PTE, while 0x1000–1 GB points to scratch. This is why, in
my reply to the patch, I indicated that the entire bind pipeline needs
to be updated to properly invalidate the PTEs.

I hope this explanation, along with my reply to the next patch, makes
sense.

Matt

> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_pt.c | 28 ++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_pt.h |  2 ++
>  2 files changed, 30 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index 1ddcc7e79a93..2363260da6a6 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -792,6 +792,34 @@ static const struct xe_pt_walk_ops xe_pt_zap_ptes_ops = {
>  	.pt_entry = xe_pt_zap_ptes_entry,
>  };
>  
> +/**
> + * xe_pt_zap_range() - Zap (zero) gpu ptes of an virtual address range
> + * @tile: The tile we're zapping for.
> + * @vm: The vm we're zapping for.
> + * @start: Start of the virtual address range, inclusive.
> + * @end: End of the virtual address range, exclusive.
> + *
> + * This is similar to xe_pt_zap_ptes() but it's used when we don't have a
> + * vma to work with. This is used for example when we're clearing the scratch
> + * page mapping during vm_bind.
> + *
> + */
> +void xe_pt_zap_range(struct xe_tile *tile, struct xe_vm *vm, u64 start, u64 end)
> +{
> +	struct xe_pt_zap_ptes_walk xe_walk = {
> +		.base = {
> +			.ops = &xe_pt_zap_ptes_ops,
> +			.shifts = xe_normal_pt_shifts,
> +			.max_level = XE_PT_HIGHEST_LEVEL,
> +		},
> +		.tile = tile,
> +	};
> +	struct xe_pt *pt = vm->pt_root[tile->id];
> +
> +	(void)xe_pt_walk_shared(&pt->base, pt->level, start,
> +				end, &xe_walk.base);
> +}
> +
>  /**
>   * xe_pt_zap_ptes() - Zap (zero) gpu ptes of an address range
>   * @tile: The tile we're zapping for.
> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> index 9ab386431cad..b166b324f455 100644
> --- a/drivers/gpu/drm/xe/xe_pt.h
> +++ b/drivers/gpu/drm/xe/xe_pt.h
> @@ -43,4 +43,6 @@ void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops *vops);
>  
>  bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>  
> +void xe_pt_zap_range(struct xe_tile *tile, struct xe_vm *vm, u64 start, u64 end);
> +
>  #endif
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/3] drm/xe: Add a function to zap page table by address range
  2025-01-28 22:21 [PATCH 1/3] drm/xe: Add a function to zap page table by address range Oak Zeng
                   ` (6 preceding siblings ...)
  2025-01-28 23:38 ` [PATCH 1/3] " Matthew Brost
@ 2025-01-29  8:48 ` Thomas Hellström
  7 siblings, 0 replies; 32+ messages in thread
From: Thomas Hellström @ 2025-01-29  8:48 UTC (permalink / raw)
  To: Oak Zeng, intel-xe; +Cc: joonas.lahtinen

On Tue, 2025-01-28 at 17:21 -0500, Oak Zeng wrote:
> Add a function xe_pt_zap_range. This is similar to xe_pt_zap_ptes
> but is used when we don't have a vma to work with, such as zap
> a range mapped to scratch page where we don't have vma.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>

As I think also Matt mentions, The zap functionality works only when it
exactly matches a pre-bound range (like in this case a vma) and there
is something cleaning up the zapped ptes afterwards. Typically that's a
rebind or an unbind.

Thanks,
Thomas



> ---
>  drivers/gpu/drm/xe/xe_pt.c | 28 ++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_pt.h |  2 ++
>  2 files changed, 30 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index 1ddcc7e79a93..2363260da6a6 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -792,6 +792,34 @@ static const struct xe_pt_walk_ops
> xe_pt_zap_ptes_ops = {
>  	.pt_entry = xe_pt_zap_ptes_entry,
>  };
>  
> +/**
> + * xe_pt_zap_range() - Zap (zero) gpu ptes of an virtual address
> range
> + * @tile: The tile we're zapping for.
> + * @vm: The vm we're zapping for.
> + * @start: Start of the virtual address range, inclusive.
> + * @end: End of the virtual address range, exclusive.
> + *
> + * This is similar to xe_pt_zap_ptes() but it's used when we don't
> have a
> + * vma to work with. This is used for example when we're clearing
> the scratch
> + * page mapping during vm_bind.
> + *
> + */
> +void xe_pt_zap_range(struct xe_tile *tile, struct xe_vm *vm, u64
> start, u64 end)
> +{
> +	struct xe_pt_zap_ptes_walk xe_walk = {
> +		.base = {
> +			.ops = &xe_pt_zap_ptes_ops,
> +			.shifts = xe_normal_pt_shifts,
> +			.max_level = XE_PT_HIGHEST_LEVEL,
> +		},
> +		.tile = tile,
> +	};
> +	struct xe_pt *pt = vm->pt_root[tile->id];
> +
> +	(void)xe_pt_walk_shared(&pt->base, pt->level, start,
> +				end, &xe_walk.base);
> +}
> +
>  /**
>   * xe_pt_zap_ptes() - Zap (zero) gpu ptes of an address range
>   * @tile: The tile we're zapping for.
> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> index 9ab386431cad..b166b324f455 100644
> --- a/drivers/gpu/drm/xe/xe_pt.h
> +++ b/drivers/gpu/drm/xe/xe_pt.h
> @@ -43,4 +43,6 @@ void xe_pt_update_ops_abort(struct xe_tile *tile,
> struct xe_vma_ops *vops);
>  
>  bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>  
> +void xe_pt_zap_range(struct xe_tile *tile, struct xe_vm *vm, u64
> start, u64 end);
> +
>  #endif


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/3] drm/xe: Introduced needs_scratch bit in device descriptor
@ 2025-02-04 18:45 Oak Zeng
  2025-02-04 18:45 ` [PATCH 2/3] drm/xe: Clear scratch page before vm_bind Oak Zeng
  0 siblings, 1 reply; 32+ messages in thread
From: Oak Zeng @ 2025-02-04 18:45 UTC (permalink / raw)
  To: intel-xe; +Cc: Thomas.Hellstrom, matthew.brost, jonathan.cavitt

On some platform, scratch page is needed for out of bound prefetch
to work. Introduce a bit in device descriptor to specify whether
this device needs scratch page to work.

v2: introduce a needs_scratch bit in device info (Thomas, Jonathan)

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_device_types.h | 3 +++
 drivers/gpu/drm/xe/xe_pci.c          | 5 +++++
 2 files changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index c0e886bac183..1197ab31e528 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -45,6 +45,7 @@ struct xe_pxp;
 #define GRAPHICS_VERx100(xe) ((xe)->info.graphics_verx100)
 #define MEDIA_VERx100(xe) ((xe)->info.media_verx100)
 #define IS_DGFX(xe) ((xe)->info.is_dgfx)
+#define NEEDS_SCRATCH(xe) ((xe)->info.needs_scratch)
 
 #define XE_VRAM_FLAGS_NEED64K		BIT(0)
 
@@ -318,6 +319,8 @@ struct xe_device {
 		u8 has_usm:1;
 		/** @info.is_dgfx: is discrete device */
 		u8 is_dgfx:1;
+		/** @info.needs_scratch: needs scratch page for oob prefetch to work */
+		u8 needs_scratch:1;
 		/**
 		 * @info.probe_display: Probe display hardware.  If set to
 		 * false, the driver will behave as if there is no display
diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
index e8ef7d6b4db8..09736674a77e 100644
--- a/drivers/gpu/drm/xe/xe_pci.c
+++ b/drivers/gpu/drm/xe/xe_pci.c
@@ -67,6 +67,7 @@ struct xe_device_desc {
 	u8 has_llc:1;
 	u8 has_pxp:1;
 	u8 has_sriov:1;
+	u8 needs_scratch:1;
 	u8 skip_guc_pc:1;
 	u8 skip_mtcfg:1;
 	u8 skip_pcode:1;
@@ -353,6 +354,7 @@ static const struct xe_device_desc lnl_desc = {
 	.dma_mask_size = 46,
 	.has_display = true,
 	.has_pxp = true,
+	.needs_scratch = true,
 };
 
 static const struct xe_device_desc bmg_desc = {
@@ -361,6 +363,7 @@ static const struct xe_device_desc bmg_desc = {
 	.dma_mask_size = 46,
 	.has_display = true,
 	.has_heci_cscfi = 1,
+	.needs_scratch = true,
 };
 
 static const struct xe_device_desc ptl_desc = {
@@ -368,6 +371,7 @@ static const struct xe_device_desc ptl_desc = {
 	.dma_mask_size = 46,
 	.has_display = true,
 	.require_force_probe = true,
+	.needs_scratch = true,
 };
 
 #undef PLATFORM
@@ -643,6 +647,7 @@ static int xe_info_init_early(struct xe_device *xe,
 	xe->info.skip_guc_pc = desc->skip_guc_pc;
 	xe->info.skip_mtcfg = desc->skip_mtcfg;
 	xe->info.skip_pcode = desc->skip_pcode;
+	xe->info.needs_scratch = desc->needs_scratch;
 
 	xe->info.probe_display = IS_ENABLED(CONFIG_DRM_XE_DISPLAY) &&
 				 xe_modparam.probe_display &&
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-02-04 18:45 [PATCH 1/3] drm/xe: Introduced needs_scratch bit in device descriptor Oak Zeng
@ 2025-02-04 18:45 ` Oak Zeng
  2025-02-05 14:35   ` Thomas Hellström
  2025-02-06 10:34   ` Matthew Brost
  0 siblings, 2 replies; 32+ messages in thread
From: Oak Zeng @ 2025-02-04 18:45 UTC (permalink / raw)
  To: intel-xe; +Cc: Thomas.Hellstrom, matthew.brost, jonathan.cavitt

When a vm runs under fault mode, if scratch page is enabled, we need
to clear the scratch page mapping before vm_bind for the vm_bind
address range. Under fault mode, we depend on recoverable page fault
to establish mapping in page table. If scratch page is not cleared,
GPU access of address won't cause page fault because it always hits
the existing scratch page mapping.

When vm_bind with IMMEDIATE flag, there is no need of clearing as
immediate bind can overwrite the scratch page mapping.

So far only is xe2 and xe3 products are allowed to enable scratch page
under fault mode. On other platform we don't allow scratch page under
fault mode, so no need of such clearing.

v2: Rework vm_bind pipeline to clear scratch page mapping. This is similar
to a map operation, with the exception that PTEs are cleared instead of
pointing to valid physical pages. (Matt, Thomas)

TLB invalidation is needed after clear scratch page mapping as larger
scratch page mapping could be backed by physical page and cached in
TLB. (Matt, Thomas)

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c       | 66 ++++++++++++++++++++++----------
 drivers/gpu/drm/xe/xe_pt_types.h |  2 +
 drivers/gpu/drm/xe/xe_vm.c       | 29 ++++++++++++--
 drivers/gpu/drm/xe/xe_vm_types.h |  2 +
 4 files changed, 75 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 1ddcc7e79a93..3fd0ae2dbe7d 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -268,6 +268,8 @@ struct xe_pt_stage_bind_walk {
 	 * granularity.
 	 */
 	bool needs_64K;
+	/* @clear_pt: clear page table entries during the bind walk */
+	bool clear_pt;
 	/**
 	 * @vma: VMA being mapped
 	 */
@@ -497,21 +499,25 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
 
 		XE_WARN_ON(xe_walk->va_curs_start != addr);
 
-		pte = vm->pt_ops->pte_encode_vma(is_null ? 0 :
-						 xe_res_dma(curs) + xe_walk->dma_offset,
-						 xe_walk->vma, pat_index, level);
-		pte |= xe_walk->default_pte;
+		if (xe_walk->clear_pt) {
+			pte = 0;
+		} else {
+			pte = vm->pt_ops->pte_encode_vma(is_null ? 0 :
+					xe_res_dma(curs) + xe_walk->dma_offset,
+					xe_walk->vma, pat_index, level);
+			pte |= xe_walk->default_pte;
 
-		/*
-		 * Set the XE_PTE_PS64 hint if possible, otherwise if
-		 * this device *requires* 64K PTE size for VRAM, fail.
-		 */
-		if (level == 0 && !xe_parent->is_compact) {
-			if (xe_pt_is_pte_ps64K(addr, next, xe_walk)) {
-				xe_walk->vma->gpuva.flags |= XE_VMA_PTE_64K;
-				pte |= XE_PTE_PS64;
-			} else if (XE_WARN_ON(xe_walk->needs_64K)) {
-				return -EINVAL;
+			/*
+			 * Set the XE_PTE_PS64 hint if possible, otherwise if
+			 * this device *requires* 64K PTE size for VRAM, fail.
+			 */
+			if (level == 0 && !xe_parent->is_compact) {
+				if (xe_pt_is_pte_ps64K(addr, next, xe_walk)) {
+					xe_walk->vma->gpuva.flags |= XE_VMA_PTE_64K;
+					pte |= XE_PTE_PS64;
+				} else if (XE_WARN_ON(xe_walk->needs_64K)) {
+					return -EINVAL;
+				}
 			}
 		}
 
@@ -519,7 +525,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
 		if (unlikely(ret))
 			return ret;
 
-		if (!is_null)
+		if (!is_null && !xe_walk->clear_pt)
 			xe_res_next(curs, next - addr);
 		xe_walk->va_curs_start = next;
 		xe_walk->vma->gpuva.flags |= (XE_VMA_PTE_4K << level);
@@ -589,6 +595,7 @@ static const struct xe_pt_walk_ops xe_pt_stage_bind_ops = {
  * @vma: The vma indicating the address range.
  * @entries: Storage for the update entries used for connecting the tree to
  * the main tree at commit time.
+ * @clear_pt: Clear the page table entries.
  * @num_entries: On output contains the number of @entries used.
  *
  * This function builds a disconnected page-table tree for a given address
@@ -602,7 +609,8 @@ static const struct xe_pt_walk_ops xe_pt_stage_bind_ops = {
  */
 static int
 xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
-		 struct xe_vm_pgtable_update *entries, u32 *num_entries)
+		 struct xe_vm_pgtable_update *entries,
+		 bool clear_pt, u32 *num_entries)
 {
 	struct xe_device *xe = tile_to_xe(tile);
 	struct xe_bo *bo = xe_vma_bo(vma);
@@ -622,10 +630,19 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 		.vma = vma,
 		.wupd.entries = entries,
 		.needs_64K = (xe_vma_vm(vma)->flags & XE_VM_FLAG_64K) && is_devmem,
+		.clear_pt = clear_pt,
 	};
 	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
 	int ret;
 
+	if (clear_pt) {
+		ret = xe_pt_walk_range(&pt->base, pt->level, xe_vma_start(vma),
+				       xe_vma_end(vma), &xe_walk.base);
+
+		*num_entries = xe_walk.wupd.num_used_entries;
+		return ret;
+	}
+
 	/**
 	 * Default atomic expectations for different allocation scenarios are as follows:
 	 *
@@ -981,12 +998,14 @@ static void xe_pt_free_bind(struct xe_vm_pgtable_update *entries,
 
 static int
 xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
-		   struct xe_vm_pgtable_update *entries, u32 *num_entries)
+		   struct xe_vm_pgtable_update *entries,
+		   bool invalidate_on_bind, u32 *num_entries)
 {
 	int err;
 
 	*num_entries = 0;
-	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
+	err = xe_pt_stage_bind(tile, vma, entries, invalidate_on_bind,
+			       num_entries);
 	if (!err)
 		xe_tile_assert(tile, *num_entries);
 
@@ -1661,6 +1680,7 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 		return err;
 
 	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
+				 pt_update_ops->invalidate_on_bind,
 				 &pt_op->num_entries);
 	if (!err) {
 		xe_tile_assert(tile, pt_op->num_entries <=
@@ -1685,7 +1705,7 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 		 * it needs to be done here.
 		 */
 		if ((!pt_op->rebind && xe_vm_has_scratch(vm) &&
-		     xe_vm_in_preempt_fence_mode(vm)))
+		     xe_vm_in_preempt_fence_mode(vm)) || pt_update_ops->invalidate_on_bind)
 			pt_update_ops->needs_invalidation = true;
 		else if (pt_op->rebind && !xe_vm_in_lr_mode(vm))
 			/* We bump also if batch_invalidate_tlb is true */
@@ -1759,9 +1779,13 @@ static int op_prepare(struct xe_vm *vm,
 
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
-		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+		if (!op->map.immediate && xe_vm_in_fault_mode(vm) &&
+		    !op->map.invalidate_on_bind)
 			break;
 
+		if (op->map.invalidate_on_bind)
+			pt_update_ops->invalidate_on_bind = true;
+
 		err = bind_op_prepare(vm, tile, pt_update_ops, op->map.vma);
 		pt_update_ops->wait_vm_kernel = true;
 		break;
@@ -1871,6 +1895,8 @@ static void bind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
 	}
 	vma->tile_present |= BIT(tile->id);
 	vma->tile_staged &= ~BIT(tile->id);
+	if (pt_update_ops->invalidate_on_bind)
+		vma->tile_invalidated |= BIT(tile->id);
 	if (xe_vma_is_userptr(vma)) {
 		lockdep_assert_held_read(&vm->userptr.notifier_lock);
 		to_userptr_vma(vma)->userptr.initial_bind = true;
diff --git a/drivers/gpu/drm/xe/xe_pt_types.h b/drivers/gpu/drm/xe/xe_pt_types.h
index 384cc04de719..3d0aa2a5102e 100644
--- a/drivers/gpu/drm/xe/xe_pt_types.h
+++ b/drivers/gpu/drm/xe/xe_pt_types.h
@@ -108,6 +108,8 @@ struct xe_vm_pgtable_update_ops {
 	bool needs_userptr_lock;
 	/** @needs_invalidation: Needs invalidation */
 	bool needs_invalidation;
+	/** @invalidate_on_bind: Invalidate the range before bind */
+	bool invalidate_on_bind;
 	/**
 	 * @wait_vm_bookkeep: PT operations need to wait until VM is idle
 	 * (bookkeep dma-resv slots are idle) and stage all future VM activity
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index d664f2e418b2..813d893d9b63 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -1921,6 +1921,23 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
 }
 #endif
 
+static bool __xe_vm_needs_clear_scratch_pages(struct xe_vm *vm, u32 bind_flags)
+{
+	if (!xe_vm_in_fault_mode(vm))
+		return false;
+
+	if (!NEEDS_SCRATCH(vm->xe))
+		return false;
+
+	if (!xe_vm_has_scratch(vm))
+		return false;
+
+	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
+		return false;
+
+	return true;
+}
+
 /*
  * Create operations list from IOCTL arguments, setup operations fields so parse
  * and commit steps are decoupled from IOCTL arguments. This step can fail.
@@ -1991,6 +2008,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
 			op->map.is_null = flags & DRM_XE_VM_BIND_FLAG_NULL;
 			op->map.dumpable = flags & DRM_XE_VM_BIND_FLAG_DUMPABLE;
 			op->map.pat_index = pat_index;
+			op->map.invalidate_on_bind =
+				__xe_vm_needs_clear_scratch_pages(vm, flags);
 		} else if (__op->op == DRM_GPUVA_OP_PREFETCH) {
 			op->prefetch.region = prefetch_region;
 		}
@@ -2188,7 +2207,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				return PTR_ERR(vma);
 
 			op->map.vma = vma;
-			if (op->map.immediate || !xe_vm_in_fault_mode(vm))
+			if (op->map.immediate || !xe_vm_in_fault_mode(vm) ||
+			    op->map.invalidate_on_bind)
 				xe_vma_ops_incr_pt_update_ops(vops,
 							      op->tile_mask);
 			break;
@@ -2416,9 +2436,10 @@ static int op_lock_and_prep(struct drm_exec *exec, struct xe_vm *vm,
 
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
-		err = vma_lock_and_validate(exec, op->map.vma,
-					    !xe_vm_in_fault_mode(vm) ||
-					    op->map.immediate);
+		if (!op->map.invalidate_on_bind)
+			err = vma_lock_and_validate(exec, op->map.vma,
+						    !xe_vm_in_fault_mode(vm) ||
+						    op->map.immediate);
 		break;
 	case DRM_GPUVA_OP_REMAP:
 		err = check_ufence(gpuva_to_vma(op->base.remap.unmap->va));
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 52467b9b5348..dace04f4ea5e 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -297,6 +297,8 @@ struct xe_vma_op_map {
 	bool is_null;
 	/** @dumpable: whether BO is dumped on GPU hang */
 	bool dumpable;
+	/** @invalidate: invalidate the VMA before bind */
+	bool invalidate_on_bind;
 	/** @pat_index: The pat index to use for this operation. */
 	u16 pat_index;
 };
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-02-04 18:45 ` [PATCH 2/3] drm/xe: Clear scratch page before vm_bind Oak Zeng
@ 2025-02-05 14:35   ` Thomas Hellström
  2025-02-06  1:52     ` Zeng, Oak
  2025-02-06 10:34   ` Matthew Brost
  1 sibling, 1 reply; 32+ messages in thread
From: Thomas Hellström @ 2025-02-05 14:35 UTC (permalink / raw)
  To: Oak Zeng, intel-xe; +Cc: matthew.brost, jonathan.cavitt

On Tue, 2025-02-04 at 13:45 -0500, Oak Zeng wrote:
> When a vm runs under fault mode, if scratch page is enabled, we need
> to clear the scratch page mapping before vm_bind for the vm_bind
> address range. Under fault mode, we depend on recoverable page fault
> to establish mapping in page table. If scratch page is not cleared,
> GPU access of address won't cause page fault because it always hits
> the existing scratch page mapping.

I think we need to ephasize that we clear AT vm_bind time, not before,
both in the commit description and the patch title.

> 
> When vm_bind with IMMEDIATE flag, there is no need of clearing as
> immediate bind can overwrite the scratch page mapping.
> 
> So far only is xe2 and xe3 products are allowed to enable scratch
> page
> under fault mode. On other platform we don't allow scratch page under
> fault mode, so no need of such clearing.
> 
> v2: Rework vm_bind pipeline to clear scratch page mapping. This is
> similar
> to a map operation, with the exception that PTEs are cleared instead
> of
> pointing to valid physical pages. (Matt, Thomas)
> 
> TLB invalidation is needed after clear scratch page mapping as larger
> scratch page mapping could be backed by physical page and cached in
> TLB. (Matt, Thomas)
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_pt.c       | 66 ++++++++++++++++++++++--------
> --
>  drivers/gpu/drm/xe/xe_pt_types.h |  2 +
>  drivers/gpu/drm/xe/xe_vm.c       | 29 ++++++++++++--
>  drivers/gpu/drm/xe/xe_vm_types.h |  2 +
>  4 files changed, 75 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index 1ddcc7e79a93..3fd0ae2dbe7d 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -268,6 +268,8 @@ struct xe_pt_stage_bind_walk {
>  	 * granularity.
>  	 */
>  	bool needs_64K;
> +	/* @clear_pt: clear page table entries during the bind walk
> */
> +	bool clear_pt;
>  	/**
>  	 * @vma: VMA being mapped
>  	 */
> @@ -497,21 +499,25 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> pgoff_t offset,
>  
>  		XE_WARN_ON(xe_walk->va_curs_start != addr);
>  
> -		pte = vm->pt_ops->pte_encode_vma(is_null ? 0 :
> -						 xe_res_dma(curs) +
> xe_walk->dma_offset,
> -						 xe_walk->vma,
> pat_index, level);
> -		pte |= xe_walk->default_pte;
> +		if (xe_walk->clear_pt) {
> +			pte = 0;
> +		} else {
> +			pte = vm->pt_ops->pte_encode_vma(is_null ? 0
> :
> +					xe_res_dma(curs) + xe_walk-
> >dma_offset,
> +					xe_walk->vma, pat_index,
> level);
> +			pte |= xe_walk->default_pte;
>  
> -		/*
> -		 * Set the XE_PTE_PS64 hint if possible, otherwise
> if
> -		 * this device *requires* 64K PTE size for VRAM,
> fail.
> -		 */
> -		if (level == 0 && !xe_parent->is_compact) {
> -			if (xe_pt_is_pte_ps64K(addr, next, xe_walk))
> {
> -				xe_walk->vma->gpuva.flags |=
> XE_VMA_PTE_64K;
> -				pte |= XE_PTE_PS64;
> -			} else if (XE_WARN_ON(xe_walk->needs_64K)) {
> -				return -EINVAL;
> +			/*
> +			 * Set the XE_PTE_PS64 hint if possible,
> otherwise if
> +			 * this device *requires* 64K PTE size for
> VRAM, fail.
> +			 */
> +			if (level == 0 && !xe_parent->is_compact) {
> +				if (xe_pt_is_pte_ps64K(addr, next,
> xe_walk)) {
> +					xe_walk->vma->gpuva.flags |=
> XE_VMA_PTE_64K;
> +					pte |= XE_PTE_PS64;
> +				} else if (XE_WARN_ON(xe_walk-
> >needs_64K)) {
> +					return -EINVAL;
> +				}
>  			}
>  		}
>  

I think there might be some things missing here.
The bind should look exactly as a is_null bind, with the exception that
the pte should be 0, so might be able to benefit from is_null, for
example xe_pg_hugepte_possible() short-circuit the dma-address check.
With clear_pt we don't want to use the XE_PTE_PS64K, though, like you
identified above.

Will get back with a look at the invalidation.


> @@ -519,7 +525,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> pgoff_t offset,
>  		if (unlikely(ret))
>  			return ret;
>  
> -		if (!is_null)
> +		if (!is_null && !xe_walk->clear_pt)
>  			xe_res_next(curs, next - addr);
>  		xe_walk->va_curs_start = next;
>  		xe_walk->vma->gpuva.flags |= (XE_VMA_PTE_4K <<
> level);
> @@ -589,6 +595,7 @@ static const struct xe_pt_walk_ops
> xe_pt_stage_bind_ops = {
>   * @vma: The vma indicating the address range.
>   * @entries: Storage for the update entries used for connecting the
> tree to
>   * the main tree at commit time.
> + * @clear_pt: Clear the page table entries.
>   * @num_entries: On output contains the number of @entries used.
>   *
>   * This function builds a disconnected page-table tree for a given
> address
> @@ -602,7 +609,8 @@ static const struct xe_pt_walk_ops
> xe_pt_stage_bind_ops = {
>   */
>  static int
>  xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> -		 struct xe_vm_pgtable_update *entries, u32
> *num_entries)
> +		 struct xe_vm_pgtable_update *entries,
> +		 bool clear_pt, u32 *num_entries)
>  {
>  	struct xe_device *xe = tile_to_xe(tile);
>  	struct xe_bo *bo = xe_vma_bo(vma);
> @@ -622,10 +630,19 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> xe_vma *vma,
>  		.vma = vma,
>  		.wupd.entries = entries,
>  		.needs_64K = (xe_vma_vm(vma)->flags &
> XE_VM_FLAG_64K) && is_devmem,
> +		.clear_pt = clear_pt,
>  	};
>  	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
>  	int ret;
>  
> +	if (clear_pt) {
> +		ret = xe_pt_walk_range(&pt->base, pt->level,
> xe_vma_start(vma),
> +				       xe_vma_end(vma),
> &xe_walk.base);
> +
> +		*num_entries = xe_walk.wupd.num_used_entries;
> +		return ret;
> +	}
> +
>  	/**
>  	 * Default atomic expectations for different allocation
> scenarios are as follows:
>  	 *
> @@ -981,12 +998,14 @@ static void xe_pt_free_bind(struct
> xe_vm_pgtable_update *entries,
>  
>  static int
>  xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
> -		   struct xe_vm_pgtable_update *entries, u32
> *num_entries)
> +		   struct xe_vm_pgtable_update *entries,
> +		   bool invalidate_on_bind, u32 *num_entries)
>  {
>  	int err;
>  
>  	*num_entries = 0;
> -	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
> +	err = xe_pt_stage_bind(tile, vma, entries,
> invalidate_on_bind,
> +			       num_entries);
>  	if (!err)
>  		xe_tile_assert(tile, *num_entries);
>  
> @@ -1661,6 +1680,7 @@ static int bind_op_prepare(struct xe_vm *vm,
> struct xe_tile *tile,
>  		return err;
>  
>  	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
> +				 pt_update_ops->invalidate_on_bind,
>  				 &pt_op->num_entries);
>  	if (!err) {
>  		xe_tile_assert(tile, pt_op->num_entries <=
> @@ -1685,7 +1705,7 @@ static int bind_op_prepare(struct xe_vm *vm,
> struct xe_tile *tile,
>  		 * it needs to be done here.
>  		 */
>  		if ((!pt_op->rebind && xe_vm_has_scratch(vm) &&
> -		     xe_vm_in_preempt_fence_mode(vm)))
> +		     xe_vm_in_preempt_fence_mode(vm)) ||
> pt_update_ops->invalidate_on_bind)
>  			pt_update_ops->needs_invalidation = true;
>  		else if (pt_op->rebind && !xe_vm_in_lr_mode(vm))
>  			/* We bump also if batch_invalidate_tlb is
> true */
> @@ -1759,9 +1779,13 @@ static int op_prepare(struct xe_vm *vm,
>  
>  	switch (op->base.op) {
>  	case DRM_GPUVA_OP_MAP:
> -		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
> +		if (!op->map.immediate && xe_vm_in_fault_mode(vm) &&
> +		    !op->map.invalidate_on_bind)
>  			break;
>  
> +		if (op->map.invalidate_on_bind)
> +			pt_update_ops->invalidate_on_bind = true;
> +
>  		err = bind_op_prepare(vm, tile, pt_update_ops, op-
> >map.vma);
>  		pt_update_ops->wait_vm_kernel = true;
>  		break;
> @@ -1871,6 +1895,8 @@ static void bind_op_commit(struct xe_vm *vm,
> struct xe_tile *tile,
>  	}
>  	vma->tile_present |= BIT(tile->id);
>  	vma->tile_staged &= ~BIT(tile->id);
> +	if (pt_update_ops->invalidate_on_bind)
> +		vma->tile_invalidated |= BIT(tile->id);
>  	if (xe_vma_is_userptr(vma)) {
>  		lockdep_assert_held_read(&vm-
> >userptr.notifier_lock);
>  		to_userptr_vma(vma)->userptr.initial_bind = true;
> diff --git a/drivers/gpu/drm/xe/xe_pt_types.h
> b/drivers/gpu/drm/xe/xe_pt_types.h
> index 384cc04de719..3d0aa2a5102e 100644
> --- a/drivers/gpu/drm/xe/xe_pt_types.h
> +++ b/drivers/gpu/drm/xe/xe_pt_types.h
> @@ -108,6 +108,8 @@ struct xe_vm_pgtable_update_ops {
>  	bool needs_userptr_lock;
>  	/** @needs_invalidation: Needs invalidation */
>  	bool needs_invalidation;
> +	/** @invalidate_on_bind: Invalidate the range before bind */
> +	bool invalidate_on_bind;
>  	/**
>  	 * @wait_vm_bookkeep: PT operations need to wait until VM is
> idle
>  	 * (bookkeep dma-resv slots are idle) and stage all future
> VM activity
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index d664f2e418b2..813d893d9b63 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -1921,6 +1921,23 @@ static void print_op(struct xe_device *xe,
> struct drm_gpuva_op *op)
>  }
>  #endif
>  
> +static bool __xe_vm_needs_clear_scratch_pages(struct xe_vm *vm, u32
> bind_flags)
> +{
> +	if (!xe_vm_in_fault_mode(vm))
> +		return false;
> +
> +	if (!NEEDS_SCRATCH(vm->xe))
> +		return false;
> +
> +	if (!xe_vm_has_scratch(vm))
> +		return false;
> +
> +	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> +		return false;
> +
> +	return true;
> +}
> +
>  /*
>   * Create operations list from IOCTL arguments, setup operations
> fields so parse
>   * and commit steps are decoupled from IOCTL arguments. This step
> can fail.
> @@ -1991,6 +2008,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> struct xe_bo *bo,
>  			op->map.is_null = flags &
> DRM_XE_VM_BIND_FLAG_NULL;
>  			op->map.dumpable = flags &
> DRM_XE_VM_BIND_FLAG_DUMPABLE;
>  			op->map.pat_index = pat_index;
> +			op->map.invalidate_on_bind =
> +				__xe_vm_needs_clear_scratch_pages(vm
> , flags);
>  		} else if (__op->op == DRM_GPUVA_OP_PREFETCH) {
>  			op->prefetch.region = prefetch_region;
>  		}
> @@ -2188,7 +2207,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
> *vm, struct drm_gpuva_ops *ops,
>  				return PTR_ERR(vma);
>  
>  			op->map.vma = vma;
> -			if (op->map.immediate ||
> !xe_vm_in_fault_mode(vm))
> +			if (op->map.immediate ||
> !xe_vm_in_fault_mode(vm) ||
> +			    op->map.invalidate_on_bind)
>  				xe_vma_ops_incr_pt_update_ops(vops,
>  							      op-
> >tile_mask);
>  			break;
> @@ -2416,9 +2436,10 @@ static int op_lock_and_prep(struct drm_exec
> *exec, struct xe_vm *vm,
>  
>  	switch (op->base.op) {
>  	case DRM_GPUVA_OP_MAP:
> -		err = vma_lock_and_validate(exec, op->map.vma,
> -					    !xe_vm_in_fault_mode(vm)
> ||
> -					    op->map.immediate);
> +		if (!op->map.invalidate_on_bind)
> +			err = vma_lock_and_validate(exec, op-
> >map.vma,
> +						   
> !xe_vm_in_fault_mode(vm) ||
> +						    op-
> >map.immediate);
>  		break;
>  	case DRM_GPUVA_OP_REMAP:
>  		err = check_ufence(gpuva_to_vma(op-
> >base.remap.unmap->va));
> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> b/drivers/gpu/drm/xe/xe_vm_types.h
> index 52467b9b5348..dace04f4ea5e 100644
> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> @@ -297,6 +297,8 @@ struct xe_vma_op_map {
>  	bool is_null;
>  	/** @dumpable: whether BO is dumped on GPU hang */
>  	bool dumpable;
> +	/** @invalidate: invalidate the VMA before bind */
> +	bool invalidate_on_bind;
>  	/** @pat_index: The pat index to use for this operation. */
>  	u16 pat_index;
>  };


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-02-05 14:35   ` Thomas Hellström
@ 2025-02-06  1:52     ` Zeng, Oak
  2025-02-06 12:51       ` Thomas Hellström
  0 siblings, 1 reply; 32+ messages in thread
From: Zeng, Oak @ 2025-02-06  1:52 UTC (permalink / raw)
  To: Thomas Hellström, intel-xe@lists.freedesktop.org
  Cc: Brost, Matthew, Cavitt, Jonathan



> -----Original Message-----
> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Sent: February 5, 2025 9:36 AM
> To: Zeng, Oak <oak.zeng@intel.com>; intel-xe@lists.freedesktop.org
> Cc: Brost, Matthew <matthew.brost@intel.com>; Cavitt, Jonathan
> <jonathan.cavitt@intel.com>
> Subject: Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
> 
> On Tue, 2025-02-04 at 13:45 -0500, Oak Zeng wrote:
> > When a vm runs under fault mode, if scratch page is enabled, we
> need
> > to clear the scratch page mapping before vm_bind for the vm_bind
> > address range. Under fault mode, we depend on recoverable page
> fault
> > to establish mapping in page table. If scratch page is not cleared,
> > GPU access of address won't cause page fault because it always hits
> > the existing scratch page mapping.
> 
> I think we need to ephasize that we clear AT vm_bind time, not
> before,
> both in the commit description and the patch title.

Will fix.

> 
> >
> > When vm_bind with IMMEDIATE flag, there is no need of clearing as
> > immediate bind can overwrite the scratch page mapping.
> >
> > So far only is xe2 and xe3 products are allowed to enable scratch
> > page
> > under fault mode. On other platform we don't allow scratch page
> under
> > fault mode, so no need of such clearing.
> >
> > v2: Rework vm_bind pipeline to clear scratch page mapping. This is
> > similar
> > to a map operation, with the exception that PTEs are cleared
> instead
> > of
> > pointing to valid physical pages. (Matt, Thomas)
> >
> > TLB invalidation is needed after clear scratch page mapping as larger
> > scratch page mapping could be backed by physical page and cached
> in
> > TLB. (Matt, Thomas)
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_pt.c       | 66 ++++++++++++++++++++++--
> ------
> > --
> >  drivers/gpu/drm/xe/xe_pt_types.h |  2 +
> >  drivers/gpu/drm/xe/xe_vm.c       | 29 ++++++++++++--
> >  drivers/gpu/drm/xe/xe_vm_types.h |  2 +
> >  4 files changed, 75 insertions(+), 24 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_pt.c
> b/drivers/gpu/drm/xe/xe_pt.c
> > index 1ddcc7e79a93..3fd0ae2dbe7d 100644
> > --- a/drivers/gpu/drm/xe/xe_pt.c
> > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > @@ -268,6 +268,8 @@ struct xe_pt_stage_bind_walk {
> >  	 * granularity.
> >  	 */
> >  	bool needs_64K;
> > +	/* @clear_pt: clear page table entries during the bind walk
> > */
> > +	bool clear_pt;
> >  	/**
> >  	 * @vma: VMA being mapped
> >  	 */
> > @@ -497,21 +499,25 @@ xe_pt_stage_bind_entry(struct xe_ptw
> *parent,
> > pgoff_t offset,
> >
> >  		XE_WARN_ON(xe_walk->va_curs_start != addr);
> >
> > -		pte = vm->pt_ops->pte_encode_vma(is_null ? 0 :
> > -						 xe_res_dma(curs) +
> > xe_walk->dma_offset,
> > -						 xe_walk->vma,
> > pat_index, level);
> > -		pte |= xe_walk->default_pte;
> > +		if (xe_walk->clear_pt) {
> > +			pte = 0;
> > +		} else {
> > +			pte = vm->pt_ops->pte_encode_vma(is_null ?
> 0
> > :
> > +					xe_res_dma(curs) + xe_walk-
> > >dma_offset,
> > +					xe_walk->vma, pat_index,
> > level);
> > +			pte |= xe_walk->default_pte;
> >
> > -		/*
> > -		 * Set the XE_PTE_PS64 hint if possible, otherwise
> > if
> > -		 * this device *requires* 64K PTE size for VRAM,
> > fail.
> > -		 */
> > -		if (level == 0 && !xe_parent->is_compact) {
> > -			if (xe_pt_is_pte_ps64K(addr, next, xe_walk))
> > {
> > -				xe_walk->vma->gpuva.flags |=
> > XE_VMA_PTE_64K;
> > -				pte |= XE_PTE_PS64;
> > -			} else if (XE_WARN_ON(xe_walk-
> >needs_64K)) {
> > -				return -EINVAL;
> > +			/*
> > +			 * Set the XE_PTE_PS64 hint if possible,
> > otherwise if
> > +			 * this device *requires* 64K PTE size for
> > VRAM, fail.
> > +			 */
> > +			if (level == 0 && !xe_parent->is_compact) {
> > +				if (xe_pt_is_pte_ps64K(addr, next,
> > xe_walk)) {
> > +					xe_walk->vma->gpuva.flags
> |=
> > XE_VMA_PTE_64K;
> > +					pte |= XE_PTE_PS64;
> > +				} else if (XE_WARN_ON(xe_walk-
> > >needs_64K)) {
> > +					return -EINVAL;
> > +				}
> >  			}
> >  		}
> >
> 
> I think there might be some things missing here.
> The bind should look exactly as a is_null bind, with the exception that
> the pte should be 0, so might be able to benefit from is_null, for
> example xe_pg_hugepte_possible() short-circuit the dma-address
> check.

You are right. Above code has a problem for huge pte cases: it wouldn't detect
A huge pte for clear_pt case. I will fix it.

I will not make is_null be true for clear_pt case, but will make similar logic to
Let clear_pt benefit from is_null logic.

> With clear_pt we don't want to use the XE_PTE_PS64K, though, like
> you
> identified above.
> 
> Will get back with a look at the invalidation.

I don't quite get here. Please do let me know if you find anything.

Oak
> 
> 
> > @@ -519,7 +525,7 @@ xe_pt_stage_bind_entry(struct xe_ptw
> *parent,
> > pgoff_t offset,
> >  		if (unlikely(ret))
> >  			return ret;
> >
> > -		if (!is_null)
> > +		if (!is_null && !xe_walk->clear_pt)
> >  			xe_res_next(curs, next - addr);
> >  		xe_walk->va_curs_start = next;
> >  		xe_walk->vma->gpuva.flags |= (XE_VMA_PTE_4K <<
> > level);
> > @@ -589,6 +595,7 @@ static const struct xe_pt_walk_ops
> > xe_pt_stage_bind_ops = {
> >   * @vma: The vma indicating the address range.
> >   * @entries: Storage for the update entries used for connecting the
> > tree to
> >   * the main tree at commit time.
> > + * @clear_pt: Clear the page table entries.
> >   * @num_entries: On output contains the number of @entries
> used.
> >   *
> >   * This function builds a disconnected page-table tree for a given
> > address
> > @@ -602,7 +609,8 @@ static const struct xe_pt_walk_ops
> > xe_pt_stage_bind_ops = {
> >   */
> >  static int
> >  xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> > -		 struct xe_vm_pgtable_update *entries, u32
> > *num_entries)
> > +		 struct xe_vm_pgtable_update *entries,
> > +		 bool clear_pt, u32 *num_entries)
> >  {
> >  	struct xe_device *xe = tile_to_xe(tile);
> >  	struct xe_bo *bo = xe_vma_bo(vma);
> > @@ -622,10 +630,19 @@ xe_pt_stage_bind(struct xe_tile *tile,
> struct
> > xe_vma *vma,
> >  		.vma = vma,
> >  		.wupd.entries = entries,
> >  		.needs_64K = (xe_vma_vm(vma)->flags &
> > XE_VM_FLAG_64K) && is_devmem,
> > +		.clear_pt = clear_pt,
> >  	};
> >  	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
> >  	int ret;
> >
> > +	if (clear_pt) {
> > +		ret = xe_pt_walk_range(&pt->base, pt->level,
> > xe_vma_start(vma),
> > +				       xe_vma_end(vma),
> > &xe_walk.base);
> > +
> > +		*num_entries = xe_walk.wupd.num_used_entries;
> > +		return ret;
> > +	}
> > +
> >  	/**
> >  	 * Default atomic expectations for different allocation
> > scenarios are as follows:
> >  	 *
> > @@ -981,12 +998,14 @@ static void xe_pt_free_bind(struct
> > xe_vm_pgtable_update *entries,
> >
> >  static int
> >  xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
> > -		   struct xe_vm_pgtable_update *entries, u32
> > *num_entries)
> > +		   struct xe_vm_pgtable_update *entries,
> > +		   bool invalidate_on_bind, u32 *num_entries)
> >  {
> >  	int err;
> >
> >  	*num_entries = 0;
> > -	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
> > +	err = xe_pt_stage_bind(tile, vma, entries,
> > invalidate_on_bind,
> > +			       num_entries);
> >  	if (!err)
> >  		xe_tile_assert(tile, *num_entries);
> >
> > @@ -1661,6 +1680,7 @@ static int bind_op_prepare(struct xe_vm
> *vm,
> > struct xe_tile *tile,
> >  		return err;
> >
> >  	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
> > +				 pt_update_ops->invalidate_on_bind,
> >  				 &pt_op->num_entries);
> >  	if (!err) {
> >  		xe_tile_assert(tile, pt_op->num_entries <=
> > @@ -1685,7 +1705,7 @@ static int bind_op_prepare(struct xe_vm
> *vm,
> > struct xe_tile *tile,
> >  		 * it needs to be done here.
> >  		 */
> >  		if ((!pt_op->rebind && xe_vm_has_scratch(vm) &&
> > -		     xe_vm_in_preempt_fence_mode(vm)))
> > +		     xe_vm_in_preempt_fence_mode(vm)) ||
> > pt_update_ops->invalidate_on_bind)
> >  			pt_update_ops->needs_invalidation = true;
> >  		else if (pt_op->rebind && !xe_vm_in_lr_mode(vm))
> >  			/* We bump also if batch_invalidate_tlb is
> > true */
> > @@ -1759,9 +1779,13 @@ static int op_prepare(struct xe_vm *vm,
> >
> >  	switch (op->base.op) {
> >  	case DRM_GPUVA_OP_MAP:
> > -		if (!op->map.immediate &&
> xe_vm_in_fault_mode(vm))
> > +		if (!op->map.immediate &&
> xe_vm_in_fault_mode(vm) &&
> > +		    !op->map.invalidate_on_bind)
> >  			break;
> >
> > +		if (op->map.invalidate_on_bind)
> > +			pt_update_ops->invalidate_on_bind = true;
> > +
> >  		err = bind_op_prepare(vm, tile, pt_update_ops, op-
> > >map.vma);
> >  		pt_update_ops->wait_vm_kernel = true;
> >  		break;
> > @@ -1871,6 +1895,8 @@ static void bind_op_commit(struct xe_vm
> *vm,
> > struct xe_tile *tile,
> >  	}
> >  	vma->tile_present |= BIT(tile->id);
> >  	vma->tile_staged &= ~BIT(tile->id);
> > +	if (pt_update_ops->invalidate_on_bind)
> > +		vma->tile_invalidated |= BIT(tile->id);
> >  	if (xe_vma_is_userptr(vma)) {
> >  		lockdep_assert_held_read(&vm-
> > >userptr.notifier_lock);
> >  		to_userptr_vma(vma)->userptr.initial_bind = true;
> > diff --git a/drivers/gpu/drm/xe/xe_pt_types.h
> > b/drivers/gpu/drm/xe/xe_pt_types.h
> > index 384cc04de719..3d0aa2a5102e 100644
> > --- a/drivers/gpu/drm/xe/xe_pt_types.h
> > +++ b/drivers/gpu/drm/xe/xe_pt_types.h
> > @@ -108,6 +108,8 @@ struct xe_vm_pgtable_update_ops {
> >  	bool needs_userptr_lock;
> >  	/** @needs_invalidation: Needs invalidation */
> >  	bool needs_invalidation;
> > +	/** @invalidate_on_bind: Invalidate the range before bind */
> > +	bool invalidate_on_bind;
> >  	/**
> >  	 * @wait_vm_bookkeep: PT operations need to wait until VM
> is
> > idle
> >  	 * (bookkeep dma-resv slots are idle) and stage all future
> > VM activity
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> b/drivers/gpu/drm/xe/xe_vm.c
> > index d664f2e418b2..813d893d9b63 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -1921,6 +1921,23 @@ static void print_op(struct xe_device *xe,
> > struct drm_gpuva_op *op)
> >  }
> >  #endif
> >
> > +static bool __xe_vm_needs_clear_scratch_pages(struct xe_vm
> *vm, u32
> > bind_flags)
> > +{
> > +	if (!xe_vm_in_fault_mode(vm))
> > +		return false;
> > +
> > +	if (!NEEDS_SCRATCH(vm->xe))
> > +		return false;
> > +
> > +	if (!xe_vm_has_scratch(vm))
> > +		return false;
> > +
> > +	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> >  /*
> >   * Create operations list from IOCTL arguments, setup operations
> > fields so parse
> >   * and commit steps are decoupled from IOCTL arguments. This
> step
> > can fail.
> > @@ -1991,6 +2008,8 @@ vm_bind_ioctl_ops_create(struct xe_vm
> *vm,
> > struct xe_bo *bo,
> >  			op->map.is_null = flags &
> > DRM_XE_VM_BIND_FLAG_NULL;
> >  			op->map.dumpable = flags &
> > DRM_XE_VM_BIND_FLAG_DUMPABLE;
> >  			op->map.pat_index = pat_index;
> > +			op->map.invalidate_on_bind =
> > +
> 	__xe_vm_needs_clear_scratch_pages(vm
> > , flags);
> >  		} else if (__op->op == DRM_GPUVA_OP_PREFETCH) {
> >  			op->prefetch.region = prefetch_region;
> >  		}
> > @@ -2188,7 +2207,8 @@ static int vm_bind_ioctl_ops_parse(struct
> xe_vm
> > *vm, struct drm_gpuva_ops *ops,
> >  				return PTR_ERR(vma);
> >
> >  			op->map.vma = vma;
> > -			if (op->map.immediate ||
> > !xe_vm_in_fault_mode(vm))
> > +			if (op->map.immediate ||
> > !xe_vm_in_fault_mode(vm) ||
> > +			    op->map.invalidate_on_bind)
> >
> 	xe_vma_ops_incr_pt_update_ops(vops,
> >  							      op-
> > >tile_mask);
> >  			break;
> > @@ -2416,9 +2436,10 @@ static int op_lock_and_prep(struct
> drm_exec
> > *exec, struct xe_vm *vm,
> >
> >  	switch (op->base.op) {
> >  	case DRM_GPUVA_OP_MAP:
> > -		err = vma_lock_and_validate(exec, op->map.vma,
> > -					    !xe_vm_in_fault_mode(vm)
> > ||
> > -					    op->map.immediate);
> > +		if (!op->map.invalidate_on_bind)
> > +			err = vma_lock_and_validate(exec, op-
> > >map.vma,
> > +
> > !xe_vm_in_fault_mode(vm) ||
> > +						    op-
> > >map.immediate);
> >  		break;
> >  	case DRM_GPUVA_OP_REMAP:
> >  		err = check_ufence(gpuva_to_vma(op-
> > >base.remap.unmap->va));
> > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > b/drivers/gpu/drm/xe/xe_vm_types.h
> > index 52467b9b5348..dace04f4ea5e 100644
> > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > @@ -297,6 +297,8 @@ struct xe_vma_op_map {
> >  	bool is_null;
> >  	/** @dumpable: whether BO is dumped on GPU hang */
> >  	bool dumpable;
> > +	/** @invalidate: invalidate the VMA before bind */
> > +	bool invalidate_on_bind;
> >  	/** @pat_index: The pat index to use for this operation. */
> >  	u16 pat_index;
> >  };


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-02-06  1:52     ` Zeng, Oak
@ 2025-02-06 12:51       ` Thomas Hellström
  2025-02-06 18:56         ` Zeng, Oak
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Hellström @ 2025-02-06 12:51 UTC (permalink / raw)
  To: Zeng, Oak, intel-xe@lists.freedesktop.org
  Cc: Brost, Matthew, Cavitt, Jonathan

On Thu, 2025-02-06 at 01:52 +0000, Zeng, Oak wrote:
> 
> 
> > -----Original Message-----
> > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Sent: February 5, 2025 9:36 AM
> > To: Zeng, Oak <oak.zeng@intel.com>; intel-xe@lists.freedesktop.org
> > Cc: Brost, Matthew <matthew.brost@intel.com>; Cavitt, Jonathan
> > <jonathan.cavitt@intel.com>
> > Subject: Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
> > 
> > On Tue, 2025-02-04 at 13:45 -0500, Oak Zeng wrote:
> > > When a vm runs under fault mode, if scratch page is enabled, we
> > need
> > > to clear the scratch page mapping before vm_bind for the vm_bind
> > > address range. Under fault mode, we depend on recoverable page
> > fault
> > > to establish mapping in page table. If scratch page is not
> > > cleared,
> > > GPU access of address won't cause page fault because it always
> > > hits
> > > the existing scratch page mapping.
> > 
> > I think we need to ephasize that we clear AT vm_bind time, not
> > before,
> > both in the commit description and the patch title.
> 
> Will fix.
> 
> > 
> > > 
> > > When vm_bind with IMMEDIATE flag, there is no need of clearing as
> > > immediate bind can overwrite the scratch page mapping.
> > > 
> > > So far only is xe2 and xe3 products are allowed to enable scratch
> > > page
> > > under fault mode. On other platform we don't allow scratch page
> > under
> > > fault mode, so no need of such clearing.
> > > 
> > > v2: Rework vm_bind pipeline to clear scratch page mapping. This
> > > is
> > > similar
> > > to a map operation, with the exception that PTEs are cleared
> > instead
> > > of
> > > pointing to valid physical pages. (Matt, Thomas)
> > > 
> > > TLB invalidation is needed after clear scratch page mapping as
> > > larger
> > > scratch page mapping could be backed by physical page and cached
> > in
> > > TLB. (Matt, Thomas)
> > > 
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_pt.c       | 66 ++++++++++++++++++++++--
> > ------
> > > --
> > >  drivers/gpu/drm/xe/xe_pt_types.h |  2 +
> > >  drivers/gpu/drm/xe/xe_vm.c       | 29 ++++++++++++--
> > >  drivers/gpu/drm/xe/xe_vm_types.h |  2 +
> > >  4 files changed, 75 insertions(+), 24 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_pt.c
> > b/drivers/gpu/drm/xe/xe_pt.c
> > > index 1ddcc7e79a93..3fd0ae2dbe7d 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt.c
> > > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > > @@ -268,6 +268,8 @@ struct xe_pt_stage_bind_walk {
> > >  	 * granularity.
> > >  	 */
> > >  	bool needs_64K;
> > > +	/* @clear_pt: clear page table entries during the bind
> > > walk
> > > */
> > > +	bool clear_pt;
> > >  	/**
> > >  	 * @vma: VMA being mapped
> > >  	 */
> > > @@ -497,21 +499,25 @@ xe_pt_stage_bind_entry(struct xe_ptw
> > *parent,
> > > pgoff_t offset,
> > > 
> > >  		XE_WARN_ON(xe_walk->va_curs_start != addr);
> > > 
> > > -		pte = vm->pt_ops->pte_encode_vma(is_null ? 0 :
> > > -						
> > > xe_res_dma(curs) +
> > > xe_walk->dma_offset,
> > > -						 xe_walk->vma,
> > > pat_index, level);
> > > -		pte |= xe_walk->default_pte;
> > > +		if (xe_walk->clear_pt) {
> > > +			pte = 0;
> > > +		} else {
> > > +			pte = vm->pt_ops->pte_encode_vma(is_null
> > > ?
> > 0
> > > :
> > > +					xe_res_dma(curs) +
> > > xe_walk-
> > > > dma_offset,
> > > +					xe_walk->vma, pat_index,
> > > level);
> > > +			pte |= xe_walk->default_pte;
> > > 
> > > -		/*
> > > -		 * Set the XE_PTE_PS64 hint if possible,
> > > otherwise
> > > if
> > > -		 * this device *requires* 64K PTE size for VRAM,
> > > fail.
> > > -		 */
> > > -		if (level == 0 && !xe_parent->is_compact) {
> > > -			if (xe_pt_is_pte_ps64K(addr, next,
> > > xe_walk))
> > > {
> > > -				xe_walk->vma->gpuva.flags |=
> > > XE_VMA_PTE_64K;
> > > -				pte |= XE_PTE_PS64;
> > > -			} else if (XE_WARN_ON(xe_walk-
> > > needs_64K)) {
> > > -				return -EINVAL;
> > > +			/*
> > > +			 * Set the XE_PTE_PS64 hint if possible,
> > > otherwise if
> > > +			 * this device *requires* 64K PTE size
> > > for
> > > VRAM, fail.
> > > +			 */
> > > +			if (level == 0 && !xe_parent-
> > > >is_compact) {
> > > +				if (xe_pt_is_pte_ps64K(addr,
> > > next,
> > > xe_walk)) {
> > > +					xe_walk->vma-
> > > >gpuva.flags
> > > =
> > > XE_VMA_PTE_64K;
> > > +					pte |= XE_PTE_PS64;
> > > +				} else if (XE_WARN_ON(xe_walk-
> > > > needs_64K)) {
> > > +					return -EINVAL;
> > > +				}
> > >  			}
> > >  		}
> > > 
> > 
> > I think there might be some things missing here.
> > The bind should look exactly as a is_null bind, with the exception
> > that
> > the pte should be 0, so might be able to benefit from is_null, for
> > example xe_pg_hugepte_possible() short-circuit the dma-address
> > check.
> 
> You are right. Above code has a problem for huge pte cases: it
> wouldn't detect
> A huge pte for clear_pt case. I will fix it.
> 
> I will not make is_null be true for clear_pt case, but will make
> similar logic to
> Let clear_pt benefit from is_null logic.
> 
> > With clear_pt we don't want to use the XE_PTE_PS64K, though, like
> > you
> > identified above.
> > 
> > Will get back with a look at the invalidation.
> 
> I don't quite get here. Please do let me know if you find anything.

Yes, I meant I hadn't had time to look at this part yet.

For binding with scratch, it looks like *all* binds need TLB
invalidation, not just clearing binds, So in the below code-path, can't
we just replace the xe_vm_in_preempt_fence_mode() with
xe_vm_in_lr_mode()?


		/*
		 * If rebind, we have to invalidate TLB on !LR vms to
invalidate
		 * cached PTEs point to freed memory. On LR vms this
is done
		 * automatically when the context is re-enabled by the
rebind worker,
		 * or in fault mode it was invalidated on PTE zapping.
		 *
		 * If !rebind, and scratch enabled VMs, there is a
chance the scratch
		 * PTE is already cached in the TLB so it needs to be
invalidated.
		 * On !LR VMs this is done in the ring ops preceding a
batch, but on
		 * non-faulting LR, in particular on user-space batch
buffer chaining,
		 * it needs to be done here.
		 */
		if ((!pt_op->rebind && xe_vm_has_scratch(vm) &&
		     xe_vm_in_preempt_fence_mode(vm)))

/Thomas




















> 
> Oak
> > 
> > 
> > > @@ -519,7 +525,7 @@ xe_pt_stage_bind_entry(struct xe_ptw
> > *parent,
> > > pgoff_t offset,
> > >  		if (unlikely(ret))
> > >  			return ret;
> > > 
> > > -		if (!is_null)
> > > +		if (!is_null && !xe_walk->clear_pt)
> > >  			xe_res_next(curs, next - addr);
> > >  		xe_walk->va_curs_start = next;
> > >  		xe_walk->vma->gpuva.flags |= (XE_VMA_PTE_4K <<
> > > level);
> > > @@ -589,6 +595,7 @@ static const struct xe_pt_walk_ops
> > > xe_pt_stage_bind_ops = {
> > >   * @vma: The vma indicating the address range.
> > >   * @entries: Storage for the update entries used for connecting
> > > the
> > > tree to
> > >   * the main tree at commit time.
> > > + * @clear_pt: Clear the page table entries.
> > >   * @num_entries: On output contains the number of @entries
> > used.
> > >   *
> > >   * This function builds a disconnected page-table tree for a
> > > given
> > > address
> > > @@ -602,7 +609,8 @@ static const struct xe_pt_walk_ops
> > > xe_pt_stage_bind_ops = {
> > >   */
> > >  static int
> > >  xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> > > -		 struct xe_vm_pgtable_update *entries, u32
> > > *num_entries)
> > > +		 struct xe_vm_pgtable_update *entries,
> > > +		 bool clear_pt, u32 *num_entries)
> > >  {
> > >  	struct xe_device *xe = tile_to_xe(tile);
> > >  	struct xe_bo *bo = xe_vma_bo(vma);
> > > @@ -622,10 +630,19 @@ xe_pt_stage_bind(struct xe_tile *tile,
> > struct
> > > xe_vma *vma,
> > >  		.vma = vma,
> > >  		.wupd.entries = entries,
> > >  		.needs_64K = (xe_vma_vm(vma)->flags &
> > > XE_VM_FLAG_64K) && is_devmem,
> > > +		.clear_pt = clear_pt,
> > >  	};
> > >  	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
> > >  	int ret;
> > > 
> > > +	if (clear_pt) {
> > > +		ret = xe_pt_walk_range(&pt->base, pt->level,
> > > xe_vma_start(vma),
> > > +				       xe_vma_end(vma),
> > > &xe_walk.base);
> > > +
> > > +		*num_entries = xe_walk.wupd.num_used_entries;
> > > +		return ret;
> > > +	}
> > > +
> > >  	/**
> > >  	 * Default atomic expectations for different allocation
> > > scenarios are as follows:
> > >  	 *
> > > @@ -981,12 +998,14 @@ static void xe_pt_free_bind(struct
> > > xe_vm_pgtable_update *entries,
> > > 
> > >  static int
> > >  xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
> > > -		   struct xe_vm_pgtable_update *entries, u32
> > > *num_entries)
> > > +		   struct xe_vm_pgtable_update *entries,
> > > +		   bool invalidate_on_bind, u32 *num_entries)
> > >  {
> > >  	int err;
> > > 
> > >  	*num_entries = 0;
> > > -	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
> > > +	err = xe_pt_stage_bind(tile, vma, entries,
> > > invalidate_on_bind,
> > > +			       num_entries);
> > >  	if (!err)
> > >  		xe_tile_assert(tile, *num_entries);
> > > 
> > > @@ -1661,6 +1680,7 @@ static int bind_op_prepare(struct xe_vm
> > *vm,
> > > struct xe_tile *tile,
> > >  		return err;
> > > 
> > >  	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
> > > +				 pt_update_ops-
> > > >invalidate_on_bind,
> > >  				 &pt_op->num_entries);
> > >  	if (!err) {
> > >  		xe_tile_assert(tile, pt_op->num_entries <=
> > > @@ -1685,7 +1705,7 @@ static int bind_op_prepare(struct xe_vm
> > *vm,
> > > struct xe_tile *tile,
> > >  		 * it needs to be done here.
> > >  		 */
> > >  		if ((!pt_op->rebind && xe_vm_has_scratch(vm) &&
> > > -		     xe_vm_in_preempt_fence_mode(vm)))
> > > +		     xe_vm_in_preempt_fence_mode(vm)) ||
> > > pt_update_ops->invalidate_on_bind)
> > >  			pt_update_ops->needs_invalidation =
> > > true;
> > >  		else if (pt_op->rebind && !xe_vm_in_lr_mode(vm))
> > >  			/* We bump also if batch_invalidate_tlb
> > > is
> > > true */
> > > @@ -1759,9 +1779,13 @@ static int op_prepare(struct xe_vm *vm,
> > > 
> > >  	switch (op->base.op) {
> > >  	case DRM_GPUVA_OP_MAP:
> > > -		if (!op->map.immediate &&
> > xe_vm_in_fault_mode(vm))
> > > +		if (!op->map.immediate &&
> > xe_vm_in_fault_mode(vm) &&
> > > +		    !op->map.invalidate_on_bind)
> > >  			break;
> > > 
> > > +		if (op->map.invalidate_on_bind)
> > > +			pt_update_ops->invalidate_on_bind =
> > > true;
> > > +
> > >  		err = bind_op_prepare(vm, tile, pt_update_ops,
> > > op-
> > > > map.vma);
> > >  		pt_update_ops->wait_vm_kernel = true;
> > >  		break;
> > > @@ -1871,6 +1895,8 @@ static void bind_op_commit(struct xe_vm
> > *vm,
> > > struct xe_tile *tile,
> > >  	}
> > >  	vma->tile_present |= BIT(tile->id);
> > >  	vma->tile_staged &= ~BIT(tile->id);
> > > +	if (pt_update_ops->invalidate_on_bind)
> > > +		vma->tile_invalidated |= BIT(tile->id);
> > >  	if (xe_vma_is_userptr(vma)) {
> > >  		lockdep_assert_held_read(&vm-
> > > > userptr.notifier_lock);
> > >  		to_userptr_vma(vma)->userptr.initial_bind =
> > > true;
> > > diff --git a/drivers/gpu/drm/xe/xe_pt_types.h
> > > b/drivers/gpu/drm/xe/xe_pt_types.h
> > > index 384cc04de719..3d0aa2a5102e 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_pt_types.h
> > > @@ -108,6 +108,8 @@ struct xe_vm_pgtable_update_ops {
> > >  	bool needs_userptr_lock;
> > >  	/** @needs_invalidation: Needs invalidation */
> > >  	bool needs_invalidation;
> > > +	/** @invalidate_on_bind: Invalidate the range before
> > > bind */
> > > +	bool invalidate_on_bind;
> > >  	/**
> > >  	 * @wait_vm_bookkeep: PT operations need to wait until
> > > VM
> > is
> > > idle
> > >  	 * (bookkeep dma-resv slots are idle) and stage all
> > > future
> > > VM activity
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > b/drivers/gpu/drm/xe/xe_vm.c
> > > index d664f2e418b2..813d893d9b63 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > @@ -1921,6 +1921,23 @@ static void print_op(struct xe_device *xe,
> > > struct drm_gpuva_op *op)
> > >  }
> > >  #endif
> > > 
> > > +static bool __xe_vm_needs_clear_scratch_pages(struct xe_vm
> > *vm, u32
> > > bind_flags)
> > > +{
> > > +	if (!xe_vm_in_fault_mode(vm))
> > > +		return false;
> > > +
> > > +	if (!NEEDS_SCRATCH(vm->xe))
> > > +		return false;
> > > +
> > > +	if (!xe_vm_has_scratch(vm))
> > > +		return false;
> > > +
> > > +	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> > > +		return false;
> > > +
> > > +	return true;
> > > +}
> > > +
> > >  /*
> > >   * Create operations list from IOCTL arguments, setup operations
> > > fields so parse
> > >   * and commit steps are decoupled from IOCTL arguments. This
> > step
> > > can fail.
> > > @@ -1991,6 +2008,8 @@ vm_bind_ioctl_ops_create(struct xe_vm
> > *vm,
> > > struct xe_bo *bo,
> > >  			op->map.is_null = flags &
> > > DRM_XE_VM_BIND_FLAG_NULL;
> > >  			op->map.dumpable = flags &
> > > DRM_XE_VM_BIND_FLAG_DUMPABLE;
> > >  			op->map.pat_index = pat_index;
> > > +			op->map.invalidate_on_bind =
> > > +
> > 	__xe_vm_needs_clear_scratch_pages(vm
> > > , flags);
> > >  		} else if (__op->op == DRM_GPUVA_OP_PREFETCH) {
> > >  			op->prefetch.region = prefetch_region;
> > >  		}
> > > @@ -2188,7 +2207,8 @@ static int vm_bind_ioctl_ops_parse(struct
> > xe_vm
> > > *vm, struct drm_gpuva_ops *ops,
> > >  				return PTR_ERR(vma);
> > > 
> > >  			op->map.vma = vma;
> > > -			if (op->map.immediate ||
> > > !xe_vm_in_fault_mode(vm))
> > > +			if (op->map.immediate ||
> > > !xe_vm_in_fault_mode(vm) ||
> > > +			    op->map.invalidate_on_bind)
> > > 
> > 	xe_vma_ops_incr_pt_update_ops(vops,
> > >  							     
> > > op-
> > > > tile_mask);
> > >  			break;
> > > @@ -2416,9 +2436,10 @@ static int op_lock_and_prep(struct
> > drm_exec
> > > *exec, struct xe_vm *vm,
> > > 
> > >  	switch (op->base.op) {
> > >  	case DRM_GPUVA_OP_MAP:
> > > -		err = vma_lock_and_validate(exec, op->map.vma,
> > > -					   
> > > !xe_vm_in_fault_mode(vm)
> > > > > 
> > > -					    op->map.immediate);
> > > +		if (!op->map.invalidate_on_bind)
> > > +			err = vma_lock_and_validate(exec, op-
> > > > map.vma,
> > > +
> > > !xe_vm_in_fault_mode(vm) ||
> > > +						    op-
> > > > map.immediate);
> > >  		break;
> > >  	case DRM_GPUVA_OP_REMAP:
> > >  		err = check_ufence(gpuva_to_vma(op-
> > > > base.remap.unmap->va));
> > > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > > b/drivers/gpu/drm/xe/xe_vm_types.h
> > > index 52467b9b5348..dace04f4ea5e 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > > @@ -297,6 +297,8 @@ struct xe_vma_op_map {
> > >  	bool is_null;
> > >  	/** @dumpable: whether BO is dumped on GPU hang */
> > >  	bool dumpable;
> > > +	/** @invalidate: invalidate the VMA before bind */
> > > +	bool invalidate_on_bind;
> > >  	/** @pat_index: The pat index to use for this operation.
> > > */
> > >  	u16 pat_index;
> > >  };
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-02-06 12:51       ` Thomas Hellström
@ 2025-02-06 18:56         ` Zeng, Oak
  2025-02-06 20:11           ` Thomas Hellström
  0 siblings, 1 reply; 32+ messages in thread
From: Zeng, Oak @ 2025-02-06 18:56 UTC (permalink / raw)
  To: Thomas Hellström, intel-xe@lists.freedesktop.org
  Cc: Brost, Matthew, Cavitt, Jonathan

Hi Thomas!

> -----Original Message-----
> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Sent: February 6, 2025 7:52 AM
> To: Zeng, Oak <oak.zeng@intel.com>; intel-xe@lists.freedesktop.org
> Cc: Brost, Matthew <matthew.brost@intel.com>; Cavitt, Jonathan
> <jonathan.cavitt@intel.com>
> Subject: Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
> 
> On Thu, 2025-02-06 at 01:52 +0000, Zeng, Oak wrote:
> >
> >
> > > -----Original Message-----
> > > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Sent: February 5, 2025 9:36 AM
> > > To: Zeng, Oak <oak.zeng@intel.com>; intel-
> xe@lists.freedesktop.org
> > > Cc: Brost, Matthew <matthew.brost@intel.com>; Cavitt, Jonathan
> > > <jonathan.cavitt@intel.com>
> > > Subject: Re: [PATCH 2/3] drm/xe: Clear scratch page before
> vm_bind
> > >
> > > On Tue, 2025-02-04 at 13:45 -0500, Oak Zeng wrote:
> > > > When a vm runs under fault mode, if scratch page is enabled,
> we
> > > need
> > > > to clear the scratch page mapping before vm_bind for the
> vm_bind
> > > > address range. Under fault mode, we depend on recoverable
> page
> > > fault
> > > > to establish mapping in page table. If scratch page is not
> > > > cleared,
> > > > GPU access of address won't cause page fault because it always
> > > > hits
> > > > the existing scratch page mapping.
> > >
> > > I think we need to ephasize that we clear AT vm_bind time, not
> > > before,
> > > both in the commit description and the patch title.
> >
> > Will fix.
> >
> > >
> > > >
> > > > When vm_bind with IMMEDIATE flag, there is no need of
> clearing as
> > > > immediate bind can overwrite the scratch page mapping.
> > > >
> > > > So far only is xe2 and xe3 products are allowed to enable scratch
> > > > page
> > > > under fault mode. On other platform we don't allow scratch
> page
> > > under
> > > > fault mode, so no need of such clearing.
> > > >
> > > > v2: Rework vm_bind pipeline to clear scratch page mapping. This
> > > > is
> > > > similar
> > > > to a map operation, with the exception that PTEs are cleared
> > > instead
> > > > of
> > > > pointing to valid physical pages. (Matt, Thomas)
> > > >
> > > > TLB invalidation is needed after clear scratch page mapping as
> > > > larger
> > > > scratch page mapping could be backed by physical page and
> cached
> > > in
> > > > TLB. (Matt, Thomas)
> > > >
> > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_pt.c       | 66
> ++++++++++++++++++++++--
> > > ------
> > > > --
> > > >  drivers/gpu/drm/xe/xe_pt_types.h |  2 +
> > > >  drivers/gpu/drm/xe/xe_vm.c       | 29 ++++++++++++--
> > > >  drivers/gpu/drm/xe/xe_vm_types.h |  2 +
> > > >  4 files changed, 75 insertions(+), 24 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_pt.c
> > > b/drivers/gpu/drm/xe/xe_pt.c
> > > > index 1ddcc7e79a93..3fd0ae2dbe7d 100644
> > > > --- a/drivers/gpu/drm/xe/xe_pt.c
> > > > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > > > @@ -268,6 +268,8 @@ struct xe_pt_stage_bind_walk {
> > > >  	 * granularity.
> > > >  	 */
> > > >  	bool needs_64K;
> > > > +	/* @clear_pt: clear page table entries during the bind
> > > > walk
> > > > */
> > > > +	bool clear_pt;
> > > >  	/**
> > > >  	 * @vma: VMA being mapped
> > > >  	 */
> > > > @@ -497,21 +499,25 @@ xe_pt_stage_bind_entry(struct
> xe_ptw
> > > *parent,
> > > > pgoff_t offset,
> > > >
> > > >  		XE_WARN_ON(xe_walk->va_curs_start != addr);
> > > >
> > > > -		pte = vm->pt_ops->pte_encode_vma(is_null ? 0 :
> > > > -
> > > > xe_res_dma(curs) +
> > > > xe_walk->dma_offset,
> > > > -						 xe_walk->vma,
> > > > pat_index, level);
> > > > -		pte |= xe_walk->default_pte;
> > > > +		if (xe_walk->clear_pt) {
> > > > +			pte = 0;
> > > > +		} else {
> > > > +			pte = vm->pt_ops->pte_encode_vma(is_null
> > > > ?
> > > 0
> > > > :
> > > > +					xe_res_dma(curs) +
> > > > xe_walk-
> > > > > dma_offset,
> > > > +					xe_walk->vma, pat_index,
> > > > level);
> > > > +			pte |= xe_walk->default_pte;
> > > >
> > > > -		/*
> > > > -		 * Set the XE_PTE_PS64 hint if possible,
> > > > otherwise
> > > > if
> > > > -		 * this device *requires* 64K PTE size for VRAM,
> > > > fail.
> > > > -		 */
> > > > -		if (level == 0 && !xe_parent->is_compact) {
> > > > -			if (xe_pt_is_pte_ps64K(addr, next,
> > > > xe_walk))
> > > > {
> > > > -				xe_walk->vma->gpuva.flags |=
> > > > XE_VMA_PTE_64K;
> > > > -				pte |= XE_PTE_PS64;
> > > > -			} else if (XE_WARN_ON(xe_walk-
> > > > needs_64K)) {
> > > > -				return -EINVAL;
> > > > +			/*
> > > > +			 * Set the XE_PTE_PS64 hint if possible,
> > > > otherwise if
> > > > +			 * this device *requires* 64K PTE size
> > > > for
> > > > VRAM, fail.
> > > > +			 */
> > > > +			if (level == 0 && !xe_parent-
> > > > >is_compact) {
> > > > +				if (xe_pt_is_pte_ps64K(addr,
> > > > next,
> > > > xe_walk)) {
> > > > +					xe_walk->vma-
> > > > >gpuva.flags
> > > > =
> > > > XE_VMA_PTE_64K;
> > > > +					pte |= XE_PTE_PS64;
> > > > +				} else if (XE_WARN_ON(xe_walk-
> > > > > needs_64K)) {
> > > > +					return -EINVAL;
> > > > +				}
> > > >  			}
> > > >  		}
> > > >
> > >
> > > I think there might be some things missing here.
> > > The bind should look exactly as a is_null bind, with the exception
> > > that
> > > the pte should be 0, so might be able to benefit from is_null, for
> > > example xe_pg_hugepte_possible() short-circuit the dma-
> address
> > > check.
> >
> > You are right. Above code has a problem for huge pte cases: it
> > wouldn't detect
> > A huge pte for clear_pt case. I will fix it.
> >
> > I will not make is_null be true for clear_pt case, but will make
> > similar logic to
> > Let clear_pt benefit from is_null logic.
> >
> > > With clear_pt we don't want to use the XE_PTE_PS64K, though,
> like
> > > you
> > > identified above.
> > >
> > > Will get back with a look at the invalidation.
> >
> > I don't quite get here. Please do let me know if you find anything.
> 
> Yes, I meant I hadn't had time to look at this part yet.
> 
> For binding with scratch, it looks like *all* binds need TLB
> invalidation, not just clearing binds, So in the below code-path, can't
> we just replace the xe_vm_in_preempt_fence_mode() with
> xe_vm_in_lr_mode()?

I think you are correct. 

For long run vm, regardless it is in fault mode or not, we need to invalidate
TLB scratch page caching during bind. Original codes only did TLB inv on non-
Fault mode. On fault mode + LR,  this runs into two sub-cases:
1.  immediate bind
2. bind triggered by page fault: note this can actually runs into the !rebind case
When the fault address is first time accessed by gpu. But the TLB cache for
Scratch page mapping should have been invalidated by the "invalidate_on_bind"
Condition.

So this leaves case #1 problematic.

It would be better for others to comment also, as in the comment below, it explicitly
Called out "non-faulting LR". My understanding is, people didn't think of case #1 above.

Since it is an existing issue, I will make it a separate patch for better tracking purpose.

Oak 

> 
> 
> 		/*
> 		 * If rebind, we have to invalidate TLB on !LR vms to
> invalidate
> 		 * cached PTEs point to freed memory. On LR vms this
> is done
> 		 * automatically when the context is re-enabled by
> the
> rebind worker,
> 		 * or in fault mode it was invalidated on PTE zapping.
> 		 *
> 		 * If !rebind, and scratch enabled VMs, there is a
> chance the scratch
> 		 * PTE is already cached in the TLB so it needs to be
> invalidated.
> 		 * On !LR VMs this is done in the ring ops preceding a
> batch, but on
> 		 * non-faulting LR, in particular on user-space batch
> buffer chaining,
> 		 * it needs to be done here.
> 		 */
> 		if ((!pt_op->rebind && xe_vm_has_scratch(vm) &&
> 		     xe_vm_in_preempt_fence_mode(vm)))
> 
> /Thomas
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> >
> > Oak
> > >
> > >
> > > > @@ -519,7 +525,7 @@ xe_pt_stage_bind_entry(struct xe_ptw
> > > *parent,
> > > > pgoff_t offset,
> > > >  		if (unlikely(ret))
> > > >  			return ret;
> > > >
> > > > -		if (!is_null)
> > > > +		if (!is_null && !xe_walk->clear_pt)
> > > >  			xe_res_next(curs, next - addr);
> > > >  		xe_walk->va_curs_start = next;
> > > >  		xe_walk->vma->gpuva.flags |= (XE_VMA_PTE_4K <<
> > > > level);
> > > > @@ -589,6 +595,7 @@ static const struct xe_pt_walk_ops
> > > > xe_pt_stage_bind_ops = {
> > > >   * @vma: The vma indicating the address range.
> > > >   * @entries: Storage for the update entries used for connecting
> > > > the
> > > > tree to
> > > >   * the main tree at commit time.
> > > > + * @clear_pt: Clear the page table entries.
> > > >   * @num_entries: On output contains the number of @entries
> > > used.
> > > >   *
> > > >   * This function builds a disconnected page-table tree for a
> > > > given
> > > > address
> > > > @@ -602,7 +609,8 @@ static const struct xe_pt_walk_ops
> > > > xe_pt_stage_bind_ops = {
> > > >   */
> > > >  static int
> > > >  xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> > > > -		 struct xe_vm_pgtable_update *entries, u32
> > > > *num_entries)
> > > > +		 struct xe_vm_pgtable_update *entries,
> > > > +		 bool clear_pt, u32 *num_entries)
> > > >  {
> > > >  	struct xe_device *xe = tile_to_xe(tile);
> > > >  	struct xe_bo *bo = xe_vma_bo(vma);
> > > > @@ -622,10 +630,19 @@ xe_pt_stage_bind(struct xe_tile *tile,
> > > struct
> > > > xe_vma *vma,
> > > >  		.vma = vma,
> > > >  		.wupd.entries = entries,
> > > >  		.needs_64K = (xe_vma_vm(vma)->flags &
> > > > XE_VM_FLAG_64K) && is_devmem,
> > > > +		.clear_pt = clear_pt,
> > > >  	};
> > > >  	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
> > > >  	int ret;
> > > >
> > > > +	if (clear_pt) {
> > > > +		ret = xe_pt_walk_range(&pt->base, pt->level,
> > > > xe_vma_start(vma),
> > > > +				       xe_vma_end(vma),
> > > > &xe_walk.base);
> > > > +
> > > > +		*num_entries = xe_walk.wupd.num_used_entries;
> > > > +		return ret;
> > > > +	}
> > > > +
> > > >  	/**
> > > >  	 * Default atomic expectations for different allocation
> > > > scenarios are as follows:
> > > >  	 *
> > > > @@ -981,12 +998,14 @@ static void xe_pt_free_bind(struct
> > > > xe_vm_pgtable_update *entries,
> > > >
> > > >  static int
> > > >  xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
> > > > -		   struct xe_vm_pgtable_update *entries, u32
> > > > *num_entries)
> > > > +		   struct xe_vm_pgtable_update *entries,
> > > > +		   bool invalidate_on_bind, u32 *num_entries)
> > > >  {
> > > >  	int err;
> > > >
> > > >  	*num_entries = 0;
> > > > -	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
> > > > +	err = xe_pt_stage_bind(tile, vma, entries,
> > > > invalidate_on_bind,
> > > > +			       num_entries);
> > > >  	if (!err)
> > > >  		xe_tile_assert(tile, *num_entries);
> > > >
> > > > @@ -1661,6 +1680,7 @@ static int bind_op_prepare(struct
> xe_vm
> > > *vm,
> > > > struct xe_tile *tile,
> > > >  		return err;
> > > >
> > > >  	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
> > > > +				 pt_update_ops-
> > > > >invalidate_on_bind,
> > > >  				 &pt_op->num_entries);
> > > >  	if (!err) {
> > > >  		xe_tile_assert(tile, pt_op->num_entries <=
> > > > @@ -1685,7 +1705,7 @@ static int bind_op_prepare(struct
> xe_vm
> > > *vm,
> > > > struct xe_tile *tile,
> > > >  		 * it needs to be done here.
> > > >  		 */
> > > >  		if ((!pt_op->rebind && xe_vm_has_scratch(vm) &&
> > > > -		     xe_vm_in_preempt_fence_mode(vm)))
> > > > +		     xe_vm_in_preempt_fence_mode(vm)) ||
> > > > pt_update_ops->invalidate_on_bind)
> > > >  			pt_update_ops->needs_invalidation =
> > > > true;
> > > >  		else if (pt_op->rebind && !xe_vm_in_lr_mode(vm))
> > > >  			/* We bump also if batch_invalidate_tlb
> > > > is
> > > > true */
> > > > @@ -1759,9 +1779,13 @@ static int op_prepare(struct xe_vm
> *vm,
> > > >
> > > >  	switch (op->base.op) {
> > > >  	case DRM_GPUVA_OP_MAP:
> > > > -		if (!op->map.immediate &&
> > > xe_vm_in_fault_mode(vm))
> > > > +		if (!op->map.immediate &&
> > > xe_vm_in_fault_mode(vm) &&
> > > > +		    !op->map.invalidate_on_bind)
> > > >  			break;
> > > >
> > > > +		if (op->map.invalidate_on_bind)
> > > > +			pt_update_ops->invalidate_on_bind =
> > > > true;
> > > > +
> > > >  		err = bind_op_prepare(vm, tile, pt_update_ops,
> > > > op-
> > > > > map.vma);
> > > >  		pt_update_ops->wait_vm_kernel = true;
> > > >  		break;
> > > > @@ -1871,6 +1895,8 @@ static void bind_op_commit(struct
> xe_vm
> > > *vm,
> > > > struct xe_tile *tile,
> > > >  	}
> > > >  	vma->tile_present |= BIT(tile->id);
> > > >  	vma->tile_staged &= ~BIT(tile->id);
> > > > +	if (pt_update_ops->invalidate_on_bind)
> > > > +		vma->tile_invalidated |= BIT(tile->id);
> > > >  	if (xe_vma_is_userptr(vma)) {
> > > >  		lockdep_assert_held_read(&vm-
> > > > > userptr.notifier_lock);
> > > >  		to_userptr_vma(vma)->userptr.initial_bind =
> > > > true;
> > > > diff --git a/drivers/gpu/drm/xe/xe_pt_types.h
> > > > b/drivers/gpu/drm/xe/xe_pt_types.h
> > > > index 384cc04de719..3d0aa2a5102e 100644
> > > > --- a/drivers/gpu/drm/xe/xe_pt_types.h
> > > > +++ b/drivers/gpu/drm/xe/xe_pt_types.h
> > > > @@ -108,6 +108,8 @@ struct xe_vm_pgtable_update_ops {
> > > >  	bool needs_userptr_lock;
> > > >  	/** @needs_invalidation: Needs invalidation */
> > > >  	bool needs_invalidation;
> > > > +	/** @invalidate_on_bind: Invalidate the range before
> > > > bind */
> > > > +	bool invalidate_on_bind;
> > > >  	/**
> > > >  	 * @wait_vm_bookkeep: PT operations need to wait until
> > > > VM
> > > is
> > > > idle
> > > >  	 * (bookkeep dma-resv slots are idle) and stage all
> > > > future
> > > > VM activity
> > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > b/drivers/gpu/drm/xe/xe_vm.c
> > > > index d664f2e418b2..813d893d9b63 100644
> > > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > > @@ -1921,6 +1921,23 @@ static void print_op(struct xe_device
> *xe,
> > > > struct drm_gpuva_op *op)
> > > >  }
> > > >  #endif
> > > >
> > > > +static bool __xe_vm_needs_clear_scratch_pages(struct
> xe_vm
> > > *vm, u32
> > > > bind_flags)
> > > > +{
> > > > +	if (!xe_vm_in_fault_mode(vm))
> > > > +		return false;
> > > > +
> > > > +	if (!NEEDS_SCRATCH(vm->xe))
> > > > +		return false;
> > > > +
> > > > +	if (!xe_vm_has_scratch(vm))
> > > > +		return false;
> > > > +
> > > > +	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> > > > +		return false;
> > > > +
> > > > +	return true;
> > > > +}
> > > > +
> > > >  /*
> > > >   * Create operations list from IOCTL arguments, setup
> operations
> > > > fields so parse
> > > >   * and commit steps are decoupled from IOCTL arguments. This
> > > step
> > > > can fail.
> > > > @@ -1991,6 +2008,8 @@ vm_bind_ioctl_ops_create(struct
> xe_vm
> > > *vm,
> > > > struct xe_bo *bo,
> > > >  			op->map.is_null = flags &
> > > > DRM_XE_VM_BIND_FLAG_NULL;
> > > >  			op->map.dumpable = flags &
> > > > DRM_XE_VM_BIND_FLAG_DUMPABLE;
> > > >  			op->map.pat_index = pat_index;
> > > > +			op->map.invalidate_on_bind =
> > > > +
> > > 	__xe_vm_needs_clear_scratch_pages(vm
> > > > , flags);
> > > >  		} else if (__op->op == DRM_GPUVA_OP_PREFETCH) {
> > > >  			op->prefetch.region = prefetch_region;
> > > >  		}
> > > > @@ -2188,7 +2207,8 @@ static int
> vm_bind_ioctl_ops_parse(struct
> > > xe_vm
> > > > *vm, struct drm_gpuva_ops *ops,
> > > >  				return PTR_ERR(vma);
> > > >
> > > >  			op->map.vma = vma;
> > > > -			if (op->map.immediate ||
> > > > !xe_vm_in_fault_mode(vm))
> > > > +			if (op->map.immediate ||
> > > > !xe_vm_in_fault_mode(vm) ||
> > > > +			    op->map.invalidate_on_bind)
> > > >
> > > 	xe_vma_ops_incr_pt_update_ops(vops,
> > > >
> > > > op-
> > > > > tile_mask);
> > > >  			break;
> > > > @@ -2416,9 +2436,10 @@ static int op_lock_and_prep(struct
> > > drm_exec
> > > > *exec, struct xe_vm *vm,
> > > >
> > > >  	switch (op->base.op) {
> > > >  	case DRM_GPUVA_OP_MAP:
> > > > -		err = vma_lock_and_validate(exec, op->map.vma,
> > > > -
> > > > !xe_vm_in_fault_mode(vm)
> > > > > >
> > > > -					    op->map.immediate);
> > > > +		if (!op->map.invalidate_on_bind)
> > > > +			err = vma_lock_and_validate(exec, op-
> > > > > map.vma,
> > > > +
> > > > !xe_vm_in_fault_mode(vm) ||
> > > > +						    op-
> > > > > map.immediate);
> > > >  		break;
> > > >  	case DRM_GPUVA_OP_REMAP:
> > > >  		err = check_ufence(gpuva_to_vma(op-
> > > > > base.remap.unmap->va));
> > > > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > > > b/drivers/gpu/drm/xe/xe_vm_types.h
> > > > index 52467b9b5348..dace04f4ea5e 100644
> > > > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > > > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > > > @@ -297,6 +297,8 @@ struct xe_vma_op_map {
> > > >  	bool is_null;
> > > >  	/** @dumpable: whether BO is dumped on GPU hang */
> > > >  	bool dumpable;
> > > > +	/** @invalidate: invalidate the VMA before bind */
> > > > +	bool invalidate_on_bind;
> > > >  	/** @pat_index: The pat index to use for this operation.
> > > > */
> > > >  	u16 pat_index;
> > > >  };
> >


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-02-06 18:56         ` Zeng, Oak
@ 2025-02-06 20:11           ` Thomas Hellström
  2025-02-06 21:05             ` Zeng, Oak
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Hellström @ 2025-02-06 20:11 UTC (permalink / raw)
  To: Zeng, Oak, intel-xe@lists.freedesktop.org
  Cc: Brost, Matthew, Cavitt, Jonathan

Hi, Oak,

On Thu, 2025-02-06 at 18:56 +0000, Zeng, Oak wrote:
> Hi Thomas!
> 
> > -----Original Message-----
> > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Sent: February 6, 2025 7:52 AM
> > To: Zeng, Oak <oak.zeng@intel.com>; intel-xe@lists.freedesktop.org
> > Cc: Brost, Matthew <matthew.brost@intel.com>; Cavitt, Jonathan
> > <jonathan.cavitt@intel.com>
> > Subject: Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
> > 
> > On Thu, 2025-02-06 at 01:52 +0000, Zeng, Oak wrote:
> > > 
> > > 
> > > > -----Original Message-----
> > > > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > Sent: February 5, 2025 9:36 AM
> > > > To: Zeng, Oak <oak.zeng@intel.com>; intel-
> > xe@lists.freedesktop.org
> > > > Cc: Brost, Matthew <matthew.brost@intel.com>; Cavitt, Jonathan
> > > > <jonathan.cavitt@intel.com>
> > > > Subject: Re: [PATCH 2/3] drm/xe: Clear scratch page before
> > vm_bind
> > > > 
> > > > On Tue, 2025-02-04 at 13:45 -0500, Oak Zeng wrote:
> > > > > When a vm runs under fault mode, if scratch page is enabled,
> > we
> > > > need
> > > > > to clear the scratch page mapping before vm_bind for the
> > vm_bind
> > > > > address range. Under fault mode, we depend on recoverable
> > page
> > > > fault
> > > > > to establish mapping in page table. If scratch page is not
> > > > > cleared,
> > > > > GPU access of address won't cause page fault because it
> > > > > always
> > > > > hits
> > > > > the existing scratch page mapping.
> > > > 
> > > > I think we need to ephasize that we clear AT vm_bind time, not
> > > > before,
> > > > both in the commit description and the patch title.
> > > 
> > > Will fix.
> > > 
> > > > 
> > > > > 
> > > > > When vm_bind with IMMEDIATE flag, there is no need of
> > clearing as
> > > > > immediate bind can overwrite the scratch page mapping.
> > > > > 
> > > > > So far only is xe2 and xe3 products are allowed to enable
> > > > > scratch
> > > > > page
> > > > > under fault mode. On other platform we don't allow scratch
> > page
> > > > under
> > > > > fault mode, so no need of such clearing.
> > > > > 
> > > > > v2: Rework vm_bind pipeline to clear scratch page mapping.
> > > > > This
> > > > > is
> > > > > similar
> > > > > to a map operation, with the exception that PTEs are cleared
> > > > instead
> > > > > of
> > > > > pointing to valid physical pages. (Matt, Thomas)
> > > > > 
> > > > > TLB invalidation is needed after clear scratch page mapping
> > > > > as
> > > > > larger
> > > > > scratch page mapping could be backed by physical page and
> > cached
> > > > in
> > > > > TLB. (Matt, Thomas)
> > > > > 
> > > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/xe/xe_pt.c       | 66
> > ++++++++++++++++++++++--
> > > > ------
> > > > > --
> > > > >  drivers/gpu/drm/xe/xe_pt_types.h |  2 +
> > > > >  drivers/gpu/drm/xe/xe_vm.c       | 29 ++++++++++++--
> > > > >  drivers/gpu/drm/xe/xe_vm_types.h |  2 +
> > > > >  4 files changed, 75 insertions(+), 24 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/xe/xe_pt.c
> > > > b/drivers/gpu/drm/xe/xe_pt.c
> > > > > index 1ddcc7e79a93..3fd0ae2dbe7d 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_pt.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > > > > @@ -268,6 +268,8 @@ struct xe_pt_stage_bind_walk {
> > > > >  	 * granularity.
> > > > >  	 */
> > > > >  	bool needs_64K;
> > > > > +	/* @clear_pt: clear page table entries during the
> > > > > bind
> > > > > walk
> > > > > */
> > > > > +	bool clear_pt;
> > > > >  	/**
> > > > >  	 * @vma: VMA being mapped
> > > > >  	 */
> > > > > @@ -497,21 +499,25 @@ xe_pt_stage_bind_entry(struct
> > xe_ptw
> > > > *parent,
> > > > > pgoff_t offset,
> > > > > 
> > > > >  		XE_WARN_ON(xe_walk->va_curs_start != addr);
> > > > > 
> > > > > -		pte = vm->pt_ops->pte_encode_vma(is_null ? 0
> > > > > :
> > > > > -
> > > > > xe_res_dma(curs) +
> > > > > xe_walk->dma_offset,
> > > > > -						 xe_walk-
> > > > > >vma,
> > > > > pat_index, level);
> > > > > -		pte |= xe_walk->default_pte;
> > > > > +		if (xe_walk->clear_pt) {
> > > > > +			pte = 0;
> > > > > +		} else {
> > > > > +			pte = vm->pt_ops-
> > > > > >pte_encode_vma(is_null
> > > > > ?
> > > > 0
> > > > > :
> > > > > +					xe_res_dma(curs) +
> > > > > xe_walk-
> > > > > > dma_offset,
> > > > > +					xe_walk->vma,
> > > > > pat_index,
> > > > > level);
> > > > > +			pte |= xe_walk->default_pte;
> > > > > 
> > > > > -		/*
> > > > > -		 * Set the XE_PTE_PS64 hint if possible,
> > > > > otherwise
> > > > > if
> > > > > -		 * this device *requires* 64K PTE size for
> > > > > VRAM,
> > > > > fail.
> > > > > -		 */
> > > > > -		if (level == 0 && !xe_parent->is_compact) {
> > > > > -			if (xe_pt_is_pte_ps64K(addr, next,
> > > > > xe_walk))
> > > > > {
> > > > > -				xe_walk->vma->gpuva.flags |=
> > > > > XE_VMA_PTE_64K;
> > > > > -				pte |= XE_PTE_PS64;
> > > > > -			} else if (XE_WARN_ON(xe_walk-
> > > > > needs_64K)) {
> > > > > -				return -EINVAL;
> > > > > +			/*
> > > > > +			 * Set the XE_PTE_PS64 hint if
> > > > > possible,
> > > > > otherwise if
> > > > > +			 * this device *requires* 64K PTE
> > > > > size
> > > > > for
> > > > > VRAM, fail.
> > > > > +			 */
> > > > > +			if (level == 0 && !xe_parent-
> > > > > > is_compact) {
> > > > > +				if (xe_pt_is_pte_ps64K(addr,
> > > > > next,
> > > > > xe_walk)) {
> > > > > +					xe_walk->vma-
> > > > > > gpuva.flags
> > > > > =
> > > > > XE_VMA_PTE_64K;
> > > > > +					pte |= XE_PTE_PS64;
> > > > > +				} else if
> > > > > (XE_WARN_ON(xe_walk-
> > > > > > needs_64K)) {
> > > > > +					return -EINVAL;
> > > > > +				}
> > > > >  			}
> > > > >  		}
> > > > > 
> > > > 
> > > > I think there might be some things missing here.
> > > > The bind should look exactly as a is_null bind, with the
> > > > exception
> > > > that
> > > > the pte should be 0, so might be able to benefit from is_null,
> > > > for
> > > > example xe_pg_hugepte_possible() short-circuit the dma-
> > address
> > > > check.
> > > 
> > > You are right. Above code has a problem for huge pte cases: it
> > > wouldn't detect
> > > A huge pte for clear_pt case. I will fix it.
> > > 
> > > I will not make is_null be true for clear_pt case, but will make
> > > similar logic to
> > > Let clear_pt benefit from is_null logic.
> > > 
> > > > With clear_pt we don't want to use the XE_PTE_PS64K, though,
> > like
> > > > you
> > > > identified above.
> > > > 
> > > > Will get back with a look at the invalidation.
> > > 
> > > I don't quite get here. Please do let me know if you find
> > > anything.
> > 
> > Yes, I meant I hadn't had time to look at this part yet.
> > 
> > For binding with scratch, it looks like *all* binds need TLB
> > invalidation, not just clearing binds, So in the below code-path,
> > can't
> > we just replace the xe_vm_in_preempt_fence_mode() with
> > xe_vm_in_lr_mode()?
> 
> I think you are correct. 
> 
> For long run vm, regardless it is in fault mode or not, we need to
> invalidate
> TLB scratch page caching during bind. Original codes only did TLB inv
> on non-
> Fault mode. On fault mode + LR,  this runs into two sub-cases:
> 1.  immediate bind
> 2. bind triggered by page fault: note this can actually runs into the
> !rebind case
> When the fault address is first time accessed by gpu. But the TLB
> cache for
> Scratch page mapping should have been invalidated by the
> "invalidate_on_bind"
> Condition.
> 
> So this leaves case #1 problematic.
> 
> It would be better for others to comment also, as in the comment
> below, it explicitly
> Called out "non-faulting LR". My understanding is, people didn't
> think of case #1 above.
> 
> Since it is an existing issue, I will make it a separate patch for
> better tracking purpose.

Hmm. What is the existing issue, I don't quite follow?

/Thomas

> 
> Oak 
> 
> > 
> > 
> > 		/*
> > 		 * If rebind, we have to invalidate TLB on !LR vms
> > to
> > invalidate
> > 		 * cached PTEs point to freed memory. On LR vms
> > this
> > is done
> > 		 * automatically when the context is re-enabled by
> > the
> > rebind worker,
> > 		 * or in fault mode it was invalidated on PTE
> > zapping.
> > 		 *
> > 		 * If !rebind, and scratch enabled VMs, there is a
> > chance the scratch
> > 		 * PTE is already cached in the TLB so it needs to
> > be
> > invalidated.
> > 		 * On !LR VMs this is done in the ring ops
> > preceding a
> > batch, but on
> > 		 * non-faulting LR, in particular on user-space
> > batch
> > buffer chaining,
> > 		 * it needs to be done here.
> > 		 */
> > 		if ((!pt_op->rebind && xe_vm_has_scratch(vm) &&
> > 		     xe_vm_in_preempt_fence_mode(vm)))
> > 
> > /Thomas
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > > 
> > > Oak
> > > > 
> > > > 
> > > > > @@ -519,7 +525,7 @@ xe_pt_stage_bind_entry(struct xe_ptw
> > > > *parent,
> > > > > pgoff_t offset,
> > > > >  		if (unlikely(ret))
> > > > >  			return ret;
> > > > > 
> > > > > -		if (!is_null)
> > > > > +		if (!is_null && !xe_walk->clear_pt)
> > > > >  			xe_res_next(curs, next - addr);
> > > > >  		xe_walk->va_curs_start = next;
> > > > >  		xe_walk->vma->gpuva.flags |= (XE_VMA_PTE_4K
> > > > > <<
> > > > > level);
> > > > > @@ -589,6 +595,7 @@ static const struct xe_pt_walk_ops
> > > > > xe_pt_stage_bind_ops = {
> > > > >   * @vma: The vma indicating the address range.
> > > > >   * @entries: Storage for the update entries used for
> > > > > connecting
> > > > > the
> > > > > tree to
> > > > >   * the main tree at commit time.
> > > > > + * @clear_pt: Clear the page table entries.
> > > > >   * @num_entries: On output contains the number of @entries
> > > > used.
> > > > >   *
> > > > >   * This function builds a disconnected page-table tree for a
> > > > > given
> > > > > address
> > > > > @@ -602,7 +609,8 @@ static const struct xe_pt_walk_ops
> > > > > xe_pt_stage_bind_ops = {
> > > > >   */
> > > > >  static int
> > > > >  xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> > > > > -		 struct xe_vm_pgtable_update *entries, u32
> > > > > *num_entries)
> > > > > +		 struct xe_vm_pgtable_update *entries,
> > > > > +		 bool clear_pt, u32 *num_entries)
> > > > >  {
> > > > >  	struct xe_device *xe = tile_to_xe(tile);
> > > > >  	struct xe_bo *bo = xe_vma_bo(vma);
> > > > > @@ -622,10 +630,19 @@ xe_pt_stage_bind(struct xe_tile *tile,
> > > > struct
> > > > > xe_vma *vma,
> > > > >  		.vma = vma,
> > > > >  		.wupd.entries = entries,
> > > > >  		.needs_64K = (xe_vma_vm(vma)->flags &
> > > > > XE_VM_FLAG_64K) && is_devmem,
> > > > > +		.clear_pt = clear_pt,
> > > > >  	};
> > > > >  	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile-
> > > > > >id];
> > > > >  	int ret;
> > > > > 
> > > > > +	if (clear_pt) {
> > > > > +		ret = xe_pt_walk_range(&pt->base, pt->level,
> > > > > xe_vma_start(vma),
> > > > > +				       xe_vma_end(vma),
> > > > > &xe_walk.base);
> > > > > +
> > > > > +		*num_entries =
> > > > > xe_walk.wupd.num_used_entries;
> > > > > +		return ret;
> > > > > +	}
> > > > > +
> > > > >  	/**
> > > > >  	 * Default atomic expectations for different
> > > > > allocation
> > > > > scenarios are as follows:
> > > > >  	 *
> > > > > @@ -981,12 +998,14 @@ static void xe_pt_free_bind(struct
> > > > > xe_vm_pgtable_update *entries,
> > > > > 
> > > > >  static int
> > > > >  xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
> > > > > -		   struct xe_vm_pgtable_update *entries, u32
> > > > > *num_entries)
> > > > > +		   struct xe_vm_pgtable_update *entries,
> > > > > +		   bool invalidate_on_bind, u32
> > > > > *num_entries)
> > > > >  {
> > > > >  	int err;
> > > > > 
> > > > >  	*num_entries = 0;
> > > > > -	err = xe_pt_stage_bind(tile, vma, entries,
> > > > > num_entries);
> > > > > +	err = xe_pt_stage_bind(tile, vma, entries,
> > > > > invalidate_on_bind,
> > > > > +			       num_entries);
> > > > >  	if (!err)
> > > > >  		xe_tile_assert(tile, *num_entries);
> > > > > 
> > > > > @@ -1661,6 +1680,7 @@ static int bind_op_prepare(struct
> > xe_vm
> > > > *vm,
> > > > > struct xe_tile *tile,
> > > > >  		return err;
> > > > > 
> > > > >  	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
> > > > > +				 pt_update_ops-
> > > > > > invalidate_on_bind,
> > > > >  				 &pt_op->num_entries);
> > > > >  	if (!err) {
> > > > >  		xe_tile_assert(tile, pt_op->num_entries <=
> > > > > @@ -1685,7 +1705,7 @@ static int bind_op_prepare(struct
> > xe_vm
> > > > *vm,
> > > > > struct xe_tile *tile,
> > > > >  		 * it needs to be done here.
> > > > >  		 */
> > > > >  		if ((!pt_op->rebind && xe_vm_has_scratch(vm)
> > > > > &&
> > > > > -		     xe_vm_in_preempt_fence_mode(vm)))
> > > > > +		     xe_vm_in_preempt_fence_mode(vm)) ||
> > > > > pt_update_ops->invalidate_on_bind)
> > > > >  			pt_update_ops->needs_invalidation =
> > > > > true;
> > > > >  		else if (pt_op->rebind &&
> > > > > !xe_vm_in_lr_mode(vm))
> > > > >  			/* We bump also if
> > > > > batch_invalidate_tlb
> > > > > is
> > > > > true */
> > > > > @@ -1759,9 +1779,13 @@ static int op_prepare(struct xe_vm
> > *vm,
> > > > > 
> > > > >  	switch (op->base.op) {
> > > > >  	case DRM_GPUVA_OP_MAP:
> > > > > -		if (!op->map.immediate &&
> > > > xe_vm_in_fault_mode(vm))
> > > > > +		if (!op->map.immediate &&
> > > > xe_vm_in_fault_mode(vm) &&
> > > > > +		    !op->map.invalidate_on_bind)
> > > > >  			break;
> > > > > 
> > > > > +		if (op->map.invalidate_on_bind)
> > > > > +			pt_update_ops->invalidate_on_bind =
> > > > > true;
> > > > > +
> > > > >  		err = bind_op_prepare(vm, tile,
> > > > > pt_update_ops,
> > > > > op-
> > > > > > map.vma);
> > > > >  		pt_update_ops->wait_vm_kernel = true;
> > > > >  		break;
> > > > > @@ -1871,6 +1895,8 @@ static void bind_op_commit(struct
> > xe_vm
> > > > *vm,
> > > > > struct xe_tile *tile,
> > > > >  	}
> > > > >  	vma->tile_present |= BIT(tile->id);
> > > > >  	vma->tile_staged &= ~BIT(tile->id);
> > > > > +	if (pt_update_ops->invalidate_on_bind)
> > > > > +		vma->tile_invalidated |= BIT(tile->id);
> > > > >  	if (xe_vma_is_userptr(vma)) {
> > > > >  		lockdep_assert_held_read(&vm-
> > > > > > userptr.notifier_lock);
> > > > >  		to_userptr_vma(vma)->userptr.initial_bind =
> > > > > true;
> > > > > diff --git a/drivers/gpu/drm/xe/xe_pt_types.h
> > > > > b/drivers/gpu/drm/xe/xe_pt_types.h
> > > > > index 384cc04de719..3d0aa2a5102e 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_pt_types.h
> > > > > +++ b/drivers/gpu/drm/xe/xe_pt_types.h
> > > > > @@ -108,6 +108,8 @@ struct xe_vm_pgtable_update_ops {
> > > > >  	bool needs_userptr_lock;
> > > > >  	/** @needs_invalidation: Needs invalidation */
> > > > >  	bool needs_invalidation;
> > > > > +	/** @invalidate_on_bind: Invalidate the range before
> > > > > bind */
> > > > > +	bool invalidate_on_bind;
> > > > >  	/**
> > > > >  	 * @wait_vm_bookkeep: PT operations need to wait
> > > > > until
> > > > > VM
> > > > is
> > > > > idle
> > > > >  	 * (bookkeep dma-resv slots are idle) and stage all
> > > > > future
> > > > > VM activity
> > > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > > b/drivers/gpu/drm/xe/xe_vm.c
> > > > > index d664f2e418b2..813d893d9b63 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > > > @@ -1921,6 +1921,23 @@ static void print_op(struct xe_device
> > *xe,
> > > > > struct drm_gpuva_op *op)
> > > > >  }
> > > > >  #endif
> > > > > 
> > > > > +static bool __xe_vm_needs_clear_scratch_pages(struct
> > xe_vm
> > > > *vm, u32
> > > > > bind_flags)
> > > > > +{
> > > > > +	if (!xe_vm_in_fault_mode(vm))
> > > > > +		return false;
> > > > > +
> > > > > +	if (!NEEDS_SCRATCH(vm->xe))
> > > > > +		return false;
> > > > > +
> > > > > +	if (!xe_vm_has_scratch(vm))
> > > > > +		return false;
> > > > > +
> > > > > +	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> > > > > +		return false;
> > > > > +
> > > > > +	return true;
> > > > > +}
> > > > > +
> > > > >  /*
> > > > >   * Create operations list from IOCTL arguments, setup
> > operations
> > > > > fields so parse
> > > > >   * and commit steps are decoupled from IOCTL arguments. This
> > > > step
> > > > > can fail.
> > > > > @@ -1991,6 +2008,8 @@ vm_bind_ioctl_ops_create(struct
> > xe_vm
> > > > *vm,
> > > > > struct xe_bo *bo,
> > > > >  			op->map.is_null = flags &
> > > > > DRM_XE_VM_BIND_FLAG_NULL;
> > > > >  			op->map.dumpable = flags &
> > > > > DRM_XE_VM_BIND_FLAG_DUMPABLE;
> > > > >  			op->map.pat_index = pat_index;
> > > > > +			op->map.invalidate_on_bind =
> > > > > +
> > > > 	__xe_vm_needs_clear_scratch_pages(vm
> > > > > , flags);
> > > > >  		} else if (__op->op ==
> > > > > DRM_GPUVA_OP_PREFETCH) {
> > > > >  			op->prefetch.region =
> > > > > prefetch_region;
> > > > >  		}
> > > > > @@ -2188,7 +2207,8 @@ static int
> > vm_bind_ioctl_ops_parse(struct
> > > > xe_vm
> > > > > *vm, struct drm_gpuva_ops *ops,
> > > > >  				return PTR_ERR(vma);
> > > > > 
> > > > >  			op->map.vma = vma;
> > > > > -			if (op->map.immediate ||
> > > > > !xe_vm_in_fault_mode(vm))
> > > > > +			if (op->map.immediate ||
> > > > > !xe_vm_in_fault_mode(vm) ||
> > > > > +			    op->map.invalidate_on_bind)
> > > > > 
> > > > 	xe_vma_ops_incr_pt_update_ops(vops,
> > > > > 
> > > > > op-
> > > > > > tile_mask);
> > > > >  			break;
> > > > > @@ -2416,9 +2436,10 @@ static int op_lock_and_prep(struct
> > > > drm_exec
> > > > > *exec, struct xe_vm *vm,
> > > > > 
> > > > >  	switch (op->base.op) {
> > > > >  	case DRM_GPUVA_OP_MAP:
> > > > > -		err = vma_lock_and_validate(exec, op-
> > > > > >map.vma,
> > > > > -
> > > > > !xe_vm_in_fault_mode(vm)
> > > > > > > 
> > > > > -					    op-
> > > > > >map.immediate);
> > > > > +		if (!op->map.invalidate_on_bind)
> > > > > +			err = vma_lock_and_validate(exec,
> > > > > op-
> > > > > > map.vma,
> > > > > +
> > > > > !xe_vm_in_fault_mode(vm) ||
> > > > > +						    op-
> > > > > > map.immediate);
> > > > >  		break;
> > > > >  	case DRM_GPUVA_OP_REMAP:
> > > > >  		err = check_ufence(gpuva_to_vma(op-
> > > > > > base.remap.unmap->va));
> > > > > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > > > > b/drivers/gpu/drm/xe/xe_vm_types.h
> > > > > index 52467b9b5348..dace04f4ea5e 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > > > > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > > > > @@ -297,6 +297,8 @@ struct xe_vma_op_map {
> > > > >  	bool is_null;
> > > > >  	/** @dumpable: whether BO is dumped on GPU hang */
> > > > >  	bool dumpable;
> > > > > +	/** @invalidate: invalidate the VMA before bind */
> > > > > +	bool invalidate_on_bind;
> > > > >  	/** @pat_index: The pat index to use for this
> > > > > operation.
> > > > > */
> > > > >  	u16 pat_index;
> > > > >  };
> > > 
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-02-06 20:11           ` Thomas Hellström
@ 2025-02-06 21:05             ` Zeng, Oak
  0 siblings, 0 replies; 32+ messages in thread
From: Zeng, Oak @ 2025-02-06 21:05 UTC (permalink / raw)
  To: Thomas Hellström, intel-xe@lists.freedesktop.org
  Cc: Brost, Matthew, Cavitt, Jonathan



> -----Original Message-----
> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Sent: February 6, 2025 3:12 PM
> To: Zeng, Oak <oak.zeng@intel.com>; intel-xe@lists.freedesktop.org
> Cc: Brost, Matthew <matthew.brost@intel.com>; Cavitt, Jonathan
> <jonathan.cavitt@intel.com>
> Subject: Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
> 
> Hi, Oak,
> 
> On Thu, 2025-02-06 at 18:56 +0000, Zeng, Oak wrote:
> > Hi Thomas!
> >
> > > -----Original Message-----
> > > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Sent: February 6, 2025 7:52 AM
> > > To: Zeng, Oak <oak.zeng@intel.com>; intel-
> xe@lists.freedesktop.org
> > > Cc: Brost, Matthew <matthew.brost@intel.com>; Cavitt, Jonathan
> > > <jonathan.cavitt@intel.com>
> > > Subject: Re: [PATCH 2/3] drm/xe: Clear scratch page before
> vm_bind
> > >
> > > On Thu, 2025-02-06 at 01:52 +0000, Zeng, Oak wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > Sent: February 5, 2025 9:36 AM
> > > > > To: Zeng, Oak <oak.zeng@intel.com>; intel-
> > > xe@lists.freedesktop.org
> > > > > Cc: Brost, Matthew <matthew.brost@intel.com>; Cavitt,
> Jonathan
> > > > > <jonathan.cavitt@intel.com>
> > > > > Subject: Re: [PATCH 2/3] drm/xe: Clear scratch page before
> > > vm_bind
> > > > >
> > > > > On Tue, 2025-02-04 at 13:45 -0500, Oak Zeng wrote:
> > > > > > When a vm runs under fault mode, if scratch page is enabled,
> > > we
> > > > > need
> > > > > > to clear the scratch page mapping before vm_bind for the
> > > vm_bind
> > > > > > address range. Under fault mode, we depend on
> recoverable
> > > page
> > > > > fault
> > > > > > to establish mapping in page table. If scratch page is not
> > > > > > cleared,
> > > > > > GPU access of address won't cause page fault because it
> > > > > > always
> > > > > > hits
> > > > > > the existing scratch page mapping.
> > > > >
> > > > > I think we need to ephasize that we clear AT vm_bind time,
> not
> > > > > before,
> > > > > both in the commit description and the patch title.
> > > >
> > > > Will fix.
> > > >
> > > > >
> > > > > >
> > > > > > When vm_bind with IMMEDIATE flag, there is no need of
> > > clearing as
> > > > > > immediate bind can overwrite the scratch page mapping.
> > > > > >
> > > > > > So far only is xe2 and xe3 products are allowed to enable
> > > > > > scratch
> > > > > > page
> > > > > > under fault mode. On other platform we don't allow scratch
> > > page
> > > > > under
> > > > > > fault mode, so no need of such clearing.
> > > > > >
> > > > > > v2: Rework vm_bind pipeline to clear scratch page mapping.
> > > > > > This
> > > > > > is
> > > > > > similar
> > > > > > to a map operation, with the exception that PTEs are cleared
> > > > > instead
> > > > > > of
> > > > > > pointing to valid physical pages. (Matt, Thomas)
> > > > > >
> > > > > > TLB invalidation is needed after clear scratch page mapping
> > > > > > as
> > > > > > larger
> > > > > > scratch page mapping could be backed by physical page and
> > > cached
> > > > > in
> > > > > > TLB. (Matt, Thomas)
> > > > > >
> > > > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/xe/xe_pt.c       | 66
> > > ++++++++++++++++++++++--
> > > > > ------
> > > > > > --
> > > > > >  drivers/gpu/drm/xe/xe_pt_types.h |  2 +
> > > > > >  drivers/gpu/drm/xe/xe_vm.c       | 29 ++++++++++++--
> > > > > >  drivers/gpu/drm/xe/xe_vm_types.h |  2 +
> > > > > >  4 files changed, 75 insertions(+), 24 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_pt.c
> > > > > b/drivers/gpu/drm/xe/xe_pt.c
> > > > > > index 1ddcc7e79a93..3fd0ae2dbe7d 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_pt.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > > > > > @@ -268,6 +268,8 @@ struct xe_pt_stage_bind_walk {
> > > > > >  	 * granularity.
> > > > > >  	 */
> > > > > >  	bool needs_64K;
> > > > > > +	/* @clear_pt: clear page table entries during the
> > > > > > bind
> > > > > > walk
> > > > > > */
> > > > > > +	bool clear_pt;
> > > > > >  	/**
> > > > > >  	 * @vma: VMA being mapped
> > > > > >  	 */
> > > > > > @@ -497,21 +499,25 @@ xe_pt_stage_bind_entry(struct
> > > xe_ptw
> > > > > *parent,
> > > > > > pgoff_t offset,
> > > > > >
> > > > > >  		XE_WARN_ON(xe_walk->va_curs_start !=
> addr);
> > > > > >
> > > > > > -		pte = vm->pt_ops->pte_encode_vma(is_null ?
> 0
> > > > > > :
> > > > > > -
> > > > > > xe_res_dma(curs) +
> > > > > > xe_walk->dma_offset,
> > > > > > -						 xe_walk-
> > > > > > >vma,
> > > > > > pat_index, level);
> > > > > > -		pte |= xe_walk->default_pte;
> > > > > > +		if (xe_walk->clear_pt) {
> > > > > > +			pte = 0;
> > > > > > +		} else {
> > > > > > +			pte = vm->pt_ops-
> > > > > > >pte_encode_vma(is_null
> > > > > > ?
> > > > > 0
> > > > > > :
> > > > > > +					xe_res_dma(curs) +
> > > > > > xe_walk-
> > > > > > > dma_offset,
> > > > > > +					xe_walk->vma,
> > > > > > pat_index,
> > > > > > level);
> > > > > > +			pte |= xe_walk->default_pte;
> > > > > >
> > > > > > -		/*
> > > > > > -		 * Set the XE_PTE_PS64 hint if possible,
> > > > > > otherwise
> > > > > > if
> > > > > > -		 * this device *requires* 64K PTE size for
> > > > > > VRAM,
> > > > > > fail.
> > > > > > -		 */
> > > > > > -		if (level == 0 && !xe_parent->is_compact) {
> > > > > > -			if (xe_pt_is_pte_ps64K(addr, next,
> > > > > > xe_walk))
> > > > > > {
> > > > > > -				xe_walk->vma->gpuva.flags
> |=
> > > > > > XE_VMA_PTE_64K;
> > > > > > -				pte |= XE_PTE_PS64;
> > > > > > -			} else if (XE_WARN_ON(xe_walk-
> > > > > > needs_64K)) {
> > > > > > -				return -EINVAL;
> > > > > > +			/*
> > > > > > +			 * Set the XE_PTE_PS64 hint if
> > > > > > possible,
> > > > > > otherwise if
> > > > > > +			 * this device *requires* 64K PTE
> > > > > > size
> > > > > > for
> > > > > > VRAM, fail.
> > > > > > +			 */
> > > > > > +			if (level == 0 && !xe_parent-
> > > > > > > is_compact) {
> > > > > > +				if (xe_pt_is_pte_ps64K(addr,
> > > > > > next,
> > > > > > xe_walk)) {
> > > > > > +					xe_walk->vma-
> > > > > > > gpuva.flags
> > > > > > =
> > > > > > XE_VMA_PTE_64K;
> > > > > > +					pte |= XE_PTE_PS64;
> > > > > > +				} else if
> > > > > > (XE_WARN_ON(xe_walk-
> > > > > > > needs_64K)) {
> > > > > > +					return -EINVAL;
> > > > > > +				}
> > > > > >  			}
> > > > > >  		}
> > > > > >
> > > > >
> > > > > I think there might be some things missing here.
> > > > > The bind should look exactly as a is_null bind, with the
> > > > > exception
> > > > > that
> > > > > the pte should be 0, so might be able to benefit from is_null,
> > > > > for
> > > > > example xe_pg_hugepte_possible() short-circuit the dma-
> > > address
> > > > > check.
> > > >
> > > > You are right. Above code has a problem for huge pte cases: it
> > > > wouldn't detect
> > > > A huge pte for clear_pt case. I will fix it.
> > > >
> > > > I will not make is_null be true for clear_pt case, but will make
> > > > similar logic to
> > > > Let clear_pt benefit from is_null logic.
> > > >
> > > > > With clear_pt we don't want to use the XE_PTE_PS64K, though,
> > > like
> > > > > you
> > > > > identified above.
> > > > >
> > > > > Will get back with a look at the invalidation.
> > > >
> > > > I don't quite get here. Please do let me know if you find
> > > > anything.
> > >
> > > Yes, I meant I hadn't had time to look at this part yet.
> > >
> > > For binding with scratch, it looks like *all* binds need TLB
> > > invalidation, not just clearing binds, So in the below code-path,
> > > can't
> > > we just replace the xe_vm_in_preempt_fence_mode() with
> > > xe_vm_in_lr_mode()?
> >
> > I think you are correct.
> >
> > For long run vm, regardless it is in fault mode or not, we need to
> > invalidate
> > TLB scratch page caching during bind. Original codes only did TLB inv
> > on non-
> > Fault mode. On fault mode + LR,  this runs into two sub-cases:
> > 1.  immediate bind
> > 2. bind triggered by page fault: note this can actually runs into the
> > !rebind case
> > When the fault address is first time accessed by gpu. But the TLB
> > cache for
> > Scratch page mapping should have been invalidated by the
> > "invalidate_on_bind"
> > Condition.
> >
> > So this leaves case #1 problematic.
> >
> > It would be better for others to comment also, as in the comment
> > below, it explicitly
> > Called out "non-faulting LR". My understanding is, people didn't
> > think of case #1 above.
> >
> > Since it is an existing issue, I will make it a separate patch for
> > better tracking purpose.
> 
> Hmm. What is the existing issue, I don't quite follow?

I just realized it is not an existing issue, as previously we didn't allow
Scratch on fault mode. 

I will still fix in the current patch.

Oak

> 
> /Thomas
> 
> >
> > Oak
> >
> > >
> > >
> > > 		/*
> > > 		 * If rebind, we have to invalidate TLB on !LR vms
> > > to
> > > invalidate
> > > 		 * cached PTEs point to freed memory. On LR vms
> > > this
> > > is done
> > > 		 * automatically when the context is re-enabled by
> > > the
> > > rebind worker,
> > > 		 * or in fault mode it was invalidated on PTE
> > > zapping.
> > > 		 *
> > > 		 * If !rebind, and scratch enabled VMs, there is a
> > > chance the scratch
> > > 		 * PTE is already cached in the TLB so it needs to
> > > be
> > > invalidated.
> > > 		 * On !LR VMs this is done in the ring ops
> > > preceding a
> > > batch, but on
> > > 		 * non-faulting LR, in particular on user-space
> > > batch
> > > buffer chaining,
> > > 		 * it needs to be done here.
> > > 		 */
> > > 		if ((!pt_op->rebind && xe_vm_has_scratch(vm) &&
> > > 		     xe_vm_in_preempt_fence_mode(vm)))
> > >
> > > /Thomas
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > >
> > > > Oak
> > > > >
> > > > >
> > > > > > @@ -519,7 +525,7 @@ xe_pt_stage_bind_entry(struct
> xe_ptw
> > > > > *parent,
> > > > > > pgoff_t offset,
> > > > > >  		if (unlikely(ret))
> > > > > >  			return ret;
> > > > > >
> > > > > > -		if (!is_null)
> > > > > > +		if (!is_null && !xe_walk->clear_pt)
> > > > > >  			xe_res_next(curs, next - addr);
> > > > > >  		xe_walk->va_curs_start = next;
> > > > > >  		xe_walk->vma->gpuva.flags |=
> (XE_VMA_PTE_4K
> > > > > > <<
> > > > > > level);
> > > > > > @@ -589,6 +595,7 @@ static const struct xe_pt_walk_ops
> > > > > > xe_pt_stage_bind_ops = {
> > > > > >   * @vma: The vma indicating the address range.
> > > > > >   * @entries: Storage for the update entries used for
> > > > > > connecting
> > > > > > the
> > > > > > tree to
> > > > > >   * the main tree at commit time.
> > > > > > + * @clear_pt: Clear the page table entries.
> > > > > >   * @num_entries: On output contains the number of
> @entries
> > > > > used.
> > > > > >   *
> > > > > >   * This function builds a disconnected page-table tree for a
> > > > > > given
> > > > > > address
> > > > > > @@ -602,7 +609,8 @@ static const struct xe_pt_walk_ops
> > > > > > xe_pt_stage_bind_ops = {
> > > > > >   */
> > > > > >  static int
> > > > > >  xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> > > > > > -		 struct xe_vm_pgtable_update *entries, u32
> > > > > > *num_entries)
> > > > > > +		 struct xe_vm_pgtable_update *entries,
> > > > > > +		 bool clear_pt, u32 *num_entries)
> > > > > >  {
> > > > > >  	struct xe_device *xe = tile_to_xe(tile);
> > > > > >  	struct xe_bo *bo = xe_vma_bo(vma);
> > > > > > @@ -622,10 +630,19 @@ xe_pt_stage_bind(struct xe_tile
> *tile,
> > > > > struct
> > > > > > xe_vma *vma,
> > > > > >  		.vma = vma,
> > > > > >  		.wupd.entries = entries,
> > > > > >  		.needs_64K = (xe_vma_vm(vma)->flags &
> > > > > > XE_VM_FLAG_64K) && is_devmem,
> > > > > > +		.clear_pt = clear_pt,
> > > > > >  	};
> > > > > >  	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile-
> > > > > > >id];
> > > > > >  	int ret;
> > > > > >
> > > > > > +	if (clear_pt) {
> > > > > > +		ret = xe_pt_walk_range(&pt->base, pt->level,
> > > > > > xe_vma_start(vma),
> > > > > > +				       xe_vma_end(vma),
> > > > > > &xe_walk.base);
> > > > > > +
> > > > > > +		*num_entries =
> > > > > > xe_walk.wupd.num_used_entries;
> > > > > > +		return ret;
> > > > > > +	}
> > > > > > +
> > > > > >  	/**
> > > > > >  	 * Default atomic expectations for different
> > > > > > allocation
> > > > > > scenarios are as follows:
> > > > > >  	 *
> > > > > > @@ -981,12 +998,14 @@ static void xe_pt_free_bind(struct
> > > > > > xe_vm_pgtable_update *entries,
> > > > > >
> > > > > >  static int
> > > > > >  xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma
> *vma,
> > > > > > -		   struct xe_vm_pgtable_update *entries, u32
> > > > > > *num_entries)
> > > > > > +		   struct xe_vm_pgtable_update *entries,
> > > > > > +		   bool invalidate_on_bind, u32
> > > > > > *num_entries)
> > > > > >  {
> > > > > >  	int err;
> > > > > >
> > > > > >  	*num_entries = 0;
> > > > > > -	err = xe_pt_stage_bind(tile, vma, entries,
> > > > > > num_entries);
> > > > > > +	err = xe_pt_stage_bind(tile, vma, entries,
> > > > > > invalidate_on_bind,
> > > > > > +			       num_entries);
> > > > > >  	if (!err)
> > > > > >  		xe_tile_assert(tile, *num_entries);
> > > > > >
> > > > > > @@ -1661,6 +1680,7 @@ static int bind_op_prepare(struct
> > > xe_vm
> > > > > *vm,
> > > > > > struct xe_tile *tile,
> > > > > >  		return err;
> > > > > >
> > > > > >  	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
> > > > > > +				 pt_update_ops-
> > > > > > > invalidate_on_bind,
> > > > > >  				 &pt_op->num_entries);
> > > > > >  	if (!err) {
> > > > > >  		xe_tile_assert(tile, pt_op->num_entries <=
> > > > > > @@ -1685,7 +1705,7 @@ static int bind_op_prepare(struct
> > > xe_vm
> > > > > *vm,
> > > > > > struct xe_tile *tile,
> > > > > >  		 * it needs to be done here.
> > > > > >  		 */
> > > > > >  		if ((!pt_op->rebind &&
> xe_vm_has_scratch(vm)
> > > > > > &&
> > > > > > -		     xe_vm_in_preempt_fence_mode(vm)))
> > > > > > +		     xe_vm_in_preempt_fence_mode(vm)) ||
> > > > > > pt_update_ops->invalidate_on_bind)
> > > > > >  			pt_update_ops->needs_invalidation
> =
> > > > > > true;
> > > > > >  		else if (pt_op->rebind &&
> > > > > > !xe_vm_in_lr_mode(vm))
> > > > > >  			/* We bump also if
> > > > > > batch_invalidate_tlb
> > > > > > is
> > > > > > true */
> > > > > > @@ -1759,9 +1779,13 @@ static int op_prepare(struct
> xe_vm
> > > *vm,
> > > > > >
> > > > > >  	switch (op->base.op) {
> > > > > >  	case DRM_GPUVA_OP_MAP:
> > > > > > -		if (!op->map.immediate &&
> > > > > xe_vm_in_fault_mode(vm))
> > > > > > +		if (!op->map.immediate &&
> > > > > xe_vm_in_fault_mode(vm) &&
> > > > > > +		    !op->map.invalidate_on_bind)
> > > > > >  			break;
> > > > > >
> > > > > > +		if (op->map.invalidate_on_bind)
> > > > > > +			pt_update_ops->invalidate_on_bind
> =
> > > > > > true;
> > > > > > +
> > > > > >  		err = bind_op_prepare(vm, tile,
> > > > > > pt_update_ops,
> > > > > > op-
> > > > > > > map.vma);
> > > > > >  		pt_update_ops->wait_vm_kernel = true;
> > > > > >  		break;
> > > > > > @@ -1871,6 +1895,8 @@ static void bind_op_commit(struct
> > > xe_vm
> > > > > *vm,
> > > > > > struct xe_tile *tile,
> > > > > >  	}
> > > > > >  	vma->tile_present |= BIT(tile->id);
> > > > > >  	vma->tile_staged &= ~BIT(tile->id);
> > > > > > +	if (pt_update_ops->invalidate_on_bind)
> > > > > > +		vma->tile_invalidated |= BIT(tile->id);
> > > > > >  	if (xe_vma_is_userptr(vma)) {
> > > > > >  		lockdep_assert_held_read(&vm-
> > > > > > > userptr.notifier_lock);
> > > > > >  		to_userptr_vma(vma)->userptr.initial_bind =
> > > > > > true;
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_pt_types.h
> > > > > > b/drivers/gpu/drm/xe/xe_pt_types.h
> > > > > > index 384cc04de719..3d0aa2a5102e 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_pt_types.h
> > > > > > +++ b/drivers/gpu/drm/xe/xe_pt_types.h
> > > > > > @@ -108,6 +108,8 @@ struct xe_vm_pgtable_update_ops {
> > > > > >  	bool needs_userptr_lock;
> > > > > >  	/** @needs_invalidation: Needs invalidation */
> > > > > >  	bool needs_invalidation;
> > > > > > +	/** @invalidate_on_bind: Invalidate the range before
> > > > > > bind */
> > > > > > +	bool invalidate_on_bind;
> > > > > >  	/**
> > > > > >  	 * @wait_vm_bookkeep: PT operations need to wait
> > > > > > until
> > > > > > VM
> > > > > is
> > > > > > idle
> > > > > >  	 * (bookkeep dma-resv slots are idle) and stage all
> > > > > > future
> > > > > > VM activity
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > > > b/drivers/gpu/drm/xe/xe_vm.c
> > > > > > index d664f2e418b2..813d893d9b63 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > > > > @@ -1921,6 +1921,23 @@ static void print_op(struct
> xe_device
> > > *xe,
> > > > > > struct drm_gpuva_op *op)
> > > > > >  }
> > > > > >  #endif
> > > > > >
> > > > > > +static bool __xe_vm_needs_clear_scratch_pages(struct
> > > xe_vm
> > > > > *vm, u32
> > > > > > bind_flags)
> > > > > > +{
> > > > > > +	if (!xe_vm_in_fault_mode(vm))
> > > > > > +		return false;
> > > > > > +
> > > > > > +	if (!NEEDS_SCRATCH(vm->xe))
> > > > > > +		return false;
> > > > > > +
> > > > > > +	if (!xe_vm_has_scratch(vm))
> > > > > > +		return false;
> > > > > > +
> > > > > > +	if (bind_flags &
> DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> > > > > > +		return false;
> > > > > > +
> > > > > > +	return true;
> > > > > > +}
> > > > > > +
> > > > > >  /*
> > > > > >   * Create operations list from IOCTL arguments, setup
> > > operations
> > > > > > fields so parse
> > > > > >   * and commit steps are decoupled from IOCTL arguments.
> This
> > > > > step
> > > > > > can fail.
> > > > > > @@ -1991,6 +2008,8 @@ vm_bind_ioctl_ops_create(struct
> > > xe_vm
> > > > > *vm,
> > > > > > struct xe_bo *bo,
> > > > > >  			op->map.is_null = flags &
> > > > > > DRM_XE_VM_BIND_FLAG_NULL;
> > > > > >  			op->map.dumpable = flags &
> > > > > > DRM_XE_VM_BIND_FLAG_DUMPABLE;
> > > > > >  			op->map.pat_index = pat_index;
> > > > > > +			op->map.invalidate_on_bind =
> > > > > > +
> > > > > 	__xe_vm_needs_clear_scratch_pages(vm
> > > > > > , flags);
> > > > > >  		} else if (__op->op ==
> > > > > > DRM_GPUVA_OP_PREFETCH) {
> > > > > >  			op->prefetch.region =
> > > > > > prefetch_region;
> > > > > >  		}
> > > > > > @@ -2188,7 +2207,8 @@ static int
> > > vm_bind_ioctl_ops_parse(struct
> > > > > xe_vm
> > > > > > *vm, struct drm_gpuva_ops *ops,
> > > > > >  				return PTR_ERR(vma);
> > > > > >
> > > > > >  			op->map.vma = vma;
> > > > > > -			if (op->map.immediate ||
> > > > > > !xe_vm_in_fault_mode(vm))
> > > > > > +			if (op->map.immediate ||
> > > > > > !xe_vm_in_fault_mode(vm) ||
> > > > > > +			    op->map.invalidate_on_bind)
> > > > > >
> > > > > 	xe_vma_ops_incr_pt_update_ops(vops,
> > > > > >
> > > > > > op-
> > > > > > > tile_mask);
> > > > > >  			break;
> > > > > > @@ -2416,9 +2436,10 @@ static int
> op_lock_and_prep(struct
> > > > > drm_exec
> > > > > > *exec, struct xe_vm *vm,
> > > > > >
> > > > > >  	switch (op->base.op) {
> > > > > >  	case DRM_GPUVA_OP_MAP:
> > > > > > -		err = vma_lock_and_validate(exec, op-
> > > > > > >map.vma,
> > > > > > -
> > > > > > !xe_vm_in_fault_mode(vm)
> > > > > > > >
> > > > > > -					    op-
> > > > > > >map.immediate);
> > > > > > +		if (!op->map.invalidate_on_bind)
> > > > > > +			err = vma_lock_and_validate(exec,
> > > > > > op-
> > > > > > > map.vma,
> > > > > > +
> > > > > > !xe_vm_in_fault_mode(vm) ||
> > > > > > +						    op-
> > > > > > > map.immediate);
> > > > > >  		break;
> > > > > >  	case DRM_GPUVA_OP_REMAP:
> > > > > >  		err = check_ufence(gpuva_to_vma(op-
> > > > > > > base.remap.unmap->va));
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > > > > > b/drivers/gpu/drm/xe/xe_vm_types.h
> > > > > > index 52467b9b5348..dace04f4ea5e 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > > > > > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > > > > > @@ -297,6 +297,8 @@ struct xe_vma_op_map {
> > > > > >  	bool is_null;
> > > > > >  	/** @dumpable: whether BO is dumped on GPU
> hang */
> > > > > >  	bool dumpable;
> > > > > > +	/** @invalidate: invalidate the VMA before bind */
> > > > > > +	bool invalidate_on_bind;
> > > > > >  	/** @pat_index: The pat index to use for this
> > > > > > operation.
> > > > > > */
> > > > > >  	u16 pat_index;
> > > > > >  };
> > > >
> >


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-02-04 18:45 ` [PATCH 2/3] drm/xe: Clear scratch page before vm_bind Oak Zeng
  2025-02-05 14:35   ` Thomas Hellström
@ 2025-02-06 10:34   ` Matthew Brost
  2025-02-06 10:43     ` Thomas Hellström
  2025-02-06 15:16     ` Zeng, Oak
  1 sibling, 2 replies; 32+ messages in thread
From: Matthew Brost @ 2025-02-06 10:34 UTC (permalink / raw)
  To: Oak Zeng; +Cc: intel-xe, Thomas.Hellstrom, jonathan.cavitt

On Tue, Feb 04, 2025 at 01:45:57PM -0500, Oak Zeng wrote:
> When a vm runs under fault mode, if scratch page is enabled, we need
> to clear the scratch page mapping before vm_bind for the vm_bind
> address range. Under fault mode, we depend on recoverable page fault
> to establish mapping in page table. If scratch page is not cleared,
> GPU access of address won't cause page fault because it always hits
> the existing scratch page mapping.
> 
> When vm_bind with IMMEDIATE flag, there is no need of clearing as
> immediate bind can overwrite the scratch page mapping.
> 
> So far only is xe2 and xe3 products are allowed to enable scratch page
> under fault mode. On other platform we don't allow scratch page under
> fault mode, so no need of such clearing.
> 
> v2: Rework vm_bind pipeline to clear scratch page mapping. This is similar
> to a map operation, with the exception that PTEs are cleared instead of
> pointing to valid physical pages. (Matt, Thomas)
> 
> TLB invalidation is needed after clear scratch page mapping as larger
> scratch page mapping could be backed by physical page and cached in
> TLB. (Matt, Thomas)
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>

Given the complexity of the VM bind path, I think we need an IGT posted
with this series before merging as I suggested in v1. Without it, it
will be fairly difficult to ensure correctness by reviews only.

Matt

> ---
>  drivers/gpu/drm/xe/xe_pt.c       | 66 ++++++++++++++++++++++----------
>  drivers/gpu/drm/xe/xe_pt_types.h |  2 +
>  drivers/gpu/drm/xe/xe_vm.c       | 29 ++++++++++++--
>  drivers/gpu/drm/xe/xe_vm_types.h |  2 +
>  4 files changed, 75 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index 1ddcc7e79a93..3fd0ae2dbe7d 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -268,6 +268,8 @@ struct xe_pt_stage_bind_walk {
>  	 * granularity.
>  	 */
>  	bool needs_64K;
> +	/* @clear_pt: clear page table entries during the bind walk */
> +	bool clear_pt;
>  	/**
>  	 * @vma: VMA being mapped
>  	 */
> @@ -497,21 +499,25 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>  
>  		XE_WARN_ON(xe_walk->va_curs_start != addr);
>  
> -		pte = vm->pt_ops->pte_encode_vma(is_null ? 0 :
> -						 xe_res_dma(curs) + xe_walk->dma_offset,
> -						 xe_walk->vma, pat_index, level);
> -		pte |= xe_walk->default_pte;
> +		if (xe_walk->clear_pt) {
> +			pte = 0;
> +		} else {
> +			pte = vm->pt_ops->pte_encode_vma(is_null ? 0 :
> +					xe_res_dma(curs) + xe_walk->dma_offset,
> +					xe_walk->vma, pat_index, level);
> +			pte |= xe_walk->default_pte;
>  
> -		/*
> -		 * Set the XE_PTE_PS64 hint if possible, otherwise if
> -		 * this device *requires* 64K PTE size for VRAM, fail.
> -		 */
> -		if (level == 0 && !xe_parent->is_compact) {
> -			if (xe_pt_is_pte_ps64K(addr, next, xe_walk)) {
> -				xe_walk->vma->gpuva.flags |= XE_VMA_PTE_64K;
> -				pte |= XE_PTE_PS64;
> -			} else if (XE_WARN_ON(xe_walk->needs_64K)) {
> -				return -EINVAL;
> +			/*
> +			 * Set the XE_PTE_PS64 hint if possible, otherwise if
> +			 * this device *requires* 64K PTE size for VRAM, fail.
> +			 */
> +			if (level == 0 && !xe_parent->is_compact) {
> +				if (xe_pt_is_pte_ps64K(addr, next, xe_walk)) {
> +					xe_walk->vma->gpuva.flags |= XE_VMA_PTE_64K;
> +					pte |= XE_PTE_PS64;
> +				} else if (XE_WARN_ON(xe_walk->needs_64K)) {
> +					return -EINVAL;
> +				}
>  			}
>  		}
>  
> @@ -519,7 +525,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>  		if (unlikely(ret))
>  			return ret;
>  
> -		if (!is_null)
> +		if (!is_null && !xe_walk->clear_pt)
>  			xe_res_next(curs, next - addr);
>  		xe_walk->va_curs_start = next;
>  		xe_walk->vma->gpuva.flags |= (XE_VMA_PTE_4K << level);
> @@ -589,6 +595,7 @@ static const struct xe_pt_walk_ops xe_pt_stage_bind_ops = {
>   * @vma: The vma indicating the address range.
>   * @entries: Storage for the update entries used for connecting the tree to
>   * the main tree at commit time.
> + * @clear_pt: Clear the page table entries.
>   * @num_entries: On output contains the number of @entries used.
>   *
>   * This function builds a disconnected page-table tree for a given address
> @@ -602,7 +609,8 @@ static const struct xe_pt_walk_ops xe_pt_stage_bind_ops = {
>   */
>  static int
>  xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> -		 struct xe_vm_pgtable_update *entries, u32 *num_entries)
> +		 struct xe_vm_pgtable_update *entries,
> +		 bool clear_pt, u32 *num_entries)
>  {
>  	struct xe_device *xe = tile_to_xe(tile);
>  	struct xe_bo *bo = xe_vma_bo(vma);
> @@ -622,10 +630,19 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
>  		.vma = vma,
>  		.wupd.entries = entries,
>  		.needs_64K = (xe_vma_vm(vma)->flags & XE_VM_FLAG_64K) && is_devmem,
> +		.clear_pt = clear_pt,
>  	};
>  	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
>  	int ret;
>  
> +	if (clear_pt) {
> +		ret = xe_pt_walk_range(&pt->base, pt->level, xe_vma_start(vma),
> +				       xe_vma_end(vma), &xe_walk.base);
> +
> +		*num_entries = xe_walk.wupd.num_used_entries;
> +		return ret;
> +	}
> +
>  	/**
>  	 * Default atomic expectations for different allocation scenarios are as follows:
>  	 *
> @@ -981,12 +998,14 @@ static void xe_pt_free_bind(struct xe_vm_pgtable_update *entries,
>  
>  static int
>  xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
> -		   struct xe_vm_pgtable_update *entries, u32 *num_entries)
> +		   struct xe_vm_pgtable_update *entries,
> +		   bool invalidate_on_bind, u32 *num_entries)
>  {
>  	int err;
>  
>  	*num_entries = 0;
> -	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
> +	err = xe_pt_stage_bind(tile, vma, entries, invalidate_on_bind,
> +			       num_entries);
>  	if (!err)
>  		xe_tile_assert(tile, *num_entries);
>  
> @@ -1661,6 +1680,7 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
>  		return err;
>  
>  	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
> +				 pt_update_ops->invalidate_on_bind,
>  				 &pt_op->num_entries);
>  	if (!err) {
>  		xe_tile_assert(tile, pt_op->num_entries <=
> @@ -1685,7 +1705,7 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
>  		 * it needs to be done here.
>  		 */
>  		if ((!pt_op->rebind && xe_vm_has_scratch(vm) &&
> -		     xe_vm_in_preempt_fence_mode(vm)))
> +		     xe_vm_in_preempt_fence_mode(vm)) || pt_update_ops->invalidate_on_bind)
>  			pt_update_ops->needs_invalidation = true;
>  		else if (pt_op->rebind && !xe_vm_in_lr_mode(vm))
>  			/* We bump also if batch_invalidate_tlb is true */
> @@ -1759,9 +1779,13 @@ static int op_prepare(struct xe_vm *vm,
>  
>  	switch (op->base.op) {
>  	case DRM_GPUVA_OP_MAP:
> -		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
> +		if (!op->map.immediate && xe_vm_in_fault_mode(vm) &&
> +		    !op->map.invalidate_on_bind)
>  			break;
>  
> +		if (op->map.invalidate_on_bind)
> +			pt_update_ops->invalidate_on_bind = true;
> +
>  		err = bind_op_prepare(vm, tile, pt_update_ops, op->map.vma);
>  		pt_update_ops->wait_vm_kernel = true;
>  		break;
> @@ -1871,6 +1895,8 @@ static void bind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
>  	}
>  	vma->tile_present |= BIT(tile->id);
>  	vma->tile_staged &= ~BIT(tile->id);
> +	if (pt_update_ops->invalidate_on_bind)
> +		vma->tile_invalidated |= BIT(tile->id);
>  	if (xe_vma_is_userptr(vma)) {
>  		lockdep_assert_held_read(&vm->userptr.notifier_lock);
>  		to_userptr_vma(vma)->userptr.initial_bind = true;
> diff --git a/drivers/gpu/drm/xe/xe_pt_types.h b/drivers/gpu/drm/xe/xe_pt_types.h
> index 384cc04de719..3d0aa2a5102e 100644
> --- a/drivers/gpu/drm/xe/xe_pt_types.h
> +++ b/drivers/gpu/drm/xe/xe_pt_types.h
> @@ -108,6 +108,8 @@ struct xe_vm_pgtable_update_ops {
>  	bool needs_userptr_lock;
>  	/** @needs_invalidation: Needs invalidation */
>  	bool needs_invalidation;
> +	/** @invalidate_on_bind: Invalidate the range before bind */
> +	bool invalidate_on_bind;
>  	/**
>  	 * @wait_vm_bookkeep: PT operations need to wait until VM is idle
>  	 * (bookkeep dma-resv slots are idle) and stage all future VM activity
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index d664f2e418b2..813d893d9b63 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -1921,6 +1921,23 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
>  }
>  #endif
>  
> +static bool __xe_vm_needs_clear_scratch_pages(struct xe_vm *vm, u32 bind_flags)
> +{
> +	if (!xe_vm_in_fault_mode(vm))
> +		return false;
> +
> +	if (!NEEDS_SCRATCH(vm->xe))
> +		return false;
> +
> +	if (!xe_vm_has_scratch(vm))
> +		return false;
> +
> +	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> +		return false;
> +
> +	return true;
> +}
> +
>  /*
>   * Create operations list from IOCTL arguments, setup operations fields so parse
>   * and commit steps are decoupled from IOCTL arguments. This step can fail.
> @@ -1991,6 +2008,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>  			op->map.is_null = flags & DRM_XE_VM_BIND_FLAG_NULL;
>  			op->map.dumpable = flags & DRM_XE_VM_BIND_FLAG_DUMPABLE;
>  			op->map.pat_index = pat_index;
> +			op->map.invalidate_on_bind =
> +				__xe_vm_needs_clear_scratch_pages(vm, flags);
>  		} else if (__op->op == DRM_GPUVA_OP_PREFETCH) {
>  			op->prefetch.region = prefetch_region;
>  		}
> @@ -2188,7 +2207,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
>  				return PTR_ERR(vma);
>  
>  			op->map.vma = vma;
> -			if (op->map.immediate || !xe_vm_in_fault_mode(vm))
> +			if (op->map.immediate || !xe_vm_in_fault_mode(vm) ||
> +			    op->map.invalidate_on_bind)
>  				xe_vma_ops_incr_pt_update_ops(vops,
>  							      op->tile_mask);
>  			break;
> @@ -2416,9 +2436,10 @@ static int op_lock_and_prep(struct drm_exec *exec, struct xe_vm *vm,
>  
>  	switch (op->base.op) {
>  	case DRM_GPUVA_OP_MAP:
> -		err = vma_lock_and_validate(exec, op->map.vma,
> -					    !xe_vm_in_fault_mode(vm) ||
> -					    op->map.immediate);
> +		if (!op->map.invalidate_on_bind)
> +			err = vma_lock_and_validate(exec, op->map.vma,
> +						    !xe_vm_in_fault_mode(vm) ||
> +						    op->map.immediate);
>  		break;
>  	case DRM_GPUVA_OP_REMAP:
>  		err = check_ufence(gpuva_to_vma(op->base.remap.unmap->va));
> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
> index 52467b9b5348..dace04f4ea5e 100644
> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> @@ -297,6 +297,8 @@ struct xe_vma_op_map {
>  	bool is_null;
>  	/** @dumpable: whether BO is dumped on GPU hang */
>  	bool dumpable;
> +	/** @invalidate: invalidate the VMA before bind */
> +	bool invalidate_on_bind;
>  	/** @pat_index: The pat index to use for this operation. */
>  	u16 pat_index;
>  };
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-02-06 10:34   ` Matthew Brost
@ 2025-02-06 10:43     ` Thomas Hellström
  2025-02-10 17:08       ` Matthew Brost
  2025-02-06 15:16     ` Zeng, Oak
  1 sibling, 1 reply; 32+ messages in thread
From: Thomas Hellström @ 2025-02-06 10:43 UTC (permalink / raw)
  To: Matthew Brost, Oak Zeng; +Cc: intel-xe, jonathan.cavitt

Hi, Matt,

On Thu, 2025-02-06 at 02:34 -0800, Matthew Brost wrote:
> On Tue, Feb 04, 2025 at 01:45:57PM -0500, Oak Zeng wrote:
> > When a vm runs under fault mode, if scratch page is enabled, we
> > need
> > to clear the scratch page mapping before vm_bind for the vm_bind
> > address range. Under fault mode, we depend on recoverable page
> > fault
> > to establish mapping in page table. If scratch page is not cleared,
> > GPU access of address won't cause page fault because it always hits
> > the existing scratch page mapping.
> > 
> > When vm_bind with IMMEDIATE flag, there is no need of clearing as
> > immediate bind can overwrite the scratch page mapping.
> > 
> > So far only is xe2 and xe3 products are allowed to enable scratch
> > page
> > under fault mode. On other platform we don't allow scratch page
> > under
> > fault mode, so no need of such clearing.
> > 
> > v2: Rework vm_bind pipeline to clear scratch page mapping. This is
> > similar
> > to a map operation, with the exception that PTEs are cleared
> > instead of
> > pointing to valid physical pages. (Matt, Thomas)
> > 
> > TLB invalidation is needed after clear scratch page mapping as
> > larger
> > scratch page mapping could be backed by physical page and cached in
> > TLB. (Matt, Thomas)
> > 
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> 
> Given the complexity of the VM bind path, I think we need an IGT
> posted
> with this series before merging as I suggested in v1. Without it, it
> will be fairly difficult to ensure correctness by reviews only.

There is an igt posted that exercises xe_exec_fault_mode + scratch
pages. Although you might have had something more elaborate in mind?

/Thomas


> 
> Matt
> 
> > ---
> >  drivers/gpu/drm/xe/xe_pt.c       | 66 ++++++++++++++++++++++------
> > ----
> >  drivers/gpu/drm/xe/xe_pt_types.h |  2 +
> >  drivers/gpu/drm/xe/xe_vm.c       | 29 ++++++++++++--
> >  drivers/gpu/drm/xe/xe_vm_types.h |  2 +
> >  4 files changed, 75 insertions(+), 24 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_pt.c
> > b/drivers/gpu/drm/xe/xe_pt.c
> > index 1ddcc7e79a93..3fd0ae2dbe7d 100644
> > --- a/drivers/gpu/drm/xe/xe_pt.c
> > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > @@ -268,6 +268,8 @@ struct xe_pt_stage_bind_walk {
> >  	 * granularity.
> >  	 */
> >  	bool needs_64K;
> > +	/* @clear_pt: clear page table entries during the bind
> > walk */
> > +	bool clear_pt;
> >  	/**
> >  	 * @vma: VMA being mapped
> >  	 */
> > @@ -497,21 +499,25 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> > pgoff_t offset,
> >  
> >  		XE_WARN_ON(xe_walk->va_curs_start != addr);
> >  
> > -		pte = vm->pt_ops->pte_encode_vma(is_null ? 0 :
> > -						 xe_res_dma(curs)
> > + xe_walk->dma_offset,
> > -						 xe_walk->vma,
> > pat_index, level);
> > -		pte |= xe_walk->default_pte;
> > +		if (xe_walk->clear_pt) {
> > +			pte = 0;
> > +		} else {
> > +			pte = vm->pt_ops->pte_encode_vma(is_null ?
> > 0 :
> > +					xe_res_dma(curs) +
> > xe_walk->dma_offset,
> > +					xe_walk->vma, pat_index,
> > level);
> > +			pte |= xe_walk->default_pte;
> >  
> > -		/*
> > -		 * Set the XE_PTE_PS64 hint if possible, otherwise
> > if
> > -		 * this device *requires* 64K PTE size for VRAM,
> > fail.
> > -		 */
> > -		if (level == 0 && !xe_parent->is_compact) {
> > -			if (xe_pt_is_pte_ps64K(addr, next,
> > xe_walk)) {
> > -				xe_walk->vma->gpuva.flags |=
> > XE_VMA_PTE_64K;
> > -				pte |= XE_PTE_PS64;
> > -			} else if (XE_WARN_ON(xe_walk->needs_64K))
> > {
> > -				return -EINVAL;
> > +			/*
> > +			 * Set the XE_PTE_PS64 hint if possible,
> > otherwise if
> > +			 * this device *requires* 64K PTE size for
> > VRAM, fail.
> > +			 */
> > +			if (level == 0 && !xe_parent->is_compact)
> > {
> > +				if (xe_pt_is_pte_ps64K(addr, next,
> > xe_walk)) {
> > +					xe_walk->vma->gpuva.flags
> > |= XE_VMA_PTE_64K;
> > +					pte |= XE_PTE_PS64;
> > +				} else if (XE_WARN_ON(xe_walk-
> > >needs_64K)) {
> > +					return -EINVAL;
> > +				}
> >  			}
> >  		}
> >  
> > @@ -519,7 +525,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> > pgoff_t offset,
> >  		if (unlikely(ret))
> >  			return ret;
> >  
> > -		if (!is_null)
> > +		if (!is_null && !xe_walk->clear_pt)
> >  			xe_res_next(curs, next - addr);
> >  		xe_walk->va_curs_start = next;
> >  		xe_walk->vma->gpuva.flags |= (XE_VMA_PTE_4K <<
> > level);
> > @@ -589,6 +595,7 @@ static const struct xe_pt_walk_ops
> > xe_pt_stage_bind_ops = {
> >   * @vma: The vma indicating the address range.
> >   * @entries: Storage for the update entries used for connecting
> > the tree to
> >   * the main tree at commit time.
> > + * @clear_pt: Clear the page table entries.
> >   * @num_entries: On output contains the number of @entries used.
> >   *
> >   * This function builds a disconnected page-table tree for a given
> > address
> > @@ -602,7 +609,8 @@ static const struct xe_pt_walk_ops
> > xe_pt_stage_bind_ops = {
> >   */
> >  static int
> >  xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> > -		 struct xe_vm_pgtable_update *entries, u32
> > *num_entries)
> > +		 struct xe_vm_pgtable_update *entries,
> > +		 bool clear_pt, u32 *num_entries)
> >  {
> >  	struct xe_device *xe = tile_to_xe(tile);
> >  	struct xe_bo *bo = xe_vma_bo(vma);
> > @@ -622,10 +630,19 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> > xe_vma *vma,
> >  		.vma = vma,
> >  		.wupd.entries = entries,
> >  		.needs_64K = (xe_vma_vm(vma)->flags &
> > XE_VM_FLAG_64K) && is_devmem,
> > +		.clear_pt = clear_pt,
> >  	};
> >  	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
> >  	int ret;
> >  
> > +	if (clear_pt) {
> > +		ret = xe_pt_walk_range(&pt->base, pt->level,
> > xe_vma_start(vma),
> > +				       xe_vma_end(vma),
> > &xe_walk.base);
> > +
> > +		*num_entries = xe_walk.wupd.num_used_entries;
> > +		return ret;
> > +	}
> > +
> >  	/**
> >  	 * Default atomic expectations for different allocation
> > scenarios are as follows:
> >  	 *
> > @@ -981,12 +998,14 @@ static void xe_pt_free_bind(struct
> > xe_vm_pgtable_update *entries,
> >  
> >  static int
> >  xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
> > -		   struct xe_vm_pgtable_update *entries, u32
> > *num_entries)
> > +		   struct xe_vm_pgtable_update *entries,
> > +		   bool invalidate_on_bind, u32 *num_entries)
> >  {
> >  	int err;
> >  
> >  	*num_entries = 0;
> > -	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
> > +	err = xe_pt_stage_bind(tile, vma, entries,
> > invalidate_on_bind,
> > +			       num_entries);
> >  	if (!err)
> >  		xe_tile_assert(tile, *num_entries);
> >  
> > @@ -1661,6 +1680,7 @@ static int bind_op_prepare(struct xe_vm *vm,
> > struct xe_tile *tile,
> >  		return err;
> >  
> >  	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
> > +				 pt_update_ops-
> > >invalidate_on_bind,
> >  				 &pt_op->num_entries);
> >  	if (!err) {
> >  		xe_tile_assert(tile, pt_op->num_entries <=
> > @@ -1685,7 +1705,7 @@ static int bind_op_prepare(struct xe_vm *vm,
> > struct xe_tile *tile,
> >  		 * it needs to be done here.
> >  		 */
> >  		if ((!pt_op->rebind && xe_vm_has_scratch(vm) &&
> > -		     xe_vm_in_preempt_fence_mode(vm)))
> > +		     xe_vm_in_preempt_fence_mode(vm)) ||
> > pt_update_ops->invalidate_on_bind)
> >  			pt_update_ops->needs_invalidation = true;
> >  		else if (pt_op->rebind && !xe_vm_in_lr_mode(vm))
> >  			/* We bump also if batch_invalidate_tlb is
> > true */
> > @@ -1759,9 +1779,13 @@ static int op_prepare(struct xe_vm *vm,
> >  
> >  	switch (op->base.op) {
> >  	case DRM_GPUVA_OP_MAP:
> > -		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
> > +		if (!op->map.immediate && xe_vm_in_fault_mode(vm)
> > &&
> > +		    !op->map.invalidate_on_bind)
> >  			break;
> >  
> > +		if (op->map.invalidate_on_bind)
> > +			pt_update_ops->invalidate_on_bind = true;
> > +
> >  		err = bind_op_prepare(vm, tile, pt_update_ops, op-
> > >map.vma);
> >  		pt_update_ops->wait_vm_kernel = true;
> >  		break;
> > @@ -1871,6 +1895,8 @@ static void bind_op_commit(struct xe_vm *vm,
> > struct xe_tile *tile,
> >  	}
> >  	vma->tile_present |= BIT(tile->id);
> >  	vma->tile_staged &= ~BIT(tile->id);
> > +	if (pt_update_ops->invalidate_on_bind)
> > +		vma->tile_invalidated |= BIT(tile->id);
> >  	if (xe_vma_is_userptr(vma)) {
> >  		lockdep_assert_held_read(&vm-
> > >userptr.notifier_lock);
> >  		to_userptr_vma(vma)->userptr.initial_bind = true;
> > diff --git a/drivers/gpu/drm/xe/xe_pt_types.h
> > b/drivers/gpu/drm/xe/xe_pt_types.h
> > index 384cc04de719..3d0aa2a5102e 100644
> > --- a/drivers/gpu/drm/xe/xe_pt_types.h
> > +++ b/drivers/gpu/drm/xe/xe_pt_types.h
> > @@ -108,6 +108,8 @@ struct xe_vm_pgtable_update_ops {
> >  	bool needs_userptr_lock;
> >  	/** @needs_invalidation: Needs invalidation */
> >  	bool needs_invalidation;
> > +	/** @invalidate_on_bind: Invalidate the range before bind
> > */
> > +	bool invalidate_on_bind;
> >  	/**
> >  	 * @wait_vm_bookkeep: PT operations need to wait until VM
> > is idle
> >  	 * (bookkeep dma-resv slots are idle) and stage all future
> > VM activity
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > b/drivers/gpu/drm/xe/xe_vm.c
> > index d664f2e418b2..813d893d9b63 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -1921,6 +1921,23 @@ static void print_op(struct xe_device *xe,
> > struct drm_gpuva_op *op)
> >  }
> >  #endif
> >  
> > +static bool __xe_vm_needs_clear_scratch_pages(struct xe_vm *vm,
> > u32 bind_flags)
> > +{
> > +	if (!xe_vm_in_fault_mode(vm))
> > +		return false;
> > +
> > +	if (!NEEDS_SCRATCH(vm->xe))
> > +		return false;
> > +
> > +	if (!xe_vm_has_scratch(vm))
> > +		return false;
> > +
> > +	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> >  /*
> >   * Create operations list from IOCTL arguments, setup operations
> > fields so parse
> >   * and commit steps are decoupled from IOCTL arguments. This step
> > can fail.
> > @@ -1991,6 +2008,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> > struct xe_bo *bo,
> >  			op->map.is_null = flags &
> > DRM_XE_VM_BIND_FLAG_NULL;
> >  			op->map.dumpable = flags &
> > DRM_XE_VM_BIND_FLAG_DUMPABLE;
> >  			op->map.pat_index = pat_index;
> > +			op->map.invalidate_on_bind =
> > +				__xe_vm_needs_clear_scratch_pages(
> > vm, flags);
> >  		} else if (__op->op == DRM_GPUVA_OP_PREFETCH) {
> >  			op->prefetch.region = prefetch_region;
> >  		}
> > @@ -2188,7 +2207,8 @@ static int vm_bind_ioctl_ops_parse(struct
> > xe_vm *vm, struct drm_gpuva_ops *ops,
> >  				return PTR_ERR(vma);
> >  
> >  			op->map.vma = vma;
> > -			if (op->map.immediate ||
> > !xe_vm_in_fault_mode(vm))
> > +			if (op->map.immediate ||
> > !xe_vm_in_fault_mode(vm) ||
> > +			    op->map.invalidate_on_bind)
> >  				xe_vma_ops_incr_pt_update_ops(vops
> > ,
> >  							      op-
> > >tile_mask);
> >  			break;
> > @@ -2416,9 +2436,10 @@ static int op_lock_and_prep(struct drm_exec
> > *exec, struct xe_vm *vm,
> >  
> >  	switch (op->base.op) {
> >  	case DRM_GPUVA_OP_MAP:
> > -		err = vma_lock_and_validate(exec, op->map.vma,
> > -					   
> > !xe_vm_in_fault_mode(vm) ||
> > -					    op->map.immediate);
> > +		if (!op->map.invalidate_on_bind)
> > +			err = vma_lock_and_validate(exec, op-
> > >map.vma,
> > +						   
> > !xe_vm_in_fault_mode(vm) ||
> > +						    op-
> > >map.immediate);
> >  		break;
> >  	case DRM_GPUVA_OP_REMAP:
> >  		err = check_ufence(gpuva_to_vma(op-
> > >base.remap.unmap->va));
> > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > b/drivers/gpu/drm/xe/xe_vm_types.h
> > index 52467b9b5348..dace04f4ea5e 100644
> > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > @@ -297,6 +297,8 @@ struct xe_vma_op_map {
> >  	bool is_null;
> >  	/** @dumpable: whether BO is dumped on GPU hang */
> >  	bool dumpable;
> > +	/** @invalidate: invalidate the VMA before bind */
> > +	bool invalidate_on_bind;
> >  	/** @pat_index: The pat index to use for this operation.
> > */
> >  	u16 pat_index;
> >  };
> > -- 
> > 2.26.3
> > 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-02-06 10:43     ` Thomas Hellström
@ 2025-02-10 17:08       ` Matthew Brost
  0 siblings, 0 replies; 32+ messages in thread
From: Matthew Brost @ 2025-02-10 17:08 UTC (permalink / raw)
  To: Thomas Hellström; +Cc: Oak Zeng, intel-xe, jonathan.cavitt

On Thu, Feb 06, 2025 at 11:43:10AM +0100, Thomas Hellström wrote:
> Hi, Matt,
> 
> On Thu, 2025-02-06 at 02:34 -0800, Matthew Brost wrote:
> > On Tue, Feb 04, 2025 at 01:45:57PM -0500, Oak Zeng wrote:
> > > When a vm runs under fault mode, if scratch page is enabled, we
> > > need
> > > to clear the scratch page mapping before vm_bind for the vm_bind
> > > address range. Under fault mode, we depend on recoverable page
> > > fault
> > > to establish mapping in page table. If scratch page is not cleared,
> > > GPU access of address won't cause page fault because it always hits
> > > the existing scratch page mapping.
> > > 
> > > When vm_bind with IMMEDIATE flag, there is no need of clearing as
> > > immediate bind can overwrite the scratch page mapping.
> > > 
> > > So far only is xe2 and xe3 products are allowed to enable scratch
> > > page
> > > under fault mode. On other platform we don't allow scratch page
> > > under
> > > fault mode, so no need of such clearing.
> > > 
> > > v2: Rework vm_bind pipeline to clear scratch page mapping. This is
> > > similar
> > > to a map operation, with the exception that PTEs are cleared
> > > instead of
> > > pointing to valid physical pages. (Matt, Thomas)
> > > 
> > > TLB invalidation is needed after clear scratch page mapping as
> > > larger
> > > scratch page mapping could be backed by physical page and cached in
> > > TLB. (Matt, Thomas)
> > > 
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > 
> > Given the complexity of the VM bind path, I think we need an IGT
> > posted
> > with this series before merging as I suggested in v1. Without it, it
> > will be fairly difficult to ensure correctness by reviews only.
> 
> There is an igt posted that exercises xe_exec_fault_mode + scratch
> pages. Although you might have had something more elaborate in mind?
> 

Ah, missed this as generally look at cover letter to find associated
IGTs.

A quick look at IGT and thinking we need to do slightly better. Off the
top of my head something like:

- Run a batch which does a DW write to scratch address
- Bind scratch address /w IMMEDIATE cleared
- Run a batch which does a DW write to newly bound address, verfiy
  result

Matt

> /Thomas
> 
> 
> > 
> > Matt
> > 
> > > ---
> > >  drivers/gpu/drm/xe/xe_pt.c       | 66 ++++++++++++++++++++++------
> > > ----
> > >  drivers/gpu/drm/xe/xe_pt_types.h |  2 +
> > >  drivers/gpu/drm/xe/xe_vm.c       | 29 ++++++++++++--
> > >  drivers/gpu/drm/xe/xe_vm_types.h |  2 +
> > >  4 files changed, 75 insertions(+), 24 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_pt.c
> > > b/drivers/gpu/drm/xe/xe_pt.c
> > > index 1ddcc7e79a93..3fd0ae2dbe7d 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt.c
> > > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > > @@ -268,6 +268,8 @@ struct xe_pt_stage_bind_walk {
> > >  	 * granularity.
> > >  	 */
> > >  	bool needs_64K;
> > > +	/* @clear_pt: clear page table entries during the bind
> > > walk */
> > > +	bool clear_pt;
> > >  	/**
> > >  	 * @vma: VMA being mapped
> > >  	 */
> > > @@ -497,21 +499,25 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> > > pgoff_t offset,
> > >  
> > >  		XE_WARN_ON(xe_walk->va_curs_start != addr);
> > >  
> > > -		pte = vm->pt_ops->pte_encode_vma(is_null ? 0 :
> > > -						 xe_res_dma(curs)
> > > + xe_walk->dma_offset,
> > > -						 xe_walk->vma,
> > > pat_index, level);
> > > -		pte |= xe_walk->default_pte;
> > > +		if (xe_walk->clear_pt) {
> > > +			pte = 0;
> > > +		} else {
> > > +			pte = vm->pt_ops->pte_encode_vma(is_null ?
> > > 0 :
> > > +					xe_res_dma(curs) +
> > > xe_walk->dma_offset,
> > > +					xe_walk->vma, pat_index,
> > > level);
> > > +			pte |= xe_walk->default_pte;
> > >  
> > > -		/*
> > > -		 * Set the XE_PTE_PS64 hint if possible, otherwise
> > > if
> > > -		 * this device *requires* 64K PTE size for VRAM,
> > > fail.
> > > -		 */
> > > -		if (level == 0 && !xe_parent->is_compact) {
> > > -			if (xe_pt_is_pte_ps64K(addr, next,
> > > xe_walk)) {
> > > -				xe_walk->vma->gpuva.flags |=
> > > XE_VMA_PTE_64K;
> > > -				pte |= XE_PTE_PS64;
> > > -			} else if (XE_WARN_ON(xe_walk->needs_64K))
> > > {
> > > -				return -EINVAL;
> > > +			/*
> > > +			 * Set the XE_PTE_PS64 hint if possible,
> > > otherwise if
> > > +			 * this device *requires* 64K PTE size for
> > > VRAM, fail.
> > > +			 */
> > > +			if (level == 0 && !xe_parent->is_compact)
> > > {
> > > +				if (xe_pt_is_pte_ps64K(addr, next,
> > > xe_walk)) {
> > > +					xe_walk->vma->gpuva.flags
> > > |= XE_VMA_PTE_64K;
> > > +					pte |= XE_PTE_PS64;
> > > +				} else if (XE_WARN_ON(xe_walk-
> > > >needs_64K)) {
> > > +					return -EINVAL;
> > > +				}
> > >  			}
> > >  		}
> > >  
> > > @@ -519,7 +525,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> > > pgoff_t offset,
> > >  		if (unlikely(ret))
> > >  			return ret;
> > >  
> > > -		if (!is_null)
> > > +		if (!is_null && !xe_walk->clear_pt)
> > >  			xe_res_next(curs, next - addr);
> > >  		xe_walk->va_curs_start = next;
> > >  		xe_walk->vma->gpuva.flags |= (XE_VMA_PTE_4K <<
> > > level);
> > > @@ -589,6 +595,7 @@ static const struct xe_pt_walk_ops
> > > xe_pt_stage_bind_ops = {
> > >   * @vma: The vma indicating the address range.
> > >   * @entries: Storage for the update entries used for connecting
> > > the tree to
> > >   * the main tree at commit time.
> > > + * @clear_pt: Clear the page table entries.
> > >   * @num_entries: On output contains the number of @entries used.
> > >   *
> > >   * This function builds a disconnected page-table tree for a given
> > > address
> > > @@ -602,7 +609,8 @@ static const struct xe_pt_walk_ops
> > > xe_pt_stage_bind_ops = {
> > >   */
> > >  static int
> > >  xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> > > -		 struct xe_vm_pgtable_update *entries, u32
> > > *num_entries)
> > > +		 struct xe_vm_pgtable_update *entries,
> > > +		 bool clear_pt, u32 *num_entries)
> > >  {
> > >  	struct xe_device *xe = tile_to_xe(tile);
> > >  	struct xe_bo *bo = xe_vma_bo(vma);
> > > @@ -622,10 +630,19 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> > > xe_vma *vma,
> > >  		.vma = vma,
> > >  		.wupd.entries = entries,
> > >  		.needs_64K = (xe_vma_vm(vma)->flags &
> > > XE_VM_FLAG_64K) && is_devmem,
> > > +		.clear_pt = clear_pt,
> > >  	};
> > >  	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
> > >  	int ret;
> > >  
> > > +	if (clear_pt) {
> > > +		ret = xe_pt_walk_range(&pt->base, pt->level,
> > > xe_vma_start(vma),
> > > +				       xe_vma_end(vma),
> > > &xe_walk.base);
> > > +
> > > +		*num_entries = xe_walk.wupd.num_used_entries;
> > > +		return ret;
> > > +	}
> > > +
> > >  	/**
> > >  	 * Default atomic expectations for different allocation
> > > scenarios are as follows:
> > >  	 *
> > > @@ -981,12 +998,14 @@ static void xe_pt_free_bind(struct
> > > xe_vm_pgtable_update *entries,
> > >  
> > >  static int
> > >  xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
> > > -		   struct xe_vm_pgtable_update *entries, u32
> > > *num_entries)
> > > +		   struct xe_vm_pgtable_update *entries,
> > > +		   bool invalidate_on_bind, u32 *num_entries)
> > >  {
> > >  	int err;
> > >  
> > >  	*num_entries = 0;
> > > -	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
> > > +	err = xe_pt_stage_bind(tile, vma, entries,
> > > invalidate_on_bind,
> > > +			       num_entries);
> > >  	if (!err)
> > >  		xe_tile_assert(tile, *num_entries);
> > >  
> > > @@ -1661,6 +1680,7 @@ static int bind_op_prepare(struct xe_vm *vm,
> > > struct xe_tile *tile,
> > >  		return err;
> > >  
> > >  	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
> > > +				 pt_update_ops-
> > > >invalidate_on_bind,
> > >  				 &pt_op->num_entries);
> > >  	if (!err) {
> > >  		xe_tile_assert(tile, pt_op->num_entries <=
> > > @@ -1685,7 +1705,7 @@ static int bind_op_prepare(struct xe_vm *vm,
> > > struct xe_tile *tile,
> > >  		 * it needs to be done here.
> > >  		 */
> > >  		if ((!pt_op->rebind && xe_vm_has_scratch(vm) &&
> > > -		     xe_vm_in_preempt_fence_mode(vm)))
> > > +		     xe_vm_in_preempt_fence_mode(vm)) ||
> > > pt_update_ops->invalidate_on_bind)
> > >  			pt_update_ops->needs_invalidation = true;
> > >  		else if (pt_op->rebind && !xe_vm_in_lr_mode(vm))
> > >  			/* We bump also if batch_invalidate_tlb is
> > > true */
> > > @@ -1759,9 +1779,13 @@ static int op_prepare(struct xe_vm *vm,
> > >  
> > >  	switch (op->base.op) {
> > >  	case DRM_GPUVA_OP_MAP:
> > > -		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
> > > +		if (!op->map.immediate && xe_vm_in_fault_mode(vm)
> > > &&
> > > +		    !op->map.invalidate_on_bind)
> > >  			break;
> > >  
> > > +		if (op->map.invalidate_on_bind)
> > > +			pt_update_ops->invalidate_on_bind = true;
> > > +
> > >  		err = bind_op_prepare(vm, tile, pt_update_ops, op-
> > > >map.vma);
> > >  		pt_update_ops->wait_vm_kernel = true;
> > >  		break;
> > > @@ -1871,6 +1895,8 @@ static void bind_op_commit(struct xe_vm *vm,
> > > struct xe_tile *tile,
> > >  	}
> > >  	vma->tile_present |= BIT(tile->id);
> > >  	vma->tile_staged &= ~BIT(tile->id);
> > > +	if (pt_update_ops->invalidate_on_bind)
> > > +		vma->tile_invalidated |= BIT(tile->id);
> > >  	if (xe_vma_is_userptr(vma)) {
> > >  		lockdep_assert_held_read(&vm-
> > > >userptr.notifier_lock);
> > >  		to_userptr_vma(vma)->userptr.initial_bind = true;
> > > diff --git a/drivers/gpu/drm/xe/xe_pt_types.h
> > > b/drivers/gpu/drm/xe/xe_pt_types.h
> > > index 384cc04de719..3d0aa2a5102e 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_pt_types.h
> > > @@ -108,6 +108,8 @@ struct xe_vm_pgtable_update_ops {
> > >  	bool needs_userptr_lock;
> > >  	/** @needs_invalidation: Needs invalidation */
> > >  	bool needs_invalidation;
> > > +	/** @invalidate_on_bind: Invalidate the range before bind
> > > */
> > > +	bool invalidate_on_bind;
> > >  	/**
> > >  	 * @wait_vm_bookkeep: PT operations need to wait until VM
> > > is idle
> > >  	 * (bookkeep dma-resv slots are idle) and stage all future
> > > VM activity
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > b/drivers/gpu/drm/xe/xe_vm.c
> > > index d664f2e418b2..813d893d9b63 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > @@ -1921,6 +1921,23 @@ static void print_op(struct xe_device *xe,
> > > struct drm_gpuva_op *op)
> > >  }
> > >  #endif
> > >  
> > > +static bool __xe_vm_needs_clear_scratch_pages(struct xe_vm *vm,
> > > u32 bind_flags)
> > > +{
> > > +	if (!xe_vm_in_fault_mode(vm))
> > > +		return false;
> > > +
> > > +	if (!NEEDS_SCRATCH(vm->xe))
> > > +		return false;
> > > +
> > > +	if (!xe_vm_has_scratch(vm))
> > > +		return false;
> > > +
> > > +	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> > > +		return false;
> > > +
> > > +	return true;
> > > +}
> > > +
> > >  /*
> > >   * Create operations list from IOCTL arguments, setup operations
> > > fields so parse
> > >   * and commit steps are decoupled from IOCTL arguments. This step
> > > can fail.
> > > @@ -1991,6 +2008,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> > > struct xe_bo *bo,
> > >  			op->map.is_null = flags &
> > > DRM_XE_VM_BIND_FLAG_NULL;
> > >  			op->map.dumpable = flags &
> > > DRM_XE_VM_BIND_FLAG_DUMPABLE;
> > >  			op->map.pat_index = pat_index;
> > > +			op->map.invalidate_on_bind =
> > > +				__xe_vm_needs_clear_scratch_pages(
> > > vm, flags);
> > >  		} else if (__op->op == DRM_GPUVA_OP_PREFETCH) {
> > >  			op->prefetch.region = prefetch_region;
> > >  		}
> > > @@ -2188,7 +2207,8 @@ static int vm_bind_ioctl_ops_parse(struct
> > > xe_vm *vm, struct drm_gpuva_ops *ops,
> > >  				return PTR_ERR(vma);
> > >  
> > >  			op->map.vma = vma;
> > > -			if (op->map.immediate ||
> > > !xe_vm_in_fault_mode(vm))
> > > +			if (op->map.immediate ||
> > > !xe_vm_in_fault_mode(vm) ||
> > > +			    op->map.invalidate_on_bind)
> > >  				xe_vma_ops_incr_pt_update_ops(vops
> > > ,
> > >  							      op-
> > > >tile_mask);
> > >  			break;
> > > @@ -2416,9 +2436,10 @@ static int op_lock_and_prep(struct drm_exec
> > > *exec, struct xe_vm *vm,
> > >  
> > >  	switch (op->base.op) {
> > >  	case DRM_GPUVA_OP_MAP:
> > > -		err = vma_lock_and_validate(exec, op->map.vma,
> > > -					   
> > > !xe_vm_in_fault_mode(vm) ||
> > > -					    op->map.immediate);
> > > +		if (!op->map.invalidate_on_bind)
> > > +			err = vma_lock_and_validate(exec, op-
> > > >map.vma,
> > > +						   
> > > !xe_vm_in_fault_mode(vm) ||
> > > +						    op-
> > > >map.immediate);
> > >  		break;
> > >  	case DRM_GPUVA_OP_REMAP:
> > >  		err = check_ufence(gpuva_to_vma(op-
> > > >base.remap.unmap->va));
> > > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > > b/drivers/gpu/drm/xe/xe_vm_types.h
> > > index 52467b9b5348..dace04f4ea5e 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > > @@ -297,6 +297,8 @@ struct xe_vma_op_map {
> > >  	bool is_null;
> > >  	/** @dumpable: whether BO is dumped on GPU hang */
> > >  	bool dumpable;
> > > +	/** @invalidate: invalidate the VMA before bind */
> > > +	bool invalidate_on_bind;
> > >  	/** @pat_index: The pat index to use for this operation.
> > > */
> > >  	u16 pat_index;
> > >  };
> > > -- 
> > > 2.26.3
> > > 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-02-06 10:34   ` Matthew Brost
  2025-02-06 10:43     ` Thomas Hellström
@ 2025-02-06 15:16     ` Zeng, Oak
  2025-02-10 17:09       ` Matthew Brost
  1 sibling, 1 reply; 32+ messages in thread
From: Zeng, Oak @ 2025-02-06 15:16 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Thomas.Hellstrom@linux.intel.com,
	Cavitt, Jonathan



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: February 6, 2025 5:35 AM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org;
> Thomas.Hellstrom@linux.intel.com; Cavitt, Jonathan
> <jonathan.cavitt@intel.com>
> Subject: Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
> 
> On Tue, Feb 04, 2025 at 01:45:57PM -0500, Oak Zeng wrote:
> > When a vm runs under fault mode, if scratch page is enabled, we
> need
> > to clear the scratch page mapping before vm_bind for the vm_bind
> > address range. Under fault mode, we depend on recoverable page
> fault
> > to establish mapping in page table. If scratch page is not cleared,
> > GPU access of address won't cause page fault because it always hits
> > the existing scratch page mapping.
> >
> > When vm_bind with IMMEDIATE flag, there is no need of clearing as
> > immediate bind can overwrite the scratch page mapping.
> >
> > So far only is xe2 and xe3 products are allowed to enable scratch
> page
> > under fault mode. On other platform we don't allow scratch page
> under
> > fault mode, so no need of such clearing.
> >
> > v2: Rework vm_bind pipeline to clear scratch page mapping. This is
> similar
> > to a map operation, with the exception that PTEs are cleared
> instead of
> > pointing to valid physical pages. (Matt, Thomas)
> >
> > TLB invalidation is needed after clear scratch page mapping as larger
> > scratch page mapping could be backed by physical page and cached
> in
> > TLB. (Matt, Thomas)
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> 
> Given the complexity of the VM bind path, I think we need an IGT
> posted
> with this series before merging as I suggested in v1. Without it, it
> will be fairly difficult to ensure correctness by reviews only.

I did have an igt posted in patch 3 of this series.

The reason I posted on patch 3 is, without patch 3 this "feature"
Is not enabled. 

Oak

> 
> Matt
> 
> > ---
> >  drivers/gpu/drm/xe/xe_pt.c       | 66 ++++++++++++++++++++++--
> --------
> >  drivers/gpu/drm/xe/xe_pt_types.h |  2 +
> >  drivers/gpu/drm/xe/xe_vm.c       | 29 ++++++++++++--
> >  drivers/gpu/drm/xe/xe_vm_types.h |  2 +
> >  4 files changed, 75 insertions(+), 24 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_pt.c
> b/drivers/gpu/drm/xe/xe_pt.c
> > index 1ddcc7e79a93..3fd0ae2dbe7d 100644
> > --- a/drivers/gpu/drm/xe/xe_pt.c
> > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > @@ -268,6 +268,8 @@ struct xe_pt_stage_bind_walk {
> >  	 * granularity.
> >  	 */
> >  	bool needs_64K;
> > +	/* @clear_pt: clear page table entries during the bind walk */
> > +	bool clear_pt;
> >  	/**
> >  	 * @vma: VMA being mapped
> >  	 */
> > @@ -497,21 +499,25 @@ xe_pt_stage_bind_entry(struct xe_ptw
> *parent, pgoff_t offset,
> >
> >  		XE_WARN_ON(xe_walk->va_curs_start != addr);
> >
> > -		pte = vm->pt_ops->pte_encode_vma(is_null ? 0 :
> > -						 xe_res_dma(curs) +
> xe_walk->dma_offset,
> > -						 xe_walk->vma,
> pat_index, level);
> > -		pte |= xe_walk->default_pte;
> > +		if (xe_walk->clear_pt) {
> > +			pte = 0;
> > +		} else {
> > +			pte = vm->pt_ops->pte_encode_vma(is_null ?
> 0 :
> > +					xe_res_dma(curs) + xe_walk-
> >dma_offset,
> > +					xe_walk->vma, pat_index,
> level);
> > +			pte |= xe_walk->default_pte;
> >
> > -		/*
> > -		 * Set the XE_PTE_PS64 hint if possible, otherwise if
> > -		 * this device *requires* 64K PTE size for VRAM, fail.
> > -		 */
> > -		if (level == 0 && !xe_parent->is_compact) {
> > -			if (xe_pt_is_pte_ps64K(addr, next, xe_walk))
> {
> > -				xe_walk->vma->gpuva.flags |=
> XE_VMA_PTE_64K;
> > -				pte |= XE_PTE_PS64;
> > -			} else if (XE_WARN_ON(xe_walk-
> >needs_64K)) {
> > -				return -EINVAL;
> > +			/*
> > +			 * Set the XE_PTE_PS64 hint if possible,
> otherwise if
> > +			 * this device *requires* 64K PTE size for
> VRAM, fail.
> > +			 */
> > +			if (level == 0 && !xe_parent->is_compact) {
> > +				if (xe_pt_is_pte_ps64K(addr, next,
> xe_walk)) {
> > +					xe_walk->vma->gpuva.flags
> |= XE_VMA_PTE_64K;
> > +					pte |= XE_PTE_PS64;
> > +				} else if (XE_WARN_ON(xe_walk-
> >needs_64K)) {
> > +					return -EINVAL;
> > +				}
> >  			}
> >  		}
> >
> > @@ -519,7 +525,7 @@ xe_pt_stage_bind_entry(struct xe_ptw
> *parent, pgoff_t offset,
> >  		if (unlikely(ret))
> >  			return ret;
> >
> > -		if (!is_null)
> > +		if (!is_null && !xe_walk->clear_pt)
> >  			xe_res_next(curs, next - addr);
> >  		xe_walk->va_curs_start = next;
> >  		xe_walk->vma->gpuva.flags |= (XE_VMA_PTE_4K <<
> level);
> > @@ -589,6 +595,7 @@ static const struct xe_pt_walk_ops
> xe_pt_stage_bind_ops = {
> >   * @vma: The vma indicating the address range.
> >   * @entries: Storage for the update entries used for connecting the
> tree to
> >   * the main tree at commit time.
> > + * @clear_pt: Clear the page table entries.
> >   * @num_entries: On output contains the number of @entries
> used.
> >   *
> >   * This function builds a disconnected page-table tree for a given
> address
> > @@ -602,7 +609,8 @@ static const struct xe_pt_walk_ops
> xe_pt_stage_bind_ops = {
> >   */
> >  static int
> >  xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> > -		 struct xe_vm_pgtable_update *entries, u32
> *num_entries)
> > +		 struct xe_vm_pgtable_update *entries,
> > +		 bool clear_pt, u32 *num_entries)
> >  {
> >  	struct xe_device *xe = tile_to_xe(tile);
> >  	struct xe_bo *bo = xe_vma_bo(vma);
> > @@ -622,10 +630,19 @@ xe_pt_stage_bind(struct xe_tile *tile,
> struct xe_vma *vma,
> >  		.vma = vma,
> >  		.wupd.entries = entries,
> >  		.needs_64K = (xe_vma_vm(vma)->flags &
> XE_VM_FLAG_64K) && is_devmem,
> > +		.clear_pt = clear_pt,
> >  	};
> >  	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
> >  	int ret;
> >
> > +	if (clear_pt) {
> > +		ret = xe_pt_walk_range(&pt->base, pt->level,
> xe_vma_start(vma),
> > +				       xe_vma_end(vma),
> &xe_walk.base);
> > +
> > +		*num_entries = xe_walk.wupd.num_used_entries;
> > +		return ret;
> > +	}
> > +
> >  	/**
> >  	 * Default atomic expectations for different allocation
> scenarios are as follows:
> >  	 *
> > @@ -981,12 +998,14 @@ static void xe_pt_free_bind(struct
> xe_vm_pgtable_update *entries,
> >
> >  static int
> >  xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
> > -		   struct xe_vm_pgtable_update *entries, u32
> *num_entries)
> > +		   struct xe_vm_pgtable_update *entries,
> > +		   bool invalidate_on_bind, u32 *num_entries)
> >  {
> >  	int err;
> >
> >  	*num_entries = 0;
> > -	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
> > +	err = xe_pt_stage_bind(tile, vma, entries, invalidate_on_bind,
> > +			       num_entries);
> >  	if (!err)
> >  		xe_tile_assert(tile, *num_entries);
> >
> > @@ -1661,6 +1680,7 @@ static int bind_op_prepare(struct xe_vm
> *vm, struct xe_tile *tile,
> >  		return err;
> >
> >  	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
> > +				 pt_update_ops->invalidate_on_bind,
> >  				 &pt_op->num_entries);
> >  	if (!err) {
> >  		xe_tile_assert(tile, pt_op->num_entries <=
> > @@ -1685,7 +1705,7 @@ static int bind_op_prepare(struct xe_vm
> *vm, struct xe_tile *tile,
> >  		 * it needs to be done here.
> >  		 */
> >  		if ((!pt_op->rebind && xe_vm_has_scratch(vm) &&
> > -		     xe_vm_in_preempt_fence_mode(vm)))
> > +		     xe_vm_in_preempt_fence_mode(vm)) ||
> pt_update_ops->invalidate_on_bind)
> >  			pt_update_ops->needs_invalidation = true;
> >  		else if (pt_op->rebind && !xe_vm_in_lr_mode(vm))
> >  			/* We bump also if batch_invalidate_tlb is
> true */
> > @@ -1759,9 +1779,13 @@ static int op_prepare(struct xe_vm *vm,
> >
> >  	switch (op->base.op) {
> >  	case DRM_GPUVA_OP_MAP:
> > -		if (!op->map.immediate &&
> xe_vm_in_fault_mode(vm))
> > +		if (!op->map.immediate &&
> xe_vm_in_fault_mode(vm) &&
> > +		    !op->map.invalidate_on_bind)
> >  			break;
> >
> > +		if (op->map.invalidate_on_bind)
> > +			pt_update_ops->invalidate_on_bind = true;
> > +
> >  		err = bind_op_prepare(vm, tile, pt_update_ops, op-
> >map.vma);
> >  		pt_update_ops->wait_vm_kernel = true;
> >  		break;
> > @@ -1871,6 +1895,8 @@ static void bind_op_commit(struct xe_vm
> *vm, struct xe_tile *tile,
> >  	}
> >  	vma->tile_present |= BIT(tile->id);
> >  	vma->tile_staged &= ~BIT(tile->id);
> > +	if (pt_update_ops->invalidate_on_bind)
> > +		vma->tile_invalidated |= BIT(tile->id);
> >  	if (xe_vma_is_userptr(vma)) {
> >  		lockdep_assert_held_read(&vm-
> >userptr.notifier_lock);
> >  		to_userptr_vma(vma)->userptr.initial_bind = true;
> > diff --git a/drivers/gpu/drm/xe/xe_pt_types.h
> b/drivers/gpu/drm/xe/xe_pt_types.h
> > index 384cc04de719..3d0aa2a5102e 100644
> > --- a/drivers/gpu/drm/xe/xe_pt_types.h
> > +++ b/drivers/gpu/drm/xe/xe_pt_types.h
> > @@ -108,6 +108,8 @@ struct xe_vm_pgtable_update_ops {
> >  	bool needs_userptr_lock;
> >  	/** @needs_invalidation: Needs invalidation */
> >  	bool needs_invalidation;
> > +	/** @invalidate_on_bind: Invalidate the range before bind */
> > +	bool invalidate_on_bind;
> >  	/**
> >  	 * @wait_vm_bookkeep: PT operations need to wait until VM
> is idle
> >  	 * (bookkeep dma-resv slots are idle) and stage all future VM
> activity
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> b/drivers/gpu/drm/xe/xe_vm.c
> > index d664f2e418b2..813d893d9b63 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -1921,6 +1921,23 @@ static void print_op(struct xe_device *xe,
> struct drm_gpuva_op *op)
> >  }
> >  #endif
> >
> > +static bool __xe_vm_needs_clear_scratch_pages(struct xe_vm
> *vm, u32 bind_flags)
> > +{
> > +	if (!xe_vm_in_fault_mode(vm))
> > +		return false;
> > +
> > +	if (!NEEDS_SCRATCH(vm->xe))
> > +		return false;
> > +
> > +	if (!xe_vm_has_scratch(vm))
> > +		return false;
> > +
> > +	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> >  /*
> >   * Create operations list from IOCTL arguments, setup operations
> fields so parse
> >   * and commit steps are decoupled from IOCTL arguments. This
> step can fail.
> > @@ -1991,6 +2008,8 @@ vm_bind_ioctl_ops_create(struct xe_vm
> *vm, struct xe_bo *bo,
> >  			op->map.is_null = flags &
> DRM_XE_VM_BIND_FLAG_NULL;
> >  			op->map.dumpable = flags &
> DRM_XE_VM_BIND_FLAG_DUMPABLE;
> >  			op->map.pat_index = pat_index;
> > +			op->map.invalidate_on_bind =
> > +
> 	__xe_vm_needs_clear_scratch_pages(vm, flags);
> >  		} else if (__op->op == DRM_GPUVA_OP_PREFETCH) {
> >  			op->prefetch.region = prefetch_region;
> >  		}
> > @@ -2188,7 +2207,8 @@ static int vm_bind_ioctl_ops_parse(struct
> xe_vm *vm, struct drm_gpuva_ops *ops,
> >  				return PTR_ERR(vma);
> >
> >  			op->map.vma = vma;
> > -			if (op->map.immediate
> || !xe_vm_in_fault_mode(vm))
> > +			if (op->map.immediate
> || !xe_vm_in_fault_mode(vm) ||
> > +			    op->map.invalidate_on_bind)
> >
> 	xe_vma_ops_incr_pt_update_ops(vops,
> >  							      op-
> >tile_mask);
> >  			break;
> > @@ -2416,9 +2436,10 @@ static int op_lock_and_prep(struct
> drm_exec *exec, struct xe_vm *vm,
> >
> >  	switch (op->base.op) {
> >  	case DRM_GPUVA_OP_MAP:
> > -		err = vma_lock_and_validate(exec, op->map.vma,
> > -					    !xe_vm_in_fault_mode(vm)
> ||
> > -					    op->map.immediate);
> > +		if (!op->map.invalidate_on_bind)
> > +			err = vma_lock_and_validate(exec, op-
> >map.vma,
> > +
> 	    !xe_vm_in_fault_mode(vm) ||
> > +						    op-
> >map.immediate);
> >  		break;
> >  	case DRM_GPUVA_OP_REMAP:
> >  		err = check_ufence(gpuva_to_vma(op-
> >base.remap.unmap->va));
> > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> b/drivers/gpu/drm/xe/xe_vm_types.h
> > index 52467b9b5348..dace04f4ea5e 100644
> > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > @@ -297,6 +297,8 @@ struct xe_vma_op_map {
> >  	bool is_null;
> >  	/** @dumpable: whether BO is dumped on GPU hang */
> >  	bool dumpable;
> > +	/** @invalidate: invalidate the VMA before bind */
> > +	bool invalidate_on_bind;
> >  	/** @pat_index: The pat index to use for this operation. */
> >  	u16 pat_index;
> >  };
> > --
> > 2.26.3
> >

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
  2025-02-06 15:16     ` Zeng, Oak
@ 2025-02-10 17:09       ` Matthew Brost
  0 siblings, 0 replies; 32+ messages in thread
From: Matthew Brost @ 2025-02-10 17:09 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-xe@lists.freedesktop.org, Thomas.Hellstrom@linux.intel.com,
	Cavitt, Jonathan

On Thu, Feb 06, 2025 at 08:16:36AM -0700, Zeng, Oak wrote:
> 
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: February 6, 2025 5:35 AM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org;
> > Thomas.Hellstrom@linux.intel.com; Cavitt, Jonathan
> > <jonathan.cavitt@intel.com>
> > Subject: Re: [PATCH 2/3] drm/xe: Clear scratch page before vm_bind
> > 
> > On Tue, Feb 04, 2025 at 01:45:57PM -0500, Oak Zeng wrote:
> > > When a vm runs under fault mode, if scratch page is enabled, we
> > need
> > > to clear the scratch page mapping before vm_bind for the vm_bind
> > > address range. Under fault mode, we depend on recoverable page
> > fault
> > > to establish mapping in page table. If scratch page is not cleared,
> > > GPU access of address won't cause page fault because it always hits
> > > the existing scratch page mapping.
> > >
> > > When vm_bind with IMMEDIATE flag, there is no need of clearing as
> > > immediate bind can overwrite the scratch page mapping.
> > >
> > > So far only is xe2 and xe3 products are allowed to enable scratch
> > page
> > > under fault mode. On other platform we don't allow scratch page
> > under
> > > fault mode, so no need of such clearing.
> > >
> > > v2: Rework vm_bind pipeline to clear scratch page mapping. This is
> > similar
> > > to a map operation, with the exception that PTEs are cleared
> > instead of
> > > pointing to valid physical pages. (Matt, Thomas)
> > >
> > > TLB invalidation is needed after clear scratch page mapping as larger
> > > scratch page mapping could be backed by physical page and cached
> > in
> > > TLB. (Matt, Thomas)
> > >
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > 
> > Given the complexity of the VM bind path, I think we need an IGT
> > posted
> > with this series before merging as I suggested in v1. Without it, it
> > will be fairly difficult to ensure correctness by reviews only.
> 
> I did have an igt posted in patch 3 of this series.
> 
> The reason I posted on patch 3 is, without patch 3 this "feature"
> Is not enabled. 
> 

Just replied to Thomas, see my reply there - I did miss an IGT was
included but I think we need to do a bit better, more details in my
reply to Thomas.

Matt

> Oak
> 
> > 
> > Matt
> > 
> > > ---
> > >  drivers/gpu/drm/xe/xe_pt.c       | 66 ++++++++++++++++++++++--
> > --------
> > >  drivers/gpu/drm/xe/xe_pt_types.h |  2 +
> > >  drivers/gpu/drm/xe/xe_vm.c       | 29 ++++++++++++--
> > >  drivers/gpu/drm/xe/xe_vm_types.h |  2 +
> > >  4 files changed, 75 insertions(+), 24 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_pt.c
> > b/drivers/gpu/drm/xe/xe_pt.c
> > > index 1ddcc7e79a93..3fd0ae2dbe7d 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt.c
> > > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > > @@ -268,6 +268,8 @@ struct xe_pt_stage_bind_walk {
> > >  	 * granularity.
> > >  	 */
> > >  	bool needs_64K;
> > > +	/* @clear_pt: clear page table entries during the bind walk */
> > > +	bool clear_pt;
> > >  	/**
> > >  	 * @vma: VMA being mapped
> > >  	 */
> > > @@ -497,21 +499,25 @@ xe_pt_stage_bind_entry(struct xe_ptw
> > *parent, pgoff_t offset,
> > >
> > >  		XE_WARN_ON(xe_walk->va_curs_start != addr);
> > >
> > > -		pte = vm->pt_ops->pte_encode_vma(is_null ? 0 :
> > > -						 xe_res_dma(curs) +
> > xe_walk->dma_offset,
> > > -						 xe_walk->vma,
> > pat_index, level);
> > > -		pte |= xe_walk->default_pte;
> > > +		if (xe_walk->clear_pt) {
> > > +			pte = 0;
> > > +		} else {
> > > +			pte = vm->pt_ops->pte_encode_vma(is_null ?
> > 0 :
> > > +					xe_res_dma(curs) + xe_walk-
> > >dma_offset,
> > > +					xe_walk->vma, pat_index,
> > level);
> > > +			pte |= xe_walk->default_pte;
> > >
> > > -		/*
> > > -		 * Set the XE_PTE_PS64 hint if possible, otherwise if
> > > -		 * this device *requires* 64K PTE size for VRAM, fail.
> > > -		 */
> > > -		if (level == 0 && !xe_parent->is_compact) {
> > > -			if (xe_pt_is_pte_ps64K(addr, next, xe_walk))
> > {
> > > -				xe_walk->vma->gpuva.flags |=
> > XE_VMA_PTE_64K;
> > > -				pte |= XE_PTE_PS64;
> > > -			} else if (XE_WARN_ON(xe_walk-
> > >needs_64K)) {
> > > -				return -EINVAL;
> > > +			/*
> > > +			 * Set the XE_PTE_PS64 hint if possible,
> > otherwise if
> > > +			 * this device *requires* 64K PTE size for
> > VRAM, fail.
> > > +			 */
> > > +			if (level == 0 && !xe_parent->is_compact) {
> > > +				if (xe_pt_is_pte_ps64K(addr, next,
> > xe_walk)) {
> > > +					xe_walk->vma->gpuva.flags
> > |= XE_VMA_PTE_64K;
> > > +					pte |= XE_PTE_PS64;
> > > +				} else if (XE_WARN_ON(xe_walk-
> > >needs_64K)) {
> > > +					return -EINVAL;
> > > +				}
> > >  			}
> > >  		}
> > >
> > > @@ -519,7 +525,7 @@ xe_pt_stage_bind_entry(struct xe_ptw
> > *parent, pgoff_t offset,
> > >  		if (unlikely(ret))
> > >  			return ret;
> > >
> > > -		if (!is_null)
> > > +		if (!is_null && !xe_walk->clear_pt)
> > >  			xe_res_next(curs, next - addr);
> > >  		xe_walk->va_curs_start = next;
> > >  		xe_walk->vma->gpuva.flags |= (XE_VMA_PTE_4K <<
> > level);
> > > @@ -589,6 +595,7 @@ static const struct xe_pt_walk_ops
> > xe_pt_stage_bind_ops = {
> > >   * @vma: The vma indicating the address range.
> > >   * @entries: Storage for the update entries used for connecting the
> > tree to
> > >   * the main tree at commit time.
> > > + * @clear_pt: Clear the page table entries.
> > >   * @num_entries: On output contains the number of @entries
> > used.
> > >   *
> > >   * This function builds a disconnected page-table tree for a given
> > address
> > > @@ -602,7 +609,8 @@ static const struct xe_pt_walk_ops
> > xe_pt_stage_bind_ops = {
> > >   */
> > >  static int
> > >  xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> > > -		 struct xe_vm_pgtable_update *entries, u32
> > *num_entries)
> > > +		 struct xe_vm_pgtable_update *entries,
> > > +		 bool clear_pt, u32 *num_entries)
> > >  {
> > >  	struct xe_device *xe = tile_to_xe(tile);
> > >  	struct xe_bo *bo = xe_vma_bo(vma);
> > > @@ -622,10 +630,19 @@ xe_pt_stage_bind(struct xe_tile *tile,
> > struct xe_vma *vma,
> > >  		.vma = vma,
> > >  		.wupd.entries = entries,
> > >  		.needs_64K = (xe_vma_vm(vma)->flags &
> > XE_VM_FLAG_64K) && is_devmem,
> > > +		.clear_pt = clear_pt,
> > >  	};
> > >  	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
> > >  	int ret;
> > >
> > > +	if (clear_pt) {
> > > +		ret = xe_pt_walk_range(&pt->base, pt->level,
> > xe_vma_start(vma),
> > > +				       xe_vma_end(vma),
> > &xe_walk.base);
> > > +
> > > +		*num_entries = xe_walk.wupd.num_used_entries;
> > > +		return ret;
> > > +	}
> > > +
> > >  	/**
> > >  	 * Default atomic expectations for different allocation
> > scenarios are as follows:
> > >  	 *
> > > @@ -981,12 +998,14 @@ static void xe_pt_free_bind(struct
> > xe_vm_pgtable_update *entries,
> > >
> > >  static int
> > >  xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
> > > -		   struct xe_vm_pgtable_update *entries, u32
> > *num_entries)
> > > +		   struct xe_vm_pgtable_update *entries,
> > > +		   bool invalidate_on_bind, u32 *num_entries)
> > >  {
> > >  	int err;
> > >
> > >  	*num_entries = 0;
> > > -	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
> > > +	err = xe_pt_stage_bind(tile, vma, entries, invalidate_on_bind,
> > > +			       num_entries);
> > >  	if (!err)
> > >  		xe_tile_assert(tile, *num_entries);
> > >
> > > @@ -1661,6 +1680,7 @@ static int bind_op_prepare(struct xe_vm
> > *vm, struct xe_tile *tile,
> > >  		return err;
> > >
> > >  	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
> > > +				 pt_update_ops->invalidate_on_bind,
> > >  				 &pt_op->num_entries);
> > >  	if (!err) {
> > >  		xe_tile_assert(tile, pt_op->num_entries <=
> > > @@ -1685,7 +1705,7 @@ static int bind_op_prepare(struct xe_vm
> > *vm, struct xe_tile *tile,
> > >  		 * it needs to be done here.
> > >  		 */
> > >  		if ((!pt_op->rebind && xe_vm_has_scratch(vm) &&
> > > -		     xe_vm_in_preempt_fence_mode(vm)))
> > > +		     xe_vm_in_preempt_fence_mode(vm)) ||
> > pt_update_ops->invalidate_on_bind)
> > >  			pt_update_ops->needs_invalidation = true;
> > >  		else if (pt_op->rebind && !xe_vm_in_lr_mode(vm))
> > >  			/* We bump also if batch_invalidate_tlb is
> > true */
> > > @@ -1759,9 +1779,13 @@ static int op_prepare(struct xe_vm *vm,
> > >
> > >  	switch (op->base.op) {
> > >  	case DRM_GPUVA_OP_MAP:
> > > -		if (!op->map.immediate &&
> > xe_vm_in_fault_mode(vm))
> > > +		if (!op->map.immediate &&
> > xe_vm_in_fault_mode(vm) &&
> > > +		    !op->map.invalidate_on_bind)
> > >  			break;
> > >
> > > +		if (op->map.invalidate_on_bind)
> > > +			pt_update_ops->invalidate_on_bind = true;
> > > +
> > >  		err = bind_op_prepare(vm, tile, pt_update_ops, op-
> > >map.vma);
> > >  		pt_update_ops->wait_vm_kernel = true;
> > >  		break;
> > > @@ -1871,6 +1895,8 @@ static void bind_op_commit(struct xe_vm
> > *vm, struct xe_tile *tile,
> > >  	}
> > >  	vma->tile_present |= BIT(tile->id);
> > >  	vma->tile_staged &= ~BIT(tile->id);
> > > +	if (pt_update_ops->invalidate_on_bind)
> > > +		vma->tile_invalidated |= BIT(tile->id);
> > >  	if (xe_vma_is_userptr(vma)) {
> > >  		lockdep_assert_held_read(&vm-
> > >userptr.notifier_lock);
> > >  		to_userptr_vma(vma)->userptr.initial_bind = true;
> > > diff --git a/drivers/gpu/drm/xe/xe_pt_types.h
> > b/drivers/gpu/drm/xe/xe_pt_types.h
> > > index 384cc04de719..3d0aa2a5102e 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_pt_types.h
> > > @@ -108,6 +108,8 @@ struct xe_vm_pgtable_update_ops {
> > >  	bool needs_userptr_lock;
> > >  	/** @needs_invalidation: Needs invalidation */
> > >  	bool needs_invalidation;
> > > +	/** @invalidate_on_bind: Invalidate the range before bind */
> > > +	bool invalidate_on_bind;
> > >  	/**
> > >  	 * @wait_vm_bookkeep: PT operations need to wait until VM
> > is idle
> > >  	 * (bookkeep dma-resv slots are idle) and stage all future VM
> > activity
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > b/drivers/gpu/drm/xe/xe_vm.c
> > > index d664f2e418b2..813d893d9b63 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > @@ -1921,6 +1921,23 @@ static void print_op(struct xe_device *xe,
> > struct drm_gpuva_op *op)
> > >  }
> > >  #endif
> > >
> > > +static bool __xe_vm_needs_clear_scratch_pages(struct xe_vm
> > *vm, u32 bind_flags)
> > > +{
> > > +	if (!xe_vm_in_fault_mode(vm))
> > > +		return false;
> > > +
> > > +	if (!NEEDS_SCRATCH(vm->xe))
> > > +		return false;
> > > +
> > > +	if (!xe_vm_has_scratch(vm))
> > > +		return false;
> > > +
> > > +	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> > > +		return false;
> > > +
> > > +	return true;
> > > +}
> > > +
> > >  /*
> > >   * Create operations list from IOCTL arguments, setup operations
> > fields so parse
> > >   * and commit steps are decoupled from IOCTL arguments. This
> > step can fail.
> > > @@ -1991,6 +2008,8 @@ vm_bind_ioctl_ops_create(struct xe_vm
> > *vm, struct xe_bo *bo,
> > >  			op->map.is_null = flags &
> > DRM_XE_VM_BIND_FLAG_NULL;
> > >  			op->map.dumpable = flags &
> > DRM_XE_VM_BIND_FLAG_DUMPABLE;
> > >  			op->map.pat_index = pat_index;
> > > +			op->map.invalidate_on_bind =
> > > +
> > 	__xe_vm_needs_clear_scratch_pages(vm, flags);
> > >  		} else if (__op->op == DRM_GPUVA_OP_PREFETCH) {
> > >  			op->prefetch.region = prefetch_region;
> > >  		}
> > > @@ -2188,7 +2207,8 @@ static int vm_bind_ioctl_ops_parse(struct
> > xe_vm *vm, struct drm_gpuva_ops *ops,
> > >  				return PTR_ERR(vma);
> > >
> > >  			op->map.vma = vma;
> > > -			if (op->map.immediate
> > || !xe_vm_in_fault_mode(vm))
> > > +			if (op->map.immediate
> > || !xe_vm_in_fault_mode(vm) ||
> > > +			    op->map.invalidate_on_bind)
> > >
> > 	xe_vma_ops_incr_pt_update_ops(vops,
> > >  							      op-
> > >tile_mask);
> > >  			break;
> > > @@ -2416,9 +2436,10 @@ static int op_lock_and_prep(struct
> > drm_exec *exec, struct xe_vm *vm,
> > >
> > >  	switch (op->base.op) {
> > >  	case DRM_GPUVA_OP_MAP:
> > > -		err = vma_lock_and_validate(exec, op->map.vma,
> > > -					    !xe_vm_in_fault_mode(vm)
> > ||
> > > -					    op->map.immediate);
> > > +		if (!op->map.invalidate_on_bind)
> > > +			err = vma_lock_and_validate(exec, op-
> > >map.vma,
> > > +
> > 	    !xe_vm_in_fault_mode(vm) ||
> > > +						    op-
> > >map.immediate);
> > >  		break;
> > >  	case DRM_GPUVA_OP_REMAP:
> > >  		err = check_ufence(gpuva_to_vma(op-
> > >base.remap.unmap->va));
> > > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > b/drivers/gpu/drm/xe/xe_vm_types.h
> > > index 52467b9b5348..dace04f4ea5e 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > > @@ -297,6 +297,8 @@ struct xe_vma_op_map {
> > >  	bool is_null;
> > >  	/** @dumpable: whether BO is dumped on GPU hang */
> > >  	bool dumpable;
> > > +	/** @invalidate: invalidate the VMA before bind */
> > > +	bool invalidate_on_bind;
> > >  	/** @pat_index: The pat index to use for this operation. */
> > >  	u16 pat_index;
> > >  };
> > > --
> > > 2.26.3
> > >

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-02-10 17:09 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-28 22:21 [PATCH 1/3] drm/xe: Add a function to zap page table by address range Oak Zeng
2025-01-28 22:21 ` [PATCH 2/3] drm/xe: Clear scratch page before vm_bind Oak Zeng
2025-01-28 23:05   ` Cavitt, Jonathan
2025-01-28 23:19   ` Matthew Brost
2025-01-29  5:04     ` Matthew Brost
2025-01-29  8:21       ` Thomas Hellström
2025-01-29 21:01     ` Zeng, Oak
2025-01-30  1:36       ` Matthew Brost
2025-01-30  8:36         ` Thomas Hellström
2025-01-30 15:18           ` Zeng, Oak
2025-01-28 22:21 ` [PATCH 3/3] drm/xe: Allow scratch page under fault mode for certain platform Oak Zeng
2025-01-28 23:05   ` Cavitt, Jonathan
2025-01-29  8:52   ` Thomas Hellström
2025-01-29 16:41     ` Matthew Brost
2025-01-28 23:04 ` [PATCH 1/3] drm/xe: Add a function to zap page table by address range Cavitt, Jonathan
2025-01-28 23:12 ` ✓ CI.Patch_applied: success for series starting with [1/3] " Patchwork
2025-01-28 23:12 ` ✓ CI.checkpatch: " Patchwork
2025-01-28 23:13 ` ✗ CI.KUnit: failure " Patchwork
2025-01-28 23:38 ` [PATCH 1/3] " Matthew Brost
2025-01-29  8:48 ` Thomas Hellström
  -- strict thread matches above, loose matches on Subject: below --
2025-02-04 18:45 [PATCH 1/3] drm/xe: Introduced needs_scratch bit in device descriptor Oak Zeng
2025-02-04 18:45 ` [PATCH 2/3] drm/xe: Clear scratch page before vm_bind Oak Zeng
2025-02-05 14:35   ` Thomas Hellström
2025-02-06  1:52     ` Zeng, Oak
2025-02-06 12:51       ` Thomas Hellström
2025-02-06 18:56         ` Zeng, Oak
2025-02-06 20:11           ` Thomas Hellström
2025-02-06 21:05             ` Zeng, Oak
2025-02-06 10:34   ` Matthew Brost
2025-02-06 10:43     ` Thomas Hellström
2025-02-10 17:08       ` Matthew Brost
2025-02-06 15:16     ` Zeng, Oak
2025-02-10 17:09       ` Matthew Brost

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox