[PATCH v2 1/3] drm/xe/userptr: restore invalidation list on error

Linux kernel -stable discussions
 help / color / mirror / Atom feed

* [PATCH v2 1/3] drm/xe/userptr: restore invalidation list on error
@ 2025-02-14 17:05 Matthew Auld
  2025-02-14 17:05 ` [PATCH v2 2/3] drm/xe/userptr: fix EFAULT handling Matthew Auld
  2025-02-15  1:28 ` [PATCH v2 1/3] drm/xe/userptr: restore invalidation list on error Matthew Brost
  0 siblings, 2 replies; 12+ messages in thread
From: Matthew Auld @ 2025-02-14 17:05 UTC (permalink / raw)
  To: intel-xe; +Cc: Matthew Brost, Thomas Hellström, stable

On error restore anything still on the pin_list back to the invalidation
list on error. For the actual pin, so long as the vma is tracked on
either list it should get picked up on the next pin, however it looks
possible for the vma to get nuked but still be present on this per vm
pin_list leading to corruption. An alternative might be then to instead
just remove the link when destroying the vma.

Fixes: ed2bdf3b264d ("drm/xe/vm: Subclass userptr vmas")
Suggested-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
---
 drivers/gpu/drm/xe/xe_vm.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index d664f2e418b2..668b0bde7822 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -670,12 +670,12 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
 	list_for_each_entry_safe(uvma, next, &vm->userptr.invalidated,
 				 userptr.invalidate_link) {
 		list_del_init(&uvma->userptr.invalidate_link);
-		list_move_tail(&uvma->userptr.repin_link,
-			       &vm->userptr.repin_list);
+		list_add_tail(&uvma->userptr.repin_link,
+			      &vm->userptr.repin_list);
 	}
 	spin_unlock(&vm->userptr.invalidated_lock);
 
-	/* Pin and move to temporary list */
+	/* Pin and move to bind list */
 	list_for_each_entry_safe(uvma, next, &vm->userptr.repin_list,
 				 userptr.repin_link) {
 		err = xe_vma_userptr_pin_pages(uvma);
@@ -691,10 +691,10 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
 			err = xe_vm_invalidate_vma(&uvma->vma);
 			xe_vm_unlock(vm);
 			if (err)
-				return err;
+				break;
 		} else {
-			if (err < 0)
-				return err;
+			if (err)
+				break;
 
 			list_del_init(&uvma->userptr.repin_link);
 			list_move_tail(&uvma->vma.combined_links.rebind,
@@ -702,7 +702,19 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
 		}
 	}
 
-	return 0;
+	if (err) {
+		down_write(&vm->userptr.notifier_lock);
+		spin_lock(&vm->userptr.invalidated_lock);
+		list_for_each_entry_safe(uvma, next, &vm->userptr.repin_list,
+					 userptr.repin_link) {
+			list_del_init(&uvma->userptr.repin_link);
+			list_move_tail(&uvma->userptr.invalidate_link,
+				       &vm->userptr.invalidated);
+		}
+		spin_unlock(&vm->userptr.invalidated_lock);
+		up_write(&vm->userptr.notifier_lock);
+	}
+	return err;
 }
 
 /**
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/3] drm/xe/userptr: fix EFAULT handling
  2025-02-14 17:05 [PATCH v2 1/3] drm/xe/userptr: restore invalidation list on error Matthew Auld
@ 2025-02-14 17:05 ` Matthew Auld
  2025-02-15  1:23   ` Matthew Brost
  2025-02-15  1:28 ` [PATCH v2 1/3] drm/xe/userptr: restore invalidation list on error Matthew Brost
  1 sibling, 1 reply; 12+ messages in thread
From: Matthew Auld @ 2025-02-14 17:05 UTC (permalink / raw)
  To: intel-xe; +Cc: Matthew Brost, Thomas Hellström, stable

Currently we treat EFAULT from hmm_range_fault() as a non-fatal error
when called from xe_vm_userptr_pin() with the idea that we want to avoid
killing the entire vm and chucking an error, under the assumption that
the user just did an unmap or something, and has no intention of
actually touching that memory from the GPU.  At this point we have
already zapped the PTEs so any access should generate a page fault, and
if the pin fails there also it will then become fatal.

However it looks like it's possible for the userptr vma to still be on
the rebind list in preempt_rebind_work_func(), if we had to retry the
pin again due to something happening in the caller before we did the
rebind step, but in the meantime needing to re-validate the userptr and
this time hitting the EFAULT.

This might explain an internal user report of hitting:

[  191.738349] WARNING: CPU: 1 PID: 157 at drivers/gpu/drm/xe/xe_res_cursor.h:158 xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe]
[  191.738551] Workqueue: xe-ordered-wq preempt_rebind_work_func [xe]
[  191.738616] RIP: 0010:xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe]
[  191.738690] Call Trace:
[  191.738692]  <TASK>
[  191.738694]  ? show_regs+0x69/0x80
[  191.738698]  ? __warn+0x93/0x1a0
[  191.738703]  ? xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe]
[  191.738759]  ? report_bug+0x18f/0x1a0
[  191.738764]  ? handle_bug+0x63/0xa0
[  191.738767]  ? exc_invalid_op+0x19/0x70
[  191.738770]  ? asm_exc_invalid_op+0x1b/0x20
[  191.738777]  ? xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe]
[  191.738834]  ? ret_from_fork_asm+0x1a/0x30
[  191.738849]  bind_op_prepare+0x105/0x7b0 [xe]
[  191.738906]  ? dma_resv_reserve_fences+0x301/0x380
[  191.738912]  xe_pt_update_ops_prepare+0x28c/0x4b0 [xe]
[  191.738966]  ? kmemleak_alloc+0x4b/0x80
[  191.738973]  ops_execute+0x188/0x9d0 [xe]
[  191.739036]  xe_vm_rebind+0x4ce/0x5a0 [xe]
[  191.739098]  ? trace_hardirqs_on+0x4d/0x60
[  191.739112]  preempt_rebind_work_func+0x76f/0xd00 [xe]

Followed by NPD, when running some workload, since the sg was never
actually populated but the vma is still marked for rebind when it should
be skipped for this special EFAULT case. And from the logs it does seem
like we hit this special EFAULT case before the explosions.

v2 (MattB):
 - Move earlier

Fixes: 521db22a1d70 ("drm/xe: Invalidate userptr VMA on page pin fault")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: <stable@vger.kernel.org> # v6.10+
---
 drivers/gpu/drm/xe/xe_vm.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 668b0bde7822..f36e2cc1d155 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -681,6 +681,18 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
 		err = xe_vma_userptr_pin_pages(uvma);
 		if (err == -EFAULT) {
 			list_del_init(&uvma->userptr.repin_link);
+			/*
+			 * We might have already done the pin once already, but
+			 * then had to retry before the re-bind happened, due
+			 * some other condition in the caller, but in the
+			 * meantime the userptr got dinged by the notifier such
+			 * that we need to revalidate here, but this time we hit
+			 * the EFAULT. In such a case make sure we remove
+			 * ourselves from the rebind list to avoid going down in
+			 * flames.
+			 */
+			if (!list_empty(&uvma->vma.combined_links.rebind))
+				list_del_init(&uvma->vma.combined_links.rebind);
 
 			/* Wait for pending binds */
 			xe_vm_lock(vm, false);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/3] drm/xe/userptr: fix EFAULT handling
  2025-02-14 17:05 ` [PATCH v2 2/3] drm/xe/userptr: fix EFAULT handling Matthew Auld
@ 2025-02-15  1:23   ` Matthew Brost
  2025-02-17  9:19     ` Matthew Auld
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Brost @ 2025-02-15  1:23 UTC (permalink / raw)
  To: Matthew Auld; +Cc: intel-xe, Thomas Hellström, stable

On Fri, Feb 14, 2025 at 05:05:29PM +0000, Matthew Auld wrote:
> Currently we treat EFAULT from hmm_range_fault() as a non-fatal error
> when called from xe_vm_userptr_pin() with the idea that we want to avoid
> killing the entire vm and chucking an error, under the assumption that
> the user just did an unmap or something, and has no intention of
> actually touching that memory from the GPU.  At this point we have
> already zapped the PTEs so any access should generate a page fault, and
> if the pin fails there also it will then become fatal.
> 
> However it looks like it's possible for the userptr vma to still be on
> the rebind list in preempt_rebind_work_func(), if we had to retry the
> pin again due to something happening in the caller before we did the
> rebind step, but in the meantime needing to re-validate the userptr and
> this time hitting the EFAULT.
> 
> This might explain an internal user report of hitting:
> 
> [  191.738349] WARNING: CPU: 1 PID: 157 at drivers/gpu/drm/xe/xe_res_cursor.h:158 xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe]
> [  191.738551] Workqueue: xe-ordered-wq preempt_rebind_work_func [xe]
> [  191.738616] RIP: 0010:xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe]
> [  191.738690] Call Trace:
> [  191.738692]  <TASK>
> [  191.738694]  ? show_regs+0x69/0x80
> [  191.738698]  ? __warn+0x93/0x1a0
> [  191.738703]  ? xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe]
> [  191.738759]  ? report_bug+0x18f/0x1a0
> [  191.738764]  ? handle_bug+0x63/0xa0
> [  191.738767]  ? exc_invalid_op+0x19/0x70
> [  191.738770]  ? asm_exc_invalid_op+0x1b/0x20
> [  191.738777]  ? xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe]
> [  191.738834]  ? ret_from_fork_asm+0x1a/0x30
> [  191.738849]  bind_op_prepare+0x105/0x7b0 [xe]
> [  191.738906]  ? dma_resv_reserve_fences+0x301/0x380
> [  191.738912]  xe_pt_update_ops_prepare+0x28c/0x4b0 [xe]
> [  191.738966]  ? kmemleak_alloc+0x4b/0x80
> [  191.738973]  ops_execute+0x188/0x9d0 [xe]
> [  191.739036]  xe_vm_rebind+0x4ce/0x5a0 [xe]
> [  191.739098]  ? trace_hardirqs_on+0x4d/0x60
> [  191.739112]  preempt_rebind_work_func+0x76f/0xd00 [xe]
> 
> Followed by NPD, when running some workload, since the sg was never
> actually populated but the vma is still marked for rebind when it should
> be skipped for this special EFAULT case. And from the logs it does seem
> like we hit this special EFAULT case before the explosions.
> 

It would be nice to verify if this fixes the bug report.

> v2 (MattB):
>  - Move earlier
> 
> Fixes: 521db22a1d70 ("drm/xe: Invalidate userptr VMA on page pin fault")
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>

Anyways, LGTM:
Reviewed-by: Matthew Brost <matthew.brost@intel.com>

> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v6.10+
> ---
>  drivers/gpu/drm/xe/xe_vm.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 668b0bde7822..f36e2cc1d155 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -681,6 +681,18 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
>  		err = xe_vma_userptr_pin_pages(uvma);
>  		if (err == -EFAULT) {
>  			list_del_init(&uvma->userptr.repin_link);
> +			/*
> +			 * We might have already done the pin once already, but
> +			 * then had to retry before the re-bind happened, due
> +			 * some other condition in the caller, but in the
> +			 * meantime the userptr got dinged by the notifier such
> +			 * that we need to revalidate here, but this time we hit
> +			 * the EFAULT. In such a case make sure we remove
> +			 * ourselves from the rebind list to avoid going down in
> +			 * flames.
> +			 */
> +			if (!list_empty(&uvma->vma.combined_links.rebind))
> +				list_del_init(&uvma->vma.combined_links.rebind);
>  
>  			/* Wait for pending binds */
>  			xe_vm_lock(vm, false);
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/3] drm/xe/userptr: fix EFAULT handling
  2025-02-15  1:23   ` Matthew Brost
@ 2025-02-17  9:19     ` Matthew Auld
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Auld @ 2025-02-17  9:19 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe, Thomas Hellström, stable

On 15/02/2025 01:23, Matthew Brost wrote:
> On Fri, Feb 14, 2025 at 05:05:29PM +0000, Matthew Auld wrote:
>> Currently we treat EFAULT from hmm_range_fault() as a non-fatal error
>> when called from xe_vm_userptr_pin() with the idea that we want to avoid
>> killing the entire vm and chucking an error, under the assumption that
>> the user just did an unmap or something, and has no intention of
>> actually touching that memory from the GPU.  At this point we have
>> already zapped the PTEs so any access should generate a page fault, and
>> if the pin fails there also it will then become fatal.
>>
>> However it looks like it's possible for the userptr vma to still be on
>> the rebind list in preempt_rebind_work_func(), if we had to retry the
>> pin again due to something happening in the caller before we did the
>> rebind step, but in the meantime needing to re-validate the userptr and
>> this time hitting the EFAULT.
>>
>> This might explain an internal user report of hitting:
>>
>> [  191.738349] WARNING: CPU: 1 PID: 157 at drivers/gpu/drm/xe/xe_res_cursor.h:158 xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe]
>> [  191.738551] Workqueue: xe-ordered-wq preempt_rebind_work_func [xe]
>> [  191.738616] RIP: 0010:xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe]
>> [  191.738690] Call Trace:
>> [  191.738692]  <TASK>
>> [  191.738694]  ? show_regs+0x69/0x80
>> [  191.738698]  ? __warn+0x93/0x1a0
>> [  191.738703]  ? xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe]
>> [  191.738759]  ? report_bug+0x18f/0x1a0
>> [  191.738764]  ? handle_bug+0x63/0xa0
>> [  191.738767]  ? exc_invalid_op+0x19/0x70
>> [  191.738770]  ? asm_exc_invalid_op+0x1b/0x20
>> [  191.738777]  ? xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe]
>> [  191.738834]  ? ret_from_fork_asm+0x1a/0x30
>> [  191.738849]  bind_op_prepare+0x105/0x7b0 [xe]
>> [  191.738906]  ? dma_resv_reserve_fences+0x301/0x380
>> [  191.738912]  xe_pt_update_ops_prepare+0x28c/0x4b0 [xe]
>> [  191.738966]  ? kmemleak_alloc+0x4b/0x80
>> [  191.738973]  ops_execute+0x188/0x9d0 [xe]
>> [  191.739036]  xe_vm_rebind+0x4ce/0x5a0 [xe]
>> [  191.739098]  ? trace_hardirqs_on+0x4d/0x60
>> [  191.739112]  preempt_rebind_work_func+0x76f/0xd00 [xe]
>>
>> Followed by NPD, when running some workload, since the sg was never
>> actually populated but the vma is still marked for rebind when it should
>> be skipped for this special EFAULT case. And from the logs it does seem
>> like we hit this special EFAULT case before the explosions.
>>
> 
> It would be nice to verify if this fixes the bug report.

Yes, reporter said it fixes it. Or at least the previous version did. 
See GSD-10562 if you are curious. Will re-phrase the commit message to 
make that clear.

> 
>> v2 (MattB):
>>   - Move earlier
>>
>> Fixes: 521db22a1d70 ("drm/xe: Invalidate userptr VMA on page pin fault")
>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
> 
> Anyways, LGTM:
> Reviewed-by: Matthew Brost <matthew.brost@intel.com>

Thanks.

> 
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Cc: <stable@vger.kernel.org> # v6.10+
>> ---
>>   drivers/gpu/drm/xe/xe_vm.c | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>> index 668b0bde7822..f36e2cc1d155 100644
>> --- a/drivers/gpu/drm/xe/xe_vm.c
>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>> @@ -681,6 +681,18 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
>>   		err = xe_vma_userptr_pin_pages(uvma);
>>   		if (err == -EFAULT) {
>>   			list_del_init(&uvma->userptr.repin_link);
>> +			/*
>> +			 * We might have already done the pin once already, but
>> +			 * then had to retry before the re-bind happened, due
>> +			 * some other condition in the caller, but in the
>> +			 * meantime the userptr got dinged by the notifier such
>> +			 * that we need to revalidate here, but this time we hit
>> +			 * the EFAULT. In such a case make sure we remove
>> +			 * ourselves from the rebind list to avoid going down in
>> +			 * flames.
>> +			 */
>> +			if (!list_empty(&uvma->vma.combined_links.rebind))
>> +				list_del_init(&uvma->vma.combined_links.rebind);
>>   
>>   			/* Wait for pending binds */
>>   			xe_vm_lock(vm, false);
>> -- 
>> 2.48.1
>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] drm/xe/userptr: restore invalidation list on error
  2025-02-14 17:05 [PATCH v2 1/3] drm/xe/userptr: restore invalidation list on error Matthew Auld
  2025-02-14 17:05 ` [PATCH v2 2/3] drm/xe/userptr: fix EFAULT handling Matthew Auld
@ 2025-02-15  1:28 ` Matthew Brost
  2025-02-17  9:38   ` Matthew Auld
  1 sibling, 1 reply; 12+ messages in thread
From: Matthew Brost @ 2025-02-15  1:28 UTC (permalink / raw)
  To: Matthew Auld; +Cc: intel-xe, Thomas Hellström, stable

On Fri, Feb 14, 2025 at 05:05:28PM +0000, Matthew Auld wrote:
> On error restore anything still on the pin_list back to the invalidation
> list on error. For the actual pin, so long as the vma is tracked on
> either list it should get picked up on the next pin, however it looks
> possible for the vma to get nuked but still be present on this per vm
> pin_list leading to corruption. An alternative might be then to instead
> just remove the link when destroying the vma.
> 
> Fixes: ed2bdf3b264d ("drm/xe/vm: Subclass userptr vmas")
> Suggested-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v6.8+
> ---
>  drivers/gpu/drm/xe/xe_vm.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index d664f2e418b2..668b0bde7822 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -670,12 +670,12 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
>  	list_for_each_entry_safe(uvma, next, &vm->userptr.invalidated,
>  				 userptr.invalidate_link) {
>  		list_del_init(&uvma->userptr.invalidate_link);
> -		list_move_tail(&uvma->userptr.repin_link,
> -			       &vm->userptr.repin_list);
> +		list_add_tail(&uvma->userptr.repin_link,
> +			      &vm->userptr.repin_list);

Why this change?

>  	}
>  	spin_unlock(&vm->userptr.invalidated_lock);
>  
> -	/* Pin and move to temporary list */
> +	/* Pin and move to bind list */
>  	list_for_each_entry_safe(uvma, next, &vm->userptr.repin_list,
>  				 userptr.repin_link) {
>  		err = xe_vma_userptr_pin_pages(uvma);
> @@ -691,10 +691,10 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
>  			err = xe_vm_invalidate_vma(&uvma->vma);
>  			xe_vm_unlock(vm);
>  			if (err)
> -				return err;
> +				break;
>  		} else {
> -			if (err < 0)
> -				return err;
> +			if (err)
> +				break;
>  
>  			list_del_init(&uvma->userptr.repin_link);
>  			list_move_tail(&uvma->vma.combined_links.rebind,
> @@ -702,7 +702,19 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
>  		}
>  	}
>  
> -	return 0;
> +	if (err) {
> +		down_write(&vm->userptr.notifier_lock);

Can you explain why you take the notifier lock here? I don't think this
required unless I'm missing something.

Matt 

> +		spin_lock(&vm->userptr.invalidated_lock);
> +		list_for_each_entry_safe(uvma, next, &vm->userptr.repin_list,
> +					 userptr.repin_link) {
> +			list_del_init(&uvma->userptr.repin_link);
> +			list_move_tail(&uvma->userptr.invalidate_link,
> +				       &vm->userptr.invalidated);
> +		}
> +		spin_unlock(&vm->userptr.invalidated_lock);
> +		up_write(&vm->userptr.notifier_lock);
> +	}
> +	return err;
>  }
>  
>  /**
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] drm/xe/userptr: restore invalidation list on error
  2025-02-15  1:28 ` [PATCH v2 1/3] drm/xe/userptr: restore invalidation list on error Matthew Brost
@ 2025-02-17  9:38   ` Matthew Auld
  2025-02-18  3:58     ` Matthew Brost
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Auld @ 2025-02-17  9:38 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe, Thomas Hellström, stable

On 15/02/2025 01:28, Matthew Brost wrote:
> On Fri, Feb 14, 2025 at 05:05:28PM +0000, Matthew Auld wrote:
>> On error restore anything still on the pin_list back to the invalidation
>> list on error. For the actual pin, so long as the vma is tracked on
>> either list it should get picked up on the next pin, however it looks
>> possible for the vma to get nuked but still be present on this per vm
>> pin_list leading to corruption. An alternative might be then to instead
>> just remove the link when destroying the vma.
>>
>> Fixes: ed2bdf3b264d ("drm/xe/vm: Subclass userptr vmas")
>> Suggested-by: Matthew Brost <matthew.brost@intel.com>
>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Cc: <stable@vger.kernel.org> # v6.8+
>> ---
>>   drivers/gpu/drm/xe/xe_vm.c | 26 +++++++++++++++++++-------
>>   1 file changed, 19 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>> index d664f2e418b2..668b0bde7822 100644
>> --- a/drivers/gpu/drm/xe/xe_vm.c
>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>> @@ -670,12 +670,12 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
>>   	list_for_each_entry_safe(uvma, next, &vm->userptr.invalidated,
>>   				 userptr.invalidate_link) {
>>   		list_del_init(&uvma->userptr.invalidate_link);
>> -		list_move_tail(&uvma->userptr.repin_link,
>> -			       &vm->userptr.repin_list);
>> +		list_add_tail(&uvma->userptr.repin_link,
>> +			      &vm->userptr.repin_list);
> 
> Why this change?

Just that with this patch the repin_link should now always be empty at 
this point, I think. add should complain if that is not the case.

> 
>>   	}
>>   	spin_unlock(&vm->userptr.invalidated_lock);
>>   
>> -	/* Pin and move to temporary list */
>> +	/* Pin and move to bind list */
>>   	list_for_each_entry_safe(uvma, next, &vm->userptr.repin_list,
>>   				 userptr.repin_link) {
>>   		err = xe_vma_userptr_pin_pages(uvma);
>> @@ -691,10 +691,10 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
>>   			err = xe_vm_invalidate_vma(&uvma->vma);
>>   			xe_vm_unlock(vm);
>>   			if (err)
>> -				return err;
>> +				break;
>>   		} else {
>> -			if (err < 0)
>> -				return err;
>> +			if (err)
>> +				break;
>>   
>>   			list_del_init(&uvma->userptr.repin_link);
>>   			list_move_tail(&uvma->vma.combined_links.rebind,
>> @@ -702,7 +702,19 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
>>   		}
>>   	}
>>   
>> -	return 0;
>> +	if (err) {
>> +		down_write(&vm->userptr.notifier_lock);
> 
> Can you explain why you take the notifier lock here? I don't think this
> required unless I'm missing something.

For the invalidated list, the docs say:

"Removing items from the list additionally requires @lock in write mode, 
and adding items to the list requires the @userptr.notifer_lock in write 
mode."

Not sure if the docs needs to be updated here?

> 
> Matt
> 
>> +		spin_lock(&vm->userptr.invalidated_lock);
>> +		list_for_each_entry_safe(uvma, next, &vm->userptr.repin_list,
>> +					 userptr.repin_link) {
>> +			list_del_init(&uvma->userptr.repin_link);
>> +			list_move_tail(&uvma->userptr.invalidate_link,
>> +				       &vm->userptr.invalidated);
>> +		}
>> +		spin_unlock(&vm->userptr.invalidated_lock);
>> +		up_write(&vm->userptr.notifier_lock);
>> +	}
>> +	return err;
>>   }
>>   
>>   /**
>> -- 
>> 2.48.1
>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] drm/xe/userptr: restore invalidation list on error
  2025-02-17  9:38   ` Matthew Auld
@ 2025-02-18  3:58     ` Matthew Brost
  2025-02-20 23:52       ` Matthew Brost
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Brost @ 2025-02-18  3:58 UTC (permalink / raw)
  To: Matthew Auld; +Cc: intel-xe, Thomas Hellström, stable

On Mon, Feb 17, 2025 at 09:38:26AM +0000, Matthew Auld wrote:
> On 15/02/2025 01:28, Matthew Brost wrote:
> > On Fri, Feb 14, 2025 at 05:05:28PM +0000, Matthew Auld wrote:
> > > On error restore anything still on the pin_list back to the invalidation
> > > list on error. For the actual pin, so long as the vma is tracked on
> > > either list it should get picked up on the next pin, however it looks
> > > possible for the vma to get nuked but still be present on this per vm
> > > pin_list leading to corruption. An alternative might be then to instead
> > > just remove the link when destroying the vma.
> > > 
> > > Fixes: ed2bdf3b264d ("drm/xe/vm: Subclass userptr vmas")
> > > Suggested-by: Matthew Brost <matthew.brost@intel.com>
> > > Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: <stable@vger.kernel.org> # v6.8+
> > > ---
> > >   drivers/gpu/drm/xe/xe_vm.c | 26 +++++++++++++++++++-------
> > >   1 file changed, 19 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > > index d664f2e418b2..668b0bde7822 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > @@ -670,12 +670,12 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
> > >   	list_for_each_entry_safe(uvma, next, &vm->userptr.invalidated,
> > >   				 userptr.invalidate_link) {
> > >   		list_del_init(&uvma->userptr.invalidate_link);
> > > -		list_move_tail(&uvma->userptr.repin_link,
> > > -			       &vm->userptr.repin_list);
> > > +		list_add_tail(&uvma->userptr.repin_link,
> > > +			      &vm->userptr.repin_list);
> > 
> > Why this change?
> 
> Just that with this patch the repin_link should now always be empty at this
> point, I think. add should complain if that is not the case.
> 

If it is always expected to be empty, then yea maybe add a xe_assert for
this as the list management is pretty tricky. 

> > 
> > >   	}
> > >   	spin_unlock(&vm->userptr.invalidated_lock);
> > > -	/* Pin and move to temporary list */
> > > +	/* Pin and move to bind list */
> > >   	list_for_each_entry_safe(uvma, next, &vm->userptr.repin_list,
> > >   				 userptr.repin_link) {
> > >   		err = xe_vma_userptr_pin_pages(uvma);
> > > @@ -691,10 +691,10 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
> > >   			err = xe_vm_invalidate_vma(&uvma->vma);
> > >   			xe_vm_unlock(vm);
> > >   			if (err)
> > > -				return err;
> > > +				break;
> > >   		} else {
> > > -			if (err < 0)
> > > -				return err;
> > > +			if (err)
> > > +				break;
> > >   			list_del_init(&uvma->userptr.repin_link);
> > >   			list_move_tail(&uvma->vma.combined_links.rebind,
> > > @@ -702,7 +702,19 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
> > >   		}
> > >   	}
> > > -	return 0;
> > > +	if (err) {
> > > +		down_write(&vm->userptr.notifier_lock);
> > 
> > Can you explain why you take the notifier lock here? I don't think this
> > required unless I'm missing something.
> 
> For the invalidated list, the docs say:
> 
> "Removing items from the list additionally requires @lock in write mode, and
> adding items to the list requires the @userptr.notifer_lock in write mode."
> 
> Not sure if the docs needs to be updated here?
> 

Oh. I believe the part of comment for 'adding items to the list
requires the @userptr.notifer_lock in write mode' really means something
like this:

'When adding to @vm->userptr.invalidated in the notifier the
@userptr.notifer_lock in write mode protects against concurrent VM binds
from setting up newly invalidated pages.'

So with above and since this code path is in the VM bind path (i.e. we
are not racing with other binds) I think the
vm->userptr.invalidated_lock is sufficient. Maybe ask Thomas if he
agrees here.

Matt

> > 
> > Matt
> > 
> > > +		spin_lock(&vm->userptr.invalidated_lock);
> > > +		list_for_each_entry_safe(uvma, next, &vm->userptr.repin_list,
> > > +					 userptr.repin_link) {
> > > +			list_del_init(&uvma->userptr.repin_link);
> > > +			list_move_tail(&uvma->userptr.invalidate_link,
> > > +				       &vm->userptr.invalidated);
> > > +		}
> > > +		spin_unlock(&vm->userptr.invalidated_lock);
> > > +		up_write(&vm->userptr.notifier_lock);
> > > +	}
> > > +	return err;
> > >   }
> > >   /**
> > > -- 
> > > 2.48.1
> > > 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] drm/xe/userptr: restore invalidation list on error
  2025-02-18  3:58     ` Matthew Brost
@ 2025-02-20 23:52       ` Matthew Brost
  2025-02-21 11:11         ` Matthew Auld
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Brost @ 2025-02-20 23:52 UTC (permalink / raw)
  To: Matthew Auld; +Cc: intel-xe, Thomas Hellström, stable

On Mon, Feb 17, 2025 at 07:58:11PM -0800, Matthew Brost wrote:
> On Mon, Feb 17, 2025 at 09:38:26AM +0000, Matthew Auld wrote:
> > On 15/02/2025 01:28, Matthew Brost wrote:
> > > On Fri, Feb 14, 2025 at 05:05:28PM +0000, Matthew Auld wrote:
> > > > On error restore anything still on the pin_list back to the invalidation
> > > > list on error. For the actual pin, so long as the vma is tracked on
> > > > either list it should get picked up on the next pin, however it looks
> > > > possible for the vma to get nuked but still be present on this per vm
> > > > pin_list leading to corruption. An alternative might be then to instead
> > > > just remove the link when destroying the vma.
> > > > 
> > > > Fixes: ed2bdf3b264d ("drm/xe/vm: Subclass userptr vmas")
> > > > Suggested-by: Matthew Brost <matthew.brost@intel.com>
> > > > Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> > > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > Cc: <stable@vger.kernel.org> # v6.8+
> > > > ---
> > > >   drivers/gpu/drm/xe/xe_vm.c | 26 +++++++++++++++++++-------
> > > >   1 file changed, 19 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > > > index d664f2e418b2..668b0bde7822 100644
> > > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > > @@ -670,12 +670,12 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
> > > >   	list_for_each_entry_safe(uvma, next, &vm->userptr.invalidated,
> > > >   				 userptr.invalidate_link) {
> > > >   		list_del_init(&uvma->userptr.invalidate_link);
> > > > -		list_move_tail(&uvma->userptr.repin_link,
> > > > -			       &vm->userptr.repin_list);
> > > > +		list_add_tail(&uvma->userptr.repin_link,
> > > > +			      &vm->userptr.repin_list);
> > > 
> > > Why this change?
> > 
> > Just that with this patch the repin_link should now always be empty at this
> > point, I think. add should complain if that is not the case.
> > 
> 
> If it is always expected to be empty, then yea maybe add a xe_assert for
> this as the list management is pretty tricky. 
> 
> > > 
> > > >   	}
> > > >   	spin_unlock(&vm->userptr.invalidated_lock);
> > > > -	/* Pin and move to temporary list */
> > > > +	/* Pin and move to bind list */
> > > >   	list_for_each_entry_safe(uvma, next, &vm->userptr.repin_list,
> > > >   				 userptr.repin_link) {
> > > >   		err = xe_vma_userptr_pin_pages(uvma);
> > > > @@ -691,10 +691,10 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
> > > >   			err = xe_vm_invalidate_vma(&uvma->vma);
> > > >   			xe_vm_unlock(vm);
> > > >   			if (err)
> > > > -				return err;
> > > > +				break;
> > > >   		} else {
> > > > -			if (err < 0)
> > > > -				return err;
> > > > +			if (err)
> > > > +				break;
> > > >   			list_del_init(&uvma->userptr.repin_link);
> > > >   			list_move_tail(&uvma->vma.combined_links.rebind,
> > > > @@ -702,7 +702,19 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
> > > >   		}
> > > >   	}
> > > > -	return 0;
> > > > +	if (err) {
> > > > +		down_write(&vm->userptr.notifier_lock);
> > > 
> > > Can you explain why you take the notifier lock here? I don't think this
> > > required unless I'm missing something.
> > 
> > For the invalidated list, the docs say:
> > 
> > "Removing items from the list additionally requires @lock in write mode, and
> > adding items to the list requires the @userptr.notifer_lock in write mode."
> > 
> > Not sure if the docs needs to be updated here?
> > 
> 
> Oh. I believe the part of comment for 'adding items to the list
> requires the @userptr.notifer_lock in write mode' really means something
> like this:
> 
> 'When adding to @vm->userptr.invalidated in the notifier the
> @userptr.notifer_lock in write mode protects against concurrent VM binds
> from setting up newly invalidated pages.'
> 
> So with above and since this code path is in the VM bind path (i.e. we
> are not racing with other binds) I think the
> vm->userptr.invalidated_lock is sufficient. Maybe ask Thomas if he
> agrees here.
> 

After some discussion with Thomas, removing notifier lock here is safe.

However, for adding is either userptr.notifer_lock || vm->lock to also
avoid races between binds, execs, and rebind worker.

I'd like update the documentation and add a helper like this:

void xe_vma_userptr_add_invalidated(struct xe_userptr_vma *uvma)
{
       struct xe_vm *vm = xe_vma_vm(&uvma->vma);

       lockdep_assert(lock_is_held_type(&vm->lock.dep_map, 1) ||
                      lock_is_held_type(&vm->userptr.notifier_lock.dep_map, 1));

       spin_lock(&vm->userptr.invalidated_lock);
       list_move_tail(&uvma->userptr.invalidate_link,
                      &vm->userptr.invalidated);
       spin_unlock(&vm->userptr.invalidated_lock);
}

However, let's delay the helper until this series and recently post
series of mine [1] merge as both are fixes series and hoping for a clean
backport.

Matt

[1] https://patchwork.freedesktop.org/series/145198/

> Matt
> 
> > > 
> > > Matt
> > > 
> > > > +		spin_lock(&vm->userptr.invalidated_lock);
> > > > +		list_for_each_entry_safe(uvma, next, &vm->userptr.repin_list,
> > > > +					 userptr.repin_link) {
> > > > +			list_del_init(&uvma->userptr.repin_link);
> > > > +			list_move_tail(&uvma->userptr.invalidate_link,
> > > > +				       &vm->userptr.invalidated);
> > > > +		}
> > > > +		spin_unlock(&vm->userptr.invalidated_lock);
> > > > +		up_write(&vm->userptr.notifier_lock);
> > > > +	}
> > > > +	return err;
> > > >   }
> > > >   /**
> > > > -- 
> > > > 2.48.1
> > > > 
> > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] drm/xe/userptr: restore invalidation list on error
  2025-02-20 23:52       ` Matthew Brost
@ 2025-02-21 11:11         ` Matthew Auld
  2025-02-21 11:20           ` Thomas Hellström
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Auld @ 2025-02-21 11:11 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe, Thomas Hellström, stable

On 20/02/2025 23:52, Matthew Brost wrote:
> On Mon, Feb 17, 2025 at 07:58:11PM -0800, Matthew Brost wrote:
>> On Mon, Feb 17, 2025 at 09:38:26AM +0000, Matthew Auld wrote:
>>> On 15/02/2025 01:28, Matthew Brost wrote:
>>>> On Fri, Feb 14, 2025 at 05:05:28PM +0000, Matthew Auld wrote:
>>>>> On error restore anything still on the pin_list back to the invalidation
>>>>> list on error. For the actual pin, so long as the vma is tracked on
>>>>> either list it should get picked up on the next pin, however it looks
>>>>> possible for the vma to get nuked but still be present on this per vm
>>>>> pin_list leading to corruption. An alternative might be then to instead
>>>>> just remove the link when destroying the vma.
>>>>>
>>>>> Fixes: ed2bdf3b264d ("drm/xe/vm: Subclass userptr vmas")
>>>>> Suggested-by: Matthew Brost <matthew.brost@intel.com>
>>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>>> Cc: <stable@vger.kernel.org> # v6.8+
>>>>> ---
>>>>>    drivers/gpu/drm/xe/xe_vm.c | 26 +++++++++++++++++++-------
>>>>>    1 file changed, 19 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>>>>> index d664f2e418b2..668b0bde7822 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>>>> @@ -670,12 +670,12 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
>>>>>    	list_for_each_entry_safe(uvma, next, &vm->userptr.invalidated,
>>>>>    				 userptr.invalidate_link) {
>>>>>    		list_del_init(&uvma->userptr.invalidate_link);
>>>>> -		list_move_tail(&uvma->userptr.repin_link,
>>>>> -			       &vm->userptr.repin_list);
>>>>> +		list_add_tail(&uvma->userptr.repin_link,
>>>>> +			      &vm->userptr.repin_list);
>>>>
>>>> Why this change?
>>>
>>> Just that with this patch the repin_link should now always be empty at this
>>> point, I think. add should complain if that is not the case.
>>>
>>
>> If it is always expected to be empty, then yea maybe add a xe_assert for
>> this as the list management is pretty tricky.
>>
>>>>
>>>>>    	}
>>>>>    	spin_unlock(&vm->userptr.invalidated_lock);
>>>>> -	/* Pin and move to temporary list */
>>>>> +	/* Pin and move to bind list */
>>>>>    	list_for_each_entry_safe(uvma, next, &vm->userptr.repin_list,
>>>>>    				 userptr.repin_link) {
>>>>>    		err = xe_vma_userptr_pin_pages(uvma);
>>>>> @@ -691,10 +691,10 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
>>>>>    			err = xe_vm_invalidate_vma(&uvma->vma);
>>>>>    			xe_vm_unlock(vm);
>>>>>    			if (err)
>>>>> -				return err;
>>>>> +				break;
>>>>>    		} else {
>>>>> -			if (err < 0)
>>>>> -				return err;
>>>>> +			if (err)
>>>>> +				break;
>>>>>    			list_del_init(&uvma->userptr.repin_link);
>>>>>    			list_move_tail(&uvma->vma.combined_links.rebind,
>>>>> @@ -702,7 +702,19 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
>>>>>    		}
>>>>>    	}
>>>>> -	return 0;
>>>>> +	if (err) {
>>>>> +		down_write(&vm->userptr.notifier_lock);
>>>>
>>>> Can you explain why you take the notifier lock here? I don't think this
>>>> required unless I'm missing something.
>>>
>>> For the invalidated list, the docs say:
>>>
>>> "Removing items from the list additionally requires @lock in write mode, and
>>> adding items to the list requires the @userptr.notifer_lock in write mode."
>>>
>>> Not sure if the docs needs to be updated here?
>>>
>>
>> Oh. I believe the part of comment for 'adding items to the list
>> requires the @userptr.notifer_lock in write mode' really means something
>> like this:
>>
>> 'When adding to @vm->userptr.invalidated in the notifier the
>> @userptr.notifer_lock in write mode protects against concurrent VM binds
>> from setting up newly invalidated pages.'
>>
>> So with above and since this code path is in the VM bind path (i.e. we
>> are not racing with other binds) I think the
>> vm->userptr.invalidated_lock is sufficient. Maybe ask Thomas if he
>> agrees here.
>>
> 
> After some discussion with Thomas, removing notifier lock here is safe.

Thanks for confirming.

> 
> However, for adding is either userptr.notifer_lock || vm->lock to also
> avoid races between binds, execs, and rebind worker.
> 
> I'd like update the documentation and add a helper like this:
> 
> void xe_vma_userptr_add_invalidated(struct xe_userptr_vma *uvma)
> {
>         struct xe_vm *vm = xe_vma_vm(&uvma->vma);
> 
>         lockdep_assert(lock_is_held_type(&vm->lock.dep_map, 1) ||
>                        lock_is_held_type(&vm->userptr.notifier_lock.dep_map, 1));
> 
>         spin_lock(&vm->userptr.invalidated_lock);
>         list_move_tail(&uvma->userptr.invalidate_link,
>                        &vm->userptr.invalidated);
>         spin_unlock(&vm->userptr.invalidated_lock);
> }

Sounds good.

> 
> However, let's delay the helper until this series and recently post
> series of mine [1] merge as both are fixes series and hoping for a clean
> backport.

Makes sense.

> 
> Matt
> 
> [1] https://patchwork.freedesktop.org/series/145198/
> 
>> Matt
>>
>>>>
>>>> Matt
>>>>
>>>>> +		spin_lock(&vm->userptr.invalidated_lock);
>>>>> +		list_for_each_entry_safe(uvma, next, &vm->userptr.repin_list,
>>>>> +					 userptr.repin_link) {
>>>>> +			list_del_init(&uvma->userptr.repin_link);
>>>>> +			list_move_tail(&uvma->userptr.invalidate_link,
>>>>> +				       &vm->userptr.invalidated);
>>>>> +		}
>>>>> +		spin_unlock(&vm->userptr.invalidated_lock);
>>>>> +		up_write(&vm->userptr.notifier_lock);
>>>>> +	}
>>>>> +	return err;
>>>>>    }
>>>>>    /**
>>>>> -- 
>>>>> 2.48.1
>>>>>
>>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] drm/xe/userptr: restore invalidation list on error
  2025-02-21 11:11         ` Matthew Auld
@ 2025-02-21 11:20           ` Thomas Hellström
  2025-02-21 13:17             ` Matthew Auld
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Hellström @ 2025-02-21 11:20 UTC (permalink / raw)
  To: Matthew Auld, Matthew Brost; +Cc: intel-xe, stable

On Fri, 2025-02-21 at 11:11 +0000, Matthew Auld wrote:
> On 20/02/2025 23:52, Matthew Brost wrote:
> > On Mon, Feb 17, 2025 at 07:58:11PM -0800, Matthew Brost wrote:
> > > On Mon, Feb 17, 2025 at 09:38:26AM +0000, Matthew Auld wrote:
> > > > On 15/02/2025 01:28, Matthew Brost wrote:
> > > > > On Fri, Feb 14, 2025 at 05:05:28PM +0000, Matthew Auld wrote:
> > > > > > On error restore anything still on the pin_list back to the
> > > > > > invalidation
> > > > > > list on error. For the actual pin, so long as the vma is
> > > > > > tracked on
> > > > > > either list it should get picked up on the next pin,
> > > > > > however it looks
> > > > > > possible for the vma to get nuked but still be present on
> > > > > > this per vm
> > > > > > pin_list leading to corruption. An alternative might be
> > > > > > then to instead
> > > > > > just remove the link when destroying the vma.
> > > > > > 
> > > > > > Fixes: ed2bdf3b264d ("drm/xe/vm: Subclass userptr vmas")
> > > > > > Suggested-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> > > > > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > > Cc: <stable@vger.kernel.org> # v6.8+
> > > > > > ---
> > > > > >    drivers/gpu/drm/xe/xe_vm.c | 26 +++++++++++++++++++-----
> > > > > > --
> > > > > >    1 file changed, 19 insertions(+), 7 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > > > > b/drivers/gpu/drm/xe/xe_vm.c
> > > > > > index d664f2e418b2..668b0bde7822 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > > > > @@ -670,12 +670,12 @@ int xe_vm_userptr_pin(struct xe_vm
> > > > > > *vm)
> > > > > >    	list_for_each_entry_safe(uvma, next, &vm-
> > > > > > >userptr.invalidated,
> > > > > >    				 userptr.invalidate_link)
> > > > > > {
> > > > > >    		list_del_init(&uvma-
> > > > > > >userptr.invalidate_link);
> > > > > > -		list_move_tail(&uvma->userptr.repin_link,
> > > > > > -			       &vm->userptr.repin_list);
> > > > > > +		list_add_tail(&uvma->userptr.repin_link,
> > > > > > +			      &vm->userptr.repin_list);
> > > > > 
> > > > > Why this change?
> > > > 
> > > > Just that with this patch the repin_link should now always be
> > > > empty at this
> > > > point, I think. add should complain if that is not the case.
> > > > 
> > > 
> > > If it is always expected to be empty, then yea maybe add a
> > > xe_assert for
> > > this as the list management is pretty tricky.
> > > 
> > > > > 
> > > > > >    	}
> > > > > >    	spin_unlock(&vm->userptr.invalidated_lock);
> > > > > > -	/* Pin and move to temporary list */
> > > > > > +	/* Pin and move to bind list */
> > > > > >    	list_for_each_entry_safe(uvma, next, &vm-
> > > > > > >userptr.repin_list,
> > > > > >    				 userptr.repin_link) {
> > > > > >    		err = xe_vma_userptr_pin_pages(uvma);
> > > > > > @@ -691,10 +691,10 @@ int xe_vm_userptr_pin(struct xe_vm
> > > > > > *vm)
> > > > > >    			err = xe_vm_invalidate_vma(&uvma-
> > > > > > >vma);
> > > > > >    			xe_vm_unlock(vm);
> > > > > >    			if (err)
> > > > > > -				return err;
> > > > > > +				break;
> > > > > >    		} else {
> > > > > > -			if (err < 0)
> > > > > > -				return err;
> > > > > > +			if (err)
> > > > > > +				break;
> > > > > >    			list_del_init(&uvma-
> > > > > > >userptr.repin_link);
> > > > > >    			list_move_tail(&uvma-
> > > > > > >vma.combined_links.rebind,
> > > > > > @@ -702,7 +702,19 @@ int xe_vm_userptr_pin(struct xe_vm
> > > > > > *vm)
> > > > > >    		}
> > > > > >    	}
> > > > > > -	return 0;
> > > > > > +	if (err) {
> > > > > > +		down_write(&vm->userptr.notifier_lock);
> > > > > 
> > > > > Can you explain why you take the notifier lock here? I don't
> > > > > think this
> > > > > required unless I'm missing something.
> > > > 
> > > > For the invalidated list, the docs say:
> > > > 
> > > > "Removing items from the list additionally requires @lock in
> > > > write mode, and
> > > > adding items to the list requires the @userptr.notifer_lock in
> > > > write mode."
> > > > 
> > > > Not sure if the docs needs to be updated here?
> > > > 
> > > 
> > > Oh. I believe the part of comment for 'adding items to the list
> > > requires the @userptr.notifer_lock in write mode' really means
> > > something
> > > like this:
> > > 
> > > 'When adding to @vm->userptr.invalidated in the notifier the
> > > @userptr.notifer_lock in write mode protects against concurrent
> > > VM binds
> > > from setting up newly invalidated pages.'
> > > 
> > > So with above and since this code path is in the VM bind path
> > > (i.e. we
> > > are not racing with other binds) I think the
> > > vm->userptr.invalidated_lock is sufficient. Maybe ask Thomas if
> > > he
> > > agrees here.
> > > 
> > 
> > After some discussion with Thomas, removing notifier lock here is
> > safe.
> 
> Thanks for confirming.

So basically that was to protect exec when it takes the notifier lock
in read mode, and checks that there are no invalidated userptr, that
needs to stay true as lock as the notifier lock is held.

But as MBrost pointed out, the vm lock is also held, so I think the
kerneldoc should be updated so that the requirement is that either the
notifier lock is held in write mode, or the vm lock in write mode. 

As a general comment these locking protection docs are there to
simplify reading and writing of the code so that when new code is
written and reviewed, we should just keep to the rules to avoid
auditing all locations in the driver where the protected data-structure
is touched. If we want to update those docs I think a complete such
audit needs to be done and all use-cases are understood.

/Thomas


> 
> > 
> > However, for adding is either userptr.notifer_lock || vm->lock to
> > also
> > avoid races between binds, execs, and rebind worker.
> > 
> > I'd like update the documentation and add a helper like this:
> > 
> > void xe_vma_userptr_add_invalidated(struct xe_userptr_vma *uvma)
> > {
> >         struct xe_vm *vm = xe_vma_vm(&uvma->vma);
> > 
> >         lockdep_assert(lock_is_held_type(&vm->lock.dep_map, 1) ||
> >                        lock_is_held_type(&vm-
> > >userptr.notifier_lock.dep_map, 1));
> > 
> >         spin_lock(&vm->userptr.invalidated_lock);
> >         list_move_tail(&uvma->userptr.invalidate_link,
> >                        &vm->userptr.invalidated);
> >         spin_unlock(&vm->userptr.invalidated_lock);
> > }
> 
> Sounds good.
> 
> > 
> > However, let's delay the helper until this series and recently post
> > series of mine [1] merge as both are fixes series and hoping for a
> > clean
> > backport.
> 
> Makes sense.
> 
> > 
> > Matt
> > 
> > [1] https://patchwork.freedesktop.org/series/145198/
> > 
> > > Matt
> > > 
> > > > > 
> > > > > Matt
> > > > > 
> > > > > > +		spin_lock(&vm->userptr.invalidated_lock);
> > > > > > +		list_for_each_entry_safe(uvma, next, &vm-
> > > > > > >userptr.repin_list,
> > > > > > +					
> > > > > > userptr.repin_link) {
> > > > > > +			list_del_init(&uvma-
> > > > > > >userptr.repin_link);
> > > > > > +			list_move_tail(&uvma-
> > > > > > >userptr.invalidate_link,
> > > > > > +				       &vm-
> > > > > > >userptr.invalidated);
> > > > > > +		}
> > > > > > +		spin_unlock(&vm-
> > > > > > >userptr.invalidated_lock);
> > > > > > +		up_write(&vm->userptr.notifier_lock);
> > > > > > +	}
> > > > > > +	return err;
> > > > > >    }
> > > > > >    /**
> > > > > > -- 
> > > > > > 2.48.1
> > > > > > 
> > > > 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] drm/xe/userptr: restore invalidation list on error
  2025-02-21 11:20           ` Thomas Hellström
@ 2025-02-21 13:17             ` Matthew Auld
  2025-02-21 13:23               ` Thomas Hellström
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Auld @ 2025-02-21 13:17 UTC (permalink / raw)
  To: Thomas Hellström, Matthew Brost; +Cc: intel-xe, stable

On 21/02/2025 11:20, Thomas Hellström wrote:
> On Fri, 2025-02-21 at 11:11 +0000, Matthew Auld wrote:
>> On 20/02/2025 23:52, Matthew Brost wrote:
>>> On Mon, Feb 17, 2025 at 07:58:11PM -0800, Matthew Brost wrote:
>>>> On Mon, Feb 17, 2025 at 09:38:26AM +0000, Matthew Auld wrote:
>>>>> On 15/02/2025 01:28, Matthew Brost wrote:
>>>>>> On Fri, Feb 14, 2025 at 05:05:28PM +0000, Matthew Auld wrote:
>>>>>>> On error restore anything still on the pin_list back to the
>>>>>>> invalidation
>>>>>>> list on error. For the actual pin, so long as the vma is
>>>>>>> tracked on
>>>>>>> either list it should get picked up on the next pin,
>>>>>>> however it looks
>>>>>>> possible for the vma to get nuked but still be present on
>>>>>>> this per vm
>>>>>>> pin_list leading to corruption. An alternative might be
>>>>>>> then to instead
>>>>>>> just remove the link when destroying the vma.
>>>>>>>
>>>>>>> Fixes: ed2bdf3b264d ("drm/xe/vm: Subclass userptr vmas")
>>>>>>> Suggested-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>>>>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>>>>> Cc: <stable@vger.kernel.org> # v6.8+
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/xe/xe_vm.c | 26 +++++++++++++++++++-----
>>>>>>> --
>>>>>>>     1 file changed, 19 insertions(+), 7 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c
>>>>>>> b/drivers/gpu/drm/xe/xe_vm.c
>>>>>>> index d664f2e418b2..668b0bde7822 100644
>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>>>>>> @@ -670,12 +670,12 @@ int xe_vm_userptr_pin(struct xe_vm
>>>>>>> *vm)
>>>>>>>     	list_for_each_entry_safe(uvma, next, &vm-
>>>>>>>> userptr.invalidated,
>>>>>>>     				 userptr.invalidate_link)
>>>>>>> {
>>>>>>>     		list_del_init(&uvma-
>>>>>>>> userptr.invalidate_link);
>>>>>>> -		list_move_tail(&uvma->userptr.repin_link,
>>>>>>> -			       &vm->userptr.repin_list);
>>>>>>> +		list_add_tail(&uvma->userptr.repin_link,
>>>>>>> +			      &vm->userptr.repin_list);
>>>>>>
>>>>>> Why this change?
>>>>>
>>>>> Just that with this patch the repin_link should now always be
>>>>> empty at this
>>>>> point, I think. add should complain if that is not the case.
>>>>>
>>>>
>>>> If it is always expected to be empty, then yea maybe add a
>>>> xe_assert for
>>>> this as the list management is pretty tricky.
>>>>
>>>>>>
>>>>>>>     	}
>>>>>>>     	spin_unlock(&vm->userptr.invalidated_lock);
>>>>>>> -	/* Pin and move to temporary list */
>>>>>>> +	/* Pin and move to bind list */
>>>>>>>     	list_for_each_entry_safe(uvma, next, &vm-
>>>>>>>> userptr.repin_list,
>>>>>>>     				 userptr.repin_link) {
>>>>>>>     		err = xe_vma_userptr_pin_pages(uvma);
>>>>>>> @@ -691,10 +691,10 @@ int xe_vm_userptr_pin(struct xe_vm
>>>>>>> *vm)
>>>>>>>     			err = xe_vm_invalidate_vma(&uvma-
>>>>>>>> vma);
>>>>>>>     			xe_vm_unlock(vm);
>>>>>>>     			if (err)
>>>>>>> -				return err;
>>>>>>> +				break;
>>>>>>>     		} else {
>>>>>>> -			if (err < 0)
>>>>>>> -				return err;
>>>>>>> +			if (err)
>>>>>>> +				break;
>>>>>>>     			list_del_init(&uvma-
>>>>>>>> userptr.repin_link);
>>>>>>>     			list_move_tail(&uvma-
>>>>>>>> vma.combined_links.rebind,
>>>>>>> @@ -702,7 +702,19 @@ int xe_vm_userptr_pin(struct xe_vm
>>>>>>> *vm)
>>>>>>>     		}
>>>>>>>     	}
>>>>>>> -	return 0;
>>>>>>> +	if (err) {
>>>>>>> +		down_write(&vm->userptr.notifier_lock);
>>>>>>
>>>>>> Can you explain why you take the notifier lock here? I don't
>>>>>> think this
>>>>>> required unless I'm missing something.
>>>>>
>>>>> For the invalidated list, the docs say:
>>>>>
>>>>> "Removing items from the list additionally requires @lock in
>>>>> write mode, and
>>>>> adding items to the list requires the @userptr.notifer_lock in
>>>>> write mode."
>>>>>
>>>>> Not sure if the docs needs to be updated here?
>>>>>
>>>>
>>>> Oh. I believe the part of comment for 'adding items to the list
>>>> requires the @userptr.notifer_lock in write mode' really means
>>>> something
>>>> like this:
>>>>
>>>> 'When adding to @vm->userptr.invalidated in the notifier the
>>>> @userptr.notifer_lock in write mode protects against concurrent
>>>> VM binds
>>>> from setting up newly invalidated pages.'
>>>>
>>>> So with above and since this code path is in the VM bind path
>>>> (i.e. we
>>>> are not racing with other binds) I think the
>>>> vm->userptr.invalidated_lock is sufficient. Maybe ask Thomas if
>>>> he
>>>> agrees here.
>>>>
>>>
>>> After some discussion with Thomas, removing notifier lock here is
>>> safe.
>>
>> Thanks for confirming.
> 
> So basically that was to protect exec when it takes the notifier lock
> in read mode, and checks that there are no invalidated userptr, that
> needs to stay true as lock as the notifier lock is held.
> 
> But as MBrost pointed out, the vm lock is also held, so I think the
> kerneldoc should be updated so that the requirement is that either the
> notifier lock is held in write mode, or the vm lock in write mode.
> 
> As a general comment these locking protection docs are there to
> simplify reading and writing of the code so that when new code is
> written and reviewed, we should just keep to the rules to avoid
> auditing all locations in the driver where the protected data-structure
> is touched. If we want to update those docs I think a complete such
> audit needs to be done and all use-cases are understood.

For this patch is the preference to go with the slightly overzealous 
locking for now? Circling back around later, fixing the doc when adding 
the new helper, and at the same time also audit all callers?

> 
> /Thomas
> 
> 
>>
>>>
>>> However, for adding is either userptr.notifer_lock || vm->lock to
>>> also
>>> avoid races between binds, execs, and rebind worker.
>>>
>>> I'd like update the documentation and add a helper like this:
>>>
>>> void xe_vma_userptr_add_invalidated(struct xe_userptr_vma *uvma)
>>> {
>>>          struct xe_vm *vm = xe_vma_vm(&uvma->vma);
>>>
>>>          lockdep_assert(lock_is_held_type(&vm->lock.dep_map, 1) ||
>>>                         lock_is_held_type(&vm-
>>>> userptr.notifier_lock.dep_map, 1));
>>>
>>>          spin_lock(&vm->userptr.invalidated_lock);
>>>          list_move_tail(&uvma->userptr.invalidate_link,
>>>                         &vm->userptr.invalidated);
>>>          spin_unlock(&vm->userptr.invalidated_lock);
>>> }
>>
>> Sounds good.
>>
>>>
>>> However, let's delay the helper until this series and recently post
>>> series of mine [1] merge as both are fixes series and hoping for a
>>> clean
>>> backport.
>>
>> Makes sense.
>>
>>>
>>> Matt
>>>
>>> [1] https://patchwork.freedesktop.org/series/145198/
>>>
>>>> Matt
>>>>
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>>> +		spin_lock(&vm->userptr.invalidated_lock);
>>>>>>> +		list_for_each_entry_safe(uvma, next, &vm-
>>>>>>>> userptr.repin_list,
>>>>>>> +					
>>>>>>> userptr.repin_link) {
>>>>>>> +			list_del_init(&uvma-
>>>>>>>> userptr.repin_link);
>>>>>>> +			list_move_tail(&uvma-
>>>>>>>> userptr.invalidate_link,
>>>>>>> +				       &vm-
>>>>>>>> userptr.invalidated);
>>>>>>> +		}
>>>>>>> +		spin_unlock(&vm-
>>>>>>>> userptr.invalidated_lock);
>>>>>>> +		up_write(&vm->userptr.notifier_lock);
>>>>>>> +	}
>>>>>>> +	return err;
>>>>>>>     }
>>>>>>>     /**
>>>>>>> -- 
>>>>>>> 2.48.1
>>>>>>>
>>>>>
>>
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] drm/xe/userptr: restore invalidation list on error
  2025-02-21 13:17             ` Matthew Auld
@ 2025-02-21 13:23               ` Thomas Hellström
  0 siblings, 0 replies; 12+ messages in thread
From: Thomas Hellström @ 2025-02-21 13:23 UTC (permalink / raw)
  To: Matthew Auld, Matthew Brost; +Cc: intel-xe, stable

On Fri, 2025-02-21 at 13:17 +0000, Matthew Auld wrote:
> On 21/02/2025 11:20, Thomas Hellström wrote:
> > On Fri, 2025-02-21 at 11:11 +0000, Matthew Auld wrote:
> > > On 20/02/2025 23:52, Matthew Brost wrote:
> > > > On Mon, Feb 17, 2025 at 07:58:11PM -0800, Matthew Brost wrote:
> > > > > On Mon, Feb 17, 2025 at 09:38:26AM +0000, Matthew Auld wrote:
> > > > > > On 15/02/2025 01:28, Matthew Brost wrote:
> > > > > > > On Fri, Feb 14, 2025 at 05:05:28PM +0000, Matthew Auld
> > > > > > > wrote:
> > > > > > > > On error restore anything still on the pin_list back to
> > > > > > > > the
> > > > > > > > invalidation
> > > > > > > > list on error. For the actual pin, so long as the vma
> > > > > > > > is
> > > > > > > > tracked on
> > > > > > > > either list it should get picked up on the next pin,
> > > > > > > > however it looks
> > > > > > > > possible for the vma to get nuked but still be present
> > > > > > > > on
> > > > > > > > this per vm
> > > > > > > > pin_list leading to corruption. An alternative might be
> > > > > > > > then to instead
> > > > > > > > just remove the link when destroying the vma.
> > > > > > > > 
> > > > > > > > Fixes: ed2bdf3b264d ("drm/xe/vm: Subclass userptr
> > > > > > > > vmas")
> > > > > > > > Suggested-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> > > > > > > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > > > > Cc: <stable@vger.kernel.org> # v6.8+
> > > > > > > > ---
> > > > > > > >     drivers/gpu/drm/xe/xe_vm.c | 26
> > > > > > > > +++++++++++++++++++-----
> > > > > > > > --
> > > > > > > >     1 file changed, 19 insertions(+), 7 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > > > > > > b/drivers/gpu/drm/xe/xe_vm.c
> > > > > > > > index d664f2e418b2..668b0bde7822 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > > > > > > @@ -670,12 +670,12 @@ int xe_vm_userptr_pin(struct
> > > > > > > > xe_vm
> > > > > > > > *vm)
> > > > > > > >     	list_for_each_entry_safe(uvma, next, &vm-
> > > > > > > > > userptr.invalidated,
> > > > > > > >     				
> > > > > > > > userptr.invalidate_link)
> > > > > > > > {
> > > > > > > >     		list_del_init(&uvma-
> > > > > > > > > userptr.invalidate_link);
> > > > > > > > -		list_move_tail(&uvma-
> > > > > > > > >userptr.repin_link,
> > > > > > > > -			       &vm-
> > > > > > > > >userptr.repin_list);
> > > > > > > > +		list_add_tail(&uvma-
> > > > > > > > >userptr.repin_link,
> > > > > > > > +			      &vm-
> > > > > > > > >userptr.repin_list);
> > > > > > > 
> > > > > > > Why this change?
> > > > > > 
> > > > > > Just that with this patch the repin_link should now always
> > > > > > be
> > > > > > empty at this
> > > > > > point, I think. add should complain if that is not the
> > > > > > case.
> > > > > > 
> > > > > 
> > > > > If it is always expected to be empty, then yea maybe add a
> > > > > xe_assert for
> > > > > this as the list management is pretty tricky.
> > > > > 
> > > > > > > 
> > > > > > > >     	}
> > > > > > > >     	spin_unlock(&vm->userptr.invalidated_lock);
> > > > > > > > -	/* Pin and move to temporary list */
> > > > > > > > +	/* Pin and move to bind list */
> > > > > > > >     	list_for_each_entry_safe(uvma, next, &vm-
> > > > > > > > > userptr.repin_list,
> > > > > > > >     				 userptr.repin_link) {
> > > > > > > >     		err = xe_vma_userptr_pin_pages(uvma);
> > > > > > > > @@ -691,10 +691,10 @@ int xe_vm_userptr_pin(struct
> > > > > > > > xe_vm
> > > > > > > > *vm)
> > > > > > > >     			err =
> > > > > > > > xe_vm_invalidate_vma(&uvma-
> > > > > > > > > vma);
> > > > > > > >     			xe_vm_unlock(vm);
> > > > > > > >     			if (err)
> > > > > > > > -				return err;
> > > > > > > > +				break;
> > > > > > > >     		} else {
> > > > > > > > -			if (err < 0)
> > > > > > > > -				return err;
> > > > > > > > +			if (err)
> > > > > > > > +				break;
> > > > > > > >     			list_del_init(&uvma-
> > > > > > > > > userptr.repin_link);
> > > > > > > >     			list_move_tail(&uvma-
> > > > > > > > > vma.combined_links.rebind,
> > > > > > > > @@ -702,7 +702,19 @@ int xe_vm_userptr_pin(struct xe_vm
> > > > > > > > *vm)
> > > > > > > >     		}
> > > > > > > >     	}
> > > > > > > > -	return 0;
> > > > > > > > +	if (err) {
> > > > > > > > +		down_write(&vm-
> > > > > > > > >userptr.notifier_lock);
> > > > > > > 
> > > > > > > Can you explain why you take the notifier lock here? I
> > > > > > > don't
> > > > > > > think this
> > > > > > > required unless I'm missing something.
> > > > > > 
> > > > > > For the invalidated list, the docs say:
> > > > > > 
> > > > > > "Removing items from the list additionally requires @lock
> > > > > > in
> > > > > > write mode, and
> > > > > > adding items to the list requires the @userptr.notifer_lock
> > > > > > in
> > > > > > write mode."
> > > > > > 
> > > > > > Not sure if the docs needs to be updated here?
> > > > > > 
> > > > > 
> > > > > Oh. I believe the part of comment for 'adding items to the
> > > > > list
> > > > > requires the @userptr.notifer_lock in write mode' really
> > > > > means
> > > > > something
> > > > > like this:
> > > > > 
> > > > > 'When adding to @vm->userptr.invalidated in the notifier the
> > > > > @userptr.notifer_lock in write mode protects against
> > > > > concurrent
> > > > > VM binds
> > > > > from setting up newly invalidated pages.'
> > > > > 
> > > > > So with above and since this code path is in the VM bind path
> > > > > (i.e. we
> > > > > are not racing with other binds) I think the
> > > > > vm->userptr.invalidated_lock is sufficient. Maybe ask Thomas
> > > > > if
> > > > > he
> > > > > agrees here.
> > > > > 
> > > > 
> > > > After some discussion with Thomas, removing notifier lock here
> > > > is
> > > > safe.
> > > 
> > > Thanks for confirming.
> > 
> > So basically that was to protect exec when it takes the notifier
> > lock
> > in read mode, and checks that there are no invalidated userptr,
> > that
> > needs to stay true as lock as the notifier lock is held.
> > 
> > But as MBrost pointed out, the vm lock is also held, so I think the
> > kerneldoc should be updated so that the requirement is that either
> > the
> > notifier lock is held in write mode, or the vm lock in write mode.
> > 
> > As a general comment these locking protection docs are there to
> > simplify reading and writing of the code so that when new code is
> > written and reviewed, we should just keep to the rules to avoid
> > auditing all locations in the driver where the protected data-
> > structure
> > is touched. If we want to update those docs I think a complete such
> > audit needs to be done and all use-cases are understood.
> 
> For this patch is the preference to go with the slightly overzealous 
> locking for now? Circling back around later, fixing the doc when
> adding 
> the new helper, and at the same time also audit all callers?

Since it's a -fixes patch I think we should keep the locking and
documentation consistent, so either update the docs also in the stable
backports or do the overzealous locking.

/Thomas



> 
> > 
> > /Thomas
> > 
> > 
> > > 
> > > > 
> > > > However, for adding is either userptr.notifer_lock || vm->lock
> > > > to
> > > > also
> > > > avoid races between binds, execs, and rebind worker.
> > > > 
> > > > I'd like update the documentation and add a helper like this:
> > > > 
> > > > void xe_vma_userptr_add_invalidated(struct xe_userptr_vma
> > > > *uvma)
> > > > {
> > > >          struct xe_vm *vm = xe_vma_vm(&uvma->vma);
> > > > 
> > > >          lockdep_assert(lock_is_held_type(&vm->lock.dep_map, 1)
> > > > ||
> > > >                         lock_is_held_type(&vm-
> > > > > userptr.notifier_lock.dep_map, 1));
> > > > 
> > > >          spin_lock(&vm->userptr.invalidated_lock);
> > > >          list_move_tail(&uvma->userptr.invalidate_link,
> > > >                         &vm->userptr.invalidated);
> > > >          spin_unlock(&vm->userptr.invalidated_lock);
> > > > }
> > > 
> > > Sounds good.
> > > 
> > > > 
> > > > However, let's delay the helper until this series and recently
> > > > post
> > > > series of mine [1] merge as both are fixes series and hoping
> > > > for a
> > > > clean
> > > > backport.
> > > 
> > > Makes sense.
> > > 
> > > > 
> > > > Matt
> > > > 
> > > > [1] https://patchwork.freedesktop.org/series/145198/
> > > > 
> > > > > Matt
> > > > > 
> > > > > > > 
> > > > > > > Matt
> > > > > > > 
> > > > > > > > +		spin_lock(&vm-
> > > > > > > > >userptr.invalidated_lock);
> > > > > > > > +		list_for_each_entry_safe(uvma, next,
> > > > > > > > &vm-
> > > > > > > > > userptr.repin_list,
> > > > > > > > +					
> > > > > > > > userptr.repin_link) {
> > > > > > > > +			list_del_init(&uvma-
> > > > > > > > > userptr.repin_link);
> > > > > > > > +			list_move_tail(&uvma-
> > > > > > > > > userptr.invalidate_link,
> > > > > > > > +				       &vm-
> > > > > > > > > userptr.invalidated);
> > > > > > > > +		}
> > > > > > > > +		spin_unlock(&vm-
> > > > > > > > > userptr.invalidated_lock);
> > > > > > > > +		up_write(&vm->userptr.notifier_lock);
> > > > > > > > +	}
> > > > > > > > +	return err;
> > > > > > > >     }
> > > > > > > >     /**
> > > > > > > > -- 
> > > > > > > > 2.48.1
> > > > > > > > 
> > > > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-02-21 13:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-14 17:05 [PATCH v2 1/3] drm/xe/userptr: restore invalidation list on error Matthew Auld
2025-02-14 17:05 ` [PATCH v2 2/3] drm/xe/userptr: fix EFAULT handling Matthew Auld
2025-02-15  1:23   ` Matthew Brost
2025-02-17  9:19     ` Matthew Auld
2025-02-15  1:28 ` [PATCH v2 1/3] drm/xe/userptr: restore invalidation list on error Matthew Brost
2025-02-17  9:38   ` Matthew Auld
2025-02-18  3:58     ` Matthew Brost
2025-02-20 23:52       ` Matthew Brost
2025-02-21 11:11         ` Matthew Auld
2025-02-21 11:20           ` Thomas Hellström
2025-02-21 13:17             ` Matthew Auld
2025-02-21 13:23               ` Thomas Hellström

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox