dma_buf_detach lockdep splat

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* dma_buf_detach lockdep splat
@ 2024-06-26 15:58 Thomas Hellström
  2024-06-27  8:04 ` Daniel Vetter
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Hellström @ 2024-06-26 15:58 UTC (permalink / raw)
  To: Christian König, Dmitry Osipenko, dri-devel, intel-xe; +Cc: linaro-mm-sig

Hi!

I'm seeing the below lockdep splat 1) with the xe driver in an imported
dma-buf object destruction path.

It's not because we hold the dma_resv lock at that point, but rather
because we hold *another* dma_resv lock at that point, and the dma_resv
detach happens when the object is idle, in this case it was idle at the
final put(), and dma_buf_detach() is called in the putting process.

Holding another dma-buf lock might happen as part of
drm_exec_unlock_all, or simply if the wider vm dma_resv was held at
object put time, so it's not an uncommon pattern, even if the drm_exec
instance can be fixed by putting all bos after unlocking them all.

Two solutions coming to mind here:

1) Provide a dma_buf_detach_locked()
2) Have TTM always take the delayed delete path for imported dma-buf
objects.

I'd prefer 1) since I think the correct place to call this is in the
TTM callback delete_mem_notify() where the bo is already locked, and I
figure non-TTM gem backends may come to suffer from the same problem.

Opinions, suggestions?

[1]
[   99.136161] ============================================
[   99.136162] WARNING: possible recursive locking detected
[   99.136163] 6.10.0-rc2+ #6 Tainted: G     U            
[   99.136165] --------------------------------------------
[   99.136166] glxgears:sh0/4675 is trying to acquire lock:
[   99.136167] ffff9967dcdd91a8 (reservation_ww_class_mutex){+.+.}-
{3:3}, at: dma_buf_detach+0x3b/0xf0
[   99.136184] 
               but task is already holding lock:
[   99.136186] ffff9967d8c145a8 (reservation_ww_class_mutex){+.+.}-
{3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
[   99.136191] 
               other info that might help us debug this:
[   99.136192]  Possible unsafe locking scenario:

[   99.136194]        CPU0
[   99.136194]        ----
[   99.136195]   lock(reservation_ww_class_mutex);
[   99.136197]   lock(reservation_ww_class_mutex);
[   99.136199] 
                *** DEADLOCK ***

[   99.136199]  May be due to missing lock nesting notation

[   99.136200] 5 locks held by glxgears:sh0/4675:
[   99.136202]  #0: ffff9967d8c104c8 (&xef->vm.lock){+.+.}-{3:3}, at:
xe_file_close+0xde/0x1c0 [xe]
[   99.136272]  #1: ffff9967d5bb7480 (&vm->lock){++++}-{3:3}, at:
xe_vm_close_and_put+0x161/0x9b0 [xe]
[   99.136350]  #2: ffff9967ef88a970 (&val->lock){.+.+}-{3:3}, at:
xe_validation_ctx_init+0x6d/0x70 [xe]
[   99.136440]  #3: ffffbd6a085577b8
(reservation_ww_class_acquire){+.+.}-{0:0}, at:
xe_vma_destroy_unlocked+0x7f/0xe0 [xe]
[   99.136546]  #4: ffff9967d8c145a8
(reservation_ww_class_mutex){+.+.}-{3:3}, at:
drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
[   99.136552] 
               stack backtrace:
[   99.136553] CPU: 10 PID: 4675 Comm: glxgears:sh0 Tainted: G     U  
6.10.0-rc2+ #6
[   99.136555] Hardware name: ASUS System Product Name/PRIME B560M-A
AC, BIOS 2001 02/01/2023
[   99.136557] Call Trace:
[   99.136558]  <TASK>
[   99.136560]  dump_stack_lvl+0x77/0xb0
[   99.136564]  __lock_acquire+0x1232/0x2160
[   99.136569]  lock_acquire+0xcb/0x2d0
[   99.136570]  ? dma_buf_detach+0x3b/0xf0
[   99.136574]  ? __lock_acquire+0x417/0x2160
[   99.136577]  __ww_mutex_lock.constprop.0+0xd0/0x13b0
[   99.136580]  ? dma_buf_detach+0x3b/0xf0
[   99.136584]  ? dma_buf_detach+0x3b/0xf0
[   99.136588]  ? ww_mutex_lock+0x2b/0x90
[   99.136590]  ww_mutex_lock+0x2b/0x90
[   99.136592]  dma_buf_detach+0x3b/0xf0
[   99.136595]  drm_prime_gem_destroy+0x2f/0x40 [drm]
[   99.136638]  xe_ttm_bo_destroy+0x32/0x220 [xe]
[   99.136734]  ? __mutex_unlock_slowpath+0x3a/0x290
[   99.136738]  drm_exec_unlock_all+0xa1/0xd0 [drm_exec]
[   99.136741]  drm_exec_fini+0x12/0xb0 [drm_exec]
[   99.136743]  xe_validation_ctx_fini+0x15/0x40 [xe]
[   99.136848]  xe_vma_destroy_unlocked+0xb1/0xe0 [xe]
[   99.136954]  xe_vm_close_and_put+0x41a/0x9b0 [xe]
[   99.137056]  ? xa_find+0xe3/0x1e0
[   99.137060]  xe_file_close+0x10a/0x1c0 [xe]
[   99.137157]  drm_file_free+0x22a/0x280 [drm]
[   99.137193]  drm_release_noglobal+0x22/0x70 [drm]
[   99.137227]  __fput+0xf1/0x2d0
[   99.137231]  task_work_run+0x59/0x90
[   99.137235]  do_exit+0x330/0xb40
[   99.137238]  do_group_exit+0x36/0xa0
[   99.137241]  get_signal+0xbd2/0xbe0
[   99.137245]  arch_do_signal_or_restart+0x3e/0x240
[   99.137249]  syscall_exit_to_user_mode+0x1e7/0x290
[   99.137252]  do_syscall_64+0xa1/0x180
[   99.137255]  ? _raw_spin_unlock+0x23/0x40
[   99.137257]  ? look_up_lock_class+0x6f/0x120
[   99.137261]  ? __lock_acquire+0x417/0x2160
[   99.137264]  ? lock_acquire+0xcb/0x2d0
[   99.137266]  ? __set_task_comm+0x28/0x1e0
[   99.137268]  ? find_held_lock+0x2b/0x80
[   99.137271]  ? __set_task_comm+0xe1/0x1e0
[   99.137273]  ? lock_release+0xca/0x290
[   99.137277]  ? __do_sys_prctl+0x245/0xab0
[   99.137279]  ? lockdep_hardirqs_on_prepare+0xde/0x190
[   99.137281]  ? syscall_exit_to_user_mode+0xb0/0x290
[   99.137284]  ? do_syscall_64+0xa1/0x180
[   99.137286]  ? cpuset_cpus_allowed+0x36/0x140
[   99.137289]  ? find_held_lock+0x2b/0x80
[   99.137291]  ? find_held_lock+0x2b/0x80
[   99.137294]  ? __sched_setaffinity+0x78/0x240
[   99.137297]  ? kfree+0xe2/0x310
[   99.137301]  ? kfree+0x202/0x310
[   99.137303]  ? __sched_setaffinity+0x78/0x240
[   99.137305]  ? __x64_sys_sched_setaffinity+0x69/0xb0
[   99.137307]  ? kfree+0xe2/0x310
[   99.137310]  ? lockdep_hardirqs_on_prepare+0xde/0x190
[   99.137312]  ? syscall_exit_to_user_mode+0xb0/0x290
[   99.137315]  ? do_syscall_64+0xa1/0x180
[   99.137317]  ? trace_hardirqs_off+0x4b/0xc0
[   99.137321]  ? clear_bhb_loop+0x45/0xa0
[   99.137325]  ? clear_bhb_loop+0x45/0xa0
[   99.137327]  ? clear_bhb_loop+0x45/0xa0
[   99.137330]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   99.137333] RIP: 0033:0x7fda70ee6169
[   99.137351] Code: Unable to access opcode bytes at 0x7fda70ee613f.
[   99.137352] RSP: 002b:00007fda5fdffc80 EFLAGS: 00000246 ORIG_RAX:
00000000000000ca
[   99.137354] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX:
00007fda70ee6169
[   99.137356] RDX: 0000000000000000 RSI: 0000000000000189 RDI:
0000564a96f45b30
[   99.137358] RBP: 00007fda5fdffcb0 R08: 0000000000000000 R09:
00000000ffffffff
[   99.137359] R10: 0000000000000000 R11: 0000000000000246 R12:
0000000000000000
[   99.137360] R13: 0000000000000000 R14: 0000000000000000 R15:
0000564a96f45b30
[   99.137365]  </TASK>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: dma_buf_detach lockdep splat
  2024-06-26 15:58 dma_buf_detach lockdep splat Thomas Hellström
@ 2024-06-27  8:04 ` Daniel Vetter
  2024-06-27  8:25   ` Christian König
  2024-06-27 12:18   ` Thomas Hellström
  0 siblings, 2 replies; 7+ messages in thread
From: Daniel Vetter @ 2024-06-27  8:04 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Christian König, Dmitry Osipenko, dri-devel, intel-xe,
	linaro-mm-sig

On Wed, Jun 26, 2024 at 05:58:02PM +0200, Thomas Hellström wrote:
> Hi!
> 
> I'm seeing the below lockdep splat 1) with the xe driver in an imported
> dma-buf object destruction path.
> 
> It's not because we hold the dma_resv lock at that point, but rather
> because we hold *another* dma_resv lock at that point, and the dma_resv
> detach happens when the object is idle, in this case it was idle at the
> final put(), and dma_buf_detach() is called in the putting process.
> 
> Holding another dma-buf lock might happen as part of
> drm_exec_unlock_all, or simply if the wider vm dma_resv was held at
> object put time, so it's not an uncommon pattern, even if the drm_exec
> instance can be fixed by putting all bos after unlocking them all.
> 
> Two solutions coming to mind here:
> 
> 1) Provide a dma_buf_detach_locked()

This smells way too much like the endless headaches we had with
drm_gem_object_put_locked and friends against drm_device.struct_mutex. Or
I'm not understanding what you're doing, because I'm pretty sure you have
to take the dma_resv lock on final put() of imported objects. Because that
final put() is of the import wrapper, the exporter (and other importers)
can still get at that object and so dma_resv_lock is very much needed.

Or it's a completely different final put(), but I have no idea how you get
that on an imported dma_buf.

> 2) Have TTM always take the delayed delete path for imported dma-buf
> objects.
> 
> I'd prefer 1) since I think the correct place to call this is in the
> TTM callback delete_mem_notify() where the bo is already locked, and I
> figure non-TTM gem backends may come to suffer from the same problem.
> 
> Opinions, suggestions?

Imo 2) or trying to push the object puts outside of the dma_resv_lock. The
latter is imo natural, since usually you grab references, then lock. And
this even holds for at least the slow path of lru eviction, because you
need to drop all locks and then do a ww_mutex_lock_slow, and that requires
that you can hold references to unlocked objects.

But 2) alone is imo fine, dma_buf have become really big objects that go
across drivers, extremely similar to struct file, and that is doing the
delayed final put unconditionally since years too, using task_work. It's
simply a solid design.

Cheers, Sima

> [1]
> [   99.136161] ============================================
> [   99.136162] WARNING: possible recursive locking detected
> [   99.136163] 6.10.0-rc2+ #6 Tainted: G     U            
> [   99.136165] --------------------------------------------
> [   99.136166] glxgears:sh0/4675 is trying to acquire lock:
> [   99.136167] ffff9967dcdd91a8 (reservation_ww_class_mutex){+.+.}-
> {3:3}, at: dma_buf_detach+0x3b/0xf0
> [   99.136184] 
>                but task is already holding lock:
> [   99.136186] ffff9967d8c145a8 (reservation_ww_class_mutex){+.+.}-
> {3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
> [   99.136191] 
>                other info that might help us debug this:
> [   99.136192]  Possible unsafe locking scenario:
> 
> [   99.136194]        CPU0
> [   99.136194]        ----
> [   99.136195]   lock(reservation_ww_class_mutex);
> [   99.136197]   lock(reservation_ww_class_mutex);
> [   99.136199] 
>                 *** DEADLOCK ***
> 
> [   99.136199]  May be due to missing lock nesting notation
> 
> [   99.136200] 5 locks held by glxgears:sh0/4675:
> [   99.136202]  #0: ffff9967d8c104c8 (&xef->vm.lock){+.+.}-{3:3}, at:
> xe_file_close+0xde/0x1c0 [xe]
> [   99.136272]  #1: ffff9967d5bb7480 (&vm->lock){++++}-{3:3}, at:
> xe_vm_close_and_put+0x161/0x9b0 [xe]
> [   99.136350]  #2: ffff9967ef88a970 (&val->lock){.+.+}-{3:3}, at:
> xe_validation_ctx_init+0x6d/0x70 [xe]
> [   99.136440]  #3: ffffbd6a085577b8
> (reservation_ww_class_acquire){+.+.}-{0:0}, at:
> xe_vma_destroy_unlocked+0x7f/0xe0 [xe]
> [   99.136546]  #4: ffff9967d8c145a8
> (reservation_ww_class_mutex){+.+.}-{3:3}, at:
> drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
> [   99.136552] 
>                stack backtrace:
> [   99.136553] CPU: 10 PID: 4675 Comm: glxgears:sh0 Tainted: G     U  
> 6.10.0-rc2+ #6
> [   99.136555] Hardware name: ASUS System Product Name/PRIME B560M-A
> AC, BIOS 2001 02/01/2023
> [   99.136557] Call Trace:
> [   99.136558]  <TASK>
> [   99.136560]  dump_stack_lvl+0x77/0xb0
> [   99.136564]  __lock_acquire+0x1232/0x2160
> [   99.136569]  lock_acquire+0xcb/0x2d0
> [   99.136570]  ? dma_buf_detach+0x3b/0xf0
> [   99.136574]  ? __lock_acquire+0x417/0x2160
> [   99.136577]  __ww_mutex_lock.constprop.0+0xd0/0x13b0
> [   99.136580]  ? dma_buf_detach+0x3b/0xf0
> [   99.136584]  ? dma_buf_detach+0x3b/0xf0
> [   99.136588]  ? ww_mutex_lock+0x2b/0x90
> [   99.136590]  ww_mutex_lock+0x2b/0x90
> [   99.136592]  dma_buf_detach+0x3b/0xf0
> [   99.136595]  drm_prime_gem_destroy+0x2f/0x40 [drm]
> [   99.136638]  xe_ttm_bo_destroy+0x32/0x220 [xe]
> [   99.136734]  ? __mutex_unlock_slowpath+0x3a/0x290
> [   99.136738]  drm_exec_unlock_all+0xa1/0xd0 [drm_exec]
> [   99.136741]  drm_exec_fini+0x12/0xb0 [drm_exec]
> [   99.136743]  xe_validation_ctx_fini+0x15/0x40 [xe]
> [   99.136848]  xe_vma_destroy_unlocked+0xb1/0xe0 [xe]
> [   99.136954]  xe_vm_close_and_put+0x41a/0x9b0 [xe]
> [   99.137056]  ? xa_find+0xe3/0x1e0
> [   99.137060]  xe_file_close+0x10a/0x1c0 [xe]
> [   99.137157]  drm_file_free+0x22a/0x280 [drm]
> [   99.137193]  drm_release_noglobal+0x22/0x70 [drm]
> [   99.137227]  __fput+0xf1/0x2d0
> [   99.137231]  task_work_run+0x59/0x90
> [   99.137235]  do_exit+0x330/0xb40
> [   99.137238]  do_group_exit+0x36/0xa0
> [   99.137241]  get_signal+0xbd2/0xbe0
> [   99.137245]  arch_do_signal_or_restart+0x3e/0x240
> [   99.137249]  syscall_exit_to_user_mode+0x1e7/0x290
> [   99.137252]  do_syscall_64+0xa1/0x180
> [   99.137255]  ? _raw_spin_unlock+0x23/0x40
> [   99.137257]  ? look_up_lock_class+0x6f/0x120
> [   99.137261]  ? __lock_acquire+0x417/0x2160
> [   99.137264]  ? lock_acquire+0xcb/0x2d0
> [   99.137266]  ? __set_task_comm+0x28/0x1e0
> [   99.137268]  ? find_held_lock+0x2b/0x80
> [   99.137271]  ? __set_task_comm+0xe1/0x1e0
> [   99.137273]  ? lock_release+0xca/0x290
> [   99.137277]  ? __do_sys_prctl+0x245/0xab0
> [   99.137279]  ? lockdep_hardirqs_on_prepare+0xde/0x190
> [   99.137281]  ? syscall_exit_to_user_mode+0xb0/0x290
> [   99.137284]  ? do_syscall_64+0xa1/0x180
> [   99.137286]  ? cpuset_cpus_allowed+0x36/0x140
> [   99.137289]  ? find_held_lock+0x2b/0x80
> [   99.137291]  ? find_held_lock+0x2b/0x80
> [   99.137294]  ? __sched_setaffinity+0x78/0x240
> [   99.137297]  ? kfree+0xe2/0x310
> [   99.137301]  ? kfree+0x202/0x310
> [   99.137303]  ? __sched_setaffinity+0x78/0x240
> [   99.137305]  ? __x64_sys_sched_setaffinity+0x69/0xb0
> [   99.137307]  ? kfree+0xe2/0x310
> [   99.137310]  ? lockdep_hardirqs_on_prepare+0xde/0x190
> [   99.137312]  ? syscall_exit_to_user_mode+0xb0/0x290
> [   99.137315]  ? do_syscall_64+0xa1/0x180
> [   99.137317]  ? trace_hardirqs_off+0x4b/0xc0
> [   99.137321]  ? clear_bhb_loop+0x45/0xa0
> [   99.137325]  ? clear_bhb_loop+0x45/0xa0
> [   99.137327]  ? clear_bhb_loop+0x45/0xa0
> [   99.137330]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   99.137333] RIP: 0033:0x7fda70ee6169
> [   99.137351] Code: Unable to access opcode bytes at 0x7fda70ee613f.
> [   99.137352] RSP: 002b:00007fda5fdffc80 EFLAGS: 00000246 ORIG_RAX:
> 00000000000000ca
> [   99.137354] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX:
> 00007fda70ee6169
> [   99.137356] RDX: 0000000000000000 RSI: 0000000000000189 RDI:
> 0000564a96f45b30
> [   99.137358] RBP: 00007fda5fdffcb0 R08: 0000000000000000 R09:
> 00000000ffffffff
> [   99.137359] R10: 0000000000000000 R11: 0000000000000246 R12:
> 0000000000000000
> [   99.137360] R13: 0000000000000000 R14: 0000000000000000 R15:
> 0000564a96f45b30
> [   99.137365]  </TASK>
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: dma_buf_detach lockdep splat
  2024-06-27  8:04 ` Daniel Vetter
@ 2024-06-27  8:25   ` Christian König
  2024-06-27 12:03     ` [Linaro-mm-sig] " Thomas Hellström
  2024-06-27 12:18   ` Thomas Hellström
  1 sibling, 1 reply; 7+ messages in thread
From: Christian König @ 2024-06-27  8:25 UTC (permalink / raw)
  To: Daniel Vetter, Thomas Hellström
  Cc: Dmitry Osipenko, dri-devel, intel-xe, linaro-mm-sig

Am 27.06.24 um 10:04 schrieb Daniel Vetter:
> On Wed, Jun 26, 2024 at 05:58:02PM +0200, Thomas Hellström wrote:
>> Hi!
>>
>> I'm seeing the below lockdep splat 1) with the xe driver in an imported
>> dma-buf object destruction path.

Mhm strange.

>>
>> It's not because we hold the dma_resv lock at that point, but rather
>> because we hold *another* dma_resv lock at that point, and the dma_resv
>> detach happens when the object is idle, in this case it was idle at the
>> final put(), and dma_buf_detach() is called in the putting process.
>>
>> Holding another dma-buf lock might happen as part of
>> drm_exec_unlock_all, or simply if the wider vm dma_resv was held at
>> object put time, so it's not an uncommon pattern, even if the drm_exec
>> instance can be fixed by putting all bos after unlocking them all.
>>
>> Two solutions coming to mind here:
>>
>> 1) Provide a dma_buf_detach_locked()
> This smells way too much like the endless headaches we had with
> drm_gem_object_put_locked and friends against drm_device.struct_mutex. Or
> I'm not understanding what you're doing, because I'm pretty sure you have
> to take the dma_resv lock on final put() of imported objects. Because that
> final put() is of the import wrapper, the exporter (and other importers)
> can still get at that object and so dma_resv_lock is very much needed.
>
> Or it's a completely different final put(), but I have no idea how you get
> that on an imported dma_buf.
>
>> 2) Have TTM always take the delayed delete path for imported dma-buf
>> objects.
>>
>> I'd prefer 1) since I think the correct place to call this is in the
>> TTM callback delete_mem_notify() where the bo is already locked, and I
>> figure non-TTM gem backends may come to suffer from the same problem.
>>
>> Opinions, suggestions?
> Imo 2) or trying to push the object puts outside of the dma_resv_lock.

IIRC I've stumbled over this issue before with TTM but though that I've 
fixed it.

I mean no objections from my side to change drm_exec_fini() to something 
like this:

drm_exec_for_each_locked_object_reverse(exec, index, obj)
     dma_resv_unlock(obj->resv);

drm_exec_for_each_locked_object_reverse(exec, index, obj)
     drm_gem_object_put(obj);

but in general that the last reference is dropped while holding a 
different reservation object is not something special. For example that 
happens all the time in TTMs eviction code.

So at least for TTM I would say we should move cleanup of imported BOs 
to the worker. But not sure if that covers everything.

Regards,
Christian.

>   The
> latter is imo natural, since usually you grab references, then lock. And
> this even holds for at least the slow path of lru eviction, because you
> need to drop all locks and then do a ww_mutex_lock_slow, and that requires
> that you can hold references to unlocked objects.
>
> But 2) alone is imo fine, dma_buf have become really big objects that go
> across drivers, extremely similar to struct file, and that is doing the
> delayed final put unconditionally since years too, using task_work. It's
> simply a solid design.
>
> Cheers, Sima
>
>> [1]
>> [   99.136161] ============================================
>> [   99.136162] WARNING: possible recursive locking detected
>> [   99.136163] 6.10.0-rc2+ #6 Tainted: G     U
>> [   99.136165] --------------------------------------------
>> [   99.136166] glxgears:sh0/4675 is trying to acquire lock:
>> [   99.136167] ffff9967dcdd91a8 (reservation_ww_class_mutex){+.+.}-
>> {3:3}, at: dma_buf_detach+0x3b/0xf0
>> [   99.136184]
>>                 but task is already holding lock:
>> [   99.136186] ffff9967d8c145a8 (reservation_ww_class_mutex){+.+.}-
>> {3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
>> [   99.136191]
>>                 other info that might help us debug this:
>> [   99.136192]  Possible unsafe locking scenario:
>>
>> [   99.136194]        CPU0
>> [   99.136194]        ----
>> [   99.136195]   lock(reservation_ww_class_mutex);
>> [   99.136197]   lock(reservation_ww_class_mutex);
>> [   99.136199]
>>                  *** DEADLOCK ***
>>
>> [   99.136199]  May be due to missing lock nesting notation
>>
>> [   99.136200] 5 locks held by glxgears:sh0/4675:
>> [   99.136202]  #0: ffff9967d8c104c8 (&xef->vm.lock){+.+.}-{3:3}, at:
>> xe_file_close+0xde/0x1c0 [xe]
>> [   99.136272]  #1: ffff9967d5bb7480 (&vm->lock){++++}-{3:3}, at:
>> xe_vm_close_and_put+0x161/0x9b0 [xe]
>> [   99.136350]  #2: ffff9967ef88a970 (&val->lock){.+.+}-{3:3}, at:
>> xe_validation_ctx_init+0x6d/0x70 [xe]
>> [   99.136440]  #3: ffffbd6a085577b8
>> (reservation_ww_class_acquire){+.+.}-{0:0}, at:
>> xe_vma_destroy_unlocked+0x7f/0xe0 [xe]
>> [   99.136546]  #4: ffff9967d8c145a8
>> (reservation_ww_class_mutex){+.+.}-{3:3}, at:
>> drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
>> [   99.136552]
>>                 stack backtrace:
>> [   99.136553] CPU: 10 PID: 4675 Comm: glxgears:sh0 Tainted: G     U
>> 6.10.0-rc2+ #6
>> [   99.136555] Hardware name: ASUS System Product Name/PRIME B560M-A
>> AC, BIOS 2001 02/01/2023
>> [   99.136557] Call Trace:
>> [   99.136558]  <TASK>
>> [   99.136560]  dump_stack_lvl+0x77/0xb0
>> [   99.136564]  __lock_acquire+0x1232/0x2160
>> [   99.136569]  lock_acquire+0xcb/0x2d0
>> [   99.136570]  ? dma_buf_detach+0x3b/0xf0
>> [   99.136574]  ? __lock_acquire+0x417/0x2160
>> [   99.136577]  __ww_mutex_lock.constprop.0+0xd0/0x13b0
>> [   99.136580]  ? dma_buf_detach+0x3b/0xf0
>> [   99.136584]  ? dma_buf_detach+0x3b/0xf0
>> [   99.136588]  ? ww_mutex_lock+0x2b/0x90
>> [   99.136590]  ww_mutex_lock+0x2b/0x90
>> [   99.136592]  dma_buf_detach+0x3b/0xf0
>> [   99.136595]  drm_prime_gem_destroy+0x2f/0x40 [drm]
>> [   99.136638]  xe_ttm_bo_destroy+0x32/0x220 [xe]
>> [   99.136734]  ? __mutex_unlock_slowpath+0x3a/0x290
>> [   99.136738]  drm_exec_unlock_all+0xa1/0xd0 [drm_exec]
>> [   99.136741]  drm_exec_fini+0x12/0xb0 [drm_exec]
>> [   99.136743]  xe_validation_ctx_fini+0x15/0x40 [xe]
>> [   99.136848]  xe_vma_destroy_unlocked+0xb1/0xe0 [xe]
>> [   99.136954]  xe_vm_close_and_put+0x41a/0x9b0 [xe]
>> [   99.137056]  ? xa_find+0xe3/0x1e0
>> [   99.137060]  xe_file_close+0x10a/0x1c0 [xe]
>> [   99.137157]  drm_file_free+0x22a/0x280 [drm]
>> [   99.137193]  drm_release_noglobal+0x22/0x70 [drm]
>> [   99.137227]  __fput+0xf1/0x2d0
>> [   99.137231]  task_work_run+0x59/0x90
>> [   99.137235]  do_exit+0x330/0xb40
>> [   99.137238]  do_group_exit+0x36/0xa0
>> [   99.137241]  get_signal+0xbd2/0xbe0
>> [   99.137245]  arch_do_signal_or_restart+0x3e/0x240
>> [   99.137249]  syscall_exit_to_user_mode+0x1e7/0x290
>> [   99.137252]  do_syscall_64+0xa1/0x180
>> [   99.137255]  ? _raw_spin_unlock+0x23/0x40
>> [   99.137257]  ? look_up_lock_class+0x6f/0x120
>> [   99.137261]  ? __lock_acquire+0x417/0x2160
>> [   99.137264]  ? lock_acquire+0xcb/0x2d0
>> [   99.137266]  ? __set_task_comm+0x28/0x1e0
>> [   99.137268]  ? find_held_lock+0x2b/0x80
>> [   99.137271]  ? __set_task_comm+0xe1/0x1e0
>> [   99.137273]  ? lock_release+0xca/0x290
>> [   99.137277]  ? __do_sys_prctl+0x245/0xab0
>> [   99.137279]  ? lockdep_hardirqs_on_prepare+0xde/0x190
>> [   99.137281]  ? syscall_exit_to_user_mode+0xb0/0x290
>> [   99.137284]  ? do_syscall_64+0xa1/0x180
>> [   99.137286]  ? cpuset_cpus_allowed+0x36/0x140
>> [   99.137289]  ? find_held_lock+0x2b/0x80
>> [   99.137291]  ? find_held_lock+0x2b/0x80
>> [   99.137294]  ? __sched_setaffinity+0x78/0x240
>> [   99.137297]  ? kfree+0xe2/0x310
>> [   99.137301]  ? kfree+0x202/0x310
>> [   99.137303]  ? __sched_setaffinity+0x78/0x240
>> [   99.137305]  ? __x64_sys_sched_setaffinity+0x69/0xb0
>> [   99.137307]  ? kfree+0xe2/0x310
>> [   99.137310]  ? lockdep_hardirqs_on_prepare+0xde/0x190
>> [   99.137312]  ? syscall_exit_to_user_mode+0xb0/0x290
>> [   99.137315]  ? do_syscall_64+0xa1/0x180
>> [   99.137317]  ? trace_hardirqs_off+0x4b/0xc0
>> [   99.137321]  ? clear_bhb_loop+0x45/0xa0
>> [   99.137325]  ? clear_bhb_loop+0x45/0xa0
>> [   99.137327]  ? clear_bhb_loop+0x45/0xa0
>> [   99.137330]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> [   99.137333] RIP: 0033:0x7fda70ee6169
>> [   99.137351] Code: Unable to access opcode bytes at 0x7fda70ee613f.
>> [   99.137352] RSP: 002b:00007fda5fdffc80 EFLAGS: 00000246 ORIG_RAX:
>> 00000000000000ca
>> [   99.137354] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX:
>> 00007fda70ee6169
>> [   99.137356] RDX: 0000000000000000 RSI: 0000000000000189 RDI:
>> 0000564a96f45b30
>> [   99.137358] RBP: 00007fda5fdffcb0 R08: 0000000000000000 R09:
>> 00000000ffffffff
>> [   99.137359] R10: 0000000000000000 R11: 0000000000000246 R12:
>> 0000000000000000
>> [   99.137360] R13: 0000000000000000 R14: 0000000000000000 R15:
>> 0000564a96f45b30
>> [   99.137365]  </TASK>
>>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Linaro-mm-sig] Re: dma_buf_detach lockdep splat
  2024-06-27  8:25   ` Christian König
@ 2024-06-27 12:03     ` Thomas Hellström
  0 siblings, 0 replies; 7+ messages in thread
From: Thomas Hellström @ 2024-06-27 12:03 UTC (permalink / raw)
  To: Christian König, Daniel Vetter
  Cc: Dmitry Osipenko, dri-devel, intel-xe, linaro-mm-sig

On Thu, 2024-06-27 at 10:25 +0200, Christian König wrote:
> Am 27.06.24 um 10:04 schrieb Daniel Vetter:
> > On Wed, Jun 26, 2024 at 05:58:02PM +0200, Thomas Hellström wrote:
> > > Hi!
> > > 
> > > I'm seeing the below lockdep splat 1) with the xe driver in an
> > > imported
> > > dma-buf object destruction path.
> 
> Mhm strange.
> 
> > > 
> > > It's not because we hold the dma_resv lock at that point, but
> > > rather
> > > because we hold *another* dma_resv lock at that point, and the
> > > dma_resv
> > > detach happens when the object is idle, in this case it was idle
> > > at the
> > > final put(), and dma_buf_detach() is called in the putting
> > > process.
> > > 
> > > Holding another dma-buf lock might happen as part of
> > > drm_exec_unlock_all, or simply if the wider vm dma_resv was held
> > > at
> > > object put time, so it's not an uncommon pattern, even if the
> > > drm_exec
> > > instance can be fixed by putting all bos after unlocking them
> > > all.
> > > 
> > > Two solutions coming to mind here:
> > > 
> > > 1) Provide a dma_buf_detach_locked()
> > This smells way too much like the endless headaches we had with
> > drm_gem_object_put_locked and friends against
> > drm_device.struct_mutex. Or
> > I'm not understanding what you're doing, because I'm pretty sure
> > you have
> > to take the dma_resv lock on final put() of imported objects.
> > Because that
> > final put() is of the import wrapper, the exporter (and other
> > importers)
> > can still get at that object and so dma_resv_lock is very much
> > needed.
> > 
> > Or it's a completely different final put(), but I have no idea how
> > you get
> > that on an imported dma_buf.
> > 
> > > 2) Have TTM always take the delayed delete path for imported dma-
> > > buf
> > > objects.
> > > 
> > > I'd prefer 1) since I think the correct place to call this is in
> > > the
> > > TTM callback delete_mem_notify() where the bo is already locked,
> > > and I
> > > figure non-TTM gem backends may come to suffer from the same
> > > problem.
> > > 
> > > Opinions, suggestions?
> > Imo 2) or trying to push the object puts outside of the
> > dma_resv_lock.
> 
> IIRC I've stumbled over this issue before with TTM but though that
> I've 
> fixed it.
> 
> I mean no objections from my side to change drm_exec_fini() to
> something 
> like this:
> 
> drm_exec_for_each_locked_object_reverse(exec, index, obj)
>      dma_resv_unlock(obj->resv);
> 
> drm_exec_for_each_locked_object_reverse(exec, index, obj)
>      drm_gem_object_put(obj);
> 
> but in general that the last reference is dropped while holding a 
> different reservation object is not something special. For example
> that 
> happens all the time in TTMs eviction code.
> 
> So at least for TTM I would say we should move cleanup of imported
> BOs 
> to the worker. But not sure if that covers everything.

I'm fine with this. It covers all the TTM use-cases, I think.

Thanks,
/Thomas


> 
> Regards,
> Christian.
> 
> >   The
> > latter is imo natural, since usually you grab references, then
> > lock. And
> > this even holds for at least the slow path of lru eviction, because
> > you
> > need to drop all locks and then do a ww_mutex_lock_slow, and that
> > requires
> > that you can hold references to unlocked objects.
> > 
> > But 2) alone is imo fine, dma_buf have become really big objects
> > that go
> > across drivers, extremely similar to struct file, and that is doing
> > the
> > delayed final put unconditionally since years too, using task_work.
> > It's
> > simply a solid design.
> > 
> > Cheers, Sima
> > 
> > > [1]
> > > [   99.136161] ============================================
> > > [   99.136162] WARNING: possible recursive locking detected
> > > [   99.136163] 6.10.0-rc2+ #6 Tainted: G     U
> > > [   99.136165] --------------------------------------------
> > > [   99.136166] glxgears:sh0/4675 is trying to acquire lock:
> > > [   99.136167] ffff9967dcdd91a8
> > > (reservation_ww_class_mutex){+.+.}-
> > > {3:3}, at: dma_buf_detach+0x3b/0xf0
> > > [   99.136184]
> > >                 but task is already holding lock:
> > > [   99.136186] ffff9967d8c145a8
> > > (reservation_ww_class_mutex){+.+.}-
> > > {3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
> > > [   99.136191]
> > >                 other info that might help us debug this:
> > > [   99.136192]  Possible unsafe locking scenario:
> > > 
> > > [   99.136194]        CPU0
> > > [   99.136194]        ----
> > > [   99.136195]   lock(reservation_ww_class_mutex);
> > > [   99.136197]   lock(reservation_ww_class_mutex);
> > > [   99.136199]
> > >                  *** DEADLOCK ***
> > > 
> > > [   99.136199]  May be due to missing lock nesting notation
> > > 
> > > [   99.136200] 5 locks held by glxgears:sh0/4675:
> > > [   99.136202]  #0: ffff9967d8c104c8 (&xef->vm.lock){+.+.}-{3:3},
> > > at:
> > > xe_file_close+0xde/0x1c0 [xe]
> > > [   99.136272]  #1: ffff9967d5bb7480 (&vm->lock){++++}-{3:3}, at:
> > > xe_vm_close_and_put+0x161/0x9b0 [xe]
> > > [   99.136350]  #2: ffff9967ef88a970 (&val->lock){.+.+}-{3:3},
> > > at:
> > > xe_validation_ctx_init+0x6d/0x70 [xe]
> > > [   99.136440]  #3: ffffbd6a085577b8
> > > (reservation_ww_class_acquire){+.+.}-{0:0}, at:
> > > xe_vma_destroy_unlocked+0x7f/0xe0 [xe]
> > > [   99.136546]  #4: ffff9967d8c145a8
> > > (reservation_ww_class_mutex){+.+.}-{3:3}, at:
> > > drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
> > > [   99.136552]
> > >                 stack backtrace:
> > > [   99.136553] CPU: 10 PID: 4675 Comm: glxgears:sh0 Tainted:
> > > G     U
> > > 6.10.0-rc2+ #6
> > > [   99.136555] Hardware name: ASUS System Product Name/PRIME
> > > B560M-A
> > > AC, BIOS 2001 02/01/2023
> > > [   99.136557] Call Trace:
> > > [   99.136558]  <TASK>
> > > [   99.136560]  dump_stack_lvl+0x77/0xb0
> > > [   99.136564]  __lock_acquire+0x1232/0x2160
> > > [   99.136569]  lock_acquire+0xcb/0x2d0
> > > [   99.136570]  ? dma_buf_detach+0x3b/0xf0
> > > [   99.136574]  ? __lock_acquire+0x417/0x2160
> > > [   99.136577]  __ww_mutex_lock.constprop.0+0xd0/0x13b0
> > > [   99.136580]  ? dma_buf_detach+0x3b/0xf0
> > > [   99.136584]  ? dma_buf_detach+0x3b/0xf0
> > > [   99.136588]  ? ww_mutex_lock+0x2b/0x90
> > > [   99.136590]  ww_mutex_lock+0x2b/0x90
> > > [   99.136592]  dma_buf_detach+0x3b/0xf0
> > > [   99.136595]  drm_prime_gem_destroy+0x2f/0x40 [drm]
> > > [   99.136638]  xe_ttm_bo_destroy+0x32/0x220 [xe]
> > > [   99.136734]  ? __mutex_unlock_slowpath+0x3a/0x290
> > > [   99.136738]  drm_exec_unlock_all+0xa1/0xd0 [drm_exec]
> > > [   99.136741]  drm_exec_fini+0x12/0xb0 [drm_exec]
> > > [   99.136743]  xe_validation_ctx_fini+0x15/0x40 [xe]
> > > [   99.136848]  xe_vma_destroy_unlocked+0xb1/0xe0 [xe]
> > > [   99.136954]  xe_vm_close_and_put+0x41a/0x9b0 [xe]
> > > [   99.137056]  ? xa_find+0xe3/0x1e0
> > > [   99.137060]  xe_file_close+0x10a/0x1c0 [xe]
> > > [   99.137157]  drm_file_free+0x22a/0x280 [drm]
> > > [   99.137193]  drm_release_noglobal+0x22/0x70 [drm]
> > > [   99.137227]  __fput+0xf1/0x2d0
> > > [   99.137231]  task_work_run+0x59/0x90
> > > [   99.137235]  do_exit+0x330/0xb40
> > > [   99.137238]  do_group_exit+0x36/0xa0
> > > [   99.137241]  get_signal+0xbd2/0xbe0
> > > [   99.137245]  arch_do_signal_or_restart+0x3e/0x240
> > > [   99.137249]  syscall_exit_to_user_mode+0x1e7/0x290
> > > [   99.137252]  do_syscall_64+0xa1/0x180
> > > [   99.137255]  ? _raw_spin_unlock+0x23/0x40
> > > [   99.137257]  ? look_up_lock_class+0x6f/0x120
> > > [   99.137261]  ? __lock_acquire+0x417/0x2160
> > > [   99.137264]  ? lock_acquire+0xcb/0x2d0
> > > [   99.137266]  ? __set_task_comm+0x28/0x1e0
> > > [   99.137268]  ? find_held_lock+0x2b/0x80
> > > [   99.137271]  ? __set_task_comm+0xe1/0x1e0
> > > [   99.137273]  ? lock_release+0xca/0x290
> > > [   99.137277]  ? __do_sys_prctl+0x245/0xab0
> > > [   99.137279]  ? lockdep_hardirqs_on_prepare+0xde/0x190
> > > [   99.137281]  ? syscall_exit_to_user_mode+0xb0/0x290
> > > [   99.137284]  ? do_syscall_64+0xa1/0x180
> > > [   99.137286]  ? cpuset_cpus_allowed+0x36/0x140
> > > [   99.137289]  ? find_held_lock+0x2b/0x80
> > > [   99.137291]  ? find_held_lock+0x2b/0x80
> > > [   99.137294]  ? __sched_setaffinity+0x78/0x240
> > > [   99.137297]  ? kfree+0xe2/0x310
> > > [   99.137301]  ? kfree+0x202/0x310
> > > [   99.137303]  ? __sched_setaffinity+0x78/0x240
> > > [   99.137305]  ? __x64_sys_sched_setaffinity+0x69/0xb0
> > > [   99.137307]  ? kfree+0xe2/0x310
> > > [   99.137310]  ? lockdep_hardirqs_on_prepare+0xde/0x190
> > > [   99.137312]  ? syscall_exit_to_user_mode+0xb0/0x290
> > > [   99.137315]  ? do_syscall_64+0xa1/0x180
> > > [   99.137317]  ? trace_hardirqs_off+0x4b/0xc0
> > > [   99.137321]  ? clear_bhb_loop+0x45/0xa0
> > > [   99.137325]  ? clear_bhb_loop+0x45/0xa0
> > > [   99.137327]  ? clear_bhb_loop+0x45/0xa0
> > > [   99.137330]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > [   99.137333] RIP: 0033:0x7fda70ee6169
> > > [   99.137351] Code: Unable to access opcode bytes at
> > > 0x7fda70ee613f.
> > > [   99.137352] RSP: 002b:00007fda5fdffc80 EFLAGS: 00000246
> > > ORIG_RAX:
> > > 00000000000000ca
> > > [   99.137354] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX:
> > > 00007fda70ee6169
> > > [   99.137356] RDX: 0000000000000000 RSI: 0000000000000189 RDI:
> > > 0000564a96f45b30
> > > [   99.137358] RBP: 00007fda5fdffcb0 R08: 0000000000000000 R09:
> > > 00000000ffffffff
> > > [   99.137359] R10: 0000000000000000 R11: 0000000000000246 R12:
> > > 0000000000000000
> > > [   99.137360] R13: 0000000000000000 R14: 0000000000000000 R15:
> > > 0000564a96f45b30
> > > [   99.137365]  </TASK>
> > > 
> 
> _______________________________________________
> Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org
> To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Linaro-mm-sig] Re: dma_buf_detach lockdep splat
  2024-06-27  8:04 ` Daniel Vetter
  2024-06-27  8:25   ` Christian König
@ 2024-06-27 12:18   ` Thomas Hellström
  2024-06-28 18:06     ` Daniel Vetter
  1 sibling, 1 reply; 7+ messages in thread
From: Thomas Hellström @ 2024-06-27 12:18 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, Dmitry Osipenko, dri-devel, intel-xe,
	linaro-mm-sig

On Thu, 2024-06-27 at 10:04 +0200, Daniel Vetter wrote:
> On Wed, Jun 26, 2024 at 05:58:02PM +0200, Thomas Hellström wrote:
> > Hi!
> > 
> > I'm seeing the below lockdep splat 1) with the xe driver in an
> > imported
> > dma-buf object destruction path.
> > 
> > It's not because we hold the dma_resv lock at that point, but
> > rather
> > because we hold *another* dma_resv lock at that point, and the
> > dma_resv
> > detach happens when the object is idle, in this case it was idle at
> > the
> > final put(), and dma_buf_detach() is called in the putting process.
> > 
> > Holding another dma-buf lock might happen as part of
> > drm_exec_unlock_all, or simply if the wider vm dma_resv was held at
> > object put time, so it's not an uncommon pattern, even if the
> > drm_exec
> > instance can be fixed by putting all bos after unlocking them all.
> > 
> > Two solutions coming to mind here:
> > 
> > 1) Provide a dma_buf_detach_locked()
> 
> This smells way too much like the endless headaches we had with
> drm_gem_object_put_locked and friends against
> drm_device.struct_mutex. Or
> I'm not understanding what you're doing, because I'm pretty sure you
> have
> to take the dma_resv lock on final put() of imported objects. Because
> that
> final put() is of the import wrapper, the exporter (and other
> importers)
> can still get at that object and so dma_resv_lock is very much
> needed.

Yeah, the TTM final put looks like

if (!dma_resv_trylock() || !idle)
	queue_work(final_distruction);

dma_resv_unlock();
dma_buf_detach(); <--- lockdep splat

Here's where a dma_buf_detach_locked() would've made sense before the
dma_resv_unlock().

But if you think this will cause grief, I'm completely fine with
fixing this in TTM by always taking the deferring path.

/Thomas
 

> 
> Or it's a completely different final put(), but I have no idea how
> you get
> that on an imported dma_buf.
> 
> > 2) Have TTM always take the delayed delete path for imported dma-
> > buf
> > objects.
> > 
> > I'd prefer 1) since I think the correct place to call this is in
> > the
> > TTM callback delete_mem_notify() where the bo is already locked,
> > and I
> > figure non-TTM gem backends may come to suffer from the same
> > problem.
> > 
> > Opinions, suggestions?
> 
> Imo 2) or trying to push the object puts outside of the
> dma_resv_lock. The
> latter is imo natural, since usually you grab references, then lock.
> And
> this even holds for at least the slow path of lru eviction, because
> you
> need to drop all locks and then do a ww_mutex_lock_slow, and that
> requires
> that you can hold references to unlocked objects.
> 
> But 2) alone is imo fine, dma_buf have become really big objects that
> go
> across drivers, extremely similar to struct file, and that is doing
> the
> delayed final put unconditionally since years too, using task_work.
> It's
> simply a solid design.
> 
> Cheers, Sima
> 
> > [1]
> > [   99.136161] ============================================
> > [   99.136162] WARNING: possible recursive locking detected
> > [   99.136163] 6.10.0-rc2+ #6 Tainted: G     U            
> > [   99.136165] --------------------------------------------
> > [   99.136166] glxgears:sh0/4675 is trying to acquire lock:
> > [   99.136167] ffff9967dcdd91a8 (reservation_ww_class_mutex){+.+.}-
> > {3:3}, at: dma_buf_detach+0x3b/0xf0
> > [   99.136184] 
> >                but task is already holding lock:
> > [   99.136186] ffff9967d8c145a8 (reservation_ww_class_mutex){+.+.}-
> > {3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
> > [   99.136191] 
> >                other info that might help us debug this:
> > [   99.136192]  Possible unsafe locking scenario:
> > 
> > [   99.136194]        CPU0
> > [   99.136194]        ----
> > [   99.136195]   lock(reservation_ww_class_mutex);
> > [   99.136197]   lock(reservation_ww_class_mutex);
> > [   99.136199] 
> >                 *** DEADLOCK ***
> > 
> > [   99.136199]  May be due to missing lock nesting notation
> > 
> > [   99.136200] 5 locks held by glxgears:sh0/4675:
> > [   99.136202]  #0: ffff9967d8c104c8 (&xef->vm.lock){+.+.}-{3:3},
> > at:
> > xe_file_close+0xde/0x1c0 [xe]
> > [   99.136272]  #1: ffff9967d5bb7480 (&vm->lock){++++}-{3:3}, at:
> > xe_vm_close_and_put+0x161/0x9b0 [xe]
> > [   99.136350]  #2: ffff9967ef88a970 (&val->lock){.+.+}-{3:3}, at:
> > xe_validation_ctx_init+0x6d/0x70 [xe]
> > [   99.136440]  #3: ffffbd6a085577b8
> > (reservation_ww_class_acquire){+.+.}-{0:0}, at:
> > xe_vma_destroy_unlocked+0x7f/0xe0 [xe]
> > [   99.136546]  #4: ffff9967d8c145a8
> > (reservation_ww_class_mutex){+.+.}-{3:3}, at:
> > drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
> > [   99.136552] 
> >                stack backtrace:
> > [   99.136553] CPU: 10 PID: 4675 Comm: glxgears:sh0 Tainted: G    
> > U  
> > 6.10.0-rc2+ #6
> > [   99.136555] Hardware name: ASUS System Product Name/PRIME B560M-
> > A
> > AC, BIOS 2001 02/01/2023
> > [   99.136557] Call Trace:
> > [   99.136558]  <TASK>
> > [   99.136560]  dump_stack_lvl+0x77/0xb0
> > [   99.136564]  __lock_acquire+0x1232/0x2160
> > [   99.136569]  lock_acquire+0xcb/0x2d0
> > [   99.136570]  ? dma_buf_detach+0x3b/0xf0
> > [   99.136574]  ? __lock_acquire+0x417/0x2160
> > [   99.136577]  __ww_mutex_lock.constprop.0+0xd0/0x13b0
> > [   99.136580]  ? dma_buf_detach+0x3b/0xf0
> > [   99.136584]  ? dma_buf_detach+0x3b/0xf0
> > [   99.136588]  ? ww_mutex_lock+0x2b/0x90
> > [   99.136590]  ww_mutex_lock+0x2b/0x90
> > [   99.136592]  dma_buf_detach+0x3b/0xf0
> > [   99.136595]  drm_prime_gem_destroy+0x2f/0x40 [drm]
> > [   99.136638]  xe_ttm_bo_destroy+0x32/0x220 [xe]
> > [   99.136734]  ? __mutex_unlock_slowpath+0x3a/0x290
> > [   99.136738]  drm_exec_unlock_all+0xa1/0xd0 [drm_exec]
> > [   99.136741]  drm_exec_fini+0x12/0xb0 [drm_exec]
> > [   99.136743]  xe_validation_ctx_fini+0x15/0x40 [xe]
> > [   99.136848]  xe_vma_destroy_unlocked+0xb1/0xe0 [xe]
> > [   99.136954]  xe_vm_close_and_put+0x41a/0x9b0 [xe]
> > [   99.137056]  ? xa_find+0xe3/0x1e0
> > [   99.137060]  xe_file_close+0x10a/0x1c0 [xe]
> > [   99.137157]  drm_file_free+0x22a/0x280 [drm]
> > [   99.137193]  drm_release_noglobal+0x22/0x70 [drm]
> > [   99.137227]  __fput+0xf1/0x2d0
> > [   99.137231]  task_work_run+0x59/0x90
> > [   99.137235]  do_exit+0x330/0xb40
> > [   99.137238]  do_group_exit+0x36/0xa0
> > [   99.137241]  get_signal+0xbd2/0xbe0
> > [   99.137245]  arch_do_signal_or_restart+0x3e/0x240
> > [   99.137249]  syscall_exit_to_user_mode+0x1e7/0x290
> > [   99.137252]  do_syscall_64+0xa1/0x180
> > [   99.137255]  ? _raw_spin_unlock+0x23/0x40
> > [   99.137257]  ? look_up_lock_class+0x6f/0x120
> > [   99.137261]  ? __lock_acquire+0x417/0x2160
> > [   99.137264]  ? lock_acquire+0xcb/0x2d0
> > [   99.137266]  ? __set_task_comm+0x28/0x1e0
> > [   99.137268]  ? find_held_lock+0x2b/0x80
> > [   99.137271]  ? __set_task_comm+0xe1/0x1e0
> > [   99.137273]  ? lock_release+0xca/0x290
> > [   99.137277]  ? __do_sys_prctl+0x245/0xab0
> > [   99.137279]  ? lockdep_hardirqs_on_prepare+0xde/0x190
> > [   99.137281]  ? syscall_exit_to_user_mode+0xb0/0x290
> > [   99.137284]  ? do_syscall_64+0xa1/0x180
> > [   99.137286]  ? cpuset_cpus_allowed+0x36/0x140
> > [   99.137289]  ? find_held_lock+0x2b/0x80
> > [   99.137291]  ? find_held_lock+0x2b/0x80
> > [   99.137294]  ? __sched_setaffinity+0x78/0x240
> > [   99.137297]  ? kfree+0xe2/0x310
> > [   99.137301]  ? kfree+0x202/0x310
> > [   99.137303]  ? __sched_setaffinity+0x78/0x240
> > [   99.137305]  ? __x64_sys_sched_setaffinity+0x69/0xb0
> > [   99.137307]  ? kfree+0xe2/0x310
> > [   99.137310]  ? lockdep_hardirqs_on_prepare+0xde/0x190
> > [   99.137312]  ? syscall_exit_to_user_mode+0xb0/0x290
> > [   99.137315]  ? do_syscall_64+0xa1/0x180
> > [   99.137317]  ? trace_hardirqs_off+0x4b/0xc0
> > [   99.137321]  ? clear_bhb_loop+0x45/0xa0
> > [   99.137325]  ? clear_bhb_loop+0x45/0xa0
> > [   99.137327]  ? clear_bhb_loop+0x45/0xa0
> > [   99.137330]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [   99.137333] RIP: 0033:0x7fda70ee6169
> > [   99.137351] Code: Unable to access opcode bytes at
> > 0x7fda70ee613f.
> > [   99.137352] RSP: 002b:00007fda5fdffc80 EFLAGS: 00000246
> > ORIG_RAX:
> > 00000000000000ca
> > [   99.137354] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX:
> > 00007fda70ee6169
> > [   99.137356] RDX: 0000000000000000 RSI: 0000000000000189 RDI:
> > 0000564a96f45b30
> > [   99.137358] RBP: 00007fda5fdffcb0 R08: 0000000000000000 R09:
> > 00000000ffffffff
> > [   99.137359] R10: 0000000000000000 R11: 0000000000000246 R12:
> > 0000000000000000
> > [   99.137360] R13: 0000000000000000 R14: 0000000000000000 R15:
> > 0000564a96f45b30
> > [   99.137365]  </TASK>
> > 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Linaro-mm-sig] Re: dma_buf_detach lockdep splat
  2024-06-27 12:18   ` Thomas Hellström
@ 2024-06-28 18:06     ` Daniel Vetter
  2024-07-01  8:56       ` Thomas Hellström
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Vetter @ 2024-06-28 18:06 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Daniel Vetter, Christian König, Dmitry Osipenko, dri-devel,
	intel-xe, linaro-mm-sig

On Thu, Jun 27, 2024 at 02:18:44PM +0200, Thomas Hellström wrote:
> On Thu, 2024-06-27 at 10:04 +0200, Daniel Vetter wrote:
> > On Wed, Jun 26, 2024 at 05:58:02PM +0200, Thomas Hellström wrote:
> > > Hi!
> > > 
> > > I'm seeing the below lockdep splat 1) with the xe driver in an
> > > imported
> > > dma-buf object destruction path.
> > > 
> > > It's not because we hold the dma_resv lock at that point, but
> > > rather
> > > because we hold *another* dma_resv lock at that point, and the
> > > dma_resv
> > > detach happens when the object is idle, in this case it was idle at
> > > the
> > > final put(), and dma_buf_detach() is called in the putting process.
> > > 
> > > Holding another dma-buf lock might happen as part of
> > > drm_exec_unlock_all, or simply if the wider vm dma_resv was held at
> > > object put time, so it's not an uncommon pattern, even if the
> > > drm_exec
> > > instance can be fixed by putting all bos after unlocking them all.
> > > 
> > > Two solutions coming to mind here:
> > > 
> > > 1) Provide a dma_buf_detach_locked()
> > 
> > This smells way too much like the endless headaches we had with
> > drm_gem_object_put_locked and friends against
> > drm_device.struct_mutex. Or
> > I'm not understanding what you're doing, because I'm pretty sure you
> > have
> > to take the dma_resv lock on final put() of imported objects. Because
> > that
> > final put() is of the import wrapper, the exporter (and other
> > importers)
> > can still get at that object and so dma_resv_lock is very much
> > needed.
> 
> Yeah, the TTM final put looks like
> 
> if (!dma_resv_trylock() || !idle)
> 	queue_work(final_distruction);
> 
> dma_resv_unlock();
> dma_buf_detach(); <--- lockdep splat
> 
> Here's where a dma_buf_detach_locked() would've made sense before the
> dma_resv_unlock().
> 
> But if you think this will cause grief, I'm completely fine with
> fixing this in TTM by always taking the deferring path.

Oh I misunderstood what you've meant, I thought you want to do a huge
exercise in passing the "do we know we're locked" flag all the way through
entire callchains to exporters.

If it's just so that the fastpath of bypassing the worker can function for
imported buffers, then I think that's fine. As long as we just punt to the
worker if we can't get the lock.
-Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Linaro-mm-sig] Re: dma_buf_detach lockdep splat
  2024-06-28 18:06     ` Daniel Vetter
@ 2024-07-01  8:56       ` Thomas Hellström
  0 siblings, 0 replies; 7+ messages in thread
From: Thomas Hellström @ 2024-07-01  8:56 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, Dmitry Osipenko, dri-devel, intel-xe,
	linaro-mm-sig

On Fri, 2024-06-28 at 20:06 +0200, Daniel Vetter wrote:
> On Thu, Jun 27, 2024 at 02:18:44PM +0200, Thomas Hellström wrote:
> > On Thu, 2024-06-27 at 10:04 +0200, Daniel Vetter wrote:
> > > On Wed, Jun 26, 2024 at 05:58:02PM +0200, Thomas Hellström wrote:
> > > > Hi!
> > > > 
> > > > I'm seeing the below lockdep splat 1) with the xe driver in an
> > > > imported
> > > > dma-buf object destruction path.
> > > > 
> > > > It's not because we hold the dma_resv lock at that point, but
> > > > rather
> > > > because we hold *another* dma_resv lock at that point, and the
> > > > dma_resv
> > > > detach happens when the object is idle, in this case it was
> > > > idle at
> > > > the
> > > > final put(), and dma_buf_detach() is called in the putting
> > > > process.
> > > > 
> > > > Holding another dma-buf lock might happen as part of
> > > > drm_exec_unlock_all, or simply if the wider vm dma_resv was
> > > > held at
> > > > object put time, so it's not an uncommon pattern, even if the
> > > > drm_exec
> > > > instance can be fixed by putting all bos after unlocking them
> > > > all.
> > > > 
> > > > Two solutions coming to mind here:
> > > > 
> > > > 1) Provide a dma_buf_detach_locked()
> > > 
> > > This smells way too much like the endless headaches we had with
> > > drm_gem_object_put_locked and friends against
> > > drm_device.struct_mutex. Or
> > > I'm not understanding what you're doing, because I'm pretty sure
> > > you
> > > have
> > > to take the dma_resv lock on final put() of imported objects.
> > > Because
> > > that
> > > final put() is of the import wrapper, the exporter (and other
> > > importers)
> > > can still get at that object and so dma_resv_lock is very much
> > > needed.
> > 
> > Yeah, the TTM final put looks like
> > 
> > if (!dma_resv_trylock() || !idle)
> > 	queue_work(final_distruction);
> > 
> > dma_resv_unlock();
> > dma_buf_detach(); <--- lockdep splat
> > 
> > Here's where a dma_buf_detach_locked() would've made sense before
> > the
> > dma_resv_unlock().
> > 
> > But if you think this will cause grief, I'm completely fine with
> > fixing this in TTM by always taking the deferring path.
> 
> Oh I misunderstood what you've meant, I thought you want to do a huge
> exercise in passing the "do we know we're locked" flag all the way
> through
> entire callchains to exporters.
> 
> If it's just so that the fastpath of bypassing the worker can
> function for
> imported buffers, then I think that's fine. As long as we just punt
> to the
> worker if we can't get the lock.

OK, TBH, the driver would need a drm_prime_gem_destroy_locked() as well
since that's the function that calls dma_buf_detach. But TBH I think
it's worth it anyway since if we just modify TTM to always take the
delayed destruction path, I figure much code will come to depend on it
and it will be invasive to update.

I'll take a quick stab a that to see how ugly it becomes.

/Thomas


> -Sima


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-07-01  8:56 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-26 15:58 dma_buf_detach lockdep splat Thomas Hellström
2024-06-27  8:04 ` Daniel Vetter
2024-06-27  8:25   ` Christian König
2024-06-27 12:03     ` [Linaro-mm-sig] " Thomas Hellström
2024-06-27 12:18   ` Thomas Hellström
2024-06-28 18:06     ` Daniel Vetter
2024-07-01  8:56       ` Thomas Hellström

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox