From: "Christian König" <christian.koenig@amd.com>
To: "Andrey Grodzovsky" <andrey.grodzovsky@amd.com>,
"Christian König" <ckoenig.leichtzumerken@gmail.com>,
"amd-gfx list" <amd-gfx@lists.freedesktop.org>,
dri-devel <dri-devel@lists.freedesktop.org>
Subject: Re: Lockdep spalt on killing a processes
Date: Thu, 21 Oct 2021 08:34:59 +0200 [thread overview]
Message-ID: <696ac3f9-4b45-b8a5-5300-79093d2cd1a5@amd.com> (raw)
In-Reply-To: <5e67d2d2-f5ce-2669-6ad9-7e40c74ce853@amd.com>
Am 20.10.21 um 21:32 schrieb Andrey Grodzovsky:
> On 2021-10-04 4:14 a.m., Christian König wrote:
>
>> The problem is a bit different.
>>
>> The callback is on the dependent fence, while we need to signal the
>> scheduler fence.
>>
>> Daniel is right that this needs an irq_work struct to handle this
>> properly.
>>
>> Christian.
>
>
> So we had some discussions with Christian regarding irq_work and
> agreed I should look into doing it but stepping back for a sec -
>
> Why we insist on calling the dma_fence_cb with fence->lock locked ?
> Is it because of dma_fence_add_callback ?
> Because we first test for DMA_FENCE_FLAG_SIGNALED_BIT and only after
> that lock the fence->lock ? If so, can't we
> move DMA_FENCE_FLAG_SIGNALED_BIT check inside the locked section ?
> Because if in theory
> we could call the cb with unlocked fence->lock (i.e. this kind of
> iteration
> https://elixir.bootlin.com/linux/v5.15-rc6/source/drivers/gpu/drm/ttm/ttm_resource.c#L117)
> we wouldn't have the lockdep splat. And in general, is it really
> the correct approach to call a third party code from a call back with
> locked spinlock ? We don't know what the cb does inside
> and I don't see any explicit restrictions in documentation of
> dma_fence_func_t what can and cannot be done there.
Yeah, that's exactly what I meant with using the irq_work directly in
the fence code.
The problem is dma_fence_signal_locked() which is used by quite a number
of drivers to signal the fence while holding the lock.
Otherwise we could indeed simplify the fence handling a lot.
Christian.
>
> Andrey
>
>
>>
>> Am 01.10.21 um 17:10 schrieb Andrey Grodzovsky:
>>> From what I see here you supposed to have actual deadlock and not
>>> only warning, sched_fence->finished is first signaled from within
>>> hw fence done callback (drm_sched_job_done_cb) but then again from
>>> within it's own callback (drm_sched_entity_kill_jobs_cb) and so
>>> looks like same fence object is recursively signaled twice. This
>>> leads to attempt to lock fence->lock second time while it's already
>>> locked. I don't see a need to call drm_sched_fence_finished from
>>> within drm_sched_entity_kill_jobs_cb as this callback already
>>> registered
>>> on sched_fence->finished fence (entity->last_scheduled ==
>>> s_fence->finished) and hence the signaling already took place.
>>>
>>> Andrey
>>>
>>> On 2021-10-01 6:50 a.m., Christian König wrote:
>>>> Hey, Andrey.
>>>>
>>>> while investigating some memory management problems I've got the
>>>> logdep splat below.
>>>>
>>>> Looks like something is wrong with drm_sched_entity_kill_jobs_cb(),
>>>> can you investigate?
>>>>
>>>> Thanks,
>>>> Christian.
>>>>
>>>> [11176.741052] ============================================
>>>> [11176.741056] WARNING: possible recursive locking detected
>>>> [11176.741060] 5.15.0-rc1-00031-g9d546d600800 #171 Not tainted
>>>> [11176.741066] --------------------------------------------
>>>> [11176.741070] swapper/12/0 is trying to acquire lock:
>>>> [11176.741074] ffff9c337ed175a8 (&fence->lock){-.-.}-{3:3}, at:
>>>> dma_fence_signal+0x28/0x80
>>>> [11176.741088]
>>>> but task is already holding lock:
>>>> [11176.741092] ffff9c337ed172a8 (&fence->lock){-.-.}-{3:3}, at:
>>>> dma_fence_signal+0x28/0x80
>>>> [11176.741100]
>>>> other info that might help us debug this:
>>>> [11176.741104] Possible unsafe locking scenario:
>>>>
>>>> [11176.741108] CPU0
>>>> [11176.741110] ----
>>>> [11176.741113] lock(&fence->lock);
>>>> [11176.741118] lock(&fence->lock);
>>>> [11176.741122]
>>>> *** DEADLOCK ***
>>>>
>>>> [11176.741125] May be due to missing lock nesting notation
>>>>
>>>> [11176.741128] 2 locks held by swapper/12/0:
>>>> [11176.741133] #0: ffff9c339c30f768
>>>> (&ring->fence_drv.lock){-.-.}-{3:3}, at: dma_fence_signal+0x28/0x80
>>>> [11176.741142] #1: ffff9c337ed172a8 (&fence->lock){-.-.}-{3:3},
>>>> at: dma_fence_signal+0x28/0x80
>>>> [11176.741151]
>>>> stack backtrace:
>>>> [11176.741155] CPU: 12 PID: 0 Comm: swapper/12 Not tainted
>>>> 5.15.0-rc1-00031-g9d546d600800 #171
>>>> [11176.741160] Hardware name: System manufacturer System Product
>>>> Name/PRIME X399-A, BIOS 0808 10/12/2018
>>>> [11176.741165] Call Trace:
>>>> [11176.741169] <IRQ>
>>>> [11176.741173] dump_stack_lvl+0x5b/0x74
>>>> [11176.741181] dump_stack+0x10/0x12
>>>> [11176.741186] __lock_acquire.cold+0x208/0x2df
>>>> [11176.741197] lock_acquire+0xc6/0x2d0
>>>> [11176.741204] ? dma_fence_signal+0x28/0x80
>>>> [11176.741212] _raw_spin_lock_irqsave+0x4d/0x70
>>>> [11176.741219] ? dma_fence_signal+0x28/0x80
>>>> [11176.741225] dma_fence_signal+0x28/0x80
>>>> [11176.741230] drm_sched_fence_finished+0x12/0x20 [gpu_sched]
>>>> [11176.741240] drm_sched_entity_kill_jobs_cb+0x1c/0x50 [gpu_sched]
>>>> [11176.741248] dma_fence_signal_timestamp_locked+0xac/0x1a0
>>>> [11176.741254] dma_fence_signal+0x3b/0x80
>>>> [11176.741260] drm_sched_fence_finished+0x12/0x20 [gpu_sched]
>>>> [11176.741268] drm_sched_job_done.isra.0+0x7f/0x1a0 [gpu_sched]
>>>> [11176.741277] drm_sched_job_done_cb+0x12/0x20 [gpu_sched]
>>>> [11176.741284] dma_fence_signal_timestamp_locked+0xac/0x1a0
>>>> [11176.741290] dma_fence_signal+0x3b/0x80
>>>> [11176.741296] amdgpu_fence_process+0xd1/0x140 [amdgpu]
>>>> [11176.741504] sdma_v4_0_process_trap_irq+0x8c/0xb0 [amdgpu]
>>>> [11176.741731] amdgpu_irq_dispatch+0xce/0x250 [amdgpu]
>>>> [11176.741954] amdgpu_ih_process+0x81/0x100 [amdgpu]
>>>> [11176.742174] amdgpu_irq_handler+0x26/0xa0 [amdgpu]
>>>> [11176.742393] __handle_irq_event_percpu+0x4f/0x2c0
>>>> [11176.742402] handle_irq_event_percpu+0x33/0x80
>>>> [11176.742408] handle_irq_event+0x39/0x60
>>>> [11176.742414] handle_edge_irq+0x93/0x1d0
>>>> [11176.742419] __common_interrupt+0x50/0xe0
>>>> [11176.742426] common_interrupt+0x80/0x90
>>>> [11176.742431] </IRQ>
>>>> [11176.742436] asm_common_interrupt+0x1e/0x40
>>>> [11176.742442] RIP: 0010:cpuidle_enter_state+0xff/0x470
>>>> [11176.742449] Code: 0f a3 05 04 54 24 01 0f 82 70 02 00 00 31 ff
>>>> e8 37 5d 6f ff 80 7d d7 00 0f 85 e9 01 00 00 e8 58 a2 7f ff fb 66
>>>> 0f 1f 44 00 00 <45> 85 ff 0f 88 01 01 00 00 49 63 c7 4c 2b 75 c8 48
>>>> 8d 14 40 48 8d
>>>> [11176.742455] RSP: 0018:ffffb6970021fe48 EFLAGS: 00000202
>>>> [11176.742461] RAX: 000000000059be25 RBX: 0000000000000002 RCX:
>>>> 0000000000000000
>>>> [11176.742465] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
>>>> ffffffff9efeed78
>>>> [11176.742470] RBP: ffffb6970021fe80 R08: 0000000000000001 R09:
>>>> 0000000000000001
>>>> [11176.742473] R10: 0000000000000001 R11: 0000000000000001 R12:
>>>> ffff9c3350b0e800
>>>> [11176.742477] R13: ffffffffa00e9680 R14: 00000a2a49ada060 R15:
>>>> 0000000000000002
>>>> [11176.742483] ? cpuidle_enter_state+0xf8/0x470
>>>> [11176.742489] ? cpuidle_enter_state+0xf8/0x470
>>>> [11176.742495] cpuidle_enter+0x2e/0x40
>>>> [11176.742500] call_cpuidle+0x23/0x40
>>>> [11176.742506] do_idle+0x201/0x280
>>>> [11176.742512] cpu_startup_entry+0x20/0x30
>>>> [11176.742517] start_secondary+0x11f/0x160
>>>> [11176.742523] secondary_startup_64_no_verify+0xb0/0xbb
>>>>
>>
next prev parent reply other threads:[~2021-10-21 6:35 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-01 10:50 Lockdep spalt on killing a processes Christian König
2021-10-01 14:52 ` Daniel Vetter
2021-10-01 15:10 ` Andrey Grodzovsky
2021-10-04 8:14 ` Christian König
2021-10-04 15:27 ` Andrey Grodzovsky
2021-10-20 19:32 ` Andrey Grodzovsky
2021-10-21 6:34 ` Christian König [this message]
2021-10-25 19:10 ` Andrey Grodzovsky
2021-10-25 19:56 ` Christian König
2021-10-26 2:33 ` Andrey Grodzovsky
2021-10-26 10:54 ` Christian König
2021-10-27 14:27 ` Andrey Grodzovsky
2021-10-27 14:34 ` Christian König
2021-10-27 14:47 ` Andrey Grodzovsky
2021-10-27 14:50 ` Christian König
2021-10-27 19:58 ` Andrey Grodzovsky
2021-10-28 17:26 ` Andrey Grodzovsky
2021-10-29 7:07 ` Christian König
2021-11-01 15:24 ` Andrey Grodzovsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=696ac3f9-4b45-b8a5-5300-79093d2cd1a5@amd.com \
--to=christian.koenig@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=andrey.grodzovsky@amd.com \
--cc=ckoenig.leichtzumerken@gmail.com \
--cc=dri-devel@lists.freedesktop.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox