From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: Mani Milani <mani@chromium.org>
Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
LKML <linux-kernel@vger.kernel.org>,
"Chris Wilson" <chris@chris-wilson.co.uk>,
"Matthew Auld" <matthew.auld@intel.com>,
"Daniel Vetter" <daniel@ffwll.ch>,
"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
"David Airlie" <airlied@gmail.com>,
"Christian König" <christian.koenig@amd.com>,
"Nirmoy Das" <nirmoy.das@intel.com>
Subject: Re: [Intel-gfx] [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
Date: Mon, 14 Nov 2022 13:48:11 +0100 [thread overview]
Message-ID: <9d0b5696-496f-a03a-2b5c-e38f36a02d86@linux.intel.com> (raw)
In-Reply-To: <CAHzEqDkFAiGkTFF3C--2NKt+ALjtfiNpWYca-Y-p=sekjQXGpw@mail.gmail.com>
Hi, Mani.
On 11/14/22 03:16, Mani Milani wrote:
> Thank you for your comments.
>
> To Thomas's point, the crash always seems to happen when the following
> sequence of events occurs:
>
> 1. When inside "i915_gem_evict_vm()", the call to
> "i915_gem_object_trylock(vma->obj, ww)" fails (due to deadlock), and
> eviction of a vma is skipped as a result. Basically if the code
> reaches here:
> https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/i915/i915_gem_evict.c#L468
> And here is the stack dump for this scenario:
> Call Trace:
> <TASK>
> dump_stack_lvl+0x68/0x95
> i915_gem_evict_vm+0x1d2/0x369
> eb_validate_vmas+0x54a/0x6ae
> eb_relocate_parse+0x4b/0xdb
> i915_gem_execbuffer2_ioctl+0x6f5/0xab6
> ? i915_gem_object_prepare_write+0xfb/0xfb
> drm_ioctl_kernel+0xda/0x14d
> drm_ioctl+0x27f/0x3b7
> ? i915_gem_object_prepare_write+0xfb/0xfb
> __se_sys_ioctl+0x7a/0xbc
> do_syscall_64+0x56/0xa1
> ? exit_to_user_mode_prepare+0x3d/0x8c
> entry_SYSCALL_64_after_hwframe+0x61/0xcb
> RIP: 0033:0x78302de5fae7
> Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
> 89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
> RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
> RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
> RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
> R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
> R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
> </TASK>
> It is worth noting that "i915_gem_evict_vm()" still returns success in
> this case.
>
> 2. After step 1 occurs, the next call to "grab_vma()" always fails
> (with "i915_gem_object_trylock(vma->obj, ww)" failing also due to
> deadlock), which then results in the crash.
> Here is the stack dump for this scenario:
> Call Trace:
> <TASK>
> dump_stack_lvl+0x68/0x95
> grab_vma+0x6c/0xd0
> i915_gem_evict_for_node+0x178/0x23b
> i915_gem_gtt_reserve+0x5a/0x82
> i915_vma_insert+0x295/0x29e
> i915_vma_pin_ww+0x41e/0x5c7
> eb_validate_vmas+0x5f5/0x6ae
> eb_relocate_parse+0x4b/0xdb
> i915_gem_execbuffer2_ioctl+0x6f5/0xab6
> ? i915_gem_object_prepare_write+0xfb/0xfb
> drm_ioctl_kernel+0xda/0x14d
> drm_ioctl+0x27f/0x3b7
> ? i915_gem_object_prepare_write+0xfb/0xfb
> __se_sys_ioctl+0x7a/0xbc
> do_syscall_64+0x56/0xa1
> ? exit_to_user_mode_prepare+0x3d/0x8c
> entry_SYSCALL_64_after_hwframe+0x61/0xcb
> RIP: 0033:0x78302de5fae7
> Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
> 89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
> RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
> RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
> RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
> R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
> R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
> </TASK>
>
> My Notes:
> - I verified the two "i915_gem_object_trylock()" failures I mentioned
> above are due to deadlock by slightly modifying the code to call
> "i915_gem_object_lock()" only in those exact cases and subsequent to
> the trylock failure, only to look at the return error code.
> - The two cases mentioned above, are the only cases where
> "i915_gem_object_trylock(obj, ww)" is called with the second argument
> not being forced to NULL.
> - When in either of the two cases above (i.e. inside "grab_vma()" or
> "i915_gem_evict_vm") I replace calling "i915_gem_object_trylock" with
> "i915_gem_object_lock", the issue gets resolved (because deadlock is
> detected and resolved).
>
> So if this could matches the design better, another solution could be
> for "grab_vma" to continue to call "i915_gem_object_trylock", but for
> "i915_gem_evict_vm" to call "i915_gem_object_lock" instead.
No, i915_gem_object_lock() is not allowed when the vm mutex is held.
>
> Further info:
> - Would you like any further info on the crash? If so, could you
> please advise 1) what exactly you need and 2) how I can share with you
> especially if it is big dumps?
Yes, I would like to know how the crash manifests itself. Is it a kernel
BUG or a kernel WARNING or is it the user-space application that crashes
due to receiveing an -ENOSPC?
Thanks,
Thomas
>
> Thanks.
next prev parent reply other threads:[~2022-11-14 12:48 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-10 5:31 [Intel-gfx] [PATCH] drm/i915: Fix unhandled deadlock in grab_vma() Mani Milani
2022-11-10 14:49 ` Matthew Auld
2022-11-10 15:21 ` Thomas Hellström
2022-11-14 2:16 ` Mani Milani
2022-11-14 12:48 ` Thomas Hellström [this message]
2022-11-15 23:54 ` Mani Milani
2022-11-10 21:08 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9d0b5696-496f-a03a-2b5c-e38f36a02d86@linux.intel.com \
--to=thomas.hellstrom@linux.intel.com \
--cc=airlied@gmail.com \
--cc=chris@chris-wilson.co.uk \
--cc=christian.koenig@amd.com \
--cc=daniel@ffwll.ch \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-gfx@lists.freedesktop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mani@chromium.org \
--cc=matthew.auld@intel.com \
--cc=nirmoy.das@intel.com \
--cc=rodrigo.vivi@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox