From: Daniel Vetter <daniel@ffwll.ch>
To: John Harrison <john.c.harrison@intel.com>
Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
Daniel Vetter <daniel@ffwll.ch>
Subject: Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset
Date: Wed, 6 Sep 2023 08:50:03 +0200 [thread overview]
Message-ID: <ZPghG6GmhO4j/9qV@phenom.ffwll.local> (raw)
In-Reply-To: <2bd0fb41-58d5-9862-143e-34e31f6f791f@intel.com>
On Mon, Aug 28, 2023 at 04:01:38PM -0700, John Harrison wrote:
> On 8/23/2023 10:37, John Harrison wrote:
> > On 8/23/2023 09:00, Daniel Vetter wrote:
> > > On Tue, Aug 22, 2023 at 11:53:24AM -0700, John Harrison wrote:
> > > > On 8/11/2023 11:20, Zhanjun Dong wrote:
> > > > > This attempts to avoid circular locking dependency between
> > > > > flush delayed
> > > > > work and intel_gt_reset.
> > > > > When intel_gt_reset was called, task will hold a lock.
> > > > > To cacel delayed work here, the _sync version will also
> > > > > acquire a lock,
> > > > > which might trigger the possible cirular locking dependency warning.
> > > > > When intel_gt_reset called, reset_in_progress flag will be
> > > > > set, add code
> > > > > to check the flag, call async verion if reset is in progress.
> > > > >
> > > > > Signed-off-by: Zhanjun Dong<zhanjun.dong@intel.com>
> > > > > Cc: John Harrison<John.C.Harrison@Intel.com>
> > > > > Cc: Andi Shyti<andi.shyti@linux.intel.com>
> > > > > Cc: Daniel Vetter<daniel@ffwll.ch>
> > > > > ---
> > > > > drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++++++++++-
> > > > > 1 file changed, 10 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git
> > > > > a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > index a0e3ef1c65d2..600388c849f7 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > @@ -1359,7 +1359,16 @@ static void
> > > > > guc_enable_busyness_worker(struct intel_guc *guc)
> > > > > static void guc_cancel_busyness_worker(struct intel_guc *guc)
> > > > > {
> > > > > - cancel_delayed_work_sync(&guc->timestamp.work);
> > > > > + /*
> > > > > + * When intel_gt_reset was called, task will hold a lock.
> > > > > + * To cacel delayed work here, the _sync version will
> > > > > also acquire a lock, which might
> > > > > + * trigger the possible cirular locking dependency warning.
> > > > > + * Check the reset_in_progress flag, call async verion
> > > > > if reset is in progress.
> > > > > + */
> > > > This needs to explain in much more detail what is going on and
> > > > why it is not
> > > > a problem. E.g.:
> > > >
> > > > The busyness worker needs to be cancelled. In general that means
> > > > using the synchronous cancel version to ensure that an in-progress
> > > > worker will not keep executing beyond whatever is happening that
> > > > needs the cancel. E.g. suspend, driver unload, etc. However, in the
> > > > case of a reset, the synchronous version is not required and can
> > > > trigger a false deadlock detection warning.
> > > >
> > > > The business worker takes the reset mutex to protect against resets
> > > > interfering with it. However, it does a trylock and bails
> > > > out if the
> > > > reset lock is already acquired. Thus there is no actual deadlock or
> > > > other concern with the worker running concurrently with a reset. So
> > > > an asynchronous cancel is safe in the case of a reset rather than a
> > > > driver unload or suspend type operation. On the other hand, if the
> > > > cancel_sync version is used when a reset is in progress then the
> > > > mutex deadlock detection sees the mutex being acquired through
> > > > multiple paths and complains.
> > > >
> > > > So just don't bother. That keeps the detection code happy and is
> > > > safe because of the trylock code described above.
> > > So why do we even need to cancel anything if it doesn't do anything
> > > while
> > > the reset is in progress?
> > It still needs to be cancelled. The worker only aborts if it is actively
> > executing concurrently with the reset. It might not start to execute
> > until after the reset has completed. And there is presumably a reason
> > why the cancel is being called, a reason not necessarily related to
> > resets at all. Leaving the worker to run arbitrarily after the driver is
> > expecting it to be stopped will lead to much worse things than a fake
> > lockdep splat, e.g. a use after free pointer deref.
> >
> > John.
> @Daniel Vetter - ping? Is this explanation sufficient? Are you okay with
> this change now?
Sorry for the late reply, I'm constantly behind on mails :-/ Ping me on
irc next time around if I don't reply, that's quicker.
"presumably" isn't good enough for locking design. Either you know, and
can prove it all, or you shouldn't touch the code and its locking design
before you've figured this out.
Again, either this is a deadlock, race condition, or the cancel isn't
necessary. And this argument works in full generality. All this patch does
it replace the dealock with one of the other two, and that's not good
enough if you don't even know which one it is.
- if you need the cancel, you have a race condition
- if you don't have a race condition, you don't need the cancel
- currently you have the deadlock
"presumably" and "maybe" aint enoug for locking design.
Cheers, Daniel
>
> John.
>
> >
> > >
> > > Just remove the cancel from the reset path as uneeded instead, and
> > > explain
> > > why that's ok? Because that's defacto what the cancel_work with a
> > > potential deadlock scenario for cancel_work_sync does, you either don't
> > > need it at all, or the replacement creates a bug.
> > > -Daniel
> > >
> > > >
> > > > John.
> > > >
> > > >
> > > > > + if (guc_to_gt(guc)->uc.reset_in_progress)
> > > > > + cancel_delayed_work(&guc->timestamp.work);
> > > > > + else
> > > > > + cancel_delayed_work_sync(&guc->timestamp.work);
> > > > > }
> > > > > static void __reset_guc_busyness_stats(struct intel_guc *guc)
> >
>
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
next prev parent reply other threads:[~2023-09-06 6:50 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-11 18:20 [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset Zhanjun Dong
2023-08-11 19:15 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Avoid circular locking dependency when flush delayed work on gt reset (rev5) Patchwork
2023-08-11 19:15 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2023-08-11 19:34 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2023-08-21 14:09 ` [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset Andi Shyti
2023-08-22 13:50 ` Daniel Vetter
2023-08-22 14:14 ` Dong, Zhanjun
2023-08-22 14:28 ` Daniel Vetter
2023-08-22 18:53 ` John Harrison
2023-08-23 16:00 ` Daniel Vetter
2023-08-23 17:37 ` John Harrison
2023-08-28 23:01 ` John Harrison
2023-09-06 6:50 ` Daniel Vetter [this message]
2023-09-06 18:40 ` John Harrison
2023-08-31 14:00 ` Andi Shyti
2023-08-31 22:27 ` John Harrison
2023-09-06 9:17 ` Andi Shyti
2023-09-06 10:04 ` Jani Nikula
2023-09-06 11:02 ` Daniel Vetter
2023-09-06 18:49 ` John Harrison
2023-08-29 10:11 ` Andi Shyti
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZPghG6GmhO4j/9qV@phenom.ffwll.local \
--to=daniel@ffwll.ch \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-gfx@lists.freedesktop.org \
--cc=john.c.harrison@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox