Re: [Intel-xe] [PATCH] drm/xe: Fix an invalid locking wait context bug

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Matthew Auld <matthew.auld@intel.com>
Cc: intel-xe@lists.freedesktop.org
Subject: Re: [Intel-xe] [PATCH] drm/xe: Fix an invalid locking wait context bug
Date: Fri, 21 Jul 2023 14:44:11 -0400	[thread overview]
Message-ID: <ZLrR+xU1O1b4tUUY@intel.com> (raw)
In-Reply-To: <37cb1b64-95ff-2f0d-f802-27e79fc55574@intel.com>

On Fri, Jul 21, 2023 at 12:17:47PM +0100, Matthew Auld wrote:
> On 20/07/2023 16:42, Rodrigo Vivi wrote:
> > On Thu, Jul 20, 2023 at 01:38:18PM +0100, Matthew Auld wrote:
> > > On 20/07/2023 13:01, Rodrigo Vivi wrote:
> > > > On Thu, Jul 20, 2023 at 10:11:00AM +0100, Matthew Auld wrote:
> > > > > On 19/07/2023 20:27, Rodrigo Vivi wrote:
> > > > > > xe_irq_{suspend,resume} were incorrectly using the xe->irq.lock.
> > > > > > 
> > > > > > The lock was created to protect the gt irq handlers, and not
> > > > > > the irq.enabled. Since suspend/resume and other places touching
> > > > > > irq.enabled are already serialized they don't need protection.
> > > > > > (see other irq.enabled accesses).
> > > > > > 
> > > > > > Then with this spin lock around xe_irq_reset, we will end up
> > > > > > calling the intel_display_power_is_enabled() function, and
> > > > > > that needs a mutex lock. Hence causing the undesired
> > > > > > "[ BUG: Invalid wait context ]"
> > > > > > 
> > > > > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > > > > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > > > > ---
> > > > > >     drivers/gpu/drm/xe/xe_irq.c | 5 -----
> > > > > >     1 file changed, 5 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
> > > > > > index eae190cb0969..df01af780a57 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_irq.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_irq.c
> > > > > > @@ -574,10 +574,8 @@ void xe_irq_shutdown(struct xe_device *xe)
> > > > > >     void xe_irq_suspend(struct xe_device *xe)
> > > > > >     {
> > > > > > -	spin_lock_irq(&xe->irq.lock);
> > > > > >     	xe->irq.enabled = false;
> > > > > >     	xe_irq_reset(xe);
> > > > > > -	spin_unlock_irq(&xe->irq.lock);
> > > > > 
> > > > > Do we not need something like:
> > > > > 
> > > > > spin_lock_irq(&xe->irq.lock);
> > > > > xe->irq.enabled = false; /* no new irqs */
> > > > > spin_unlock_irq(&xe->irq.lock);
> > > > > 
> > > > > synchronize_irq(...); /* flush irqs */
> > > > > xe_irq_reset(); /* turn off irqs */
> > > > > ....
> > > > > 
> > > > > And then at the start of the irq handler:
> > > > > 
> > > > > spin_lock_irq(&xe->irq.lock);
> > > > > if (!xe->irq.enabled) {
> > > > >       spin_unlock_irq(&xe->irq.lock);
> > > > >       return ....;
> > > > > }
> > > > > 
> > > > > Or did something happen prior to xe_irq_suspend() to ensure proper
> > > > > serialisation with irq and the above steps are not really needed?
> > > > 
> > > > the suspend and resume calls should be serialized by itself, no?!
> > > 
> > > Is it not possible for IRQs to still be firing or potentially be in-progress
> > > here as we are preparing to suspend?
> > 
> > yes, it is. We are letting the rpm to run with irq enabled otherwise
> > we will face the same invalid wait bug that this patch is trying to solve.
> > 
> > But I don't believe the right way is to use the lock to protect the irq.enabled.
> > 
> > Taking a look around I believe that what we are missing is the
> > synchronize_irq() call right after the reset. So we ensure that all the
> > racy handlers were properly processed before we allow the suspend.
> > 
> > So I believe we need something like i915 that would be:
> > 
> > xe_irq_suspend()
> > {
> > 	xe_irq_reset(xe);
> > 	xe->irq.enable = false;
> > 	synchronize_irq(pdev->irq);
> > }
> > 
> > xe_irq_resume()
> > {
> > 	xe->irq.enabled = true;
> > 	xe_irq_reset(xe);
> > 	xe_irq_postinstall(xe);
> > 
> > 	for_each_gt(gt, xe, id)
> > 			xe_irq_enable_hwe(gt);
> > }
> 
> Ok, but tbh I'm not completely sure what is going on with irq.enabled.
> AFAICT it looks to also be exposed into i915/display with
> intel_irqs_enabled() and that seems to have various callers. Are they not
> holding the lock?

Oh, they are really holding the lock.
I had missed the
#define irq_lock irq.lock
at xe/compat-i915-headers/i915_drv.h

In i915 I noticed something strange. instead of checking the irq_enable
they are using a secondary runtime_pm.irqs_enable and the only difference
is that this is set at the end of the irq reset function. while irq_enable
is set at the beginning.... it looks a hack to workaround races?!
at least very suspicious.

So, probably the right solution here lays in i915 side:

1. convert power_domains.lock from mutex to spinlock
(sent to trybot to get some feedback:
https://patchwork.freedesktop.org/series/121155/
)

2. then, add the spinlocks around the irq reset (or install and uninstall)

3. then get rid of this runtime_pm.irqs_enabled in favor
of the irq.enabled.

thoughts?

> Also the write here is to shared memory so without the
> lock it's unclear what is going on here wrt barriers and unmarked accesses
> (missing READ_ONCE/WRITE_ONCE?). I don't want to block you here since
> "Invalid wait context" is breaking CI, so I guess just go with the simplest
> thing for now.
> 
> > 
> > > 
> > > > 
> > > > no other place touching or inspecting irq.enable uses this lock
> > > > anyway, since it was created to serialize the gt_handler.
> > > > 
> > > > > 
> > > > > >     }
> > > > > >     void xe_irq_resume(struct xe_device *xe)
> > > > > > @@ -585,13 +583,10 @@ void xe_irq_resume(struct xe_device *xe)
> > > > > >     	struct xe_gt *gt;
> > > > > >     	int id;
> > > > > > -	spin_lock_irq(&xe->irq.lock);
> > > > > >     	xe->irq.enabled = true;
> > > > > >     	xe_irq_reset(xe);
> > > > > >     	xe_irq_postinstall(xe);
> > > > > >     	for_each_gt(gt, xe, id)
> > > > > >     		xe_irq_enable_hwe(gt);
> > > > > > -
> > > > > > -	spin_unlock_irq(&xe->irq.lock);
> > > > > >     }

next prev parent reply	other threads:[~2023-07-21 18:44 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-19 19:27 [Intel-xe] [PATCH] drm/xe: Fix an invalid locking wait context bug Rodrigo Vivi
2023-07-19 21:07 ` [Intel-xe] ✓ CI.Patch_applied: success for " Patchwork
2023-07-19 21:07 ` [Intel-xe] ✓ CI.checkpatch: " Patchwork
2023-07-19 21:08 ` [Intel-xe] ✓ CI.KUnit: " Patchwork
2023-07-19 21:12 ` [Intel-xe] ✓ CI.Build: " Patchwork
2023-07-19 21:12 ` [Intel-xe] ✓ CI.Hooks: " Patchwork
2023-07-19 21:14 ` [Intel-xe] ✓ CI.checksparse: " Patchwork
2023-07-19 21:14 ` [Intel-xe] [PATCH] " Matthew Brost
2023-07-20  9:11 ` Matthew Auld
2023-07-20 12:01   ` Rodrigo Vivi
2023-07-20 12:38     ` Matthew Auld
2023-07-20 15:42       ` Rodrigo Vivi
2023-07-21 11:17         ` Matthew Auld
2023-07-21 18:44           ` Rodrigo Vivi [this message]
2023-07-25 20:17   ` Rodrigo Vivi
2023-07-26 11:38     ` Matthew Auld
2023-07-26 17:24       ` Rodrigo Vivi
2023-07-26 21:30         ` Rodrigo Vivi
2023-07-27 16:08           ` Matthew Auld
2023-07-25 20:20 ` [Intel-xe] ✓ CI.Patch_applied: success for drm/xe: Fix an invalid locking wait context bug (rev2) Patchwork
2023-07-25 20:20 ` [Intel-xe] ✓ CI.checkpatch: " Patchwork
2023-07-25 20:21 ` [Intel-xe] ✓ CI.KUnit: " Patchwork
2023-07-25 20:25 ` [Intel-xe] ✓ CI.Build: " Patchwork
2023-07-25 20:25 ` [Intel-xe] ✓ CI.Hooks: " Patchwork
2023-07-25 20:27 ` [Intel-xe] ✓ CI.checksparse: " Patchwork
2023-07-25 20:45 ` [Intel-xe] ○ CI.BAT: info " Patchwork
2023-07-26 22:29 ` [Intel-xe] ✓ CI.Patch_applied: success for drm/xe: Fix an invalid locking wait context bug (rev3) Patchwork
2023-07-26 22:29 ` [Intel-xe] ✓ CI.checkpatch: " Patchwork
2023-07-26 22:30 ` [Intel-xe] ✓ CI.KUnit: " Patchwork
2023-07-26 22:34 ` [Intel-xe] ✓ CI.Build: " Patchwork
2023-07-26 22:34 ` [Intel-xe] ✓ CI.Hooks: " Patchwork
2023-07-26 22:35 ` [Intel-xe] ✓ CI.checksparse: " Patchwork
2023-07-26 23:10 ` [Intel-xe] ○ CI.BAT: info " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZLrR+xU1O1b4tUUY@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.auld@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.