All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Matthew Auld <matthew.auld@intel.com>
Cc: intel-xe@lists.freedesktop.org,
	"Matthew Brost" <matthew.brost@intel.com>,
	"Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Subject: Re: [PATCH] drm/xe: Kill missing outer runtime PM protection warning
Date: Wed, 4 Sep 2024 11:03:52 -0400	[thread overview]
Message-ID: <Zth22JdcjlcSAzin@intel.com> (raw)
In-Reply-To: <23623c0f-f3e5-4227-9a84-7490ae3978ec@intel.com>

On Wed, Sep 04, 2024 at 01:49:47PM +0100, Matthew Auld wrote:
> On 03/09/2024 23:38, Rodrigo Vivi wrote:
> > This message was very useful to ensure that Xe was taking all
> > the needed outer runtime pm references. However, at this point
> > it is only a false positive. So, remove it.
> > 
> > False positive cases:
> > 
> > 1:
> > [184.983389] xe ...: [drm] Missing outer runtime PM protection
> > [snip]
> > [184.984096]  drm_ioctl+0x2cf/0x580 [drm]
> > [snip]
> > [184.984710] xe 0000:00:02.0: Runtime PM usage count underflow!
> > 
> > In this case the underflow is the problem since we are sure that
> > the ioctl is protected. But something else is abusing the 'put'
> > calls.
> > 
> > 2:
> > rpm_status:           0000:03:00.0 status=RPM_SUSPENDING
> > console:              xe_bo_evict_all (called from suspend)
> > xe_sched_job_create:  dev=0000:03:00.0, ...
> > xe_sched_job_exec:    dev=0000:03:00.0, ...
> > xe_pm_runtime_put:    dev=0000:03:00.0, ...
> > xe_sched_job_run:     dev=0000:03:00.0, ...
> > rpm_usage:            0000:03:00.0 flags-0 cnt-2  ...
> > rpm_usage:            0000:03:00.0 flags-0 cnt-2  ...
> > rpm_usage:            0000:03:00.0 flags-0 cnt-2  ...
> > console:              xe 0000:03:00.0: [drm] Missing outer runtime
> > 		      	 	       	     	     PM protection
> > console:               xe_guc_ct_send+0x15/0x50 [xe]
> > console:               guc_exec_queue_run_job+0x1509/0x3950 [xe]
> > [snip]
> > console:               drm_sched_run_job_work+0x649/0xc20
> > 
> > At this point, BOs are getting evicted from VRAM with rpm
> > usage-counter = 2, but rpm status = SUSPENDING.
> > The xe->pm_callback_task won't be equal 'current' because this call is
> > coming from a work queue.
> > 
> > So, pm_runtime_get_if_active() will be called and return 0 because rpm
> > status != ACTIVE (but equal SUSPENDING).
> > 
> > The only way out is to just grab the reference and move on.
> > 
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Matthew Auld <matthew.auld@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > ---
> >   drivers/gpu/drm/xe/xe_pm.c | 10 ++--------
> >   1 file changed, 2 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
> > index da68cd689a96..e1a5e43b0f34 100644
> > --- a/drivers/gpu/drm/xe/xe_pm.c
> > +++ b/drivers/gpu/drm/xe/xe_pm.c
> > @@ -592,20 +592,14 @@ bool xe_pm_runtime_get_if_in_use(struct xe_device *xe)
> >    * xe_pm_runtime_get_noresume - Bump runtime PM usage counter without resuming
> >    * @xe: xe device instance
> >    *
> > - * This function should be used in inner places where it is surely already
> > + * This function should *only* be used in inner places where it is surely already
> >    * protected by outer-bound callers of `xe_pm_runtime_get`.
> > - * It will warn if not protected.
> >    * The reference should be put back after this function regardless, since it
> >    * will always bump the usage counter, regardless.
> >    */
> >   void xe_pm_runtime_get_noresume(struct xe_device *xe)
> >   {
> > -	bool ref;
> > -
> > -	ref = xe_pm_runtime_get_if_in_use(xe);
> > -
> > -	if (drm_WARN(&xe->drm, !ref, "Missing outer runtime PM protection\n"))
> > -		pm_runtime_get_noresume(xe->drm.dev);
> 
> This has proven to find real bugs in the past, right? If so, it seems
> unfortunate to drop this completely?

You do have a point....

> What about making it slightly more
> fuzzy with something like:
> 
> drm_WARN(!pm_read_callback_task() && !pm_runtime_active(), ...

I don't like inspecting the status directly because it can be racy,
but in this particular case it would for sure avoid the case-2 of
our false positives. But perhaps something like this:

@@ -600,12 +600,14 @@ bool xe_pm_runtime_get_if_in_use(struct xe_device *xe)
  */
 void xe_pm_runtime_get_noresume(struct xe_device *xe)
 {
+       struct device *dev = xe->drm.dev;
        bool ref;
 
        ref = xe_pm_runtime_get_if_in_use(xe);
 
-       if (drm_WARN(&xe->drm, !ref, "Missing outer runtime PM protection\n"))
-               pm_runtime_get_noresume(xe->drm.dev);
+       if (drm_WARN(&xe->drm, !ref && dev->power.runtime_status != RPM_SUSPENDING,
+                    "Missing outer runtime PM protection\n"))
+               pm_runtime_get_noresume(dev);


> 
> That should avoid the false positive, at the cost of not finding some real
> bugs, but at least gives us something?

Well, right, this eliminates the false positive case 2.
But still at least the false positive case 1.

The big problem that I saw with this warning at this moment, was that
CI-buglog classified all the "Missing outer runtime PM protection"
in a single bucket and that was ignored because we had the issue of
the pending ggtt_node removal case to handle.

Meanwhile we masked all these bugs that I'm hunting now, because
they all looked the ggtt_node removal.

But well, I guess we could workaround and continue the clean-up
on the CI-buglog side and workaround false positives instead of
simply removing this protection and going blindly forever on
the unprotected potential new inner callers...

thought on the diff above? (&& dev->power.runtime_status != RPM_SUSPENDING)

Thanks a lot,
Rodrigo.

> 
> > +	pm_runtime_get_noresume(xe->drm.dev);
> >   }
> >   /**

  parent reply	other threads:[~2024-09-04 15:04 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-03 22:38 [PATCH] drm/xe: Kill missing outer runtime PM protection warning Rodrigo Vivi
2024-09-03 23:25 ` ✓ CI.Patch_applied: success for " Patchwork
2024-09-03 23:25 ` ✓ CI.checkpatch: " Patchwork
2024-09-03 23:26 ` ✓ CI.KUnit: " Patchwork
2024-09-03 23:38 ` ✓ CI.Build: " Patchwork
2024-09-03 23:40 ` ✓ CI.Hooks: " Patchwork
2024-09-03 23:41 ` ✓ CI.checksparse: " Patchwork
2024-09-04  0:00 ` ✓ CI.BAT: " Patchwork
2024-09-04  4:19 ` ✗ CI.FULL: failure " Patchwork
2024-09-04 12:49 ` [PATCH] " Matthew Auld
2024-09-04 14:53   ` Matthew Brost
2024-09-04 15:03   ` Rodrigo Vivi [this message]
2024-09-04 15:42     ` Matthew Auld
2024-09-04 16:06       ` Rodrigo Vivi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zth22JdcjlcSAzin@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.auld@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.