intel-xe.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Matthew Auld <matthew.auld@intel.com>
Cc: intel-xe@lists.freedesktop.org,
	"Matthew Brost" <matthew.brost@intel.com>,
	"Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Subject: Re: [PATCH] drm/xe: Kill missing outer runtime PM protection warning
Date: Wed, 4 Sep 2024 11:03:52 -0400	[thread overview]
Message-ID: <Zth22JdcjlcSAzin@intel.com> (raw)
In-Reply-To: <23623c0f-f3e5-4227-9a84-7490ae3978ec@intel.com>

On Wed, Sep 04, 2024 at 01:49:47PM +0100, Matthew Auld wrote:
> On 03/09/2024 23:38, Rodrigo Vivi wrote:
> > This message was very useful to ensure that Xe was taking all
> > the needed outer runtime pm references. However, at this point
> > it is only a false positive. So, remove it.
> > 
> > False positive cases:
> > 
> > 1:
> > [184.983389] xe ...: [drm] Missing outer runtime PM protection
> > [snip]
> > [184.984096]  drm_ioctl+0x2cf/0x580 [drm]
> > [snip]
> > [184.984710] xe 0000:00:02.0: Runtime PM usage count underflow!
> > 
> > In this case the underflow is the problem since we are sure that
> > the ioctl is protected. But something else is abusing the 'put'
> > calls.
> > 
> > 2:
> > rpm_status:           0000:03:00.0 status=RPM_SUSPENDING
> > console:              xe_bo_evict_all (called from suspend)
> > xe_sched_job_create:  dev=0000:03:00.0, ...
> > xe_sched_job_exec:    dev=0000:03:00.0, ...
> > xe_pm_runtime_put:    dev=0000:03:00.0, ...
> > xe_sched_job_run:     dev=0000:03:00.0, ...
> > rpm_usage:            0000:03:00.0 flags-0 cnt-2  ...
> > rpm_usage:            0000:03:00.0 flags-0 cnt-2  ...
> > rpm_usage:            0000:03:00.0 flags-0 cnt-2  ...
> > console:              xe 0000:03:00.0: [drm] Missing outer runtime
> > 		      	 	       	     	     PM protection
> > console:               xe_guc_ct_send+0x15/0x50 [xe]
> > console:               guc_exec_queue_run_job+0x1509/0x3950 [xe]
> > [snip]
> > console:               drm_sched_run_job_work+0x649/0xc20
> > 
> > At this point, BOs are getting evicted from VRAM with rpm
> > usage-counter = 2, but rpm status = SUSPENDING.
> > The xe->pm_callback_task won't be equal 'current' because this call is
> > coming from a work queue.
> > 
> > So, pm_runtime_get_if_active() will be called and return 0 because rpm
> > status != ACTIVE (but equal SUSPENDING).
> > 
> > The only way out is to just grab the reference and move on.
> > 
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Matthew Auld <matthew.auld@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > ---
> >   drivers/gpu/drm/xe/xe_pm.c | 10 ++--------
> >   1 file changed, 2 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
> > index da68cd689a96..e1a5e43b0f34 100644
> > --- a/drivers/gpu/drm/xe/xe_pm.c
> > +++ b/drivers/gpu/drm/xe/xe_pm.c
> > @@ -592,20 +592,14 @@ bool xe_pm_runtime_get_if_in_use(struct xe_device *xe)
> >    * xe_pm_runtime_get_noresume - Bump runtime PM usage counter without resuming
> >    * @xe: xe device instance
> >    *
> > - * This function should be used in inner places where it is surely already
> > + * This function should *only* be used in inner places where it is surely already
> >    * protected by outer-bound callers of `xe_pm_runtime_get`.
> > - * It will warn if not protected.
> >    * The reference should be put back after this function regardless, since it
> >    * will always bump the usage counter, regardless.
> >    */
> >   void xe_pm_runtime_get_noresume(struct xe_device *xe)
> >   {
> > -	bool ref;
> > -
> > -	ref = xe_pm_runtime_get_if_in_use(xe);
> > -
> > -	if (drm_WARN(&xe->drm, !ref, "Missing outer runtime PM protection\n"))
> > -		pm_runtime_get_noresume(xe->drm.dev);
> 
> This has proven to find real bugs in the past, right? If so, it seems
> unfortunate to drop this completely?

You do have a point....

> What about making it slightly more
> fuzzy with something like:
> 
> drm_WARN(!pm_read_callback_task() && !pm_runtime_active(), ...

I don't like inspecting the status directly because it can be racy,
but in this particular case it would for sure avoid the case-2 of
our false positives. But perhaps something like this:

@@ -600,12 +600,14 @@ bool xe_pm_runtime_get_if_in_use(struct xe_device *xe)
  */
 void xe_pm_runtime_get_noresume(struct xe_device *xe)
 {
+       struct device *dev = xe->drm.dev;
        bool ref;
 
        ref = xe_pm_runtime_get_if_in_use(xe);
 
-       if (drm_WARN(&xe->drm, !ref, "Missing outer runtime PM protection\n"))
-               pm_runtime_get_noresume(xe->drm.dev);
+       if (drm_WARN(&xe->drm, !ref && dev->power.runtime_status != RPM_SUSPENDING,
+                    "Missing outer runtime PM protection\n"))
+               pm_runtime_get_noresume(dev);


> 
> That should avoid the false positive, at the cost of not finding some real
> bugs, but at least gives us something?

Well, right, this eliminates the false positive case 2.
But still at least the false positive case 1.

The big problem that I saw with this warning at this moment, was that
CI-buglog classified all the "Missing outer runtime PM protection"
in a single bucket and that was ignored because we had the issue of
the pending ggtt_node removal case to handle.

Meanwhile we masked all these bugs that I'm hunting now, because
they all looked the ggtt_node removal.

But well, I guess we could workaround and continue the clean-up
on the CI-buglog side and workaround false positives instead of
simply removing this protection and going blindly forever on
the unprotected potential new inner callers...

thought on the diff above? (&& dev->power.runtime_status != RPM_SUSPENDING)

Thanks a lot,
Rodrigo.

> 
> > +	pm_runtime_get_noresume(xe->drm.dev);
> >   }
> >   /**

  parent reply	other threads:[~2024-09-04 15:04 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-03 22:38 [PATCH] drm/xe: Kill missing outer runtime PM protection warning Rodrigo Vivi
2024-09-03 23:25 ` ✓ CI.Patch_applied: success for " Patchwork
2024-09-03 23:25 ` ✓ CI.checkpatch: " Patchwork
2024-09-03 23:26 ` ✓ CI.KUnit: " Patchwork
2024-09-03 23:38 ` ✓ CI.Build: " Patchwork
2024-09-03 23:40 ` ✓ CI.Hooks: " Patchwork
2024-09-03 23:41 ` ✓ CI.checksparse: " Patchwork
2024-09-04  0:00 ` ✓ CI.BAT: " Patchwork
2024-09-04  4:19 ` ✗ CI.FULL: failure " Patchwork
2024-09-04 12:49 ` [PATCH] " Matthew Auld
2024-09-04 14:53   ` Matthew Brost
2024-09-04 15:03   ` Rodrigo Vivi [this message]
2024-09-04 15:42     ` Matthew Auld
2024-09-04 16:06       ` Rodrigo Vivi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zth22JdcjlcSAzin@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.auld@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).