Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown
Date: Fri, 1 Jul 2022 11:09:47 -0700	[thread overview]
Message-ID: <Yr84azqlSWFdpQiJ@unerlige-desk> (raw)
In-Reply-To: <cbfb9255-bd95-87c7-aa9d-e3af56dffd76@linux.intel.com>

On Fri, Jul 01, 2022 at 09:37:20AM +0100, Tvrtko Ursulin wrote:
>
>On 01/07/2022 01:11, Umesh Nerlige Ramappa wrote:
>>On Thu, Jun 30, 2022 at 09:00:28PM +0000, Stuart Summers wrote:
>>>In the driver teardown, we are unregistering the gt prior
>>>to unregistering the PMU. This means there is a small window
>>>of time in which the application can request metrics from the
>>>PMU, some of which are calling into the uapi engines list,
>>>while the engines are not available. In this case we can
>>>see null pointer dereferences.
>>>
>>>Fix this ordering in both the driver load and unload sequences.
>>>
>>>Additionally add a check for engine presence to prevent this
>>>NPD in the event this ordering is accidentally reversed. Print
>>>a debug message indicating when they aren't available.
>>>
>>>v1: Actually address the driver load/unload ordering issue
>>>
>>>Signed-off-by: Stuart Summers <stuart.summers@intel.com>
>>>---
>>
>>I thought this is likely happening because intel_gpu_top is running 
>>in the background when i915 is unloaded. I tried a quick repro, I 
>>don't see the unload succeed ("fatal module in use", not sure if 
>>this was a partial unload), but when I try to kill intel_gpu_top, I 
>>get an NPD. This is in the event disable path - i915_pmu_event_stop 
>>-> i915_pmu_disable.
>
>So i915 failed to unload (as expected - with perf events open we 
>elevate the module ref count via i915_pmu_event_init -> drm_dev_get), 
>then you quit intel_gpu_top and get NPD?

I was just trying to point out an instance of the failure that I saw 
when running this execution sequence. This is without any of Stuart's 
changes.

> On the engine lookup?

Don't know if it is specifically engine lookup, but it's in the path of 
i915_pmu_event_stop which executes on killing intel_gpu_top.

> With the re-ordered init/fini sequence as from this patch?
No, without any changes from this thread.

>
>With elevated module count there shouldn't be any unloading happening 
>so I am intrigued.
>
>>It's likely that you are seeing a different path (unload) leading to 
>>the same issue.
>>
>>I think in i915_pmu_disable/disable should be aware of 
>>event->hw.state and or pmu->closed states before accessing the 
>>event. Maybe like,
>>
>>if (event->hw.state != PERF_HES_STOPPED && is_engine_event(event)) {
>>
>>@Tvrtko, wondering if this case is tested by 
>>igt@perf_pmu@module-unload.
>
>A bit yes. From what Stuart wrote it seems the test would need to be 
>extended to cover the case where PMU is getting opened while module 
>unload is in progress.
>
>But the NPD you saw is for the moment confusing so I don't know what 
>is happening.

I guess it's a separate issue. I can check if Stuart's patches resolve 
it and if not, will create a bug.

Regards,
Umesh

>
>>I am not clear if we should use event->hw.state or pmu->closed here 
>>and if/how they are related. IMO, for this issue, the engine check 
>>is good enough too, so we don't really need the pmu state checks.  
>>Thoughts?
>
>Engine check at the moment feels like papering.
>
>Indeed as you say I think the pmu->closed might be the solution. 
>Perhaps the race is as mentioned above. PMU open happening in parallel 
>to unload..
>
>If the sequence of events userspace triggers is:
>
>  i915_pmu_event_init
>  i915_pmu_event_start
>  i915_pmu_enable
>  i915_pmu_event_read
>
>I guess pmu->closed can get set halfway in i915_pmu_event_init. What 
>would be the effect of that.. We'd try to get a module reference while 
>in the process of unloading. Which is probably very bad.. So possibly 
>a final check on pmu->close is needed there. Ho hum.. can it be made 
>safe is the question.
>
>It doesn't explain the NPD on Ctrl-C though.. intel_gpu_top keeps the 
>evens open all the time. So I think more info needed, for me at least.
>
>Regards,
>
>Tvrtko
>
>>
>>Regards,
>>Umesh
>>
>>>drivers/gpu/drm/i915/i915_driver.c | 11 ++---
>>>drivers/gpu/drm/i915/i915_pmu.c    | 72 +++++++++++++++++-------------
>>>2 files changed, 48 insertions(+), 35 deletions(-)
>>>
>>>diff --git a/drivers/gpu/drm/i915/i915_driver.c 
>>>b/drivers/gpu/drm/i915/i915_driver.c
>>>index deb8a8b76965..ee4dcb85d206 100644
>>>--- a/drivers/gpu/drm/i915/i915_driver.c
>>>+++ b/drivers/gpu/drm/i915/i915_driver.c
>>>@@ -717,7 +717,6 @@ static void i915_driver_register(struct 
>>>drm_i915_private *dev_priv)
>>>    struct drm_device *dev = &dev_priv->drm;
>>>
>>>    i915_gem_driver_register(dev_priv);
>>>-    i915_pmu_register(dev_priv);
>>>
>>>    intel_vgpu_register(dev_priv);
>>>
>>>@@ -731,11 +730,12 @@ static void i915_driver_register(struct 
>>>drm_i915_private *dev_priv)
>>>    i915_debugfs_register(dev_priv);
>>>    i915_setup_sysfs(dev_priv);
>>>
>>>+    intel_gt_driver_register(to_gt(dev_priv));
>>>+
>>>    /* Depends on sysfs having been initialized */
>>>+    i915_pmu_register(dev_priv);
>>>    i915_perf_register(dev_priv);
>>>
>>>-    intel_gt_driver_register(to_gt(dev_priv));
>>>-
>>>    intel_display_driver_register(dev_priv);
>>>
>>>    intel_power_domains_enable(dev_priv);
>>>@@ -762,11 +762,12 @@ static void i915_driver_unregister(struct 
>>>drm_i915_private *dev_priv)
>>>
>>>    intel_display_driver_unregister(dev_priv);
>>>
>>>-    intel_gt_driver_unregister(to_gt(dev_priv));
>>>-
>>>    i915_perf_unregister(dev_priv);
>>>+    /* GT should be available until PMU is gone */
>>>    i915_pmu_unregister(dev_priv);
>>>
>>>+    intel_gt_driver_unregister(to_gt(dev_priv));
>>>+
>>>    i915_teardown_sysfs(dev_priv);
>>>    drm_dev_unplug(&dev_priv->drm);
>>>
>>>diff --git a/drivers/gpu/drm/i915/i915_pmu.c 
>>>b/drivers/gpu/drm/i915/i915_pmu.c
>>>index 958b37123bf1..796a1d8e36f2 100644
>>>--- a/drivers/gpu/drm/i915/i915_pmu.c
>>>+++ b/drivers/gpu/drm/i915/i915_pmu.c
>>>@@ -670,21 +670,28 @@ static void i915_pmu_enable(struct 
>>>perf_event *event)
>>>    if (is_engine_event(event)) {
>>>        u8 sample = engine_event_sample(event);
>>>        struct intel_engine_cs *engine;
>>>-
>>>-        engine = intel_engine_lookup_user(i915,
>>>-                          engine_event_class(event),
>>>-                          engine_event_instance(event));
>>>-
>>>-        BUILD_BUG_ON(ARRAY_SIZE(engine->pmu.enable_count) !=
>>>-                 I915_ENGINE_SAMPLE_COUNT);
>>>-        BUILD_BUG_ON(ARRAY_SIZE(engine->pmu.sample) !=
>>>-                 I915_ENGINE_SAMPLE_COUNT);
>>>-        GEM_BUG_ON(sample >= ARRAY_SIZE(engine->pmu.enable_count));
>>>-        GEM_BUG_ON(sample >= ARRAY_SIZE(engine->pmu.sample));
>>>-        GEM_BUG_ON(engine->pmu.enable_count[sample] == ~0);
>>>-
>>>-        engine->pmu.enable |= BIT(sample);
>>>-        engine->pmu.enable_count[sample]++;
>>>+        u8 class = engine_event_class(event);
>>>+        u8 instance = engine_event_instance(event);
>>>+
>>>+        engine = intel_engine_lookup_user(i915, class, instance);
>>>+        if (engine) {
>>>+            BUILD_BUG_ON(ARRAY_SIZE(engine->pmu.enable_count) !=
>>>+                     I915_ENGINE_SAMPLE_COUNT);
>>>+            BUILD_BUG_ON(ARRAY_SIZE(engine->pmu.sample) !=
>>>+                     I915_ENGINE_SAMPLE_COUNT);
>>>+            GEM_BUG_ON(sample >=
>>>+                   ARRAY_SIZE(engine->pmu.enable_count));
>>>+            GEM_BUG_ON(sample >=
>>>+                   ARRAY_SIZE(engine->pmu.sample));
>>>+            GEM_BUG_ON(engine->pmu.enable_count[sample] == ~0);
>>>+
>>>+            engine->pmu.enable |= BIT(sample);
>>>+            engine->pmu.enable_count[sample]++;
>>>+        } else {
>>>+            drm_dbg(&i915->drm,
>>>+                "Invalid engine event: { class:%d, inst:%d }\n",
>>>+                class, instance);
>>>+        }
>>>    }
>>>
>>>    spin_unlock_irqrestore(&pmu->lock, flags);
>>>@@ -714,21 +721,26 @@ static void i915_pmu_disable(struct 
>>>perf_event *event)
>>>    if (is_engine_event(event)) {
>>>        u8 sample = engine_event_sample(event);
>>>        struct intel_engine_cs *engine;
>>>-
>>>-        engine = intel_engine_lookup_user(i915,
>>>-                          engine_event_class(event),
>>>-                          engine_event_instance(event));
>>>-
>>>-        GEM_BUG_ON(sample >= ARRAY_SIZE(engine->pmu.enable_count));
>>>-        GEM_BUG_ON(sample >= ARRAY_SIZE(engine->pmu.sample));
>>>-        GEM_BUG_ON(engine->pmu.enable_count[sample] == 0);
>>>-
>>>-        /*
>>>-         * Decrement the reference count and clear the enabled
>>>-         * bitmask when the last listener on an event goes away.
>>>-         */
>>>-        if (--engine->pmu.enable_count[sample] == 0)
>>>-            engine->pmu.enable &= ~BIT(sample);
>>>+        u8 class = engine_event_class(event);
>>>+        u8 instance = engine_event_instance(event);
>>>+
>>>+        engine = intel_engine_lookup_user(i915, class, instance);
>>>+        if (engine) {
>>>+            GEM_BUG_ON(sample >= ARRAY_SIZE(engine->pmu.enable_count));
>>>+            GEM_BUG_ON(sample >= ARRAY_SIZE(engine->pmu.sample));
>>>+            GEM_BUG_ON(engine->pmu.enable_count[sample] == 0);
>>>+
>>>+            /*
>>>+             * Decrement the reference count and clear the enabled
>>>+             * bitmask when the last listener on an event goes away.
>>>+             */
>>>+            if (--engine->pmu.enable_count[sample] == 0)
>>>+                engine->pmu.enable &= ~BIT(sample);
>>>+        } else {
>>>+            drm_dbg(&i915->drm,
>>>+                "Invalid engine event: { class:%d, inst:%d }\n",
>>>+                class, instance);
>>>+        }
>>>    }
>>>
>>>    GEM_BUG_ON(bit >= ARRAY_SIZE(pmu->enable_count));
>>>-- 
>>>2.25.1
>>>

  parent reply	other threads:[~2022-07-01 18:09 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-30 21:00 [Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown Stuart Summers
2022-06-30 21:39 ` [Intel-gfx] ✓ Fi.CI.BAT: success for drm/i915: Fix NPD in PMU during driver teardown (rev2) Patchwork
2022-07-01  0:11 ` [Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown Umesh Nerlige Ramappa
2022-07-01  8:37   ` Tvrtko Ursulin
2022-07-01 14:54     ` Summers, Stuart
2022-07-04  8:31       ` Tvrtko Ursulin
2022-07-12 21:03         ` Umesh Nerlige Ramappa
2022-07-19  9:00           ` Tvrtko Ursulin
2022-07-20  0:22             ` Umesh Nerlige Ramappa
2022-07-20  8:14               ` Tvrtko Ursulin
2022-07-20 20:07                 ` Umesh Nerlige Ramappa
2022-07-21  4:30                   ` Summers, Stuart
2022-07-21  7:43                     ` Tvrtko Ursulin
2022-08-03 22:54                       ` Summers, Stuart
2022-07-01 18:09     ` Umesh Nerlige Ramappa [this message]
2022-07-01 13:54 ` [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915: Fix NPD in PMU during driver teardown (rev2) Patchwork
  -- strict thread matches above, loose matches on Subject: below --
2022-07-01 16:15 [Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown Stuart Summers
2022-06-30 22:45 Stuart Summers
2022-06-29 18:46 Stuart Summers
2022-06-30 10:29 ` Tvrtko Ursulin
2022-06-30 19:04   ` Summers, Stuart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yr84azqlSWFdpQiJ@unerlige-desk \
    --to=umesh.nerlige.ramappa@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox