From: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown
Date: Fri, 1 Jul 2022 11:09:47 -0700 [thread overview]
Message-ID: <Yr84azqlSWFdpQiJ@unerlige-desk> (raw)
In-Reply-To: <cbfb9255-bd95-87c7-aa9d-e3af56dffd76@linux.intel.com>
On Fri, Jul 01, 2022 at 09:37:20AM +0100, Tvrtko Ursulin wrote:
>
>On 01/07/2022 01:11, Umesh Nerlige Ramappa wrote:
>>On Thu, Jun 30, 2022 at 09:00:28PM +0000, Stuart Summers wrote:
>>>In the driver teardown, we are unregistering the gt prior
>>>to unregistering the PMU. This means there is a small window
>>>of time in which the application can request metrics from the
>>>PMU, some of which are calling into the uapi engines list,
>>>while the engines are not available. In this case we can
>>>see null pointer dereferences.
>>>
>>>Fix this ordering in both the driver load and unload sequences.
>>>
>>>Additionally add a check for engine presence to prevent this
>>>NPD in the event this ordering is accidentally reversed. Print
>>>a debug message indicating when they aren't available.
>>>
>>>v1: Actually address the driver load/unload ordering issue
>>>
>>>Signed-off-by: Stuart Summers <stuart.summers@intel.com>
>>>---
>>
>>I thought this is likely happening because intel_gpu_top is running
>>in the background when i915 is unloaded. I tried a quick repro, I
>>don't see the unload succeed ("fatal module in use", not sure if
>>this was a partial unload), but when I try to kill intel_gpu_top, I
>>get an NPD. This is in the event disable path - i915_pmu_event_stop
>>-> i915_pmu_disable.
>
>So i915 failed to unload (as expected - with perf events open we
>elevate the module ref count via i915_pmu_event_init -> drm_dev_get),
>then you quit intel_gpu_top and get NPD?
I was just trying to point out an instance of the failure that I saw
when running this execution sequence. This is without any of Stuart's
changes.
> On the engine lookup?
Don't know if it is specifically engine lookup, but it's in the path of
i915_pmu_event_stop which executes on killing intel_gpu_top.
> With the re-ordered init/fini sequence as from this patch?
No, without any changes from this thread.
>
>With elevated module count there shouldn't be any unloading happening
>so I am intrigued.
>
>>It's likely that you are seeing a different path (unload) leading to
>>the same issue.
>>
>>I think in i915_pmu_disable/disable should be aware of
>>event->hw.state and or pmu->closed states before accessing the
>>event. Maybe like,
>>
>>if (event->hw.state != PERF_HES_STOPPED && is_engine_event(event)) {
>>
>>@Tvrtko, wondering if this case is tested by
>>igt@perf_pmu@module-unload.
>
>A bit yes. From what Stuart wrote it seems the test would need to be
>extended to cover the case where PMU is getting opened while module
>unload is in progress.
>
>But the NPD you saw is for the moment confusing so I don't know what
>is happening.
I guess it's a separate issue. I can check if Stuart's patches resolve
it and if not, will create a bug.
Regards,
Umesh
>
>>I am not clear if we should use event->hw.state or pmu->closed here
>>and if/how they are related. IMO, for this issue, the engine check
>>is good enough too, so we don't really need the pmu state checks.
>>Thoughts?
>
>Engine check at the moment feels like papering.
>
>Indeed as you say I think the pmu->closed might be the solution.
>Perhaps the race is as mentioned above. PMU open happening in parallel
>to unload..
>
>If the sequence of events userspace triggers is:
>
> i915_pmu_event_init
> i915_pmu_event_start
> i915_pmu_enable
> i915_pmu_event_read
>
>I guess pmu->closed can get set halfway in i915_pmu_event_init. What
>would be the effect of that.. We'd try to get a module reference while
>in the process of unloading. Which is probably very bad.. So possibly
>a final check on pmu->close is needed there. Ho hum.. can it be made
>safe is the question.
>
>It doesn't explain the NPD on Ctrl-C though.. intel_gpu_top keeps the
>evens open all the time. So I think more info needed, for me at least.
>
>Regards,
>
>Tvrtko
>
>>
>>Regards,
>>Umesh
>>
>>>drivers/gpu/drm/i915/i915_driver.c | 11 ++---
>>>drivers/gpu/drm/i915/i915_pmu.c | 72 +++++++++++++++++-------------
>>>2 files changed, 48 insertions(+), 35 deletions(-)
>>>
>>>diff --git a/drivers/gpu/drm/i915/i915_driver.c
>>>b/drivers/gpu/drm/i915/i915_driver.c
>>>index deb8a8b76965..ee4dcb85d206 100644
>>>--- a/drivers/gpu/drm/i915/i915_driver.c
>>>+++ b/drivers/gpu/drm/i915/i915_driver.c
>>>@@ -717,7 +717,6 @@ static void i915_driver_register(struct
>>>drm_i915_private *dev_priv)
>>> struct drm_device *dev = &dev_priv->drm;
>>>
>>> i915_gem_driver_register(dev_priv);
>>>- i915_pmu_register(dev_priv);
>>>
>>> intel_vgpu_register(dev_priv);
>>>
>>>@@ -731,11 +730,12 @@ static void i915_driver_register(struct
>>>drm_i915_private *dev_priv)
>>> i915_debugfs_register(dev_priv);
>>> i915_setup_sysfs(dev_priv);
>>>
>>>+ intel_gt_driver_register(to_gt(dev_priv));
>>>+
>>> /* Depends on sysfs having been initialized */
>>>+ i915_pmu_register(dev_priv);
>>> i915_perf_register(dev_priv);
>>>
>>>- intel_gt_driver_register(to_gt(dev_priv));
>>>-
>>> intel_display_driver_register(dev_priv);
>>>
>>> intel_power_domains_enable(dev_priv);
>>>@@ -762,11 +762,12 @@ static void i915_driver_unregister(struct
>>>drm_i915_private *dev_priv)
>>>
>>> intel_display_driver_unregister(dev_priv);
>>>
>>>- intel_gt_driver_unregister(to_gt(dev_priv));
>>>-
>>> i915_perf_unregister(dev_priv);
>>>+ /* GT should be available until PMU is gone */
>>> i915_pmu_unregister(dev_priv);
>>>
>>>+ intel_gt_driver_unregister(to_gt(dev_priv));
>>>+
>>> i915_teardown_sysfs(dev_priv);
>>> drm_dev_unplug(&dev_priv->drm);
>>>
>>>diff --git a/drivers/gpu/drm/i915/i915_pmu.c
>>>b/drivers/gpu/drm/i915/i915_pmu.c
>>>index 958b37123bf1..796a1d8e36f2 100644
>>>--- a/drivers/gpu/drm/i915/i915_pmu.c
>>>+++ b/drivers/gpu/drm/i915/i915_pmu.c
>>>@@ -670,21 +670,28 @@ static void i915_pmu_enable(struct
>>>perf_event *event)
>>> if (is_engine_event(event)) {
>>> u8 sample = engine_event_sample(event);
>>> struct intel_engine_cs *engine;
>>>-
>>>- engine = intel_engine_lookup_user(i915,
>>>- engine_event_class(event),
>>>- engine_event_instance(event));
>>>-
>>>- BUILD_BUG_ON(ARRAY_SIZE(engine->pmu.enable_count) !=
>>>- I915_ENGINE_SAMPLE_COUNT);
>>>- BUILD_BUG_ON(ARRAY_SIZE(engine->pmu.sample) !=
>>>- I915_ENGINE_SAMPLE_COUNT);
>>>- GEM_BUG_ON(sample >= ARRAY_SIZE(engine->pmu.enable_count));
>>>- GEM_BUG_ON(sample >= ARRAY_SIZE(engine->pmu.sample));
>>>- GEM_BUG_ON(engine->pmu.enable_count[sample] == ~0);
>>>-
>>>- engine->pmu.enable |= BIT(sample);
>>>- engine->pmu.enable_count[sample]++;
>>>+ u8 class = engine_event_class(event);
>>>+ u8 instance = engine_event_instance(event);
>>>+
>>>+ engine = intel_engine_lookup_user(i915, class, instance);
>>>+ if (engine) {
>>>+ BUILD_BUG_ON(ARRAY_SIZE(engine->pmu.enable_count) !=
>>>+ I915_ENGINE_SAMPLE_COUNT);
>>>+ BUILD_BUG_ON(ARRAY_SIZE(engine->pmu.sample) !=
>>>+ I915_ENGINE_SAMPLE_COUNT);
>>>+ GEM_BUG_ON(sample >=
>>>+ ARRAY_SIZE(engine->pmu.enable_count));
>>>+ GEM_BUG_ON(sample >=
>>>+ ARRAY_SIZE(engine->pmu.sample));
>>>+ GEM_BUG_ON(engine->pmu.enable_count[sample] == ~0);
>>>+
>>>+ engine->pmu.enable |= BIT(sample);
>>>+ engine->pmu.enable_count[sample]++;
>>>+ } else {
>>>+ drm_dbg(&i915->drm,
>>>+ "Invalid engine event: { class:%d, inst:%d }\n",
>>>+ class, instance);
>>>+ }
>>> }
>>>
>>> spin_unlock_irqrestore(&pmu->lock, flags);
>>>@@ -714,21 +721,26 @@ static void i915_pmu_disable(struct
>>>perf_event *event)
>>> if (is_engine_event(event)) {
>>> u8 sample = engine_event_sample(event);
>>> struct intel_engine_cs *engine;
>>>-
>>>- engine = intel_engine_lookup_user(i915,
>>>- engine_event_class(event),
>>>- engine_event_instance(event));
>>>-
>>>- GEM_BUG_ON(sample >= ARRAY_SIZE(engine->pmu.enable_count));
>>>- GEM_BUG_ON(sample >= ARRAY_SIZE(engine->pmu.sample));
>>>- GEM_BUG_ON(engine->pmu.enable_count[sample] == 0);
>>>-
>>>- /*
>>>- * Decrement the reference count and clear the enabled
>>>- * bitmask when the last listener on an event goes away.
>>>- */
>>>- if (--engine->pmu.enable_count[sample] == 0)
>>>- engine->pmu.enable &= ~BIT(sample);
>>>+ u8 class = engine_event_class(event);
>>>+ u8 instance = engine_event_instance(event);
>>>+
>>>+ engine = intel_engine_lookup_user(i915, class, instance);
>>>+ if (engine) {
>>>+ GEM_BUG_ON(sample >= ARRAY_SIZE(engine->pmu.enable_count));
>>>+ GEM_BUG_ON(sample >= ARRAY_SIZE(engine->pmu.sample));
>>>+ GEM_BUG_ON(engine->pmu.enable_count[sample] == 0);
>>>+
>>>+ /*
>>>+ * Decrement the reference count and clear the enabled
>>>+ * bitmask when the last listener on an event goes away.
>>>+ */
>>>+ if (--engine->pmu.enable_count[sample] == 0)
>>>+ engine->pmu.enable &= ~BIT(sample);
>>>+ } else {
>>>+ drm_dbg(&i915->drm,
>>>+ "Invalid engine event: { class:%d, inst:%d }\n",
>>>+ class, instance);
>>>+ }
>>> }
>>>
>>> GEM_BUG_ON(bit >= ARRAY_SIZE(pmu->enable_count));
>>>--
>>>2.25.1
>>>
next prev parent reply other threads:[~2022-07-01 18:09 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-30 21:00 [Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown Stuart Summers
2022-06-30 21:39 ` [Intel-gfx] ✓ Fi.CI.BAT: success for drm/i915: Fix NPD in PMU during driver teardown (rev2) Patchwork
2022-07-01 0:11 ` [Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown Umesh Nerlige Ramappa
2022-07-01 8:37 ` Tvrtko Ursulin
2022-07-01 14:54 ` Summers, Stuart
2022-07-04 8:31 ` Tvrtko Ursulin
2022-07-12 21:03 ` Umesh Nerlige Ramappa
2022-07-19 9:00 ` Tvrtko Ursulin
2022-07-20 0:22 ` Umesh Nerlige Ramappa
2022-07-20 8:14 ` Tvrtko Ursulin
2022-07-20 20:07 ` Umesh Nerlige Ramappa
2022-07-21 4:30 ` Summers, Stuart
2022-07-21 7:43 ` Tvrtko Ursulin
2022-08-03 22:54 ` Summers, Stuart
2022-07-01 18:09 ` Umesh Nerlige Ramappa [this message]
2022-07-01 13:54 ` [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915: Fix NPD in PMU during driver teardown (rev2) Patchwork
-- strict thread matches above, loose matches on Subject: below --
2022-07-01 16:15 [Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown Stuart Summers
2022-06-30 22:45 Stuart Summers
2022-06-29 18:46 Stuart Summers
2022-06-30 10:29 ` Tvrtko Ursulin
2022-06-30 19:04 ` Summers, Stuart
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yr84azqlSWFdpQiJ@unerlige-desk \
--to=umesh.nerlige.ramappa@intel.com \
--cc=intel-gfx@lists.freedesktop.org \
--cc=tvrtko.ursulin@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.