Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: "intel-gfx@lists.freedesktop.org" <intel-gfx@lists.freedesktop.org>
Subject: Re: [Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown
Date: Tue, 19 Jul 2022 17:22:08 -0700	[thread overview]
Message-ID: <YtdKsIDWPLaHBKKs@orsosgc001.jf.intel.com> (raw)
In-Reply-To: <3ac56c34-85d8-bd06-e32a-fb341888f346@linux.intel.com>

On Tue, Jul 19, 2022 at 10:00:01AM +0100, Tvrtko Ursulin wrote:
>
>On 12/07/2022 22:03, Umesh Nerlige Ramappa wrote:
>>On Mon, Jul 04, 2022 at 09:31:55AM +0100, Tvrtko Ursulin wrote:
>>>
>>>On 01/07/2022 15:54, Summers, Stuart wrote:
>>>>On Fri, 2022-07-01 at 09:37 +0100, Tvrtko Ursulin wrote:
>>>>>On 01/07/2022 01:11, Umesh Nerlige Ramappa wrote:
>>>>>>On Thu, Jun 30, 2022 at 09:00:28PM +0000, Stuart Summers wrote:
>>>>>>>In the driver teardown, we are unregistering the gt prior
>>>>>>>to unregistering the PMU. This means there is a small window
>>>>>>>of time in which the application can request metrics from the
>>>>>>>PMU, some of which are calling into the uapi engines list,
>>>>>>>while the engines are not available. In this case we can
>>>>>>>see null pointer dereferences.
>>>>>>>
>>>>>>>Fix this ordering in both the driver load and unload sequences.
>>>>>>>
>>>>>>>Additionally add a check for engine presence to prevent this
>>>>>>>NPD in the event this ordering is accidentally reversed. Print
>>>>>>>a debug message indicating when they aren't available.
>>>>>>>
>>>>>>>v1: Actually address the driver load/unload ordering issue
>>>>>>>
>>>>>>>Signed-off-by: Stuart Summers <stuart.summers@intel.com>
>>>>>>>---
>>>>>>
>>>>>>I thought this is likely happening because intel_gpu_top is running
>>>>>>in
>>>>>>the background when i915 is unloaded. I tried a quick repro, I
>>>>>>don't see
>>>>>>the unload succeed ("fatal module in use", not sure if this was a
>>>>>>partial unload), but when I try to kill intel_gpu_top, I get an
>>>>>>NPD.
>>>>>>This is in the event disable path - i915_pmu_event_stop ->
>>>>>>i915_pmu_disable.
>>>>>
>>>>>So i915 failed to unload (as expected - with perf events open we
>>>>>elevate
>>>>>the module ref count via i915_pmu_event_init -> drm_dev_get), then
>>>>>you
>>>>>quit intel_gpu_top and get NPD? On the engine lookup? With the
>>>>>re-ordered init/fini sequence as from this patch?
>>>>>
>>>>>With elevated module count there shouldn't be any unloading happening
>>>>>so
>>>>>I am intrigued.
>>>>>
>>>>>>It's likely that you are seeing a different path (unload) leading
>>>>>>to the
>>>>>>same issue.
>>>>>>
>>>>>>I think in i915_pmu_disable/disable should be aware of event-
>>>>>>>hw.state
>>>>>>and or pmu->closed states before accessing the event. Maybe like,
>>>>>>
>>>>>>if (event->hw.state != PERF_HES_STOPPED && is_engine_event(event))
>>>>>>{
>>>>>>
>>>>>>@Tvrtko, wondering if this case is tested by igt@perf
>>>>>>_pmu@module-unload.
>>>>>
>>>>>A bit yes. From what Stuart wrote it seems the test would need to be
>>>>>extended to cover the case where PMU is getting opened while module
>>>>>unload is in progress.
>>>>>
>>>>>But the NPD you saw is for the moment confusing so I don't know what
>>>>>is
>>>>>happening.
>>>>>
>>>>>>I am not clear if we should use event->hw.state or pmu->closed here
>>>>>>and
>>>>>>if/how they are related. IMO, for this issue, the engine check is
>>>>>>good
>>>>>>enough too, so we don't really need the pmu state checks.
>>>>>>Thoughts?
>>>>>
>>>>>Engine check at the moment feels like papering.
>>>>>
>>>>>Indeed as you say I think the pmu->closed might be the solution.
>>>>>Perhaps
>>>>>the race is as mentioned above. PMU open happening in parallel to
>>>>>unload..
>>>>>
>>>>>If the sequence of events userspace triggers is:
>>>>>
>>>>>   i915_pmu_event_init
>>>>>   i915_pmu_event_start
>>>>>   i915_pmu_enable
>>>>>   i915_pmu_event_read
>>>>>
>>>>>I guess pmu->closed can get set halfway in i915_pmu_event_init. What
>>>>>would be the effect of that.. We'd try to get a module reference
>>>>>while
>>>>>in the process of unloading. Which is probably very bad.. So possibly
>>>>>a
>>>>>final check on pmu->close is needed there. Ho hum.. can it be made
>>>>>safe
>>>>>is the question.
>>>>>
>>>>>It doesn't explain the NPD on Ctrl-C though.. intel_gpu_top keeps
>>>>>the
>>>>>evens open all the time. So I think more info needed, for me at
>>>>>least.
>>>>
>>>>So one thing here is this doesn't have to do with module unload, but
>>>>module unbind specifically (while perf is open). I don't know if the
>>>>NPD from Umesh is the same as what we're seeing here. I'd really like
>>>>to separate these unless you know for sure that's related. Also it
>>>>would be interesting to know if this patch fixes your issue as well.
>>>>
>>>>I still think the re-ordering in i915_driver.c should be enough and we
>>>>shouldn't need to check pmu->closed. The unregister should be enough to
>>>>ensure the perf tools are notified that new events aren't allowed, and
>>>>at that time the engine structures are still intact. And even if for
>>>>some reason the perf code still calls in to our function pointers, we
>>>>have these engine checks as a failsafe.
>>>>
>>>>I'm by the way uploading one more version here with a drm_WARN_ONCE
>>>>instead of the debug print.
>>>
>>>Problem is I am not a fan of papering so lets get to the bottom of 
>>>the issue first. (In the meantime simple patch to re-order driver 
>>>fini is okay since that seems obvious enough, I tnink.)
>>>
>>>We need to see call traces from any oopses and try to extend 
>>>perf_pmu to catch them. And we need to understand the problem, if 
>>>it is a real problem, which I laid out last week about race 
>>>between module unload and elevating the module use count from our 
>>>perf event init.
>>>
>>>Without understanding the details of possible failure mode flows 
>>>we don't know how much the papering with engine checks solves and 
>>>how much it leaves broken.
>>>
>>>If you guys are too busy to tackle that I'll put it onto myself, 
>>>but help would certainly be appreciated.
>>
>>Looks like Stuart/Chris are pointing towards the unbind as an issue.
>>
>>I ran this sequence and only the modprobe showed an error (FATAL: 
>>... still in use). What happens with the unbind. Should pmu also 
>>handle the unbind somehow?
>>
>>- run intel_gpu_top
>>- unbind
>>- modprobe -r i915
>>- kill intel_gpu_top.
>
>And it crashes or survives in this scenario?

hangs on adlp, haven't been able to get the serial logs

>
>Module still in use here would be expected since intel_gpu_top is 
>holding a module reference.
>
>And pmu->closed should be set at the unbind step via i915_pci_remove 
>-> i915_driver_unregister -> i915_pmu_unregister.

After unbind, 

kill intel_gpu_top -> i915_pmu_event_del -> i915_pmu_event_stop -> 
i915_pmu_disable -> likely HANGs when dereferencing engine.

Can we can short circuit i915_pmu_disable with 

if (pmu->closed)
	return;

since this function is also adjusting pmu->enable_count. Does it matter 
after pmu is closed?

Umesh


>
>We also need to try a stress test with two threads:
>
>	Thread A		Thread B
>	-----------		-----------
>	loop:			loop:
>	  open pmu event	  rmmod
>	  close pmu event	  insmod
>
>To see if it can hit a problem with drm_dev_get from 
>i915_pmu_event_init being called at a bad moment relative to module 
>unload. Maybe I am confused but that seems a possibility and a serious 
>problem currently.
>
>Regards,
>
>Tvrtko

  reply	other threads:[~2022-07-20  0:22 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-30 21:00 [Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown Stuart Summers
2022-06-30 21:39 ` [Intel-gfx] ✓ Fi.CI.BAT: success for drm/i915: Fix NPD in PMU during driver teardown (rev2) Patchwork
2022-07-01  0:11 ` [Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown Umesh Nerlige Ramappa
2022-07-01  8:37   ` Tvrtko Ursulin
2022-07-01 14:54     ` Summers, Stuart
2022-07-04  8:31       ` Tvrtko Ursulin
2022-07-12 21:03         ` Umesh Nerlige Ramappa
2022-07-19  9:00           ` Tvrtko Ursulin
2022-07-20  0:22             ` Umesh Nerlige Ramappa [this message]
2022-07-20  8:14               ` Tvrtko Ursulin
2022-07-20 20:07                 ` Umesh Nerlige Ramappa
2022-07-21  4:30                   ` Summers, Stuart
2022-07-21  7:43                     ` Tvrtko Ursulin
2022-08-03 22:54                       ` Summers, Stuart
2022-07-01 18:09     ` Umesh Nerlige Ramappa
2022-07-01 13:54 ` [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915: Fix NPD in PMU during driver teardown (rev2) Patchwork
  -- strict thread matches above, loose matches on Subject: below --
2022-07-01 16:15 [Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown Stuart Summers
2022-06-30 22:45 Stuart Summers
2022-06-29 18:46 Stuart Summers
2022-06-30 10:29 ` Tvrtko Ursulin
2022-06-30 19:04   ` Summers, Stuart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YtdKsIDWPLaHBKKs@orsosgc001.jf.intel.com \
    --to=umesh.nerlige.ramappa@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox