From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Cc: "intel-gfx@lists.freedesktop.org" <intel-gfx@lists.freedesktop.org>
Subject: Re: [Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown
Date: Wed, 20 Jul 2022 09:14:38 +0100 [thread overview]
Message-ID: <7ffd1214-e2ad-75ce-2234-a619396fe569@linux.intel.com> (raw)
In-Reply-To: <YtdKsIDWPLaHBKKs@orsosgc001.jf.intel.com>
On 20/07/2022 01:22, Umesh Nerlige Ramappa wrote:
> On Tue, Jul 19, 2022 at 10:00:01AM +0100, Tvrtko Ursulin wrote:
>>
>> On 12/07/2022 22:03, Umesh Nerlige Ramappa wrote:
>>> On Mon, Jul 04, 2022 at 09:31:55AM +0100, Tvrtko Ursulin wrote:
>>>>
>>>> On 01/07/2022 15:54, Summers, Stuart wrote:
>>>>> On Fri, 2022-07-01 at 09:37 +0100, Tvrtko Ursulin wrote:
>>>>>> On 01/07/2022 01:11, Umesh Nerlige Ramappa wrote:
>>>>>>> On Thu, Jun 30, 2022 at 09:00:28PM +0000, Stuart Summers wrote:
>>>>>>>> In the driver teardown, we are unregistering the gt prior
>>>>>>>> to unregistering the PMU. This means there is a small window
>>>>>>>> of time in which the application can request metrics from the
>>>>>>>> PMU, some of which are calling into the uapi engines list,
>>>>>>>> while the engines are not available. In this case we can
>>>>>>>> see null pointer dereferences.
>>>>>>>>
>>>>>>>> Fix this ordering in both the driver load and unload sequences.
>>>>>>>>
>>>>>>>> Additionally add a check for engine presence to prevent this
>>>>>>>> NPD in the event this ordering is accidentally reversed. Print
>>>>>>>> a debug message indicating when they aren't available.
>>>>>>>>
>>>>>>>> v1: Actually address the driver load/unload ordering issue
>>>>>>>>
>>>>>>>> Signed-off-by: Stuart Summers <stuart.summers@intel.com>
>>>>>>>> ---
>>>>>>>
>>>>>>> I thought this is likely happening because intel_gpu_top is running
>>>>>>> in
>>>>>>> the background when i915 is unloaded. I tried a quick repro, I
>>>>>>> don't see
>>>>>>> the unload succeed ("fatal module in use", not sure if this was a
>>>>>>> partial unload), but when I try to kill intel_gpu_top, I get an
>>>>>>> NPD.
>>>>>>> This is in the event disable path - i915_pmu_event_stop ->
>>>>>>> i915_pmu_disable.
>>>>>>
>>>>>> So i915 failed to unload (as expected - with perf events open we
>>>>>> elevate
>>>>>> the module ref count via i915_pmu_event_init -> drm_dev_get), then
>>>>>> you
>>>>>> quit intel_gpu_top and get NPD? On the engine lookup? With the
>>>>>> re-ordered init/fini sequence as from this patch?
>>>>>>
>>>>>> With elevated module count there shouldn't be any unloading happening
>>>>>> so
>>>>>> I am intrigued.
>>>>>>
>>>>>>> It's likely that you are seeing a different path (unload) leading
>>>>>>> to the
>>>>>>> same issue.
>>>>>>>
>>>>>>> I think in i915_pmu_disable/disable should be aware of event-
>>>>>>>> hw.state
>>>>>>> and or pmu->closed states before accessing the event. Maybe like,
>>>>>>>
>>>>>>> if (event->hw.state != PERF_HES_STOPPED && is_engine_event(event))
>>>>>>> {
>>>>>>>
>>>>>>> @Tvrtko, wondering if this case is tested by igt@perf
>>>>>>> _pmu@module-unload.
>>>>>>
>>>>>> A bit yes. From what Stuart wrote it seems the test would need to be
>>>>>> extended to cover the case where PMU is getting opened while module
>>>>>> unload is in progress.
>>>>>>
>>>>>> But the NPD you saw is for the moment confusing so I don't know what
>>>>>> is
>>>>>> happening.
>>>>>>
>>>>>>> I am not clear if we should use event->hw.state or pmu->closed here
>>>>>>> and
>>>>>>> if/how they are related. IMO, for this issue, the engine check is
>>>>>>> good
>>>>>>> enough too, so we don't really need the pmu state checks.
>>>>>>> Thoughts?
>>>>>>
>>>>>> Engine check at the moment feels like papering.
>>>>>>
>>>>>> Indeed as you say I think the pmu->closed might be the solution.
>>>>>> Perhaps
>>>>>> the race is as mentioned above. PMU open happening in parallel to
>>>>>> unload..
>>>>>>
>>>>>> If the sequence of events userspace triggers is:
>>>>>>
>>>>>> i915_pmu_event_init
>>>>>> i915_pmu_event_start
>>>>>> i915_pmu_enable
>>>>>> i915_pmu_event_read
>>>>>>
>>>>>> I guess pmu->closed can get set halfway in i915_pmu_event_init. What
>>>>>> would be the effect of that.. We'd try to get a module reference
>>>>>> while
>>>>>> in the process of unloading. Which is probably very bad.. So possibly
>>>>>> a
>>>>>> final check on pmu->close is needed there. Ho hum.. can it be made
>>>>>> safe
>>>>>> is the question.
>>>>>>
>>>>>> It doesn't explain the NPD on Ctrl-C though.. intel_gpu_top keeps
>>>>>> the
>>>>>> evens open all the time. So I think more info needed, for me at
>>>>>> least.
>>>>>
>>>>> So one thing here is this doesn't have to do with module unload, but
>>>>> module unbind specifically (while perf is open). I don't know if the
>>>>> NPD from Umesh is the same as what we're seeing here. I'd really like
>>>>> to separate these unless you know for sure that's related. Also it
>>>>> would be interesting to know if this patch fixes your issue as well.
>>>>>
>>>>> I still think the re-ordering in i915_driver.c should be enough and we
>>>>> shouldn't need to check pmu->closed. The unregister should be
>>>>> enough to
>>>>> ensure the perf tools are notified that new events aren't allowed, and
>>>>> at that time the engine structures are still intact. And even if for
>>>>> some reason the perf code still calls in to our function pointers, we
>>>>> have these engine checks as a failsafe.
>>>>>
>>>>> I'm by the way uploading one more version here with a drm_WARN_ONCE
>>>>> instead of the debug print.
>>>>
>>>> Problem is I am not a fan of papering so lets get to the bottom of
>>>> the issue first. (In the meantime simple patch to re-order driver
>>>> fini is okay since that seems obvious enough, I tnink.)
>>>>
>>>> We need to see call traces from any oopses and try to extend
>>>> perf_pmu to catch them. And we need to understand the problem, if it
>>>> is a real problem, which I laid out last week about race between
>>>> module unload and elevating the module use count from our perf event
>>>> init.
>>>>
>>>> Without understanding the details of possible failure mode flows we
>>>> don't know how much the papering with engine checks solves and how
>>>> much it leaves broken.
>>>>
>>>> If you guys are too busy to tackle that I'll put it onto myself, but
>>>> help would certainly be appreciated.
>>>
>>> Looks like Stuart/Chris are pointing towards the unbind as an issue.
>>>
>>> I ran this sequence and only the modprobe showed an error (FATAL: ...
>>> still in use). What happens with the unbind. Should pmu also handle
>>> the unbind somehow?
>>>
>>> - run intel_gpu_top
>>> - unbind
>>> - modprobe -r i915
>>> - kill intel_gpu_top.
>>
>> And it crashes or survives in this scenario?
>
> hangs on adlp, haven't been able to get the serial logs
>
>>
>> Module still in use here would be expected since intel_gpu_top is
>> holding a module reference.
>>
>> And pmu->closed should be set at the unbind step via i915_pci_remove
>> -> i915_driver_unregister -> i915_pmu_unregister.
>
> After unbind,
> kill intel_gpu_top -> i915_pmu_event_del -> i915_pmu_event_stop ->
> i915_pmu_disable -> likely HANGs when dereferencing engine.
>
> Can we can short circuit i915_pmu_disable with
> if (pmu->closed)
> return;
>
> since this function is also adjusting pmu->enable_count. Does it matter
> after pmu is closed?
Erm yes.. this sounds obvious now but why I did not put a pmu->closed check in i915_pmu_event_stop, since read and start/init have it!? Was it a simple oversight or something more I can't remember.
Try like this maybe:
diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
index 958b37123bf1..2399adf92cc0 100644
--- a/drivers/gpu/drm/i915/i915_pmu.c
+++ b/drivers/gpu/drm/i915/i915_pmu.c
@@ -760,9 +760,13 @@ static void i915_pmu_event_start(struct perf_event *event, int flags)
static void i915_pmu_event_stop(struct perf_event *event, int flags)
{
+ if (pmu->closed)
+ goto out;
+
if (flags & PERF_EF_UPDATE)
i915_pmu_event_read(event);
i915_pmu_disable(event);
+out:
event->hw.state = PERF_HES_STOPPED;
}
Fixes: b00bccb3f0bb ("drm/i915/pmu: Handle PCI unbind")
Enable count handling in i915_pmu_disable should not matter since the i915_pmu_unregister would have already been executed by this point so all we need to ensure is that pmu->closed is not use after free. And since open event hold the DRM device reference I think that is fine.
Regards,
Tvrtko
>
> Umesh
>
>
>>
>> We also need to try a stress test with two threads:
>>
>> Thread A Thread B
>> ----------- -----------
>> loop: loop:
>> open pmu event rmmod
>> close pmu event insmod
>>
>> To see if it can hit a problem with drm_dev_get from
>> i915_pmu_event_init being called at a bad moment relative to module
>> unload. Maybe I am confused but that seems a possibility and a serious
>> problem currently.
>>
>> Regards,
>>
>> Tvrtko
next prev parent reply other threads:[~2022-07-20 8:16 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-30 21:00 [Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown Stuart Summers
2022-06-30 21:39 ` [Intel-gfx] ✓ Fi.CI.BAT: success for drm/i915: Fix NPD in PMU during driver teardown (rev2) Patchwork
2022-07-01 0:11 ` [Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown Umesh Nerlige Ramappa
2022-07-01 8:37 ` Tvrtko Ursulin
2022-07-01 14:54 ` Summers, Stuart
2022-07-04 8:31 ` Tvrtko Ursulin
2022-07-12 21:03 ` Umesh Nerlige Ramappa
2022-07-19 9:00 ` Tvrtko Ursulin
2022-07-20 0:22 ` Umesh Nerlige Ramappa
2022-07-20 8:14 ` Tvrtko Ursulin [this message]
2022-07-20 20:07 ` Umesh Nerlige Ramappa
2022-07-21 4:30 ` Summers, Stuart
2022-07-21 7:43 ` Tvrtko Ursulin
2022-08-03 22:54 ` Summers, Stuart
2022-07-01 18:09 ` Umesh Nerlige Ramappa
2022-07-01 13:54 ` [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915: Fix NPD in PMU during driver teardown (rev2) Patchwork
-- strict thread matches above, loose matches on Subject: below --
2022-07-01 16:15 [Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown Stuart Summers
2022-06-30 22:45 Stuart Summers
2022-06-29 18:46 Stuart Summers
2022-06-30 10:29 ` Tvrtko Ursulin
2022-06-30 19:04 ` Summers, Stuart
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7ffd1214-e2ad-75ce-2234-a619396fe569@linux.intel.com \
--to=tvrtko.ursulin@linux.intel.com \
--cc=intel-gfx@lists.freedesktop.org \
--cc=umesh.nerlige.ramappa@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.