From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 037F7C433EF for ; Tue, 12 Jul 2022 21:04:00 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 40E8910FB60; Tue, 12 Jul 2022 21:04:00 +0000 (UTC) Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by gabe.freedesktop.org (Postfix) with ESMTPS id DE58911A681 for ; Tue, 12 Jul 2022 21:03:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1657659839; x=1689195839; h=date:from:to:cc:subject:message-id:references: mime-version:in-reply-to; bh=rmQW3ehNPRI2KvBVSoPH2rajm062H1GvL4DZCnugl70=; b=ciVuLmHbkjCc6N/8tSsxawt4N35T5nj3GXBOVxP3SlS/GJmHv/f5rArK iCiA/9yfMa0YG7QHbw9JChmWYGbz6PLWxqRjh8Yg5/6a1XpFgmUEJhXBC 4rJQokPX13s2lKT6nc1tivflsbHY1uSRD+u9sgIsiO0A3Qb4ClSAhpuqG sBXBxLCzeDK+iM0UV+34kXsbGhGGCaI0n+bbvnMn66yB7thiYPbujcWkJ 3SUcI4hNjMxqkRFjC2OURCdYe1VllqAvS4/Eq/u7amlBq1zRzzpTCYI3a kZha+/lScCk3FkExBMPX1MzqYAIoHfKgB3qvn4TpbktLfHrDL+MuYb+Iu w==; X-IronPort-AV: E=McAfee;i="6400,9594,10406"; a="268079730" X-IronPort-AV: E=Sophos;i="5.92,266,1650956400"; d="scan'208";a="268079730" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jul 2022 14:03:58 -0700 X-IronPort-AV: E=Sophos;i="5.92,266,1650956400"; d="scan'208";a="570348605" Received: from orsosgc001.jf.intel.com ([10.165.21.135]) by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jul 2022 14:03:58 -0700 Date: Tue, 12 Jul 2022 14:03:58 -0700 From: Umesh Nerlige Ramappa To: Tvrtko Ursulin Message-ID: References: <5535d98d0c1f1fa22e6ca6e8973a05e58a097944.1656622601.git.stuart.summers@intel.com> <81c381a50536d23ec0922874e13df5e67ffcc3d7.camel@intel.com> <894bdfcb-c3c7-96ce-5ddc-a084ef04179d@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Disposition: inline In-Reply-To: <894bdfcb-c3c7-96ce-5ddc-a084ef04179d@linux.intel.com> Subject: Re: [Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "intel-gfx@lists.freedesktop.org" Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" On Mon, Jul 04, 2022 at 09:31:55AM +0100, Tvrtko Ursulin wrote: > >On 01/07/2022 15:54, Summers, Stuart wrote: >>On Fri, 2022-07-01 at 09:37 +0100, Tvrtko Ursulin wrote: >>>On 01/07/2022 01:11, Umesh Nerlige Ramappa wrote: >>>>On Thu, Jun 30, 2022 at 09:00:28PM +0000, Stuart Summers wrote: >>>>>In the driver teardown, we are unregistering the gt prior >>>>>to unregistering the PMU. This means there is a small window >>>>>of time in which the application can request metrics from the >>>>>PMU, some of which are calling into the uapi engines list, >>>>>while the engines are not available. In this case we can >>>>>see null pointer dereferences. >>>>> >>>>>Fix this ordering in both the driver load and unload sequences. >>>>> >>>>>Additionally add a check for engine presence to prevent this >>>>>NPD in the event this ordering is accidentally reversed. Print >>>>>a debug message indicating when they aren't available. >>>>> >>>>>v1: Actually address the driver load/unload ordering issue >>>>> >>>>>Signed-off-by: Stuart Summers >>>>>--- >>>> >>>>I thought this is likely happening because intel_gpu_top is running >>>>in >>>>the background when i915 is unloaded. I tried a quick repro, I >>>>don't see >>>>the unload succeed ("fatal module in use", not sure if this was a >>>>partial unload), but when I try to kill intel_gpu_top, I get an >>>>NPD. >>>>This is in the event disable path - i915_pmu_event_stop -> >>>>i915_pmu_disable. >>> >>>So i915 failed to unload (as expected - with perf events open we >>>elevate >>>the module ref count via i915_pmu_event_init -> drm_dev_get), then >>>you >>>quit intel_gpu_top and get NPD? On the engine lookup? With the >>>re-ordered init/fini sequence as from this patch? >>> >>>With elevated module count there shouldn't be any unloading happening >>>so >>>I am intrigued. >>> >>>>It's likely that you are seeing a different path (unload) leading >>>>to the >>>>same issue. >>>> >>>>I think in i915_pmu_disable/disable should be aware of event- >>>>>hw.state >>>>and or pmu->closed states before accessing the event. Maybe like, >>>> >>>>if (event->hw.state != PERF_HES_STOPPED && is_engine_event(event)) >>>>{ >>>> >>>>@Tvrtko, wondering if this case is tested by igt@perf >>>>_pmu@module-unload. >>> >>>A bit yes. From what Stuart wrote it seems the test would need to be >>>extended to cover the case where PMU is getting opened while module >>>unload is in progress. >>> >>>But the NPD you saw is for the moment confusing so I don't know what >>>is >>>happening. >>> >>>>I am not clear if we should use event->hw.state or pmu->closed here >>>>and >>>>if/how they are related. IMO, for this issue, the engine check is >>>>good >>>>enough too, so we don't really need the pmu state checks. >>>>Thoughts? >>> >>>Engine check at the moment feels like papering. >>> >>>Indeed as you say I think the pmu->closed might be the solution. >>>Perhaps >>>the race is as mentioned above. PMU open happening in parallel to >>>unload.. >>> >>>If the sequence of events userspace triggers is: >>> >>> i915_pmu_event_init >>> i915_pmu_event_start >>> i915_pmu_enable >>> i915_pmu_event_read >>> >>>I guess pmu->closed can get set halfway in i915_pmu_event_init. What >>>would be the effect of that.. We'd try to get a module reference >>>while >>>in the process of unloading. Which is probably very bad.. So possibly >>>a >>>final check on pmu->close is needed there. Ho hum.. can it be made >>>safe >>>is the question. >>> >>>It doesn't explain the NPD on Ctrl-C though.. intel_gpu_top keeps >>>the >>>evens open all the time. So I think more info needed, for me at >>>least. >> >>So one thing here is this doesn't have to do with module unload, but >>module unbind specifically (while perf is open). I don't know if the >>NPD from Umesh is the same as what we're seeing here. I'd really like >>to separate these unless you know for sure that's related. Also it >>would be interesting to know if this patch fixes your issue as well. >> >>I still think the re-ordering in i915_driver.c should be enough and we >>shouldn't need to check pmu->closed. The unregister should be enough to >>ensure the perf tools are notified that new events aren't allowed, and >>at that time the engine structures are still intact. And even if for >>some reason the perf code still calls in to our function pointers, we >>have these engine checks as a failsafe. >> >>I'm by the way uploading one more version here with a drm_WARN_ONCE >>instead of the debug print. > >Problem is I am not a fan of papering so lets get to the bottom of the >issue first. (In the meantime simple patch to re-order driver fini is >okay since that seems obvious enough, I tnink.) > >We need to see call traces from any oopses and try to extend perf_pmu >to catch them. And we need to understand the problem, if it is a real >problem, which I laid out last week about race between module unload >and elevating the module use count from our perf event init. > >Without understanding the details of possible failure mode flows we >don't know how much the papering with engine checks solves and how >much it leaves broken. > >If you guys are too busy to tackle that I'll put it onto myself, but >help would certainly be appreciated. Looks like Stuart/Chris are pointing towards the unbind as an issue. I ran this sequence and only the modprobe showed an error (FATAL: ... still in use). What happens with the unbind. Should pmu also handle the unbind somehow? - run intel_gpu_top - unbind - modprobe -r i915 - kill intel_gpu_top. Thanks, Umesh > >Regards, > >Tvrtko