From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3BF2D2036F9; Mon, 3 Mar 2025 11:58:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.8 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741003111; cv=none; b=UZxp1juwH/4H/KFqGkdrlsa5fAoHjvWblt13nz9++dYCNns0+f8RvOdF4vI36LiNvXKQuOWT/bQLdm+5mKpxC9aJZeKRk+p9zNAGvT0peB9frVUaCZe0xNNTalng2nfHnYhK3PX3hcINEUvG3/Ng1bsA591TBvXxp3cGe2WjsGQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741003111; c=relaxed/simple; bh=e0Wy96/Vl/Za2xK/skILAtBW0xOPi/xeh93FvbxO05A=; h=From:Date:To:cc:Subject:In-Reply-To:Message-ID:References: MIME-Version:Content-Type; b=T3/KWn8r+cnDKua4eX/uZ4cSQQUFSCtS0WZ1OozirHOcq5eZqI24VbptJb7+TdGHacoGBobrMIf/wkoIbKYAZbaJqGKncmx8Z+9dojlWaqLOee/B4ifD2uTisnmWXYxi17D1Os6tcwKDBfxseiSlVYDfziSE5EjFjl/IYahS9KE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=VL/7yigD; arc=none smtp.client-ip=192.198.163.8 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="VL/7yigD" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1741003110; x=1772539110; h=from:date:to:cc:subject:in-reply-to:message-id: references:mime-version; bh=e0Wy96/Vl/Za2xK/skILAtBW0xOPi/xeh93FvbxO05A=; b=VL/7yigDTIhgpVZFi2217on20CxNR/u0lVny0+G836xGjRe6sjoJotcV 1mYbOrTMUz2yCdvLV3eFi9015cO27PcN6+m1+67jubLnHk5FOlSqTjduI ILBL9y4wNiUwGt5v5+5mRbz4v9n+CJFDvSCd2MnafB7+j9A6yDFuRVJJM hLxuaV0a9QatQIYjDH8tej64heZEa7watkTYh/4r4YtP+2uEAFMM6Caai BbxE4BMwR+L3Ck6WonV4QrBQqu5O64kwjvyBpCsufuMwY61f72OfERBm6 NBg+ooIXGBiWeZYWXyrxbZZpwnUkOHa2QGNHxBAalbzn5f9nm+t5wpaaT Q==; X-CSE-ConnectionGUID: qO1VuDtXT+mwgkTN8qh9lg== X-CSE-MsgGUID: TVOE7AfYSmOPNiNw48XHOA== X-IronPort-AV: E=McAfee;i="6700,10204,11361"; a="59419267" X-IronPort-AV: E=Sophos;i="6.13,329,1732608000"; d="scan'208";a="59419267" Received: from orviesa003.jf.intel.com ([10.64.159.143]) by fmvoesa102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Mar 2025 03:58:29 -0800 X-CSE-ConnectionGUID: d9MuPnbBQimcIBMNmwtQFg== X-CSE-MsgGUID: 7CustbZ9QxqW5b+mhEq4NQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.13,329,1732608000"; d="scan'208";a="122939010" Received: from ijarvine-mobl1.ger.corp.intel.com (HELO localhost) ([10.245.245.14]) by ORVIESA003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Mar 2025 03:58:26 -0800 From: =?UTF-8?q?Ilpo=20J=C3=A4rvinen?= Date: Mon, 3 Mar 2025 13:58:23 +0200 (EET) To: Lukas Wunner cc: Bjorn Helgaas , Jonathan Cameron , linux-pci@vger.kernel.org, LKML , Joel Mathew Thomas , stable@vger.kernel.org Subject: Re: [PATCH 1/1] PCI/bwctrl: Disable PCIe BW controller during reset In-Reply-To: Message-ID: <7fd2f9e9-9c31-abb0-d0c9-f9d0a0ac1bd6@linux.intel.com> References: <20250217165258.3811-1-ilpo.jarvinen@linux.intel.com> <14797a5a-6ded-bf8f-aa0c-128668ba608f@linux.intel.com> Precedence: bulk X-Mailing-List: linux-pci@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="8323328-551350541-1741003103=:33389" This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --8323328-551350541-1741003103=:33389 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE On Thu, 27 Feb 2025, Lukas Wunner wrote: > On Mon, Feb 24, 2025 at 05:13:15PM +0200, Ilpo J=E4rvinen wrote: > > On Tue, 18 Feb 2025, Lukas Wunner wrote: > > > On Mon, Feb 17, 2025 at 06:52:58PM +0200, Ilpo J=E4rvinen wrote: > > > > PCIe BW controller enables BW notifications for Downstream Ports by > > > > setting Link Bandwidth Management Interrupt Enable (LBMIE) and Link > > > > Autonomous Bandwidth Interrupt Enable (LABIE) (PCIe Spec. r6.2 sec. > > > > 7.5.3.7). > > > >=20 > > > > It was discovered that performing a reset can lead to the device > > > > underneath the Downstream Port becoming unavailable if BW notificat= ions > > > > are left enabled throughout the reset sequence (at least LBMIE was > > > > found to cause an issue). > > >=20 > > > What kind of reset? FLR? SBR? This needs to be specified in the > > > commit message so that the reader isn't forced to sift through a > > > bugzilla with dozens of comments and attachments. > >=20 > > Heh, I never really tried to figure out it because the reset disable=20 > > patch was just a stab into the dark style patch. To my surprise, it end= ed=20 > > up working (after the initial confusion was resolved) and I just starte= d=20 > > to prepare this patch from that knowledge. >=20 > If the present patch is of the type "changing this somehow makes the > problem go away" instead of a complete root-cause analysis, it would > have been appropriate to mark it as an RFC. I'll keep that mind in the future. I don't think your depiction is entirely accurate of the situtation=20 though. The reporter had confirmed that if bwctrl is change(s) are=20 reverted, the problem is not observed. So I set to understand why bwctrl has any impact here at all since it=20 should touch only a different device (and even that in relatively limited= =20 ways). And that is how I found that if bwctrl is not enabled during reset,= =20 the problem is also not observed. I also noted that the patch just works around the problems and there was=20 also informal speculation about the suspected root cause in the patch (the= =20 only other theory I've a about the root cause relates to extra interrupts= =20 causing a problem through hp/pme interrupt handlers). > I've started to dig into the bugzilla and the very first attachment > (dmesg for the non-working case) shows: >=20 > vfio-pci 0000:01:00.0: timed out waiting for pending transaction; perfo= rming function level reset anyway >=20 > That message is emitted by pcie_flr(). Perhaps the Nvidia GPU takes > more time than usual to finish pending transactions, so the first > thing I would have tried would be to raise the timeout significantly > and see if that helps. Yet I'm not seeing any patch or comment in > the bugzilla where this was attempted. Please provide a patch for > the reporter to verify this hypothesis. I've problem in understanding how reverting bwctrl change does "solve" to= =20 this. Bwctrl is not even supposed to touch Nvidia GPU at all AFAIK. > > Logs do mention this: > >=20 > > [ 21.560206] pcieport 0000:00:01.1: unlocked secondary bus reset via:= pciehp_reset_slot+0x98/0x140 > >=20 > > ...so it seems to be SBR. >=20 > Looking at the vfio code, vfio_pci_core_enable() (which is called on > binding the vfio driver to the GPU) invokes pci_try_reset_function(). > This will execute the reset method configured via sysfs. The same > is done on unbind via vfio_pci_core_disable(). >=20 > So you should have asked the reporter for the contents of: > /sys/bus/pci/devices/0000:01:00.0/reset_method > /sys/bus/pci/devices/0000:01:00.1/reset_method >=20 > In particular, I would like to know whether the contents differ across > different kernel versions. >=20 > There's another way to perform a reset: Via an ioctl. This ends up > calling vfio_pci_dev_set_hot_reset(), which invokes pci_reset_bus() > to perform an SBR. >=20 > Looking at dmesg output in log_linux_6.13.2-arch1-1_pcie_port_pm_off.log > it seems that vfio first performs a function reset of the GPU on bind... >=20 > [ 40.171564] vfio-pci 0000:01:00.0: resetting > [ 40.276485] vfio-pci 0000:01:00.0: reset done >=20 > ...and then goes on to perform an SBR both of the GPU and its audio > device... >=20 > [ 40.381082] vfio-pci 0000:01:00.0: resetting > [ 40.381180] vfio-pci 0000:01:00.1: resetting > [ 40.381228] pcieport 0000:00:01.1: unlocked secondary bus reset via: p= ciehp_reset_slot+0x98/0x140 > [ 40.620442] vfio-pci 0000:01:00.0: reset done > [ 40.620479] vfio-pci 0000:01:00.1: reset done >=20 > ...which is odd because the audio device apparently wasn't bound to > vfio-pci, otherwise there would have been a function reset. So why > does vfio think it can safely reset it? >=20 > Oddly, there is a third function reset of only the GPU: >=20 > [ 40.621894] vfio-pci 0000:01:00.0: resetting > [ 40.724430] vfio-pci 0000:01:00.0: reset done >=20 > The reporter writes that pcie_port_pm=3Doff avoids the PME messages. > If the reset_method is "pm", I could imagine that the Nvidia GPU > signals a PME event during the D0 -> D3hot -> D0 transition. > > I also note that the vfio-pci driver allows runtime PM. So both the > GPU and its audio device may runtime suspend to D3hot. This in turn > lets the Root Port runtime suspend to D3hot. It looks like the > reporter is using a laptop with an integrated AMD GPU and a > discrete Nvidia GPU. On such products the platform often allows > powering down the discrete GPU and this is usually controlled > through ACPI Power Resources attached to the Root Port. > Those are powered off after the Root Port goes to D3hot. > You should have asked the reporter for an acpidump. A lots of these suggestions do not make much sense to me given that=20 reverting bwctrl alone does not exhibit the problem. E.g., the reset=20 method should be exactly the same. I can see there could be some way through either hp or PME interrupt=20 handlers, where an extra interrupt that comes due to bwctrl (LBMIE) being= =20 enabled triggers one of the such behaviors. But none of the above=20 description theoritizes anything to that direction. > pcie_bwnotif_irq() accesses the Link Status register without > acquiring a runtime PM reference on the PCIe port. This feels > wrong and may also contribute to the issue reported here. > Acquiring a runtime PM ref may sleep, so I think you need to > change the driver to use a threaded IRQ handler. >=20 > Nvidia GPUs are known to hide the audio device if no audio-capable > display is attached (e.g. HDMI). quirk_nvidia_hda() unhides the > audio device on boot and resume. It might be necessary to also run > the quirk after resetting the GPU. Knowing which reset_method > was used is important to decide if that's necessary, and also > whether a display was attached. =2E..You seem to have a lot of ideas on this. ;-) > Moreover Nvidia GPUs are known to change the link speed on idle > to reduce power consumption. Perhaps resetting the GPU causes > a change of link speed and thus execution of pcie_bwnotif_irq()? Why would execution of pcie_bwnotif_irq() be a problem, it's not that complicated? I'm more worried that having LBMIE enabled causes also the other interrupt= =20 handlers to execute due to the shared interrupt. > > > This approach won't work if the reset is performed without software > > > intervention. E.g. if a DPC event occurs, the device likewise underg= oes > > > a reset but there is no prior system software involvement. Software = only > > > becomes involved *after* the reset has occurred. > > >=20 > > > I think it needs to be tested if that same issue occurs with DPC. > > > It's easy to simulate DPC by setting the Software Trigger bit: > > >=20 > > > setpci -s 00:01.1 ECAP_DPC+6.w=3D40:40 > > >=20 > > > If the issue does occur with DPC then this fix isn't sufficient. > >=20 > > Looking into lspci logs, I don't see DPC capability being there for=20 > > 00:01.1?! >=20 > Hm, so we can't verify whether your approach is safe for DPC. >=20 >=20 > > > Instead of putting this in the PCI core, amend pcie_portdrv_err_handl= er > > > with ->reset_prepare and ->reset_done callbacks which call down to al= l > > > the port service drivers, then amend bwctrl.c to disable/enable > > > interrupts in these callbacks. > >=20 > > Will it work? I mean if the port itself is not reset (0000:00:01.1 in t= his=20 > > case), do these callbacks get called for it? >=20 > Never mind, indeed this won't work. >=20 > Thanks, >=20 > Lukas >=20 I'm sorry for the delay, I've been sick for a while. --=20 i. --8323328-551350541-1741003103=:33389--