From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5F7991ABECA for ; Wed, 8 Jan 2025 12:55:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.16 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736340923; cv=none; b=LO0OvywUqvUrn+8oL3QS19opZSchcVn4/s+10356qWMRzPomujLqFNiwpXdGc1bohKKAjRvKHtyckPiJXZc/exFVCHz0SF8+MgtQlaILYBPD9H5MyXPj7hnoKQMD2H9lm19rEESYoFCnYKxhJTreL1H+d+8DirZqFTVJLBH7woc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736340923; c=relaxed/simple; bh=pIyNoK1Ji+9gN6h0twjFK9XwCUvUdZ5bIgS+UjHi2pg=; h=From:Date:To:cc:Subject:In-Reply-To:Message-ID:References: MIME-Version:Content-Type; b=rIDuq8nHe+bPFYygCfSRMu6DmjiFUHgri0u985KEzuNrIu9lPUZb5rDo5bCwUD+Cbtsq+P3qauXXsQGtuZ2/79U/w42CNx1SDQnOKDb9QOgVRac3b1yh/NpqX7gMxqMzSuMoeNlFjUKqTw0K0KVvNRQqeXlDrEW/lElYTptQhwY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=ntV+OGu5; arc=none smtp.client-ip=192.198.163.16 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="ntV+OGu5" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1736340921; x=1767876921; h=from:date:to:cc:subject:in-reply-to:message-id: references:mime-version; bh=pIyNoK1Ji+9gN6h0twjFK9XwCUvUdZ5bIgS+UjHi2pg=; b=ntV+OGu5bEfcOFu1JXxtZN+bnrjO/FyM8kFRwLSjYsPehxduQnFjQ4kc 349YZkww3h/fj1vLgf2XnlVcbLCAJrUjNTdU+75w3TJzHNm6ykld9uTP2 QcfTgao0DVJCYcNlfMdv6tDsnL5wZY/t2mFIry6km5giF0lnfhb56W2RR X9cxQkO+5ESaHZyE2UUcp4bhIuOvOglP+Ap3sePzOjqKuBCbPS7yXWAOW 890esb/367Lab6tFO0a9CS4N0Z4X50OqqxvorJdv1FT5qc3iooz/WpJy8 +yaCGqKwtfXHetacu+YkJLF4pbdcO8qbDkMWKhZwoDy4ocSjluZ1jdGNb g==; X-CSE-ConnectionGUID: Oqkq0He1SF+NUWbBKlhPew== X-CSE-MsgGUID: 1o3Nt2GiRC2KNBhTSi0KBA== X-IronPort-AV: E=McAfee;i="6700,10204,11309"; a="24168489" X-IronPort-AV: E=Sophos;i="6.12,298,1728975600"; d="scan'208";a="24168489" Received: from orviesa010.jf.intel.com ([10.64.159.150]) by fmvoesa110.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jan 2025 04:55:21 -0800 X-CSE-ConnectionGUID: t0IabEpJRf+DXmu0S/WfqQ== X-CSE-MsgGUID: 1EIRcUduQHyI5Wn8qr05qA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.12,224,1728975600"; d="scan'208";a="102953375" Received: from ijarvine-mobl1.ger.corp.intel.com (HELO localhost) ([10.245.245.87]) by orviesa010-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jan 2025 04:55:18 -0800 From: =?UTF-8?q?Ilpo=20J=C3=A4rvinen?= Date: Wed, 8 Jan 2025 14:55:14 +0200 (EET) To: Lukas Wunner cc: Bjorn Helgaas , Krzysztof Wilczynski , linux-pci@vger.kernel.org, Niklas Schnelle , Jonathan Cameron , Mika Westerberg , "Maciej W. Rozycki" , Mario Limonciello , Evert Vorster Subject: Re: [PATCH for-linus] PCI/bwctrl: Fix NULL pointer deref on unbind and bind In-Reply-To: Message-ID: <60f0479e-7aac-1f6e-b667-605e283244e4@linux.intel.com> References: <0ee5faf5395cad8d29fb66e1ec444c8d882a4201.1735852688.git.lukas@wunner.de> Precedence: bulk X-Mailing-List: linux-pci@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="8323328-1163775918-1736340914=:1082" This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --8323328-1163775918-1736340914=:1082 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 8 Jan 2025, Lukas Wunner wrote: > On Tue, Jan 07, 2025 at 04:29:18PM +0200, Ilpo J=E4rvinen wrote: > > On Tue, 7 Jan 2025, Lukas Wunner wrote: > > > It seems pcie_bwctrl_setspeed_rwsem is only needed because > > > pcie_retrain_link() calls pcie_reset_lbms_count(), which > > > would recursively acquire pcie_bwctrl_lbms_rwsem. > > > > > > There are only two callers of pcie_retrain_link(), so I'm > > > wondering if the invocation of pcie_reset_lbms_count() > > > can be moved to them, thus avoiding the recursive lock > > > acquisition and allowing to get rid of pcie_bwctrl_setspeed_rwsem. > > > > > > An alternative would be to have a __pcie_retrain_link() helper > > > which doesn't call pcie_reset_lbms_count(). > >=20 > > I considered __pcie_retrain_link() variant but it felt like locking=20 > > details that are internal to bwctrl would be leaking into elsewhere in = the=20 > > code so I had some level of dislike towards this solution, but I'm not= =20 > > strictly against it. >=20 > That's a fair argument. >=20 > It seems the reason you're acquiring pcie_bwctrl_lbms_rwsem in > pcie_reset_lbms_count() is because you need to dereference > port->link_bwctrl so that you can access port->link_bwctrl->lbms_count. >=20 > If you get rid of lbms_count and instead use a flag in pci_dev->priv_flag= s, > then it seems you won't need to acquire the lock and this problem > solves itself. Agreed on both points. > > > I'm also wondering if the IRQ handler really needs to run in > > > hardirq context. Is there a reason it can't run in thread > > > context? Note that CONFIG_PREEMPT_RT=3Dy (as well as the > > > "threadirqs" command line option) cause the handler to be run > > > in thread context, so it must work properly in that situation > > > as well. > >=20 > > If thread context would work now, why was the fix in the commit=20 > > 3e82a7f9031f ("PCI/LINK: Supply IRQ handler so level-triggered IRQs are= =20 > > acked")) needed (the commit is from the bwnotif era)? What has changed= =20 > > since that fix? >=20 > Nothing has changed, I had forgotten about that commit. >=20 > Basically you could move everything in pcie_bwnotif_irq() after clearing > the interrupt into an IRQ thread, but that would just be the access to th= e > atomic variable and the pcie_update_link_speed() call. That's not worth = it > because the overhead to wake the IRQ thread is bigger than just executing > those things in the hardirq handler. >=20 > So please ignore my comment. >=20 >=20 > > > Another oddity that caught my eye is the counting of the > > > interrupts. It seems the only place where lbms_count is read > > > is the pcie_failed_link_retrain() quirk, and it only cares > > > about the count being non-zero. So this could be a bit in > > > pci_dev->priv_flags that's accessed with set_bit() / test_bit() > > > similar to pci_dev_assign_added() / pci_dev_is_added(). > > >=20 > > > Are you planning on using the count for something else in the > > > future? If not, using a flag would be simpler and more economical > > > memory-wise. > >=20 > > Somebody requested having the count exposed. For troubleshooting HW=20 > > problems (IIRC), it was privately asked from me when I posted one of=20 > > the early versions of the bwctrl series (so quite long time ago). I've > > just not created that change yet to put it under sysfs. >=20 > There's a patch pending to add trace events support to native PCIe hotplu= g: >=20 > https://lore.kernel.org/all/20241123113108.29722-1-xueshuai@linux.alibaba= =2Ecom/ >=20 > If that somebody thinks they need to know how often LBMS triggered, > we could just add similar trace events for bandwidth notifications. > That gives us not only the count but also the precise time when the > bandwidth change happened, so it's arguably more useful for debugging. >=20 > Trace points are patched in and out of the code path at runtime, > so they have basically zero cost when not enabled (which would be the > default). >=20 >=20 > > > I'm also worried about the lbms_count overflowing. > > > > Should I perhaps simply do pci_warn() if it happens? >=20 > I'd prefere getting rid of the counter altogether. :) Okay, it's a good suggestion. Trace events seem like a much better=20 approach to the problem. > > > Because there's hardware which signals an interrupt before actually > > > setting one of the two bits in the Link Status Register, I'm > > > wondering if it would make sense to poll the register a couple > > > of times in the irq handler. Obviously this is only an option > > > if the handler is running in thread context. What was the maximum > > > time you saw during testing that it took to set the LBMS bit belatedl= y? > >=20 > > Is there some misunderstanding here between us because I don't think I'= ve=20 > > noticed delayed LBMS assertion? What I saw was the new Link Speed not y= et=20 > > updated when Link Training was already 0. In that case, the Link Status= =20 > > register was read inside the handler so I'd assume LBMS was set to=20 > > actually trigger the interrupt, thus, not set belatedly. >=20 > Evert's laptop has BWMgmt+ ABWMgmt+ bits set on Root Port 00:02.1 > in this lspci dump: >=20 > https://bugzilla.kernel.org/attachment.cgi?id=3D307419&action=3Dedit >=20 > 00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ri= dge GPP Bridge (prog-if 00 [Normal decode]) > LnkSta: Speed 8GT/s, Width x4 > TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt+ >=20 > How can it be that BWMgmt+ is set? I would have expected the bandwidth > controller to clear that interrupt. I can only think of two explanations= : > Either BWMgmt+ was set but no interrupt was signaled. Or the interrupt > was signaled and handled before BWMgmt+ was set. Either one of those, or third alternative that it's set more than once=20 and no interrupt occurs on the second assertion (could be e.g. some race=20 due to write-1-to-clear and reasserting LBMS). But, I've not seen this behavior before that report so I cannot answer to= =20 your question about how long it took for LBMS to get asserted. This misbehavior is a bit problematic because those same bits are used to= =20 identify if the interrupt belongs for bwctrl or not. So if they're not=20 set by the time the handler runs, I don't have a way to identify I'd need= =20 to poll those bits in the first place (and it's also known the interrupt=20 is shared with other stuff :-(). > I'm guessing the latter is the case because /proc/irq/33/spurious > indicates 1 unhandled interrupt. >=20 > Back in March 2023 when you showed me your results with various Intel > chipsets, I thought you mentioned that you witnessed this too-early > interrupt situation a couple of times. But I may be misremembering. I've seen interrupts occur before the new Link Speed has been updated but= =20 those still had LT=3D1 and there was _another_ interrupt later. Handling that later interrupt read the new Link Speed and LT=3D0 so effectively=20 there was just one extra interrupt. I assume it's because LBMS is simply=20 asserted twice and the extra interrupt seemed harmless (no idea why HW=20 ends up doing it though). The other case, which is a real problem, had LT=3D0 but no new link speed= =20 in the Current Link Speed field. So it is "premature" in a sense but=20 seemingly the training had completed. In that case, there was no second=20 interrupt so the handler could not acquire the new link speed. bwctrl does read the new link speed outside the handler by calling=20 pcie_update_link_speed() which is mainly to handle misbehaviors where BW notifications are not coming at all (if the link speed changes=20 outside of bwctrl setting speed, those changes won't be detected). It will= =20 still be racy though if LT is deasserted before the Current Link Speed is= =20 set (I've not seen it to fail to get the new speed but I've not tried to=20 stress test it either). --=20 i. --8323328-1163775918-1736340914=:1082--