From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C6A861ABED5; Wed, 28 Aug 2024 21:11:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724879516; cv=none; b=T+BSi9NAfyRjrTIcfi/F7Fs5PdEtEg5dniiboTOvhfDY1pW8rBxZWhJluiA0r7VC7Bf+ZJONa7YfeE4sqJ/jpvEcM23yj8Ihlf4UccNfB1UkhiNDXpV6zEQqDNDQT8EneoIDLPFb0XITctrL8klMPm/ymrF8m5WrvxwhTANag5I= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724879516; c=relaxed/simple; bh=/Ps7/eGxQldl91CBy36YhIiYyuvPwv6AobwEG8m44OA=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=rb4EoCeA1StnUw04wmv5y06KxnL8QQS3SEzI+Qri/kWxiQcbaY483fjg6XrXfohT8gG/2jiZI62uH7e9k8bYzKSO4aMnVmNFFzL2tIe7PmzCGUIOFb/q8Vx9aMj3XU8ceWo1/4piq0jtUFcMeOI7ZpkigenMqGuK6WXZh/kmMLc= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=testeW97; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Q7GXyTts; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="testeW97"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Q7GXyTts" From: Thomas Gleixner DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1724879507; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=WNFOyNCt6wcdA2prJP3abR4fOjwg9PndIMrgckIdWFk=; b=testeW97CeCYgFwn5VRnbc/ueSlh+H8c91sRFGoIU8MdYyIKRZSEXo1B+jJC/fanY7Q2Gc pAQfbL07DzGPNeMwc7dLACKuj4d9eUFyMwjNvzSongsJIQlglVnT8zxSXPLuh8yYyavGh6 PRIMOqR4USTalgm9DnsIVAgE+PgQvbkuc40RjLoGyoEy+I6YPbq/W8ZWYaDCHNNotkzpTX diwTTlmRlQ6N2zPzWTXYa9JWK5wUv5Lv81S8tTgrkjF8taoIqyhwdb0AkC7RPEXMHlsyWd fGm2/CUgyGYGJpuUqgAYzceXuVxKtZg/2nI/WVSBNjW9FwhEnF3IMSyBCfPG1A== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1724879507; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=WNFOyNCt6wcdA2prJP3abR4fOjwg9PndIMrgckIdWFk=; b=Q7GXyTts76R5pCNANZOXeOXPY9DaGAmynJdErfyx+fotzbGK5IRl0aBEFYVh9m6O9WZI3+ P9a5AW5AFLTpLWBQ== To: Jonathan Cameron , Lukas Wunner Cc: Bjorn Helgaas , linuxarm@huawei.com, Mahesh J Salgaonkar , Bjorn Helgaas , linux-cxl@vger.kernel.org, linux-pci@vger.kernel.org, Davidlohr Bueso , Dave Jiang , Alison Schofield , Vishal Verma , Ira Weiny , Dan Williams , Will Deacon , Mark Rutland , Lorenzo Pieralisi , terry.bowman@amd.com, Kuppuswamy Sathyanarayanan , Ilpo =?utf-8?Q?J=C3=A4rvinen?= Subject: Re: [RFC PATCH 0/9] pci: portdrv: Add auxiliary bus and register CXL PMUs (and aer) In-Reply-To: <20240823120501.00004151@Huawei.com> References: <20240529164103.31671-1-Jonathan.Cameron@huawei.com> <20240605180409.GA520888@bhelgaas> <20240605204428.00001cb2@Huawei.com> <20240605213910.00003034@huawei.com> <20240823120501.00004151@Huawei.com> Date: Wed, 28 Aug 2024 23:11:46 +0200 Message-ID: <87plpsbbe5.ffs@tglx> Precedence: bulk X-Mailing-List: linux-pci@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain Jonathan! On Fri, Aug 23 2024 at 12:05, Jonathan Cameron wrote: > On Thu, 6 Jun 2024 15:21:02 +0200 Lukas Wunner wrote: >> On Thu, Jun 06, 2024 at 01:57:56PM +0100, Jonathan Cameron wrote: >> > Or are you thinking we can make something like the following work >> > (even when we can't do dynamic msix)? >> > >> > Core bring up facilities and interrupt etc. That includes bus master >> > so msi/msix can be issued (which makes me nervous - putting aside other >> > concerns as IIRC there are drivers where you have to be careful to >> > tweak registers to ensure you don't get a storm of traffic when you >> > hit bus master enable. Correct. You have to be really careful about this. >> > Specific driver may bind later - everything keeps running until >> > the specific driver calls pci_alloc_irq_vectors(), then we have to disable all >> > interrupts for core managed services, change the number of vectors and >> > then move them to wherever they end up before starting them again. >> > We have to do the similar in the unwind path. >> >> My recollection is that Thomas Gleixner has brought up dynamic MSI-X >> allocation. So if both the PCI core services as well as the driver >> use MSI-X, there's no problem. Correct. If the core sets up N interrupts for core usage, then devices can later on allocate MSI-X vectors on top if the underlying interrupt domain hierarchy supports it. But see below. >> For MSI, one approach might be to reserve a certain number of vectors >> in the core for later use by the driver. MSI is a problem in several aspects. 1) You really have to know how many vectors you need at the end as you need to disable MSI in order to change the number of vectors in the config space. So if you do that after interrupts are already in use this can cause lost interrupts and stale devices. When MSI is disabled the interrupt might be routed to the legacy intX and aside of an eventual spurious interrupt message there is no trace of it. 2) Allocating N vectors upfront and only using a small portion for the core and have the rest handy for drivers is problematic when the MSI implementation does not provide masking of the individual MSI interrupts. The driver probing might result in an interrupt storm for obvious reasons. Aside of that if the core allocates N interrupts then all the resources required (interrupt descriptors....CPU vectors) are allocated right away. There is no reasonable way to do that post factum like we can do with MSI-X. Any attempt to hack that into submission is futile and I have zero interrest to even look at it. The shortcomings of MSI are known for two decades and it's sufficiently documented that it's a horrorshow for software to deal with it. Though the 'save 5 gates and deal with it in software' camp still has not got the memo. TBH, I don't care at all. We can talk about MSI-X, but MSI is not going to gain support for any of this. That's the only leverage we have to educate people who refuse to accept reality. > So, my first attempt at doing things in the core ran into what I think > is probably a blocker. It's not entirely technical... > > +CC Thomas who can probably very quickly confirm if my reasoning is > correct. > > If we move these services into the PCI core itself (as opposed keeping > a PCI portdrv layer), then we need to bring up MSI for a device before > we bind a driver to that struct pci_dev / struct device. > > If we follow through > pci_alloc_irq_vectors_affinity() > -> __pci_enable_msix_range() / __pci_enable_msi_range() > -> pci_setup_msi_context() > -> msi_setup_device_data() > -> devres_add() > //there is actually devres usage before this in msi_sysfs_create_group() > but this is a shorter path to illustrate the point. > > We will have registered a dev_res callback on the struct pci_dev > that we later want to be able to bind a driver to. That driver > won't bind - because lifetimes of devres will be garbage > (see really_probe() for the specific check and resulting > "Resources present before probing") That's because you are violating all layering rules. See below. > So we in theory 'could' provide a non devres path through > the MSI code, but I'm going to guess that Thomas will not > be particularly keen on that to support this single use case. Correct. > Thomas, can you confirm if this is something you'd consider > or outright reject? Or is there an existing route to have > MSI/X working pre driver bind and still be able to bind > the driver later. The initial enable has to set up at least one vector. That vector does not have to be requested by anything, but that's it. After that you can allocate with pci_msix_alloc_irq_at(). So looking at the ASCII art of the cover letter: _____________ __ ________ __________ | Port | | creates | | | | PCI Dev | |--------->| CPMU A | | |_____________| | |________| | |portdrv binds | | perf/cxlpmu binds | |________________| |___________________| \ \ \ ________ __________ \ | | | ------>| CPMU B | | |________| | | perf/cxlpmu binds | |___________________| AND _____________ __ | Type 3 | | creates ________ _________ | PCI Dev | |--------------------------------------->| | | |_____________| | | CPMU C | | | cxl_pci binds | |________| | |________________| | perf/cxpmu binds | |__________________| If I understand it correctly then both the portdrv and the cxl_pci drivers create a "bus". The CPMU devices are attached to these buses. So we are talking about distinctly different devices with the twist that these devices somehow need to utilize the MSI/X (forget MSI) facility of the device which creates the bus. >From the devres perspective we look at separate devices and that's what the interrupt code expects too. This reminds me of the lengthy discussion we had about IMS a couple of years ago. https://lore.kernel.org/all/87bljg7u4f.fsf@nanos.tec.linutronix.de/ My view on that issue was wrong because the Intel people described the problem wrong. But the above pretty much begs for a proper separation and hierarchical design because you provide an actual bus and distinct devices. Reusing the ASCII art from that old thread for the second case, but it's probably the same for the first one: ]-------------------------------------------| | PCI device | ]-------------------| | | Physical function | | ]-------------------| | ]-------------------|----------| | | Control block for subdevices | | ]------------------------------| | | | <- "Subdevice BUS" | | | | | |-- Subddevice 0 | | |-- Subddevice 1 | | |-- ... | | |-- Subddevice N | ]-------------------------------------------| 1) cxl_pci driver binds to the PCI device. 2) cxl_pci driver creates AUXbus 3) Bus enumerates devices on AUXbus 4) Drivers bind to the AUXbus devices So you have a clear provider consumer relationship. Whether the subdevice utilizes resources of the PCI device or not is a hardware implementation detail. The important aspect is that you want to associate the subdevice resources to the subdevice instances and not to the PCI device which provides them. Let me focus on interrupts, but it's probably the same for everything else which is shared. Look at how the PCI device manages interrupts with the per device MSI mechanism. Forget doing this with either one of the legacy mechanisms. 1) It creates a hierarchical interrupt domain and gets the required resources from the provided parent domain. The PCI side does not care whether this is x86 or arm64 and it neither cares whether the parent domain does remapping or not. The only way it cares is about the features supported by the different parent domains (think multi-MSI). 2) Driver side allocations go through the per device domain That AUXbus is not any different. When the CPMU driver binds it wants to allocate interrupts. So instead of having a convoluted construct reaching into the parent PCI device, you simply can do: 1) Let the cxl_pci driver create a MSI parent domain and set that in the subdevice::msi::irqdomain pointer. 2) Provide cxl_aux_bus_create_irqdomain() which allows the CPMU device driver to create a per device interrupt domain. 3) Let the CPMU driver create its per device interrupt domain with the provided parent domain 4) Let the CPMU driver allocate its MSI-X interrupts through the per device domain Now on the cxl_pci side the AUXbus parent interrupt domain allocation function does: if (!pci_msix_enabled(pdev)) return pci_msix_enable_and_alloc_irq_at(pdev, ....); else return pci_msix_alloc_irq_at(pdev, ....); The condition can be avoided if the clx_pci driver enables MSI-X anyway for it's own purposes. Obviously you need the corresponding counterpart for free(), but that's left as an exercise for the reader. That still associates the subdevice specific MSI-X interrups to the underlying PCI device, but then you need to look at teardown. The cxl_pci driver cannot go away before the CPMU drivers are shutdown. The CPMU driver shutdown releases the interrupts through the interrupt domain hierarchy, which removes them from the parent PCI device. Once all CPMU drivers are gone, the shutdown of the cxl_pci driver gets rid of the cxl_pci driver specific interrupts and everything is cleaned up. This works out of the box on x86. The rest needs to gain support for MSI_FLAG_PCI_MSIX_ALLOC_DYN. ARM64 and parts of ARM32 are already converted over to the MSI parent domain concept, they just lack support for dynamic allocation. The rest of the arch/ world needs some more work. That's the problem of the architecture folks and that CXL auxbus thing can simply tell them if (!pci_msix_can_alloc_dyn(pdev)) return -E_MAKE_YOUR_HOME_WORK; and be done with it. Proliferating support for "two" legacy PCI/MSI mechanisms by violating layering is not going to happen. Neither I'm interrested to have creative workarounds for MSI as they are going to create more problems than they solve, unless you come up with a clean, simple and maintainable solution which works under all circumstances. I'm not holding my breath... I might not have all the crucial information as it happened in the previous IMS discussion, but looking at your ASCII art makes me halfways confident that I'm on the right track. Thanks, tglx