From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B8B4E449EA0; Thu, 7 May 2026 17:53:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778176398; cv=none; b=TQjhxgOcixOftbif/5/C5qvqAvDt+1tu3eLzYqCtovx0UrbzkSxKMumZLySO2YCNenXFoGzlQ+AqgYRqNxqUiN+EEOfEwmuSQ43ZojbNyQD6/Q+99POEd2iPNzRhHgeWt/14ftDe38x/O0qGhGT+cv6F4xk5Aa44g0FAjjS07EM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778176398; c=relaxed/simple; bh=89kmpux1Lhq4nyJP39nykhEYyXhOAZB3p/RS2GVPyxQ=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Q/GvrLJ4FGGCXapitGrPrBV5h+jFpCT3w2jI/3qQ5zQX1/k5aiAMDLT0i6G4TKAkHx5Kx0h+V+hMccl7NvYk5zcEzingzI6H0xCge9W5sx3upbjue4QRYCKuCoyTE9k2UMoJmub5qPW5I0LEaOI/rYR/Ed7KypsFDemIKVicHXc= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=FV1ZCd+7; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="FV1ZCd+7" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3BC3FC2BCB8; Thu, 7 May 2026 17:53:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778176398; bh=89kmpux1Lhq4nyJP39nykhEYyXhOAZB3p/RS2GVPyxQ=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=FV1ZCd+7kWW/PSVh+j58E2CpoqJsuRzptw6hDTxOo16k72c1uqO/PXOwI107vyAjz jJHJQ6I+I02CEAxWVWNrOUFA6n656GZ6IXEHvcfa9N+B3Na9vkwAyLsGUwmnNSze3C aymns9w7zlFVz7jKYWPefpnlnl2Ka7hkUIe7+ElgbYZz8JYXIUrYOegzmeBCgLylBk FqOBbIaQONUNnAJvMUy1Ke6L49G5du+xvlP4QL5l5utIdrubOQ7XTmzeM2cg2ReNkx 2QbXIR8l87hgrJ2c+A4ttw3g/9v2EjokrA1KUgEo/jyeJbY68ma3ZsUQDe7Gq9R1xC y1wHWvoDKqu2g== Date: Thu, 7 May 2026 18:53:03 +0100 From: Jonathan Cameron To: Terry Bowman Cc: , , , , , , , , , , , , , , , , , , , , , , , , Subject: Re: [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo Message-ID: <20260507185303.329cf964@jic23-huawei> In-Reply-To: <20260505173029.2718246-2-terry.bowman@amd.com> References: <20260505173029.2718246-1-terry.bowman@amd.com> <20260505173029.2718246-2-terry.bowman@amd.com> X-Mailer: Claws Mail 4.4.0 (GTK 3.24.52; x86_64-pc-linux-gnu) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit On Tue, 5 May 2026 12:30:19 -0500 Terry Bowman wrote: > CXL virtual hierarchy (VH) native RAS handling for CXL Port devices will be > added soon. This requires a notification mechanism for the AER driver to > share the AER interrupt with the CXL driver. The CXL drivers use the > notification to handle and log the CXL RAS errors. > > Note, 'CXL protocol error' terminology refers to CXL VH and not CXL RCH > errors unless specifically noted going forward. > > Introduce a new file in the AER driver to handle the CXL protocol > errors: pci/pcie/aer_cxl_vh.c. > > Add a kfifo work queue to be used by the AER and CXL drivers. Multiple > AER IRQ worker threads can be running and enqueueing concurrently, so > include write path synchronization. Pack the kfifo, the spinlock, the > rwsem, and the work pointer into a single structure. Initialize the > kfifo with INIT_KFIFO() from a subsys_initcall so its mask, esize and > data fields are valid before any producer or consumer runs. > > Add CXL work queue handler registration functions in the AER driver. > Export them so the CXL driver can assign or clear the work handler. > > Introduce 'struct cxl_proto_err_work_data' to serve as the kfifo work > data. It contains a reference to the PCI error source device and the > error severity. The cxl_core driver uses this when dequeuing the work. > > Introduce cxl_forward_error() to add a given CXL protocol error to a > work structure and push it onto the AER-CXL kfifo. This function takes > a pci_dev_get() on the source device. The kfifo consumer is responsible > for the matching pci_dev_put() after dequeue. On enqueue failure > cxl_forward_error() does the put itself. > > Synchronize accesses to the work function pointer during registration, > deregistration, enqueue, and dequeue. > > handle_error_source() is intentionally not changed here. The is_cxl_error() > switch that routes errors to cxl_forward_error() is added in a later patch > together with the kfifo consumer registration. This way the producer and > consumer land in the same commit, so CXL errors are not silently dropped > during bisect. > > Also add MAINTAINERS entries for both drivers/pci/pcie/aer_cxl_vh.c > (new in this patch) and drivers/pci/pcie/aer_cxl_rch.c (already in tree > but previously unlisted) under the existing CXL entry. This way the CXL > maintainers are CC'd on changes to the AER-CXL bridging code. > > Co-developed-by: Dan Williams > Signed-off-by: Dan Williams > Signed-off-by: Terry Bowman Sashiko did have one comment on what happens if there are multiple things in the kfifo and fn fails. At that point I think we are in the all bets are off corner and stranding a driver is fine, but open to other opinions! https://sashiko.dev/#/patchset/20260505173029.2718246-1-terry.bowman%40amd.com So with that in mind Reviewed-by: Jonathan Cameron > diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c > new file mode 100644 > index 000000000000..c0fea2c2b9bc > --- /dev/null > +++ b/drivers/pci/pcie/aer_cxl_vh.c > +int for_each_cxl_proto_err(struct cxl_proto_err_work_data *wd, > + cxl_proto_err_fn_t fn) > +{ > + int rc; > + > + guard(rwsem_read)(&cxl_proto_err_kfifo.rwsem); > + while (kfifo_get(&cxl_proto_err_kfifo.fifo, wd)) { > + rc = fn(wd); > + pci_dev_put(wd->pdev); > + if (rc) > + return rc; This is where Sashiko complains. Specifically: "If the consumer callback fn() returns an error, does this early return strand the remaining items in the kfifo? Because cxl_forward_error() takes a pci_dev reference for each enqueued item, it looks like these stranded items might leak their pci_dev references and prevent clean unbinding or hot-unplug until a new error triggers the queue again." I'd go with indeed it does, but there is no right thing to do here. I guess we could flush the kfifo and call pci_dev_put() on each of them, but that's horrible. Would basically mean calling the same stuff you have for cancelling outstanding entrees on exit(). > + } > + > + return 0; > +} > +EXPORT_SYMBOL_FOR_MODULES(for_each_cxl_proto_err, "cxl_core");