Linux PCI subsystem development
 help / color / mirror / Atom feed
From: "Bowman, Terry" <terry.bowman@amd.com>
To: Jonathan Cameron <jic23@kernel.org>
Cc: dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, djbw@kernel.org, bhelgaas@google.com,
	shiju.jose@huawei.com, ming.li@zohomail.com,
	Smita.KoralahalliChannabasappa@amd.com, rrichter@amd.com,
	dan.carpenter@linaro.org, PradeepVineshReddy.Kodamati@amd.com,
	lukas@wunner.de, Benjamin.Cheatham@amd.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	vishal.l.verma@intel.com, alucerop@amd.com, ira.weiny@intel.com,
	corbet@lwn.net, rafael@kernel.org, xueshuai@linux.alibaba.com,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-doc@vger.kernel.org
Subject: Re: [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo
Date: Thu, 7 May 2026 13:26:51 -0500	[thread overview]
Message-ID: <aa33879f-6ddb-41f3-ab8f-7896c71bcbde@amd.com> (raw)
In-Reply-To: <20260507185303.329cf964@jic23-huawei>

On 5/7/2026 12:53 PM, Jonathan Cameron wrote:
> On Tue, 5 May 2026 12:30:19 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
> 
>> CXL virtual hierarchy (VH) native RAS handling for CXL Port devices will be
>> added soon. This requires a notification mechanism for the AER driver to
>> share the AER interrupt with the CXL driver. The CXL drivers use the
>> notification to handle and log the CXL RAS errors.
>>
>> Note, 'CXL protocol error' terminology refers to CXL VH and not CXL RCH
>> errors unless specifically noted going forward.
>>
>> Introduce a new file in the AER driver to handle the CXL protocol
>> errors: pci/pcie/aer_cxl_vh.c.
>>
>> Add a kfifo work queue to be used by the AER and CXL drivers. Multiple
>> AER IRQ worker threads can be running and enqueueing concurrently, so
>> include write path synchronization. Pack the kfifo, the spinlock, the
>> rwsem, and the work pointer into a single structure. Initialize the
>> kfifo with INIT_KFIFO() from a subsys_initcall so its mask, esize and
>> data fields are valid before any producer or consumer runs.
>>
>> Add CXL work queue handler registration functions in the AER driver.
>> Export them so the CXL driver can assign or clear the work handler.
>>
>> Introduce 'struct cxl_proto_err_work_data' to serve as the kfifo work
>> data. It contains a reference to the PCI error source device and the
>> error severity. The cxl_core driver uses this when dequeuing the work.
>>
>> Introduce cxl_forward_error() to add a given CXL protocol error to a
>> work structure and push it onto the AER-CXL kfifo. This function takes
>> a pci_dev_get() on the source device. The kfifo consumer is responsible
>> for the matching pci_dev_put() after dequeue. On enqueue failure
>> cxl_forward_error() does the put itself.
>>
>> Synchronize accesses to the work function pointer during registration,
>> deregistration, enqueue, and dequeue.
>>
>> handle_error_source() is intentionally not changed here. The is_cxl_error()
>> switch that routes errors to cxl_forward_error() is added in a later patch
>> together with the kfifo consumer registration. This way the producer and
>> consumer land in the same commit, so CXL errors are not silently dropped
>> during bisect.
>>
>> Also add MAINTAINERS entries for both drivers/pci/pcie/aer_cxl_vh.c
>> (new in this patch) and drivers/pci/pcie/aer_cxl_rch.c (already in tree
>> but previously unlisted) under the existing CXL entry. This way the CXL
>> maintainers are CC'd on changes to the AER-CXL bridging code.
>>
>> Co-developed-by: Dan Williams <djbw@kernel.org>
>> Signed-off-by: Dan Williams <djbw@kernel.org>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> 
> Sashiko did have one comment on what happens if there are multiple things
> in the kfifo and fn fails.  At that point I think we are in the all
> bets are off corner and stranding a driver is fine, but open to other opinions!
> 
> https://sashiko.dev/#/patchset/20260505173029.2718246-1-terry.bowman%40amd.com
> 
> So with that in mind
> 
> Reviewed-by: Jonathan Cameron <jic23@kernel.org>
> 

Hi Jonathan,

I resolved this for next series by changing __cxl_proto_err_work_fn() to return void 
as the error case was unnecessary and only added complexity. 



>> diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
>> new file mode 100644
>> index 000000000000..c0fea2c2b9bc
>> --- /dev/null
>> +++ b/drivers/pci/pcie/aer_cxl_vh.c
> 
> 
>> +int for_each_cxl_proto_err(struct cxl_proto_err_work_data *wd,
>> +			   cxl_proto_err_fn_t fn)
>> +{
>> +	int rc;
>> +
>> +	guard(rwsem_read)(&cxl_proto_err_kfifo.rwsem);
>> +	while (kfifo_get(&cxl_proto_err_kfifo.fifo, wd)) {
>> +		rc = fn(wd);
>> +		pci_dev_put(wd->pdev);
>> +		if (rc)
>> +			return rc;
> This is where Sashiko complains. Specifically:
> "If the consumer callback fn() returns an error, does this early return
> strand the remaining items in the kfifo?
> Because cxl_forward_error() takes a pci_dev reference for each enqueued
> item, it looks like these stranded items might leak their pci_dev references
> and prevent clean unbinding or hot-unplug until a new error triggers the
> queue again."
> 
> I'd go with indeed it does, but there is no right thing to do here. I guess
> we could flush the kfifo and call pci_dev_put() on each of them, but that's horrible.
> Would basically mean calling the same stuff you have for cancelling outstanding
> entrees on exit().
> 
> 

Yes, that is an idea. But, until error discriminator is needed this can return void.
Clearing will be necessary but I think that will fit within the call path.

-Terry


>> +	}
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_FOR_MODULES(for_each_cxl_proto_err, "cxl_core");


  reply	other threads:[~2026-05-07 18:27 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-05-05 17:30 ` [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo Terry Bowman
2026-05-05 20:26   ` sashiko-bot
2026-05-05 21:17   ` Dave Jiang
2026-05-07 17:53   ` Jonathan Cameron
2026-05-07 18:26     ` Bowman, Terry [this message]
2026-05-05 17:30 ` [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events Terry Bowman
2026-05-05 21:07   ` sashiko-bot
2026-05-05 21:46   ` Dave Jiang
2026-05-07 18:08   ` Jonathan Cameron
2026-05-07 18:33     ` Bowman, Terry
2026-05-08 14:05       ` Jonathan Cameron
2026-05-09  3:49         ` Dan Williams (nvidia)
2026-05-05 17:30 ` [PATCH v17 03/11] cxl: Use common CPER handling for all CXL devices Terry Bowman
2026-05-05 21:30   ` sashiko-bot
2026-05-05 22:02   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 04/11] cxl: Rename find_cxl_port() to find_cxl_port_by_dport() Terry Bowman
2026-05-05 22:06   ` Dave Jiang
2026-05-07 18:11     ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 05/11] cxl: Limit CXL-CPER kfifo registration functions scope Terry Bowman
2026-05-05 21:52   ` sashiko-bot
2026-05-05 22:16   ` Dave Jiang
2026-05-07 18:14   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow Terry Bowman
2026-05-05 22:28   ` sashiko-bot
2026-05-07 18:22   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 07/11] PCI/CXL: Add RCH support to CXL handlers Terry Bowman
2026-05-05 23:34   ` sashiko-bot
2026-05-05 23:59   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 08/11] cxl: Remove Endpoint AER correctable handler Terry Bowman
2026-05-05 17:30 ` [PATCH v17 09/11] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
2026-05-06 17:43   ` Dave Jiang
2026-05-07 18:25     ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 10/11] PCI/CXL: Mask/Unmask CXL protocol errors Terry Bowman
2026-05-06  1:01   ` sashiko-bot
2026-05-06 18:00   ` Dave Jiang
2026-05-07 18:29   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 11/11] Documentation: cxl: Document CXL protocol error handling Terry Bowman
2026-05-06 18:34   ` Dave Jiang
2026-05-07 18:51   ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aa33879f-6ddb-41f3-ab8f-7896c71bcbde@amd.com \
    --to=terry.bowman@amd.com \
    --cc=Benjamin.Cheatham@amd.com \
    --cc=PradeepVineshReddy.Kodamati@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=alucerop@amd.com \
    --cc=bhelgaas@google.com \
    --cc=corbet@lwn.net \
    --cc=dan.carpenter@linaro.org \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=djbw@kernel.org \
    --cc=ira.weiny@intel.com \
    --cc=jic23@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=ming.li@zohomail.com \
    --cc=rafael@kernel.org \
    --cc=rrichter@amd.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=shiju.jose@huawei.com \
    --cc=vishal.l.verma@intel.com \
    --cc=xueshuai@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox