linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
@ 2025-08-04  9:17 Breno Leitao
  2025-08-04 13:50 ` Sathyanarayanan Kuppuswamy
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Breno Leitao @ 2025-08-04  9:17 UTC (permalink / raw)
  To: Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
	Kuppuswamy Sathyanarayanan, Jon Pan-Doh
  Cc: linuxppc-dev, linux-pci, linux-kernel, kernel-team, Breno Leitao

Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
does not rate limit, given this is fatal.

This prevents a kernel crash triggered by dereferencing a NULL pointer
in aer_ratelimit(), ensuring safer handling of PCI devices that lack
AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
which already performs this NULL check.

Signed-off-by: Breno Leitao <leitao@debian.org>
Fixes: a57f2bfb4a5863 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
---
 drivers/pci/pcie/aer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 70ac661883672..b5f96fde4dcda 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -786,6 +786,9 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
 
 static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)
 {
+	if (!dev->aer_info)
+		return 1;
+
 	switch (severity) {
 	case AER_NONFATAL:
 		return __ratelimit(&dev->aer_info->nonfatal_ratelimit);

---
base-commit: 89748acdf226fd1a8775ff6fa2703f8412b286c8
change-id: 20250801-aer_crash_2-b21cc2ef0d00

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
  2025-08-04  9:17 [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer() Breno Leitao
@ 2025-08-04 13:50 ` Sathyanarayanan Kuppuswamy
  2025-08-04 15:35   ` Breno Leitao
  2025-08-05 14:25 ` Ethan Zhao
  2025-08-06  1:55 ` Ethan Zhao
  2 siblings, 1 reply; 11+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-08-04 13:50 UTC (permalink / raw)
  To: Breno Leitao, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, Jon Pan-Doh
  Cc: linuxppc-dev, linux-pci, linux-kernel, kernel-team


On 8/4/25 2:17 AM, Breno Leitao wrote:
> Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
> when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
> calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
> does not rate limit, given this is fatal.

Why not add it to pci_print_aer() ?

>
> This prevents a kernel crash triggered by dereferencing a NULL pointer
> in aer_ratelimit(), ensuring safer handling of PCI devices that lack
> AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
> which already performs this NULL check.

Is this happening during the kernel boot ? What is the frequency and steps
to reproduce? I am curious about why pci_print_aer() is called for a PCI device
without aer_info. Not aer_info means, that particular device is already released
or in the process of release (pci_release_dev()). Is this triggered by using a stale
pci_dev pointer?

>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> Fixes: a57f2bfb4a5863 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
> ---
>   drivers/pci/pcie/aer.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 70ac661883672..b5f96fde4dcda 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -786,6 +786,9 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>   
>   static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)
>   {
> +	if (!dev->aer_info)
> +		return 1;
> +
>   	switch (severity) {
>   	case AER_NONFATAL:
>   		return __ratelimit(&dev->aer_info->nonfatal_ratelimit);
>
> ---
> base-commit: 89748acdf226fd1a8775ff6fa2703f8412b286c8
> change-id: 20250801-aer_crash_2-b21cc2ef0d00
>
> Best regards,
> --
> Breno Leitao <leitao@debian.org>
>
-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
  2025-08-04 13:50 ` Sathyanarayanan Kuppuswamy
@ 2025-08-04 15:35   ` Breno Leitao
  2025-08-04 16:11     ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 11+ messages in thread
From: Breno Leitao @ 2025-08-04 15:35 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
	Jon Pan-Doh, linuxppc-dev, linux-pci, linux-kernel, kernel-team

Hello Sathyanarayanan,

On Mon, Aug 04, 2025 at 06:50:30AM -0700, Sathyanarayanan Kuppuswamy wrote:
> 
> On 8/4/25 2:17 AM, Breno Leitao wrote:
> > Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
> > when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
> > calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
> > does not rate limit, given this is fatal.
> 
> Why not add it to pci_print_aer() ?
> 
> > 
> > This prevents a kernel crash triggered by dereferencing a NULL pointer
> > in aer_ratelimit(), ensuring safer handling of PCI devices that lack
> > AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
> > which already performs this NULL check.
> 
> Is this happening during the kernel boot ? What is the frequency and steps
> to reproduce? I am curious about why pci_print_aer() is called for a PCI device
> without aer_info. Not aer_info means, that particular device is already released
> or in the process of release (pci_release_dev()). Is this triggered by using a stale
> pci_dev pointer?

I've reported some of these investigations in here:

https://lore.kernel.org/all/buduna6darbvwfg3aogl5kimyxkggu3n4romnmq6sozut6axeu@clnx7sfsy457/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
  2025-08-04 15:35   ` Breno Leitao
@ 2025-08-04 16:11     ` Sathyanarayanan Kuppuswamy
  2025-08-04 16:47       ` Breno Leitao
  0 siblings, 1 reply; 11+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-08-04 16:11 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
	Jon Pan-Doh, linuxppc-dev, linux-pci, linux-kernel, kernel-team


On 8/4/25 8:35 AM, Breno Leitao wrote:
> Hello Sathyanarayanan,
>
> On Mon, Aug 04, 2025 at 06:50:30AM -0700, Sathyanarayanan Kuppuswamy wrote:
>> On 8/4/25 2:17 AM, Breno Leitao wrote:
>>> Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
>>> when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
>>> calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
>>> does not rate limit, given this is fatal.
>> Why not add it to pci_print_aer() ?
>>
>>> This prevents a kernel crash triggered by dereferencing a NULL pointer
>>> in aer_ratelimit(), ensuring safer handling of PCI devices that lack
>>> AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
>>> which already performs this NULL check.
>> Is this happening during the kernel boot ? What is the frequency and steps
>> to reproduce? I am curious about why pci_print_aer() is called for a PCI device
>> without aer_info. Not aer_info means, that particular device is already released
>> or in the process of release (pci_release_dev()). Is this triggered by using a stale
>> pci_dev pointer?
> I've reported some of these investigations in here:
>
> https://lore.kernel.org/all/buduna6darbvwfg3aogl5kimyxkggu3n4romnmq6sozut6axeu@clnx7sfsy457/

It has some details. But you did not mention details like your environment, steps to
reproduce and how often you see it. I just want to understand in what scenario
pci_print_aer() is triggered, when releasing the device. I am wondering whether we
are missing proper locking some where.


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
  2025-08-04 16:11     ` Sathyanarayanan Kuppuswamy
@ 2025-08-04 16:47       ` Breno Leitao
  0 siblings, 0 replies; 11+ messages in thread
From: Breno Leitao @ 2025-08-04 16:47 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
	Jon Pan-Doh, linuxppc-dev, linux-pci, linux-kernel, kernel-team

Hello Sathyanarayanan

On Mon, Aug 04, 2025 at 09:11:27AM -0700, Sathyanarayanan Kuppuswamy wrote:
> 
> On 8/4/25 8:35 AM, Breno Leitao wrote:
> > Hello Sathyanarayanan,
> > 
> > On Mon, Aug 04, 2025 at 06:50:30AM -0700, Sathyanarayanan Kuppuswamy wrote:
> > > On 8/4/25 2:17 AM, Breno Leitao wrote:
> > > > Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
> > > > when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
> > > > calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
> > > > does not rate limit, given this is fatal.
> > > Why not add it to pci_print_aer() ?
> > > 
> > > > This prevents a kernel crash triggered by dereferencing a NULL pointer
> > > > in aer_ratelimit(), ensuring safer handling of PCI devices that lack
> > > > AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
> > > > which already performs this NULL check.
> > > Is this happening during the kernel boot ? What is the frequency and steps
> > > to reproduce? I am curious about why pci_print_aer() is called for a PCI device
> > > without aer_info. Not aer_info means, that particular device is already released
> > > or in the process of release (pci_release_dev()). Is this triggered by using a stale
> > > pci_dev pointer?
> > I've reported some of these investigations in here:
> > 
> > https://lore.kernel.org/all/buduna6darbvwfg3aogl5kimyxkggu3n4romnmq6sozut6axeu@clnx7sfsy457/
> 
> It has some details. But you did not mention details like your environment, steps to
> reproduce and how often you see it. I just want to understand in what scenario
> pci_print_aer() is triggered, when releasing the device. I am wondering whether we
> are missing proper locking some where.

Oh, unfortunately I don't have these details.

I have a bunch of machine in "prod" running 6.16, and they crash from
time to time, and then I have the crashdumps.

I can get anything that crashdump provices, but, I don't have
a reproducer or the exacty steps that are triggering it.

If I can get this information from a crashdump, I am more than happy to
investigate. Can we get these information from crashdump?

Thanks,
--breno

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
  2025-08-04  9:17 [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer() Breno Leitao
  2025-08-04 13:50 ` Sathyanarayanan Kuppuswamy
@ 2025-08-05 14:25 ` Ethan Zhao
  2025-08-05 15:18   ` Breno Leitao
  2025-08-06  1:55 ` Ethan Zhao
  2 siblings, 1 reply; 11+ messages in thread
From: Ethan Zhao @ 2025-08-05 14:25 UTC (permalink / raw)
  To: Breno Leitao, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, Kuppuswamy Sathyanarayanan, Jon Pan-Doh
  Cc: linuxppc-dev, linux-pci, linux-kernel, kernel-team



On 8/4/2025 5:17 PM, Breno Leitao wrote:
> Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
> when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
> calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
> does not rate limit, given this is fatal.
> 
> This prevents a kernel crash triggered by dereferencing a NULL pointer
> in aer_ratelimit(), ensuring safer handling of PCI devices that lack
> AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
> which already performs this NULL check.
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> Fixes: a57f2bfb4a5863 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
> ---
>   drivers/pci/pcie/aer.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 70ac661883672..b5f96fde4dcda 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -786,6 +786,9 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>   
>   static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)
>   {
> +	if (!dev->aer_info)
> +		return 1;
> +
>   	switch (severity) {
>   	case AER_NONFATAL:
>   		return __ratelimit(&dev->aer_info->nonfatal_ratelimit);
> 
> ---
Seems you are using arm64 platform default config item
arch/arm64/configs/defconfig:CONFIG_ACPI_APEI_PCIEAER=y
So the issue wouldn't be triggered on X86_64 with default config.


Thanks,
Ethan

> base-commit: 89748acdf226fd1a8775ff6fa2703f8412b286c8
> change-id: 20250801-aer_crash_2-b21cc2ef0d00
> 
> Best regards,
> --
> Breno Leitao <leitao@debian.org>
> 
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
  2025-08-05 14:25 ` Ethan Zhao
@ 2025-08-05 15:18   ` Breno Leitao
  2025-08-06  1:36     ` Ethan Zhao
  0 siblings, 1 reply; 11+ messages in thread
From: Breno Leitao @ 2025-08-05 15:18 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
	Kuppuswamy Sathyanarayanan, Jon Pan-Doh, linuxppc-dev, linux-pci,
	linux-kernel, kernel-team

On Tue, Aug 05, 2025 at 10:25:11PM +0800, Ethan Zhao wrote:
> 
> Seems you are using arm64 platform default config item
> arch/arm64/configs/defconfig:CONFIG_ACPI_APEI_PCIEAER=y
> So the issue wouldn't be triggered on X86_64 with default config.

Not really, I am running on x86 hosts. There are the AER part of my
.config.

	# cat .config | grep AER
	CONFIG_ACPI_APEI_PCIEAER=y
	CONFIG_PCIEAER=y
	# CONFIG_PCIEAER_INJECT is not set
	CONFIG_PCIEAER_CXL=y

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
  2025-08-05 15:18   ` Breno Leitao
@ 2025-08-06  1:36     ` Ethan Zhao
  0 siblings, 0 replies; 11+ messages in thread
From: Ethan Zhao @ 2025-08-06  1:36 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
	Kuppuswamy Sathyanarayanan, Jon Pan-Doh, linuxppc-dev, linux-pci,
	linux-kernel, kernel-team



On 8/5/2025 11:18 PM, Breno Leitao wrote:
> On Tue, Aug 05, 2025 at 10:25:11PM +0800, Ethan Zhao wrote:
>>
>> Seems you are using arm64 platform default config item
>> arch/arm64/configs/defconfig:CONFIG_ACPI_APEI_PCIEAER=y
>> So the issue wouldn't be triggered on X86_64 with default config.
> 
> Not really, I am running on x86 hosts. There are the AER part of my
> .config.
> 
> 	# cat .config | grep AER
> 	CONFIG_ACPI_APEI_PCIEAER=y
> 	CONFIG_PCIEAER=y
> 	# CONFIG_PCIEAER_INJECT is not set
> 	CONFIG_PCIEAER_CXL=y
Okay, If so, I would suggest to check and validate the
struct aer_capability_regs *aer_regs before/in enqueue function
aer_recover_queue().

e.g.

static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
{
...
memcpy(aer_info, pcie_err->aer_info, sizeof(struct aer_capability_regs));

//validate the aer_info here

aer_recover_queue(pcie_err->device_id.segment
}

or

void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
		       int severity, struct aer_capability_regs *aer_regs)
{
//check and validate aer_regs first here

}

Would be better than dequeue side aer_recover_work_func() ?
BTW, the cause seems you are using a buggy BIOS.


Thanks,
Ethan


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
  2025-08-04  9:17 [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer() Breno Leitao
  2025-08-04 13:50 ` Sathyanarayanan Kuppuswamy
  2025-08-05 14:25 ` Ethan Zhao
@ 2025-08-06  1:55 ` Ethan Zhao
  2025-08-06  8:45   ` Breno Leitao
  2 siblings, 1 reply; 11+ messages in thread
From: Ethan Zhao @ 2025-08-06  1:55 UTC (permalink / raw)
  To: Breno Leitao, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, Kuppuswamy Sathyanarayanan, Jon Pan-Doh
  Cc: linuxppc-dev, linux-pci, linux-kernel, kernel-team



On 8/4/2025 5:17 PM, Breno Leitao wrote:
> Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
> when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
> calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
> does not rate limit, given this is fatal.
> 
> This prevents a kernel crash triggered by dereferencing a NULL pointer
> in aer_ratelimit(), ensuring safer handling of PCI devices that lack
> AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
> which already performs this NULL check.
> 
The enqueue side has lock to protect the ring, but the dequeue side no 
lock held.

The kfifo_get in
static void aer_recover_work_func(struct work_struct *work)
{
...
while (kfifo_get(&aer_recover_ring, &entry)) {
...
}
should be replaced by
kfifo_out_spinlocked()

as
static void aer_recover_work_func(struct work_struct *work)
{
...
while (kfifo_out_spinlocked(&aer_recover_ring, 
&entry,1`,&aer_recover_ring_lock )) {
...
}


Thanks,
Ethan

> Signed-off-by: Breno Leitao <leitao@debian.org>
> Fixes: a57f2bfb4a5863 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
> ---
>   drivers/pci/pcie/aer.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 70ac661883672..b5f96fde4dcda 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -786,6 +786,9 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>   
>   static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)
>   {
> +	if (!dev->aer_info)
> +		return 1;
> +
>   	switch (severity) {
>   	case AER_NONFATAL:
>   		return __ratelimit(&dev->aer_info->nonfatal_ratelimit);
> 
> ---
> base-commit: 89748acdf226fd1a8775ff6fa2703f8412b286c8
> change-id: 20250801-aer_crash_2-b21cc2ef0d00
> 
> Best regards,
> --
> Breno Leitao <leitao@debian.org>
> 
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
  2025-08-06  1:55 ` Ethan Zhao
@ 2025-08-06  8:45   ` Breno Leitao
  2025-08-07  0:46     ` Ethan Zhao
  0 siblings, 1 reply; 11+ messages in thread
From: Breno Leitao @ 2025-08-06  8:45 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
	Kuppuswamy Sathyanarayanan, Jon Pan-Doh, linuxppc-dev, linux-pci,
	linux-kernel, kernel-team

Hello Ethan,

On Wed, Aug 06, 2025 at 09:55:05AM +0800, Ethan Zhao wrote:
> On 8/4/2025 5:17 PM, Breno Leitao wrote:
> > Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
> > when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
> > calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
> > does not rate limit, given this is fatal.
> > 
> > This prevents a kernel crash triggered by dereferencing a NULL pointer
> > in aer_ratelimit(), ensuring safer handling of PCI devices that lack
> > AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
> > which already performs this NULL check.
> > 
> The enqueue side has lock to protect the ring, but the dequeue side no lock
> held.
> 
> The kfifo_get in
> static void aer_recover_work_func(struct work_struct *work)
> {
> ...
> while (kfifo_get(&aer_recover_ring, &entry)) {
> ...
> }
> should be replaced by
> kfifo_out_spinlocked()

The design seems not to need the lock on the reader side. There is just
one reader, which is the aer_recover_work. aer_recover_work runs
aer_recover_work_func(). So, if we just have one reader, we do not need
to protect the kfifo by spinlock, right?

In fact, the code documents it in the aer_recover_ring_lock.

	/*
	* Mutual exclusion for writers of aer_recover_ring, reader side don't
	* need lock, because there is only one reader and lock is not needed
	* between reader and writer.
	*/
	static DEFINE_SPINLOCK(aer_recover_ring_lock);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
  2025-08-06  8:45   ` Breno Leitao
@ 2025-08-07  0:46     ` Ethan Zhao
  0 siblings, 0 replies; 11+ messages in thread
From: Ethan Zhao @ 2025-08-07  0:46 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
	Kuppuswamy Sathyanarayanan, Jon Pan-Doh, linuxppc-dev, linux-pci,
	linux-kernel, kernel-team



On 8/6/2025 4:45 PM, Breno Leitao wrote:
> Hello Ethan,
> 
> On Wed, Aug 06, 2025 at 09:55:05AM +0800, Ethan Zhao wrote:
>> On 8/4/2025 5:17 PM, Breno Leitao wrote:
>>> Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
>>> when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
>>> calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
>>> does not rate limit, given this is fatal.
>>>
>>> This prevents a kernel crash triggered by dereferencing a NULL pointer
>>> in aer_ratelimit(), ensuring safer handling of PCI devices that lack
>>> AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
>>> which already performs this NULL check.
>>>
>> The enqueue side has lock to protect the ring, but the dequeue side no lock
>> held.
>>
>> The kfifo_get in
>> static void aer_recover_work_func(struct work_struct *work)
>> {
>> ...
>> while (kfifo_get(&aer_recover_ring, &entry)) {
>> ...
>> }
>> should be replaced by
>> kfifo_out_spinlocked()
> 
> The design seems not to need the lock on the reader side. There is just
> one reader, which is the aer_recover_work. aer_recover_work runs
> aer_recover_work_func(). So, if we just have one reader, we do not need
> to protect the kfifo by spinlock, right?

Not exactly,
If the writer and reader are serialized, no lock is needed. However, 
here the writer kfifo_in_spinlocked() and the system work queue task 
aer_recover_work() cannot guarantee serialized execution.

@Bjorn, help to check it out.


Thanks,
Ethan>
> In fact, the code documents it in the aer_recover_ring_lock.
> 
> 	/*
> 	* Mutual exclusion for writers of aer_recover_ring, reader side don't
> 	* need lock, because there is only one reader and lock is not needed
> 	* between reader and writer.
> 	*/
> 	static DEFINE_SPINLOCK(aer_recover_ring_lock);


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-08-07  0:46 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-04  9:17 [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer() Breno Leitao
2025-08-04 13:50 ` Sathyanarayanan Kuppuswamy
2025-08-04 15:35   ` Breno Leitao
2025-08-04 16:11     ` Sathyanarayanan Kuppuswamy
2025-08-04 16:47       ` Breno Leitao
2025-08-05 14:25 ` Ethan Zhao
2025-08-05 15:18   ` Breno Leitao
2025-08-06  1:36     ` Ethan Zhao
2025-08-06  1:55 ` Ethan Zhao
2025-08-06  8:45   ` Breno Leitao
2025-08-07  0:46     ` Ethan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).