* [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
@ 2025-08-04 9:17 Breno Leitao
2025-08-04 13:50 ` Sathyanarayanan Kuppuswamy
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Breno Leitao @ 2025-08-04 9:17 UTC (permalink / raw)
To: Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
Kuppuswamy Sathyanarayanan, Jon Pan-Doh
Cc: linuxppc-dev, linux-pci, linux-kernel, kernel-team, Breno Leitao
Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
does not rate limit, given this is fatal.
This prevents a kernel crash triggered by dereferencing a NULL pointer
in aer_ratelimit(), ensuring safer handling of PCI devices that lack
AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
which already performs this NULL check.
Signed-off-by: Breno Leitao <leitao@debian.org>
Fixes: a57f2bfb4a5863 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
---
drivers/pci/pcie/aer.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 70ac661883672..b5f96fde4dcda 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -786,6 +786,9 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)
{
+ if (!dev->aer_info)
+ return 1;
+
switch (severity) {
case AER_NONFATAL:
return __ratelimit(&dev->aer_info->nonfatal_ratelimit);
---
base-commit: 89748acdf226fd1a8775ff6fa2703f8412b286c8
change-id: 20250801-aer_crash_2-b21cc2ef0d00
Best regards,
--
Breno Leitao <leitao@debian.org>
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
2025-08-04 9:17 [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer() Breno Leitao
@ 2025-08-04 13:50 ` Sathyanarayanan Kuppuswamy
2025-08-04 15:35 ` Breno Leitao
2025-08-05 14:25 ` Ethan Zhao
2025-08-06 1:55 ` Ethan Zhao
2 siblings, 1 reply; 11+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-08-04 13:50 UTC (permalink / raw)
To: Breno Leitao, Mahesh J Salgaonkar, Oliver O'Halloran,
Bjorn Helgaas, Jon Pan-Doh
Cc: linuxppc-dev, linux-pci, linux-kernel, kernel-team
On 8/4/25 2:17 AM, Breno Leitao wrote:
> Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
> when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
> calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
> does not rate limit, given this is fatal.
Why not add it to pci_print_aer() ?
>
> This prevents a kernel crash triggered by dereferencing a NULL pointer
> in aer_ratelimit(), ensuring safer handling of PCI devices that lack
> AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
> which already performs this NULL check.
Is this happening during the kernel boot ? What is the frequency and steps
to reproduce? I am curious about why pci_print_aer() is called for a PCI device
without aer_info. Not aer_info means, that particular device is already released
or in the process of release (pci_release_dev()). Is this triggered by using a stale
pci_dev pointer?
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> Fixes: a57f2bfb4a5863 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
> ---
> drivers/pci/pcie/aer.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 70ac661883672..b5f96fde4dcda 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -786,6 +786,9 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>
> static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)
> {
> + if (!dev->aer_info)
> + return 1;
> +
> switch (severity) {
> case AER_NONFATAL:
> return __ratelimit(&dev->aer_info->nonfatal_ratelimit);
>
> ---
> base-commit: 89748acdf226fd1a8775ff6fa2703f8412b286c8
> change-id: 20250801-aer_crash_2-b21cc2ef0d00
>
> Best regards,
> --
> Breno Leitao <leitao@debian.org>
>
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
2025-08-04 13:50 ` Sathyanarayanan Kuppuswamy
@ 2025-08-04 15:35 ` Breno Leitao
2025-08-04 16:11 ` Sathyanarayanan Kuppuswamy
0 siblings, 1 reply; 11+ messages in thread
From: Breno Leitao @ 2025-08-04 15:35 UTC (permalink / raw)
To: Sathyanarayanan Kuppuswamy
Cc: Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
Jon Pan-Doh, linuxppc-dev, linux-pci, linux-kernel, kernel-team
Hello Sathyanarayanan,
On Mon, Aug 04, 2025 at 06:50:30AM -0700, Sathyanarayanan Kuppuswamy wrote:
>
> On 8/4/25 2:17 AM, Breno Leitao wrote:
> > Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
> > when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
> > calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
> > does not rate limit, given this is fatal.
>
> Why not add it to pci_print_aer() ?
>
> >
> > This prevents a kernel crash triggered by dereferencing a NULL pointer
> > in aer_ratelimit(), ensuring safer handling of PCI devices that lack
> > AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
> > which already performs this NULL check.
>
> Is this happening during the kernel boot ? What is the frequency and steps
> to reproduce? I am curious about why pci_print_aer() is called for a PCI device
> without aer_info. Not aer_info means, that particular device is already released
> or in the process of release (pci_release_dev()). Is this triggered by using a stale
> pci_dev pointer?
I've reported some of these investigations in here:
https://lore.kernel.org/all/buduna6darbvwfg3aogl5kimyxkggu3n4romnmq6sozut6axeu@clnx7sfsy457/
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
2025-08-04 15:35 ` Breno Leitao
@ 2025-08-04 16:11 ` Sathyanarayanan Kuppuswamy
2025-08-04 16:47 ` Breno Leitao
0 siblings, 1 reply; 11+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-08-04 16:11 UTC (permalink / raw)
To: Breno Leitao
Cc: Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
Jon Pan-Doh, linuxppc-dev, linux-pci, linux-kernel, kernel-team
On 8/4/25 8:35 AM, Breno Leitao wrote:
> Hello Sathyanarayanan,
>
> On Mon, Aug 04, 2025 at 06:50:30AM -0700, Sathyanarayanan Kuppuswamy wrote:
>> On 8/4/25 2:17 AM, Breno Leitao wrote:
>>> Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
>>> when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
>>> calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
>>> does not rate limit, given this is fatal.
>> Why not add it to pci_print_aer() ?
>>
>>> This prevents a kernel crash triggered by dereferencing a NULL pointer
>>> in aer_ratelimit(), ensuring safer handling of PCI devices that lack
>>> AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
>>> which already performs this NULL check.
>> Is this happening during the kernel boot ? What is the frequency and steps
>> to reproduce? I am curious about why pci_print_aer() is called for a PCI device
>> without aer_info. Not aer_info means, that particular device is already released
>> or in the process of release (pci_release_dev()). Is this triggered by using a stale
>> pci_dev pointer?
> I've reported some of these investigations in here:
>
> https://lore.kernel.org/all/buduna6darbvwfg3aogl5kimyxkggu3n4romnmq6sozut6axeu@clnx7sfsy457/
It has some details. But you did not mention details like your environment, steps to
reproduce and how often you see it. I just want to understand in what scenario
pci_print_aer() is triggered, when releasing the device. I am wondering whether we
are missing proper locking some where.
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
2025-08-04 16:11 ` Sathyanarayanan Kuppuswamy
@ 2025-08-04 16:47 ` Breno Leitao
0 siblings, 0 replies; 11+ messages in thread
From: Breno Leitao @ 2025-08-04 16:47 UTC (permalink / raw)
To: Sathyanarayanan Kuppuswamy
Cc: Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
Jon Pan-Doh, linuxppc-dev, linux-pci, linux-kernel, kernel-team
Hello Sathyanarayanan
On Mon, Aug 04, 2025 at 09:11:27AM -0700, Sathyanarayanan Kuppuswamy wrote:
>
> On 8/4/25 8:35 AM, Breno Leitao wrote:
> > Hello Sathyanarayanan,
> >
> > On Mon, Aug 04, 2025 at 06:50:30AM -0700, Sathyanarayanan Kuppuswamy wrote:
> > > On 8/4/25 2:17 AM, Breno Leitao wrote:
> > > > Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
> > > > when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
> > > > calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
> > > > does not rate limit, given this is fatal.
> > > Why not add it to pci_print_aer() ?
> > >
> > > > This prevents a kernel crash triggered by dereferencing a NULL pointer
> > > > in aer_ratelimit(), ensuring safer handling of PCI devices that lack
> > > > AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
> > > > which already performs this NULL check.
> > > Is this happening during the kernel boot ? What is the frequency and steps
> > > to reproduce? I am curious about why pci_print_aer() is called for a PCI device
> > > without aer_info. Not aer_info means, that particular device is already released
> > > or in the process of release (pci_release_dev()). Is this triggered by using a stale
> > > pci_dev pointer?
> > I've reported some of these investigations in here:
> >
> > https://lore.kernel.org/all/buduna6darbvwfg3aogl5kimyxkggu3n4romnmq6sozut6axeu@clnx7sfsy457/
>
> It has some details. But you did not mention details like your environment, steps to
> reproduce and how often you see it. I just want to understand in what scenario
> pci_print_aer() is triggered, when releasing the device. I am wondering whether we
> are missing proper locking some where.
Oh, unfortunately I don't have these details.
I have a bunch of machine in "prod" running 6.16, and they crash from
time to time, and then I have the crashdumps.
I can get anything that crashdump provices, but, I don't have
a reproducer or the exacty steps that are triggering it.
If I can get this information from a crashdump, I am more than happy to
investigate. Can we get these information from crashdump?
Thanks,
--breno
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
2025-08-04 9:17 [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer() Breno Leitao
2025-08-04 13:50 ` Sathyanarayanan Kuppuswamy
@ 2025-08-05 14:25 ` Ethan Zhao
2025-08-05 15:18 ` Breno Leitao
2025-08-06 1:55 ` Ethan Zhao
2 siblings, 1 reply; 11+ messages in thread
From: Ethan Zhao @ 2025-08-05 14:25 UTC (permalink / raw)
To: Breno Leitao, Mahesh J Salgaonkar, Oliver O'Halloran,
Bjorn Helgaas, Kuppuswamy Sathyanarayanan, Jon Pan-Doh
Cc: linuxppc-dev, linux-pci, linux-kernel, kernel-team
On 8/4/2025 5:17 PM, Breno Leitao wrote:
> Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
> when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
> calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
> does not rate limit, given this is fatal.
>
> This prevents a kernel crash triggered by dereferencing a NULL pointer
> in aer_ratelimit(), ensuring safer handling of PCI devices that lack
> AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
> which already performs this NULL check.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> Fixes: a57f2bfb4a5863 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
> ---
> drivers/pci/pcie/aer.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 70ac661883672..b5f96fde4dcda 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -786,6 +786,9 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>
> static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)
> {
> + if (!dev->aer_info)
> + return 1;
> +
> switch (severity) {
> case AER_NONFATAL:
> return __ratelimit(&dev->aer_info->nonfatal_ratelimit);
>
> ---
Seems you are using arm64 platform default config item
arch/arm64/configs/defconfig:CONFIG_ACPI_APEI_PCIEAER=y
So the issue wouldn't be triggered on X86_64 with default config.
Thanks,
Ethan
> base-commit: 89748acdf226fd1a8775ff6fa2703f8412b286c8
> change-id: 20250801-aer_crash_2-b21cc2ef0d00
>
> Best regards,
> --
> Breno Leitao <leitao@debian.org>
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
2025-08-05 14:25 ` Ethan Zhao
@ 2025-08-05 15:18 ` Breno Leitao
2025-08-06 1:36 ` Ethan Zhao
0 siblings, 1 reply; 11+ messages in thread
From: Breno Leitao @ 2025-08-05 15:18 UTC (permalink / raw)
To: Ethan Zhao
Cc: Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
Kuppuswamy Sathyanarayanan, Jon Pan-Doh, linuxppc-dev, linux-pci,
linux-kernel, kernel-team
On Tue, Aug 05, 2025 at 10:25:11PM +0800, Ethan Zhao wrote:
>
> Seems you are using arm64 platform default config item
> arch/arm64/configs/defconfig:CONFIG_ACPI_APEI_PCIEAER=y
> So the issue wouldn't be triggered on X86_64 with default config.
Not really, I am running on x86 hosts. There are the AER part of my
.config.
# cat .config | grep AER
CONFIG_ACPI_APEI_PCIEAER=y
CONFIG_PCIEAER=y
# CONFIG_PCIEAER_INJECT is not set
CONFIG_PCIEAER_CXL=y
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
2025-08-05 15:18 ` Breno Leitao
@ 2025-08-06 1:36 ` Ethan Zhao
0 siblings, 0 replies; 11+ messages in thread
From: Ethan Zhao @ 2025-08-06 1:36 UTC (permalink / raw)
To: Breno Leitao
Cc: Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
Kuppuswamy Sathyanarayanan, Jon Pan-Doh, linuxppc-dev, linux-pci,
linux-kernel, kernel-team
On 8/5/2025 11:18 PM, Breno Leitao wrote:
> On Tue, Aug 05, 2025 at 10:25:11PM +0800, Ethan Zhao wrote:
>>
>> Seems you are using arm64 platform default config item
>> arch/arm64/configs/defconfig:CONFIG_ACPI_APEI_PCIEAER=y
>> So the issue wouldn't be triggered on X86_64 with default config.
>
> Not really, I am running on x86 hosts. There are the AER part of my
> .config.
>
> # cat .config | grep AER
> CONFIG_ACPI_APEI_PCIEAER=y
> CONFIG_PCIEAER=y
> # CONFIG_PCIEAER_INJECT is not set
> CONFIG_PCIEAER_CXL=y
Okay, If so, I would suggest to check and validate the
struct aer_capability_regs *aer_regs before/in enqueue function
aer_recover_queue().
e.g.
static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
{
...
memcpy(aer_info, pcie_err->aer_info, sizeof(struct aer_capability_regs));
//validate the aer_info here
aer_recover_queue(pcie_err->device_id.segment
}
or
void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
int severity, struct aer_capability_regs *aer_regs)
{
//check and validate aer_regs first here
}
Would be better than dequeue side aer_recover_work_func() ?
BTW, the cause seems you are using a buggy BIOS.
Thanks,
Ethan
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
2025-08-04 9:17 [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer() Breno Leitao
2025-08-04 13:50 ` Sathyanarayanan Kuppuswamy
2025-08-05 14:25 ` Ethan Zhao
@ 2025-08-06 1:55 ` Ethan Zhao
2025-08-06 8:45 ` Breno Leitao
2 siblings, 1 reply; 11+ messages in thread
From: Ethan Zhao @ 2025-08-06 1:55 UTC (permalink / raw)
To: Breno Leitao, Mahesh J Salgaonkar, Oliver O'Halloran,
Bjorn Helgaas, Kuppuswamy Sathyanarayanan, Jon Pan-Doh
Cc: linuxppc-dev, linux-pci, linux-kernel, kernel-team
On 8/4/2025 5:17 PM, Breno Leitao wrote:
> Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
> when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
> calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
> does not rate limit, given this is fatal.
>
> This prevents a kernel crash triggered by dereferencing a NULL pointer
> in aer_ratelimit(), ensuring safer handling of PCI devices that lack
> AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
> which already performs this NULL check.
>
The enqueue side has lock to protect the ring, but the dequeue side no
lock held.
The kfifo_get in
static void aer_recover_work_func(struct work_struct *work)
{
...
while (kfifo_get(&aer_recover_ring, &entry)) {
...
}
should be replaced by
kfifo_out_spinlocked()
as
static void aer_recover_work_func(struct work_struct *work)
{
...
while (kfifo_out_spinlocked(&aer_recover_ring,
&entry,1`,&aer_recover_ring_lock )) {
...
}
Thanks,
Ethan
> Signed-off-by: Breno Leitao <leitao@debian.org>
> Fixes: a57f2bfb4a5863 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
> ---
> drivers/pci/pcie/aer.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 70ac661883672..b5f96fde4dcda 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -786,6 +786,9 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>
> static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)
> {
> + if (!dev->aer_info)
> + return 1;
> +
> switch (severity) {
> case AER_NONFATAL:
> return __ratelimit(&dev->aer_info->nonfatal_ratelimit);
>
> ---
> base-commit: 89748acdf226fd1a8775ff6fa2703f8412b286c8
> change-id: 20250801-aer_crash_2-b21cc2ef0d00
>
> Best regards,
> --
> Breno Leitao <leitao@debian.org>
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
2025-08-06 1:55 ` Ethan Zhao
@ 2025-08-06 8:45 ` Breno Leitao
2025-08-07 0:46 ` Ethan Zhao
0 siblings, 1 reply; 11+ messages in thread
From: Breno Leitao @ 2025-08-06 8:45 UTC (permalink / raw)
To: Ethan Zhao
Cc: Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
Kuppuswamy Sathyanarayanan, Jon Pan-Doh, linuxppc-dev, linux-pci,
linux-kernel, kernel-team
Hello Ethan,
On Wed, Aug 06, 2025 at 09:55:05AM +0800, Ethan Zhao wrote:
> On 8/4/2025 5:17 PM, Breno Leitao wrote:
> > Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
> > when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
> > calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
> > does not rate limit, given this is fatal.
> >
> > This prevents a kernel crash triggered by dereferencing a NULL pointer
> > in aer_ratelimit(), ensuring safer handling of PCI devices that lack
> > AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
> > which already performs this NULL check.
> >
> The enqueue side has lock to protect the ring, but the dequeue side no lock
> held.
>
> The kfifo_get in
> static void aer_recover_work_func(struct work_struct *work)
> {
> ...
> while (kfifo_get(&aer_recover_ring, &entry)) {
> ...
> }
> should be replaced by
> kfifo_out_spinlocked()
The design seems not to need the lock on the reader side. There is just
one reader, which is the aer_recover_work. aer_recover_work runs
aer_recover_work_func(). So, if we just have one reader, we do not need
to protect the kfifo by spinlock, right?
In fact, the code documents it in the aer_recover_ring_lock.
/*
* Mutual exclusion for writers of aer_recover_ring, reader side don't
* need lock, because there is only one reader and lock is not needed
* between reader and writer.
*/
static DEFINE_SPINLOCK(aer_recover_ring_lock);
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer()
2025-08-06 8:45 ` Breno Leitao
@ 2025-08-07 0:46 ` Ethan Zhao
0 siblings, 0 replies; 11+ messages in thread
From: Ethan Zhao @ 2025-08-07 0:46 UTC (permalink / raw)
To: Breno Leitao
Cc: Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
Kuppuswamy Sathyanarayanan, Jon Pan-Doh, linuxppc-dev, linux-pci,
linux-kernel, kernel-team
On 8/6/2025 4:45 PM, Breno Leitao wrote:
> Hello Ethan,
>
> On Wed, Aug 06, 2025 at 09:55:05AM +0800, Ethan Zhao wrote:
>> On 8/4/2025 5:17 PM, Breno Leitao wrote:
>>> Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
>>> when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
>>> calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
>>> does not rate limit, given this is fatal.
>>>
>>> This prevents a kernel crash triggered by dereferencing a NULL pointer
>>> in aer_ratelimit(), ensuring safer handling of PCI devices that lack
>>> AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
>>> which already performs this NULL check.
>>>
>> The enqueue side has lock to protect the ring, but the dequeue side no lock
>> held.
>>
>> The kfifo_get in
>> static void aer_recover_work_func(struct work_struct *work)
>> {
>> ...
>> while (kfifo_get(&aer_recover_ring, &entry)) {
>> ...
>> }
>> should be replaced by
>> kfifo_out_spinlocked()
>
> The design seems not to need the lock on the reader side. There is just
> one reader, which is the aer_recover_work. aer_recover_work runs
> aer_recover_work_func(). So, if we just have one reader, we do not need
> to protect the kfifo by spinlock, right?
Not exactly,
If the writer and reader are serialized, no lock is needed. However,
here the writer kfifo_in_spinlocked() and the system work queue task
aer_recover_work() cannot guarantee serialized execution.
@Bjorn, help to check it out.
Thanks,
Ethan>
> In fact, the code documents it in the aer_recover_ring_lock.
>
> /*
> * Mutual exclusion for writers of aer_recover_ring, reader side don't
> * need lock, because there is only one reader and lock is not needed
> * between reader and writer.
> */
> static DEFINE_SPINLOCK(aer_recover_ring_lock);
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-08-07 0:46 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-04 9:17 [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in pci_print_aer() Breno Leitao
2025-08-04 13:50 ` Sathyanarayanan Kuppuswamy
2025-08-04 15:35 ` Breno Leitao
2025-08-04 16:11 ` Sathyanarayanan Kuppuswamy
2025-08-04 16:47 ` Breno Leitao
2025-08-05 14:25 ` Ethan Zhao
2025-08-05 15:18 ` Breno Leitao
2025-08-06 1:36 ` Ethan Zhao
2025-08-06 1:55 ` Ethan Zhao
2025-08-06 8:45 ` Breno Leitao
2025-08-07 0:46 ` Ethan Zhao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).