Linux kernel -stable discussions
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Bandhan Pramanik <bandhanpramanik06.foss@gmail.com>
Cc: Jeff Johnson <jjohnson@kernel.org>,
	linux-pci@vger.kernel.org, linux-acpi@vger.kernel.org,
	ath10k@lists.infradead.org, linux-wireless@vger.kernel.org,
	stable@vger.kernel.org
Subject: Re: Instability in ALL stable and LTS distro kernels (IRQ #16 being disabled, PCIe bus errors, ath10k_pci) in Dell Inspiron 5567
Date: Sat, 5 Jul 2025 14:58:46 -0500	[thread overview]
Message-ID: <20250705195846.GA2011829@bhelgaas> (raw)
In-Reply-To: <CAEmM+QjHnU0h3HtWH8AXP05k2dTYozu81eRxn45HVEUSRG8jLw@mail.gmail.com>

On Sat, Jul 05, 2025 at 08:30:46PM +0530, Bandhan Pramanik wrote:
> Hello,
> 
> The dmesg log (the older one) is present here:

[1]:
> https://gist.githubusercontent.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d832a16180/raw/78460e6931a055b6776afe756a95d467913d5ebd/dmesg.log
> 
> The newer dmesg log includes the first line and is not overwritten by
> the ring buffer (used pci=noaer in this case):
> https://gist.githubusercontent.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d832a16180/raw/78460e6931a055b6776afe756a95d467913d5ebd/updated-dmesg
>  (The newer one doesn't have the error recorded).
> 
> You should check out the older dmesg, the quoted line was taken from
> there verbatim, including any additional details.
> 
> Bandhan
> 
> On Sat, Jul 5, 2025 at 7:20 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > On Sat, Jul 05, 2025 at 01:00:23AM +0530, Bandhan Pramanik wrote:
> > > Hi everyone,
> > >
> > > Here after a week. I did my research.
> > >
> > > I talked to some folks on IRC and the glaring issue was basically this:
> > >
> > > > [ 1146.810055] pcieport 0000:00:1c.0: AER: Uncorrectable (Fatal) error message received from 0000:01:00.0

From [1]:

  [ 1146.810055] pcieport 0000:00:1c.0: AER: Uncorrectable (Fatal) error message received from 0000:01:00.0
  [ 1146.810069] ath10k_pci 0000:01:00.0: AER: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
  [ 1146.813130] ath10k_pci 0000:01:00.0: AER: can't recover (no error_detected callback)
  [ 1146.948066] pcieport 0000:00:1c.0: AER: Root Port link has been reset (0)
  [ 1146.948112] pcieport 0000:00:1c.0: AER: device recovery failed
  [ 1146.949480] ath10k_pci 0000:01:00.0: failed to wake target for read32 at 0x0003a028: -110

I think Linux is not doing a very good job of extracting error
information.  I think is_error_source() read PCI_ERR_UNCOR_STATUS from
01:00.0 and saw an error logged, but aer_get_device_error_info()
declined to read PCI_ERR_UNCOR_STATUS again because we thought the
link was unusable, so aer_print_error() didn't have any info to print,
hence the "Inaccessible" message.

Are you able to rebuild a kernel with the patch below?  This is based
on v6.16-rc1 and likely wouldn't apply cleanly to your v6.14 kernel.
But if you are able to build v6.16-rc1 with this patch, or adapt it to
v6.14, I'd be interested in the output.

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 70ac66188367..99acb1e1946e 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -990,6 +990,8 @@ static bool is_error_source(struct pci_dev *dev, struct aer_err_info *e_info)
 	if ((PCI_BUS_NUM(e_info->id) != 0) &&
 	    !(dev->bus->bus_flags & PCI_BUS_FLAGS_NO_AERSID)) {
 		/* Device ID match? */
+		pci_info(dev, "%s: bus_flags %#x e_info->id %#04x\n",
+			 __func__, dev->bus->bus_flags, e_info->id);
 		if (e_info->id == pci_dev_id(dev))
 			return true;
 
@@ -1025,6 +1027,10 @@ static bool is_error_source(struct pci_dev *dev, struct aer_err_info *e_info)
 		pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, &status);
 		pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &mask);
 	}
+	pci_info(dev, "%s: %s STATUS %#010x MASK %#010x\n",
+		 __func__,
+		 e_info->severity == AER_CORRECTABLE ? "COR" : "UNCOR",
+		 status, mask);
 	if (status & ~mask)
 		return true;
 
@@ -1368,6 +1374,8 @@ int aer_get_device_error_info(struct aer_err_info *info, int i)
 	aer = dev->aer_cap;
 	type = pci_pcie_type(dev);
 
+	pci_info(dev, "%s: type %#x cap %#04x\n", __func__, type, aer);
+
 	/* Must reset in this function */
 	info->status = 0;
 	info->tlp_header_valid = 0;
@@ -1383,16 +1391,14 @@ int aer_get_device_error_info(struct aer_err_info *info, int i)
 			&info->mask);
 		if (!(info->status & ~info->mask))
 			return 0;
-	} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
-		   type == PCI_EXP_TYPE_RC_EC ||
-		   type == PCI_EXP_TYPE_DOWNSTREAM ||
-		   info->severity == AER_NONFATAL) {
-
+	} else {
 		/* Link is still healthy for IO reads */
 		pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,
 			&info->status);
 		pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK,
 			&info->mask);
+		pci_info(dev, "%s: UNCOR STATUS %#010x MASK %#010x\n",
+			 __func__, info->status, info->mask);
 		if (!(info->status & ~info->mask))
 			return 0;
 
@@ -1471,6 +1477,8 @@ static void aer_isr_one_error(struct pci_dev *root,
 {
 	u32 status = e_src->status;
 
+	pci_info(root, "%s: ROOT_STATUS %#010x ROOT_ERR_SRC %#010x\n",
+		 __func__, e_src->status, e_src->id);
 	pci_rootport_aer_stats_incr(root, e_src);
 
 	/*

  reply	other threads:[~2025-07-05 19:58 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-25 19:17 Instability in ALL stable and LTS distro kernels (IRQ #16 being disabled, PCIe bus errors, ath10k_pci) in Dell Inspiron 5567 Bandhan Pramanik
2025-06-25 20:20 ` Bjorn Helgaas
2025-06-25 22:50   ` Bandhan Pramanik
2025-06-26 17:53     ` Bandhan Pramanik
2025-06-26 23:21       ` Bandhan Pramanik
2025-07-04 19:30         ` Bandhan Pramanik
2025-07-05 13:50           ` Bjorn Helgaas
2025-07-05 15:00             ` Bandhan Pramanik
2025-07-05 19:58               ` Bjorn Helgaas [this message]
2025-07-06 23:01                 ` Bandhan Pramanik
2025-07-07  6:11                   ` Manivannan Sadhasivam
2025-07-09 17:30                     ` Bandhan Pramanik
2025-07-10 19:06                       ` Bandhan Pramanik
2025-07-11 12:15                         ` Bjorn Helgaas
2025-07-11 16:04                           ` Bandhan Pramanik
2025-07-11 16:36                             ` Bjorn Helgaas
2025-07-12  6:48                               ` Bandhan Pramanik
2025-07-29 17:35                                 ` Bandhan Pramanik
2025-08-17  9:38                                   ` [PATCH TEST] ath10k: Testing Mani's ASPM patch (QCA9377, v6.16-rc1) Bandhan Pramanik
2025-07-12 19:18 ` Instability in ALL stable and LTS distro kernels (IRQ #16 being disabled, PCIe bus errors, ath10k_pci) in Dell Inspiron 5567 Askar Safin
2025-07-13 16:04   ` Bandhan Pramanik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250705195846.GA2011829@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=ath10k@lists.infradead.org \
    --cc=bandhanpramanik06.foss@gmail.com \
    --cc=jjohnson@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-wireless@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox