* [PATCHv2 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO @ 2023-03-17 17:51 Grant Grundler 2023-03-17 17:51 ` [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors Grant Grundler 2023-03-17 18:50 ` [PATCHv2 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO Sathyanarayanan Kuppuswamy 0 siblings, 2 replies; 15+ messages in thread From: Grant Grundler @ 2023-03-17 17:51 UTC (permalink / raw) To: Mahesh J Salgaonkar, Oliver O 'Halloran, Bjorn Helgaas Cc: Rajat Khandelwal, linuxppc-dev, linux-pci, linux-kernel, Rajat Jain, Grant Grundler Since correctable errors have been corrected (and counted), the dmesg output should not be reported as a warning, but rather as "informational". Otherwise, using a certain well known vendor's PCIe parts in a USB4 docking station, the dmesg buffer can be spammed with correctable errors, 717 bytes per instance, potentially many MB per day. Given the "WARN" priority, these messages have already confused the typical user that stumbles across them, support staff (triaging feedback reports), and more than a few linux kernel devs. Changing to INFO will hide these messages from most audiences. Signed-off-by: Grant Grundler <grundler@chromium.org> --- drivers/pci/pcie/aer.c | 29 +++++++++++++++++++---------- 1 file changed, 19 insertions(+), 10 deletions(-) diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index f6c24ded134c..cb6b96233967 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -687,23 +687,29 @@ static void __aer_print_error(struct pci_dev *dev, { const char **strings; unsigned long status = info->status & ~info->mask; - const char *level, *errmsg; int i; if (info->severity == AER_CORRECTABLE) { strings = aer_correctable_error_string; - level = KERN_WARNING; + pci_info(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", + info->status, info->mask); } else { strings = aer_uncorrectable_error_string; - level = KERN_ERR; + pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", + info->status, info->mask); } for_each_set_bit(i, &status, 32) { - errmsg = strings[i]; + const char *errmsg = strings[i]; + if (!errmsg) errmsg = "Unknown Error Bit"; - pci_printk(level, dev, " [%2d] %-22s%s\n", i, errmsg, + if (info->severity == AER_CORRECTABLE) + pci_info(dev, " [%2d] %-22s%s\n", i, errmsg, + info->first_error == i ? " (First)" : ""); + else + pci_err(dev, " [%2d] %-22s%s\n", i, errmsg, info->first_error == i ? " (First)" : ""); } pci_dev_aer_stats_incr(dev, info); @@ -724,7 +730,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) layer = AER_GET_LAYER_ERROR(info->severity, info->status); agent = AER_GET_AGENT(info->severity, info->status); - level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR; + level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR; pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", aer_error_severity_string[info->severity], @@ -797,14 +803,17 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity, info.mask = mask; info.first_error = PCI_ERR_CAP_FEP(aer->cap_control); - pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask); __aer_print_error(dev, &info); - pci_err(dev, "aer_layer=%s, aer_agent=%s\n", - aer_error_layer[layer], aer_agent_string[agent]); - if (aer_severity != AER_CORRECTABLE) + if (aer_severity == AER_CORRECTABLE) { + pci_info(dev, "aer_layer=%s, aer_agent=%s\n", + aer_error_layer[layer], aer_agent_string[agent]); + } else { + pci_err(dev, "aer_layer=%s, aer_agent=%s\n", + aer_error_layer[layer], aer_agent_string[agent]); pci_err(dev, "aer_uncor_severity: 0x%08x\n", aer->uncor_severity); + } if (tlp_header_valid) __print_tlp_header(dev, &aer->header_log); -- 2.40.0.rc1.284.g88254d51c5-goog ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors 2023-03-17 17:51 [PATCHv2 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO Grant Grundler @ 2023-03-17 17:51 ` Grant Grundler 2023-04-06 19:50 ` Bjorn Helgaas 2023-03-17 18:50 ` [PATCHv2 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO Sathyanarayanan Kuppuswamy 1 sibling, 1 reply; 15+ messages in thread From: Grant Grundler @ 2023-03-17 17:51 UTC (permalink / raw) To: Mahesh J Salgaonkar, Oliver O 'Halloran, Bjorn Helgaas Cc: Rajat Khandelwal, linuxppc-dev, linux-pci, linux-kernel, Rajat Jain, Grant Grundler From: Rajat Khandelwal <rajat.khandelwal@linux.intel.com> There are many instances where correctable errors tend to inundate the message buffer. We observe such instances during thunderbolt PCIe tunneling. It's true that they are mitigated by the hardware and are non-fatal but we shouldn't be spamming the logs with such correctable errors as it confuses other kernel developers less familiar with PCI errors, support staff, and users who happen to look at the logs, hence rate limit them. A typical example log inside an HP TBT4 dock: [54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0 [54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) [54912.661203] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001100/00002000 [54912.661211] igc 0000:2b:00.0: [ 8] Rollover [54912.661219] igc 0000:2b:00.0: [12] Timeout [54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0 [54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) [54982.838808] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001000/00002000 [54982.838817] igc 0000:2b:00.0: [12] Timeout This gets repeated continuously, thus inundating the buffer. Signed-off-by: Rajat Khandelwal <rajat.khandelwal@linux.intel.com> Signed-off-by: Grant Grundler <grundler@chromium.org> --- drivers/pci/pcie/aer.c | 42 ++++++++++++++++++++++++++++-------------- 1 file changed, 28 insertions(+), 14 deletions(-) diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index cb6b96233967..b592cea8bffe 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -706,8 +706,8 @@ static void __aer_print_error(struct pci_dev *dev, errmsg = "Unknown Error Bit"; if (info->severity == AER_CORRECTABLE) - pci_info(dev, " [%2d] %-22s%s\n", i, errmsg, - info->first_error == i ? " (First)" : ""); + pci_info_ratelimited(dev, " [%2d] %-22s%s\n", i, errmsg, + info->first_error == i ? " (First)" : ""); else pci_err(dev, " [%2d] %-22s%s\n", i, errmsg, info->first_error == i ? " (First)" : ""); @@ -719,7 +719,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) { int layer, agent; int id = ((dev->bus->number << 8) | dev->devfn); - const char *level; if (!info->status) { pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n", @@ -730,14 +729,21 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) layer = AER_GET_LAYER_ERROR(info->severity, info->status); agent = AER_GET_AGENT(info->severity, info->status); - level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR; + if (info->severity == AER_CORRECTABLE) { + pci_info_ratelimited(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", + aer_error_severity_string[info->severity], + aer_error_layer[layer], aer_agent_string[agent]); - pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", - aer_error_severity_string[info->severity], - aer_error_layer[layer], aer_agent_string[agent]); + pci_info_ratelimited(dev, " device [%04x:%04x] error status/mask=%08x/%08x\n", + dev->vendor, dev->device, info->status, info->mask); + } else { + pci_err(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", + aer_error_severity_string[info->severity], + aer_error_layer[layer], aer_agent_string[agent]); - pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x\n", - dev->vendor, dev->device, info->status, info->mask); + pci_err(dev, " device [%04x:%04x] error status/mask=%08x/%08x\n", + dev->vendor, dev->device, info->status, info->mask); + } __aer_print_error(dev, info); @@ -757,11 +763,19 @@ static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info) u8 bus = info->id >> 8; u8 devfn = info->id & 0xff; - pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n", - info->multi_error_valid ? "Multiple " : "", - aer_error_severity_string[info->severity], - pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), - PCI_FUNC(devfn)); + if (info->severity == AER_CORRECTABLE) + pci_info_ratelimited(dev, "%s%s error received: %04x:%02x:%02x.%d\n", + info->multi_error_valid ? "Multiple " : "", + aer_error_severity_string[info->severity], + pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), + PCI_FUNC(devfn)); + else + pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n", + info->multi_error_valid ? "Multiple " : "", + aer_error_severity_string[info->severity], + pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), + PCI_FUNC(devfn)); + } #ifdef CONFIG_ACPI_APEI_PCIEAER -- 2.40.0.rc1.284.g88254d51c5-goog ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors 2023-03-17 17:51 ` [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors Grant Grundler @ 2023-04-06 19:50 ` Bjorn Helgaas 2023-04-07 18:53 ` Grant Grundler 0 siblings, 1 reply; 15+ messages in thread From: Bjorn Helgaas @ 2023-04-06 19:50 UTC (permalink / raw) To: Grant Grundler Cc: Mahesh J Salgaonkar, Oliver O 'Halloran, Bjorn Helgaas, Rajat Jain, Rajat Khandelwal, linux-pci, linux-kernel, linuxppc-dev On Fri, Mar 17, 2023 at 10:51:09AM -0700, Grant Grundler wrote: > From: Rajat Khandelwal <rajat.khandelwal@linux.intel.com> > > There are many instances where correctable errors tend to inundate > the message buffer. We observe such instances during thunderbolt PCIe > tunneling. > > It's true that they are mitigated by the hardware and are non-fatal > but we shouldn't be spamming the logs with such correctable errors as it > confuses other kernel developers less familiar with PCI errors, support > staff, and users who happen to look at the logs, hence rate limit them. > > A typical example log inside an HP TBT4 dock: > [54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0 > [54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) > [54912.661203] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001100/00002000 > [54912.661211] igc 0000:2b:00.0: [ 8] Rollover > [54912.661219] igc 0000:2b:00.0: [12] Timeout > [54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0 > [54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) > [54982.838808] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001000/00002000 > [54982.838817] igc 0000:2b:00.0: [12] Timeout The timestamps don't contribute to understanding the problem, so we can omit them. > This gets repeated continuously, thus inundating the buffer. > > Signed-off-by: Rajat Khandelwal <rajat.khandelwal@linux.intel.com> > Signed-off-by: Grant Grundler <grundler@chromium.org> > --- > drivers/pci/pcie/aer.c | 42 ++++++++++++++++++++++++++++-------------- > 1 file changed, 28 insertions(+), 14 deletions(-) > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index cb6b96233967..b592cea8bffe 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -706,8 +706,8 @@ static void __aer_print_error(struct pci_dev *dev, > errmsg = "Unknown Error Bit"; > > if (info->severity == AER_CORRECTABLE) > - pci_info(dev, " [%2d] %-22s%s\n", i, errmsg, > - info->first_error == i ? " (First)" : ""); > + pci_info_ratelimited(dev, " [%2d] %-22s%s\n", i, errmsg, > + info->first_error == i ? " (First)" : ""); I don't think this is going to reliably work the way we want. We have a bunch of pci_info_ratelimited() calls, and each caller has its own ratelimit_state data. Unless we call pci_info_ratelimited() exactly the same number of times for each error, the ratelimit counters will get out of sync and we'll end up printing fragments from error A mixed with fragments from error B. I think we need to explicitly manage the ratelimiting ourselves, similar to print_hmi_event_info() or print_extlog_rcd(). Then we can have a *single* ratelimit_state, and we can check it once to determine whether to log this correctable error. > else > pci_err(dev, " [%2d] %-22s%s\n", i, errmsg, > info->first_error == i ? " (First)" : ""); > @@ -719,7 +719,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) > { > int layer, agent; > int id = ((dev->bus->number << 8) | dev->devfn); > - const char *level; > > if (!info->status) { > pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n", > @@ -730,14 +729,21 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) > layer = AER_GET_LAYER_ERROR(info->severity, info->status); > agent = AER_GET_AGENT(info->severity, info->status); > > - level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR; > + if (info->severity == AER_CORRECTABLE) { > + pci_info_ratelimited(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", > + aer_error_severity_string[info->severity], > + aer_error_layer[layer], aer_agent_string[agent]); > > - pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", > - aer_error_severity_string[info->severity], > - aer_error_layer[layer], aer_agent_string[agent]); > + pci_info_ratelimited(dev, " device [%04x:%04x] error status/mask=%08x/%08x\n", > + dev->vendor, dev->device, info->status, info->mask); > + } else { > + pci_err(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", > + aer_error_severity_string[info->severity], > + aer_error_layer[layer], aer_agent_string[agent]); > > - pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x\n", > - dev->vendor, dev->device, info->status, info->mask); > + pci_err(dev, " device [%04x:%04x] error status/mask=%08x/%08x\n", > + dev->vendor, dev->device, info->status, info->mask); > + } > > __aer_print_error(dev, info); > > @@ -757,11 +763,19 @@ static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info) > u8 bus = info->id >> 8; > u8 devfn = info->id & 0xff; > > - pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n", > - info->multi_error_valid ? "Multiple " : "", > - aer_error_severity_string[info->severity], > - pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), > - PCI_FUNC(devfn)); > + if (info->severity == AER_CORRECTABLE) > + pci_info_ratelimited(dev, "%s%s error received: %04x:%02x:%02x.%d\n", > + info->multi_error_valid ? "Multiple " : "", > + aer_error_severity_string[info->severity], > + pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), > + PCI_FUNC(devfn)); > + else > + pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n", > + info->multi_error_valid ? "Multiple " : "", > + aer_error_severity_string[info->severity], > + pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), > + PCI_FUNC(devfn)); > + > } > > #ifdef CONFIG_ACPI_APEI_PCIEAER > -- > 2.40.0.rc1.284.g88254d51c5-goog > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors 2023-04-06 19:50 ` Bjorn Helgaas @ 2023-04-07 18:53 ` Grant Grundler 2023-04-07 19:46 ` Bjorn Helgaas 0 siblings, 1 reply; 15+ messages in thread From: Grant Grundler @ 2023-04-07 18:53 UTC (permalink / raw) To: Bjorn Helgaas Cc: Grant Grundler, Mahesh J Salgaonkar, Oliver O 'Halloran, Bjorn Helgaas, Rajat Jain, Rajat Khandelwal, linux-pci, linux-kernel, linuxppc-dev On Thu, Apr 6, 2023 at 12:50 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Fri, Mar 17, 2023 at 10:51:09AM -0700, Grant Grundler wrote: > > From: Rajat Khandelwal <rajat.khandelwal@linux.intel.com> > > > > There are many instances where correctable errors tend to inundate > > the message buffer. We observe such instances during thunderbolt PCIe > > tunneling. > > > > It's true that they are mitigated by the hardware and are non-fatal > > but we shouldn't be spamming the logs with such correctable errors as it > > confuses other kernel developers less familiar with PCI errors, support > > staff, and users who happen to look at the logs, hence rate limit them. > > > > A typical example log inside an HP TBT4 dock: > > [54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0 > > [54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) > > [54912.661203] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001100/00002000 > > [54912.661211] igc 0000:2b:00.0: [ 8] Rollover > > [54912.661219] igc 0000:2b:00.0: [12] Timeout > > [54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0 > > [54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) > > [54982.838808] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001000/00002000 > > [54982.838817] igc 0000:2b:00.0: [12] Timeout > > The timestamps don't contribute to understanding the problem, so we > can omit them. Ok. > > This gets repeated continuously, thus inundating the buffer. > > > > Signed-off-by: Rajat Khandelwal <rajat.khandelwal@linux.intel.com> > > Signed-off-by: Grant Grundler <grundler@chromium.org> > > --- > > drivers/pci/pcie/aer.c | 42 ++++++++++++++++++++++++++++-------------- > > 1 file changed, 28 insertions(+), 14 deletions(-) > > > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > > index cb6b96233967..b592cea8bffe 100644 > > --- a/drivers/pci/pcie/aer.c > > +++ b/drivers/pci/pcie/aer.c > > @@ -706,8 +706,8 @@ static void __aer_print_error(struct pci_dev *dev, > > errmsg = "Unknown Error Bit"; > > > > if (info->severity == AER_CORRECTABLE) > > - pci_info(dev, " [%2d] %-22s%s\n", i, errmsg, > > - info->first_error == i ? " (First)" : ""); > > + pci_info_ratelimited(dev, " [%2d] %-22s%s\n", i, errmsg, > > + info->first_error == i ? " (First)" : ""); > > I don't think this is going to reliably work the way we want. We have > a bunch of pci_info_ratelimited() calls, and each caller has its own > ratelimit_state data. Unless we call pci_info_ratelimited() exactly > the same number of times for each error, the ratelimit counters will > get out of sync and we'll end up printing fragments from error A mixed > with fragments from error B. Ok - what I'm reading between the lines here is the output should be emitted in one step, not multiple pci_info_ratelimited() calls. if the code built an output string (using sprintnf()), and then called pci_info_ratelimited() exactly once at the bottom, would that be sufficient? > I think we need to explicitly manage the ratelimiting ourselves, > similar to print_hmi_event_info() or print_extlog_rcd(). Then we can > have a *single* ratelimit_state, and we can check it once to determine > whether to log this correctable error. Is the rate limiting per call location or per device? From above, I understood rate limiting is "per call location". If the code only has one call location, it should achieve the same goal, right? cheers, grant > > > else > > pci_err(dev, " [%2d] %-22s%s\n", i, errmsg, > > info->first_error == i ? " (First)" : ""); > > @@ -719,7 +719,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) > > { > > int layer, agent; > > int id = ((dev->bus->number << 8) | dev->devfn); > > - const char *level; > > > > if (!info->status) { > > pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n", > > @@ -730,14 +729,21 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) > > layer = AER_GET_LAYER_ERROR(info->severity, info->status); > > agent = AER_GET_AGENT(info->severity, info->status); > > > > - level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR; > > + if (info->severity == AER_CORRECTABLE) { > > + pci_info_ratelimited(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", > > + aer_error_severity_string[info->severity], > > + aer_error_layer[layer], aer_agent_string[agent]); > > > > - pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", > > - aer_error_severity_string[info->severity], > > - aer_error_layer[layer], aer_agent_string[agent]); > > + pci_info_ratelimited(dev, " device [%04x:%04x] error status/mask=%08x/%08x\n", > > + dev->vendor, dev->device, info->status, info->mask); > > + } else { > > + pci_err(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", > > + aer_error_severity_string[info->severity], > > + aer_error_layer[layer], aer_agent_string[agent]); > > > > - pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x\n", > > - dev->vendor, dev->device, info->status, info->mask); > > + pci_err(dev, " device [%04x:%04x] error status/mask=%08x/%08x\n", > > + dev->vendor, dev->device, info->status, info->mask); > > + } > > > > __aer_print_error(dev, info); > > > > @@ -757,11 +763,19 @@ static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info) > > u8 bus = info->id >> 8; > > u8 devfn = info->id & 0xff; > > > > - pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n", > > - info->multi_error_valid ? "Multiple " : "", > > - aer_error_severity_string[info->severity], > > - pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), > > - PCI_FUNC(devfn)); > > + if (info->severity == AER_CORRECTABLE) > > + pci_info_ratelimited(dev, "%s%s error received: %04x:%02x:%02x.%d\n", > > + info->multi_error_valid ? "Multiple " : "", > > + aer_error_severity_string[info->severity], > > + pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), > > + PCI_FUNC(devfn)); > > + else > > + pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n", > > + info->multi_error_valid ? "Multiple " : "", > > + aer_error_severity_string[info->severity], > > + pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), > > + PCI_FUNC(devfn)); > > + > > } > > > > #ifdef CONFIG_ACPI_APEI_PCIEAER > > -- > > 2.40.0.rc1.284.g88254d51c5-goog > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors 2023-04-07 18:53 ` Grant Grundler @ 2023-04-07 19:46 ` Bjorn Helgaas 2023-04-07 23:47 ` Grant Grundler ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Bjorn Helgaas @ 2023-04-07 19:46 UTC (permalink / raw) To: Grant Grundler Cc: Mahesh J Salgaonkar, Oliver O 'Halloran, Bjorn Helgaas, Rajat Jain, Rajat Khandelwal, linux-pci, linux-kernel, linuxppc-dev On Fri, Apr 07, 2023 at 11:53:27AM -0700, Grant Grundler wrote: > On Thu, Apr 6, 2023 at 12:50 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Fri, Mar 17, 2023 at 10:51:09AM -0700, Grant Grundler wrote: > > > From: Rajat Khandelwal <rajat.khandelwal@linux.intel.com> > > > > > > There are many instances where correctable errors tend to inundate > > > the message buffer. We observe such instances during thunderbolt PCIe > > > tunneling. > ... > > > if (info->severity == AER_CORRECTABLE) > > > - pci_info(dev, " [%2d] %-22s%s\n", i, errmsg, > > > - info->first_error == i ? " (First)" : ""); > > > + pci_info_ratelimited(dev, " [%2d] %-22s%s\n", i, errmsg, > > > + info->first_error == i ? " (First)" : ""); > > > > I don't think this is going to reliably work the way we want. We have > > a bunch of pci_info_ratelimited() calls, and each caller has its own > > ratelimit_state data. Unless we call pci_info_ratelimited() exactly > > the same number of times for each error, the ratelimit counters will > > get out of sync and we'll end up printing fragments from error A mixed > > with fragments from error B. > > Ok - what I'm reading between the lines here is the output should be > emitted in one step, not multiple pci_info_ratelimited() calls. if the > code built an output string (using sprintnf()), and then called > pci_info_ratelimited() exactly once at the bottom, would that be > sufficient? > > > I think we need to explicitly manage the ratelimiting ourselves, > > similar to print_hmi_event_info() or print_extlog_rcd(). Then we can > > have a *single* ratelimit_state, and we can check it once to determine > > whether to log this correctable error. > > Is the rate limiting per call location or per device? From above, I > understood rate limiting is "per call location". If the code only > has one call location, it should achieve the same goal, right? Rate-limiting is per call location, so yes, if we only have one call location, that would solve it. It would also have the nice property that all the output would be atomic so it wouldn't get mixed with other stuff, and it might encourage us to be a little less wordy in the output. But I don't think we need output in a single step; we just need a single instance of ratelimit_state (or one for CPER path and another for native AER path), and that can control all the output for a single error. E.g., print_hmi_event_info() looks like this: static void print_hmi_event_info(...) { static DEFINE_RATELIMIT_STATE(rs, ...); if (__ratelimit(&rs)) { printk("%s%s Hypervisor Maintenance interrupt ..."); printk("%s Error detail: %s\n", ...); printk("%s HMER: %016llx\n", ...); } } I think it's nice that the struct ratelimit_state is explicit and there's no danger of breaking it when adding another printk later. It *could* be per pci_dev, too, but I suspect it's not worth spending 40ish bytes per device for the ratelimit data. Bjorn ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors 2023-04-07 19:46 ` Bjorn Helgaas @ 2023-04-07 23:47 ` Grant Grundler 2023-04-07 23:49 ` Grant Grundler 2023-05-18 6:11 ` Grant Grundler 2 siblings, 0 replies; 15+ messages in thread From: Grant Grundler @ 2023-04-07 23:47 UTC (permalink / raw) To: Bjorn Helgaas Cc: Grant Grundler, Mahesh J Salgaonkar, Oliver O 'Halloran, Bjorn Helgaas, Rajat Jain, Rajat Khandelwal, linux-pci, linux-kernel, linuxppc-dev [reposting in plain text mode] On Fri, Apr 7, 2023 at 12:46 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Fri, Apr 07, 2023 at 11:53:27AM -0700, Grant Grundler wrote: > > On Thu, Apr 6, 2023 at 12:50 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > On Fri, Mar 17, 2023 at 10:51:09AM -0700, Grant Grundler wrote: > > > > From: Rajat Khandelwal <rajat.khandelwal@linux.intel.com> > > > > > > > > There are many instances where correctable errors tend to inundate > > > > the message buffer. We observe such instances during thunderbolt PCIe > > > > tunneling. > > ... > > > > > if (info->severity == AER_CORRECTABLE) > > > > - pci_info(dev, " [%2d] %-22s%s\n", i, errmsg, > > > > - info->first_error == i ? " (First)" : ""); > > > > + pci_info_ratelimited(dev, " [%2d] %-22s%s\n", i, errmsg, > > > > + info->first_error == i ? " (First)" : ""); > > > > > > I don't think this is going to reliably work the way we want. We have > > > a bunch of pci_info_ratelimited() calls, and each caller has its own > > > ratelimit_state data. Unless we call pci_info_ratelimited() exactly > > > the same number of times for each error, the ratelimit counters will > > > get out of sync and we'll end up printing fragments from error A mixed > > > with fragments from error B. > > > > Ok - what I'm reading between the lines here is the output should be > > emitted in one step, not multiple pci_info_ratelimited() calls. if the > > code built an output string (using sprintnf()), and then called > > pci_info_ratelimited() exactly once at the bottom, would that be > > sufficient? > > > > > I think we need to explicitly manage the ratelimiting ourselves, > > > similar to print_hmi_event_info() or print_extlog_rcd(). Then we can > > > have a *single* ratelimit_state, and we can check it once to determine > > > whether to log this correctable error. > > > > Is the rate limiting per call location or per device? From above, I > > understood rate limiting is "per call location". If the code only > > has one call location, it should achieve the same goal, right? > > Rate-limiting is per call location, so yes, if we only have one call > location, that would solve it. It would also have the nice property > that all the output would be atomic so it wouldn't get mixed with > other stuff, and it might encourage us to be a little less wordy in > the output. > > But I don't think we need output in a single step; we just need a > single instance of ratelimit_state (or one for CPER path and another > for native AER path), and that can control all the output for a single > error. E.g., print_hmi_event_info() looks like this: > > static void print_hmi_event_info(...) > { > static DEFINE_RATELIMIT_STATE(rs, ...); > > if (__ratelimit(&rs)) { > printk("%s%s Hypervisor Maintenance interrupt ..."); > printk("%s Error detail: %s\n", ...); > printk("%s HMER: %016llx\n", ...); > } > } > > I think it's nice that the struct ratelimit_state is explicit and > there's no danger of breaking it when adding another printk later. > > It *could* be per pci_dev, too, but I suspect it's not worth spending > 40ish bytes per device for the ratelimit data. > > Bjorn ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors 2023-04-07 19:46 ` Bjorn Helgaas 2023-04-07 23:47 ` Grant Grundler @ 2023-04-07 23:49 ` Grant Grundler 2023-05-18 6:11 ` Grant Grundler 2 siblings, 0 replies; 15+ messages in thread From: Grant Grundler @ 2023-04-07 23:49 UTC (permalink / raw) To: Bjorn Helgaas Cc: Grant Grundler, Mahesh J Salgaonkar, Oliver O 'Halloran, Bjorn Helgaas, Rajat Jain, Rajat Khandelwal, linux-pci, linux-kernel, linuxppc-dev [reposting in plain text mode] On Fri, Apr 7, 2023 at 12:46 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Fri, Apr 07, 2023 at 11:53:27AM -0700, Grant Grundler wrote: > > On Thu, Apr 6, 2023 at 12:50 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > On Fri, Mar 17, 2023 at 10:51:09AM -0700, Grant Grundler wrote: > > > > From: Rajat Khandelwal <rajat.khandelwal@linux.intel.com> > > > > > > > > There are many instances where correctable errors tend to inundate > > > > the message buffer. We observe such instances during thunderbolt PCIe > > > > tunneling. > > ... > > > > > if (info->severity == AER_CORRECTABLE) > > > > - pci_info(dev, " [%2d] %-22s%s\n", i, errmsg, > > > > - info->first_error == i ? " (First)" : ""); > > > > + pci_info_ratelimited(dev, " [%2d] %-22s%s\n", i, errmsg, > > > > + info->first_error == i ? " (First)" : ""); > > > > > > I don't think this is going to reliably work the way we want. We have > > > a bunch of pci_info_ratelimited() calls, and each caller has its own > > > ratelimit_state data. Unless we call pci_info_ratelimited() exactly > > > the same number of times for each error, the ratelimit counters will > > > get out of sync and we'll end up printing fragments from error A mixed > > > with fragments from error B. > > > > Ok - what I'm reading between the lines here is the output should be > > emitted in one step, not multiple pci_info_ratelimited() calls. if the > > code built an output string (using sprintnf()), and then called > > pci_info_ratelimited() exactly once at the bottom, would that be > > sufficient? > > > > > I think we need to explicitly manage the ratelimiting ourselves, > > > similar to print_hmi_event_info() or print_extlog_rcd(). Then we can > > > have a *single* ratelimit_state, and we can check it once to determine > > > whether to log this correctable error. > > > > Is the rate limiting per call location or per device? From above, I > > understood rate limiting is "per call location". If the code only > > has one call location, it should achieve the same goal, right? > > Rate-limiting is per call location, so yes, if we only have one call > location, that would solve it. It would also have the nice property > that all the output would be atomic so it wouldn't get mixed with > other stuff, and it might encourage us to be a little less wordy in > the output. +1 to all of those reasons. Especially reducing the number of lines output. I'm going to be out for the next week. If someone else (Rajat Kendalwal maybe?) wants to rework this to use one call location it should be fairly straight forward. If not, I'll tackle this when I'm back (in 2 weeks essentially). > > But I don't think we need output in a single step; we just need a > single instance of ratelimit_state (or one for CPER path and another > for native AER path), and that can control all the output for a single > error. E.g., print_hmi_event_info() looks like this: > > static void print_hmi_event_info(...) > { > static DEFINE_RATELIMIT_STATE(rs, ...); > > if (__ratelimit(&rs)) { > printk("%s%s Hypervisor Maintenance interrupt ..."); > printk("%s Error detail: %s\n", ...); > printk("%s HMER: %016llx\n", ...); > } > } > > I think it's nice that the struct ratelimit_state is explicit and > there's no danger of breaking it when adding another printk later. True. But a single call to a "well documented" API is my preference (assuming this is my choice). > It *could* be per pci_dev, too, but I suspect it's not worth spending > 40ish bytes per device for the ratelimit data. Good - I don't think we need to make this per device - I had assumed it was but also currently don't see a need for this. cheers, grant ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors 2023-04-07 19:46 ` Bjorn Helgaas 2023-04-07 23:47 ` Grant Grundler 2023-04-07 23:49 ` Grant Grundler @ 2023-05-18 6:11 ` Grant Grundler 2023-06-06 3:45 ` Grant Grundler 2 siblings, 1 reply; 15+ messages in thread From: Grant Grundler @ 2023-05-18 6:11 UTC (permalink / raw) To: Bjorn Helgaas Cc: Grant Grundler, Mahesh J Salgaonkar, Oliver O 'Halloran, Bjorn Helgaas, Rajat Jain, Rajat Khandelwal, linux-pci, linux-kernel, linuxppc-dev On Fri, Apr 7, 2023 at 12:46 PM Bjorn Helgaas <helgaas@kernel.org> wrote: ... > But I don't think we need output in a single step; we just need a > single instance of ratelimit_state (or one for CPER path and another > for native AER path), and that can control all the output for a single > error. E.g., print_hmi_event_info() looks like this: > > static void print_hmi_event_info(...) > { > static DEFINE_RATELIMIT_STATE(rs, ...); > > if (__ratelimit(&rs)) { > printk("%s%s Hypervisor Maintenance interrupt ..."); > printk("%s Error detail: %s\n", ...); > printk("%s HMER: %016llx\n", ...); > } > } > > I think it's nice that the struct ratelimit_state is explicit and > there's no danger of breaking it when adding another printk later. Since the output is spread across at least two functions, I think your proposal is a better solution. I'm not happy with the patch series I sent in my previous reply as an attachment. It's only marginally better than the original code. I need another day or two to see if I can implement your proposal correctly. cheers, grant ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors 2023-05-18 6:11 ` Grant Grundler @ 2023-06-06 3:45 ` Grant Grundler 0 siblings, 0 replies; 15+ messages in thread From: Grant Grundler @ 2023-06-06 3:45 UTC (permalink / raw) To: Grant Grundler Cc: Bjorn Helgaas, Mahesh J Salgaonkar, Oliver O 'Halloran, Bjorn Helgaas, Rajat Jain, Rajat Khandelwal, linux-pci, linux-kernel, linuxppc-dev [plain text only this time...] On Wed, May 17, 2023 at 11:11 PM Grant Grundler <grundler@chromium.org> wrote: > > On Fri, Apr 7, 2023 at 12:46 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > ... > > But I don't think we need output in a single step; we just need a > > single instance of ratelimit_state (or one for CPER path and another > > for native AER path), and that can control all the output for a single > > error. E.g., print_hmi_event_info() looks like this: > > > > static void print_hmi_event_info(...) > > { > > static DEFINE_RATELIMIT_STATE(rs, ...); > > > > if (__ratelimit(&rs)) { > > printk("%s%s Hypervisor Maintenance interrupt ..."); > > printk("%s Error detail: %s\n", ...); > > printk("%s HMER: %016llx\n", ...); > > } > > } > > > > I think it's nice that the struct ratelimit_state is explicit and > > there's no danger of breaking it when adding another printk later. > > Since the output is spread across at least two functions, I think your > proposal is a better solution. > > I'm not happy with the patch series I sent in my previous reply as an > attachment. It's only marginally better than the original code. Despite not being happy about it, after a week of vacation I now think it would be better to include them as is since they solve the immediate problems and then solve the above two issues in additional patches. The two changes I have prepared so far correctly fix the original issues they intended to fix and don't affect the new issues we've found. I'll post a V3 of this series tonight after making sure it at least compiles and "looks right". cheers, grant > > I need another day or two to see if I can implement your proposal correctly. > > cheers, > grant ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv2 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO 2023-03-17 17:51 [PATCHv2 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO Grant Grundler 2023-03-17 17:51 ` [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors Grant Grundler @ 2023-03-17 18:50 ` Sathyanarayanan Kuppuswamy 2023-03-17 19:30 ` Bjorn Helgaas 1 sibling, 1 reply; 15+ messages in thread From: Sathyanarayanan Kuppuswamy @ 2023-03-17 18:50 UTC (permalink / raw) To: Grant Grundler, Mahesh J Salgaonkar, Oliver O 'Halloran, Bjorn Helgaas Cc: Rajat Khandelwal, linuxppc-dev, linux-pci, linux-kernel, Rajat Jain On 3/17/23 10:51 AM, Grant Grundler wrote: > Since correctable errors have been corrected (and counted), the dmesg output > should not be reported as a warning, but rather as "informational". > > Otherwise, using a certain well known vendor's PCIe parts in a USB4 docking > station, the dmesg buffer can be spammed with correctable errors, 717 bytes > per instance, potentially many MB per day. Why don't you investigate why you are getting so many correctable errors? Isn't solving the problem preferable to hiding the logs? > > Given the "WARN" priority, these messages have already confused the typical > user that stumbles across them, support staff (triaging feedback reports), > and more than a few linux kernel devs. Changing to INFO will hide these > messages from most audiences. > > Signed-off-by: Grant Grundler <grundler@chromium.org> > --- > drivers/pci/pcie/aer.c | 29 +++++++++++++++++++---------- > 1 file changed, 19 insertions(+), 10 deletions(-) > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index f6c24ded134c..cb6b96233967 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -687,23 +687,29 @@ static void __aer_print_error(struct pci_dev *dev, > { > const char **strings; > unsigned long status = info->status & ~info->mask; > - const char *level, *errmsg; > int i; > > if (info->severity == AER_CORRECTABLE) { > strings = aer_correctable_error_string; > - level = KERN_WARNING; > + pci_info(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", > + info->status, info->mask); > } else { > strings = aer_uncorrectable_error_string; > - level = KERN_ERR; > + pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", > + info->status, info->mask); > } > > for_each_set_bit(i, &status, 32) { > - errmsg = strings[i]; > + const char *errmsg = strings[i]; > + > if (!errmsg) > errmsg = "Unknown Error Bit"; > > - pci_printk(level, dev, " [%2d] %-22s%s\n", i, errmsg, > + if (info->severity == AER_CORRECTABLE) > + pci_info(dev, " [%2d] %-22s%s\n", i, errmsg, > + info->first_error == i ? " (First)" : ""); > + else > + pci_err(dev, " [%2d] %-22s%s\n", i, errmsg, > info->first_error == i ? " (First)" : ""); > } > pci_dev_aer_stats_incr(dev, info); > @@ -724,7 +730,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) > layer = AER_GET_LAYER_ERROR(info->severity, info->status); > agent = AER_GET_AGENT(info->severity, info->status); > > - level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR; > + level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR; > > pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", > aer_error_severity_string[info->severity], > @@ -797,14 +803,17 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity, > info.mask = mask; > info.first_error = PCI_ERR_CAP_FEP(aer->cap_control); > > - pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask); > __aer_print_error(dev, &info); > - pci_err(dev, "aer_layer=%s, aer_agent=%s\n", > - aer_error_layer[layer], aer_agent_string[agent]); > > - if (aer_severity != AER_CORRECTABLE) > + if (aer_severity == AER_CORRECTABLE) { > + pci_info(dev, "aer_layer=%s, aer_agent=%s\n", > + aer_error_layer[layer], aer_agent_string[agent]); > + } else { > + pci_err(dev, "aer_layer=%s, aer_agent=%s\n", > + aer_error_layer[layer], aer_agent_string[agent]); > pci_err(dev, "aer_uncor_severity: 0x%08x\n", > aer->uncor_severity); > + } > > if (tlp_header_valid) > __print_tlp_header(dev, &aer->header_log); -- Sathyanarayanan Kuppuswamy Linux Kernel Developer ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv2 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO 2023-03-17 18:50 ` [PATCHv2 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO Sathyanarayanan Kuppuswamy @ 2023-03-17 19:30 ` Bjorn Helgaas 2023-03-19 6:00 ` Grant Grundler 0 siblings, 1 reply; 15+ messages in thread From: Bjorn Helgaas @ 2023-03-17 19:30 UTC (permalink / raw) To: Sathyanarayanan Kuppuswamy Cc: Grant Grundler, Mahesh J Salgaonkar, Oliver O 'Halloran, Bjorn Helgaas, linux-pci, Rajat Khandelwal, linuxppc-dev, linux-kernel, Rajat Jain On Fri, Mar 17, 2023 at 11:50:22AM -0700, Sathyanarayanan Kuppuswamy wrote: > On 3/17/23 10:51 AM, Grant Grundler wrote: > > Since correctable errors have been corrected (and counted), the dmesg output > > should not be reported as a warning, but rather as "informational". > > > > Otherwise, using a certain well known vendor's PCIe parts in a USB4 docking > > station, the dmesg buffer can be spammed with correctable errors, 717 bytes > > per instance, potentially many MB per day. > > Why don't you investigate why you are getting so many correctable errors? > Isn't solving the problem preferable to hiding the logs? I hope there's some effort to find the cause of the errors, too. But I do think KERN_INFO is a reasonable level for errors that have already been corrected. KERN_ERR seems a little bit too severe to me. Does changing to KERN_INFO keep the messages out of the dmesg log? I don't think it does, because *most* kernel messages are at KERN_INFO. This may be just a commit log clarification. I would like to know *which* devices are involved. Is there some reason for weasel-wording this? Knowing which devices are involved helps in triaging issue reports. If there are any public reports on mailing lists, etc, we could also cite those here to help users find this solution. > > Given the "WARN" priority, these messages have already confused the typical > > user that stumbles across them, support staff (triaging feedback reports), > > and more than a few linux kernel devs. Changing to INFO will hide these > > messages from most audiences. > > > > Signed-off-by: Grant Grundler <grundler@chromium.org> > > --- > > drivers/pci/pcie/aer.c | 29 +++++++++++++++++++---------- > > 1 file changed, 19 insertions(+), 10 deletions(-) > > > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > > index f6c24ded134c..cb6b96233967 100644 > > --- a/drivers/pci/pcie/aer.c > > +++ b/drivers/pci/pcie/aer.c > > @@ -687,23 +687,29 @@ static void __aer_print_error(struct pci_dev *dev, > > { > > const char **strings; > > unsigned long status = info->status & ~info->mask; > > - const char *level, *errmsg; > > int i; > > > > if (info->severity == AER_CORRECTABLE) { > > strings = aer_correctable_error_string; > > - level = KERN_WARNING; > > + pci_info(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", > > + info->status, info->mask); > > } else { > > strings = aer_uncorrectable_error_string; > > - level = KERN_ERR; > > + pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", > > + info->status, info->mask); > > } > > > > for_each_set_bit(i, &status, 32) { > > - errmsg = strings[i]; > > + const char *errmsg = strings[i]; > > + > > if (!errmsg) > > errmsg = "Unknown Error Bit"; > > > > - pci_printk(level, dev, " [%2d] %-22s%s\n", i, errmsg, > > + if (info->severity == AER_CORRECTABLE) > > + pci_info(dev, " [%2d] %-22s%s\n", i, errmsg, > > + info->first_error == i ? " (First)" : ""); > > + else > > + pci_err(dev, " [%2d] %-22s%s\n", i, errmsg, > > info->first_error == i ? " (First)" : ""); The - 5 lines, + 11 lines diff and repetition of the printk strings doesn't seem like an improvement compared to the -1, +1 in the v1 patch: @@ -692,7 +692,7 @@ static void __aer_print_error(struct pci_dev *dev, if (info->severity == AER_CORRECTABLE) { strings = aer_correctable_error_string; - level = KERN_WARNING; + level = KERN_INFO; } else { But maybe there's a reason? > > } > > pci_dev_aer_stats_incr(dev, info); > > @@ -724,7 +730,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) > > layer = AER_GET_LAYER_ERROR(info->severity, info->status); > > agent = AER_GET_AGENT(info->severity, info->status); > > > > - level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR; > > + level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR; > > > > pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", > > aer_error_severity_string[info->severity], > > @@ -797,14 +803,17 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity, > > info.mask = mask; > > info.first_error = PCI_ERR_CAP_FEP(aer->cap_control); > > > > - pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask); > > __aer_print_error(dev, &info); > > - pci_err(dev, "aer_layer=%s, aer_agent=%s\n", > > - aer_error_layer[layer], aer_agent_string[agent]); > > > > - if (aer_severity != AER_CORRECTABLE) > > + if (aer_severity == AER_CORRECTABLE) { > > + pci_info(dev, "aer_layer=%s, aer_agent=%s\n", > > + aer_error_layer[layer], aer_agent_string[agent]); > > + } else { > > + pci_err(dev, "aer_layer=%s, aer_agent=%s\n", > > + aer_error_layer[layer], aer_agent_string[agent]); > > pci_err(dev, "aer_uncor_severity: 0x%08x\n", > > aer->uncor_severity); > > + } > > > > if (tlp_header_valid) > > __print_tlp_header(dev, &aer->header_log); > > -- > Sathyanarayanan Kuppuswamy > Linux Kernel Developer ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv2 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO 2023-03-17 19:30 ` Bjorn Helgaas @ 2023-03-19 6:00 ` Grant Grundler 0 siblings, 0 replies; 15+ messages in thread From: Grant Grundler @ 2023-03-19 6:00 UTC (permalink / raw) To: Bjorn Helgaas Cc: Sathyanarayanan Kuppuswamy, Grant Grundler, Mahesh J Salgaonkar, Oliver O 'Halloran, Bjorn Helgaas, linux-pci, Rajat Khandelwal, linuxppc-dev, linux-kernel, Rajat Jain On Fri, Mar 17, 2023 at 12:31 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Fri, Mar 17, 2023 at 11:50:22AM -0700, Sathyanarayanan Kuppuswamy wrote: > > On 3/17/23 10:51 AM, Grant Grundler wrote: > > > Since correctable errors have been corrected (and counted), the dmesg output > > > should not be reported as a warning, but rather as "informational". > > > > > > Otherwise, using a certain well known vendor's PCIe parts in a USB4 docking > > > station, the dmesg buffer can be spammed with correctable errors, 717 bytes > > > per instance, potentially many MB per day. > > > > Why don't you investigate why you are getting so many correctable errors? It's a bit difficult to insert a PCIe logic analyzer on someone else's soldered down parts. Intel is the only party with all the HW in a form that allows that level of debugging. > > Isn't solving the problem preferable to hiding the logs? Of course! But note that spamming the logs with low priority messages isn't acceptable either. So we should be doing both. > I hope there's some effort to find the cause of the errors, too. Rajat (from Intel) has told me that Intel is investigating this. I was told the devices are reporting two kinds of correctable errors:replay rollover and replay timeout. This may be related to a replay timer value (PCIERTP1) in the PCI device is too short for PCIe tunneling (USB4). It's not clear to me if this value is hardcoded in the Intel NIC or if it's configurable. But I've been waiting since last December for Intel to give me guidance on what is implemented in their NIC. > But I do think KERN_INFO is a reasonable level for errors that have > already been corrected. KERN_ERR seems a little bit too severe to me. > > Does changing to KERN_INFO keep the messages out of the dmesg log? I > don't think it does, because *most* kernel messages are at KERN_INFO. > This may be just a commit log clarification. True. I'm not sure why I thought KERN_INFO would not show up. If they do, I'll depend on the rate limiting patch to not spam the system logs. > I would like to know *which* devices are involved. > Is there some reason for weasel-wording this? I don't think there is. Rajat previously mentioned this was for the HP TBT4 docking station. The two USB4 docking station vendors I'm familiar with are both using Intel i225 NIC (foxville). This appears to be a problem only with the NIC behind a PCIe tunnel and not a HW problem with the docking station overall (which is what I previously assumed.) FTR, this same NIC works and does not report correctable errors when directly connected to a PCI host controller. It's not as happy on the other side of a USB4 PCIe tunnel but still appears to work correctly (retries are succeeding). I'm not able to point at any NIC failures (yet). But I'm certainly not very happy that the same company which is the primary backer of thunderbolt and USB4 isn't able to design/configure parts so they play well together. > Knowing which devices are involved > helps in triaging issue reports. If there are any public reports on > mailing lists, etc, we could also cite those here to help users find > this solution. Yup - that's exactly how I've been helped so often in the past as well. > > > > Given the "WARN" priority, these messages have already confused the typical > > > user that stumbles across them, support staff (triaging feedback reports), > > > and more than a few linux kernel devs. Changing to INFO will hide these > > > messages from most audiences. > > > > > > Signed-off-by: Grant Grundler <grundler@chromium.org> > > > --- > > > drivers/pci/pcie/aer.c | 29 +++++++++++++++++++---------- > > > 1 file changed, 19 insertions(+), 10 deletions(-) > > > > > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > > > index f6c24ded134c..cb6b96233967 100644 > > > --- a/drivers/pci/pcie/aer.c > > > +++ b/drivers/pci/pcie/aer.c > > > @@ -687,23 +687,29 @@ static void __aer_print_error(struct pci_dev *dev, > > > { > > > const char **strings; > > > unsigned long status = info->status & ~info->mask; > > > - const char *level, *errmsg; > > > int i; > > > > > > if (info->severity == AER_CORRECTABLE) { > > > strings = aer_correctable_error_string; > > > - level = KERN_WARNING; > > > + pci_info(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", > > > + info->status, info->mask); > > > } else { > > > strings = aer_uncorrectable_error_string; > > > - level = KERN_ERR; > > > + pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", > > > + info->status, info->mask); > > > } > > > > > > for_each_set_bit(i, &status, 32) { > > > - errmsg = strings[i]; > > > + const char *errmsg = strings[i]; > > > + > > > if (!errmsg) > > > errmsg = "Unknown Error Bit"; > > > > > > - pci_printk(level, dev, " [%2d] %-22s%s\n", i, errmsg, > > > + if (info->severity == AER_CORRECTABLE) > > > + pci_info(dev, " [%2d] %-22s%s\n", i, errmsg, > > > + info->first_error == i ? " (First)" : ""); > > > + else > > > + pci_err(dev, " [%2d] %-22s%s\n", i, errmsg, > > > info->first_error == i ? " (First)" : ""); > > The - 5 lines, + 11 lines diff and repetition of the printk strings > doesn't seem like an improvement compared to the -1, +1 in the v1 > patch: > > @@ -692,7 +692,7 @@ static void __aer_print_error(struct pci_dev *dev, > > if (info->severity == AER_CORRECTABLE) { > strings = aer_correctable_error_string; > - level = KERN_WARNING; > + level = KERN_INFO; > } else { > > But maybe there's a reason? Not intentionally - but Yes - the next patch is simpler when changing to pci_info_ratelimit. This change will show up in either the first or second patch. I can switch that hunk back to V1 code if you prefer that for this patch. cheers, grant > > > > } > > > pci_dev_aer_stats_incr(dev, info); > > > @@ -724,7 +730,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) > > > layer = AER_GET_LAYER_ERROR(info->severity, info->status); > > > agent = AER_GET_AGENT(info->severity, info->status); > > > > > > - level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR; > > > + level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR; > > > > > > pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", > > > aer_error_severity_string[info->severity], > > > @@ -797,14 +803,17 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity, > > > info.mask = mask; > > > info.first_error = PCI_ERR_CAP_FEP(aer->cap_control); > > > > > > - pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask); > > > __aer_print_error(dev, &info); > > > - pci_err(dev, "aer_layer=%s, aer_agent=%s\n", > > > - aer_error_layer[layer], aer_agent_string[agent]); > > > > > > - if (aer_severity != AER_CORRECTABLE) > > > + if (aer_severity == AER_CORRECTABLE) { > > > + pci_info(dev, "aer_layer=%s, aer_agent=%s\n", > > > + aer_error_layer[layer], aer_agent_string[agent]); > > > + } else { > > > + pci_err(dev, "aer_layer=%s, aer_agent=%s\n", > > > + aer_error_layer[layer], aer_agent_string[agent]); > > > pci_err(dev, "aer_uncor_severity: 0x%08x\n", > > > aer->uncor_severity); > > > + } > > > > > > if (tlp_header_valid) > > > __print_tlp_header(dev, &aer->header_log); > > > > -- > > Sathyanarayanan Kuppuswamy > > Linux Kernel Developer ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <CANEJEGsE6KS484iSLkKV8hx2nNThZGfaaz+u+R-A3X5nRev6Gg@mail.gmail.com>]
* Re: [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors [not found] <CANEJEGsE6KS484iSLkKV8hx2nNThZGfaaz+u+R-A3X5nRev6Gg@mail.gmail.com> @ 2023-05-17 16:02 ` Bjorn Helgaas 2023-05-17 21:02 ` Grant Grundler 2023-05-18 5:58 ` Grant Grundler 0 siblings, 2 replies; 15+ messages in thread From: Bjorn Helgaas @ 2023-05-17 16:02 UTC (permalink / raw) To: Grant Grundler Cc: Rajat Jain, Rajat Khandelwal, linux-pci, Mahesh J Salgaonkar, linux-kernel, Oliver O 'Halloran, Bjorn Helgaas, linuxppc-dev On Fri, Apr 07, 2023 at 04:46:03PM -0700, Grant Grundler wrote: > On Fri, Apr 7, 2023 at 12:46 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Fri, Apr 07, 2023 at 11:53:27AM -0700, Grant Grundler wrote: > > > On Thu, Apr 6, 2023 at 12:50 PM Bjorn Helgaas <helgaas@kernel.org> > > wrote: > > > > On Fri, Mar 17, 2023 at 10:51:09AM -0700, Grant Grundler wrote: > > > > > From: Rajat Khandelwal <rajat.khandelwal@linux.intel.com> > > > > > > > > > > There are many instances where correctable errors tend to inundate > > > > > the message buffer. We observe such instances during thunderbolt PCIe > > > > > tunneling. > > > ... > > > > > > > if (info->severity == AER_CORRECTABLE) > > > > > - pci_info(dev, " [%2d] %-22s%s\n", i, errmsg, > > > > > - info->first_error == i ? " (First)" : > > ""); > > > > > + pci_info_ratelimited(dev, " [%2d] > > %-22s%s\n", i, errmsg, > > > > > + info->first_error == i ? > > " (First)" : ""); > > > > > > > > I don't think this is going to reliably work the way we want. We have > > > > a bunch of pci_info_ratelimited() calls, and each caller has its own > > > > ratelimit_state data. Unless we call pci_info_ratelimited() exactly > > > > the same number of times for each error, the ratelimit counters will > > > > get out of sync and we'll end up printing fragments from error A mixed > > > > with fragments from error B. > > > > > > Ok - what I'm reading between the lines here is the output should be > > > emitted in one step, not multiple pci_info_ratelimited() calls. if the > > > code built an output string (using sprintnf()), and then called > > > pci_info_ratelimited() exactly once at the bottom, would that be > > > sufficient? > > > > > > > I think we need to explicitly manage the ratelimiting ourselves, > > > > similar to print_hmi_event_info() or print_extlog_rcd(). Then we can > > > > have a *single* ratelimit_state, and we can check it once to determine > > > > whether to log this correctable error. > > > > > > Is the rate limiting per call location or per device? From above, I > > > understood rate limiting is "per call location". If the code only > > > has one call location, it should achieve the same goal, right? > > > > Rate-limiting is per call location, so yes, if we only have one call > > location, that would solve it. It would also have the nice property > > that all the output would be atomic so it wouldn't get mixed with > > other stuff, and it might encourage us to be a little less wordy in > > the output. > > > > +1 to all of those reasons. Especially reducing the number of lines output. > > I'm going to be out for the next week. If someone else (Rajat Kendalwal > maybe?) wants to rework this to use one call location it should be fairly > straight forward. If not, I'll tackle this when I'm back (in 2 weeks > essentially). Ping? Really hoping to merge this for v6.5. Bjorn ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors 2023-05-17 16:02 ` [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors Bjorn Helgaas @ 2023-05-17 21:02 ` Grant Grundler 2023-05-18 5:58 ` Grant Grundler 1 sibling, 0 replies; 15+ messages in thread From: Grant Grundler @ 2023-05-17 21:02 UTC (permalink / raw) To: Bjorn Helgaas Cc: Grant Grundler, Rajat Jain, Rajat Khandelwal, linux-pci, Mahesh J Salgaonkar, linux-kernel, Oliver O 'Halloran, Bjorn Helgaas, linuxppc-dev On Wed, May 17, 2023 at 9:03 AM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Fri, Apr 07, 2023 at 04:46:03PM -0700, Grant Grundler wrote: > > On Fri, Apr 7, 2023 at 12:46 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > On Fri, Apr 07, 2023 at 11:53:27AM -0700, Grant Grundler wrote: > > > > On Thu, Apr 6, 2023 at 12:50 PM Bjorn Helgaas <helgaas@kernel.org> > > > wrote: > > > > > On Fri, Mar 17, 2023 at 10:51:09AM -0700, Grant Grundler wrote: > > > > > > From: Rajat Khandelwal <rajat.khandelwal@linux.intel.com> > > > > > > > > > > > > There are many instances where correctable errors tend to inundate > > > > > > the message buffer. We observe such instances during thunderbolt PCIe > > > > > > tunneling. > > > > ... > > > > > > > > > if (info->severity == AER_CORRECTABLE) > > > > > > - pci_info(dev, " [%2d] %-22s%s\n", i, errmsg, > > > > > > - info->first_error == i ? " (First)" : > > > ""); > > > > > > + pci_info_ratelimited(dev, " [%2d] > > > %-22s%s\n", i, errmsg, > > > > > > + info->first_error == i ? > > > " (First)" : ""); > > > > > > > > > > I don't think this is going to reliably work the way we want. We have > > > > > a bunch of pci_info_ratelimited() calls, and each caller has its own > > > > > ratelimit_state data. Unless we call pci_info_ratelimited() exactly > > > > > the same number of times for each error, the ratelimit counters will > > > > > get out of sync and we'll end up printing fragments from error A mixed > > > > > with fragments from error B. > > > > > > > > Ok - what I'm reading between the lines here is the output should be > > > > emitted in one step, not multiple pci_info_ratelimited() calls. if the > > > > code built an output string (using sprintnf()), and then called > > > > pci_info_ratelimited() exactly once at the bottom, would that be > > > > sufficient? > > > > > > > > > I think we need to explicitly manage the ratelimiting ourselves, > > > > > similar to print_hmi_event_info() or print_extlog_rcd(). Then we can > > > > > have a *single* ratelimit_state, and we can check it once to determine > > > > > whether to log this correctable error. > > > > > > > > Is the rate limiting per call location or per device? From above, I > > > > understood rate limiting is "per call location". If the code only > > > > has one call location, it should achieve the same goal, right? > > > > > > Rate-limiting is per call location, so yes, if we only have one call > > > location, that would solve it. It would also have the nice property > > > that all the output would be atomic so it wouldn't get mixed with > > > other stuff, and it might encourage us to be a little less wordy in > > > the output. > > > > > > > +1 to all of those reasons. Especially reducing the number of lines output. > > > > I'm going to be out for the next week. If someone else (Rajat Kendalwal > > maybe?) wants to rework this to use one call location it should be fairly > > straight forward. If not, I'll tackle this when I'm back (in 2 weeks > > essentially). > > Ping? Really hoping to merge this for v6.5. Sorry - I forgot about this... I'll take a shot at it. Should have something by this evening. cheers, grant > > Bjorn ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors 2023-05-17 16:02 ` [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors Bjorn Helgaas 2023-05-17 21:02 ` Grant Grundler @ 2023-05-18 5:58 ` Grant Grundler 1 sibling, 0 replies; 15+ messages in thread From: Grant Grundler @ 2023-05-18 5:58 UTC (permalink / raw) To: Bjorn Helgaas Cc: Grant Grundler, Rajat Jain, Rajat Khandelwal, linux-pci, Mahesh J Salgaonkar, linux-kernel, Oliver O 'Halloran, Bjorn Helgaas, linuxppc-dev [-- Attachment #1: Type: text/plain, Size: 3007 bytes --] On Wed, May 17, 2023 at 9:03 AM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Fri, Apr 07, 2023 at 04:46:03PM -0700, Grant Grundler wrote: > > On Fri, Apr 7, 2023 at 12:46 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > On Fri, Apr 07, 2023 at 11:53:27AM -0700, Grant Grundler wrote: > > > > On Thu, Apr 6, 2023 at 12:50 PM Bjorn Helgaas <helgaas@kernel.org> > > > wrote: > > > > > On Fri, Mar 17, 2023 at 10:51:09AM -0700, Grant Grundler wrote: > > > > > > From: Rajat Khandelwal <rajat.khandelwal@linux.intel.com> > > > > > > > > > > > > There are many instances where correctable errors tend to inundate > > > > > > the message buffer. We observe such instances during thunderbolt PCIe > > > > > > tunneling. > > > > ... > > > > > > > > > if (info->severity == AER_CORRECTABLE) > > > > > > - pci_info(dev, " [%2d] %-22s%s\n", i, errmsg, > > > > > > - info->first_error == i ? " (First)" : > > > ""); > > > > > > + pci_info_ratelimited(dev, " [%2d] > > > %-22s%s\n", i, errmsg, > > > > > > + info->first_error == i ? > > > " (First)" : ""); > > > > > > > > > > I don't think this is going to reliably work the way we want. We have > > > > > a bunch of pci_info_ratelimited() calls, and each caller has its own > > > > > ratelimit_state data. Unless we call pci_info_ratelimited() exactly > > > > > the same number of times for each error, the ratelimit counters will > > > > > get out of sync and we'll end up printing fragments from error A mixed > > > > > with fragments from error B. Despite consolidating the error output, my impression is this is still possible. :( ... > > > Rate-limiting is per call location, so yes, if we only have one call > > > location, that would solve it. It would also have the nice property > > > that all the output would be atomic so it wouldn't get mixed with > > > other stuff, and it might encourage us to be a little less wordy in > > > the output. Unfortunately, I think this needs further surgery. > > +1 to all of those reasons. Especially reducing the number of lines output. > > > > I'm going to be out for the next week. If someone else (Rajat Kendalwal > > maybe?) wants to rework this to use one call location it should be fairly > > straight forward. If not, I'll tackle this when I'm back (in 2 weeks > > essentially). > > Ping? Really hoping to merge this for v6.5. I've appended what I have now... but there are still two issues: 1) we still end up with two "pci_info_ratelimited" call locations: one in aer_print_err() and another in __aer_print_err(). 2) I just noticed both functions output info->status and info->mask (so this ends up getting printed twice in different formats). and that's not really even looking carefully at the other call site: cper_print_aer() If this is "good enough" for now, I can repost as v3. cheers, grant [-- Attachment #2: 0001-PCI-AER-correctable-error-message-as-KERN_INFO.patch --] [-- Type: text/x-patch, Size: 2935 bytes --] From 5ee8c86ce0496be66784eff94e0b165be33f83f4 Mon Sep 17 00:00:00 2001 From: Grant Grundler <grundler@chromium.org> Date: Tue, 28 Feb 2023 22:04:53 -0800 Subject: [PATCH 1/2] PCI/AER: correctable error message as KERN_INFO Since correctable errors have been corrected (and counted), the dmesg output should not be reported as a warning, but rather as "informational". Otherwise, using a certain well known vendor's PCIe parts in a USB4 docking station, the dmesg buffer can be spammed with correctable errors, 717 bytes per instance, potentially many MB per day. Given the "WARN" priority, these messages have already confused the typical user that stumbles across them, support staff (triaging feedback reports), and more than a few linux kernel devs. Changing to INFO will hide these messages from most audiences. Signed-off-by: Grant Grundler <grundler@chromium.org> --- drivers/pci/pcie/aer.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index f6c24ded134c..d7bfc6070ddb 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -692,7 +692,7 @@ static void __aer_print_error(struct pci_dev *dev, if (info->severity == AER_CORRECTABLE) { strings = aer_correctable_error_string; - level = KERN_WARNING; + level = KERN_INFO; } else { strings = aer_uncorrectable_error_string; level = KERN_ERR; @@ -724,7 +724,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) layer = AER_GET_LAYER_ERROR(info->severity, info->status); agent = AER_GET_AGENT(info->severity, info->status); - level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR; + level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR; pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", aer_error_severity_string[info->severity], @@ -797,14 +797,22 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity, info.mask = mask; info.first_error = PCI_ERR_CAP_FEP(aer->cap_control); - pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask); + if (aer_severity == AER_CORRECTABLE) + pci_info(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask); + else + pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask); + __aer_print_error(dev, &info); - pci_err(dev, "aer_layer=%s, aer_agent=%s\n", - aer_error_layer[layer], aer_agent_string[agent]); - if (aer_severity != AER_CORRECTABLE) + if (aer_severity == AER_CORRECTABLE) { + pci_info(dev, "aer_layer=%s, aer_agent=%s\n", + aer_error_layer[layer], aer_agent_string[agent]); + } else { + pci_err(dev, "aer_layer=%s, aer_agent=%s\n", + aer_error_layer[layer], aer_agent_string[agent]); pci_err(dev, "aer_uncor_severity: 0x%08x\n", aer->uncor_severity); + } if (tlp_header_valid) __print_tlp_header(dev, &aer->header_log); -- 2.40.1.606.ga4b1b128d6-goog [-- Attachment #3: 0002-PCI-AER-Rate-limit-the-reporting-of-the-correctable-.patch --] [-- Type: text/x-patch, Size: 6811 bytes --] From 70821bbf58a9ff1ea33dcdf8be17023f2437d4eb Mon Sep 17 00:00:00 2001 From: Rajat Khandelwal <rajat.khandelwal@linux.intel.com> Date: Tue, 3 Jan 2023 22:25:48 +0530 Subject: [PATCH 2/2] PCI/AER: Rate limit the reporting of the correctable errors There are many instances where correctable errors tend to inundate the message buffer. We observe such instances during thunderbolt PCIe tunneling. It's true that they are mitigated by the hardware and are non-fatal but we shouldn't be spamming the logs with such correctable errors as it confuses other kernel developers less familiar with PCI errors, support staff, and users who happen to look at the logs, hence rate limit them. A typical example log inside an HP TBT4 dock: [54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0 [54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) [54912.661203] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001100/00002000 [54912.661211] igc 0000:2b:00.0: [ 8] Rollover [54912.661219] igc 0000:2b:00.0: [12] Timeout [54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0 [54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) [54982.838808] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001000/00002000 [54982.838817] igc 0000:2b:00.0: [12] Timeout This gets repeated continuously, thus inundating the buffer. Signed-off-by: Rajat Khandelwal <rajat.khandelwal@linux.intel.com> Signed-off-by: Grant Grundler <grundler@chromium.org> --- drivers/pci/pcie/aer.c | 80 +++++++++++++++++++++++++++--------------- 1 file changed, 51 insertions(+), 29 deletions(-) diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index d7bfc6070ddb..830f5a1261c9 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -686,26 +686,36 @@ static void __aer_print_error(struct pci_dev *dev, struct aer_err_info *info) { const char **strings; + char aer_msg[512]; unsigned long status = info->status & ~info->mask; - const char *level, *errmsg; int i; - if (info->severity == AER_CORRECTABLE) { - strings = aer_correctable_error_string; - level = KERN_INFO; - } else { - strings = aer_uncorrectable_error_string; - level = KERN_ERR; - } + memset(aer_msg, 0, sizeof(*aer_msg)); + snprintf(aer_msg, sizeof(*aer_msg), "aer_status: 0x%08x, aer_mask: 0x%08x\n", + info->status, info->mask); + + strings = (info->severity == AER_CORRECTABLE) ? + aer_correctable_error_string : aer_uncorrectable_error_string; for_each_set_bit(i, &status, 32) { - errmsg = strings[i]; + const char *errmsg = strings[i]; + char bitmsg[64]; + memset(bitmsg, 0, sizeof(*bitmsg)); + if (!errmsg) errmsg = "Unknown Error Bit"; - pci_printk(level, dev, " [%2d] %-22s%s\n", i, errmsg, - info->first_error == i ? " (First)" : ""); + snprintf(bitmsg, sizeof(*bitmsg), " [%2d] %-22s%s\n", i, errmsg, + info->first_error == i ? " (First)" : ""); + + strlcat(aer_msg, bitmsg, sizeof(*aer_msg)); } + + if (info->severity == AER_CORRECTABLE) + pci_info_ratelimited(dev, "%s", aer_msg); + else + pci_err(dev, "%s", aer_msg): + pci_dev_aer_stats_incr(dev, info); } @@ -713,7 +723,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) { int layer, agent; int id = ((dev->bus->number << 8) | dev->devfn); - const char *level; if (!info->status) { pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n", @@ -724,14 +733,19 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) layer = AER_GET_LAYER_ERROR(info->severity, info->status); agent = AER_GET_AGENT(info->severity, info->status); - level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR; - - pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", - aer_error_severity_string[info->severity], - aer_error_layer[layer], aer_agent_string[agent]); - - pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x\n", - dev->vendor, dev->device, info->status, info->mask); + if (info->severity == AER_CORRECTABLE) { + pci_info_ratelimited(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n" + " device [%04x:%04x] error status/mask=%08x/%08x\n", + aer_error_severity_string[info->severity], + aer_error_layer[layer], aer_agent_string[agent], + dev->vendor, dev->device, info->status, info->mask); + } else { + pci_err(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", + " device [%04x:%04x] error status/mask=%08x/%08x\n", + aer_error_severity_string[info->severity], + aer_error_layer[layer], aer_agent_string[agent], + dev->vendor, dev->device, info->status, info->mask); + } __aer_print_error(dev, info); @@ -751,11 +765,19 @@ static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info) u8 bus = info->id >> 8; u8 devfn = info->id & 0xff; - pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n", - info->multi_error_valid ? "Multiple " : "", - aer_error_severity_string[info->severity], - pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), - PCI_FUNC(devfn)); + if (info->severity == AER_CORRECTABLE) + pci_info_ratelimited(dev, "%s%s error received: %04x:%02x:%02x.%d\n", + info->multi_error_valid ? "Multiple " : "", + aer_error_severity_string[info->severity], + pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), + PCI_FUNC(devfn)); + else + pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n", + info->multi_error_valid ? "Multiple " : "", + aer_error_severity_string[info->severity], + pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), + PCI_FUNC(devfn)); + } #ifdef CONFIG_ACPI_APEI_PCIEAER @@ -798,7 +820,7 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity, info.first_error = PCI_ERR_CAP_FEP(aer->cap_control); if (aer_severity == AER_CORRECTABLE) - pci_info(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask); + pci_info_ratelimited(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask); else pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask); @@ -808,9 +830,9 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity, pci_info(dev, "aer_layer=%s, aer_agent=%s\n", aer_error_layer[layer], aer_agent_string[agent]); } else { - pci_err(dev, "aer_layer=%s, aer_agent=%s\n", - aer_error_layer[layer], aer_agent_string[agent]); - pci_err(dev, "aer_uncor_severity: 0x%08x\n", + pci_err(dev, "aer_layer=%s, aer_agent=%s," + " aer_uncor_severity=0x%08x\n", + aer_error_layer[layer], aer_agent_string[agent], aer->uncor_severity); } -- 2.40.1.606.ga4b1b128d6-goog ^ permalink raw reply related [flat|nested] 15+ messages in thread
end of thread, other threads:[~2023-06-06 3:45 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-03-17 17:51 [PATCHv2 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO Grant Grundler 2023-03-17 17:51 ` [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors Grant Grundler 2023-04-06 19:50 ` Bjorn Helgaas 2023-04-07 18:53 ` Grant Grundler 2023-04-07 19:46 ` Bjorn Helgaas 2023-04-07 23:47 ` Grant Grundler 2023-04-07 23:49 ` Grant Grundler 2023-05-18 6:11 ` Grant Grundler 2023-06-06 3:45 ` Grant Grundler 2023-03-17 18:50 ` [PATCHv2 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO Sathyanarayanan Kuppuswamy 2023-03-17 19:30 ` Bjorn Helgaas 2023-03-19 6:00 ` Grant Grundler [not found] <CANEJEGsE6KS484iSLkKV8hx2nNThZGfaaz+u+R-A3X5nRev6Gg@mail.gmail.com> 2023-05-17 16:02 ` [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors Bjorn Helgaas 2023-05-17 21:02 ` Grant Grundler 2023-05-18 5:58 ` Grant Grundler
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).