public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] PCI: hotplug: Add a generic RAS tracepoint for hotplug event
@ 2024-11-12 11:58 Shuai Xue
  2024-11-12 11:58 ` [PATCH v2 1/2] " Shuai Xue
  2024-11-12 11:58 ` [PATCH v2 2/2] pci: pciehp: Generate tracepoints " Shuai Xue
  0 siblings, 2 replies; 5+ messages in thread
From: Shuai Xue @ 2024-11-12 11:58 UTC (permalink / raw)
  To: lukas, linux-pci, linux-kernel, linux-edac
  Cc: bhelgaas, tony.luck, bp, xueshuai

Hotplug events are critical indicators for analyzing hardware health,
particularly in AI supercomputers where surprise link downs can
significantly impact system performance and reliability. The failure
characterization analysis illustrates the significance of failures
caused by the Infiniband link errors. Meta observes that 2% in a machine
learning cluster and 6% in a vision application cluster of Infiniband
failures co-occur with GPU failures, such as falling off the bus, which
may indicate a correlation with PCIe.[1]

PATCH 1/2: Add a generic RAS tracepoint for hotplug event to help healthy check. 
PATCH 2/2: Generate tracepoints for PCIe hotplug event

The output like below:

$ echo 1 > /sys/kernel/debug/tracing/events/hotplug/pci_hp_event/enable
$ cat /sys/kernel/debug/tracing/trace_pipe
           <...>-206     [001] .....    40.373870: pci_hp_event: 0000:00:02.0 slot:10, trans_state:Link Down

           <...>-206     [001] .....    40.374871: pci_hp_event: 0000:00:02.0 slot:10, trans_state:Card not present

[1]https://arxiv.org/abs/2410.21680


Shuai Xue (2):
  PCI: hotplug: Add a generic RAS tracepoint for hotplug event
  pci: pciehp: Generate tracepoints for hotplug event

 drivers/pci/hotplug/pciehp_ctrl.c | 33 ++++++++++++---
 drivers/pci/hotplug/trace.h       | 68 +++++++++++++++++++++++++++++++
 include/uapi/linux/pci.h          |  7 ++++
 3 files changed, 102 insertions(+), 6 deletions(-)
 create mode 100644 drivers/pci/hotplug/trace.h

-- 
2.39.3


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-11-14  1:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-12 11:58 [PATCH v2 0/2] PCI: hotplug: Add a generic RAS tracepoint for hotplug event Shuai Xue
2024-11-12 11:58 ` [PATCH v2 1/2] " Shuai Xue
2024-11-13 16:27   ` Lukas Wunner
2024-11-14  1:39     ` Shuai Xue
2024-11-12 11:58 ` [PATCH v2 2/2] pci: pciehp: Generate tracepoints " Shuai Xue

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox