public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] PCI: pciehp: Generate a RAS tracepoint for hotplug event
@ 2024-11-08  3:09 Shuai Xue
  2024-11-09 17:52 ` Lukas Wunner
  0 siblings, 1 reply; 5+ messages in thread
From: Shuai Xue @ 2024-11-08  3:09 UTC (permalink / raw)
  To: linux-pci, linux-kernel, linux-edac; +Cc: bhelgaas, tony.luck, bp, xueshuai

Hotplug events are critical indicators for analyzing hardware health,
particularly in AI supercomputers where surprise link downs can
significantly impact system performance and reliability. The failure
characterization analysis illustrates the significance of failures
caused by the Infiniband link errors.  Meta observe that 2% in a machine
learning cluster and 6% in a vision application cluster of Infiniband
failures co-occur with GPU failures, such as falling off the bus, which
may indicate a correlation with PCIe.[1]

Generate a RAS tracepoint for hotplug event to help healthy check.

The output like below:
$ echo 1 > /sys/kernel/debug/tracing/events/ras/pciehp_event/enable
$ cat /sys/kernel/debug/tracing/trace_pipe
           <...>-213     [001] .....    43.762740: pciehp_event: 0000:00:02.0 slot:10, state:5, events:65792

[1]https://arxiv.org/abs/2410.21680

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 drivers/pci/hotplug/pciehp_ctrl.c |  5 +++++
 include/ras/ras_event.h           | 29 +++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/drivers/pci/hotplug/pciehp_ctrl.c b/drivers/pci/hotplug/pciehp_ctrl.c
index dcdbfcf404dd..ec9285e3b9b5 100644
--- a/drivers/pci/hotplug/pciehp_ctrl.c
+++ b/drivers/pci/hotplug/pciehp_ctrl.c
@@ -19,6 +19,7 @@
 #include <linux/types.h>
 #include <linux/pm_runtime.h>
 #include <linux/pci.h>
+#include <ras/ras_event.h>
 #include "pciehp.h"
 
 /* The following routines constitute the bulk of the
@@ -245,6 +246,8 @@ void pciehp_handle_presence_or_link_change(struct controller *ctrl, u32 events)
 		if (events & PCI_EXP_SLTSTA_PDC)
 			ctrl_info(ctrl, "Slot(%s): Card not present\n",
 				  slot_name(ctrl));
+		trace_pciehp_event(dev_name(&ctrl->pcie->port->dev),
+				   slot_name(ctrl), ON_STATE, events);
 		pciehp_disable_slot(ctrl, SURPRISE_REMOVAL);
 		break;
 	default:
@@ -282,6 +285,8 @@ void pciehp_handle_presence_or_link_change(struct controller *ctrl, u32 events)
 		if (link_active)
 			ctrl_info(ctrl, "Slot(%s): Link Up\n",
 				  slot_name(ctrl));
+		trace_pciehp_event(dev_name(&ctrl->pcie->port->dev),
+				   slot_name(ctrl), OFF_STATE, events);
 		ctrl->request_result = pciehp_enable_slot(ctrl);
 		break;
 	default:
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index e5f7ee0864e7..5013d6ff920e 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -338,6 +338,35 @@ TRACE_EVENT(aer_event,
 			"Not available")
 );
 
+TRACE_EVENT(pciehp_event,
+	TP_PROTO(const char *port_name,
+		 const char *slot,
+		 const u8 state,
+		 const u32 events),
+
+	TP_ARGS(port_name, slot, state, events),
+
+	TP_STRUCT__entry(
+		__string(	port_name,	port_name	)
+		__string(	slot,		slot		)
+		__field(	u8,		state		)
+		__field(	u32,		events		)
+	),
+
+	TP_fast_assign(
+		__assign_str(port_name);
+		__assign_str(slot);
+		__entry->state		= state;
+		__entry->events		= events;
+	),
+
+	TP_printk("%s slot:%s, state:%d, events:%d\n",
+		__get_str(port_name),
+		__get_str(slot),
+		__entry->state,
+		__entry->events)
+);
+
 /*
  * memory-failure recovery action result event
  *
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-11-11  1:42 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-08  3:09 [RFC PATCH] PCI: pciehp: Generate a RAS tracepoint for hotplug event Shuai Xue
2024-11-09 17:52 ` Lukas Wunner
2024-11-10 10:12   ` Shuai Xue
2024-11-10 16:44     ` Lukas Wunner
2024-11-11  1:42       ` Shuai Xue

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox