From: Bjorn Helgaas <helgaas@kernel.org>
To: Shuai Xue <xueshuai@linux.alibaba.com>
Cc: rostedt@goodmis.org, lukas@wunner.de, linux-pci@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org,
linux-trace-kernel@vger.kernel.org, bhelgaas@google.com,
tony.luck@intel.com, bp@alien8.de, mhiramat@kernel.org,
mathieu.desnoyers@efficios.com, oleg@redhat.com,
naveen@kernel.org, davem@davemloft.net,
anil.s.keshavamurthy@intel.com, mark.rutland@arm.com,
peterz@infradead.org
Subject: Re: [PATCH v4] PCI: hotplug: Add a generic RAS tracepoint for hotplug event
Date: Wed, 8 Jan 2025 11:59:06 -0600 [thread overview]
Message-ID: <20250108175906.GA219807@bhelgaas> (raw)
In-Reply-To: <309dd6e6-53ec-4f82-94ca-242941bd7136@linux.alibaba.com>
On Wed, Jan 08, 2025 at 05:04:25PM +0800, Shuai Xue wrote:
> 在 2025/1/8 07:19, Bjorn Helgaas 写道:
> > On Sat, Nov 23, 2024 at 07:31:08PM +0800, Shuai Xue wrote:
> > > Hotplug events are critical indicators for analyzing hardware health,
> > > particularly in AI supercomputers where surprise link downs can
> > > significantly impact system performance and reliability. The failure
> > > characterization analysis illustrates the significance of failures
> > > caused by the Infiniband link errors. Meta observes that 2% in a machine
> > > learning cluster and 6% in a vision application cluster of Infiniband
> > > failures co-occur with GPU failures, such as falling off the bus, which
> > > may indicate a correlation with PCIe.[1]
> > >
> > > To this end, define a new TRACING_SYSTEM named pci, add a generic RAS
> > > tracepoint for hotplug event to help healthy check, and generate
> > > tracepoints for pcie hotplug event. To monitor these tracepoints in
> > > userspace, e.g. with rasdaemon, put `enum pci_hotplug_event` in uapi
> > > header.
> > >
> > > The output like below:
> > > $ echo 1 > /sys/kernel/debug/tracing/events/pci/pci_hp_event/enable
> > > $ cat /sys/kernel/debug/tracing/trace_pipe
> > > <...>-206 [001] ..... 40.373870: pci_hp_event: 0000:00:02.0 slot:10, event:Link Down
> > >
> > > <...>-206 [001] ..... 40.374871: pci_hp_event: 0000:00:02.0 slot:10, event:Card not present
> > >
> > > [1]https://arxiv.org/abs/2410.21680
> >
> > Doesn't apply on pci/main (v6.13-rc1); can you rebase it?
>
> Sure. Do you mean Git repository at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git
> branch main
Yes. The most recent -rc1 is generally a safe bet for basing patches.
> > Probably more detail than necessary about AI supercomputers,
> > Infiniband, vision applications, etc. This is a very generic issue.
>
> Agreed. It is generic. Are you asking for the first background paragraph to be
> deleted?
I think the important part is that hotplug and link down events are
critical indicators of hardware health. That's enough to motivate
this patch.
> > "Falling off the bus" doesn't really mean anything to me. I suppose
> > it's another way to describe a "link down" event that leads to UR
> > errors when trying to access the device?
>
> Sorry for the confusion. "Falling off the bus" is a common error for
> NVIDIA GPU observed in production. The GPU driver will log a such
> message when GPU is not accessible.
Yep, I see those too, and I wish the message weren't phrased so
casually. IIRC this is typically logged when an MMIO read returns ~0,
which happens when a UR or similar error occurs.
> > I'm guessing that monitoring these via rasdaemon requires more than
> > just adding "enum pci_hotplug_event"? Or does rasdaemon read
> > include/uapi/linux/pci.h and automagically incorporate new events?
> > Maybe there's at least a rebuild involved?
>
> Yes, a rebuild is needed. Rasdaemon has a basic infrastructure to manually
> register a tracepoint event handler. For example, for this new event, we can
> register to handle pci_hp_event:
>
> rc = add_event_handler(ras, pevent, page_size, "pci", "pci_hp_event",
> ras_pci_hp_event_handler, NULL, PCI_HOTPLUG_EVENT);
I would say something like "Add enum pci_hotplug_event in
include/uapi/linux/pci.h so applications like rasdaemon can register
tracepoint event handlers for it."
Bjorn
next prev parent reply other threads:[~2025-01-08 17:59 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-23 11:31 [PATCH v4] PCI: hotplug: Add a generic RAS tracepoint for hotplug event Shuai Xue
2025-01-07 11:30 ` Shuai Xue
2025-01-07 12:53 ` Lukas Wunner
2025-01-08 1:47 ` Shuai Xue
2025-01-07 23:19 ` Bjorn Helgaas
2025-01-08 9:04 ` Shuai Xue
2025-01-08 17:59 ` Bjorn Helgaas [this message]
2025-01-09 1:40 ` Shuai Xue
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250108175906.GA219807@bhelgaas \
--to=helgaas@kernel.org \
--cc=anil.s.keshavamurthy@intel.com \
--cc=bhelgaas@google.com \
--cc=bp@alien8.de \
--cc=davem@davemloft.net \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=mark.rutland@arm.com \
--cc=mathieu.desnoyers@efficios.com \
--cc=mhiramat@kernel.org \
--cc=naveen@kernel.org \
--cc=oleg@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=tony.luck@intel.com \
--cc=xueshuai@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.