From: Lukas Wunner <lukas@wunner.de>
To: Shuai Xue <xueshuai@linux.alibaba.com>
Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, bhelgaas@google.com,
tony.luck@intel.com, bp@alien8.de
Subject: Re: [PATCH v2 1/2] PCI: hotplug: Add a generic RAS tracepoint for hotplug event
Date: Wed, 13 Nov 2024 17:27:44 +0100 [thread overview]
Message-ID: <ZzTTgOoCSqfC4cVR@wunner.de> (raw)
In-Reply-To: <20241112115852.52980-3-xueshuai@linux.alibaba.com> <20241112115852.52980-2-xueshuai@linux.alibaba.com>
On Tue, Nov 12, 2024 at 07:58:51PM +0800, Shuai Xue wrote:
> Hotplug events are critical indicators for analyzing hardware health,
> particularly in AI supercomputers where surprise link downs can
> significantly impact system performance and reliability. The failure
> characterization analysis illustrates the significance of failures
> caused by the Infiniband link errors. Meta observes that 2% in a machine
> learning cluster and 6% in a vision application cluster of Infiniband
> failures co-occur with GPU failures, such as falling off the bus, which
> may indicate a correlation with PCIe.[1]
>
> Add a generic RAS tracepoint for hotplug event to help healthy check.
It would be good if you could mention in the commit message that
you intend to use rasdaemon for monitoring these tracepoints
and that you're hence adding the enum to a uapi header.
So that if someone wonders later on why you chose a uapi header,
they will find breadcrumbs in the commit message.
I think both patches can be squashed into one.
I'm not an expert on tracepoints, so when respinning, perhaps you
could cc linux-trace-kernel@vger.kernel.org and maybe also tracing
maintainers so that subject matter experts can look the patch over
and ack it.
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM hotplug
Maybe "pci_hotplug" to differentiate this from, say, cpu hotplug?
> +#define PCI_HP_TRANS_STATE \
> + EM(PCI_HOTPLUG_LINK_UP, "Link Up") \
> + EM(PCI_HOTPLUG_LINK_DOWN, "Link Down") \
> + EM(PCI_HOTPLUG_CARD_PRESENT, "Card present") \
> + EMe(PCI_HOTPLUG_CARD_NO_PRESENT, "Card not present")
PCI_HOTPLUG_CARD_NOT_PRESENT would be neater I think.
^
> +PCI_HP_TRANS_STATE
Not sure what "trans state" stands for, maybe "state transition"?
What if we add tracepoints going forward which aren't for state
transitions but other types of events, such as "Power Failure"?
Perhaps something more generic such as "PCI_HOTPLUG_EVENT" would
be more apt?
> +enum pci_hotplug_trans_type {
> + PCI_HOTPLUG_LINK_UP,
> + PCI_HOTPLUG_LINK_DOWN,
> + PCI_HOTPLUG_CARD_PRESENT,
> + PCI_HOTPLUG_CARD_NO_PRESENT,
> +};
I note that this is called "pci_hotplug_trans_type", perhaps for
consistency use "pci_hotplug_trans_state" as everywhere else
(or whatever you choose to substitute the name for, see above).
Other than these cosmetic things, the patches LGTM.
Again, my knowledge on tracepoints is superficial, but broadly
it looks reasonable.
Thanks,
Lukas
next prev parent reply other threads:[~2024-11-13 16:27 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-12 11:58 [PATCH v2 0/2] PCI: hotplug: Add a generic RAS tracepoint for hotplug event Shuai Xue
2024-11-12 11:58 ` [PATCH v2 1/2] " Shuai Xue
2024-11-13 16:27 ` Lukas Wunner [this message]
2024-11-14 1:39 ` Shuai Xue
2024-11-12 11:58 ` [PATCH v2 2/2] pci: pciehp: Generate tracepoints " Shuai Xue
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZzTTgOoCSqfC4cVR@wunner.de \
--to=lukas@wunner.de \
--cc=bhelgaas@google.com \
--cc=bp@alien8.de \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=tony.luck@intel.com \
--cc=xueshuai@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox