intel-xe.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Dave Airlie <airlied@gmail.com>
Cc: "Zhang, Hawking" <Hawking.Zhang@amd.com>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
	"Joonas Lahtinen" <joonas.lahtinen@linux.intel.com>,
	Simona Vetter <simona.vetter@ffwll.ch>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>,
	Lukas Wunner <lukas@wunner.de>,
	Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>,
	"Zhou1, Tao" <Tao.Zhou1@amd.com>,
	"Liu, Xiang(Dean)" <Xiang.Liu@amd.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	"John Hubbard" <jhubbard@nvidia.com>
Subject: Re: DRM_RAS for CPER Error logging?!
Date: Mon, 10 Nov 2025 15:35:27 -0500	[thread overview]
Message-ID: <aRJMjyYLo5_rFbzP@intel.com> (raw)
In-Reply-To: <CAPM=9tybY_LECdMNH6iw5pzxtd2=Z+4vwLt-_kuMQFUaEXsdpw@mail.gmail.com>

On Mon, Nov 10, 2025 at 01:34:22PM +1000, Dave Airlie wrote:
> On Thu, 6 Nov 2025 at 23:16, Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
> >
> > On Wed, Oct 29, 2025 at 02:00:38AM +0000, Zhang, Hawking wrote:
> > >    [AMD Official Use Only - AMD Internal Distribution Only]
> > >    + [1]@Zhou1, Tao and [2]@Liu, Xiang(Dean) for the awareness.
> > >
> > >    RE - AMD folks, would you consider this to replace the current debugfs you
> > >    have?
> > >
> > >    [Hawking]:
> > >
> > >    Replacing the debugfs is not the primary concern.
> >
> > My initial plan was to go with debugfs like you are doing, but
> > I keep hearing complains that debugfs is not global and we need
> > to take into account some cases where debugfs is not available
> > in production images.
> >
> > > The main concern is
> > >    whether drm_ras can effectively support the necessary RAS information for
> > >    all device vendors, as this largely depends on the design of the hardware
> > >    and firmware.
> >
> > I fully agree. This is the main reason I'm doing my best to make the drm-ras
> > the most generic and expansible as possible.
> >
> > node registration with different node types, and names.
> >
> > I imagined something like:
> >
> > [{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> >
> > based on the format that the current non-standard-cper tracefs uses, with
> > the FRU + CPER. But we could avoid the FRU and make the FRU as node name.
> >
> > >
> > >    AMD is currently evaluating the proposed interface for error logging.
> >
> > The design of the details and the implementation is pretty much open for discussion
> > at this point.
> >
> > What I'm really looking forward is:
> >
> > to know if the path is acceptable overall
> > even if different drivers are opting for different node types?
> >
> > Is there any blocker on using this drm-ras/netlink for the CPER?
> 
> sorry for delay on this, I just had to read what CPER was :-)
> 
> I'm not offended by the idea of using tracefs here,

Right, that was my first thought as well.
Perhaps we simply use the

log_non_standard_event(sec_type, fru_id, fru_text, sec_sev, cper_data, cper_length)

provided directly by dirvers/ras/ras.c

But one limitation with that is that it is from HW/FW -> Kernel -> User Space.

There is no way for user space to query for the current/last log available.

I mean, we would only generate the CPER when passing certain threshold to avoid
flood in case of memory error storm. So, in this case, there's the need for user
to query the most recent log.

I believe it gets a bit ugly if we tell admin that in order to get the most
recent cper log you need to query the error counter through the netlink, and
up to every single error counter query we also emit the tracefs event.

Then I thought about using the netlink to query the cper, but with a separate
node, exclusively for error-log instead of abusing the error counter API.

But if you believe it is okay to emit tracefs on every counter check, then
we can take that path.

> I definitely think
> debugfs is a bad idea coming from the enterprise distro land where we
> don't like having it.

Yeap, this is why I thought that AMD was trying to find alternatives to
their debugfs solution. But the debugfs solution does have this possibility
of query...

> 
> I'm ccing a few other people that might have opinions on exposing CPER
> compatible logs for RAS purposes from devices, I assume there might be
> more than GPUs wanting to do something like this,

Thank you!

> 
> Dave.

  parent reply	other threads:[~2025-11-10 20:35 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-29 21:44 [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS Rodrigo Vivi
2025-09-29 21:44 ` [PATCH 1/2] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Rodrigo Vivi
2025-10-31  1:32   ` Jakub Kicinski
2025-11-06 13:30     ` Rodrigo Vivi
2025-11-06 14:58       ` Jakub Kicinski
2025-09-29 21:44 ` [PATCH 2/2] drm/xe: Introduce the usage of drm_ras with supported HW errors Rodrigo Vivi
2025-09-30  2:07   ` kernel test robot
2025-09-29 21:49 ` ✗ CI.checkpatch: warning for Introduce DRM_RAS using generic netlink for RAS Patchwork
2025-09-29 21:50 ` ✗ CI.KUnit: failure " Patchwork
2025-10-02 20:38 ` [PATCH 0/2] " Zack McKevitt
2025-10-28 19:14   ` Rodrigo Vivi
2025-11-06 13:42   ` Rodrigo Vivi
2025-11-07 20:20     ` Zack McKevitt
2025-11-08  3:01       ` Rodrigo Vivi
2025-12-09 21:40       ` Rodrigo Vivi
2025-10-28 19:13 ` DRM_RAS for CPER Error logging?! Rodrigo Vivi
2025-10-29  2:00   ` Zhang, Hawking
2025-11-06 13:16     ` Rodrigo Vivi
2025-11-10  3:34       ` Dave Airlie
2025-11-10  5:13         ` John Hubbard
2025-11-10 20:35         ` Rodrigo Vivi [this message]
2025-11-17 14:39         ` Jason Gunthorpe
2025-10-30 14:47   ` Rodrigo Vivi
2025-10-30 15:37     ` DRM_RAS (netlink genl family) " Rodrigo Vivi
2025-10-31  5:38     ` DRM_RAS " Lukas Wunner
2025-11-06 13:08       ` Rodrigo Vivi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aRJMjyYLo5_rFbzP@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Hawking.Zhang@amd.com \
    --cc=Tao.Zhou1@amd.com \
    --cc=Xiang.Liu@amd.com \
    --cc=airlied@gmail.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=joonas.lahtinen@linux.intel.com \
    --cc=lukas@wunner.de \
    --cc=rostedt@goodmis.org \
    --cc=simona.vetter@ffwll.ch \
    --cc=zachary.mckevitt@oss.qualcomm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).