dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: <dri-devel@lists.freedesktop.org>,
	<intel-xe@lists.freedesktop.org>,
	"Dave Airlie" <airlied@gmail.com>,
	Joonas Lahtinen <joonas.lahtinen@linux.intel.com>,
	Simona Vetter <simona.vetter@ffwll.ch>,
	Hawking Zhang <Hawking.Zhang@amd.com>,
	Alex Deucher <alexander.deucher@amd.com>,
	Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>,
	Lukas Wunner <lukas@wunner.de>,
	"Aravind Iddamsetty" <aravind.iddamsetty@linux.intel.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>,
	Alex Deucher <alexander.deucher@amd.com>,
	Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>,
	Lukas Wunner <lukas@wunner.de>, "Dave Airlie" <airlied@gmail.com>,
	Simona Vetter <simona.vetter@ffwll.ch>,
	"Aravind Iddamsetty" <aravind.iddamsetty@linux.intel.com>,
	Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Subject: DRM_RAS for CPER Error logging?!
Date: Tue, 28 Oct 2025 15:13:15 -0400	[thread overview]
Message-ID: <aQEVy1qjaDCwL_cc@intel.com> (raw)
In-Reply-To: <20250929214415.326414-4-rodrigo.vivi@intel.com>

On Mon, Sep 29, 2025 at 05:44:12PM -0400, Rodrigo Vivi wrote:

Hey Dave, Sima, AMD folks, Qualcomm folks,

I have a key question to you below here.

> This work is a continuation of the great work started by Aravind ([1] and [2])
> in order to fulfill the RAS requirements and proposal as previously discussed
> and agreed in the Linux Plumbers accelerator's bof of 2022 [3].
> 
> [1]: https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.iddamsetty@linux.intel.com/
> [2]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
> [3]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
> 
> During the past review round, Lukas pointed out that netlink had evolved
> in parallel during these years and that now, any new usage of netlink families
> would require the usage of the YAML description and scripts.
> 
> With this new requirement in place, the family name is hardcoded in the yaml file,
> so we are forced to have a single family name for the entire drm, and then we now
> we are forced to have a registration.
> 
> So, while doing the registration, we now created the concept of drm-ras-node.
> For now the only node type supported is the agreed error-counter. But that could
> be expanded for other cases like telemetry, requested by Zack for the qualcomm accel
> driver.
> 
> In this first version, only querying counter is supported. But also this is expandable
> to future introduction of multicast notification and also clearing the counters.
> 
> This design with multiple nodes per device is already flexible enough for driver
> to decide if it wants to handle error per device, or per IP block, or per error
> category. I believe this fully attend to the requested AMD feedback in the earlier
> reviews.
> 
> So, my proposal is to start simple with this case as is, and then iterate over
> with the drm-ras in tree so we evolve together according to various driver's RAS
> needs.
> 
> I have provided a documentation and the first Xe implementation of the counter
> as reference.
> 
> Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
> exercises this new API, hence I hope this can be the reference code for the uAPI
> usage, while we continue with the plan of introducing IGT tests and tools for this
> and adjusting the internal vendor tools to open with open source developments and
> changing them to support these flows.
> 
> Example on MTL:
> 
> $ sudo ./tools/net/ynl/pyynl/cli.py \
>   --spec Documentation/netlink/specs/drm_ras.yaml \
>   --dump list-nodes
> [{'device-name': '00:02.0',
>   'node-id': 0,
>   'node-name': 'non-fatal',
>   'node-type': 'error-counter'},
>  {'device-name': '00:02.0',
>   'node-id': 1,
>   'node-name': 'correctable',
>   'node-type': 'error-counter'}]

As you can see on the drm-ras patch, we now have only a single family called
'drm-ras', with that we have to register entry points, called 'nodes'
and for now only one type is existing: 'error-counter'

As I believe it was agreed in the Linux Plumbers accelerator's bof of 2022 [3].

Zack already indicated that for Qualcomm he doesn't need the error counters,
but another type, perhaps telemetry.

I need your feedback and input on yet another case here that goes side
by side with error-counters: Error logging.

One of the RAS requirements that we have is to emit CPER logs in certain
cases. AMD is currently using debugfs for printing the CPER entries that
accumulates in a ringbuffer. (iiuc).

Some folks are asking us to emit the CPER in the tracefs because
debugfs might not be available in some enterprise production images.

However, there's a concern on the tracefs usage for the error-logging case.
There is no active query path in the tracefs. If user needs to poll for
the latest CPER records it would need to pig-back on some other API
that would force the emit-trace(cper).

I believe that the cleanest way is to have another drm-ras node type
named 'error-logging' with a single operation that is query-logs,
that would be a dump of the available ring-buffer with latest known
cper records. Is this acceptable?

AMD folks, would you consider this to replace the current debugfs you
have?

Please let me know your thoughts.

We won't have an example for now, but it would be something like:

Thanks,
Rodrigo.

$ sudo ./tools/net/ynl/pyynl/cli.py \
  --spec Documentation/netlink/specs/drm_ras.yaml \
  --dump list-nodes
[{'device-name': '00:02.0',
  'node-id': 0,
  'node-name': 'non-fatal',
  'node-type': 'error-counter'},
 {'device-name': '00:02.0',
  'node-id': 1,
  'node-name': 'correctable',
  'node-type': 'error-counter'}
 'device-name': '00:02.0',
  'node-id': 2,
  'node-name': 'non-fatal',
  'node-type': 'error-logging'},
 {'device-name': '00:02.0',
  'node-id': 3,
  'node-name': 'correctable',
  'node-type': 'error-logging'}]

$ sudo ./tools/net/ynl/pyynl/cli.py \
   --spec Documentation/netlink/specs/drm_ras.yaml \
   --dump get-logs --json '{"node-id":3}'
[{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
]

Of course, details of the error-logging fields along with the CPER binary
is yet to be defined.

Oh, and the nodes names and split is device specific. The infra is flexible
enough. Driver can do whatever it makes sense for their device.

Any feedback or comment is really appreciated.

Thanks in advance,
Rodrigo.

> 
> $ sudo ./tools/net/ynl/pyynl/cli.py \
>   --spec Documentation/netlink/specs/drm_ras.yaml \
>   --dump get-error-counters --json '{"node-id":1}'
> [{'error-id': 0, 'error-name': 'GT Error', 'error-value': 0},
>  {'error-id': 4, 'error-name': 'Display Error', 'error-value': 0},
>  {'error-id': 8, 'error-name': 'GSC Error', 'error-value': 0},
>  {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0},
>  {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0},
>  {'error-id': 17, 'error-name': 'CSC Error', 'error-value': 0}]
> 
> $ sudo ./tools/net/ynl/pyynl/cli.py \
>   --spec Documentation/netlink/specs/drm_ras.yaml \
>   --do query-error-counter --json '{"node-id": 0, "error-id": 12}'
> {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0}
> 
> $ sudo ./tools/net/ynl/pyynl/cli.py \
>   --spec Documentation/netlink/specs/drm_ras.yaml \
>   --do query-error-counter --json '{"node-id": 1, "error-id": 16}'
> {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0}
> 
> Thanks,
> Rodrigo.
> 
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> Cc: Lukas Wunner <lukas@wunner.de>
> Cc: Dave Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona.vetter@ffwll.ch>
> Cc: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> 
> Rodrigo Vivi (2):
>   drm/ras: Introduce the DRM RAS infrastructure over generic netlink
>   drm/xe: Introduce the usage of drm_ras with supported HW errors
> 
>  Documentation/gpu/drm-ras.rst              | 109 +++++++
>  Documentation/netlink/specs/drm_ras.yaml   | 130 ++++++++
>  drivers/gpu/drm/Kconfig                    |   9 +
>  drivers/gpu/drm/Makefile                   |   1 +
>  drivers/gpu/drm/drm_drv.c                  |   6 +
>  drivers/gpu/drm/drm_ras.c                  | 357 +++++++++++++++++++++
>  drivers/gpu/drm/drm_ras_genl_family.c      |  42 +++
>  drivers/gpu/drm/drm_ras_nl.c               |  54 ++++
>  drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  22 ++
>  drivers/gpu/drm/xe/xe_hw_error.c           | 155 ++++++++-
>  include/drm/drm_ras.h                      |  76 +++++
>  include/drm/drm_ras_genl_family.h          |  17 +
>  include/drm/drm_ras_nl.h                   |  24 ++
>  include/uapi/drm/drm_ras.h                 |  49 +++
>  14 files changed, 1049 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/gpu/drm-ras.rst
>  create mode 100644 Documentation/netlink/specs/drm_ras.yaml
>  create mode 100644 drivers/gpu/drm/drm_ras.c
>  create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
>  create mode 100644 drivers/gpu/drm/drm_ras_nl.c
>  create mode 100644 include/drm/drm_ras.h
>  create mode 100644 include/drm/drm_ras_genl_family.h
>  create mode 100644 include/drm/drm_ras_nl.h
>  create mode 100644 include/uapi/drm/drm_ras.h
> 
> -- 
> 2.51.0
> 

  parent reply	other threads:[~2025-10-28 19:13 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-29 21:44 [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS Rodrigo Vivi
2025-09-29 21:44 ` [PATCH 1/2] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Rodrigo Vivi
2025-10-31  1:32   ` Jakub Kicinski
2025-11-06 13:30     ` Rodrigo Vivi
2025-11-06 14:58       ` Jakub Kicinski
2025-09-29 21:44 ` [PATCH 2/2] drm/xe: Introduce the usage of drm_ras with supported HW errors Rodrigo Vivi
2025-09-30  2:07   ` kernel test robot
2025-10-02 20:38 ` [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS Zack McKevitt
2025-10-28 19:14   ` Rodrigo Vivi
2025-11-06 13:42   ` Rodrigo Vivi
2025-11-07 20:20     ` Zack McKevitt
2025-11-08  3:01       ` Rodrigo Vivi
2025-10-28 19:13 ` Rodrigo Vivi [this message]
2025-10-29  2:00   ` DRM_RAS for CPER Error logging?! Zhang, Hawking
2025-11-06 13:16     ` Rodrigo Vivi
2025-11-10  3:34       ` Dave Airlie
2025-11-10  5:13         ` John Hubbard
2025-11-10 20:35         ` Rodrigo Vivi
2025-10-30 14:47   ` Rodrigo Vivi
2025-10-30 15:37     ` DRM_RAS (netlink genl family) " Rodrigo Vivi
2025-10-31  5:38     ` DRM_RAS " Lukas Wunner
2025-11-06 13:08       ` Rodrigo Vivi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQEVy1qjaDCwL_cc@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=Hawking.Zhang@amd.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=joonas.lahtinen@linux.intel.com \
    --cc=lukas@wunner.de \
    --cc=simona.vetter@ffwll.ch \
    --cc=zachary.mckevitt@oss.qualcomm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).