* Re: DRM_RAS for CPER Error logging?!
[not found] ` <aQEVy1qjaDCwL_cc@intel.com>
@ 2025-10-30 14:47 ` Rodrigo Vivi
2025-10-30 15:37 ` DRM_RAS (netlink genl family) " Rodrigo Vivi
2025-10-31 5:38 ` DRM_RAS " Lukas Wunner
0 siblings, 2 replies; 7+ messages in thread
From: Rodrigo Vivi @ 2025-10-30 14:47 UTC (permalink / raw)
To: dri-devel, intel-xe, Dave Airlie, Joonas Lahtinen, Simona Vetter,
Hawking Zhang, Alex Deucher, Zack McKevitt, Lukas Wunner,
Aravind Iddamsetty, netdev, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman
On Tue, Oct 28, 2025 at 03:13:15PM -0400, Rodrigo Vivi wrote:
> On Mon, Sep 29, 2025 at 05:44:12PM -0400, Rodrigo Vivi wrote:
>
> Hey Dave, Sima, AMD folks, Qualcomm folks,
+ Netlink list and maintainers to get some feedback on the netlink usage
proposed here.
Specially to check if there's any concern with CPER blob going through
netlink or if there's any size limitation or concern.
>
> I have a key question to you below here.
>
> > This work is a continuation of the great work started by Aravind ([1] and [2])
> > in order to fulfill the RAS requirements and proposal as previously discussed
> > and agreed in the Linux Plumbers accelerator's bof of 2022 [3].
> >
> > [1]: https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.iddamsetty@linux.intel.com/
> > [2]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
> > [3]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
> >
> > During the past review round, Lukas pointed out that netlink had evolved
> > in parallel during these years and that now, any new usage of netlink families
> > would require the usage of the YAML description and scripts.
> >
> > With this new requirement in place, the family name is hardcoded in the yaml file,
> > so we are forced to have a single family name for the entire drm, and then we now
> > we are forced to have a registration.
> >
> > So, while doing the registration, we now created the concept of drm-ras-node.
> > For now the only node type supported is the agreed error-counter. But that could
> > be expanded for other cases like telemetry, requested by Zack for the qualcomm accel
> > driver.
> >
> > In this first version, only querying counter is supported. But also this is expandable
> > to future introduction of multicast notification and also clearing the counters.
> >
> > This design with multiple nodes per device is already flexible enough for driver
> > to decide if it wants to handle error per device, or per IP block, or per error
> > category. I believe this fully attend to the requested AMD feedback in the earlier
> > reviews.
> >
> > So, my proposal is to start simple with this case as is, and then iterate over
> > with the drm-ras in tree so we evolve together according to various driver's RAS
> > needs.
> >
> > I have provided a documentation and the first Xe implementation of the counter
> > as reference.
> >
> > Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
> > exercises this new API, hence I hope this can be the reference code for the uAPI
> > usage, while we continue with the plan of introducing IGT tests and tools for this
> > and adjusting the internal vendor tools to open with open source developments and
> > changing them to support these flows.
> >
> > Example on MTL:
> >
> > $ sudo ./tools/net/ynl/pyynl/cli.py \
> > --spec Documentation/netlink/specs/drm_ras.yaml \
> > --dump list-nodes
> > [{'device-name': '00:02.0',
> > 'node-id': 0,
> > 'node-name': 'non-fatal',
> > 'node-type': 'error-counter'},
> > {'device-name': '00:02.0',
> > 'node-id': 1,
> > 'node-name': 'correctable',
> > 'node-type': 'error-counter'}]
>
> As you can see on the drm-ras patch, we now have only a single family called
> 'drm-ras', with that we have to register entry points, called 'nodes'
> and for now only one type is existing: 'error-counter'
>
> As I believe it was agreed in the Linux Plumbers accelerator's bof of 2022 [3].
>
> Zack already indicated that for Qualcomm he doesn't need the error counters,
> but another type, perhaps telemetry.
>
> I need your feedback and input on yet another case here that goes side
> by side with error-counters: Error logging.
>
> One of the RAS requirements that we have is to emit CPER logs in certain
> cases. AMD is currently using debugfs for printing the CPER entries that
> accumulates in a ringbuffer. (iiuc).
>
> Some folks are asking us to emit the CPER in the tracefs because
> debugfs might not be available in some enterprise production images.
>
> However, there's a concern on the tracefs usage for the error-logging case.
> There is no active query path in the tracefs. If user needs to poll for
> the latest CPER records it would need to pig-back on some other API
> that would force the emit-trace(cper).
>
> I believe that the cleanest way is to have another drm-ras node type
> named 'error-logging' with a single operation that is query-logs,
> that would be a dump of the available ring-buffer with latest known
> cper records. Is this acceptable?
>
> AMD folks, would you consider this to replace the current debugfs you
> have?
>
> Please let me know your thoughts.
>
> We won't have an example for now, but it would be something like:
>
> Thanks,
> Rodrigo.
>
> $ sudo ./tools/net/ynl/pyynl/cli.py \
> --spec Documentation/netlink/specs/drm_ras.yaml \
> --dump list-nodes
> [{'device-name': '00:02.0',
> 'node-id': 0,
> 'node-name': 'non-fatal',
> 'node-type': 'error-counter'},
> {'device-name': '00:02.0',
> 'node-id': 1,
> 'node-name': 'correctable',
> 'node-type': 'error-counter'}
> 'device-name': '00:02.0',
> 'node-id': 2,
> 'node-name': 'non-fatal',
> 'node-type': 'error-logging'},
> {'device-name': '00:02.0',
> 'node-id': 3,
> 'node-name': 'correctable',
> 'node-type': 'error-logging'}]
>
> $ sudo ./tools/net/ynl/pyynl/cli.py \
> --spec Documentation/netlink/specs/drm_ras.yaml \
> --dump get-logs --json '{"node-id":3}'
> [{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> ]
>
> Of course, details of the error-logging fields along with the CPER binary
> is yet to be defined.
>
> Oh, and the nodes names and split is device specific. The infra is flexible
> enough. Driver can do whatever it makes sense for their device.
>
> Any feedback or comment is really appreciated.
>
> Thanks in advance,
> Rodrigo.
>
> >
> > $ sudo ./tools/net/ynl/pyynl/cli.py \
> > --spec Documentation/netlink/specs/drm_ras.yaml \
> > --dump get-error-counters --json '{"node-id":1}'
> > [{'error-id': 0, 'error-name': 'GT Error', 'error-value': 0},
> > {'error-id': 4, 'error-name': 'Display Error', 'error-value': 0},
> > {'error-id': 8, 'error-name': 'GSC Error', 'error-value': 0},
> > {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0},
> > {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0},
> > {'error-id': 17, 'error-name': 'CSC Error', 'error-value': 0}]
> >
> > $ sudo ./tools/net/ynl/pyynl/cli.py \
> > --spec Documentation/netlink/specs/drm_ras.yaml \
> > --do query-error-counter --json '{"node-id": 0, "error-id": 12}'
> > {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0}
> >
> > $ sudo ./tools/net/ynl/pyynl/cli.py \
> > --spec Documentation/netlink/specs/drm_ras.yaml \
> > --do query-error-counter --json '{"node-id": 1, "error-id": 16}'
> > {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0}
> >
> > Thanks,
> > Rodrigo.
> >
> > Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> > Cc: Alex Deucher <alexander.deucher@amd.com>
> > Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> > Cc: Lukas Wunner <lukas@wunner.de>
> > Cc: Dave Airlie <airlied@gmail.com>
> > Cc: Simona Vetter <simona.vetter@ffwll.ch>
> > Cc: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> >
> > Rodrigo Vivi (2):
> > drm/ras: Introduce the DRM RAS infrastructure over generic netlink
> > drm/xe: Introduce the usage of drm_ras with supported HW errors
> >
> > Documentation/gpu/drm-ras.rst | 109 +++++++
> > Documentation/netlink/specs/drm_ras.yaml | 130 ++++++++
> > drivers/gpu/drm/Kconfig | 9 +
> > drivers/gpu/drm/Makefile | 1 +
> > drivers/gpu/drm/drm_drv.c | 6 +
> > drivers/gpu/drm/drm_ras.c | 357 +++++++++++++++++++++
> > drivers/gpu/drm/drm_ras_genl_family.c | 42 +++
> > drivers/gpu/drm/drm_ras_nl.c | 54 ++++
> > drivers/gpu/drm/xe/regs/xe_hw_error_regs.h | 22 ++
> > drivers/gpu/drm/xe/xe_hw_error.c | 155 ++++++++-
> > include/drm/drm_ras.h | 76 +++++
> > include/drm/drm_ras_genl_family.h | 17 +
> > include/drm/drm_ras_nl.h | 24 ++
> > include/uapi/drm/drm_ras.h | 49 +++
> > 14 files changed, 1049 insertions(+), 2 deletions(-)
> > create mode 100644 Documentation/gpu/drm-ras.rst
> > create mode 100644 Documentation/netlink/specs/drm_ras.yaml
> > create mode 100644 drivers/gpu/drm/drm_ras.c
> > create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
> > create mode 100644 drivers/gpu/drm/drm_ras_nl.c
> > create mode 100644 include/drm/drm_ras.h
> > create mode 100644 include/drm/drm_ras_genl_family.h
> > create mode 100644 include/drm/drm_ras_nl.h
> > create mode 100644 include/uapi/drm/drm_ras.h
> >
> > --
> > 2.51.0
> >
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: DRM_RAS (netlink genl family) for CPER Error logging?!
2025-10-30 14:47 ` DRM_RAS for CPER Error logging?! Rodrigo Vivi
@ 2025-10-30 15:37 ` Rodrigo Vivi
2025-10-31 5:38 ` DRM_RAS " Lukas Wunner
1 sibling, 0 replies; 7+ messages in thread
From: Rodrigo Vivi @ 2025-10-30 15:37 UTC (permalink / raw)
To: dri-devel, intel-xe, Dave Airlie, Joonas Lahtinen, Simona Vetter,
Hawking Zhang, Alex Deucher, Zack McKevitt, Lukas Wunner,
Aravind Iddamsetty, netdev, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman
On Thu, Oct 30, 2025 at 10:47:18AM -0400, Rodrigo Vivi wrote:
> On Tue, Oct 28, 2025 at 03:13:15PM -0400, Rodrigo Vivi wrote:
> > On Mon, Sep 29, 2025 at 05:44:12PM -0400, Rodrigo Vivi wrote:
> >
> > Hey Dave, Sima, AMD folks, Qualcomm folks,
>
> + Netlink list and maintainers to get some feedback on the netlink usage
> proposed here.
The netdev mailing list blocked my bounces of the original discussions,
so for the overall context:
Usage of netlink as a drm-ras solution (with error counters in mind):
https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
Proposal for error-counters with drm-ras generic netlink:
https://lore.kernel.org/dri-devel/20250929214415.326414-4-rodrigo.vivi@intel.com/
Question about the error-logging RAS sub-case with CPER over this drm-ras netlink:
https://lore.kernel.org/dri-devel/aQEVy1qjaDCwL_cc@intel.com/
>
> Specially to check if there's any concern with CPER blob going through
> netlink or if there's any size limitation or concern.
>
> >
> > I have a key question to you below here.
> >
> > > This work is a continuation of the great work started by Aravind ([1] and [2])
> > > in order to fulfill the RAS requirements and proposal as previously discussed
> > > and agreed in the Linux Plumbers accelerator's bof of 2022 [3].
> > >
> > > [1]: https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.iddamsetty@linux.intel.com/
> > > [2]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
> > > [3]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
> > >
> > > During the past review round, Lukas pointed out that netlink had evolved
> > > in parallel during these years and that now, any new usage of netlink families
> > > would require the usage of the YAML description and scripts.
> > >
> > > With this new requirement in place, the family name is hardcoded in the yaml file,
> > > so we are forced to have a single family name for the entire drm, and then we now
> > > we are forced to have a registration.
> > >
> > > So, while doing the registration, we now created the concept of drm-ras-node.
> > > For now the only node type supported is the agreed error-counter. But that could
> > > be expanded for other cases like telemetry, requested by Zack for the qualcomm accel
> > > driver.
> > >
> > > In this first version, only querying counter is supported. But also this is expandable
> > > to future introduction of multicast notification and also clearing the counters.
> > >
> > > This design with multiple nodes per device is already flexible enough for driver
> > > to decide if it wants to handle error per device, or per IP block, or per error
> > > category. I believe this fully attend to the requested AMD feedback in the earlier
> > > reviews.
> > >
> > > So, my proposal is to start simple with this case as is, and then iterate over
> > > with the drm-ras in tree so we evolve together according to various driver's RAS
> > > needs.
> > >
> > > I have provided a documentation and the first Xe implementation of the counter
> > > as reference.
> > >
> > > Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
> > > exercises this new API, hence I hope this can be the reference code for the uAPI
> > > usage, while we continue with the plan of introducing IGT tests and tools for this
> > > and adjusting the internal vendor tools to open with open source developments and
> > > changing them to support these flows.
> > >
> > > Example on MTL:
> > >
> > > $ sudo ./tools/net/ynl/pyynl/cli.py \
> > > --spec Documentation/netlink/specs/drm_ras.yaml \
> > > --dump list-nodes
> > > [{'device-name': '00:02.0',
> > > 'node-id': 0,
> > > 'node-name': 'non-fatal',
> > > 'node-type': 'error-counter'},
> > > {'device-name': '00:02.0',
> > > 'node-id': 1,
> > > 'node-name': 'correctable',
> > > 'node-type': 'error-counter'}]
> >
> > As you can see on the drm-ras patch, we now have only a single family called
> > 'drm-ras', with that we have to register entry points, called 'nodes'
> > and for now only one type is existing: 'error-counter'
> >
> > As I believe it was agreed in the Linux Plumbers accelerator's bof of 2022 [3].
> >
> > Zack already indicated that for Qualcomm he doesn't need the error counters,
> > but another type, perhaps telemetry.
> >
> > I need your feedback and input on yet another case here that goes side
> > by side with error-counters: Error logging.
> >
> > One of the RAS requirements that we have is to emit CPER logs in certain
> > cases. AMD is currently using debugfs for printing the CPER entries that
> > accumulates in a ringbuffer. (iiuc).
> >
> > Some folks are asking us to emit the CPER in the tracefs because
> > debugfs might not be available in some enterprise production images.
> >
> > However, there's a concern on the tracefs usage for the error-logging case.
> > There is no active query path in the tracefs. If user needs to poll for
> > the latest CPER records it would need to pig-back on some other API
> > that would force the emit-trace(cper).
> >
> > I believe that the cleanest way is to have another drm-ras node type
> > named 'error-logging' with a single operation that is query-logs,
> > that would be a dump of the available ring-buffer with latest known
> > cper records. Is this acceptable?
> >
> > AMD folks, would you consider this to replace the current debugfs you
> > have?
> >
> > Please let me know your thoughts.
> >
> > We won't have an example for now, but it would be something like:
> >
> > Thanks,
> > Rodrigo.
> >
> > $ sudo ./tools/net/ynl/pyynl/cli.py \
> > --spec Documentation/netlink/specs/drm_ras.yaml \
> > --dump list-nodes
> > [{'device-name': '00:02.0',
> > 'node-id': 0,
> > 'node-name': 'non-fatal',
> > 'node-type': 'error-counter'},
> > {'device-name': '00:02.0',
> > 'node-id': 1,
> > 'node-name': 'correctable',
> > 'node-type': 'error-counter'}
> > 'device-name': '00:02.0',
> > 'node-id': 2,
> > 'node-name': 'non-fatal',
> > 'node-type': 'error-logging'},
> > {'device-name': '00:02.0',
> > 'node-id': 3,
> > 'node-name': 'correctable',
> > 'node-type': 'error-logging'}]
> >
> > $ sudo ./tools/net/ynl/pyynl/cli.py \
> > --spec Documentation/netlink/specs/drm_ras.yaml \
> > --dump get-logs --json '{"node-id":3}'
> > [{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> > {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> > {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> > {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> > {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> > ]
> >
> > Of course, details of the error-logging fields along with the CPER binary
> > is yet to be defined.
> >
> > Oh, and the nodes names and split is device specific. The infra is flexible
> > enough. Driver can do whatever it makes sense for their device.
> >
> > Any feedback or comment is really appreciated.
> >
> > Thanks in advance,
> > Rodrigo.
> >
> > >
> > > $ sudo ./tools/net/ynl/pyynl/cli.py \
> > > --spec Documentation/netlink/specs/drm_ras.yaml \
> > > --dump get-error-counters --json '{"node-id":1}'
> > > [{'error-id': 0, 'error-name': 'GT Error', 'error-value': 0},
> > > {'error-id': 4, 'error-name': 'Display Error', 'error-value': 0},
> > > {'error-id': 8, 'error-name': 'GSC Error', 'error-value': 0},
> > > {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0},
> > > {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0},
> > > {'error-id': 17, 'error-name': 'CSC Error', 'error-value': 0}]
> > >
> > > $ sudo ./tools/net/ynl/pyynl/cli.py \
> > > --spec Documentation/netlink/specs/drm_ras.yaml \
> > > --do query-error-counter --json '{"node-id": 0, "error-id": 12}'
> > > {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0}
> > >
> > > $ sudo ./tools/net/ynl/pyynl/cli.py \
> > > --spec Documentation/netlink/specs/drm_ras.yaml \
> > > --do query-error-counter --json '{"node-id": 1, "error-id": 16}'
> > > {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0}
> > >
> > > Thanks,
> > > Rodrigo.
> > >
> > > Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> > > Cc: Alex Deucher <alexander.deucher@amd.com>
> > > Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> > > Cc: Lukas Wunner <lukas@wunner.de>
> > > Cc: Dave Airlie <airlied@gmail.com>
> > > Cc: Simona Vetter <simona.vetter@ffwll.ch>
> > > Cc: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > >
> > > Rodrigo Vivi (2):
> > > drm/ras: Introduce the DRM RAS infrastructure over generic netlink
> > > drm/xe: Introduce the usage of drm_ras with supported HW errors
> > >
> > > Documentation/gpu/drm-ras.rst | 109 +++++++
> > > Documentation/netlink/specs/drm_ras.yaml | 130 ++++++++
> > > drivers/gpu/drm/Kconfig | 9 +
> > > drivers/gpu/drm/Makefile | 1 +
> > > drivers/gpu/drm/drm_drv.c | 6 +
> > > drivers/gpu/drm/drm_ras.c | 357 +++++++++++++++++++++
> > > drivers/gpu/drm/drm_ras_genl_family.c | 42 +++
> > > drivers/gpu/drm/drm_ras_nl.c | 54 ++++
> > > drivers/gpu/drm/xe/regs/xe_hw_error_regs.h | 22 ++
> > > drivers/gpu/drm/xe/xe_hw_error.c | 155 ++++++++-
> > > include/drm/drm_ras.h | 76 +++++
> > > include/drm/drm_ras_genl_family.h | 17 +
> > > include/drm/drm_ras_nl.h | 24 ++
> > > include/uapi/drm/drm_ras.h | 49 +++
> > > 14 files changed, 1049 insertions(+), 2 deletions(-)
> > > create mode 100644 Documentation/gpu/drm-ras.rst
> > > create mode 100644 Documentation/netlink/specs/drm_ras.yaml
> > > create mode 100644 drivers/gpu/drm/drm_ras.c
> > > create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
> > > create mode 100644 drivers/gpu/drm/drm_ras_nl.c
> > > create mode 100644 include/drm/drm_ras.h
> > > create mode 100644 include/drm/drm_ras_genl_family.h
> > > create mode 100644 include/drm/drm_ras_nl.h
> > > create mode 100644 include/uapi/drm/drm_ras.h
> > >
> > > --
> > > 2.51.0
> > >
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: DRM_RAS for CPER Error logging?!
2025-10-30 14:47 ` DRM_RAS for CPER Error logging?! Rodrigo Vivi
2025-10-30 15:37 ` DRM_RAS (netlink genl family) " Rodrigo Vivi
@ 2025-10-31 5:38 ` Lukas Wunner
2025-11-06 13:08 ` Rodrigo Vivi
1 sibling, 1 reply; 7+ messages in thread
From: Lukas Wunner @ 2025-10-31 5:38 UTC (permalink / raw)
To: Rodrigo Vivi
Cc: dri-devel, intel-xe, Dave Airlie, Joonas Lahtinen, Simona Vetter,
Hawking Zhang, Alex Deucher, Zack McKevitt, Aravind Iddamsetty,
netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman
On Thu, Oct 30, 2025 at 10:47:18AM -0400, Rodrigo Vivi wrote:
> On Tue, Oct 28, 2025 at 03:13:15PM -0400, Rodrigo Vivi wrote:
> > On Mon, Sep 29, 2025 at 05:44:12PM -0400, Rodrigo Vivi wrote:
> >
> > Hey Dave, Sima, AMD folks, Qualcomm folks,
>
> + Netlink list and maintainers to get some feedback on the netlink usage
> proposed here.
>
> Specially to check if there's any concern with CPER blob going through
> netlink or if there's any size limitation or concern.
How large are those blobs? If the netlink message exceeds PAGE_SIZE
because of the CPER blob, a workaround might be to attach it to the
skb as fragments with skb_add_rx_frag().
Thanks,
Lukas
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: DRM_RAS for CPER Error logging?!
2025-10-31 5:38 ` DRM_RAS " Lukas Wunner
@ 2025-11-06 13:08 ` Rodrigo Vivi
0 siblings, 0 replies; 7+ messages in thread
From: Rodrigo Vivi @ 2025-11-06 13:08 UTC (permalink / raw)
To: Lukas Wunner
Cc: dri-devel, intel-xe, Dave Airlie, Joonas Lahtinen, Simona Vetter,
Hawking Zhang, Alex Deucher, Zack McKevitt, Aravind Iddamsetty,
netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman
On Fri, Oct 31, 2025 at 06:38:57AM +0100, Lukas Wunner wrote:
> On Thu, Oct 30, 2025 at 10:47:18AM -0400, Rodrigo Vivi wrote:
> > On Tue, Oct 28, 2025 at 03:13:15PM -0400, Rodrigo Vivi wrote:
> > > On Mon, Sep 29, 2025 at 05:44:12PM -0400, Rodrigo Vivi wrote:
> > >
> > > Hey Dave, Sima, AMD folks, Qualcomm folks,
> >
> > + Netlink list and maintainers to get some feedback on the netlink usage
> > proposed here.
> >
> > Specially to check if there's any concern with CPER blob going through
> > netlink or if there's any size limitation or concern.
>
> How large are those blobs?
The honest answer is: I don't know!
By spec it has no limitation, but since in general CPER is made
for FW storage it is usually not really big.
From what I could see usual max seems to be around 64Kb. But
for our case we are looking to something much smaller than that.
> If the netlink message exceeds PAGE_SIZE
> because of the CPER blob, a workaround might be to attach it to the
> skb as fragments with skb_add_rx_frag().
Yeap, I imagined that there should be a way.
Thank you
>
> Thanks,
>
> Lukas
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS
[not found] ` <c8caad3b-d7b9-4e0c-8d90-5b2bc576cabf@oss.qualcomm.com>
@ 2025-11-06 13:42 ` Rodrigo Vivi
2025-11-07 20:20 ` Zack McKevitt
0 siblings, 1 reply; 7+ messages in thread
From: Rodrigo Vivi @ 2025-11-06 13:42 UTC (permalink / raw)
To: Zack McKevitt, netdev, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman
Cc: dri-devel, intel-xe, Hawking Zhang, Alex Deucher, Lukas Wunner,
Dave Airlie, Simona Vetter, Aravind Iddamsetty, Joonas Lahtinen
On Thu, Oct 02, 2025 at 02:38:47PM -0600, Zack McKevitt wrote:
> I think this looks good, adding telemetry functionality as a node type and
> in the yaml spec looks straightforward (despite some potential naming
> awkwardness with the RAS module). Thanks for adding this.
>
> Have you considered how this might work for containerized workloads?
From the use cases that we have, we are already expecting network=host,
so there shouldn't be any problem for this usage.
> Specifically, I think it would be best if the underlying drm_ras nodes are
> only accessible for containerized workloads where the device has been
> explicitly passed in. Do you know if this is handled automatically with the
> existing netlink implementation? I imagine that this would be of interest to
> the broader community outside of Qualcomm as well.
My understanding is that it is. But adding the netlink mailing list and maintainers
here for more specialized eyes.
>
> > Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
> > exercises this new API, hence I hope this can be the reference code for the uAPI
> > usage, while we continue with the plan of introducing IGT tests and tools for this
> > and adjusting the internal vendor tools to open with open source developments and
> > changing them to support these flows.
>
> I think it would be nice to see some accompanying userspace code that makes
> use of this implementation to have as a reference if at all possible.
We have some folks working on the userspace tools, but I just realized that
perhaps we don't even need that and we could perhaps only using the
kernel-tools/ynl as official drm-ras consumer?
$ sudo ynl --family drm_ras --dump list-nodes
[{'device-name': '00:02.0',
'node-id': 0,
'node-name': 'non-fatal',
'node-type': 'error-counter'},
{'device-name': '00:02.0',
'node-id': 1,
'node-name': 'correctable',
'node-type': 'error-counter'}]
thoughts?
>
> As a side note, I will be on vacation for a couple of weeks as of this
> weekend and my response time will be affected.
Thank you,
Please let me know if you have further thoughts here, or if you see any blocker
or an ack to move forward with this path.
Thanks,
Rodrigo.
>
> Thanks,
>
> Zack
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS
2025-11-06 13:42 ` [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS Rodrigo Vivi
@ 2025-11-07 20:20 ` Zack McKevitt
2025-11-08 3:01 ` Rodrigo Vivi
0 siblings, 1 reply; 7+ messages in thread
From: Zack McKevitt @ 2025-11-07 20:20 UTC (permalink / raw)
To: Rodrigo Vivi, netdev, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman
Cc: dri-devel, intel-xe, Hawking Zhang, Alex Deucher, Lukas Wunner,
Dave Airlie, Simona Vetter, Aravind Iddamsetty, Joonas Lahtinen
On 11/6/2025 6:42 AM, Rodrigo Vivi wrote:
>>
>>> Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
>>> exercises this new API, hence I hope this can be the reference code for the uAPI
>>> usage, while we continue with the plan of introducing IGT tests and tools for this
>>> and adjusting the internal vendor tools to open with open source developments and
>>> changing them to support these flows.
>>
>> I think it would be nice to see some accompanying userspace code that makes
>> use of this implementation to have as a reference if at all possible.
>
> We have some folks working on the userspace tools, but I just realized that
> perhaps we don't even need that and we could perhaps only using the
> kernel-tools/ynl as official drm-ras consumer?
>
> $ sudo ynl --family drm_ras --dump list-nodes
> [{'device-name': '00:02.0',
> 'node-id': 0,
> 'node-name': 'non-fatal',
> 'node-type': 'error-counter'},
> {'device-name': '00:02.0',
> 'node-id': 1,
> 'node-name': 'correctable',
> 'node-type': 'error-counter'}]
>
> thoughts?
>
I think this is probably ok for demonstrating this patch's
functionality, but some userspace code would be helpful as a reference
for applications that might want to integrate this directly instead of
relying on CLI tools.
>>
>> As a side note, I will be on vacation for a couple of weeks as of this
>> weekend and my response time will be affected.
>
> Thank you,
> Please let me know if you have further thoughts here, or if you see any blocker
> or an ack to move forward with this path.
>
> Thanks,
> Rodrigo.
>
No further thoughts on the patch contents, I think it looks good. I see
that Jakub posted some TODOs while I was away, so I assume there will be
another iteration that I will take a look at if/when that comes in.
>>
>> Thanks,
>>
>> Zack
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS
2025-11-07 20:20 ` Zack McKevitt
@ 2025-11-08 3:01 ` Rodrigo Vivi
0 siblings, 0 replies; 7+ messages in thread
From: Rodrigo Vivi @ 2025-11-08 3:01 UTC (permalink / raw)
To: Zack McKevitt
Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, dri-devel, intel-xe, Hawking Zhang,
Alex Deucher, Lukas Wunner, Dave Airlie, Simona Vetter,
Aravind Iddamsetty, Joonas Lahtinen
On Fri, Nov 07, 2025 at 01:20:03PM -0700, Zack McKevitt wrote:
>
>
> On 11/6/2025 6:42 AM, Rodrigo Vivi wrote:
> > >
> > > > Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
> > > > exercises this new API, hence I hope this can be the reference code for the uAPI
> > > > usage, while we continue with the plan of introducing IGT tests and tools for this
> > > > and adjusting the internal vendor tools to open with open source developments and
> > > > changing them to support these flows.
> > >
> > > I think it would be nice to see some accompanying userspace code that makes
> > > use of this implementation to have as a reference if at all possible.
> >
> > We have some folks working on the userspace tools, but I just realized that
> > perhaps we don't even need that and we could perhaps only using the
> > kernel-tools/ynl as official drm-ras consumer?
> >
> > $ sudo ynl --family drm_ras --dump list-nodes
> > [{'device-name': '00:02.0',
> > 'node-id': 0,
> > 'node-name': 'non-fatal',
> > 'node-type': 'error-counter'},
> > {'device-name': '00:02.0',
> > 'node-id': 1,
> > 'node-name': 'correctable',
> > 'node-type': 'error-counter'}]
> >
> > thoughts?
> >
>
> I think this is probably ok for demonstrating this patch's functionality,
> but some userspace code would be helpful as a reference for applications
> that might want to integrate this directly instead of relying on CLI tools.
It makes sense. So let's continue to have some IGT tool for this.
>
> > >
> > > As a side note, I will be on vacation for a couple of weeks as of this
> > > weekend and my response time will be affected.
> >
> > Thank you,
> > Please let me know if you have further thoughts here, or if you see any blocker
> > or an ack to move forward with this path.
> >
> > Thanks,
> > Rodrigo.
> >
>
> No further thoughts on the patch contents, I think it looks good. I see that
> Jakub posted some TODOs while I was away, so I assume there will be another
> iteration that I will take a look at if/when that comes in.
Yes, but the changes in the error counter is not that big, just some better iteration,
small fixes and a fixed driver API regarding the error ID and error string.
>
> > >
> > > Thanks,
> > >
> > > Zack
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-11-08 3:02 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20250929214415.326414-4-rodrigo.vivi@intel.com>
[not found] ` <aQEVy1qjaDCwL_cc@intel.com>
2025-10-30 14:47 ` DRM_RAS for CPER Error logging?! Rodrigo Vivi
2025-10-30 15:37 ` DRM_RAS (netlink genl family) " Rodrigo Vivi
2025-10-31 5:38 ` DRM_RAS " Lukas Wunner
2025-11-06 13:08 ` Rodrigo Vivi
[not found] ` <c8caad3b-d7b9-4e0c-8d90-5b2bc576cabf@oss.qualcomm.com>
2025-11-06 13:42 ` [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS Rodrigo Vivi
2025-11-07 20:20 ` Zack McKevitt
2025-11-08 3:01 ` Rodrigo Vivi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).