From: Raag Jadav <raag.jadav@intel.com>
To: Riana Tauro <riana.tauro@intel.com>
Cc: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
netdev@vger.kernel.org, aravind.iddamsetty@linux.intel.com,
anshuman.gupta@intel.com, rodrigo.vivi@intel.com,
joonas.lahtinen@linux.intel.com, kuba@kernel.org,
simona.vetter@ffwll.ch, airlied@gmail.com, pratik.bari@intel.com,
joshua.santosh.ranjan@intel.com, ashwin.kumar.kulkarni@intel.com,
shubham.kumar@intel.com, ravi.kishore.koppuravuri@intel.com,
maarten.lankhorst@linux.intel.com, mallesh.koujalagi@intel.com,
soham.purkait@intel.com,
Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>,
Lijo Lazar <lijo.lazar@amd.com>,
Hawking Zhang <Hawking.Zhang@amd.com>,
"David S. Miller" <davem@davemloft.net>,
Paolo Abeni <pabeni@redhat.com>,
Eric Dumazet <edumazet@google.com>
Subject: Re: [PATCH 1/2] drm/drm_ras: Add drm_ras netlink error event
Date: Mon, 1 Jun 2026 08:22:33 +0200 [thread overview]
Message-ID: <ah0lKbnM9CD3JQoG@black.igk.intel.com> (raw)
In-Reply-To: <20260518112048.1746280-5-riana.tauro@intel.com>
On Mon, May 18, 2026 at 04:50:50PM +0530, Riana Tauro wrote:
> Define a new netlink event 'error-event' and a new multicast group
> 'error-notify' in drm_ras. Each event contains device name, node and
> error information to identify the error triggering the event.
>
> Add drm_ras_nl_error_event() to trigger an event from the driver.
> Userspace must subscribe to 'error-notify' to receive 'error-event'
> notifications.
>
> Usage:
>
> $ sudo ./tools/net/ynl/pyynl/cli.py --family drm_ras \
Nit: Make the leading space consistent with other patches.
> --subscribe error-notify
>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> Cc: Lijo Lazar <lijo.lazar@amd.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> Documentation/gpu/drm-ras.rst | 21 ++++++
> Documentation/netlink/specs/drm_ras.yaml | 50 ++++++++++++++
> drivers/gpu/drm/drm_ras.c | 86 ++++++++++++++++++++++++
> drivers/gpu/drm/drm_ras_nl.c | 6 ++
> drivers/gpu/drm/drm_ras_nl.h | 4 ++
> include/drm/drm_ras.h | 5 ++
> include/uapi/drm/drm_ras.h | 15 +++++
> 7 files changed, 187 insertions(+)
>
> diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
> index 83c21853b74b..5a96dde75539 100644
> --- a/Documentation/gpu/drm-ras.rst
> +++ b/Documentation/gpu/drm-ras.rst
> @@ -56,6 +56,7 @@ User space tools can:
> ``node-id`` and ``error-id`` as parameters.
> * Clear specific error counters with the ``clear-error-counter`` command, using both
> ``node-id`` and ``error-id`` as parameters.
> +* Subscribe to the ``error-notify`` multicast group to receive ``error-event`` notifications.
>
> YAML-based Interface
> --------------------
> @@ -111,3 +112,23 @@ Example: Clear an error counter for a given node
>
> sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
> None
> +
> +Example: Subscribe to ``error-notify`` multicast group
> +
> +.. code-block:: bash
> +
> + sudo ./tools/net/ynl/pyynl/cli.py --family drm_ras --output-json --subscribe error-notify
So ynl can't do this? If yes, make it consistent with other commands
(and also in commit message). If no, please document it.
> +
> +.. code-block:: json
> +
> + {
> + "name": "error-event",
> + "msg": {
> + "device-name": "0000:03:00.0",
> + "node-id": 1,
> + "node-name": "uncorrectable-errors",
> + "error-id": 1,
> + "error-name": "error_name1",
> + "error-value": 1
> + }
> + }
> diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
> index e113056f8c01..d94c73a61aea 100644
> --- a/Documentation/netlink/specs/drm_ras.yaml
> +++ b/Documentation/netlink/specs/drm_ras.yaml
> @@ -69,6 +69,35 @@ attribute-sets:
> name: error-value
> type: u32
> doc: Current value of the requested error counter.
> + -
> + name: error-event-attrs
> + attributes:
> + -
> + name: device-name
> + type: string
> + doc: >-
> + Device name chosen by the driver at registration.
> + Can be a PCI BDF, UUID, or module name if unique.
> + -
> + name: node-id
Curious, can we reuse existing partial attr-set?
> + type: u32
> + doc: Node ID of the node that triggered the event.
> + -
> + name: node-name
> + type: string
> + doc: Node name of the node that triggered the event.
> + -
> + name: error-id
> + type: u32
> + doc: Error ID of the counter that triggered the event.
> + -
> + name: error-name
> + type: string
> + doc: Name of the error that triggered the event.
> + -
> + name: error-value
> + type: u32
> + doc: Current value of the error counter.
>
> operations:
> list:
> @@ -124,3 +153,24 @@ operations:
> do:
> request:
> attributes: *id-attrs
> + -
> + name: error-event
> + doc: >-
> + Notify userspace of an error event.
> + The event includes the device, node and error information
> + of the error that triggered the event.
> + attribute-set: error-event-attrs
> + mcgrp: error-notify
> + event:
> + attributes:
> + - device-name
> + - node-id
> + - node-name
> + - error-id
> + - error-name
> + - error-value
> +
> +mcast-groups:
> + list:
> + -
> + name: error-notify
> diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
> index d6eab29a1394..6696ec21782e 100644
> --- a/drivers/gpu/drm/drm_ras.c
> +++ b/drivers/gpu/drm/drm_ras.c
> @@ -41,6 +41,11 @@
> * Userspace must provide Node ID, Error ID.
> * Clears specific error counter of a node if supported.
> *
> + * 4. ERROR_NOTIFY: Subscribe to this multicast group to receive error events
> + *
> + * 5. ERROR_EVENT: Notify userspace of an error event. The event contains device, node
> + * and error information that triggered the event.
> + *
> * Node registration:
> *
> * - drm_ras_node_register(): Registers a new node and assigns
> @@ -186,6 +191,34 @@ static int msg_reply_value(struct sk_buff *msg, u32 error_id,
> value);
> }
>
> +static int msg_put_error_event_attrs(struct sk_buff *msg, struct drm_ras_node *node,
> + u32 error_id, const char *error_name, u32 value)
> +{
> + int ret;
> +
> + ret = nla_put_string(msg, DRM_RAS_A_ERROR_EVENT_ATTRS_DEVICE_NAME, node->device_name);
> + if (ret)
> + return ret;
> +
> + ret = nla_put_u32(msg, DRM_RAS_A_ERROR_EVENT_ATTRS_NODE_ID, node->id);
> + if (ret)
> + return ret;
> +
> + ret = nla_put_string(msg, DRM_RAS_A_ERROR_EVENT_ATTRS_NODE_NAME, node->node_name);
> + if (ret)
> + return ret;
> +
> + ret = nla_put_u32(msg, DRM_RAS_A_ERROR_EVENT_ATTRS_ERROR_ID, error_id);
> + if (ret)
> + return ret;
> +
> + ret = nla_put_string(msg, DRM_RAS_A_ERROR_EVENT_ATTRS_ERROR_NAME, error_name);
> + if (ret)
> + return ret;
> +
> + return nla_put_u32(msg, DRM_RAS_A_ERROR_EVENT_ATTRS_ERROR_VALUE, value);
> +}
> +
> static int doit_reply_value(struct genl_info *info, u32 node_id,
> u32 error_id)
> {
> @@ -222,6 +255,59 @@ static int doit_reply_value(struct genl_info *info, u32 node_id,
> return genlmsg_reply(msg, info);
> }
>
> +/**
> + * drm_ras_nl_error_event() - Notify listeners of an error event
> + * @node: Node structure
> + * @error_id: ID of the error
> + * @error_name: Name of the error
> + * @value: Value associated with the error
> + * @flags: GFP flags for memory allocation
> + *
> + * Sends a notification to all listeners about an error event on a specific
> + * RAS node.
> + *
> + * Return: 0 on success, or negative errno on failure.
> + */
> +int drm_ras_nl_error_event(struct drm_ras_node *node, u32 error_id, const char *error_name,
> + u32 value, gfp_t flags)
> +{
> + struct genl_info info;
> + struct sk_buff *msg;
> + struct nlattr *hdr;
> + int err = -EMSGSIZE;
Redundant initialization, see below.
> + if (!error_name)
> + return -EINVAL;
> +
> + if (!genl_has_listeners(&drm_ras_nl_family, &init_net, DRM_RAS_NLGRP_ERROR_NOTIFY))
> + return 0;
> +
> + genl_info_init_ntf(&info, &drm_ras_nl_family, DRM_RAS_CMD_ERROR_EVENT);
> + msg = genlmsg_new(NLMSG_GOODSIZE, flags);
> + if (!msg)
> + return -ENOMEM;
> +
> + hdr = genlmsg_iput(msg, &info);
Make this part of below and return err directly.
> + if (!hdr)
> + goto err_free_msg;
> +
> + err = msg_put_error_event_attrs(msg, node, error_id, error_name, value);
> + if (err)
> + goto err_cancel;
> +
> + genlmsg_end(msg, hdr);
> + genlmsg_multicast(&drm_ras_nl_family, msg, 0, DRM_RAS_NLGRP_ERROR_NOTIFY, flags);
> + return 0;
> +
> +err_cancel:
> + genlmsg_cancel(msg, hdr);
> +err_free_msg:
> + nlmsg_free(msg);
> + return err;
> +}
> +EXPORT_SYMBOL(drm_ras_nl_error_event);
> +
> /**
> * drm_ras_nl_get_error_counter_dumpit() - Dump all Error Counters
> * @skb: Netlink message buffer
> diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
> index dea1c1b2494e..ac724bb87a3b 100644
> --- a/drivers/gpu/drm/drm_ras_nl.c
> +++ b/drivers/gpu/drm/drm_ras_nl.c
> @@ -58,6 +58,10 @@ static const struct genl_split_ops drm_ras_nl_ops[] = {
> },
> };
>
> +static const struct genl_multicast_group drm_ras_nl_mcgrps[] = {
> + [DRM_RAS_NLGRP_ERROR_NOTIFY] = { "error-notify", },
> +};
> +
> struct genl_family drm_ras_nl_family __ro_after_init = {
> .name = DRM_RAS_FAMILY_NAME,
> .version = DRM_RAS_FAMILY_VERSION,
> @@ -66,4 +70,6 @@ struct genl_family drm_ras_nl_family __ro_after_init = {
> .module = THIS_MODULE,
> .split_ops = drm_ras_nl_ops,
> .n_split_ops = ARRAY_SIZE(drm_ras_nl_ops),
> + .mcgrps = drm_ras_nl_mcgrps,
> + .n_mcgrps = ARRAY_SIZE(drm_ras_nl_mcgrps),
> };
> diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
> index a398643572a5..17e1af8cc3b3 100644
> --- a/drivers/gpu/drm/drm_ras_nl.h
> +++ b/drivers/gpu/drm/drm_ras_nl.h
> @@ -21,6 +21,10 @@ int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
> int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
> struct genl_info *info);
>
> +enum {
> + DRM_RAS_NLGRP_ERROR_NOTIFY,
> +};
> +
> extern struct genl_family drm_ras_nl_family;
>
> #endif /* _LINUX_DRM_RAS_GEN_H */
> diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
> index f2a787bc4f64..d4a275efdbb0 100644
> --- a/include/drm/drm_ras.h
> +++ b/include/drm/drm_ras.h
> @@ -78,9 +78,14 @@ struct drm_device;
> #if IS_ENABLED(CONFIG_DRM_RAS)
> int drm_ras_node_register(struct drm_ras_node *node);
> void drm_ras_node_unregister(struct drm_ras_node *node);
> +int drm_ras_nl_error_event(struct drm_ras_node *node, u32 error_id, const char *error_name,
> + u32 value, gfp_t flags);
> #else
> static inline int drm_ras_node_register(struct drm_ras_node *node) { return 0; }
> static inline void drm_ras_node_unregister(struct drm_ras_node *node) { }
> +static inline int drm_ras_nl_error_event(struct drm_ras_node *node, u32 error_id,
> + const char *error_name, u32 value, gfp_t flags)
> +{ return 0; }
> #endif
>
> #endif
> diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
> index 218a3ee86805..bb2a8a872a44 100644
> --- a/include/uapi/drm/drm_ras.h
> +++ b/include/uapi/drm/drm_ras.h
> @@ -38,13 +38,28 @@ enum {
> DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX = (__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX - 1)
> };
>
> +enum {
> + DRM_RAS_A_ERROR_EVENT_ATTRS_DEVICE_NAME = 1,
> + DRM_RAS_A_ERROR_EVENT_ATTRS_NODE_ID,
> + DRM_RAS_A_ERROR_EVENT_ATTRS_NODE_NAME,
> + DRM_RAS_A_ERROR_EVENT_ATTRS_ERROR_ID,
> + DRM_RAS_A_ERROR_EVENT_ATTRS_ERROR_NAME,
> + DRM_RAS_A_ERROR_EVENT_ATTRS_ERROR_VALUE,
> +
> + __DRM_RAS_A_ERROR_EVENT_ATTRS_MAX,
> + DRM_RAS_A_ERROR_EVENT_ATTRS_MAX = (__DRM_RAS_A_ERROR_EVENT_ATTRS_MAX - 1)
> +};
> +
> enum {
> DRM_RAS_CMD_LIST_NODES = 1,
> DRM_RAS_CMD_GET_ERROR_COUNTER,
> DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
> + DRM_RAS_CMD_ERROR_EVENT,
>
> __DRM_RAS_CMD_MAX,
> DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
> };
>
> +#define DRM_RAS_MCGRP_ERROR_NOTIFY "error-notify"
Where is this used?
Raag
> #endif /* _UAPI_LINUX_DRM_RAS_H */
> --
> 2.47.1
>
next prev parent reply other threads:[~2026-06-01 6:22 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-18 11:20 [PATCH 0/2] Add drm_ras netlink error event support Riana Tauro
2026-05-18 11:20 ` [PATCH 1/2] drm/drm_ras: Add drm_ras netlink error event Riana Tauro
2026-06-01 6:22 ` Raag Jadav [this message]
2026-06-02 14:41 ` Tauro, Riana
2026-05-18 11:20 ` [PATCH 2/2] drm/xe/xe_drm_ras: Add error-event support in XE drm_ras Riana Tauro
2026-05-18 11:22 ` ✓ CI.KUnit: success for Add drm_ras netlink error event support Patchwork
2026-05-18 12:00 ` ✓ Xe.CI.BAT: " Patchwork
2026-05-18 15:23 ` ✓ Xe.CI.FULL: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ah0lKbnM9CD3JQoG@black.igk.intel.com \
--to=raag.jadav@intel.com \
--cc=Hawking.Zhang@amd.com \
--cc=airlied@gmail.com \
--cc=anshuman.gupta@intel.com \
--cc=aravind.iddamsetty@linux.intel.com \
--cc=ashwin.kumar.kulkarni@intel.com \
--cc=davem@davemloft.net \
--cc=dri-devel@lists.freedesktop.org \
--cc=edumazet@google.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=joonas.lahtinen@linux.intel.com \
--cc=joshua.santosh.ranjan@intel.com \
--cc=kuba@kernel.org \
--cc=lijo.lazar@amd.com \
--cc=maarten.lankhorst@linux.intel.com \
--cc=mallesh.koujalagi@intel.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=pratik.bari@intel.com \
--cc=ravi.kishore.koppuravuri@intel.com \
--cc=riana.tauro@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=shubham.kumar@intel.com \
--cc=simona.vetter@ffwll.ch \
--cc=soham.purkait@intel.com \
--cc=zachary.mckevitt@oss.qualcomm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.