dri-devel Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras
       [not found] ` <20260311102913.3387468-7-riana.tauro@intel.com>
@ 2026-03-25 12:40   ` Raag Jadav
  0 siblings, 0 replies; 9+ messages in thread
From: Raag Jadav @ 2026-03-25 12:40 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, netdev, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, anvesh.bakwad,
	maarten.lankhorst, Jakub Kicinski, Zack McKevitt, Lijo Lazar,
	Hawking Zhang, David S. Miller, Paolo Abeni, Eric Dumazet

On Wed, Mar 11, 2026 at 03:59:15PM +0530, Riana Tauro wrote:
> Introduce a new 'clear-error-counter' DRM RAS command to reset the counter
> value for a specific error counter of a given node.
> 
> The command is a 'do' netlink request with 'node-id' and 'error-id'
> as parameters with no additional response payload.
> 
> Usage

Missing ":"

> $ sudo ynl --family drm_ras  --do clear-error-counter --json \
> '{"node-id":1, "error-id":1}'
> None
> 
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> Cc: Lijo Lazar <lijo.lazar@amd.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>

Reviewed-by: Raag Jadav <raag.jadav@intel.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification
       [not found] ` <20260311102913.3387468-9-riana.tauro@intel.com>
@ 2026-03-25 13:31   ` Raag Jadav
  2026-04-08 14:29     ` Tauro, Riana
  0 siblings, 1 reply; 9+ messages in thread
From: Raag Jadav @ 2026-03-25 13:31 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, netdev, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, anvesh.bakwad,
	maarten.lankhorst, Jakub Kicinski, Zack McKevitt, Lijo Lazar,
	Hawking Zhang, David S. Miller, Paolo Abeni, Eric Dumazet

On Wed, Mar 11, 2026 at 03:59:17PM +0530, Riana Tauro wrote:
> Add support for asynchronous error notifications in drm_ras.

It's either drm_ras or DRM RAS, make it consistent in all patches
(both commit message and subject).

> Define a new `error-event` netlink event and a new multicast
> group `error-notify` in drm_ras spec. Each event contains
> a node-id and error-id to identify the type and source
> of error.
> 
> Add drm_ras_error_notify() to trigger this event from drivers.
> Userspace can receive this event by subscribing to the
> multicast group error-notify.
> 
> Example: Using ynl tool

Ditto. Either Usage or Example, make it consistent in all patches.

Also, please utilize the full 75 character space where possible.

> $ sudo ynl --family drm_ras --subscribe error-notify
> 
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> Cc: Lijo Lazar <lijo.lazar@amd.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>  Documentation/gpu/drm-ras.rst            |  9 +++++
>  Documentation/netlink/specs/drm_ras.yaml | 14 +++++++
>  drivers/gpu/drm/drm_ras.c                | 48 ++++++++++++++++++++++++
>  drivers/gpu/drm/drm_ras_nl.c             |  6 +++
>  drivers/gpu/drm/drm_ras_nl.h             |  4 ++
>  include/drm/drm_ras.h                    |  2 +
>  include/uapi/drm/drm_ras.h               |  3 ++
>  7 files changed, 86 insertions(+)
> 
> diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
> index 4636e68f5678..09b2918f67bd 100644
> --- a/Documentation/gpu/drm-ras.rst
> +++ b/Documentation/gpu/drm-ras.rst
> @@ -54,6 +54,8 @@ User space tools can:
>    ``node-id`` and ``error-id`` as parameters.
>  * Clear specific error counters with the ``clear-error-counter`` command, using both
>    ``node-id`` and ``error-id`` as parameters.
> +* Listen to ``error-event`` notifications for error events by subscribing to the
> +  ``error-notify`` multicast group.
>  
>  YAML-based Interface
>  --------------------
> @@ -109,3 +111,10 @@ Example: Clear an error counter for a given node
>  
>      sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
>      None
> +
> +Example: Listen to error events
> +
> +.. code-block:: bash
> +
> +    sudo ynl --family drm_ras --subscribe error-notify
> +    {'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}

Can we also have error-name and node-name? I'd be pulling my hair off
if I need to remember all the ids.

On that note, I think it'll be good to have them as part of request
attributes as an alternative to ids (also for existing commands) but
that can done as a follow up.

Also, what if I have multiple devices with multiple nodes. Do they need
separate subscription?

Raag

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification
  2026-03-25 13:31   ` [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification Raag Jadav
@ 2026-04-08 14:29     ` Tauro, Riana
  2026-04-09  5:35       ` Raag Jadav
  0 siblings, 1 reply; 9+ messages in thread
From: Tauro, Riana @ 2026-04-08 14:29 UTC (permalink / raw)
  To: Raag Jadav, aravind.iddamsetty, rodrigo.vivi
  Cc: intel-xe, dri-devel, netdev, anshuman.gupta, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	anvesh.bakwad, maarten.lankhorst, Zack McKevitt, Lijo Lazar,
	Hawking Zhang, David S. Miller, Paolo Abeni, Eric Dumazet,
	Jakub Kicinski


On 3/25/2026 7:01 PM, Raag Jadav wrote:
> On Wed, Mar 11, 2026 at 03:59:17PM +0530, Riana Tauro wrote:
>> Add support for asynchronous error notifications in drm_ras.
> It's either drm_ras or DRM RAS, make it consistent in all patches
> (both commit message and subject).
Sure.
>
>> Define a new `error-event` netlink event and a new multicast
>> group `error-notify` in drm_ras spec. Each event contains
>> a node-id and error-id to identify the type and source
>> of error.
>>
>> Add drm_ras_error_notify() to trigger this event from drivers.
>> Userspace can receive this event by subscribing to the
>> multicast group error-notify.
>>
>> Example: Using ynl tool
> Ditto. Either Usage or Example, make it consistent in all patches.
>
> Also, please utilize the full 75 character space where possible.

Will fix.

>
>> $ sudo ynl --family drm_ras --subscribe error-notify
>>
>> Cc: Jakub Kicinski <kuba@kernel.org>
>> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
>> Cc: Lijo Lazar <lijo.lazar@amd.com>
>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>> Cc: David S. Miller <davem@davemloft.net>
>> Cc: Paolo Abeni <pabeni@redhat.com>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>>   Documentation/gpu/drm-ras.rst            |  9 +++++
>>   Documentation/netlink/specs/drm_ras.yaml | 14 +++++++
>>   drivers/gpu/drm/drm_ras.c                | 48 ++++++++++++++++++++++++
>>   drivers/gpu/drm/drm_ras_nl.c             |  6 +++
>>   drivers/gpu/drm/drm_ras_nl.h             |  4 ++
>>   include/drm/drm_ras.h                    |  2 +
>>   include/uapi/drm/drm_ras.h               |  3 ++
>>   7 files changed, 86 insertions(+)
>>
>> diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
>> index 4636e68f5678..09b2918f67bd 100644
>> --- a/Documentation/gpu/drm-ras.rst
>> +++ b/Documentation/gpu/drm-ras.rst
>> @@ -54,6 +54,8 @@ User space tools can:
>>     ``node-id`` and ``error-id`` as parameters.
>>   * Clear specific error counters with the ``clear-error-counter`` command, using both
>>     ``node-id`` and ``error-id`` as parameters.
>> +* Listen to ``error-event`` notifications for error events by subscribing to the
>> +  ``error-notify`` multicast group.
>>   
>>   YAML-based Interface
>>   --------------------
>> @@ -109,3 +111,10 @@ Example: Clear an error counter for a given node
>>   
>>       sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
>>       None
>> +
>> +Example: Listen to error events
>> +
>> +.. code-block:: bash
>> +
>> +    sudo ynl --family drm_ras --subscribe error-notify
>> +    {'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
> Can we also have error-name and node-name? I'd be pulling my hair off
> if I need to remember all the ids.

Yeah makes sense. We can add the node_name, error_name.
Adding device_name would also be more useful in the event.

@Rodrigo/@aravind thoughts?

>
> On that note, I think it'll be good to have them as part of request
> attributes as an alternative to ids (also for existing commands) but
> that can done as a follow up.
>
We cannot use names as alternative because it won't work for multiple cards.
example in xe: Suppose there are 2 cards and each has 2 nodes. We cannot 
query using node_name+error_name.
Also most of the netlink implementations use id's as unique identifiers.

$ sudo ./cli.py --family drm_ras  --dump list-nodes
[{'device-name': 'bdf_1', 'node-id': 0, 'node-name': 
'correctable-errors', 'node-type': 'error-counter'},
  {'device-name': 'bdf_1, 'node-id': 1, 'node-name': 
'uncorrectable-errors', 'node-type': 'error-counter'},
  {'device-name': 'bdf_2', 'node-id': 2, 'node-name': 
'correctable-errors', 'node-type': 'error-counter'},
  {'device-name': 'bdf_2', 'node-id': 3, 'node-name': 
'uncorrectable-errors', 'node-type': 'error-counter'}]

>
> Also, what if I have multiple devices with multiple nodes. Do they need
> separate subscription?
>
No, we subscribe only to the group not the nodes. In this case the group 
is 'error-notify'

$ sudo ./cli.py --family drm_ras --subscribe error-notify
{'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
{'msg': {'error-id': 1, 'node-id': 3}, 'name': 'error-event'}


Thanks
Riana
>
> Raag

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification
  2026-04-08 14:29     ` Tauro, Riana
@ 2026-04-09  5:35       ` Raag Jadav
  2026-04-10  6:20         ` Tauro, Riana
  0 siblings, 1 reply; 9+ messages in thread
From: Raag Jadav @ 2026-04-09  5:35 UTC (permalink / raw)
  To: Tauro, Riana
  Cc: aravind.iddamsetty, rodrigo.vivi, intel-xe, dri-devel, netdev,
	anshuman.gupta, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, anvesh.bakwad,
	maarten.lankhorst, Zack McKevitt, Lijo Lazar, Hawking Zhang,
	David S. Miller, Paolo Abeni, Eric Dumazet, Jakub Kicinski

On Wed, Apr 08, 2026 at 07:59:33PM +0530, Tauro, Riana wrote:
> On 3/25/2026 7:01 PM, Raag Jadav wrote:
> > On Wed, Mar 11, 2026 at 03:59:17PM +0530, Riana Tauro wrote:

...

> > > +Example: Listen to error events
> > > +
> > > +.. code-block:: bash
> > > +
> > > +    sudo ynl --family drm_ras --subscribe error-notify
> > > +    {'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
> > Can we also have error-name and node-name? I'd be pulling my hair off
> > if I need to remember all the ids.
> 
> Yeah makes sense. We can add the node_name, error_name.
> Adding device_name would also be more useful in the event.
> 
> @Rodrigo/@aravind thoughts?
> 
> > 
> > On that note, I think it'll be good to have them as part of request
> > attributes as an alternative to ids (also for existing commands) but
> > that can done as a follow up.
> > 
> We cannot use names as alternative because it won't work for multiple cards.
> example in xe: Suppose there are 2 cards and each has 2 nodes. We cannot
> query using node_name+error_name.
> Also most of the netlink implementations use id's as unique identifiers.
> 
> $ sudo ./cli.py --family drm_ras  --dump list-nodes
> [{'device-name': 'bdf_1', 'node-id': 0, 'node-name': 'correctable-errors',
> 'node-type': 'error-counter'},
>  {'device-name': 'bdf_1, 'node-id': 1, 'node-name': 'uncorrectable-errors',
> 'node-type': 'error-counter'},
>  {'device-name': 'bdf_2', 'node-id': 2, 'node-name': 'correctable-errors',
> 'node-type': 'error-counter'},
>  {'device-name': 'bdf_2', 'node-id': 3, 'node-name': 'uncorrectable-errors',
> 'node-type': 'error-counter'}]

This means they don't persist the user needs to figures out all the ids before
anything can happen. In device node world we have /dev/dri/by-path/<bdf> which
makes it much easier.

Also, I'm not much informed about the history and it's still unclear to me what
problem did netlink solve here that cannot be solved by anything else? But we're
too late for that discussion, and again, not my call.

> > Also, what if I have multiple devices with multiple nodes. Do they need
> > separate subscription?
> > 
> No, we subscribe only to the group not the nodes. In this case the group is
> 'error-notify'
> 
> $ sudo ./cli.py --family drm_ras --subscribe error-notify
> {'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
> {'msg': {'error-id': 1, 'node-id': 3}, 'name': 'error-event'}

Hm, perhaps I need to spend some time wrapping my head around the new concept.
Let's catch up sometime this week.

Raag

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification
  2026-04-09  5:35       ` Raag Jadav
@ 2026-04-10  6:20         ` Tauro, Riana
  2026-04-23  5:52           ` Tauro, Riana
  0 siblings, 1 reply; 9+ messages in thread
From: Tauro, Riana @ 2026-04-10  6:20 UTC (permalink / raw)
  To: Raag Jadav, aravind.iddamsetty, intel-xe
  Cc: rodrigo.vivi, dri-devel, netdev, anshuman.gupta, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	anvesh.bakwad, maarten.lankhorst, Zack McKevitt, Lijo Lazar,
	Hawking Zhang, David S. Miller, Paolo Abeni, Eric Dumazet,
	Jakub Kicinski


On 4/9/2026 11:05 AM, Raag Jadav wrote:
> On Wed, Apr 08, 2026 at 07:59:33PM +0530, Tauro, Riana wrote:
>> On 3/25/2026 7:01 PM, Raag Jadav wrote:
>>> On Wed, Mar 11, 2026 at 03:59:17PM +0530, Riana Tauro wrote:
> ...
>
>>>> +Example: Listen to error events
>>>> +
>>>> +.. code-block:: bash
>>>> +
>>>> +    sudo ynl --family drm_ras --subscribe error-notify
>>>> +    {'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
>>> Can we also have error-name and node-name? I'd be pulling my hair off
>>> if I need to remember all the ids.
>> Yeah makes sense. We can add the node_name, error_name.
>> Adding device_name would also be more useful in the event.
>>
>> @Rodrigo/@aravind thoughts?
>>
I tried adding all parameters, but the event response seems overloaded.
I think node-name, error-name is not necessary since this will be mostly 
used

by tools and scripts that get the nodes and error-ids prior to subscribing.

Let me know your thoughts

$ sudo ./tools/net/ynl/pyynl/cli.py --family drm_ras --subscribe 
error-notify

{'msg': {'device-name': '0000:03:00.0', 'error-id': 1, 'error-name': 
'core-compute', 'node-id': 3, 'node-name': 'uncorrectable-errors'},

'name': 'error-event'}

{'msg': {'device-name': '0000:04:00.0', 'error-id': 1, 'error-name': 
'core-compute', 'node-id': 1, 'node-name': 'uncorrectable-errors'},

'name': 'error-event'}

Thanks
Riana

>>
>>> On that note, I think it'll be good to have them as part of request
>>> attributes as an alternative to ids (also for existing commands) but
>>> that can done as a follow up.
>>>
>> We cannot use names as alternative because it won't work for multiple cards.
>> example in xe: Suppose there are 2 cards and each has 2 nodes. We cannot
>> query using node_name+error_name.
>> Also most of the netlink implementations use id's as unique identifiers.
>>
>> $ sudo ./cli.py --family drm_ras  --dump list-nodes
>> [{'device-name': 'bdf_1', 'node-id': 0, 'node-name': 'correctable-errors',
>> 'node-type': 'error-counter'},
>>   {'device-name': 'bdf_1, 'node-id': 1, 'node-name': 'uncorrectable-errors',
>> 'node-type': 'error-counter'},
>>   {'device-name': 'bdf_2', 'node-id': 2, 'node-name': 'correctable-errors',
>> 'node-type': 'error-counter'},
>>   {'device-name': 'bdf_2', 'node-id': 3, 'node-name': 'uncorrectable-errors',
>> 'node-type': 'error-counter'}]
> This means they don't persist the user needs to figures out all the ids before
> anything can happen. In device node world we have /dev/dri/by-path/<bdf> which
> makes it much easier.
>
> Also, I'm not much informed about the history and it's still unclear to me what
> problem did netlink solve here that cannot be solved by anything else? But we're
> too late for that discussion, and again, not my call.
>
>>> Also, what if I have multiple devices with multiple nodes. Do they need
>>> separate subscription?
>>>
>> No, we subscribe only to the group not the nodes. In this case the group is
>> 'error-notify'
>>
>> $ sudo ./cli.py --family drm_ras --subscribe error-notify
>> {'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
>> {'msg': {'error-id': 1, 'node-id': 3}, 'name': 'error-event'}
> Hm, perhaps I need to spend some time wrapping my head around the new concept.
> Let's catch up sometime this week.
>
> Raag

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification
  2026-04-10  6:20         ` Tauro, Riana
@ 2026-04-23  5:52           ` Tauro, Riana
  2026-04-23 16:29             ` Jakub Kicinski
  0 siblings, 1 reply; 9+ messages in thread
From: Tauro, Riana @ 2026-04-23  5:52 UTC (permalink / raw)
  To: Raag Jadav, aravind.iddamsetty, Jakub Kicinski, rodrigo.vivi
  Cc: dri-devel, netdev, anshuman.gupta, joonas.lahtinen, simona.vetter,
	airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	anvesh.bakwad, maarten.lankhorst, Zack McKevitt, Lijo Lazar,
	Hawking Zhang, David S. Miller, Paolo Abeni, Eric Dumazet,
	intel-xe

Hi Jakub

We had some questions regarding events in netlink

1) According to netlink spec, "Events are considered less idiomatic for 
netlink and notifications should be preferred."
Since we currently don't have a response that matches a get operation to 
use notify. Would using an event be acceptable?

2) Is there a way to check if there are subscribers to a group before 
creating the event message?
    Currently the subscriber check happens in 
netlink_broadcast_filtered, but a reviewer suggested we could
    optimize by skipping message creation if there are no subscribers.

Thanks
Riana

On 4/10/2026 11:50 AM, Tauro, Riana wrote:
>
> On 4/9/2026 11:05 AM, Raag Jadav wrote:
>> On Wed, Apr 08, 2026 at 07:59:33PM +0530, Tauro, Riana wrote:
>>> On 3/25/2026 7:01 PM, Raag Jadav wrote:
>>>> On Wed, Mar 11, 2026 at 03:59:17PM +0530, Riana Tauro wrote:
>> ...
>>
>>>>> +Example: Listen to error events
>>>>> +
>>>>> +.. code-block:: bash
>>>>> +
>>>>> +    sudo ynl --family drm_ras --subscribe error-notify
>>>>> +    {'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
>>>> Can we also have error-name and node-name? I'd be pulling my hair off
>>>> if I need to remember all the ids.
>>> Yeah makes sense. We can add the node_name, error_name.
>>> Adding device_name would also be more useful in the event.
>>>
>>> @Rodrigo/@aravind thoughts?
>>>
> I tried adding all parameters, but the event response seems overloaded.
> I think node-name, error-name is not necessary since this will be 
> mostly used
>
> by tools and scripts that get the nodes and error-ids prior to 
> subscribing.
>
> Let me know your thoughts
>
> $ sudo ./tools/net/ynl/pyynl/cli.py --family drm_ras --subscribe 
> error-notify
>
> {'msg': {'device-name': '0000:03:00.0', 'error-id': 1, 'error-name': 
> 'core-compute', 'node-id': 3, 'node-name': 'uncorrectable-errors'},
>
> 'name': 'error-event'}
>
> {'msg': {'device-name': '0000:04:00.0', 'error-id': 1, 'error-name': 
> 'core-compute', 'node-id': 1, 'node-name': 'uncorrectable-errors'},
>
> 'name': 'error-event'}
>
> Thanks
> Riana
>
>>>
>>>> On that note, I think it'll be good to have them as part of request
>>>> attributes as an alternative to ids (also for existing commands) but
>>>> that can done as a follow up.
>>>>
>>> We cannot use names as alternative because it won't work for 
>>> multiple cards.
>>> example in xe: Suppose there are 2 cards and each has 2 nodes. We 
>>> cannot
>>> query using node_name+error_name.
>>> Also most of the netlink implementations use id's as unique 
>>> identifiers.
>>>
>>> $ sudo ./cli.py --family drm_ras  --dump list-nodes
>>> [{'device-name': 'bdf_1', 'node-id': 0, 'node-name': 
>>> 'correctable-errors',
>>> 'node-type': 'error-counter'},
>>>   {'device-name': 'bdf_1, 'node-id': 1, 'node-name': 
>>> 'uncorrectable-errors',
>>> 'node-type': 'error-counter'},
>>>   {'device-name': 'bdf_2', 'node-id': 2, 'node-name': 
>>> 'correctable-errors',
>>> 'node-type': 'error-counter'},
>>>   {'device-name': 'bdf_2', 'node-id': 3, 'node-name': 
>>> 'uncorrectable-errors',
>>> 'node-type': 'error-counter'}]
>> This means they don't persist the user needs to figures out all the 
>> ids before
>> anything can happen. In device node world we have 
>> /dev/dri/by-path/<bdf> which
>> makes it much easier.
>>
>> Also, I'm not much informed about the history and it's still unclear 
>> to me what
>> problem did netlink solve here that cannot be solved by anything 
>> else? But we're
>> too late for that discussion, and again, not my call.
>>
>>>> Also, what if I have multiple devices with multiple nodes. Do they 
>>>> need
>>>> separate subscription?
>>>>
>>> No, we subscribe only to the group not the nodes. In this case the 
>>> group is
>>> 'error-notify'
>>>
>>> $ sudo ./cli.py --family drm_ras --subscribe error-notify
>>> {'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
>>> {'msg': {'error-id': 1, 'node-id': 3}, 'name': 'error-event'}
>> Hm, perhaps I need to spend some time wrapping my head around the new 
>> concept.
>> Let's catch up sometime this week.
>>
>> Raag

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification
  2026-04-23  5:52           ` Tauro, Riana
@ 2026-04-23 16:29             ` Jakub Kicinski
  2026-05-04  7:28               ` Tauro, Riana
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2026-04-23 16:29 UTC (permalink / raw)
  To: Tauro, Riana
  Cc: Raag Jadav, aravind.iddamsetty, rodrigo.vivi, dri-devel, netdev,
	anshuman.gupta, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, anvesh.bakwad,
	maarten.lankhorst, Zack McKevitt, Lijo Lazar, Hawking Zhang,
	David S. Miller, Paolo Abeni, Eric Dumazet, intel-xe

On Thu, 23 Apr 2026 11:22:45 +0530 Tauro, Riana wrote:
> We had some questions regarding events in netlink
> 
> 1) According to netlink spec, "Events are considered less idiomatic for 
> netlink and notifications should be preferred."
> Since we currently don't have a response that matches a get operation to 
> use notify. Would using an event be acceptable?

I have only glanced at previous messages but would it not work to
include the counter values into the "event" as if it was it was a
counter get?

If it makes no sense - it's fine. We have been adding events in some
places, too, in networking. The thinking is basically that when the app
that does the user space monitoring starts it will often want to know
the state of the world. So we end up with "get state of the world"
and "notify about state of the world" which should really be the same
format.

> 2) Is there a way to check if there are subscribers to a group before 
> creating the event message?
>     Currently the subscriber check happens in 
> netlink_broadcast_filtered, but a reviewer suggested we could
>     optimize by skipping message creation if there are no subscribers.

Yes, genl_has_listeners()

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification
  2026-04-23 16:29             ` Jakub Kicinski
@ 2026-05-04  7:28               ` Tauro, Riana
  2026-05-04 18:35                 ` Rodrigo Vivi
  0 siblings, 1 reply; 9+ messages in thread
From: Tauro, Riana @ 2026-05-04  7:28 UTC (permalink / raw)
  To: Jakub Kicinski, aravind.iddamsetty, rodrigo.vivi
  Cc: Raag Jadav, dri-devel, netdev, anshuman.gupta, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	anvesh.bakwad, maarten.lankhorst, Zack McKevitt, Lijo Lazar,
	Hawking Zhang, David S. Miller, Paolo Abeni, Eric Dumazet,
	intel-xe


On 4/23/2026 9:59 PM, Jakub Kicinski wrote:
> On Thu, 23 Apr 2026 11:22:45 +0530 Tauro, Riana wrote:
>> We had some questions regarding events in netlink
>>
>> 1) According to netlink spec, "Events are considered less idiomatic for
>> netlink and notifications should be preferred."
>> Since we currently don't have a response that matches a get operation to
>> use notify. Would using an event be acceptable?
> I have only glanced at previous messages but would it not work to
> include the counter values into the "event" as if it was it was a
> counter get?

We wanted to keep the notification general with error name and node name.
There is another series which will re-use this event [PATCH v7 3/3] 
drm/xe/ras: Introduce correctable error handling - Raag Jadav 
<https://lore.kernel.org/intel-xe/20260428054826.1202076-4-raag.jadav@intel.com/>
<https://lore.kernel.org/intel-xe/20260417085812.4013309-15-riana.tauro@intel.com/>[PATCH 
v4 00/13] Introduce Xe Uncorrectable Error Handling - Riana Tauro 
<https://lore.kernel.org/intel-xe/20260417085812.4013309-15-riana.tauro@intel.com/> . 
We do not have the complete counter value there.
@Aravind/@Rodrigo Do you think we should add the counter value to the event?

Also, there is a request to add other parameters in the review comments
so cannot re-use get-counter response

>
> If it makes no sense - it's fine. We have been adding events in some
> places, too, in networking. The thinking is basically that when the app
> that does the user space monitoring starts it will often want to know
> the state of the world. So we end up with "get state of the world"
> and "notify about state of the world" which should really be the same
> format.

Sure. Thank you

>
>> 2) Is there a way to check if there are subscribers to a group before
>> creating the event message?
>>      Currently the subscriber check happens in
>> netlink_broadcast_filtered, but a reviewer suggested we could
>>      optimize by skipping message creation if there are no subscribers.
> Yes, genl_has_listeners()

Thank you

Riana



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification
  2026-05-04  7:28               ` Tauro, Riana
@ 2026-05-04 18:35                 ` Rodrigo Vivi
  0 siblings, 0 replies; 9+ messages in thread
From: Rodrigo Vivi @ 2026-05-04 18:35 UTC (permalink / raw)
  To: Tauro, Riana
  Cc: Jakub Kicinski, aravind.iddamsetty, Raag Jadav, dri-devel, netdev,
	anshuman.gupta, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, anvesh.bakwad,
	maarten.lankhorst, Zack McKevitt, Lijo Lazar, Hawking Zhang,
	David S. Miller, Paolo Abeni, Eric Dumazet, intel-xe

On Mon, May 04, 2026 at 12:58:13PM +0530, Tauro, Riana wrote:
> 
> On 4/23/2026 9:59 PM, Jakub Kicinski wrote:
> > On Thu, 23 Apr 2026 11:22:45 +0530 Tauro, Riana wrote:
> > > We had some questions regarding events in netlink
> > > 
> > > 1) According to netlink spec, "Events are considered less idiomatic for
> > > netlink and notifications should be preferred."
> > > Since we currently don't have a response that matches a get operation to
> > > use notify. Would using an event be acceptable?
> > I have only glanced at previous messages but would it not work to
> > include the counter values into the "event" as if it was it was a
> > counter get?
> 
> We wanted to keep the notification general with error name and node name.
> There is another series which will re-use this event [PATCH v7 3/3]
> drm/xe/ras: Introduce correctable error handling - Raag Jadav <https://lore.kernel.org/intel-xe/20260428054826.1202076-4-raag.jadav@intel.com/>
> <https://lore.kernel.org/intel-xe/20260417085812.4013309-15-riana.tauro@intel.com/>[PATCH
> v4 00/13] Introduce Xe Uncorrectable Error Handling - Riana Tauro <https://lore.kernel.org/intel-xe/20260417085812.4013309-15-riana.tauro@intel.com/> .
> We do not have the complete counter value there.
> @Aravind/@Rodrigo Do you think we should add the counter value to the event?

I don't see any reason why we couldn't include the counter value in the event.
Although we know that this value is likely the threshold, I don't see any problem
with this flow. In many cases it could mean one less call from user space.

> 
> Also, there is a request to add other parameters in the review comments
> so cannot re-use get-counter response
> 
> > 
> > If it makes no sense - it's fine. We have been adding events in some
> > places, too, in networking. The thinking is basically that when the app
> > that does the user space monitoring starts it will often want to know
> > the state of the world. So we end up with "get state of the world"
> > and "notify about state of the world" which should really be the same
> > format.
> 
> Sure. Thank you
> 
> > 
> > > 2) Is there a way to check if there are subscribers to a group before
> > > creating the event message?
> > >      Currently the subscriber check happens in
> > > netlink_broadcast_filtered, but a reviewer suggested we could
> > >      optimize by skipping message creation if there are no subscribers.
> > Yes, genl_has_listeners()
> 
> Thank you
> 
> Riana
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-05-04 18:35 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260311102913.3387468-6-riana.tauro@intel.com>
     [not found] ` <20260311102913.3387468-7-riana.tauro@intel.com>
2026-03-25 12:40   ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Raag Jadav
     [not found] ` <20260311102913.3387468-9-riana.tauro@intel.com>
2026-03-25 13:31   ` [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification Raag Jadav
2026-04-08 14:29     ` Tauro, Riana
2026-04-09  5:35       ` Raag Jadav
2026-04-10  6:20         ` Tauro, Riana
2026-04-23  5:52           ` Tauro, Riana
2026-04-23 16:29             ` Jakub Kicinski
2026-05-04  7:28               ` Tauro, Riana
2026-05-04 18:35                 ` Rodrigo Vivi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox