linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC net-next] net: devlink: add port function attr for vport ↔ eswitch metadata forwarding
@ 2025-08-28  6:52 Shay Drory
  2025-08-28  9:03 ` Jiri Pirko
  0 siblings, 1 reply; 4+ messages in thread
From: Shay Drory @ 2025-08-28  6:52 UTC (permalink / raw)
  To: jiri, davem, edumazet, kuba, pabeni, horms, netdev, linux-kernel
  Cc: ozsh, mbloch, tariqt, saeedm, Shay Drory

In some product architectures, the eswitch manager and the exception
handler run as separate user space processes. The eswitch manager uses
the physical uplink device, while the slow path handler uses a virtual
device.

In this architectures, the eswitch manager application program the HW to
send the exception packets to specific vport, and on top this vport
virtual device, the exception application is running and handling these
packets.

Currently, when packets are forwarded between the eswitch and a vport,
no per-packet metadata is preserved. As a result, the slow path handler
cannot implement features that require visibility into the packet's
hardware context.

This RFC introduces two optional devlink port-function attributes. When
these two capabilities are enable for a function of the port, the device
is making the necessary preparations for the function to exchange
metadata with the eswitch.

rx_metadata
When enabled, packets received by the vport from the eswitch will be
prepended with a device-specific metadata header. This allows the slow
path application to receive the full context of the packet as seen by
the hardware.

tx_metadata
When enabled, the vport can send a packet prepended with a metadata
header. The eswitch hardware consumes this metadata to steer the packet.

Together they allow the said app to process slow-path events in
user-space at line rate while still leaving the common fast-path in
hardware.

User-space interface
Enable / disable is done with existing devlink port-function syntax:

$ devlink port function set pci/0000:06:00.0/3 rx_metadata enable
$ devlink port function set pci/0000:06:00.0/3 tx_metadata enable

Querying the state shows the new knobs:

$ devlink port function show pci/0000:06:00.0/3
  pci/0000:06:00.0/3:
   roce enabled rx_metadata enabled tx_metadata enabled

Disabling is symmetrical:

$ devlink port function set pci/0000:06:00.0/3 rx_metadata disable
$ devlink port function set pci/0000:06:00.0/3 tx_metadata disable

Signed-off-by: Shay Drory <shayd@nvidia.com>


-- 
2.38.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC net-next] net: devlink: add port function attr for vport ↔ eswitch metadata forwarding
  2025-08-28  6:52 [RFC net-next] net: devlink: add port function attr for vport ↔ eswitch metadata forwarding Shay Drory
@ 2025-08-28  9:03 ` Jiri Pirko
  2025-08-30  2:09   ` Jakub Kicinski
  2025-09-01  6:30   ` Shay Drori
  0 siblings, 2 replies; 4+ messages in thread
From: Jiri Pirko @ 2025-08-28  9:03 UTC (permalink / raw)
  To: Shay Drory
  Cc: davem, edumazet, kuba, pabeni, horms, netdev, linux-kernel, ozsh,
	mbloch, tariqt, saeedm

:q
Thu, Aug 28, 2025 at 08:52:29AM +0200, shayd@nvidia.com wrote:
>In some product architectures, the eswitch manager and the exception
>handler run as separate user space processes. The eswitch manager uses
>the physical uplink device, while the slow path handler uses a virtual
>device.
>
>In this architectures, the eswitch manager application program the HW to
>send the exception packets to specific vport, and on top this vport
>virtual device, the exception application is running and handling these
>packets.
>
>Currently, when packets are forwarded between the eswitch and a vport,
>no per-packet metadata is preserved. As a result, the slow path handler
>cannot implement features that require visibility into the packet's
>hardware context.

A vendor-specific slow path. Basically you provide a possibility for
user to pass a binary blob to hw along with every TX'ed packet and
vice versa. That looks quite odd tbh. I mean, isn't this horribly
breaking the socket abstraction? Also, isn't this horribly breaking the
forwarding offloading model when HW should just mimic the behaviour of
the kernel?



>
>This RFC introduces two optional devlink port-function attributes. When
>these two capabilities are enable for a function of the port, the device
>is making the necessary preparations for the function to exchange
>metadata with the eswitch.
>
>rx_metadata
>When enabled, packets received by the vport from the eswitch will be
>prepended with a device-specific metadata header. This allows the slow
>path application to receive the full context of the packet as seen by
>the hardware.
>
>tx_metadata
>When enabled, the vport can send a packet prepended with a metadata
>header. The eswitch hardware consumes this metadata to steer the packet.
>
>Together they allow the said app to process slow-path events in
>user-space at line rate while still leaving the common fast-path in
>hardware.
>
>User-space interface
>Enable / disable is done with existing devlink port-function syntax:
>
>$ devlink port function set pci/0000:06:00.0/3 rx_metadata enable
>$ devlink port function set pci/0000:06:00.0/3 tx_metadata enable
>
>Querying the state shows the new knobs:
>
>$ devlink port function show pci/0000:06:00.0/3
>  pci/0000:06:00.0/3:
>   roce enabled rx_metadata enabled tx_metadata enabled
>
>Disabling is symmetrical:
>
>$ devlink port function set pci/0000:06:00.0/3 rx_metadata disable
>$ devlink port function set pci/0000:06:00.0/3 tx_metadata disable
>
>Signed-off-by: Shay Drory <shayd@nvidia.com>
>
>
>-- 
>2.38.1
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC net-next] net: devlink: add port function attr for vport ↔ eswitch metadata forwarding
  2025-08-28  9:03 ` Jiri Pirko
@ 2025-08-30  2:09   ` Jakub Kicinski
  2025-09-01  6:30   ` Shay Drori
  1 sibling, 0 replies; 4+ messages in thread
From: Jakub Kicinski @ 2025-08-30  2:09 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Shay Drory, davem, edumazet, pabeni, horms, netdev, linux-kernel,
	ozsh, mbloch, tariqt, saeedm

On Thu, 28 Aug 2025 11:03:41 +0200 Jiri Pirko wrote:
> Thu, Aug 28, 2025 at 08:52:29AM +0200, shayd@nvidia.com wrote:
> >In some product architectures, the eswitch manager and the exception
> >handler run as separate user space processes. The eswitch manager uses
> >the physical uplink device, while the slow path handler uses a virtual
> >device.
> >
> >In this architectures, the eswitch manager application program the HW to
> >send the exception packets to specific vport, and on top this vport
> >virtual device, the exception application is running and handling these
> >packets.
> >
> >Currently, when packets are forwarded between the eswitch and a vport,
> >no per-packet metadata is preserved. As a result, the slow path handler
> >cannot implement features that require visibility into the packet's
> >hardware context.  
> 
> A vendor-specific slow path. Basically you provide a possibility for
> user to pass a binary blob to hw along with every TX'ed packet and
> vice versa. That looks quite odd tbh. I mean, isn't this horribly
> breaking the socket abstraction? Also, isn't this horribly breaking the
> forwarding offloading model when HW should just mimic the behaviour of
> the kernel?

I suppose will be told at some point that it's for debug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC net-next] net: devlink: add port function attr for vport ↔ eswitch metadata forwarding
  2025-08-28  9:03 ` Jiri Pirko
  2025-08-30  2:09   ` Jakub Kicinski
@ 2025-09-01  6:30   ` Shay Drori
  1 sibling, 0 replies; 4+ messages in thread
From: Shay Drori @ 2025-09-01  6:30 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: davem, edumazet, kuba, pabeni, horms, netdev, linux-kernel, ozsh,
	mbloch, tariqt, saeedm



On 28/08/2025 12:03, Jiri Pirko wrote:
> External email: Use caution opening links or attachments
> 
> 
> :q
> Thu, Aug 28, 2025 at 08:52:29AM +0200, shayd@nvidia.com wrote:
>> In some product architectures, the eswitch manager and the exception
>> handler run as separate user space processes. The eswitch manager uses
>> the physical uplink device, while the slow path handler uses a virtual
>> device.
>>
>> In this architectures, the eswitch manager application program the HW to
>> send the exception packets to specific vport, and on top this vport
>> virtual device, the exception application is running and handling these
>> packets.
>>
>> Currently, when packets are forwarded between the eswitch and a vport,
>> no per-packet metadata is preserved. As a result, the slow path handler
>> cannot implement features that require visibility into the packet's
>> hardware context.
> 
> A vendor-specific slow path. Basically you provide a possibility for
> user to pass a binary blob to hw along with every TX'ed packet and
> vice versa. That looks quite odd tbh. I mean, isn't this horribly
> breaking the socket abstraction? Also, isn't this horribly breaking the
> forwarding offloading model when HW should just mimic the behaviour of
> the kernel?

This feature is targeted at kernel-bypass applications, which already
operate outside the Linux kernel’s traditional networking stack.
These applications need access to hardware-specific metadata to make
forwarding decisions or offload acceleration, and they do not use
“representor” devices in the same way as kernel-driven virtual functions.

The devlink interface configures the hardware switch, and these knobs
adjust how metadata is preserved between the e-switch manager and the
exception handler.

> 
> 
> 
>>
>> This RFC introduces two optional devlink port-function attributes. When
>> these two capabilities are enable for a function of the port, the device
>> is making the necessary preparations for the function to exchange
>> metadata with the eswitch.
>>
>> rx_metadata
>> When enabled, packets received by the vport from the eswitch will be
>> prepended with a device-specific metadata header. This allows the slow
>> path application to receive the full context of the packet as seen by
>> the hardware.
>>
>> tx_metadata
>> When enabled, the vport can send a packet prepended with a metadata
>> header. The eswitch hardware consumes this metadata to steer the packet.
>>
>> Together they allow the said app to process slow-path events in
>> user-space at line rate while still leaving the common fast-path in
>> hardware.
>>
>> User-space interface
>> Enable / disable is done with existing devlink port-function syntax:
>>
>> $ devlink port function set pci/0000:06:00.0/3 rx_metadata enable
>> $ devlink port function set pci/0000:06:00.0/3 tx_metadata enable
>>
>> Querying the state shows the new knobs:
>>
>> $ devlink port function show pci/0000:06:00.0/3
>>   pci/0000:06:00.0/3:
>>    roce enabled rx_metadata enabled tx_metadata enabled
>>
>> Disabling is symmetrical:
>>
>> $ devlink port function set pci/0000:06:00.0/3 rx_metadata disable
>> $ devlink port function set pci/0000:06:00.0/3 tx_metadata disable
>>
>> Signed-off-by: Shay Drory <shayd@nvidia.com>
>>
>>
>> --
>> 2.38.1
>>


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-09-01  6:32 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-28  6:52 [RFC net-next] net: devlink: add port function attr for vport ↔ eswitch metadata forwarding Shay Drory
2025-08-28  9:03 ` Jiri Pirko
2025-08-30  2:09   ` Jakub Kicinski
2025-09-01  6:30   ` Shay Drori

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).