[RFC PATCH v2 net-next] docs: net: add an explanation of VF (and other) Representors

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 net-next] docs: net: add an explanation of VF (and other) Representors
@ 2022-08-15 14:22 ecree
  2022-08-18  9:56 ` Marcin Szycik
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: ecree @ 2022-08-15 14:22 UTC (permalink / raw)
  To: netdev, linux-net-drivers
  Cc: davem, kuba, pabeni, edumazet, corbet, linux-doc, jacob.e.keller,
	jesse.brandeburg, michael.chan, andy, saeed, jiri, snelson,
	simon.horman, alexander.duyck, rdunlap, Edward Cree

From: Edward Cree <ecree.xilinx@gmail.com>

There's no clear explanation of what VF Representors are for, their
 semantics, etc., outside of vendor docs and random conference slides.
Add a document explaining Representors and defining what drivers that
 implement them are expected to do.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
---
Changed in v2:
 - incorporated feedback from Jakub including rewrite of the Motivation section,
   representors for uplink/phys port, replace phys_port_name conventions with
   devlink port.
 - fixed archaic spelling (Randy)
 - painted the bike shed blue ("master PF") for now, we can always change it
   again later
 - added Definitions section

 Documentation/networking/index.rst        |   1 +
 Documentation/networking/representors.rst | 228 ++++++++++++++++++++++
 Documentation/networking/switchdev.rst    |   1 +
 3 files changed, 230 insertions(+)
 create mode 100644 Documentation/networking/representors.rst

diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 03b215bddde8..c37ea2b54c29 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -93,6 +93,7 @@ Contents:
    radiotap-headers
    rds
    regulatory
+   representors
    rxrpc
    sctp
    secid
diff --git a/Documentation/networking/representors.rst b/Documentation/networking/representors.rst
new file mode 100644
index 000000000000..be7cc4752d11
--- /dev/null
+++ b/Documentation/networking/representors.rst
@@ -0,0 +1,228 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================
+Network Function Representors
+=============================
+
+This document describes the semantics and usage of representor netdevices, as
+used to control internal switching on SmartNICs.  For the closely-related port
+representors on physical (multi-port) switches, see
+:ref:`Documentation/networking/switchdev.rst <switchdev>`.
+
+Motivation
+----------
+
+Since the mid-2010s, network cards have started offering more complex
+virtualisation capabilities than the legacy SR-IOV approach (with its simple
+MAC/VLAN-based switching model) can support.  This led to a desire to offload
+software-defined networks (such as OpenVSwitch) to these NICs to specify the
+network connectivity of each function.  The resulting designs are variously
+called SmartNICs or DPUs.
+
+Network function representors bring the standard Linux networking stack to
+virtual switches and IOV devices.  Just as each port of a Linux-controlled
+switch has a separate netdev, so each virtual function has one.  When the system
+boots, and before any offload is configured, all packets from the virtual
+functions appear in the networking stack of the PF via the representors.
+The PF can thus always communicate freely with the virtual functions.
+The PF can configure standard Linux forwarding between representors, the uplink
+or any other netdev (routing, bridging, TC classifiers).
+
+Thus, a representor is both a control plane object (representing the function in
+administrative commands) and a data plane object (one end of a virtual pipe).
+As a virtual link endpoint, the representor can be configured like any other
+netdevice; in some cases (e.g. link state) the representee will follow the
+representor's configuration, while in others there are separate APIs to
+configure the representee.
+
+Definitions
+-----------
+
+This document uses the term "master PF" to refer to the PCIe function which has
+administrative control over the virtual switch on the device.  Conceivably a NIC
+could be configured to grant these administrative privileges instead to a VF or
+SF (subfunction); the terminology is not meant to exclude this case.
+Depending on NIC design, a multi-port NIC might have a single master PF for the
+whole device or might have a separate virtual switch, and hence master PF, for
+each physical network port.
+If the NIC supports nested switching, there might be separate "master PFs" for
+each nested switch, in which case each "master PF" should only create
+representors for the ports on the (sub-)switch it directly administers.
+
+A "representee" is the object that a representor represents.  So for example in
+the case of a VF representor, the representee is the corresponding VF.
+
+What does a representor do?
+---------------------------
+
+A representor has three main roles.
+
+1. It is used to configure the network connection the representee sees, e.g.
+   link up/down, MTU, etc.  For instance, bringing the representor
+   administratively UP should cause the representee to see a link up / carrier
+   on event.
+2. It provides the slow path for traffic which does not hit any offloaded
+   fast-path rules in the virtual switch.  Packets transmitted on the
+   representor netdevice should be delivered to the representee; packets
+   transmitted to the representee which fail to match any switching rule should
+   be received on the representor netdevice.  (That is, there is a virtual pipe
+   connecting the representor to the representee, similar in concept to a veth
+   pair.)
+   This allows software switch implementations (such as OpenVSwitch or a Linux
+   bridge) to forward packets between representees and the rest of the network.
+3. It acts as a handle by which switching rules (such as TC filters) can refer
+   to the representee, allowing these rules to be offloaded.
+
+The combination of 2) and 3) means that the behaviour (apart from performance)
+should be the same whether a TC filter is offloaded or not.  E.g. a TC rule
+on a VF representor applies in software to packets received on that representor
+netdevice, while in hardware offload it would apply to packets transmitted by
+the representee VF.  Conversely, a mirred egress redirect to a VF representor
+corresponds in hardware to delivery directly to the representee VF.
+
+What functions should have a representor?
+-----------------------------------------
+
+Essentially, for each virtual port on the device's internal switch, there
+should be a representor.
+Some vendors have chosen to omit representors for the uplink and the physical
+network port, which can simplify usage (the uplink netdev becomes in effect the
+physical port's representor) but does not generalise to devices with multiple
+ports or uplinks.
+
+Thus, the following should all have representors:
+
+ - VFs belonging to the master PF.
+ - Other PFs on the local PCIe controller, and any VFs belonging to them.
+ - PFs and VFs on other PCIe controllers on the device (e.g. for any embedded
+   System-on-Chip within the SmartNIC).
+ - PFs and VFs with other personalities, including network block devices (such
+   as a vDPA virtio-blk PF backed by remote/distributed storage), if their
+   network access is implemented through a virtual switch port.
+   Note that such functions can require a representor despite the representee
+   not having a netdev.
+ - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have
+   their own port on the switch (as opposed to using their parent PF's port).
+ - Any accelerators or plugins on the device whose interface to the network is
+   through a virtual switch port, even if they do not have a corresponding PCIe
+   PF or VF.
+
+This allows the entire switching behaviour of the NIC to be controlled through
+representor TC rules.
+
+A PCIe function which does not have network access through the internal switch
+(not even indirectly through the hardware implementation of whatever services
+the function provides) should *not* have a representor (even if it has a
+netdev).
+Such a function has no switch virtual port for the representor to configure or
+to be the other end of the virtual pipe.
+
+How are representors created?
+-----------------------------
+
+The driver instance attached to the master PF should enumerate the virtual ports
+on the switch, and for each representee, create a pure-software netdevice which
+has some form of in-kernel reference to the PF's own netdevice or driver private
+data (``netdev_priv()``).
+If switch ports can dynamically appear/disappear, the PF driver should create
+and destroy representors appropriately.
+The operations of the representor netdevice will generally involve acting
+through the master PF.  For example, ``ndo_start_xmit()`` might send the packet
+through a hardware TX queue attached to the master PF, with either packet
+metadata or queue configuration marking it for delivery to the representee.
+
+How are representors identified?
+--------------------------------
+
+The representor netdevice should *not* directly refer to a PCIe device (e.g.
+through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
+representee or of the master PF.
+Instead, it should implement the ``ndo_get_devlink_port()`` netdevice op, which
+the kernel uses to provide the ``phys_switch_id`` and ``phys_port_name`` sysfs
+nodes.  (Some legacy drivers implement ``ndo_get_port_parent_id()`` and
+``ndo_get_phys_port_name`` directly, but this is deprecated.)  See
+:ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>` for the
+details of this API.
+
+It is expected that userland will use this information (e.g. through udev rules)
+to construct an appropriately informative name or alias for the netdevice.  For
+instance if the master PF is ``eth4`` then a representor with a
+``phys_port_name`` of ``p0pf1vf2`` might be renamed ``eth4pf1vf2rep``.
+
+There are as yet no established conventions for naming representors which do not
+correspond to PCIe functions (e.g. accelerators and plugins).
+
+How do representors interact with TC rules?
+-------------------------------------------
+
+Any TC rule on a representor applies (in software TC) to packets received by
+that representor netdevice.  Thus, if the delivery part of the rule corresponds
+to another port on the virtual switch, the driver may choose to offload it to
+hardware, applying it to packets transmitted by the representee.
+
+Similarly, since a TC mirred egress action targeting the representor would (in
+software) send the packet through the representor (and thus indirectly deliver
+it to the representee), hardware offload should interpret this as delivery to
+the representee.
+
+As a simple example, if ``eth0`` is the master PF's netdevice and ``eth1`` is a
+VF representor, the following rules::
+
+    tc filter add dev eth1 parent ffff: protocol ipv4 flower \
+        action mirred egress redirect dev eth0
+    tc filter add dev eth0 parent ffff: protocol ipv4 flower \
+        action mirred egress mirror dev eth1
+
+would mean that all IPv4 packets from the VF are sent out the physical port, and
+all IPv4 packets received on the physical port are delivered to the VF in
+addition to the master PF.
+
+Of course the rules can (if supported by the NIC) include packet-modifying
+actions (e.g. VLAN push/pop), which should be performed by the virtual switch.
+
+Tunnel encapsulation and decapsulation are rather more complicated, as they
+involve a third netdevice (a tunnel netdev operating in metadata mode, such as
+a VxLAN device created with ``ip link add vxlan0 type vxlan external``) and
+require an IP address to be bound to the underlay device (e.g. master PF or port
+representor).  TC rules such as::
+
+    tc filter add dev eth1 parent ffff: flower \
+        action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \
+                              dst_port 4789 \
+        action mirred egress redirect dev vxlan0
+    tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \
+        enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \
+        action tunnel_key unset action mirred egress redirect dev eth1
+
+where ``LOCAL_IP`` is an IP address bound to ``eth0``, and ``REMOTE_IP`` is
+another IP address on the same subnet, mean that packets sent by the VF should
+be VxLAN encapsulated and sent out the physical port (the driver has to deduce
+this by a route lookup of ``LOCAL_IP`` leading to ``eth0``, and also perform an
+ARP/neighbour table lookup to find the MAC addresses to use in the outer
+Ethernet frame), while UDP packets received on the physical port with UDP port
+4789 should be parsed as VxLAN and, if their VSID matches ``$VNI``, decapsulated
+and forwarded to the VF.
+
+If this all seems complicated, just remember the 'golden rule' of TC offload:
+the hardware should ensure the same final results as if the packets were
+processed through the slow path, traversed software TC and were transmitted or
+received through the representor netdevices.
+
+Configuring the representee's MAC
+---------------------------------
+
+The representee's link state is controlled through the representor.  Setting the
+representor administratively UP or DOWN should cause carrier ON or OFF at the
+representee.
+
+Setting an MTU on the representor should cause that same MTU to be reported to
+the representee.
+(On hardware that allows configuring separate and distinct MTU and MRU values,
+the representor MTU should correspond to the representee's MRU and vice-versa.)
+
+Currently there is no way to use the representor to set the station permanent
+MAC address of the representee; other methods available to do this include:
+
+ - legacy SR-IOV (``ip link set DEVICE vf NUM mac LLADDR``)
+ - devlink port function (see **devlink-port(8)** and
+   :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`)
diff --git a/Documentation/networking/switchdev.rst b/Documentation/networking/switchdev.rst
index f1f4e6a85a29..21e80c8e661b 100644
--- a/Documentation/networking/switchdev.rst
+++ b/Documentation/networking/switchdev.rst
@@ -1,5 +1,6 @@
 .. SPDX-License-Identifier: GPL-2.0
 .. include:: <isonum.txt>
+.. _switchdev:
 
 ===============================================
 Ethernet switch device driver model (switchdev)

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-15 14:22 [RFC PATCH v2 net-next] docs: net: add an explanation of VF (and other) Representors ecree
@ 2022-08-18  9:56 ` Marcin Szycik
  2022-08-19 15:20   ` Edward Cree
  2022-08-18 11:20 ` Simon Horman
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Marcin Szycik @ 2022-08-18  9:56 UTC (permalink / raw)
  To: ecree, netdev, linux-net-drivers
  Cc: davem, kuba, pabeni, edumazet, corbet, linux-doc, jacob.e.keller,
	jesse.brandeburg, michael.chan, andy, saeed, jiri, snelson,
	simon.horman, alexander.duyck, rdunlap, Edward Cree

On 15-Aug-22 16:22, ecree@xilinx.com wrote:

> [...]
>
> +Motivation
> +----------
> +
> +Since the mid-2010s, network cards have started offering more complex
> +virtualisation capabilities than the legacy SR-IOV approach (with its simple
> +MAC/VLAN-based switching model) can support.  This led to a desire to offload
> +software-defined networks (such as OpenVSwitch) to these NICs to specify the
> +network connectivity of each function.  The resulting designs are variously
> +called SmartNICs or DPUs.
> +
> +Network function representors bring the standard Linux networking stack to
> +virtual switches and IOV devices.  Just as each port of a Linux-controlled
> +switch has a separate netdev, so each virtual function has one.  When the system

Maybe I'm misunderstanding something, but this sentence seems a bit confusing. Maybe:
"Just as each port of a Linux-controlled switch has a separate netdev, each virtual
function has one."?

> +boots, and before any offload is configured, all packets from the virtual
> +functions appear in the networking stack of the PF via the representors.
> +The PF can thus always communicate freely with the virtual functions.
> +The PF can configure standard Linux forwarding between representors, the uplink
> +or any other netdev (routing, bridging, TC classifiers).
>
> [...]
>
> +How do representors interact with TC rules?
> +-------------------------------------------
> +
> +Any TC rule on a representor applies (in software TC) to packets received by
> +that representor netdevice.  Thus, if the delivery part of the rule corresponds
> +to another port on the virtual switch, the driver may choose to offload it to
> +hardware, applying it to packets transmitted by the representee.
> +
> +Similarly, since a TC mirred egress action targeting the representor would (in
> +software) send the packet through the representor (and thus indirectly deliver
> +it to the representee), hardware offload should interpret this as delivery to
> +the representee.
> +
> +As a simple example, if ``eth0`` is the master PF's netdevice and ``eth1`` is a
> +VF representor, the following rules::
> +
> +    tc filter add dev eth1 parent ffff: protocol ipv4 flower \
> +        action mirred egress redirect dev eth0
> +    tc filter add dev eth0 parent ffff: protocol ipv4 flower \
> +        action mirred egress mirror dev eth1

Perhaps eth0/eth1 names could be replaced with more meaningful names, as it's easy
to confuse them now. How about examples from above (e.g. PF -> eth4, PR -> eth4pf1vf2rep)?
Or just $PF_NETDEV, $PR_NETDEV.

> +would mean that all IPv4 packets from the VF are sent out the physical port, and
> +all IPv4 packets received on the physical port are delivered to the VF in
> +addition to the master PF.
> +
> +Of course the rules can (if supported by the NIC) include packet-modifying
> +actions (e.g. VLAN push/pop), which should be performed by the virtual switch.
> +
> +Tunnel encapsulation and decapsulation are rather more complicated, as they
> +involve a third netdevice (a tunnel netdev operating in metadata mode, such as
> +a VxLAN device created with ``ip link add vxlan0 type vxlan external``) and
> +require an IP address to be bound to the underlay device (e.g. master PF or port
> +representor).  TC rules such as::
> +
> +    tc filter add dev eth1 parent ffff: flower \
> +        action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \
> +                              dst_port 4789 \
> +        action mirred egress redirect dev vxlan0
> +    tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \
> +        enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \
> +        action tunnel_key unset action mirred egress redirect dev eth1
> +

Same as above, eth1 name could be more intuitive.

--- 8< ---

LGTM, only those two small nitpicks.

Regards,
Marcin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-15 14:22 [RFC PATCH v2 net-next] docs: net: add an explanation of VF (and other) Representors ecree
  2022-08-18  9:56 ` Marcin Szycik
@ 2022-08-18 11:20 ` Simon Horman
  2022-08-18 15:07 ` Roi Dayan
  2022-08-18 16:44 ` Parav Pandit
  3 siblings, 0 replies; 9+ messages in thread
From: Simon Horman @ 2022-08-18 11:20 UTC (permalink / raw)
  To: ecree
  Cc: netdev, linux-net-drivers, davem, kuba, pabeni, edumazet, corbet,
	linux-doc, jacob.e.keller, jesse.brandeburg, michael.chan, andy,
	saeed, jiri, snelson, alexander.duyck, rdunlap, Edward Cree

On Mon, Aug 15, 2022 at 03:22:51PM +0100, ecree@xilinx.com wrote:
> From: Edward Cree <ecree.xilinx@gmail.com>
> 
> There's no clear explanation of what VF Representors are for, their
>  semantics, etc., outside of vendor docs and random conference slides.
> Add a document explaining Representors and defining what drivers that
>  implement them are expected to do.
> 
> Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>

Thanks a lot for working on this, it looks very nice to me.

Reviewed-by: Simon Horman <simon.horman@corigine.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-15 14:22 [RFC PATCH v2 net-next] docs: net: add an explanation of VF (and other) Representors ecree
  2022-08-18  9:56 ` Marcin Szycik
  2022-08-18 11:20 ` Simon Horman
@ 2022-08-18 15:07 ` Roi Dayan
  2022-08-19 15:18   ` Edward Cree
  2022-08-18 16:44 ` Parav Pandit
  3 siblings, 1 reply; 9+ messages in thread
From: Roi Dayan @ 2022-08-18 15:07 UTC (permalink / raw)
  To: ecree, netdev, linux-net-drivers
  Cc: davem, kuba, pabeni, edumazet, corbet, linux-doc, jacob.e.keller,
	jesse.brandeburg, michael.chan, andy, saeed, jiri, snelson,
	simon.horman, alexander.duyck, rdunlap, Edward Cree



On 2022-08-15 5:22 PM, ecree@xilinx.com wrote:
> From: Edward Cree <ecree.xilinx@gmail.com>
> 
> There's no clear explanation of what VF Representors are for, their
>   semantics, etc., outside of vendor docs and random conference slides.
> Add a document explaining Representors and defining what drivers that
>   implement them are expected to do.
> 
> Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
> ---
> Changed in v2:
>   - incorporated feedback from Jakub including rewrite of the Motivation section,
>     representors for uplink/phys port, replace phys_port_name conventions with
>     devlink port.
>   - fixed archaic spelling (Randy)
>   - painted the bike shed blue ("master PF") for now, we can always change it
>     again later
>   - added Definitions section
> 
>   Documentation/networking/index.rst        |   1 +
>   Documentation/networking/representors.rst | 228 ++++++++++++++++++++++
>   Documentation/networking/switchdev.rst    |   1 +
>   3 files changed, 230 insertions(+)
>   create mode 100644 Documentation/networking/representors.rst
> 
> diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
> index 03b215bddde8..c37ea2b54c29 100644
> --- a/Documentation/networking/index.rst
> +++ b/Documentation/networking/index.rst
> @@ -93,6 +93,7 @@ Contents:
>      radiotap-headers
>      rds
>      regulatory
> +   representors
>      rxrpc
>      sctp
>      secid
> diff --git a/Documentation/networking/representors.rst b/Documentation/networking/representors.rst
> new file mode 100644
> index 000000000000..be7cc4752d11
> --- /dev/null
> +++ b/Documentation/networking/representors.rst
> @@ -0,0 +1,228 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============================
> +Network Function Representors
> +=============================
> +
> +This document describes the semantics and usage of representor netdevices, as
> +used to control internal switching on SmartNICs.  For the closely-related port
> +representors on physical (multi-port) switches, see
> +:ref:`Documentation/networking/switchdev.rst <switchdev>`.
> +
> +Motivation
> +----------
> +
> +Since the mid-2010s, network cards have started offering more complex
> +virtualisation capabilities than the legacy SR-IOV approach (with its simple
> +MAC/VLAN-based switching model) can support.  This led to a desire to offload
> +software-defined networks (such as OpenVSwitch) to these NICs to specify the
> +network connectivity of each function.  The resulting designs are variously
> +called SmartNICs or DPUs.
> +
> +Network function representors bring the standard Linux networking stack to
> +virtual switches and IOV devices.  Just as each port of a Linux-controlled
> +switch has a separate netdev, so each virtual function has one.  When the system
> +boots, and before any offload is configured, all packets from the virtual
> +functions appear in the networking stack of the PF via the representors.
> +The PF can thus always communicate freely with the virtual functions.
> +The PF can configure standard Linux forwarding between representors, the uplink
> +or any other netdev (routing, bridging, TC classifiers).
> +
> +Thus, a representor is both a control plane object (representing the function in
> +administrative commands) and a data plane object (one end of a virtual pipe).
> +As a virtual link endpoint, the representor can be configured like any other
> +netdevice; in some cases (e.g. link state) the representee will follow the
> +representor's configuration, while in others there are separate APIs to
> +configure the representee.
> +
> +Definitions
> +-----------
> +
> +This document uses the term "master PF" to refer to the PCIe function which has
> +administrative control over the virtual switch on the device.  Conceivably a NIC
> +could be configured to grant these administrative privileges instead to a VF or
> +SF (subfunction); the terminology is not meant to exclude this case.
> +Depending on NIC design, a multi-port NIC might have a single master PF for the
> +whole device or might have a separate virtual switch, and hence master PF, for
> +each physical network port.
> +If the NIC supports nested switching, there might be separate "master PFs" for
> +each nested switch, in which case each "master PF" should only create
> +representors for the ports on the (sub-)switch it directly administers.
> +
> +A "representee" is the object that a representor represents.  So for example in
> +the case of a VF representor, the representee is the corresponding VF.
> +
> +What does a representor do?
> +---------------------------
> +
> +A representor has three main roles.
> +
> +1. It is used to configure the network connection the representee sees, e.g.
> +   link up/down, MTU, etc.  For instance, bringing the representor
> +   administratively UP should cause the representee to see a link up / carrier
> +   on event.
> +2. It provides the slow path for traffic which does not hit any offloaded
> +   fast-path rules in the virtual switch.  Packets transmitted on the
> +   representor netdevice should be delivered to the representee; packets
> +   transmitted to the representee which fail to match any switching rule should
> +   be received on the representor netdevice.  (That is, there is a virtual pipe
> +   connecting the representor to the representee, similar in concept to a veth
> +   pair.)
> +   This allows software switch implementations (such as OpenVSwitch or a Linux
> +   bridge) to forward packets between representees and the rest of the network.
> +3. It acts as a handle by which switching rules (such as TC filters) can refer
> +   to the representee, allowing these rules to be offloaded.
> +
> +The combination of 2) and 3) means that the behaviour (apart from performance)
> +should be the same whether a TC filter is offloaded or not.  E.g. a TC rule
> +on a VF representor applies in software to packets received on that representor
> +netdevice, while in hardware offload it would apply to packets transmitted by
> +the representee VF.  Conversely, a mirred egress redirect to a VF representor
> +corresponds in hardware to delivery directly to the representee VF.
> +
> +What functions should have a representor?
> +-----------------------------------------
> +
> +Essentially, for each virtual port on the device's internal switch, there
> +should be a representor.
> +Some vendors have chosen to omit representors for the uplink and the physical
> +network port, which can simplify usage (the uplink netdev becomes in effect the
> +physical port's representor) but does not generalise to devices with multiple
> +ports or uplinks.
> +
> +Thus, the following should all have representors:
> +
> + - VFs belonging to the master PF.
> + - Other PFs on the local PCIe controller, and any VFs belonging to them.
> + - PFs and VFs on other PCIe controllers on the device (e.g. for any embedded
> +   System-on-Chip within the SmartNIC).
> + - PFs and VFs with other personalities, including network block devices (such
> +   as a vDPA virtio-blk PF backed by remote/distributed storage), if their
> +   network access is implemented through a virtual switch port.
> +   Note that such functions can require a representor despite the representee
> +   not having a netdev.
> + - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have
> +   their own port on the switch (as opposed to using their parent PF's port).
> + - Any accelerators or plugins on the device whose interface to the network is
> +   through a virtual switch port, even if they do not have a corresponding PCIe
> +   PF or VF.
> +
> +This allows the entire switching behaviour of the NIC to be controlled through
> +representor TC rules.
> +
> +A PCIe function which does not have network access through the internal switch
> +(not even indirectly through the hardware implementation of whatever services
> +the function provides) should *not* have a representor (even if it has a
> +netdev).
> +Such a function has no switch virtual port for the representor to configure or
> +to be the other end of the virtual pipe.
> +
> +How are representors created?
> +-----------------------------
> +
> +The driver instance attached to the master PF should enumerate the virtual ports
> +on the switch, and for each representee, create a pure-software netdevice which
> +has some form of in-kernel reference to the PF's own netdevice or driver private
> +data (``netdev_priv()``).
> +If switch ports can dynamically appear/disappear, the PF driver should create
> +and destroy representors appropriately.
> +The operations of the representor netdevice will generally involve acting
> +through the master PF.  For example, ``ndo_start_xmit()`` might send the packet
> +through a hardware TX queue attached to the master PF, with either packet
> +metadata or queue configuration marking it for delivery to the representee.
> +
> +How are representors identified?
> +--------------------------------
> +
> +The representor netdevice should *not* directly refer to a PCIe device (e.g.
> +through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
> +representee or of the master PF.

Hi,
maybe I'm confused here, but why representor should not refer to pci
device ? it does exists today for systemd renaming.
and this is used beside of implementing the other ndos you mention
below.

     udevadm output after linking to PCI device:
     $ udevadm test-builtin net_id /sys/class/net/eth6
     Load module index
     Network interface NamePolicy= disabled on kernel command line, 
ignoring.
     Parsed configuration file /usr/lib/systemd/network/99-default.link
     Created link configuration context.
     Using default interface naming scheme 'v243'.
     ID_NET_NAMING_SCHEME=v243
     ID_NET_NAME_PATH=enp0s8f0npf0vf0
     Unload module index
     Unloaded link configuration context.

$  git grep SET_NETDEV_DEV|grep rep
drivers/net/ethernet/intel/ice/ice_repr.c: 
SET_NETDEV_DEV(repr->netdev, ice_pf_to_dev(vf->pf));
drivers/net/ethernet/mellanox/mlx5/core/en_rep.c: 
SET_NETDEV_DEV(netdev, mdev->device);
drivers/net/ethernet/netronome/nfp/flower/main.c: 
SET_NETDEV_DEV(repr, &priv->nn->pdev->dev);


> +Instead, it should implement the ``ndo_get_devlink_port()`` netdevice op, which
> +the kernel uses to provide the ``phys_switch_id`` and ``phys_port_name`` sysfs
> +nodes.  (Some legacy drivers implement ``ndo_get_port_parent_id()`` and
> +``ndo_get_phys_port_name`` directly, but this is deprecated.)  See
> +:ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>` for the
> +details of this API.
> +
> +It is expected that userland will use this information (e.g. through udev rules)
> +to construct an appropriately informative name or alias for the netdevice.  For
> +instance if the master PF is ``eth4`` then a representor with a
> +``phys_port_name`` of ``p0pf1vf2`` might be renamed ``eth4pf1vf2rep``.
> +
> +There are as yet no established conventions for naming representors which do not
> +correspond to PCIe functions (e.g. accelerators and plugins).
> +
> +How do representors interact with TC rules?
> +-------------------------------------------
> +
> +Any TC rule on a representor applies (in software TC) to packets received by
> +that representor netdevice.  Thus, if the delivery part of the rule corresponds
> +to another port on the virtual switch, the driver may choose to offload it to
> +hardware, applying it to packets transmitted by the representee.
> +
> +Similarly, since a TC mirred egress action targeting the representor would (in
> +software) send the packet through the representor (and thus indirectly deliver
> +it to the representee), hardware offload should interpret this as delivery to
> +the representee.
> +
> +As a simple example, if ``eth0`` is the master PF's netdevice and ``eth1`` is a
> +VF representor, the following rules::
> +
> +    tc filter add dev eth1 parent ffff: protocol ipv4 flower \
> +        action mirred egress redirect dev eth0
> +    tc filter add dev eth0 parent ffff: protocol ipv4 flower \
> +        action mirred egress mirror dev eth1
> +
> +would mean that all IPv4 packets from the VF are sent out the physical port, and
> +all IPv4 packets received on the physical port are delivered to the VF in
> +addition to the master PF.
> +
> +Of course the rules can (if supported by the NIC) include packet-modifying
> +actions (e.g. VLAN push/pop), which should be performed by the virtual switch.
> +
> +Tunnel encapsulation and decapsulation are rather more complicated, as they
> +involve a third netdevice (a tunnel netdev operating in metadata mode, such as
> +a VxLAN device created with ``ip link add vxlan0 type vxlan external``) and
> +require an IP address to be bound to the underlay device (e.g. master PF or port
> +representor).  TC rules such as::
> +
> +    tc filter add dev eth1 parent ffff: flower \
> +        action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \
> +                              dst_port 4789 \
> +        action mirred egress redirect dev vxlan0
> +    tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \
> +        enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \
> +        action tunnel_key unset action mirred egress redirect dev eth1
> +
> +where ``LOCAL_IP`` is an IP address bound to ``eth0``, and ``REMOTE_IP`` is
> +another IP address on the same subnet, mean that packets sent by the VF should
> +be VxLAN encapsulated and sent out the physical port (the driver has to deduce
> +this by a route lookup of ``LOCAL_IP`` leading to ``eth0``, and also perform an
> +ARP/neighbour table lookup to find the MAC addresses to use in the outer
> +Ethernet frame), while UDP packets received on the physical port with UDP port
> +4789 should be parsed as VxLAN and, if their VSID matches ``$VNI``, decapsulated
> +and forwarded to the VF.
> +
> +If this all seems complicated, just remember the 'golden rule' of TC offload:
> +the hardware should ensure the same final results as if the packets were
> +processed through the slow path, traversed software TC and were transmitted or
> +received through the representor netdevices.
> +
> +Configuring the representee's MAC
> +---------------------------------
> +
> +The representee's link state is controlled through the representor.  Setting the
> +representor administratively UP or DOWN should cause carrier ON or OFF at the
> +representee.
> +
> +Setting an MTU on the representor should cause that same MTU to be reported to
> +the representee.
> +(On hardware that allows configuring separate and distinct MTU and MRU values,
> +the representor MTU should correspond to the representee's MRU and vice-versa.)
> +
> +Currently there is no way to use the representor to set the station permanent
> +MAC address of the representee; other methods available to do this include:
> +
> + - legacy SR-IOV (``ip link set DEVICE vf NUM mac LLADDR``)
> + - devlink port function (see **devlink-port(8)** and
> +   :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`)
> diff --git a/Documentation/networking/switchdev.rst b/Documentation/networking/switchdev.rst
> index f1f4e6a85a29..21e80c8e661b 100644
> --- a/Documentation/networking/switchdev.rst
> +++ b/Documentation/networking/switchdev.rst
> @@ -1,5 +1,6 @@
>   .. SPDX-License-Identifier: GPL-2.0
>   .. include:: <isonum.txt>
> +.. _switchdev:
>   
>   ===============================================
>   Ethernet switch device driver model (switchdev)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC PATCH v2 net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-15 14:22 [RFC PATCH v2 net-next] docs: net: add an explanation of VF (and other) Representors ecree
                   ` (2 preceding siblings ...)
  2022-08-18 15:07 ` Roi Dayan
@ 2022-08-18 16:44 ` Parav Pandit
  2022-08-19 16:36   ` Edward Cree
  3 siblings, 1 reply; 9+ messages in thread
From: Parav Pandit @ 2022-08-18 16:44 UTC (permalink / raw)
  To: ecree@xilinx.com, netdev@vger.kernel.org,
	linux-net-drivers@amd.com
  Cc: davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com,
	edumazet@google.com, corbet@lwn.net, linux-doc@vger.kernel.org,
	jacob.e.keller@intel.com, jesse.brandeburg@intel.com,
	michael.chan@broadcom.com, andy@greyhouse.net, saeed@kernel.org,
	jiri@resnulli.us, snelson@pensando.io, simon.horman@corigine.com,
	alexander.duyck@gmail.com, rdunlap@infradead.org, Edward Cree,
	Parav Pandit


> From: ecree@xilinx.com <ecree@xilinx.com>
> Sent: Monday, August 15, 2022 10:23 AM
> 
> +
> +=============================
> +Network Function Representors
Given current documentation of the patch:

s/Network Function Representors/Network Function Port Representors.

> +=============================
Most of the text below describes "netdev" device of the switchdev device.

A switchdev device' netdevice only represents the port of the representee.

Hence the document title is not truly justified what it wants to say.

A _whole_ network function is represented today using 
a. netdevice represents representee's network port

b. devlink port function for function management

This good documentation is probably best placed in switchdev.rst may be with a NIC specific section.
This will connect the two dots nicely.

[..]
> +Motivation
> +----------
> +Network function representors bring the standard Linux networking stack
Network functions port representors; here and at other places ...

> +to virtual switches and IOV devices.  Just as each port of a
> +Linux-controlled switch has a separate netdev, so each virtual function
> +has one.  When the system boots, and before any offload is configured,
> +all packets from the virtual functions appear in the networking stack of the
> PF via the representors.
> +The PF can thus always communicate freely with the virtual functions.
> +The PF can configure standard Linux forwarding between representors,
> +the uplink or any other netdev (routing, bridging, TC classifiers).
> +
> +Thus, a representor is both a control plane object (representing the
> +function in administrative commands) 
Certain part of the function administrative commands that doesn't directly fit to netdev.
They are through devlink port + function.

> and a data plane object (one end of a
> virtual pipe).
> +As a virtual link endpoint, the representor can be configured like any
> +other netdevice; in some cases (e.g. link state) the representee will
> +follow the representor's configuration, while in others there are
> +separate APIs to configure the representee.
> +
> +Definitions
> +-----------
> +
> +This document uses the term "master PF" to refer to the PCIe function
s/master PF/switchdev function
or
s/master PF/switchdev

Especially when you have below text of granting these privileges to a VF/SF...
It also aligns to devlink switchdev device notion.
Less new terms for readers and admins to learn. :)

[..]
> Packets transmitted on the
> +   representor netdevice should be delivered to the representee; packets
> +   transmitted to the representee which fail to match any switching rule
> should
> +   be received on the representor netdevice.  (That is, there is a virtual pipe
> +   connecting the representor to the representee, similar in concept to a
> veth
> +   pair.)
Please add text that,
Packets transmitted by the representee and when they are not offloaded, such packets are delivered to the port representor netdevice.

> +What functions should have a representor?
> +-----------------------------------------
> +
> +Essentially, for each virtual port on the device's internal switch,
                                                                            ^^^^
You probably wanted to say master PF internal switch here.

Better to word it as, each virtual port of a switchdev, there should be...

> +there should be a representor.
> +Some vendors have chosen to omit representors for the uplink and the
> +physical network port, which can simplify usage (the uplink netdev
> +becomes in effect the physical port's representor) but does not
> +generalise to devices with multiple ports or uplinks.
> +
> +Thus, the following should all have representors:
> +
> + - VFs belonging to the master PF.
> + - Other PFs on the local PCIe controller, and any VFs belonging to them.
Local and/or external PCIe controllers.

> + - PFs and VFs on other PCIe controllers on the device (e.g. for any
> embedded
> +   System-on-Chip within the SmartNIC).
> + - PFs and VFs with other personalities, including network block devices
> (such
> +   as a vDPA virtio-blk PF backed by remote/distributed storage), if their
> +   network access is implemented through a virtual switch port.
> +   Note that such functions can require a representor despite the
> representee
> +   not having a netdev.
This looks a big undertaking to represent them via "netdevice".
Mostly they cannot be well represented by the netdevice.

Even a network function was not fully fitting into the existing port representor netdevice.
In some cases, such vDPA devices are affiliated to the switchdev, but they use one or multiple of its ports.
In second scenario, today vdpa devices are used without switchdev device too.

Configuring vDPA device's block size, capacity, iops just doesn't make sense to control via netdevice representation.

vdpa has its own subsystem [1] to manage its backend.

So please remove above vdpa related documentation that doesn't reflect the current kernel.

[1] https://man7.org/linux/man-pages/man8/vdpa.8.html

> + - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have
> +   their own port on the switch (as opposed to using their parent PF's port).
Not sure why the text has _if_ for SF and not for the VF.
Do you see a SF device in the kernel that doesn't have their own port, due to which there is _if_ added?

> + - Any accelerators or plugins on the device whose interface to the network
> is
> +   through a virtual switch port, even if they do not have a corresponding
> PCIe
> +   PF or VF.
> +
Above vdpa block is good example that cannot be represent by the netdev port representor.
I would imagine other accelerators won't be able to fit in the netdev representation.

> +How are representors created?
> +-----------------------------
> +
> +The driver instance attached to the master PF should enumerate the
> +virtual ports on the switch, and for each representee, create a
> +pure-software netdevice which has some form of in-kernel reference to
> +the PF's own netdevice or driver private data (``netdev_priv()``).
Today a user can create new virtual ports. Hence, these port represnetors and function representors are created dynamically without enumeration.
Please add text describing both ways.

For mlx5 case a representor netdevice has real queue from which tx/rx DMA happens from the device to/from network.
It is not entirely pure software per say.
Hence, "pure-software" is misleading. Please drop that word.

> +If switch ports can dynamically appear/disappear, the PF driver should
> +create and destroy representors appropriately.
> +The operations of the representor netdevice will generally involve
> +acting through the master PF.  For example, ``ndo_start_xmit()`` might
> +send the packet through a hardware TX queue attached to the master PF,
> +with either packet metadata or queue configuration marking it for delivery
> to the representee.
Sharing/not sharing TX and RX queue among representor netdevices is not yet well established.
Mostly my knowledge is ancient on sharing model.

In mlx5 driver queues are per representor netdevice. For others may be different.

Is there a way for the netdev to tell that it shares resources/queues with other netdevices?

So better to say:
The operations of the representor netdevice will generally involve using resources/queues attached to the netdevice.

> +
> +How are representors identified?
> +--------------------------------
> +
> +The representor netdevice should *not* directly refer to a PCIe device (e.g.
> +through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
> +representee or of the master PF.
This isn't true.
Representor netdevices are connected to the switchdev device PCI function.
Without linking to PCI device, udev scriptology needs to grep among thousands of netdevices and its very inefficient.

May be a better code in future to take the dev.parent directly from the devlink port a netdevice is attached to.

> +There are as yet no established conventions for naming representors
> +which do not correspond to PCIe functions (e.g. accelerators and plugins).
Netdevice represents the networking port of the function.
So, this is not applicable text here.

> +Configuring the representee's MAC
> +---------------------------------
> +
> +
> +Currently there is no way to use the representor to set the station
> +permanent MAC address of the representee; other methods available to
> do this include:
Just rewrite above line as:
Representee's MAC can be configured using ..

Overall looks good doc to add apart from above comments to fix.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-18 15:07 ` Roi Dayan
@ 2022-08-19 15:18   ` Edward Cree
  0 siblings, 0 replies; 9+ messages in thread
From: Edward Cree @ 2022-08-19 15:18 UTC (permalink / raw)
  To: Roi Dayan, ecree, netdev, linux-net-drivers
  Cc: davem, kuba, pabeni, edumazet, corbet, linux-doc, jacob.e.keller,
	jesse.brandeburg, michael.chan, andy, saeed, jiri, snelson,
	simon.horman, alexander.duyck, rdunlap

On 18/08/2022 16:07, Roi Dayan wrote:
> On 2022-08-15 5:22 PM, ecree@xilinx.com wrote:
>> +The representor netdevice should *not* directly refer to a PCIe device (e.g.
>> +through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
>> +representee or of the master PF.
> 
> Hi,
> maybe I'm confused here, but why representor should not refer to pci
> device ? it does exists today for systemd renaming.
> and this is used beside of implementing the other ndos you mention
> below.

The master PF is already identified via ``phys_switch_id``, another linkage
 is not needed and only means that userland looking up netdevices by PCIe
 address has to do another step to distinguish the PF's own netdev from all
 the representors.  Allegedly[1] nfp ran into issues where OpenStack would
 sometimes use the reprs for ops that logically should have been on the PF
 because they all had the same /sys/class/net/$INTF/device and it wasn't
 smart enough to tell the difference.

Semantically, the representor is a virtual device, that's backed by the PF
 netdevice rather than the PF's hardware directly — even if it has e.g.
 dedicated queues, it's still not in administrative control of the PCIe
 function in the way that the PF driver instance is.  And compare to e.g.
 a vlan netdev stacked on top of the PF netdev — we don't put the PF in
 /sys/class/net/vlan0/device...

> $  git grep SET_NETDEV_DEV|grep rep
> drivers/net/ethernet/intel/ice/ice_repr.c: SET_NETDEV_DEV(repr->netdev, ice_pf_to_dev(vf->pf));
> drivers/net/ethernet/mellanox/mlx5/core/en_rep.c: SET_NETDEV_DEV(netdev, mdev->device);
> drivers/net/ethernet/netronome/nfp/flower/main.c: SET_NETDEV_DEV(repr, &priv->nn->pdev->dev);

Yes, several existing drivers do this[2].  IMHO they're wrong.

-ed

[1]: https://lore.kernel.org/all/20220728113231.26fdfab0@kernel.org/
[2]: https://lore.kernel.org/netdev/71af8654-ca69-c492-7e12-ed7ff455a2f1@gmail.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-18  9:56 ` Marcin Szycik
@ 2022-08-19 15:20   ` Edward Cree
  0 siblings, 0 replies; 9+ messages in thread
From: Edward Cree @ 2022-08-19 15:20 UTC (permalink / raw)
  To: Marcin Szycik, ecree, netdev, linux-net-drivers
  Cc: davem, kuba, pabeni, edumazet, corbet, linux-doc, jacob.e.keller,
	jesse.brandeburg, michael.chan, andy, saeed, jiri, snelson,
	simon.horman, alexander.duyck, rdunlap

On 18/08/2022 10:56, Marcin Szycik wrote:
> On 15-Aug-22 16:22, ecree@xilinx.com wrote:
> 
>> Just as each port of a Linux-controlled
>> +switch has a separate netdev, so each virtual function has one.  When the system
> 
> Maybe I'm misunderstanding something, but this sentence seems a bit confusing. Maybe:
> "Just as each port of a Linux-controlled switch has a separate netdev, each virtual
> function has one."?

Kuba wrote this paragraph and tbh it makes sense to me.
But how about "Just as each port of a Linux-controlled switch has a
 separate netdev, so does each virtual function."?

>> +As a simple example, if ``eth0`` is the master PF's netdevice and ``eth1`` is a
>> +VF representor, the following rules::
>> +
>> +    tc filter add dev eth1 parent ffff: protocol ipv4 flower \
>> +        action mirred egress redirect dev eth0
>> +    tc filter add dev eth0 parent ffff: protocol ipv4 flower \
>> +        action mirred egress mirror dev eth1
> 
> Perhaps eth0/eth1 names could be replaced with more meaningful names, as it's easy
> to confuse them now. How about examples from above (e.g. PF -> eth4, PR -> eth4pf1vf2rep)?
> Or just $PF_NETDEV, $PR_NETDEV.

Yeah, I can replace them with $VARIABLES.

-ed

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-18 16:44 ` Parav Pandit
@ 2022-08-19 16:36   ` Edward Cree
  2022-08-19 18:59     ` Parav Pandit
  0 siblings, 1 reply; 9+ messages in thread
From: Edward Cree @ 2022-08-19 16:36 UTC (permalink / raw)
  To: Parav Pandit, ecree@xilinx.com, netdev@vger.kernel.org,
	linux-net-drivers@amd.com
  Cc: davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com,
	edumazet@google.com, corbet@lwn.net, linux-doc@vger.kernel.org,
	jacob.e.keller@intel.com, jesse.brandeburg@intel.com,
	michael.chan@broadcom.com, andy@greyhouse.net, saeed@kernel.org,
	jiri@resnulli.us, snelson@pensando.io, simon.horman@corigine.com,
	alexander.duyck@gmail.com, rdunlap@infradead.org

On 18/08/2022 17:44, Parav Pandit wrote:
> A _whole_ network function is represented today using 
> a. netdevice represents representee's network port
> 
> b. devlink port function for function management

So, at the moment I'm just trying to document the current consensus,
 but where I plan to go _after_ this doc is building the case that
 devlink port as it exists today mixes in too much networking
 configuration that really belongs in the representor.  The example
 that motivated this for me is that setting the MAC address of the
 representee is currently a devlink port function operation, but
 this has nothing to do with the PCIe function and everything to do
 with the network port, so logically it should be an operation on
 the representor.  (I intend to develop a patch making it such, once
 we're all on the same page.)

I think a general rule is — would this operation make sense on a non-
 networking SR-IOV device?  If not, then it shouldn't be in devlink
 port.  E.g. why is port splitting a devlink port operation and not
 an operation on the port representor netdev?

> s/master PF/switchdev function

switchdev function might actually be the best name suggestion yet.
I like it.

> Please add text that,
> Packets transmitted by the representee and when they are not offloaded, such packets are delivered to the port representor netdevice.

That's exactly what
>> packets
>> +   transmitted to the representee which fail to match any switching rule
>> should
>> +   be received on the representor netdevice.
says.  (Although my choice of preposition — 'to', rather than 'by'
 — was less than clear.)

>> +What functions should have a representor?
>> +-----------------------------------------
>> +
>> +Essentially, for each virtual port on the device's internal switch,
>                                                                             ^^^^
> You probably wanted to say master PF internal switch here.
> 
> Better to word it as, each virtual port of a switchdev, there should be...

Hmm idk, I feel like "switchdev" has the connotation of "the software
 object inside the kernel representing the switch" rather than "the
 switch itself".

>> + - Other PFs on the local PCIe controller, and any VFs belonging to them.
> Local and/or external PCIe controllers.
That's literally the next bullet point.

>> + - PFs and VFs on other PCIe controllers on the device (e.g. for any
>> embedded
>> +   System-on-Chip within the SmartNIC).
Do I need to use the word "external" to make it more obvious?

>> + - PFs and VFs with other personalities, including network block devices
>> (such
>> +   as a vDPA virtio-blk PF backed by remote/distributed storage), if their
>> +   network access is implemented through a virtual switch port.
>> +   Note that such functions can require a representor despite the
>> representee
>> +   not having a netdev.
> This looks a big undertaking to represent them via "netdevice".
> Mostly they cannot be well represented by the netdevice.

The netdevice isn't supposed to represent the vDPA block device.  Rather
 it represents the switch port that the block device is using.

> In some cases, such vDPA devices are affiliated to the switchdev, but they use one or multiple of its ports.

If the block device uses multiple switch ports, then it should have
 multiple representors, one for each port, so that each switch port can
 be configured in the standard way.

Configuration of the block device itself is of course through separate
 interfaces which are common to non-switchdev virtual block devices.

>> + - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have
>> +   their own port on the switch (as opposed to using their parent PF's port).
> Not sure why the text has _if_ for SF and not for the VF.
> Do you see a SF device in the kernel that doesn't have their own port, due to which there is _if_ added?

This document is meant to cover situations that vendors are likely to
 find themselves in, not just those that have already been encountered.
It is plausible, at least to me, that a vendor might decide to implement
 subfunctions at a filtering rather than a switching level (i.e. it's
 just a bundle of queue pairs and you use something like ethtool NFC to
 direct traffic to it).  And if that happens, I don't want them to read
 my doc and (wrongly) think that they still need reprs for such SFs.
(The corresponding situation is far less likely to arise for VFs,
 because there's a clear understanding across the industry that VFs
 should look to their consumer like self-contained network devices,
 which implies transparent switching.)

>> +How are representors created?
>> +-----------------------------
>> +
>> +The driver instance attached to the master PF should enumerate the
>> +virtual ports on the switch, and for each representee, create a
>> +pure-software netdevice which has some form of in-kernel reference to
>> +the PF's own netdevice or driver private data (``netdev_priv()``).
> Today a user can create new virtual ports. Hence, these port represnetors and function representors are created dynamically without enumeration.
> Please add text describing both ways.

Again, this is addressed in the next sentence after you quoted:
>> +If switch ports can dynamically appear/disappear, the PF driver should
>> +create and destroy representors appropriately.

> For mlx5 case a representor netdevice has real queue from which tx/rx DMA happens from the device to/from network.
> It is not entirely pure software per say.
> Hence, "pure-software" is misleading. Please drop that word.

The rep dev doesn't own the BAR.  Everything it has it gets from
 the PF.  That's why it shouldn't SET_NETDEV_DEV, which is what I
 mean by "pure-software".

>> +The operations of the representor netdevice will generally involve
>> +acting through the master PF.  For example, ``ndo_start_xmit()`` might
>> +send the packet through a hardware TX queue attached to the master PF,
>> +with either packet metadata or queue configuration marking it for delivery
>> to the representee.
> Sharing/not sharing TX and RX queue among representor netdevices is not yet well established.

But in either case the hw TXQ will have been created out of the
 PF's BAR(s) (there's no other PCIe function/aperture to poke at
 the hardware from), that's what I mean by "attached to".  If you
 have a clearer way to word that I'm all ears.

>> +
>> +How are representors identified?
>> +--------------------------------
>> +
>> +The representor netdevice should *not* directly refer to a PCIe device (e.g.
>> +through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
>> +representee or of the master PF.
> This isn't true.
> Representor netdevices are connected to the switchdev device PCI function.

In some but not all existing drivers.
Note that I said "should not", not "does not".

> Without linking to PCI device, udev scriptology needs to grep among thousands of netdevices and its very inefficient.

It's a control plane operation, is efficiency really a prime
 concern?  If so, surely the right thing is to give
 /sys/class/net/$REP_DEV/ a suitably-named symlink to
 /sys/class/net/$SWITCH_DEV, performing the same role as the
 phys_switch_id matching without the global search, rather than
 a semantically-invalid PCIe device link.

>> +There are as yet no established conventions for naming representors
>> +which do not correspond to PCIe functions (e.g. accelerators and plugins).
> Netdevice represents the networking port of the function.

No, it represents any networking port on the switch.  Whether
 that has a PCIe function or not.
(The doc title, "Network Function Representors", is deliberately
 phrased to be interpretable as about "network functions" in the
 sense of NFV, rather than "networking PCIe functions".  An
 entire network function (in the NFV sense) could be implemented
 in hardware, in which case it would have a switch port and thus
 representor, but no PCIe function — it terminates traffic inside
 the device rather than sending it over the PCIe bus to a driver
 in the host or a VM.)

-ed

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC PATCH v2 net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-19 16:36   ` Edward Cree
@ 2022-08-19 18:59     ` Parav Pandit
  0 siblings, 0 replies; 9+ messages in thread
From: Parav Pandit @ 2022-08-19 18:59 UTC (permalink / raw)
  To: Edward Cree, ecree@xilinx.com, netdev@vger.kernel.org,
	linux-net-drivers@amd.com
  Cc: davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com,
	edumazet@google.com, corbet@lwn.net, linux-doc@vger.kernel.org,
	jacob.e.keller@intel.com, jesse.brandeburg@intel.com,
	michael.chan@broadcom.com, andy@greyhouse.net, saeed@kernel.org,
	jiri@resnulli.us, snelson@pensando.io, simon.horman@corigine.com,
	alexander.duyck@gmail.com, rdunlap@infradead.org


> From: Edward Cree <ecree.xilinx@gmail.com>
> Sent: Friday, August 19, 2022 12:36 PM
> 
> On 18/08/2022 17:44, Parav Pandit wrote:
> > A _whole_ network function is represented today using a. netdevice
> > represents representee's network port
> >
> > b. devlink port function for function management
> 
> So, at the moment I'm just trying to document the current consensus,  but
> where I plan to go _after_ this doc is building the case that  devlink port as it
> exists today mixes in too much networking  configuration that really belongs
> in the representor.  
Representor netdevice has its own mac address that can be used to build switching/routing services.

> The example  that motivated this for me is that setting
> the MAC address of the  representee is currently a devlink port function
> operation, but  this has nothing to do with the PCIe function and everything
> to do  with the network port, so logically it should be an operation on  the
> representor.  (I intend to develop a patch making it such, once  we're all on
> the same page.)

Devlink port function is the "function representor".
Life cycle and attributes of this function are controlled using this representor object.

We always wanted a "function representor" out of the devlink port but unfortunately that didn’t happen in past.

A mac address is set to the PCI function. It is a L2 address. Hence it is defined as hw_addr and not as mac.

I didn’t fully understand "network port" above. It is the network port of the function, right?

> 
> I think a general rule is — would this operation make sense on a non-
> networking SR-IOV device?  If not, then it shouldn't be in devlink  port.  E.g.
> why is port splitting a devlink port operation and not  an operation on the
> port representor netdev?
> 
It is applicable to networking non-SR-IOV device such as physical ports, PFs, SFs.

But I resonate with you, would a "devlink function object" as independent entity make sense?
Absolutely yes to me.
Should this devlink function object to be linked to a devlink port and netdevice representor? Yes.

In such case should we take devlink out of the net/devlink.c so that non networking functions can also enjoy?
May be yes.
Good discussion to have.

> > s/master PF/switchdev function
> 
> switchdev function might actually be the best name suggestion yet.
> I like it.
> 
> > Please add text that,
> > Packets transmitted by the representee and when they are not offloaded,
> such packets are delivered to the port representor netdevice.
> 
> That's exactly what
> >> packets
> >> +   transmitted to the representee which fail to match any switching
> >> + rule
> >> should
> >> +   be received on the representor netdevice.
> says.  (Although my choice of preposition — 'to', rather than 'by'
>  — was less than clear.)
> 
Yeah.
'to' is confusing. 
'by' is clear that it is sent by the representee.

> >> +What functions should have a representor?
> >> +-----------------------------------------
> >> +
> >> +Essentially, for each virtual port on the device's internal switch,
> >
> > ^^^^ You probably wanted to say master PF internal switch here.
> >
> > Better to word it as, each virtual port of a switchdev, there should be...
> 
> Hmm idk, I feel like "switchdev" has the connotation of "the software  object
> inside the kernel representing the switch" rather than "the  switch itself".
> 
True, but at kernel level, such device is exposed as switchdev object.
So referring to kernel object looks more aligned with the rest documentation.

> >> + - Other PFs on the local PCIe controller, and any VFs belonging to them.
> > Local and/or external PCIe controllers.
> That's literally the next bullet point.
> 
> >> + - PFs and VFs on other PCIe controllers on the device (e.g. for any
> >> embedded
> >> +   System-on-Chip within the SmartNIC).
> Do I need to use the word "external" to make it more obvious?
> 
'other' is vague.
Yes, external makes it obvious that reader can refer to in other part of the kernel documentation.

> >> + - PFs and VFs with other personalities, including network block
> >> + devices
> >> (such
> >> +   as a vDPA virtio-blk PF backed by remote/distributed storage), if their
> >> +   network access is implemented through a virtual switch port.
> >> +   Note that such functions can require a representor despite the
> >> representee
> >> +   not having a netdev.
> > This looks a big undertaking to represent them via "netdevice".
> > Mostly they cannot be well represented by the netdevice.
> 
> The netdevice isn't supposed to represent the vDPA block device.  Rather  it
> represents the switch port that the block device is using.
> 
Above text was written under a question " What functions should have a representor".
So should a vdpa function need a (switch) representor? Ans = no.
Should a vdpa function need a backend/control representor? Ans = yes.
This document is about (switch) representor. Hence it is better to avoid this confusion.

> > In some cases, such vDPA devices are affiliated to the switchdev, but they
> use one or multiple of its ports.
> 
> If the block device uses multiple switch ports, then it should have  multiple
> representors, one for each port, so that each switch port can  be configured
> in the standard way.
> 
Sure. There is a big difference between 'using representors' vs 'having representors'.
You wanted to say that they use switch representors.

If you want to say a 'function representor' than it doesn’t really belong to this .rst documentation.
It is the discussion that’s happening right now. :)

> Configuration of the block device itself is of course through separate
> interfaces which are common to non-switchdev virtual block devices.
> 
> >> + - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they
> have
> >> +   their own port on the switch (as opposed to using their parent PF's
> port).
> > Not sure why the text has _if_ for SF and not for the VF.
> > Do you see a SF device in the kernel that doesn't have their own port, due
> to which there is _if_ added?
> 
> This document is meant to cover situations that vendors are likely to  find
> themselves in, not just those that have already been encountered.
> It is plausible, at least to me, that a vendor might decide to implement
> subfunctions at a filtering rather than a switching level (i.e. it's  just a bundle
> of queue pairs and you use something like ethtool NFC to  direct traffic to it).
> And if that happens, I don't want them to read  my doc and (wrongly) think
> that they still need reprs for such SFs.
That would be a very different abstraction. It is not a function.
A user might be better of with netdev with bunch of queues.
There were similar macvlan acceleration in past, if not mistaken in the Intel driver.

> (The corresponding situation is far less likely to arise for VFs,  because there's
> a clear understanding across the industry that VFs  should look to their
> consumer like self-contained network devices,  which implies transparent
> switching.)
We don’t see SFs being any different in the kernel as some day when platform is ready SF will have to have self-contained too to map to the guest VM.
A subfunction may not be a network device at all, even though today in Linux kernel we see it only for network function.

> 
> >> +How are representors created?
> >> +-----------------------------
> >> +
> >> +The driver instance attached to the master PF should enumerate the
> >> +virtual ports on the switch, and for each representee, create a
> >> +pure-software netdevice which has some form of in-kernel reference
> >> +to the PF's own netdevice or driver private data (``netdev_priv()``).
> > Today a user can create new virtual ports. Hence, these port represnetors
> and function representors are created dynamically without enumeration.
> > Please add text describing both ways.
> 
> Again, this is addressed in the next sentence after you quoted:
> >> +If switch ports can dynamically appear/disappear, the PF driver
> >> +should create and destroy representors appropriately.
> 
For the reader 'appear/disappear' misses the details if it can be created or not.
Today all the three ways of port addition to switch is done in the kernel.
1. enumerate and add
2. appear through event -> add/delete
3. created by user -> add/delete

So I suggest we write it as,

The driver instance attached to the switchdev should create representor netdevices for the each function whose traffic flow through the switchdev.


> > For mlx5 case a representor netdevice has real queue from which tx/rx
> DMA happens from the device to/from network.
> > It is not entirely pure software per say.
> > Hence, "pure-software" is misleading. Please drop that word.
> 
> The rep dev doesn't own the BAR.  Everything it has it gets from  the PF.
> That's why it shouldn't SET_NETDEV_DEV, which is what I  mean by "pure-
> software".
> 
It accesses the switchdev PF, including the BAR and interrupts and hw queues.
It is no different than uplink or physical port representor.
Setting netdevice parent is not decided based on what type of wire they talk to.

> >> +The operations of the representor netdevice will generally involve
> >> +acting through the master PF.  For example, ``ndo_start_xmit()``
> >> +might send the packet through a hardware TX queue attached to the
> >> +master PF, with either packet metadata or queue configuration
> >> +marking it for delivery
> >> to the representee.
> > Sharing/not sharing TX and RX queue among representor netdevices is not
> yet well established.
> 
> But in either case the hw TXQ will have been created out of the  PF's BAR(s)
> (there's no other PCIe function/aperture to poke at  the hardware from),
> that's what I mean by "attached to".  If you  have a clearer way to word that
> I'm all ears.
> 
I think we should just way,

Representor netdevices access the switchdev device resources and queues they are attached to.

> >> +
> >> +How are representors identified?
> >> +--------------------------------
> >> +
> >> +The representor netdevice should *not* directly refer to a PCIe device
> (e.g.
> >> +through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of
> >> +the representee or of the master PF.
> > This isn't true.
> > Representor netdevices are connected to the switchdev device PCI
> function.
> 
> In some but not all existing drivers.
> Note that I said "should not", not "does not".
> 
'Should not' is a guidance to new drivers.
And for end users it will be nightmare to maintain per vendor udev plumbing in systemd.


> > Without linking to PCI device, udev scriptology needs to grep among
> thousands of netdevices and its very inefficient.
> 
> It's a control plane operation, is efficiency really a prime  concern?  
Yes. In production systems that we use, users create these functions at msec granularity and udev scripts find it hard to walk through.

>If so,
> surely the right thing is to give  /sys/class/net/$REP_DEV/ a suitably-named
> symlink to  /sys/class/net/$SWITCH_DEV, performing the same role as the
> phys_switch_id matching without the global search, rather than  a
> semantically-invalid PCIe device link.
> 
This plumbing is already done through /sys/class/net/REP_NETDEV/device pointing to the parent device uniformly regardless of whom they represent.
Be it a physical port or the virtual port.

> >> +There are as yet no established conventions for naming representors
> >> +which do not correspond to PCIe functions (e.g. accelerators and
> plugins).
> > Netdevice represents the networking port of the function.
> 
> No, it represents any networking port on the switch.  Whether  that has a
> PCIe function or not.
Sure, in context of the subject of this email I said function. :)
It can be physical port of other types of ports of switchev.

> (The doc title, "Network Function Representors", is deliberately  phrased to
> be interpretable as about "network functions" in the  sense of NFV, rather
> than "networking PCIe functions".  
Do we have non networking PCI functions through representors in the kernel today?
If not, better not to generalize them.
When they appear, a new devlink port flavour will show up and it will be perfect to extend this doc to capture that.

> An  entire network function (in the NFV
> sense) could be implemented  in hardware, in which case it would have a
> switch port and thus  representor, but no PCIe function — it terminates
> traffic inside  the device rather than sending it over the PCIe bus to a driver
> in the host or a VM.)

This is very interesting new type of network function.
Do you mean a MMIO kind of function instead of PCI function?
If so, yeah, it will need a switch representation and new devlink port flavour type.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-08-19 18:59 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-08-15 14:22 [RFC PATCH v2 net-next] docs: net: add an explanation of VF (and other) Representors ecree
2022-08-18  9:56 ` Marcin Szycik
2022-08-19 15:20   ` Edward Cree
2022-08-18 11:20 ` Simon Horman
2022-08-18 15:07 ` Roi Dayan
2022-08-19 15:18   ` Edward Cree
2022-08-18 16:44 ` Parav Pandit
2022-08-19 16:36   ` Edward Cree
2022-08-19 18:59     ` Parav Pandit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).