From: Jiri Pirko <jiri@resnulli.us>
To: Saeed Mahameed <saeed@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Eric Dumazet <edumazet@google.com>,
Saeed Mahameed <saeedm@nvidia.com>,
netdev@vger.kernel.org, Tariq Toukan <tariqt@nvidia.com>,
Gal Pressman <gal@nvidia.com>,
Leon Romanovsky <leonro@nvidia.com>
Subject: Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
Date: Mon, 19 Feb 2024 19:04:23 +0100 [thread overview]
Message-ID: <ZdOYJ5UBYXfJ52-e@nanopsycho> (raw)
In-Reply-To: <20240215030814.451812-16-saeed@kernel.org>
Thu, Feb 15, 2024 at 04:08:14AM CET, saeed@kernel.org wrote:
>From: Tariq Toukan <tariqt@nvidia.com>
>
>Add documentation for the multi-pf netdev feature.
>Describe the mlx5 implementation and design decisions.
>
>Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
>---
> Documentation/networking/index.rst | 1 +
> Documentation/networking/multi-pf-netdev.rst | 157 +++++++++++++++++++
> 2 files changed, 158 insertions(+)
> create mode 100644 Documentation/networking/multi-pf-netdev.rst
>
>diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
>index 69f3d6dcd9fd..473d72c36d61 100644
>--- a/Documentation/networking/index.rst
>+++ b/Documentation/networking/index.rst
>@@ -74,6 +74,7 @@ Contents:
> mpls-sysctl
> mptcp-sysctl
> multiqueue
>+ multi-pf-netdev
> napi
> net_cachelines/index
> netconsole
>diff --git a/Documentation/networking/multi-pf-netdev.rst b/Documentation/networking/multi-pf-netdev.rst
>new file mode 100644
>index 000000000000..6ef2ac448d1e
>--- /dev/null
>+++ b/Documentation/networking/multi-pf-netdev.rst
>@@ -0,0 +1,157 @@
>+.. SPDX-License-Identifier: GPL-2.0
>+.. include:: <isonum.txt>
>+
>+===============
>+Multi-PF Netdev
>+===============
>+
>+Contents
>+========
>+
>+- `Background`_
>+- `Overview`_
>+- `mlx5 implementation`_
>+- `Channels distribution`_
>+- `Topology`_
>+- `Steering`_
>+- `Mutually exclusive features`_
>+
>+Background
>+==========
>+
>+The advanced Multi-PF NIC technology enables several CPUs within a multi-socket server to
>+connect directly to the network, each through its own dedicated PCIe interface. Through either a
>+connection harness that splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a
>+single card. This results in eliminating the network traffic traversing over the internal bus
>+between the sockets, significantly reducing overhead and latency, in addition to reducing CPU
>+utilization and increasing network throughput.
>+
>+Overview
>+========
>+
>+This feature adds support for combining multiple devices (PFs) of the same port in a Multi-PF
>+environment under one netdev instance. Passing traffic through different devices belonging to
>+different NUMA sockets saves cross-numa traffic and allows apps running on the same netdev from
>+different numas to still feel a sense of proximity to the device and achieve improved performance.
>+
>+mlx5 implementation
>+===================
>+
>+Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same
>+NIC and has the socket-direct property enabled, once all PFS are probed, we create a single netdev
How do you enable this property?
>+to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed.
>+
>+The netdev network channels are distributed between all devices, a proper configuration would utilize
>+the correct close numa node when working on a certain app/cpu.
>+
>+We pick one PF to be a primary (leader), and it fills a special role. The other devices
>+(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent
>+mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of
>+the leader PF (east <-> west traffic) to function. All RX/TX traffic is steered through the primary
>+to/from the secondaries.
>+
>+Currently, we limit the support to PFs only, and up to two PFs (sockets).
For the record, could you please describe why exactly you didn't use
drivers/base/component.c infrastructure for this? I know you told me,
but I don't recall. Better to have this written down, I believe.
>+
>+Channels distribution
>+=====================
>+
>+We distribute the channels between the different PFs to achieve local NUMA node performance
>+on multiple NUMA nodes.
>+
>+Each combined channel works against one specific PF, creating all its datapath queues against it. We distribute
>+channels to PFs in a round-robin policy.
>+
>+::
>+
>+ Example for 2 PFs and 6 channels:
>+ +--------+--------+
>+ | ch idx | PF idx |
>+ +--------+--------+
>+ | 0 | 0 |
>+ | 1 | 1 |
>+ | 2 | 0 |
>+ | 3 | 1 |
>+ | 4 | 0 |
>+ | 5 | 1 |
>+ +--------+--------+
>+
>+
>+We prefer this round-robin distribution policy over another suggested intuitive distribution, in
>+which we first distribute one half of the channels to PF0 and then the second half to PF1.
>+
>+The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
>+mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
>+As the channel stats are persistent across channel's closure, changing the mapping every single time
>+would turn the accumulative stats less representing of the channel's history.
>+
>+This is achieved by using the correct core device instance (mdev) in each channel, instead of them
>+all using the same instance under "priv->mdev".
>+
>+Topology
>+========
>+Currently the sysfs is kept untouched, letting the netdev sysfs point to its primary PF.
>+Enhancing sysfs to reflect the actual topology is to be discussed and contributed separately.
>+For now, debugfs is being used to reflect the topology:
>+
>+.. code-block:: bash
>+
>+ $ grep -H . /sys/kernel/debug/mlx5/0000\:08\:00.0/sd/*
>+ /sys/kernel/debug/mlx5/0000:08:00.0/sd/group_id:0x00000101
>+ /sys/kernel/debug/mlx5/0000:08:00.0/sd/primary:0000:08:00.0 vhca 0x0
>+ /sys/kernel/debug/mlx5/0000:08:00.0/sd/secondary_0:0000:09:00.0 vhca 0x2
Ugh :/
SD is something that is likely going to stay with us for some time.
Can't we have some proper UAPI instead of this? IDK.
>+
>+Steering
>+========
>+Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
>+
>+In RX, the steering tables belong to the primary PF only, and it is its role to distribute incoming
>+traffic to other PFs, via cross-vhca steering capabilities. Nothing special about the RSS table
>+content, except that it needs a capable device to point to the receive queues of a different PF.
>+
>+In TX, the primary PF creates a new TX flow table, which is aliased by the secondaries, so they can
>+go out to the network through it.
>+
>+In addition, we set default XPS configuration that, based on the cpu, selects an SQ belonging to the
>+PF on the same node as the cpu.
>+
>+XPS default config example:
>+
>+NUMA node(s): 2
>+NUMA node0 CPU(s): 0-11
>+NUMA node1 CPU(s): 12-23
How can user know which queue is bound to which cpu?
>+
>+PF0 on node0, PF1 on node1.
>+
>+- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
>+- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
>+- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
>+- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
>+- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
>+- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
>+- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
>+- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
>+- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
>+- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
>+- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
>+- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
>+- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
>+- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
>+- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
>+- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
>+- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
>+- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
>+- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
>+- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
>+- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
>+- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
>+- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
>+- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
>+
>+Mutually exclusive features
>+===========================
>+
>+The nature of Multi-PF, where different channels work with different PFs, conflicts with
>+stateful features where the state is maintained in one of the PFs.
>+For example, in the TLS device-offload feature, special context objects are created per connection
>+and maintained in the PF. Transitioning between different RQs/SQs would break the feature. Hence,
>+we disable this combination for now.
>--
>2.43.0
>
>
prev parent reply other threads:[~2024-02-19 18:04 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-15 3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 01/15] net/mlx5: Add MPIR bit in mcam_access_reg Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 02/15] net/mlx5: SD, Introduce SD lib Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 03/15] net/mlx5: SD, Implement basic query and instantiation Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 04/15] net/mlx5: SD, Implement devcom communication and primary election Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 05/15] net/mlx5: SD, Implement steering for primary and secondaries Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 06/15] net/mlx5: SD, Add informative prints in kernel log Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 07/15] net/mlx5: SD, Add debugfs Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 08/15] net/mlx5e: Create single netdev per SD group Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 09/15] net/mlx5e: Create EN core HW resources for all secondary devices Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 10/15] net/mlx5e: Let channels be SD-aware Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 11/15] net/mlx5e: Support cross-vhca RSS Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 12/15] net/mlx5e: Support per-mdev queue counter Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 13/15] net/mlx5e: Block TLS device offload on combined SD netdev Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 14/15] net/mlx5: Enable SD feature Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev Saeed Mahameed
2024-02-16 5:23 ` Jakub Kicinski
2024-02-19 15:26 ` Tariq Toukan
2024-02-21 1:33 ` Jakub Kicinski
2024-02-21 2:10 ` Saeed Mahameed
2024-02-22 7:51 ` Greg Kroah-Hartman
2024-02-22 23:00 ` Jakub Kicinski
2024-02-23 1:23 ` Samudrala, Sridhar
2024-02-23 2:05 ` Jay Vosburgh
2024-02-23 5:00 ` Samudrala, Sridhar
2024-02-23 9:40 ` Jiri Pirko
2024-02-23 23:56 ` Samudrala, Sridhar
2024-02-24 12:48 ` Jiri Pirko
2024-02-23 9:36 ` Jiri Pirko
2024-02-28 2:06 ` Jakub Kicinski
2024-02-28 8:13 ` Jiri Pirko
2024-02-28 17:06 ` Jakub Kicinski
2024-02-28 17:43 ` Jakub Kicinski
2024-03-02 7:31 ` Saeed Mahameed
2024-02-29 8:21 ` Jiri Pirko
2024-02-29 14:34 ` Jakub Kicinski
2024-02-19 18:04 ` Jiri Pirko [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZdOYJ5UBYXfJ52-e@nanopsycho \
--to=jiri@resnulli.us \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=gal@nvidia.com \
--cc=kuba@kernel.org \
--cc=leonro@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=saeed@kernel.org \
--cc=saeedm@nvidia.com \
--cc=tariqt@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.