public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Jiri Pirko <jiri@resnulli.us>
To: Saeed Mahameed <saeed@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Eric Dumazet <edumazet@google.com>,
	Saeed Mahameed <saeedm@nvidia.com>,
	netdev@vger.kernel.org, Tariq Toukan <tariqt@nvidia.com>,
	Gal Pressman <gal@nvidia.com>,
	Leon Romanovsky <leonro@nvidia.com>
Subject: Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
Date: Mon, 19 Feb 2024 19:04:23 +0100	[thread overview]
Message-ID: <ZdOYJ5UBYXfJ52-e@nanopsycho> (raw)
In-Reply-To: <20240215030814.451812-16-saeed@kernel.org>

Thu, Feb 15, 2024 at 04:08:14AM CET, saeed@kernel.org wrote:
>From: Tariq Toukan <tariqt@nvidia.com>
>
>Add documentation for the multi-pf netdev feature.
>Describe the mlx5 implementation and design decisions.
>
>Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
>---
> Documentation/networking/index.rst           |   1 +
> Documentation/networking/multi-pf-netdev.rst | 157 +++++++++++++++++++
> 2 files changed, 158 insertions(+)
> create mode 100644 Documentation/networking/multi-pf-netdev.rst
>
>diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
>index 69f3d6dcd9fd..473d72c36d61 100644
>--- a/Documentation/networking/index.rst
>+++ b/Documentation/networking/index.rst
>@@ -74,6 +74,7 @@ Contents:
>    mpls-sysctl
>    mptcp-sysctl
>    multiqueue
>+   multi-pf-netdev
>    napi
>    net_cachelines/index
>    netconsole
>diff --git a/Documentation/networking/multi-pf-netdev.rst b/Documentation/networking/multi-pf-netdev.rst
>new file mode 100644
>index 000000000000..6ef2ac448d1e
>--- /dev/null
>+++ b/Documentation/networking/multi-pf-netdev.rst
>@@ -0,0 +1,157 @@
>+.. SPDX-License-Identifier: GPL-2.0
>+.. include:: <isonum.txt>
>+
>+===============
>+Multi-PF Netdev
>+===============
>+
>+Contents
>+========
>+
>+- `Background`_
>+- `Overview`_
>+- `mlx5 implementation`_
>+- `Channels distribution`_
>+- `Topology`_
>+- `Steering`_
>+- `Mutually exclusive features`_
>+
>+Background
>+==========
>+
>+The advanced Multi-PF NIC technology enables several CPUs within a multi-socket server to
>+connect directly to the network, each through its own dedicated PCIe interface. Through either a
>+connection harness that splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a
>+single card. This results in eliminating the network traffic traversing over the internal bus
>+between the sockets, significantly reducing overhead and latency, in addition to reducing CPU
>+utilization and increasing network throughput.
>+
>+Overview
>+========
>+
>+This feature adds support for combining multiple devices (PFs) of the same port in a Multi-PF
>+environment under one netdev instance. Passing traffic through different devices belonging to
>+different NUMA sockets saves cross-numa traffic and allows apps running on the same netdev from
>+different numas to still feel a sense of proximity to the device and achieve improved performance.
>+
>+mlx5 implementation
>+===================
>+
>+Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same
>+NIC and has the socket-direct property enabled, once all PFS are probed, we create a single netdev

How do you enable this property?


>+to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed.
>+
>+The netdev network channels are distributed between all devices, a proper configuration would utilize
>+the correct close numa node when working on a certain app/cpu.
>+
>+We pick one PF to be a primary (leader), and it fills a special role. The other devices
>+(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent
>+mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of
>+the leader PF (east <-> west traffic) to function. All RX/TX traffic is steered through the primary
>+to/from the secondaries.
>+
>+Currently, we limit the support to PFs only, and up to two PFs (sockets).

For the record, could you please describe why exactly you didn't use
drivers/base/component.c infrastructure for this? I know you told me,
but I don't recall. Better to have this written down, I believe.


>+
>+Channels distribution
>+=====================
>+
>+We distribute the channels between the different PFs to achieve local NUMA node performance
>+on multiple NUMA nodes.
>+
>+Each combined channel works against one specific PF, creating all its datapath queues against it. We distribute
>+channels to PFs in a round-robin policy.
>+
>+::
>+
>+        Example for 2 PFs and 6 channels:
>+        +--------+--------+
>+        | ch idx | PF idx |
>+        +--------+--------+
>+        |    0   |    0   |
>+        |    1   |    1   |
>+        |    2   |    0   |
>+        |    3   |    1   |
>+        |    4   |    0   |
>+        |    5   |    1   |
>+        +--------+--------+
>+
>+
>+We prefer this round-robin distribution policy over another suggested intuitive distribution, in
>+which we first distribute one half of the channels to PF0 and then the second half to PF1.
>+
>+The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
>+mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
>+As the channel stats are persistent across channel's closure, changing the mapping every single time
>+would turn the accumulative stats less representing of the channel's history.
>+
>+This is achieved by using the correct core device instance (mdev) in each channel, instead of them
>+all using the same instance under "priv->mdev".
>+
>+Topology
>+========
>+Currently the sysfs is kept untouched, letting the netdev sysfs point to its primary PF.
>+Enhancing sysfs to reflect the actual topology is to be discussed and contributed separately.
>+For now, debugfs is being used to reflect the topology:
>+
>+.. code-block:: bash
>+
>+        $ grep -H . /sys/kernel/debug/mlx5/0000\:08\:00.0/sd/*
>+        /sys/kernel/debug/mlx5/0000:08:00.0/sd/group_id:0x00000101
>+        /sys/kernel/debug/mlx5/0000:08:00.0/sd/primary:0000:08:00.0 vhca 0x0
>+        /sys/kernel/debug/mlx5/0000:08:00.0/sd/secondary_0:0000:09:00.0 vhca 0x2

Ugh :/

SD is something that is likely going to stay with us for some time.
Can't we have some proper UAPI instead of this? IDK.


>+
>+Steering
>+========
>+Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
>+
>+In RX, the steering tables belong to the primary PF only, and it is its role to distribute incoming
>+traffic to other PFs, via cross-vhca steering capabilities. Nothing special about the RSS table
>+content, except that it needs a capable device to point to the receive queues of a different PF.
>+
>+In TX, the primary PF creates a new TX flow table, which is aliased by the secondaries, so they can
>+go out to the network through it.
>+
>+In addition, we set default XPS configuration that, based on the cpu, selects an SQ belonging to the
>+PF on the same node as the cpu.
>+
>+XPS default config example:
>+
>+NUMA node(s):          2
>+NUMA node0 CPU(s):     0-11
>+NUMA node1 CPU(s):     12-23

How can user know which queue is bound to which cpu?


>+
>+PF0 on node0, PF1 on node1.
>+
>+- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
>+- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
>+- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
>+- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
>+- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
>+- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
>+- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
>+- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
>+- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
>+- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
>+- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
>+- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
>+- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
>+- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
>+- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
>+- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
>+- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
>+- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
>+- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
>+- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
>+- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
>+- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
>+- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
>+- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
>+
>+Mutually exclusive features
>+===========================
>+
>+The nature of Multi-PF, where different channels work with different PFs, conflicts with
>+stateful features where the state is maintained in one of the PFs.
>+For example, in the TLS device-offload feature, special context objects are created per connection
>+and maintained in the PF.  Transitioning between different RQs/SQs would break the feature. Hence,
>+we disable this combination for now.
>-- 
>2.43.0
>
>

      parent reply	other threads:[~2024-02-19 18:04 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 01/15] net/mlx5: Add MPIR bit in mcam_access_reg Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 02/15] net/mlx5: SD, Introduce SD lib Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 03/15] net/mlx5: SD, Implement basic query and instantiation Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 04/15] net/mlx5: SD, Implement devcom communication and primary election Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 05/15] net/mlx5: SD, Implement steering for primary and secondaries Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 06/15] net/mlx5: SD, Add informative prints in kernel log Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 07/15] net/mlx5: SD, Add debugfs Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 08/15] net/mlx5e: Create single netdev per SD group Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 09/15] net/mlx5e: Create EN core HW resources for all secondary devices Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 10/15] net/mlx5e: Let channels be SD-aware Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 11/15] net/mlx5e: Support cross-vhca RSS Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 12/15] net/mlx5e: Support per-mdev queue counter Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 13/15] net/mlx5e: Block TLS device offload on combined SD netdev Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 14/15] net/mlx5: Enable SD feature Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev Saeed Mahameed
2024-02-16  5:23   ` Jakub Kicinski
2024-02-19 15:26     ` Tariq Toukan
2024-02-21  1:33       ` Jakub Kicinski
2024-02-21  2:10         ` Saeed Mahameed
2024-02-22  7:51         ` Greg Kroah-Hartman
2024-02-22 23:00           ` Jakub Kicinski
2024-02-23  1:23             ` Samudrala, Sridhar
2024-02-23  2:05               ` Jay Vosburgh
2024-02-23  5:00                 ` Samudrala, Sridhar
2024-02-23  9:40                   ` Jiri Pirko
2024-02-23 23:56                     ` Samudrala, Sridhar
2024-02-24 12:48                       ` Jiri Pirko
2024-02-23  9:36               ` Jiri Pirko
2024-02-28  2:06                 ` Jakub Kicinski
2024-02-28  8:13                   ` Jiri Pirko
2024-02-28 17:06                     ` Jakub Kicinski
2024-02-28 17:43                       ` Jakub Kicinski
2024-03-02  7:31                         ` Saeed Mahameed
2024-02-29  8:21                       ` Jiri Pirko
2024-02-29 14:34                         ` Jakub Kicinski
2024-02-19 18:04   ` Jiri Pirko [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZdOYJ5UBYXfJ52-e@nanopsycho \
    --to=jiri@resnulli.us \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=gal@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=leonro@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=saeed@kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox