From: Saeed Mahameed <saeed@kernel.org>
To: "David S. Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Eric Dumazet <edumazet@google.com>
Cc: Saeed Mahameed <saeedm@nvidia.com>,
netdev@vger.kernel.org, Tariq Toukan <tariqt@nvidia.com>,
Gal Pressman <gal@nvidia.com>,
Leon Romanovsky <leonro@nvidia.com>
Subject: [net-next V2 15/15] Documentation: net/mlx5: Add description for Socket-Direct netdev combining
Date: Wed, 7 Feb 2024 19:53:52 -0800 [thread overview]
Message-ID: <20240208035352.387423-16-saeed@kernel.org> (raw)
In-Reply-To: <20240208035352.387423-1-saeed@kernel.org>
From: Tariq Toukan <tariqt@nvidia.com>
Add documentation for the feature and some details on some design decisions.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
.../ethernet/mellanox/mlx5/sd.rst | 134 ++++++++++++++++++
1 file changed, 134 insertions(+)
create mode 100644 Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst
new file mode 100644
index 000000000000..c8b4d8025a81
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst
@@ -0,0 +1,134 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+==============================
+Socket-Direct Netdev Combining
+==============================
+
+:Copyright: |copy| 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Contents
+========
+
+- `Background`_
+- `Overview`_
+- `Channels distribution`_
+- `Steering`_
+- `Mutually exclusive features`_
+
+Background
+==========
+
+NVIDIA Mellanox Socket Direct technology enables several CPUs within a multi-socket server to
+connect directly to the network, each through its own dedicated PCIe interface. Through either a
+connection harness that splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a
+single card. This results in eliminating the network traffic traversing over the internal bus
+between the sockets, significantly reducing overhead and latency, in addition to reducing CPU
+utilization and increasing network throughput.
+
+Overview
+========
+
+This feature adds support for combining multiple devices (PFs) of the same port in a Socket Direct
+environment under one netdev instance. Passing traffic through different devices belonging to
+different NUMA sockets saves cross-numa traffic and allows apps running on the same netdev from
+different numas to still feel a sense of proximity to the device and acheive improved performance.
+
+We acheive this by grouping PFs together, and creating the netdev only once all group members are
+probed. Symmetrically, we destroy the netdev once any of the PFs is removed.
+
+The channels are distributed between all devices, a proper configuration would utilize the correct
+close numa when working on a certain app/cpu.
+
+We pick one device to be a primary (leader), and it fills a special role. The other devices
+(secondaries) are disconnected from the network in the chip level (set to silent mode). All RX/TX
+traffic is steered through the primary to/from the secondaries.
+
+Currently, we limit the support to PFs only, and up to two devices (sockets).
+
+Channels distribution
+=====================
+
+Distribute the channels between the different SD-devices to acheive local numa node performance on
+multiple numas.
+
+Each channel works against one specific mdev, creating all datapath queues against it. We distribute
+channels to mdevs in a round-robin policy.
+
+Example for 2 PFs and 6 channels:
++-------+-------+
+| ch ix | PF ix |
++-------+-------+
+| 0 | 0 |
+| 1 | 1 |
+| 2 | 0 |
+| 3 | 1 |
+| 4 | 0 |
+| 5 | 1 |
++-------+-------+
+
+This round-robin distribution policy is preferred over another suggested intuitive distribution, in
+which we first distribute one half of the channels to PF0 and then the second half to PF1.
+
+The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
+mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
+As the channel stats are persistent to channels closure, changing the mapping every single time
+would turn the accumulative stats less representing of the channel's history.
+
+This is acheived by using the correct core device instance (mdev) in each channel, instead of them
+all using the same instance under "priv->mdev".
+
+Steering
+========
+Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
+
+In RX, the steering tables belong to the primary PF only, and it is its role to distribute incoming
+traffic to other PFs, via advanced HW cross-vhca steering capabilities.
+
+In TX, the primary PF creates a new TX flow table, which is aliased by the secondaries, so they can
+go out to the network through it.
+
+In addition, we set default XPS configuration that, based on the cpu, selects an SQ belonging to the
+PF on the same node as the cpu.
+
+XPS default config example:
+
+NUMA node(s): 2
+NUMA node0 CPU(s): 0-11
+NUMA node1 CPU(s): 12-23
+
+PF0 on node0, PF1 on node1.
+
+/sys/class/net/eth2/queues/tx-0/xps_cpus:000001
+/sys/class/net/eth2/queues/tx-1/xps_cpus:001000
+/sys/class/net/eth2/queues/tx-2/xps_cpus:000002
+/sys/class/net/eth2/queues/tx-3/xps_cpus:002000
+/sys/class/net/eth2/queues/tx-4/xps_cpus:000004
+/sys/class/net/eth2/queues/tx-5/xps_cpus:004000
+/sys/class/net/eth2/queues/tx-6/xps_cpus:000008
+/sys/class/net/eth2/queues/tx-7/xps_cpus:008000
+/sys/class/net/eth2/queues/tx-8/xps_cpus:000010
+/sys/class/net/eth2/queues/tx-9/xps_cpus:010000
+/sys/class/net/eth2/queues/tx-10/xps_cpus:000020
+/sys/class/net/eth2/queues/tx-11/xps_cpus:020000
+/sys/class/net/eth2/queues/tx-12/xps_cpus:000040
+/sys/class/net/eth2/queues/tx-13/xps_cpus:040000
+/sys/class/net/eth2/queues/tx-14/xps_cpus:000080
+/sys/class/net/eth2/queues/tx-15/xps_cpus:080000
+/sys/class/net/eth2/queues/tx-16/xps_cpus:000100
+/sys/class/net/eth2/queues/tx-17/xps_cpus:100000
+/sys/class/net/eth2/queues/tx-18/xps_cpus:000200
+/sys/class/net/eth2/queues/tx-19/xps_cpus:200000
+/sys/class/net/eth2/queues/tx-20/xps_cpus:000400
+/sys/class/net/eth2/queues/tx-21/xps_cpus:400000
+/sys/class/net/eth2/queues/tx-22/xps_cpus:000800
+/sys/class/net/eth2/queues/tx-23/xps_cpus:800000
+
+Mutually exclusive features
+===========================
+
+The nature of socket direct, where different channels work with different PFs, conflicts with
+stateful features where the state is maintained in one of the PFs.
+For exmaple, in the TLS device-offload feature, special context objects are created per connection
+and maintained in the PF. Transitioning between different RQs/SQs would break the feature. Hence,
+we disable this combination for now.
--
2.43.0
next prev parent reply other threads:[~2024-02-08 3:54 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-08 3:53 [pull request][net-next V2 00/15] mlx5 socket direct Saeed Mahameed
2024-02-08 3:53 ` [net-next V2 01/15] net/mlx5: Add MPIR bit in mcam_access_reg Saeed Mahameed
2024-02-08 3:53 ` [net-next V2 02/15] net/mlx5: SD, Introduce SD lib Saeed Mahameed
2024-02-08 3:53 ` [net-next v2 03/15] net/mlx5: SD, Implement basic query and instantiation Saeed Mahameed
2024-02-08 3:53 ` [net-next V2 04/15] net/mlx5: SD, Implement devcom communication and primary election Saeed Mahameed
2024-02-08 3:53 ` [net-next V2 05/15] net/mlx5: SD, Implement steering for primary and secondaries Saeed Mahameed
2024-02-08 3:53 ` [net-next V2 06/15] net/mlx5: SD, Add informative prints in kernel log Saeed Mahameed
2024-02-08 3:53 ` [net-next V2 07/15] net/mlx5: SD, Add debugfs Saeed Mahameed
2024-02-08 3:53 ` [net-next V2 08/15] net/mlx5e: Create single netdev per SD group Saeed Mahameed
2024-02-08 3:53 ` [net-next V2 09/15] net/mlx5e: Create EN core HW resources for all secondary devices Saeed Mahameed
2024-02-08 3:53 ` [net-next V2 10/15] net/mlx5e: Let channels be SD-aware Saeed Mahameed
2024-02-08 3:53 ` [net-next V2 11/15] net/mlx5e: Support cross-vhca RSS Saeed Mahameed
2024-02-08 3:53 ` [net-next V2 12/15] net/mlx5e: Support per-mdev queue counter Saeed Mahameed
2024-02-08 3:53 ` [net-next V2 13/15] net/mlx5e: Block TLS device offload on combined SD netdev Saeed Mahameed
2024-02-08 3:53 ` [net-next V2 14/15] net/mlx5: Enable SD feature Saeed Mahameed
2024-02-08 3:53 ` Saeed Mahameed [this message]
2024-02-10 6:27 ` [net-next V2 15/15] Documentation: net/mlx5: Add description for Socket-Direct netdev combining Jakub Kicinski
2024-02-13 1:11 ` Samudrala, Sridhar
2024-02-10 5:54 ` [pull request][net-next V2 00/15] mlx5 socket direct Jakub Kicinski
2024-02-10 5:54 ` Jakub Kicinski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240208035352.387423-16-saeed@kernel.org \
--to=saeed@kernel.org \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=gal@nvidia.com \
--cc=kuba@kernel.org \
--cc=leonro@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=saeedm@nvidia.com \
--cc=tariqt@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).