Re: [Intel-wired-lan] [net-next V2 15/15] Documentation: net/mlx5: Add description for Socket-Direct netdev combining

Intel-Wired-Lan Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Samudrala, Sridhar" <sridhar.samudrala@intel.com>
To: Jakub Kicinski <kuba@kernel.org>, Saeed Mahameed <saeed@kernel.org>
Cc: Amritha Nambiar <amritha.nambiar@intel.com>,
	netdev@vger.kernel.org, Gal Pressman <gal@nvidia.com>,
	Tariq Toukan <tariqt@nvidia.com>,
	Eric Dumazet <edumazet@google.com>,
	intel-wired-lan@lists.osuosl.org,
	Andy Gospodarek <andy@greyhouse.net>,
	Michael Chan <michael.chan@broadcom.com>,
	Paolo Abeni <pabeni@redhat.com>,
	Saeed Mahameed <saeedm@nvidia.com>,
	"David S. Miller" <davem@davemloft.net>,
	Leon Romanovsky <leonro@nvidia.com>
Subject: Re: [Intel-wired-lan] [net-next V2 15/15] Documentation: net/mlx5: Add description for Socket-Direct netdev combining
Date: Mon, 12 Feb 2024 19:11:55 -0600	[thread overview]
Message-ID: <db5a1878-efbd-4fc7-bffe-acc8095bb44f@intel.com> (raw)
In-Reply-To: <20240209222738.4cf9f25b@kernel.org>



On 2/10/2024 12:27 AM, Jakub Kicinski wrote:
> On Wed,  7 Feb 2024 19:53:52 -0800 Saeed Mahameed wrote:
>> From: Tariq Toukan <tariqt@nvidia.com>
>>
>> Add documentation for the feature and some details on some design decisions.
> 
> Thanks.
> 
>> diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst
> 
> SD which is not same SD which Jiri and William are talking about?
> Please spell out the name.
> 
> Please make this a general networking/ documentation file.
> 
> If other vendors could take a look and make sure this behavior makes
> sense for their plans / future devices that'd be great.
> 
>> new file mode 100644
>> index 000000000000..c8b4d8025a81
>> --- /dev/null
>> +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst
>> @@ -0,0 +1,134 @@
>> +.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
>> +.. include:: <isonum.txt>
>> +
>> +==============================
>> +Socket-Direct Netdev Combining
>> +==============================
>> +
>> +:Copyright: |copy| 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
>> +
>> +Contents
>> +========
>> +
>> +- `Background`_
>> +- `Overview`_
>> +- `Channels distribution`_
>> +- `Steering`_
>> +- `Mutually exclusive features`_
>> +
>> +Background
>> +==========
>> +
>> +NVIDIA Mellanox Socket Direct technology enables several CPUs within a multi-socket server to
> 
> Please make it sound a little less like a marketing leaflet.
> Isn't multi-PF netdev not a better name for the construct?
> We don't call aRFS "queue direct", also socket has BSD socket meaning.

Yes Socket Direct is definitely misleading.
At Intel, we call this multi-homing technology where multiple PFs are 
associated with a single uplink port. multi-pf netdev sounds technically 
correct.


> 
>> +connect directly to the network, each through its own dedicated PCIe interface. Through either a
>> +connection harness that splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a
>> +single card. This results in eliminating the network traffic traversing over the internal bus
>> +between the sockets, significantly reducing overhead and latency, in addition to reducing CPU
>> +utilization and increasing network throughput.
>> +
>> +Overview
>> +========
>> +
>> +This feature adds support for combining multiple devices (PFs) of the same port in a Socket Direct
>> +environment under one netdev instance. Passing traffic through different devices belonging to
>> +different NUMA sockets saves cross-numa traffic and allows apps running on the same netdev from
>> +different numas to still feel a sense of proximity to the device and acheive improved performance.
>> +
>> +We acheive this by grouping PFs together, and creating the netdev only once all group members are
>> +probed. Symmetrically, we destroy the netdev once any of the PFs is removed.
> 
> s/once/whenever/
> 
>> +The channels are distributed between all devices, a proper configuration would utilize the correct
>> +close numa when working on a certain app/cpu.
>> +
>> +We pick one device to be a primary (leader), and it fills a special role. The other devices
> 
> "device" is probably best avoided, users may think device == card,
> IIUC there's only one NIC ASIC here?
> 
>> +(secondaries) are disconnected from the network in the chip level (set to silent mode). All RX/TX
> 
> s/in/at/
> 
>> +traffic is steered through the primary to/from the secondaries.
> 
> I don't understand the "silent" part. I mean - you do pass traffic thru
> them, what's the silence referring to?
> 
>> +Currently, we limit the support to PFs only, and up to two devices (sockets).
>> +
>> +Channels distribution
>> +=====================
>> +
>> +Distribute the channels between the different SD-devices to acheive local numa node performance on
> 
> Something's missing in this sentence, subject "we"?
> 
>> +multiple numas.
> 
> NUMA nodes
> 
>> +Each channel works against one specific mdev, creating all datapath queues against it. We distribute
> 
> The mix of channel and queue does not compute in this sentence for me.
> 
> Also mdev -> PF?
> 
>> +channels to mdevs in a round-robin policy.
>> +
>> +Example for 2 PFs and 6 channels:
>> ++-------+-------+
>> +| ch ix | PF ix |
> 
> ix? id or idx or index.
> 
>> ++-------+-------+
>> +|   0   |   0   |
>> +|   1   |   1   |
>> +|   2   |   0   |
>> +|   3   |   1   |
>> +|   4   |   0   |
>> +|   5   |   1   |
>> ++-------+-------+
>> +
>> +This round-robin distribution policy is preferred over another suggested intuitive distribution, in
>> +which we first distribute one half of the channels to PF0 and then the second half to PF1.
> 
> Preferred.. by whom? Just say that's the most broadly useful and therefore default config.
> 
>> +The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
>> +mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
>> +As the channel stats are persistent to channels closure, changing the mapping every single time
> 
> to -> across
> channels -> channel or channel's or channel closures
> 
>> +would turn the accumulative stats less representing of the channel's history.
>> +
>> +This is acheived by using the correct core device instance (mdev) in each channel, instead of them
>> +all using the same instance under "priv->mdev".
>> +
>> +Steering
>> +========
>> +Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
>> +
>> +In RX, the steering tables belong to the primary PF only, and it is its role to distribute incoming
>> +traffic to other PFs, via advanced HW cross-vhca steering capabilities.
> 
> s/advanced HW//
> 
> You should cover how RSS looks - single table which functions exactly as
> it would for a 1-PF device? Two-tier setup?
> 
>> +In TX, the primary PF creates a new TX flow table, which is aliased by the secondaries, so they can
>> +go out to the network through it.
>> +
>> +In addition, we set default XPS configuration that, based on the cpu, selects an SQ belonging to the
>> +PF on the same node as the cpu.
>> +
>> +XPS default config example:
>> +
>> +NUMA node(s):          2
>> +NUMA node0 CPU(s):     0-11
>> +NUMA node1 CPU(s):     12-23
>> +
>> +PF0 on node0, PF1 on node1.
> 
> You didn't cover how users are supposed to discover the topology.
> netdev is linked to a single device in sysfs, which is how we get
> netdev <> NUMA node mapping today. What's the expected way to get
> the NUMA nodes here?

In this configuration, there is 1:N relation between netdev and numa 
nodes and 1:1 relation between queue and numa node.

It would help if get-queue API exposes numa node as a parameter.

> 
> And obviously this can't get merged until mlx5 exposes queue <> NAPI <>
> IRQ mapping via the netdev genl.
> 
> <snip>
> 
>> +Mutually exclusive features
>> +===========================
>> +
>> +The nature of socket direct, where different channels work with different PFs, conflicts with
>> +stateful features where the state is maintained in one of the PFs.
>> +For exmaple, in the TLS device-offload feature, special context objects are created per connection
>> +and maintained in the PF.  Transitioning between different RQs/SQs would break the feature. Hence,
>> +we disable this combination for now.
> 
>

     prev parent reply	other threads:[~2024-02-13  1:12 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20240208035352.387423-1-saeed@kernel.org>
     [not found] ` <20240208035352.387423-16-saeed@kernel.org>
2024-02-10  6:27   ` [Intel-wired-lan] [net-next V2 15/15] Documentation: net/mlx5: Add description for Socket-Direct netdev combining Jakub Kicinski
2024-02-13  1:11     ` Samudrala, Sridhar [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=db5a1878-efbd-4fc7-bffe-acc8095bb44f@intel.com \
    --to=sridhar.samudrala@intel.com \
    --cc=amritha.nambiar@intel.com \
    --cc=andy@greyhouse.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=gal@nvidia.com \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=kuba@kernel.org \
    --cc=leonro@nvidia.com \
    --cc=michael.chan@broadcom.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=saeed@kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox