From: Tariq Toukan <ttoukan.linux@gmail.com>
To: Przemek Kitszel <przemyslaw.kitszel@intel.com>,
Saeed Mahameed <saeed@kernel.org>,
"David S. Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Eric Dumazet <edumazet@google.com>
Cc: Saeed Mahameed <saeedm@nvidia.com>,
netdev@vger.kernel.org, Tariq Toukan <tariqt@nvidia.com>,
Gal Pressman <gal@nvidia.com>,
Leon Romanovsky <leonro@nvidia.com>,
sridhar.samudrala@intel.com,
Jay Vosburgh <jay.vosburgh@canonical.com>,
Jiri Pirko <jiri@nvidia.com>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Subject: Re: [net-next V4 15/15] Documentation: networking: Add description for multi-pf netdev
Date: Tue, 5 Mar 2024 22:12:23 +0200 [thread overview]
Message-ID: <228ecfb6-d5cb-403b-aecf-7c1181aa45ce@gmail.com> (raw)
In-Reply-To: <7f749366-193f-480e-8302-fea7566ec57c@intel.com>
On 04/03/2024 14:03, Przemek Kitszel wrote:
> On 3/2/24 08:22, Saeed Mahameed wrote:
>> From: Tariq Toukan <tariqt@nvidia.com>
>>
>> Add documentation for the multi-pf netdev feature.
>> Describe the mlx5 implementation and design decisions.
>>
>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
>> ---
>> Documentation/networking/index.rst | 1 +
>> Documentation/networking/multi-pf-netdev.rst | 177 +++++++++++++++++++
>> 2 files changed, 178 insertions(+)
>> create mode 100644 Documentation/networking/multi-pf-netdev.rst
>>
>> diff --git a/Documentation/networking/index.rst
>> b/Documentation/networking/index.rst
>> index 69f3d6dcd9fd..473d72c36d61 100644
>> --- a/Documentation/networking/index.rst
>> +++ b/Documentation/networking/index.rst
>> @@ -74,6 +74,7 @@ Contents:
>> mpls-sysctl
>> mptcp-sysctl
>> multiqueue
>> + multi-pf-netdev
>> napi
>> net_cachelines/index
>> netconsole
>> diff --git a/Documentation/networking/multi-pf-netdev.rst
>> b/Documentation/networking/multi-pf-netdev.rst
>> new file mode 100644
>> index 000000000000..f6f782374b71
>> --- /dev/null
>> +++ b/Documentation/networking/multi-pf-netdev.rst
>> @@ -0,0 +1,177 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +.. include:: <isonum.txt>
>> +
>> +===============
>> +Multi-PF Netdev
>> +===============
>> +
>> +Contents
>> +========
>> +
>> +- `Background`_
>> +- `Overview`_
>> +- `mlx5 implementation`_
>> +- `Channels distribution`_
>> +- `Observability`_
>> +- `Steering`_
>> +- `Mutually exclusive features`_
>
> this document describes mlx5 details mostly, and I would expect to find
> them in a mlx5.rst file instead of vendor-agnostic doc
>
It was originally under
Documentation/networking/device_drivers/ethernet/mellanox/mlx5/
We moved it here with the needed changes per request.
See:
https://lore.kernel.org/all/20240209222738.4cf9f25b@kernel.org/
>> +
>> +Background
>> +==========
>> +
>> +The advanced Multi-PF NIC technology enables several CPUs within a
>> multi-socket server to
>
> please remove the `advanced` word
>
>> +connect directly to the network, each through its own dedicated PCIe
>> interface. Through either a
>> +connection harness that splits the PCIe lanes between two cards or by
>> bifurcating a PCIe slot for a
>> +single card. This results in eliminating the network traffic
>> traversing over the internal bus
>> +between the sockets, significantly reducing overhead and latency, in
>> addition to reducing CPU
>> +utilization and increasing network throughput.
>> +
>> +Overview
>> +========
>> +
>> +The feature adds support for combining multiple PFs of the same port
>> in a Multi-PF environment under
>> +one netdev instance. It is implemented in the netdev layer.
>> Lower-layer instances like pci func,
>> +sysfs entry, devlink) are kept separate.
>> +Passing traffic through different devices belonging to different NUMA
>> sockets saves cross-numa
>
> please consider spelling out NUMA as always capitalized
>
>> +traffic and allows apps running on the same netdev from different
>> numas to still feel a sense of
>> +proximity to the device and achieve improved performance.
>> +
>> +mlx5 implementation
>> +===================
>> +
>> +Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs
>> together which belong to the same
>> +NIC and has the socket-direct property enabled, once all PFS are
>> probed, we create a single netdev
>
> s/PFS/PFs/
>
>> +to represent all of them, symmetrically, we destroy the netdev
>> whenever any of the PFs is removed.
>> +
>> +The netdev network channels are distributed between all devices, a
>> proper configuration would utilize
>> +the correct close numa node when working on a certain app/cpu.
>
> CPU
>
>> +
>> +We pick one PF to be a primary (leader), and it fills a special role.
>> The other devices
>> +(secondaries) are disconnected from the network at the chip level
>> (set to silent mode). In silent
>> +mode, no south <-> north traffic flowing directly through a secondary
>> PF. It needs the assistance of
>> +the leader PF (east <-> west traffic) to function. All RX/TX traffic
>> is steered through the primary
>
> Rx, Tx (whole document)
>
>> +to/from the secondaries.
>> +
>> +Currently, we limit the support to PFs only, and up to two PFs
>> (sockets).
>> +
>> +Channels distribution
>> +=====================
>> +
>> +We distribute the channels between the different PFs to achieve local
>> NUMA node performance
>> +on multiple NUMA nodes.
>> +
>> +Each combined channel works against one specific PF, creating all its
>> datapath queues against it. We
>> +distribute channels to PFs in a round-robin policy.
>> +
>> +::
>> +
>> + Example for 2 PFs and 5 channels:
>> + +--------+--------+
>> + | ch idx | PF idx |
>> + +--------+--------+
>> + | 0 | 0 |
>> + | 1 | 1 |
>> + | 2 | 0 |
>> + | 3 | 1 |
>> + | 4 | 0 |
>> + +--------+--------+
>> +
>> +
>> +We prefer this round-robin distribution policy over another suggested
>> intuitive distribution, in
>> +which we first distribute one half of the channels to PF0 and then
>> the second half to PF1.
>
> Please rephrase to describe current state (which makes sense over what
> was suggested), instead of addressing feedback (that could be kept in
> cover letter if you really want).
>
> And again, the wording "we" clearly indicates that this section, as
> future ones, is mlx specific.
>
>> +
>> +The reason we prefer round-robin is, it is less influenced by changes
>> in the number of channels. The
>> +mapping between a channel index and a PF is fixed, no matter how many
>> channels the user configures.
>> +As the channel stats are persistent across channel's closure,
>> changing the mapping every single time
>> +would turn the accumulative stats less representing of the channel's
>> history.
>> +
>> +This is achieved by using the correct core device instance (mdev) in
>> each channel, instead of them
>> +all using the same instance under "priv->mdev".
>> +
>> +Observability
>> +=============
>> +The relation between PF, irq, napi, and queue can be observed via
>> netlink spec:
>> +
>> +$ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml
>> --dump queue-get --json='{"ifindex": 13}'
>> +[{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'},
>> + {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'},
>> + {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'},
>> + {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'rx'},
>> + {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'rx'},
>> + {'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'tx'},
>> + {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'tx'},
>> + {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'tx'},
>> + {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'},
>> + {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}]
>> +
>> +$ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml
>> --dump napi-get --json='{"ifindex": 13}'
>> +[{'id': 543, 'ifindex': 13, 'irq': 42},
>> + {'id': 542, 'ifindex': 13, 'irq': 41},
>> + {'id': 541, 'ifindex': 13, 'irq': 40},
>> + {'id': 540, 'ifindex': 13, 'irq': 39},
>> + {'id': 539, 'ifindex': 13, 'irq': 36}]
>> +
>> +Here you can clearly observe our channels distribution policy:
>> +
>> +$ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1
>> +/proc/irq/36/mlx5_comp1@pci:0000:08:00.0
>> +/proc/irq/39/mlx5_comp1@pci:0000:09:00.0
>> +/proc/irq/40/mlx5_comp2@pci:0000:08:00.0
>> +/proc/irq/41/mlx5_comp2@pci:0000:09:00.0
>> +/proc/irq/42/mlx5_comp3@pci:0000:08:00.0
>> +
>> +Steering
>> +========
>> +Secondary PFs are set to "silent" mode, meaning they are disconnected
>> from the network.
>> +
>> +In RX, the steering tables belong to the primary PF only, and it is
>> its role to distribute incoming
>> +traffic to other PFs, via cross-vhca steering capabilities. Nothing
>> special about the RSS table
>> +content, except that it needs a capable device to point to the
>> receive queues of a different PF.
>
> I guess you cannot enable the multi-pf for incapable device, so there is
> anything noteworthy in last sentence?
>
I was asked in earlier patchsets to elaborate on this.
It tells "how" an RSS table looks like on a capable device.
Maybe I should re-phrase to emphasize the point.
It is not straightforward that we still maintain a single RSS table like
non-multi-PF netdevs. Preserving this (over other complex alternatives)
is what noteworthy here.
>> +
>> +In TX, the primary PF creates a new TX flow table, which is aliased
>> by the secondaries, so they can
>> +go out to the network through it.
>> +
>> +In addition, we set default XPS configuration that, based on the cpu,
>> selects an SQ belonging to the
>> +PF on the same node as the cpu.
>> +
>> +XPS default config example:
>> +
>> +NUMA node(s): 2
>> +NUMA node0 CPU(s): 0-11
>> +NUMA node1 CPU(s): 12-23
>> +
>> +PF0 on node0, PF1 on node1.
>> +
>> +- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
>> +- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
>> +- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
>> +- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
>> +- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
>> +- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
>> +- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
>> +- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
>> +- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
>> +- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
>> +- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
>> +- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
>> +- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
>> +- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
>> +- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
>> +- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
>> +- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
>> +- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
>> +- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
>> +- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
>> +- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
>> +- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
>> +- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
>> +- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
>> +
>> +Mutually exclusive features
>> +===========================
>> +
>> +The nature of Multi-PF, where different channels work with different
>> PFs, conflicts with
>> +stateful features where the state is maintained in one of the PFs.
>> +For example, in the TLS device-offload feature, special context
>> objects are created per connection
>> +and maintained in the PF. Transitioning between different RQs/SQs
>> would break the feature. Hence,
>> +we disable this combination for now.
>
> From the reading I will know what the feature is at the user level.
>
> After splitting most of the doc out into mlx5 file, and fixing the minor
> typos, feel free to add my:
>
> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
>
Thanks.
prev parent reply other threads:[~2024-03-05 20:12 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-02 7:22 [pull request][net-next V4 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
2024-03-02 7:22 ` [net-next V4 01/15] net/mlx5: Add MPIR bit in mcam_access_reg Saeed Mahameed
2024-03-02 7:22 ` [net-next V4 02/15] net/mlx5: SD, Introduce SD lib Saeed Mahameed
2024-03-02 7:22 ` [net-next V4 03/15] net/mlx5: SD, Implement basic query and instantiation Saeed Mahameed
2024-03-02 7:22 ` [net-next V4 04/15] net/mlx5: SD, Implement devcom communication and primary election Saeed Mahameed
2024-03-02 7:22 ` [net-next V4 05/15] net/mlx5: SD, Implement steering for primary and secondaries Saeed Mahameed
2024-03-02 7:22 ` [net-next V4 06/15] net/mlx5: SD, Add informative prints in kernel log Saeed Mahameed
2024-03-02 7:22 ` [net-next V4 07/15] net/mlx5: SD, Add debugfs Saeed Mahameed
2024-03-02 7:22 ` [net-next V4 08/15] net/mlx5e: Create single netdev per SD group Saeed Mahameed
2024-03-02 7:22 ` [net-next V4 09/15] net/mlx5e: Create EN core HW resources for all secondary devices Saeed Mahameed
2024-03-02 7:22 ` [net-next V4 10/15] net/mlx5e: Let channels be SD-aware Saeed Mahameed
2024-03-02 7:22 ` [net-next V4 11/15] net/mlx5e: Support cross-vhca RSS Saeed Mahameed
2024-03-02 7:22 ` [net-next V4 12/15] net/mlx5e: Support per-mdev queue counter Saeed Mahameed
2024-03-02 7:22 ` [net-next V4 13/15] net/mlx5e: Block TLS device offload on combined SD netdev Saeed Mahameed
2024-03-02 7:22 ` [net-next V4 14/15] net/mlx5: Enable SD feature Saeed Mahameed
2024-03-02 7:22 ` [net-next V4 15/15] Documentation: networking: Add description for multi-pf netdev Saeed Mahameed
2024-03-04 12:03 ` Przemek Kitszel
2024-03-05 20:12 ` Tariq Toukan [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=228ecfb6-d5cb-403b-aecf-7c1181aa45ce@gmail.com \
--to=ttoukan.linux@gmail.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=gal@nvidia.com \
--cc=gregkh@linuxfoundation.org \
--cc=jay.vosburgh@canonical.com \
--cc=jiri@nvidia.com \
--cc=kuba@kernel.org \
--cc=leonro@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=przemyslaw.kitszel@intel.com \
--cc=saeed@kernel.org \
--cc=saeedm@nvidia.com \
--cc=sridhar.samudrala@intel.com \
--cc=tariqt@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).