From: Jiri Pirko <jiri@resnulli.us>
To: Jakub Kicinski <kuba@kernel.org>
Cc: "Samudrala, Sridhar" <sridhar.samudrala@intel.com>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Tariq Toukan <ttoukan.linux@gmail.com>,
Saeed Mahameed <saeed@kernel.org>,
"David S. Miller" <davem@davemloft.net>,
Paolo Abeni <pabeni@redhat.com>,
Eric Dumazet <edumazet@google.com>,
Saeed Mahameed <saeedm@nvidia.com>,
netdev@vger.kernel.org, Tariq Toukan <tariqt@nvidia.com>,
Gal Pressman <gal@nvidia.com>,
Leon Romanovsky <leonro@nvidia.com>,
jay.vosburgh@canonical.com
Subject: Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
Date: Wed, 28 Feb 2024 09:13:57 +0100 [thread overview]
Message-ID: <Zd7rRTSSLO9-DM2t@nanopsycho> (raw)
In-Reply-To: <20240227180619.7e908ac4@kernel.org>
Wed, Feb 28, 2024 at 03:06:19AM CET, kuba@kernel.org wrote:
>On Fri, 23 Feb 2024 10:36:25 +0100 Jiri Pirko wrote:
>> >> It's really a special type of bonding of two netdevs. Like you'd bond
>> >> two ports to get twice the bandwidth. With the twist that the balancing
>> >> is done on NUMA proximity, rather than traffic hash.
>> >>
>> >> Well, plus, the major twist that it's all done magically "for you"
>> >> in the vendor driver, and the two "lower" devices are not visible.
>> >> You only see the resulting bond.
>> >>
>> >> I personally think that the magic hides as many problems as it
>> >> introduces and we'd be better off creating two separate netdevs.
>> >> And then a new type of "device bond" on top. Small win that
>> >> the "new device bond on top" can be shared code across vendors.
>> >
>> >Yes. We have been exploring a small extension to bonding driver to enable a
>> >single numa-aware multi-threaded application to efficiently utilize multiple
>> >NICs across numa nodes.
>>
>> Bonding was my immediate response when we discussed this internally for
>> the first time. But I had to eventually admit it is probably not that
>> suitable in this case, here's why:
>> 1) there are no 2 physical ports, only one.
>
>Right, sorry, number of PFs matches number of ports for each bus.
>But it's not necessarily a deal breaker - it's similar to a multi-host
>device. We also have multiple netdevs and PCIe links, they just go to
>different host rather than different NUMA nodes on one host.
That is a different scenario. You have multiple hosts and a switch
between them and the physical port. Yeah, it might be invisible switch,
but there still is one. On DPU/smartnic, it is visible and configurable.
>
>> 2) it is basically a matter of device layout/provisioning that this
>> feature should be enabled, not user configuration.
>
>We can still auto-instantiate it, not a deal breaker.
"Auto-instantiate" in meating of userspace orchestration deamon,
not kernel, that's what you mean?
>
>I'm not sure you're right in that assumption, tho. At Meta, we support
>container sizes ranging from few CPUs to multiple NUMA nodes. Each NUMA
>node may have it's own NIC, and the orchestration needs to stitch and
>un-stitch NICs depending on whether the cores were allocated to small
>containers or a huge one.
Yeah, but still, there is one physical port for NIC-numanode pair.
Correct? Does the orchestration setup a bond on top of them or some other
master device or let the container use them independently?
>
>So it would be _easier_ to deal with multiple netdevs. Orchestration
>layer already understands netdev <> NUMA mapping, it does not understand
>multi-NUMA netdevs, and how to match up queues to nodes.
>
>> 3) other subsystems like RDMA would benefit the same feature, so this
>> int not netdev specific in general.
>
>Yes, looks RDMA-centric. RDMA being infamously bonding-challenged.
Not really. It's just needed to consider all usecases, not only netdev.
>
>Anyway, back to the initial question - from Greg's reply I'm guessing
>there's no precedent for doing such things in the device model either.
>So we're on our own.
next prev parent reply other threads:[~2024-02-28 8:14 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-15 3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 01/15] net/mlx5: Add MPIR bit in mcam_access_reg Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 02/15] net/mlx5: SD, Introduce SD lib Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 03/15] net/mlx5: SD, Implement basic query and instantiation Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 04/15] net/mlx5: SD, Implement devcom communication and primary election Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 05/15] net/mlx5: SD, Implement steering for primary and secondaries Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 06/15] net/mlx5: SD, Add informative prints in kernel log Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 07/15] net/mlx5: SD, Add debugfs Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 08/15] net/mlx5e: Create single netdev per SD group Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 09/15] net/mlx5e: Create EN core HW resources for all secondary devices Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 10/15] net/mlx5e: Let channels be SD-aware Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 11/15] net/mlx5e: Support cross-vhca RSS Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 12/15] net/mlx5e: Support per-mdev queue counter Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 13/15] net/mlx5e: Block TLS device offload on combined SD netdev Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 14/15] net/mlx5: Enable SD feature Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev Saeed Mahameed
2024-02-16 5:23 ` Jakub Kicinski
2024-02-19 15:26 ` Tariq Toukan
2024-02-21 1:33 ` Jakub Kicinski
2024-02-21 2:10 ` Saeed Mahameed
2024-02-22 7:51 ` Greg Kroah-Hartman
2024-02-22 23:00 ` Jakub Kicinski
2024-02-23 1:23 ` Samudrala, Sridhar
2024-02-23 2:05 ` Jay Vosburgh
2024-02-23 5:00 ` Samudrala, Sridhar
2024-02-23 9:40 ` Jiri Pirko
2024-02-23 23:56 ` Samudrala, Sridhar
2024-02-24 12:48 ` Jiri Pirko
2024-02-23 9:36 ` Jiri Pirko
2024-02-28 2:06 ` Jakub Kicinski
2024-02-28 8:13 ` Jiri Pirko [this message]
2024-02-28 17:06 ` Jakub Kicinski
2024-02-28 17:43 ` Jakub Kicinski
2024-03-02 7:31 ` Saeed Mahameed
2024-02-29 8:21 ` Jiri Pirko
2024-02-29 14:34 ` Jakub Kicinski
2024-02-19 18:04 ` Jiri Pirko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zd7rRTSSLO9-DM2t@nanopsycho \
--to=jiri@resnulli.us \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=gal@nvidia.com \
--cc=gregkh@linuxfoundation.org \
--cc=jay.vosburgh@canonical.com \
--cc=kuba@kernel.org \
--cc=leonro@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=saeed@kernel.org \
--cc=saeedm@nvidia.com \
--cc=sridhar.samudrala@intel.com \
--cc=tariqt@nvidia.com \
--cc=ttoukan.linux@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.