Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

From: Jiri Pirko <jiri@resnulli.us>
To: Jakub Kicinski <kuba@kernel.org>
Cc: "Samudrala, Sridhar" <sridhar.samudrala@intel.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Tariq Toukan <ttoukan.linux@gmail.com>,
	Saeed Mahameed <saeed@kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	Paolo Abeni <pabeni@redhat.com>,
	Eric Dumazet <edumazet@google.com>,
	Saeed Mahameed <saeedm@nvidia.com>,
	netdev@vger.kernel.org, Tariq Toukan <tariqt@nvidia.com>,
	Gal Pressman <gal@nvidia.com>,
	Leon Romanovsky <leonro@nvidia.com>,
	jay.vosburgh@canonical.com
Subject: Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
Date: Wed, 28 Feb 2024 09:13:57 +0100	[thread overview]
Message-ID: <Zd7rRTSSLO9-DM2t@nanopsycho> (raw)
In-Reply-To: <20240227180619.7e908ac4@kernel.org>

Wed, Feb 28, 2024 at 03:06:19AM CET, kuba@kernel.org wrote:
>On Fri, 23 Feb 2024 10:36:25 +0100 Jiri Pirko wrote:
>> >> It's really a special type of bonding of two netdevs. Like you'd bond
>> >> two ports to get twice the bandwidth. With the twist that the balancing
>> >> is done on NUMA proximity, rather than traffic hash.
>> >> 
>> >> Well, plus, the major twist that it's all done magically "for you"
>> >> in the vendor driver, and the two "lower" devices are not visible.
>> >> You only see the resulting bond.
>> >> 
>> >> I personally think that the magic hides as many problems as it
>> >> introduces and we'd be better off creating two separate netdevs.
>> >> And then a new type of "device bond" on top. Small win that
>> >> the "new device bond on top" can be shared code across vendors.  
>> >
>> >Yes. We have been exploring a small extension to bonding driver to enable a
>> >single numa-aware multi-threaded application to efficiently utilize multiple
>> >NICs across numa nodes.  
>> 
>> Bonding was my immediate response when we discussed this internally for
>> the first time. But I had to eventually admit it is probably not that
>> suitable in this case, here's why:
>> 1) there are no 2 physical ports, only one.
>
>Right, sorry, number of PFs matches number of ports for each bus.
>But it's not necessarily a deal breaker - it's similar to a multi-host
>device. We also have multiple netdevs and PCIe links, they just go to
>different host rather than different NUMA nodes on one host.

That is a different scenario. You have multiple hosts and a switch
between them and the physical port. Yeah, it might be invisible switch,
but there still is one. On DPU/smartnic, it is visible and configurable.


>
>> 2) it is basically a matter of device layout/provisioning that this
>>    feature should be enabled, not user configuration.
>
>We can still auto-instantiate it, not a deal breaker.

"Auto-instantiate" in meating of userspace orchestration deamon,
not kernel, that's what you mean?


>
>I'm not sure you're right in that assumption, tho. At Meta, we support
>container sizes ranging from few CPUs to multiple NUMA nodes. Each NUMA
>node may have it's own NIC, and the orchestration needs to stitch and
>un-stitch NICs depending on whether the cores were allocated to small
>containers or a huge one.

Yeah, but still, there is one physical port for NIC-numanode pair.
Correct? Does the orchestration setup a bond on top of them or some other
master device or let the container use them independently?


>
>So it would be _easier_ to deal with multiple netdevs. Orchestration
>layer already understands netdev <> NUMA mapping, it does not understand
>multi-NUMA netdevs, and how to match up queues to nodes.
>
>> 3) other subsystems like RDMA would benefit the same feature, so this
>>    int not netdev specific in general.
>
>Yes, looks RDMA-centric. RDMA being infamously bonding-challenged.

Not really. It's just needed to consider all usecases, not only netdev.


>
>Anyway, back to the initial question - from Greg's reply I'm guessing
>there's no precedent for doing such things in the device model either.
>So we're on our own.

next prev parent reply	other threads:[~2024-02-28  8:14 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 01/15] net/mlx5: Add MPIR bit in mcam_access_reg Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 02/15] net/mlx5: SD, Introduce SD lib Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 03/15] net/mlx5: SD, Implement basic query and instantiation Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 04/15] net/mlx5: SD, Implement devcom communication and primary election Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 05/15] net/mlx5: SD, Implement steering for primary and secondaries Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 06/15] net/mlx5: SD, Add informative prints in kernel log Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 07/15] net/mlx5: SD, Add debugfs Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 08/15] net/mlx5e: Create single netdev per SD group Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 09/15] net/mlx5e: Create EN core HW resources for all secondary devices Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 10/15] net/mlx5e: Let channels be SD-aware Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 11/15] net/mlx5e: Support cross-vhca RSS Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 12/15] net/mlx5e: Support per-mdev queue counter Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 13/15] net/mlx5e: Block TLS device offload on combined SD netdev Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 14/15] net/mlx5: Enable SD feature Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev Saeed Mahameed
2024-02-16  5:23   ` Jakub Kicinski
2024-02-19 15:26     ` Tariq Toukan
2024-02-21  1:33       ` Jakub Kicinski
2024-02-21  2:10         ` Saeed Mahameed
2024-02-22  7:51         ` Greg Kroah-Hartman
2024-02-22 23:00           ` Jakub Kicinski
2024-02-23  1:23             ` Samudrala, Sridhar
2024-02-23  2:05               ` Jay Vosburgh
2024-02-23  5:00                 ` Samudrala, Sridhar
2024-02-23  9:40                   ` Jiri Pirko
2024-02-23 23:56                     ` Samudrala, Sridhar
2024-02-24 12:48                       ` Jiri Pirko
2024-02-23  9:36               ` Jiri Pirko
2024-02-28  2:06                 ` Jakub Kicinski
2024-02-28  8:13                   ` Jiri Pirko [this message]
2024-02-28 17:06                     ` Jakub Kicinski
2024-02-28 17:43                       ` Jakub Kicinski
2024-03-02  7:31                         ` Saeed Mahameed
2024-02-29  8:21                       ` Jiri Pirko
2024-02-29 14:34                         ` Jakub Kicinski
2024-02-19 18:04   ` Jiri Pirko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zd7rRTSSLO9-DM2t@nanopsycho \
    --to=jiri@resnulli.us \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=gal@nvidia.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=jay.vosburgh@canonical.com \
    --cc=kuba@kernel.org \
    --cc=leonro@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=saeed@kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=sridhar.samudrala@intel.com \
    --cc=tariqt@nvidia.com \
    --cc=ttoukan.linux@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox