All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jiri Pirko <jiri@resnulli.us>
To: "Samudrala, Sridhar" <sridhar.samudrala@intel.com>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Tariq Toukan <ttoukan.linux@gmail.com>,
	Saeed Mahameed <saeed@kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	Paolo Abeni <pabeni@redhat.com>,
	Eric Dumazet <edumazet@google.com>,
	Saeed Mahameed <saeedm@nvidia.com>,
	netdev@vger.kernel.org, Tariq Toukan <tariqt@nvidia.com>,
	Gal Pressman <gal@nvidia.com>,
	Leon Romanovsky <leonro@nvidia.com>
Subject: Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
Date: Fri, 23 Feb 2024 10:40:20 +0100	[thread overview]
Message-ID: <ZdhoBOKc40DeVCfG@nanopsycho> (raw)
In-Reply-To: <b7b89300-8065-4421-9935-3adf70ac47bc@intel.com>

Fri, Feb 23, 2024 at 06:00:40AM CET, sridhar.samudrala@intel.com wrote:
>
>
>On 2/22/2024 8:05 PM, Jay Vosburgh wrote:
>> Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote:
>> > On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
>> > > On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>> > > > On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>> > > > > Greg, we have a feature here where a single device of class net has
>> > > > > multiple "bus parents". We used to have one attr under class net
>> > > > > (device) which is a link to the bus parent. Now we either need to add
>> > > > > more or not bother with the linking of the whole device. Is there any
>> > > > > precedent / preference for solving this from the device model
>> > > > > perspective?
>> > > > 
>> > > > How, logically, can a netdevice be controlled properly from 2 parent
>> > > > devices on two different busses?  How is that even possible from a
>> > > > physical point-of-view?  What exact bus types are involved here?
>> > > Two PCIe buses, two endpoints, two networking ports. It's one piece
>> > 
>> > Isn't it only 1 networking port with multiple PFs?
>> > 
>> > > of silicon, tho, so the "slices" can talk to each other internally.
>> > > The NVRAM configuration tells both endpoints that the user wants
>> > > them "bonded", when the PCI drivers probe they "find each other"
>> > > using some cookie or DSN or whatnot. And once they did, they spawn
>> > > a single netdev.
>> > > 
>> > > > This "shouldn't" be possible as in the end, it's usually a PCI device
>> > > > handling this all, right?
>> > > It's really a special type of bonding of two netdevs. Like you'd bond
>> > > two ports to get twice the bandwidth. With the twist that the balancing
>> > > is done on NUMA proximity, rather than traffic hash.
>> > > Well, plus, the major twist that it's all done magically "for you"
>> > > in the vendor driver, and the two "lower" devices are not visible.
>> > > You only see the resulting bond.
>> > > I personally think that the magic hides as many problems as it
>> > > introduces and we'd be better off creating two separate netdevs.
>> > > And then a new type of "device bond" on top. Small win that
>> > > the "new device bond on top" can be shared code across vendors.
>> > 
>> > Yes. We have been exploring a small extension to bonding driver to enable
>> > a single numa-aware multi-threaded application to efficiently utilize
>> > multiple NICs across numa nodes.
>> 
>> 	Is this referring to something like the multi-pf under
>> discussion, or just generically with two arbitrary network devices
>> installed one each per NUMA node?
>
>Normal network devices one per NUMA node
>
>> 
>> > Here is an early version of a patch we have been trying and seems to be
>> > working well.
>> > 
>> > =========================================================================
>> > bonding: select tx device based on rx device of a flow
>> > 
>> > If napi_id is cached in the sk associated with skb, use the
>> > device associated with napi_id as the transmit device.
>> > 
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > 
>> > diff --git a/drivers/net/bonding/bond_main.c
>> > b/drivers/net/bonding/bond_main.c
>> > index 7a7d584f378a..77e3bf6c4502 100644
>> > --- a/drivers/net/bonding/bond_main.c
>> > +++ b/drivers/net/bonding/bond_main.c
>> > @@ -5146,6 +5146,30 @@ static struct slave
>> > *bond_xmit_3ad_xor_slave_get(struct bonding *bond,
>> >         unsigned int count;
>> >         u32 hash;
>> > 
>> > +       if (skb->sk) {
>> > +               int napi_id = skb->sk->sk_napi_id;
>> > +               struct net_device *dev;
>> > +               int idx;
>> > +
>> > +               rcu_read_lock();
>> > +               dev = dev_get_by_napi_id(napi_id);
>> > +               rcu_read_unlock();
>> > +
>> > +               if (!dev)
>> > +                       goto hash;
>> > +
>> > +               count = slaves ? READ_ONCE(slaves->count) : 0;
>> > +               if (unlikely(!count))
>> > +                       return NULL;
>> > +
>> > +               for (idx = 0; idx < count; idx++) {
>> > +                       slave = slaves->arr[idx];
>> > +                       if (slave->dev->ifindex == dev->ifindex)
>> > +                               return slave;
>> > +               }
>> > +       }
>> > +
>> > +hash:
>> >         hash = bond_xmit_hash(bond, skb);
>> >         count = slaves ? READ_ONCE(slaves->count) : 0;
>> >         if (unlikely(!count))
>> > =========================================================================
>> > 
>> > If we make this as a configurable bonding option, would this be an
>> > acceptable solution to accelerate numa-aware apps?
>> 
>> 	Assuming for the moment this is for "regular" network devices
>> installed one per NUMA node, why do this in bonding instead of at a
>> higher layer (multiple subnets or ECMP, for example)?
>> 
>> 	Is the intent here that the bond would aggregate its interfaces
>> via LACP with the peer being some kind of cross-chassis link aggregation
>> (MLAG, et al)?

No.

>
>Yes. basic LACP bonding setup. There could be multiple peers connecting to
>the server via switch providing LACP based link aggregation. No cross-chassis
>MLAG.

LACP does not make any sense, when you have only a single physical port.
That applies to ECMP mentioned above too I believe.



  reply	other threads:[~2024-02-23  9:40 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 01/15] net/mlx5: Add MPIR bit in mcam_access_reg Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 02/15] net/mlx5: SD, Introduce SD lib Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 03/15] net/mlx5: SD, Implement basic query and instantiation Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 04/15] net/mlx5: SD, Implement devcom communication and primary election Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 05/15] net/mlx5: SD, Implement steering for primary and secondaries Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 06/15] net/mlx5: SD, Add informative prints in kernel log Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 07/15] net/mlx5: SD, Add debugfs Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 08/15] net/mlx5e: Create single netdev per SD group Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 09/15] net/mlx5e: Create EN core HW resources for all secondary devices Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 10/15] net/mlx5e: Let channels be SD-aware Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 11/15] net/mlx5e: Support cross-vhca RSS Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 12/15] net/mlx5e: Support per-mdev queue counter Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 13/15] net/mlx5e: Block TLS device offload on combined SD netdev Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 14/15] net/mlx5: Enable SD feature Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev Saeed Mahameed
2024-02-16  5:23   ` Jakub Kicinski
2024-02-19 15:26     ` Tariq Toukan
2024-02-21  1:33       ` Jakub Kicinski
2024-02-21  2:10         ` Saeed Mahameed
2024-02-22  7:51         ` Greg Kroah-Hartman
2024-02-22 23:00           ` Jakub Kicinski
2024-02-23  1:23             ` Samudrala, Sridhar
2024-02-23  2:05               ` Jay Vosburgh
2024-02-23  5:00                 ` Samudrala, Sridhar
2024-02-23  9:40                   ` Jiri Pirko [this message]
2024-02-23 23:56                     ` Samudrala, Sridhar
2024-02-24 12:48                       ` Jiri Pirko
2024-02-23  9:36               ` Jiri Pirko
2024-02-28  2:06                 ` Jakub Kicinski
2024-02-28  8:13                   ` Jiri Pirko
2024-02-28 17:06                     ` Jakub Kicinski
2024-02-28 17:43                       ` Jakub Kicinski
2024-03-02  7:31                         ` Saeed Mahameed
2024-02-29  8:21                       ` Jiri Pirko
2024-02-29 14:34                         ` Jakub Kicinski
2024-02-19 18:04   ` Jiri Pirko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZdhoBOKc40DeVCfG@nanopsycho \
    --to=jiri@resnulli.us \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=gal@nvidia.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=jay.vosburgh@canonical.com \
    --cc=kuba@kernel.org \
    --cc=leonro@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=saeed@kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=sridhar.samudrala@intel.com \
    --cc=tariqt@nvidia.com \
    --cc=ttoukan.linux@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.