From: Jiri Pirko <jiri@resnulli.us>
To: "Samudrala, Sridhar" <sridhar.samudrala@intel.com>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>,
Jakub Kicinski <kuba@kernel.org>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Tariq Toukan <ttoukan.linux@gmail.com>,
Saeed Mahameed <saeed@kernel.org>,
"David S. Miller" <davem@davemloft.net>,
Paolo Abeni <pabeni@redhat.com>,
Eric Dumazet <edumazet@google.com>,
Saeed Mahameed <saeedm@nvidia.com>,
netdev@vger.kernel.org, Tariq Toukan <tariqt@nvidia.com>,
Gal Pressman <gal@nvidia.com>,
Leon Romanovsky <leonro@nvidia.com>
Subject: Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
Date: Sat, 24 Feb 2024 13:48:48 +0100 [thread overview]
Message-ID: <ZdnlsMgqsyOPaeyC@nanopsycho> (raw)
In-Reply-To: <ef0270c5-3128-40d8-933e-f9dbeddf5961@intel.com>
Sat, Feb 24, 2024 at 12:56:52AM CET, sridhar.samudrala@intel.com wrote:
>
>
>On 2/23/2024 3:40 AM, Jiri Pirko wrote:
>> Fri, Feb 23, 2024 at 06:00:40AM CET, sridhar.samudrala@intel.com wrote:
>> >
>> >
>> > On 2/22/2024 8:05 PM, Jay Vosburgh wrote:
>> > > Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote:
>> > > > On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
>> > > > > On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>> > > > > > On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>> > > > > > > Greg, we have a feature here where a single device of class net has
>> > > > > > > multiple "bus parents". We used to have one attr under class net
>> > > > > > > (device) which is a link to the bus parent. Now we either need to add
>> > > > > > > more or not bother with the linking of the whole device. Is there any
>> > > > > > > precedent / preference for solving this from the device model
>> > > > > > > perspective?
>> > > > > >
>> > > > > > How, logically, can a netdevice be controlled properly from 2 parent
>> > > > > > devices on two different busses? How is that even possible from a
>> > > > > > physical point-of-view? What exact bus types are involved here?
>> > > > > Two PCIe buses, two endpoints, two networking ports. It's one piece
>> > > >
>> > > > Isn't it only 1 networking port with multiple PFs?
>> > > >
>> > > > > of silicon, tho, so the "slices" can talk to each other internally.
>> > > > > The NVRAM configuration tells both endpoints that the user wants
>> > > > > them "bonded", when the PCI drivers probe they "find each other"
>> > > > > using some cookie or DSN or whatnot. And once they did, they spawn
>> > > > > a single netdev.
>> > > > >
>> > > > > > This "shouldn't" be possible as in the end, it's usually a PCI device
>> > > > > > handling this all, right?
>> > > > > It's really a special type of bonding of two netdevs. Like you'd bond
>> > > > > two ports to get twice the bandwidth. With the twist that the balancing
>> > > > > is done on NUMA proximity, rather than traffic hash.
>> > > > > Well, plus, the major twist that it's all done magically "for you"
>> > > > > in the vendor driver, and the two "lower" devices are not visible.
>> > > > > You only see the resulting bond.
>> > > > > I personally think that the magic hides as many problems as it
>> > > > > introduces and we'd be better off creating two separate netdevs.
>> > > > > And then a new type of "device bond" on top. Small win that
>> > > > > the "new device bond on top" can be shared code across vendors.
>> > > >
>> > > > Yes. We have been exploring a small extension to bonding driver to enable
>> > > > a single numa-aware multi-threaded application to efficiently utilize
>> > > > multiple NICs across numa nodes.
>> > >
>> > > Is this referring to something like the multi-pf under
>> > > discussion, or just generically with two arbitrary network devices
>> > > installed one each per NUMA node?
>> >
>> > Normal network devices one per NUMA node
>> >
>> > >
>> > > > Here is an early version of a patch we have been trying and seems to be
>> > > > working well.
>> > > >
>> > > > =========================================================================
>> > > > bonding: select tx device based on rx device of a flow
>> > > >
>> > > > If napi_id is cached in the sk associated with skb, use the
>> > > > device associated with napi_id as the transmit device.
>> > > >
>> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > > >
>> > > > diff --git a/drivers/net/bonding/bond_main.c
>> > > > b/drivers/net/bonding/bond_main.c
>> > > > index 7a7d584f378a..77e3bf6c4502 100644
>> > > > --- a/drivers/net/bonding/bond_main.c
>> > > > +++ b/drivers/net/bonding/bond_main.c
>> > > > @@ -5146,6 +5146,30 @@ static struct slave
>> > > > *bond_xmit_3ad_xor_slave_get(struct bonding *bond,
>> > > > unsigned int count;
>> > > > u32 hash;
>> > > >
>> > > > + if (skb->sk) {
>> > > > + int napi_id = skb->sk->sk_napi_id;
>> > > > + struct net_device *dev;
>> > > > + int idx;
>> > > > +
>> > > > + rcu_read_lock();
>> > > > + dev = dev_get_by_napi_id(napi_id);
>> > > > + rcu_read_unlock();
>> > > > +
>> > > > + if (!dev)
>> > > > + goto hash;
>> > > > +
>> > > > + count = slaves ? READ_ONCE(slaves->count) : 0;
>> > > > + if (unlikely(!count))
>> > > > + return NULL;
>> > > > +
>> > > > + for (idx = 0; idx < count; idx++) {
>> > > > + slave = slaves->arr[idx];
>> > > > + if (slave->dev->ifindex == dev->ifindex)
>> > > > + return slave;
>> > > > + }
>> > > > + }
>> > > > +
>> > > > +hash:
>> > > > hash = bond_xmit_hash(bond, skb);
>> > > > count = slaves ? READ_ONCE(slaves->count) : 0;
>> > > > if (unlikely(!count))
>> > > > =========================================================================
>> > > >
>> > > > If we make this as a configurable bonding option, would this be an
>> > > > acceptable solution to accelerate numa-aware apps?
>> > >
>> > > Assuming for the moment this is for "regular" network devices
>> > > installed one per NUMA node, why do this in bonding instead of at a
>> > > higher layer (multiple subnets or ECMP, for example)?
>> > >
>> > > Is the intent here that the bond would aggregate its interfaces
>> > > via LACP with the peer being some kind of cross-chassis link aggregation
>> > > (MLAG, et al)?
>>
>> No.
>>
>> >
>> > Yes. basic LACP bonding setup. There could be multiple peers connecting to
>> > the server via switch providing LACP based link aggregation. No cross-chassis
>> > MLAG.
>>
>> LACP does not make any sense, when you have only a single physical port.
>> That applies to ECMP mentioned above too I believe.
>
>I meant for the 2 regular NICs on 2 numa node setup, not for multi-PF 1 port
>setup.
Okay, not sure how it is related to this thread then :)
next prev parent reply other threads:[~2024-02-24 12:48 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-15 3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 01/15] net/mlx5: Add MPIR bit in mcam_access_reg Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 02/15] net/mlx5: SD, Introduce SD lib Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 03/15] net/mlx5: SD, Implement basic query and instantiation Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 04/15] net/mlx5: SD, Implement devcom communication and primary election Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 05/15] net/mlx5: SD, Implement steering for primary and secondaries Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 06/15] net/mlx5: SD, Add informative prints in kernel log Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 07/15] net/mlx5: SD, Add debugfs Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 08/15] net/mlx5e: Create single netdev per SD group Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 09/15] net/mlx5e: Create EN core HW resources for all secondary devices Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 10/15] net/mlx5e: Let channels be SD-aware Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 11/15] net/mlx5e: Support cross-vhca RSS Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 12/15] net/mlx5e: Support per-mdev queue counter Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 13/15] net/mlx5e: Block TLS device offload on combined SD netdev Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 14/15] net/mlx5: Enable SD feature Saeed Mahameed
2024-02-15 3:08 ` [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev Saeed Mahameed
2024-02-16 5:23 ` Jakub Kicinski
2024-02-19 15:26 ` Tariq Toukan
2024-02-21 1:33 ` Jakub Kicinski
2024-02-21 2:10 ` Saeed Mahameed
2024-02-22 7:51 ` Greg Kroah-Hartman
2024-02-22 23:00 ` Jakub Kicinski
2024-02-23 1:23 ` Samudrala, Sridhar
2024-02-23 2:05 ` Jay Vosburgh
2024-02-23 5:00 ` Samudrala, Sridhar
2024-02-23 9:40 ` Jiri Pirko
2024-02-23 23:56 ` Samudrala, Sridhar
2024-02-24 12:48 ` Jiri Pirko [this message]
2024-02-23 9:36 ` Jiri Pirko
2024-02-28 2:06 ` Jakub Kicinski
2024-02-28 8:13 ` Jiri Pirko
2024-02-28 17:06 ` Jakub Kicinski
2024-02-28 17:43 ` Jakub Kicinski
2024-03-02 7:31 ` Saeed Mahameed
2024-02-29 8:21 ` Jiri Pirko
2024-02-29 14:34 ` Jakub Kicinski
2024-02-19 18:04 ` Jiri Pirko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZdnlsMgqsyOPaeyC@nanopsycho \
--to=jiri@resnulli.us \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=gal@nvidia.com \
--cc=gregkh@linuxfoundation.org \
--cc=jay.vosburgh@canonical.com \
--cc=kuba@kernel.org \
--cc=leonro@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=saeed@kernel.org \
--cc=saeedm@nvidia.com \
--cc=sridhar.samudrala@intel.com \
--cc=tariqt@nvidia.com \
--cc=ttoukan.linux@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox