Re: [net-next v4 00/15] Add mlx5 subfunction support

linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: David Ahern <dsahern@gmail.com>
To: Parav Pandit <parav@nvidia.com>,
	Alexander Duyck <alexander.duyck@gmail.com>,
	Saeed Mahameed <saeed@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Leon Romanovsky <leonro@nvidia.com>,
	Netdev <netdev@vger.kernel.org>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	David Ahern <dsahern@kernel.org>,
	Jacob Keller <jacob.e.keller@intel.com>,
	Sridhar Samudrala <sridhar.samudrala@intel.com>,
	"Ertman, David M" <david.m.ertman@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Kiran Patil <kiran.patil@intel.com>,
	Greg KH <gregkh@linuxfoundation.org>
Subject: Re: [net-next v4 00/15] Add mlx5 subfunction support
Date: Tue, 15 Dec 2020 13:59:22 -0700	[thread overview]
Message-ID: <f2c1d4c6-2bca-8c9d-a347-e18f44181f7f@gmail.com> (raw)
In-Reply-To: <BY5PR12MB43221CE397D6310F2B04D9B4DCC60@BY5PR12MB4322.namprd12.prod.outlook.com>

On 12/14/20 10:48 PM, Parav Pandit wrote:
> 
>> From: Alexander Duyck <alexander.duyck@gmail.com>
>> Sent: Tuesday, December 15, 2020 7:24 AM
>>
>> On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org>
>> wrote:
>>>
>>> Hi Dave, Jakub, Jason,
>>>
>>
>> Just to clarify a few things for myself. You mention virtualization and SR-IOV
>> in your patch description but you cannot support direct assignment with this
>> correct? 
> Correct. it cannot be directly assigned.
> 
>> The idea here is simply logical partitioning of an existing network
>> interface, correct? 
> No. Idea is to spawn multiple functions from a single PCI device.
> These functions are not born in PCI device and in OS until they are created by user.
> Jason and Saeed explained this in great detail few weeks back in v0 version of the patchset at [1], [2] and [3].
> I better not repeat all of it here again. Please go through it.
> If you may want to read precursor to it, RFC from Jiri at [4] is also explains this in great detail.
> 
>> So this isn't so much a solution for virtualization, but may
>> work better for containers. I view this as an important distinction to make as
>> the first thing that came to mind when I read this was mediated devices
>> which is similar, but focused only on the virtualization case:
>> https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-
>> device.html
>>
> Managing subfunction using medicated device is already ruled out last year at [5] as it is the abuse of the mdev bus for this purpose + has severe limitations of managing the subfunction device.
> We are not going back to it anymore.
> It will be duplicating lot of the plumbing which exists in devlink, netlink, auxiliary bus and more.
>  
>> Rather than calling this a subfunction, would it make more sense to call it
>> something such as a queue set? 
> No, queue is just one way to send and receive data/packets.
> Jason and Saeed explained and discussed  this piece to you and others during v0 few weeks back at [1], [2], [3].
> Please take a look.
> 
>> So in terms of ways to go I would argue this is likely better. However one
>> downside is that we are going to end up seeing each subfunction being
>> different from driver to driver and vendor to vendor which I would argue
>> was also one of the problems with SR-IOV as you end up with a bit of vendor
>> lock-in as a result of this feature since each vendor will be providing a
>> different interface.
>>
> Each and several vendors provided unified interface for managing VFs. i.e.
> (a) enable/disable was via vendor neutral sysfs
> (b) sriov capability exposed via standard pci capability and sysfs
> (c) sriov vf config (mac, vlan, rss, tx rate, spoof check trust) are using vendor agnostic netlink
> Even though the driver's internal implementation largely differs on how trust, spoof, mac, vlan rate etc are enforced.
> 
> So subfunction feature/attribute/functionality will be implemented differently internally in the driver matching vendor's device, for reasonably abstract concept of 'subfunction'.
> 
>>> A Subfunction supports eswitch representation through which it
>>> supports tc offloads. User must configure eswitch to send/receive
>>> packets from/to subfunction port.
>>>
>>> Subfunctions share PCI level resources such as PCI MSI-X IRQs with
>>> their other subfunctions and/or with its parent PCI function.
>>
>> This piece to the architecture for this has me somewhat concerned. If all your
>> resources are shared and 
> All resources are not shared.
> 
>> you are allowing devices to be created
>> incrementally you either have to pre-partition the entire function which
>> usually results in limited resources for your base setup, or free resources
>> from existing interfaces and redistribute them as things change. I would be
>> curious which approach you are taking here? So for example if you hit a
>> certain threshold will you need to reset the port and rebalance the IRQs
>> between the various functions?
> No. Its works bit differently for mlx5 device.
> When base function is started, it started as if it doesn't have any subfunctions.
> When subfunction is instantiated, it spawns new resources in device (hw, fw, memory) depending on how much a function wants.
> 
> For example, PCI PF uses BAR 0, while subfunctions uses BAR 2.
> For IRQs, subfunction instance shares the IRQ with its parent/hosting PCI PF.
> In future, yes, a dedicated IRQs per SF is likely desired.
> Sridhar also talked about limiting number of queues to a subfunction.
> I believe there will be resources/attributes of the function to be controlled.
> devlink already provides rich interface to achieve that using devlink resources [8].
> 
> [..]
> 
>>> $ ip link show
>>> 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state
>> DOWN mode DEFAULT group default qlen 1000
>>>     link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
>>>     altname enp6s0f0np0
>>> 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
>> mode DEFAULT group default qlen 1000
>>>     link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff>
>>
>> I assume that p0sf88 is supposed to be the newly created subfunction.
>> However I thought the naming was supposed to be the same as what you are
>> referring to in the devlink, or did I miss something?
>>
> I believe you are confused with the representor netdevice of subfuction with devices of subfunction. (netdev, rdma, vdpa etc).
> I suggest that please refer to the diagram in patch_15 in [7] to see the stack, modules, objects.
> Hope below description clarifies a bit.
> There are two netdevices.
> (a) representor netdevice, attached to the devlink port of the eswitch
> (b) netdevice of the SF used by the end application (in your example, this is assigned to container).
>  
> Both netdevice follow obviously a different naming scheme.
> Representor netdevice follows naming scheme well defined in kernel + systemd/udev v245 and higher.
> It is based on phys_port_name sysfs attribute.
> This is same for existing PF and SF representors exist for year+ now. Further used by subfunction.
> 
> For subfunction netdevice (p0s88), system/udev will be extended. I put example based on my few lines of udev rule that reads
> phys_port_name and user supplied sfnum, so that user exactly knows which interface to assign to container.
> 
>>> After use inactivate the function:
>>> $ devlink port function set ens2f0npf0sf88 state inactive
>>>
>>> Now delete the subfunction port:
>>> $ devlink port del ens2f0npf0sf88
>>
>> This seems wrong to me as it breaks the symmetry with the port add
>> command and
> Example of the representor device is only to make life easier for the user.
> Devlink port del command works based on the devlink port index, just like existing devlink port commands (get,set,split,unsplit).
> I explained this in a thread with Sridhar at [6].
> In short devlink port del <bus/device_name/port_index command is just fine.
> Port index is unique handle for the devlink instance that user refers to delete, get, set port and port function attributes post its creation.
> I choose the representor netdev example because it is more intuitive to related to, but port index is equally fine and supported.
> 
>> assumes you have ownership of the interface in the host. I
>> would much prefer to to see the same arguments that were passed to the
>> add command being used to do the teardown as that would allow for the
>> parent function to create the object, assign it to a container namespace, and
>> not need to pull it back in order to destroy it.
> Parent function will not have same netdevice name as that of representor netdevice, because both devices exist in single system for large part of the use cases.
> So port delete command works on the port index.
> Host doesn't need to pull it back to destroy it. It is destroyed via port del command.
> 
> [1] https://lore.kernel.org/netdev/20201112192424.2742-1-parav@nvidia.com/
> [2] https://lore.kernel.org/netdev/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/
> [3] https://lore.kernel.org/netdev/20201120161659.GE917484@nvidia.com/
> [4] https://lore.kernel.org/netdev/20200501091449.GA25211@nanopsycho.orion/
> [5] https://lore.kernel.org/netdev/20191107160448.20962-1-parav@mellanox.com/
> [6] https://lore.kernel.org/netdev/BY5PR12MB43227784BB34D929CA64E315DCCA0@BY5PR12MB4322.namprd12.prod.outlook.com/
> [7] https://lore.kernel.org/netdev/20201214214352.198172-16-saeed@kernel.org/T/#u
> [8] https://man7.org/linux/man-pages/man8/devlink-resource.8.html
> 

Seems to be a repeated line of questions. You might want to add these
FAQs, responses and references to the subfunction document once this set
gets merged.

next prev parent reply	other threads:[~2020-12-15 21:00 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 01/15] net/mlx5: Fix compilation warning for 32-bit platform Saeed Mahameed
2020-12-14 22:31   ` Alexander Duyck
2020-12-14 22:45     ` Saeed Mahameed
2020-12-15  4:59     ` Leon Romanovsky
2020-12-14 21:43 ` [net-next v4 02/15] devlink: Prepare code to fill multiple port function attributes Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 03/15] devlink: Introduce PCI SF port flavour and port attribute Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 04/15] devlink: Support add and delete devlink port Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 05/15] devlink: Support get and set state of port function Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 06/15] net/mlx5: Introduce vhca state event notifier Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 07/15] net/mlx5: SF, Add auxiliary device support Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 08/15] net/mlx5: SF, Add auxiliary device driver Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 09/15] net/mlx5: E-switch, Prepare eswitch to handle SF vport Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 10/15] net/mlx5: E-switch, Add eswitch helpers for " Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 11/15] net/mlx5: SF, Add port add delete functionality Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 12/15] net/mlx5: SF, Port function state change support Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 13/15] devlink: Add devlink port documentation Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 14/15] devlink: Extend devlink port documentation for subfunctions Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 15/15] net/mlx5: Add devlink subfunction port documentation Saeed Mahameed
2020-12-15  1:53 ` [net-next v4 00/15] Add mlx5 subfunction support Alexander Duyck
2020-12-15  2:44   ` David Ahern
2020-12-15 16:16     ` Alexander Duyck
2020-12-15 16:59       ` Parav Pandit
2020-12-15  5:48   ` Parav Pandit
2020-12-15 18:47     ` Alexander Duyck
2020-12-15 20:05       ` Saeed Mahameed
2020-12-15 21:03       ` Jason Gunthorpe
2020-12-16  1:12       ` Edwin Peer
2020-12-16  2:39         ` Jason Gunthorpe
2020-12-16  3:12         ` Alexander Duyck
2020-12-15 20:59     ` David Ahern [this message]
2020-12-15  6:15   ` Saeed Mahameed
2020-12-15 19:12     ` Alexander Duyck
2020-12-15 20:35       ` Saeed Mahameed
2020-12-15 21:28         ` Jakub Kicinski
2020-12-16  6:50           ` Leon Romanovsky
2020-12-16 17:59             ` Saeed Mahameed
2020-12-15 21:41         ` Alexander Duyck
2020-12-16  0:19           ` Jason Gunthorpe
2020-12-16  2:19             ` Alexander Duyck
2020-12-16  3:03               ` Jason Gunthorpe
2020-12-16  4:13                 ` Alexander Duyck
2020-12-16  4:45                   ` Parav Pandit
2020-12-16 13:33                   ` Jason Gunthorpe
2020-12-16 16:31                     ` Alexander Duyck
2020-12-16 17:51                       ` Jason Gunthorpe
2020-12-16 19:27                         ` Alexander Duyck
2020-12-16 20:35                           ` Jason Gunthorpe
2020-12-16 22:53                             ` Alexander Duyck
2020-12-17  0:38                               ` Jason Gunthorpe
2020-12-17 18:48                                 ` Alexander Duyck
2020-12-17 19:40                                   ` Jason Gunthorpe
2020-12-17 21:05                                     ` Alexander Duyck
2020-12-18  0:08                                       ` Jason Gunthorpe
2020-12-18  1:30                               ` David Ahern
2020-12-18  3:11                                 ` Alexander Duyck
2020-12-18  3:55                                   ` David Ahern
2020-12-18 15:54                                     ` Alexander Duyck
2020-12-18  5:20                                   ` Parav Pandit
2020-12-18  5:36                                     ` Parav Pandit
2020-12-18 16:01                                     ` Alexander Duyck
2020-12-18 18:01                                       ` Parav Pandit
2020-12-18 19:22                                         ` Alexander Duyck
2020-12-18 20:18                                           ` Jason Gunthorpe
2020-12-19  0:03                                             ` Alexander Duyck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f2c1d4c6-2bca-8c9d-a347-e18f44181f7f@gmail.com \
    --to=dsahern@gmail.com \
    --cc=alexander.duyck@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=davem@davemloft.net \
    --cc=david.m.ertman@intel.com \
    --cc=dsahern@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=jacob.e.keller@intel.com \
    --cc=jgg@nvidia.com \
    --cc=kiran.patil@intel.com \
    --cc=kuba@kernel.org \
    --cc=leonro@nvidia.com \
    --cc=linux-rdma@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=parav@nvidia.com \
    --cc=saeed@kernel.org \
    --cc=sridhar.samudrala@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).