netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Kicinski <kuba@kernel.org>
To: Edward Cree <ecree.xilinx@gmail.com>
Cc: ecree@xilinx.com, netdev@vger.kernel.org, davem@davemloft.net,
	pabeni@redhat.com, edumazet@google.com, corbet@lwn.net,
	linux-doc@vger.kernel.org, linux-net-drivers@amd.com,
	Jacob Keller <jacob.e.keller@intel.com>,
	Jesse Brandeburg <jesse.brandeburg@intel.com>,
	Michael Chan <michael.chan@broadcom.com>,
	Andy Gospodarek <andy@greyhouse.net>,
	Saeed Mahameed <saeed@kernel.org>, Jiri Pirko <jiri@resnulli.us>,
	Shannon Nelson <snelson@pensando.io>,
	Simon Horman <simon.horman@corigine.com>,
	Alexander Duyck <alexander.duyck@gmail.com>
Subject: Re: [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors
Date: Mon, 8 Aug 2022 20:41:35 -0700	[thread overview]
Message-ID: <20220808204135.040a4516@kernel.org> (raw)
In-Reply-To: <71af8654-ca69-c492-7e12-ed7ff455a2f1@gmail.com>

On Mon, 8 Aug 2022 21:44:45 +0100 Edward Cree wrote:
> >> +What functions should have a representor?
> >> +-----------------------------------------
> >> +
> >> +Essentially, for each virtual port on the device's internal switch, there
> >> +should be a representor.
> >> +The only exceptions are the management PF (whose port is used for traffic to
> >> +and from all other representors)   
> > 
> > AFAIK there's no "management PF" in the Linux model.  
> 
> Maybe a bad word choice.  I'm referring to whichever PF (which likely
>  also has an ordinary netdevice) has administrative rights over the NIC /
>  internal switch at a firmware level.  Other names I've seen tossed
>  around include "primary PF", "admin PF".

I believe someone (mellanox?) used the term eswitch manager.
I'd use "host PF", somehow that makes most sense to me.

> >> and perhaps the physical network port (for
> >> +which the management PF may act as a kind of port representor.  Devices that
> >> +combine multiple physical ports and SR-IOV capability may need to have port
> >> +representors in addition to PF/VF representors).  
> > 
> > That doesn't generalize well. If we just say that all uplinks and PFs
> > should have a repr we don't have to make exceptions for all the cases
> > where that's the case.  
> 
> We could, but AFAIK that's not how existing drivers behave.  At least
>  when I experimented with a mlx NIC a couple of years ago I don't
>  recall it creating a repr for the primary PF or for the physical port,
>  only reprs for the VFs.

Mellanox is not the best example, I think they don't even support
uplink to uplink forwarding cleanly.

> >> + - Other PFs on the PCIe controller, and any VFs belonging to them.  
> > 
> > What is "the PCIe controller" here? I presume you've seen the
> > devlink-port doc.  
> 
> Yes, that's where I got this terminology from.
> "the" PCIe controller here is the one on which the mgmt PF lives.  For
>  instance you might have a NIC where you run OVS on a SoC inside the
>  chip, that has its own PCIe controller including a PF it uses to drive
>  the hardware v-switch (so it can offload OVS rules), in addition to
>  the PCIe controller that exposes PFs & VFs to the host you plug it
>  into through the physical PCIe socket / edge connector.
> In that case this bullet would refer to any additional PFs the SoC has
>  besides the management one...

IMO the model where there's a overall controller for the entire device
is also a mellanox limitation, due to lack of support for nested
switches.

Say I pay for a bare metal instance in my favorite public could. 
Why would the forwarding between VFs I spawn be controlled by the cloud
provider and not me?!

But perhaps Netronome was the only vendor capable of nested switching.

> >> + - PFs and VFs on other PCIe controllers on the device (e.g. for any embedded
> >> +   System-on-Chip within the SmartNIC).  
> 
> ... and this bullet to the PFs the host sees.
> 
> >> + - PFs and VFs with other personalities, including network block devices (such
> >> +   as a vDPA virtio-blk PF backed by remote/distributed storage).  
> > 
> > IDK how you can configure block forwarding (which is DMAs of command
> > + data blocks, not packets AFAIU) with the networking concepts..
> > I've not used the storage functions tho, so I could be wrong.  
> 
> Maybe I'm way off the beam here, but my understanding is that this
>  sort of thing involves a block interface between the host and the
>  NIC, but then something internal to the NIC converts those
>  operations into network operations (e.g. RDMA traffic or Ceph TCP
>  packets), which then go out on the network to access the actual
>  data.  In that case the back-end has to have network connectivity,
>  and the obvious™ way to do that is give it a v-port on the v-switch
>  just like anyone else.

I see. I don't think this covers all implementations. 

> >> +An example of a PCIe function that should *not* have a representor is, on an
> >> +FPGA-based NIC, a PF which is only used to deploy a new bitstream to the FPGA,
> >> +and which cannot create RX and TX queues.  
> > 
> > What's the thinking here? We're letting everyone add their own
> > exceptions to the doc?  
> 
> It was just the only example I could come up with of the more general
>  rule: if it doesn't have the ability to send and receive packets over
>  the network (directly or indirectly), then it won't have a virtual
>  port on the virtual switch, and so it doesn't make sense for it to
>  have a representor.
> No way to TX = nothing will ever be RXed on the rep; no way to RX = no
>  way to deliver anything you TX from the rep.  And nothing for TC
>  offload to act upon either for the same reasons.

No need to mention that, I'd think. Seems obvious.

> >> For example, ``ndo_start_xmit()`` might send the
> >> +packet, specially marked for delivery to the representee, through a TX queue
> >> +attached to the management PF.  
> > 
> > IDK how common that is, RDMA NICs will likely do the "dedicated queue
> > per repr" thing since they pretend to have infinite queues.  
> 
> Right.  But the queue is still created by the driver bound to the mgmt
>  PF, and using that PF for whatever BAR accesses it uses to create and
>  administer the queue, no?
> That's the important bit, and the details of how the NIC knows which
>  representee to deliver it to (dedicated queue, special TX descriptor,
>  whatever) are vendor-specific magic.  Better ways of phrasing that
>  are welcome :)

"TX queue attached to" made me think of a netdev Tx queue with a qdisc
rather than just a HW queue. No better ideas tho.

> >> +How are representors identified?
> >> +--------------------------------
> >> +
> >> +The representor netdevice should *not* directly refer to a PCIe device (e.g.
> >> +through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
> >> +representee or of the management PF.  
> > 
> > Do we know how many existing ones do?   
> 
> Idk.  From a quick look on lxr, mlx5 and ice do; as far as I can tell
>  nfp/flower does for "phy_reprs" but not "vnic_reprs".  nfp/abm does.
> 
> My reasoning for this "should not" here is that a repr is a pure
>  software device; compare e.g. if you build a vlan netdev on top of
>  eth0 it doesn't inherit eth0's device.
> Also, at least in theory this should avoid the problem with OpenStack
>  picking the wrong netdevice that you mentioned in [2], as this is
>  what controls the 'device' symlink in sysfs.

It makes sense. The thought I had was "what if a user reads this and
assumes it's never the case". But to be fair "should not" != "must not"
so we're probably good with your wording as is.

> >> + - ``pf<N>``, PCIe physical function index *N*.
> >> + - ``vf<N>``, PCIe virtual function index *N*.
> >> + - ``sf<N>``, Subfunction index *N*.  
> > 
> > Yeah, nah... implement devlink port, please. This is done by the core,
> > you shouldn't have to document this.  
> 
> Oh huh, I didn't know about __devlink_port_phys_port_name_get().
> Last time I looked, the drivers all had their own
>  .ndo_get_phys_port_name implementations (which is why I did one for
>  sfc), and any similarity between their string formats was purely an
>  (undocumented) convention.  TIL!
> (And it looks like the core uses `c<N>` for my `if<N>` that you were
>  so horrified by.  Devlink-port documentation doesn't make it super
>  clear whether controller 0 is "the controller that's in charge" or
>  "the controller from which we're viewing things", though I think in
>  practice it comes to the same thing.)

I think we had a bit. Perhaps @external? The controller which doesn't
have @external == true should be the local one IIRC. And by extension
presumably in charge.

  reply	other threads:[~2022-08-09  3:41 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-05 16:58 [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors ecree
2022-08-05 19:15 ` Randy Dunlap
2022-08-08 20:48   ` Edward Cree
2022-08-06  1:43 ` Jakub Kicinski
2022-08-08 16:50   ` Keller, Jacob E
2022-08-08 20:44   ` Edward Cree
2022-08-09  3:41     ` Jakub Kicinski [this message]
2022-08-10 16:02       ` Edward Cree
2022-08-10 17:58         ` Jakub Kicinski
2022-08-10 19:21           ` Edward Cree
2022-08-10 22:58             ` Alexander Duyck
2022-08-11 16:17               ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220808204135.040a4516@kernel.org \
    --to=kuba@kernel.org \
    --cc=alexander.duyck@gmail.com \
    --cc=andy@greyhouse.net \
    --cc=corbet@lwn.net \
    --cc=davem@davemloft.net \
    --cc=ecree.xilinx@gmail.com \
    --cc=ecree@xilinx.com \
    --cc=edumazet@google.com \
    --cc=jacob.e.keller@intel.com \
    --cc=jesse.brandeburg@intel.com \
    --cc=jiri@resnulli.us \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-net-drivers@amd.com \
    --cc=michael.chan@broadcom.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=saeed@kernel.org \
    --cc=simon.horman@corigine.com \
    --cc=snelson@pensando.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).