All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jakub Kicinski <kuba@kernel.org>
To: Edward Cree <ecree.xilinx@gmail.com>
Cc: ecree@xilinx.com, netdev@vger.kernel.org, davem@davemloft.net,
	pabeni@redhat.com, edumazet@google.com, corbet@lwn.net,
	linux-doc@vger.kernel.org, linux-net-drivers@amd.com,
	Jacob Keller <jacob.e.keller@intel.com>,
	Jesse Brandeburg <jesse.brandeburg@intel.com>,
	Michael Chan <michael.chan@broadcom.com>,
	Andy Gospodarek <andy@greyhouse.net>,
	Saeed Mahameed <saeed@kernel.org>, Jiri Pirko <jiri@resnulli.us>,
	Shannon Nelson <snelson@pensando.io>,
	Simon Horman <simon.horman@corigine.com>,
	Alexander Duyck <alexander.duyck@gmail.com>
Subject: Re: [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors
Date: Mon, 8 Aug 2022 20:41:35 -0700	[thread overview]
Message-ID: <20220808204135.040a4516@kernel.org> (raw)
In-Reply-To: <71af8654-ca69-c492-7e12-ed7ff455a2f1@gmail.com>

On Mon, 8 Aug 2022 21:44:45 +0100 Edward Cree wrote:
> >> +What functions should have a representor?
> >> +-----------------------------------------
> >> +
> >> +Essentially, for each virtual port on the device's internal switch, there
> >> +should be a representor.
> >> +The only exceptions are the management PF (whose port is used for traffic to
> >> +and from all other representors)   
> > 
> > AFAIK there's no "management PF" in the Linux model.  
> 
> Maybe a bad word choice.  I'm referring to whichever PF (which likely
>  also has an ordinary netdevice) has administrative rights over the NIC /
>  internal switch at a firmware level.  Other names I've seen tossed
>  around include "primary PF", "admin PF".

I believe someone (mellanox?) used the term eswitch manager.
I'd use "host PF", somehow that makes most sense to me.

> >> and perhaps the physical network port (for
> >> +which the management PF may act as a kind of port representor.  Devices that
> >> +combine multiple physical ports and SR-IOV capability may need to have port
> >> +representors in addition to PF/VF representors).  
> > 
> > That doesn't generalize well. If we just say that all uplinks and PFs
> > should have a repr we don't have to make exceptions for all the cases
> > where that's the case.  
> 
> We could, but AFAIK that's not how existing drivers behave.  At least
>  when I experimented with a mlx NIC a couple of years ago I don't
>  recall it creating a repr for the primary PF or for the physical port,
>  only reprs for the VFs.

Mellanox is not the best example, I think they don't even support
uplink to uplink forwarding cleanly.

> >> + - Other PFs on the PCIe controller, and any VFs belonging to them.  
> > 
> > What is "the PCIe controller" here? I presume you've seen the
> > devlink-port doc.  
> 
> Yes, that's where I got this terminology from.
> "the" PCIe controller here is the one on which the mgmt PF lives.  For
>  instance you might have a NIC where you run OVS on a SoC inside the
>  chip, that has its own PCIe controller including a PF it uses to drive
>  the hardware v-switch (so it can offload OVS rules), in addition to
>  the PCIe controller that exposes PFs & VFs to the host you plug it
>  into through the physical PCIe socket / edge connector.
> In that case this bullet would refer to any additional PFs the SoC has
>  besides the management one...

IMO the model where there's a overall controller for the entire device
is also a mellanox limitation, due to lack of support for nested
switches.

Say I pay for a bare metal instance in my favorite public could. 
Why would the forwarding between VFs I spawn be controlled by the cloud
provider and not me?!

But perhaps Netronome was the only vendor capable of nested switching.

> >> + - PFs and VFs on other PCIe controllers on the device (e.g. for any embedded
> >> +   System-on-Chip within the SmartNIC).  
> 
> ... and this bullet to the PFs the host sees.
> 
> >> + - PFs and VFs with other personalities, including network block devices (such
> >> +   as a vDPA virtio-blk PF backed by remote/distributed storage).  
> > 
> > IDK how you can configure block forwarding (which is DMAs of command
> > + data blocks, not packets AFAIU) with the networking concepts..
> > I've not used the storage functions tho, so I could be wrong.  
> 
> Maybe I'm way off the beam here, but my understanding is that this
>  sort of thing involves a block interface between the host and the
>  NIC, but then something internal to the NIC converts those
>  operations into network operations (e.g. RDMA traffic or Ceph TCP
>  packets), which then go out on the network to access the actual
>  data.  In that case the back-end has to have network connectivity,
>  and the obvious™ way to do that is give it a v-port on the v-switch
>  just like anyone else.

I see. I don't think this covers all implementations. 

> >> +An example of a PCIe function that should *not* have a representor is, on an
> >> +FPGA-based NIC, a PF which is only used to deploy a new bitstream to the FPGA,
> >> +and which cannot create RX and TX queues.  
> > 
> > What's the thinking here? We're letting everyone add their own
> > exceptions to the doc?  
> 
> It was just the only example I could come up with of the more general
>  rule: if it doesn't have the ability to send and receive packets over
>  the network (directly or indirectly), then it won't have a virtual
>  port on the virtual switch, and so it doesn't make sense for it to
>  have a representor.
> No way to TX = nothing will ever be RXed on the rep; no way to RX = no
>  way to deliver anything you TX from the rep.  And nothing for TC
>  offload to act upon either for the same reasons.

No need to mention that, I'd think. Seems obvious.

> >> For example, ``ndo_start_xmit()`` might send the
> >> +packet, specially marked for delivery to the representee, through a TX queue
> >> +attached to the management PF.  
> > 
> > IDK how common that is, RDMA NICs will likely do the "dedicated queue
> > per repr" thing since they pretend to have infinite queues.  
> 
> Right.  But the queue is still created by the driver bound to the mgmt
>  PF, and using that PF for whatever BAR accesses it uses to create and
>  administer the queue, no?
> That's the important bit, and the details of how the NIC knows which
>  representee to deliver it to (dedicated queue, special TX descriptor,
>  whatever) are vendor-specific magic.  Better ways of phrasing that
>  are welcome :)

"TX queue attached to" made me think of a netdev Tx queue with a qdisc
rather than just a HW queue. No better ideas tho.

> >> +How are representors identified?
> >> +--------------------------------
> >> +
> >> +The representor netdevice should *not* directly refer to a PCIe device (e.g.
> >> +through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
> >> +representee or of the management PF.  
> > 
> > Do we know how many existing ones do?   
> 
> Idk.  From a quick look on lxr, mlx5 and ice do; as far as I can tell
>  nfp/flower does for "phy_reprs" but not "vnic_reprs".  nfp/abm does.
> 
> My reasoning for this "should not" here is that a repr is a pure
>  software device; compare e.g. if you build a vlan netdev on top of
>  eth0 it doesn't inherit eth0's device.
> Also, at least in theory this should avoid the problem with OpenStack
>  picking the wrong netdevice that you mentioned in [2], as this is
>  what controls the 'device' symlink in sysfs.

It makes sense. The thought I had was "what if a user reads this and
assumes it's never the case". But to be fair "should not" != "must not"
so we're probably good with your wording as is.

> >> + - ``pf<N>``, PCIe physical function index *N*.
> >> + - ``vf<N>``, PCIe virtual function index *N*.
> >> + - ``sf<N>``, Subfunction index *N*.  
> > 
> > Yeah, nah... implement devlink port, please. This is done by the core,
> > you shouldn't have to document this.  
> 
> Oh huh, I didn't know about __devlink_port_phys_port_name_get().
> Last time I looked, the drivers all had their own
>  .ndo_get_phys_port_name implementations (which is why I did one for
>  sfc), and any similarity between their string formats was purely an
>  (undocumented) convention.  TIL!
> (And it looks like the core uses `c<N>` for my `if<N>` that you were
>  so horrified by.  Devlink-port documentation doesn't make it super
>  clear whether controller 0 is "the controller that's in charge" or
>  "the controller from which we're viewing things", though I think in
>  practice it comes to the same thing.)

I think we had a bit. Perhaps @external? The controller which doesn't
have @external == true should be the local one IIRC. And by extension
presumably in charge.

  reply	other threads:[~2022-08-09  3:41 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-05 16:58 [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors ecree
2022-08-05 19:15 ` Randy Dunlap
2022-08-08 20:48   ` Edward Cree
2022-08-06  1:43 ` Jakub Kicinski
2022-08-08 16:50   ` Keller, Jacob E
2022-08-08 20:44   ` Edward Cree
2022-08-09  3:41     ` Jakub Kicinski [this message]
2022-08-10 16:02       ` Edward Cree
2022-08-10 17:58         ` Jakub Kicinski
2022-08-10 19:21           ` Edward Cree
2022-08-10 22:58             ` Alexander Duyck
2022-08-11 16:17               ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220808204135.040a4516@kernel.org \
    --to=kuba@kernel.org \
    --cc=alexander.duyck@gmail.com \
    --cc=andy@greyhouse.net \
    --cc=corbet@lwn.net \
    --cc=davem@davemloft.net \
    --cc=ecree.xilinx@gmail.com \
    --cc=ecree@xilinx.com \
    --cc=edumazet@google.com \
    --cc=jacob.e.keller@intel.com \
    --cc=jesse.brandeburg@intel.com \
    --cc=jiri@resnulli.us \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-net-drivers@amd.com \
    --cc=michael.chan@broadcom.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=saeed@kernel.org \
    --cc=simon.horman@corigine.com \
    --cc=snelson@pensando.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.