netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Kicinski <jakub.kicinski@netronome.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: Jiri Benc <jbenc@redhat.com>, Jiri Pirko <jiri@resnulli.us>,
	netdev@vger.kernel.org, Thomas Graf <tgraf@suug.ch>,
	Roopa Prabhu <roopa@cumulusnetworks.com>,
	ogerlitz@mellanox.com, sridhar.samudrala@intel.com,
	ast@kernel.org, daniel@iogearbox.net, simon.horman@netronome.com,
	Paolo Abeni <pabeni@redhat.com>,
	Pravin B Shelar <pshelar@nicira.com>,
	hannes@stressinduktion.org, kubakici@wp.pl
Subject: Re: [RFC] net: store port/representative id in metadata_dst
Date: Fri, 23 Sep 2016 16:29:07 +0100	[thread overview]
Message-ID: <20160923162907.7fd5314e@jkicinski-Precision-T1700> (raw)
In-Reply-To: <57E53ADE.10806@gmail.com>

On Fri, 23 Sep 2016 07:23:26 -0700, John Fastabend wrote:
> On 16-09-23 05:55 AM, Jakub Kicinski wrote:
> > On Fri, 23 Sep 2016 11:06:09 +0200, Jiri Benc wrote:  
> >> On Fri, 23 Sep 2016 08:34:29 +0200, Jiri Pirko wrote:  
> >>> So if I understand that correctly, this would need some "shared netdev"
> >>> which would effectively serve only as a sink for all port netdevices to
> >>> tx packets to. On RX, this would be completely avoided. This lower
> >>> device looks like half zombie to me.    
> >>
> >> Looks more like a quarter zombie. Even tx would not be allowed unless
> >> going through one of the ports, as all skbs without
> >> METADATA_HW_PORT_MUX metadata_dst would be dropped. But it would be
> >> possible to attach qdisc to the "lower" netdevice and it would actually
> >> have an effect. On rx this netdevice would be ignored completely. This
> >> is very weird behavior.
> >>  
> >>> I don't like it :( I wonder if the
> >>> solution would not be possible without this lower netdev.    
> >>
> >> I agree. This approach doesn't sound correct. The skbs should not be
> >> requeued.  
> > 
> > Thanks for the responses!  
> 
> Nice timing we were just thinking about this.
> 
> > 
> > I think SR-IOV NICs are coming at this problem from a different angle,
> > we already have a big, feature-full per-port netdevs and now we want to
> > spawn representators for VFs to handle fallback traffic.  This patch
> > would help us mux VFR traffic on all the queues of the physical port
> > netdevs (the ones which were already present in legacy mode, that's the
> > lower device).  
> 
> Yep, I like the idea in general. I had a slightly different approach in
> mind though. If you look at __dev_queue_xmit() there is a void
> accel_priv pointer (gather you found this based on your commit note).
> My take was we could extend this a bit so it can be used by the VFR
> devices and they could do a dev_queue_xmit_accel(). In this way there is
> no need to touch /net/core/{filter, dst, ip_tunnel}.c etc. Maybe the
> accel logic needs to be extended to push the priv pointer all the way
> through the xmit routine of the target netdev though. This should look
> a lot like the macvlan accelerated xmit device path without the
> switching logic.
> 
> Of course maybe the name would be extended to dev_queue_xmit_extended()
> or something.
> 
> So the flow on ingress would be,
> 
>   1. pkt_received_by_PF_netdev
>   2. PF_netdev reads some tag off packet/descriptor and sets correct
>      skb->dev field. This is needed so stack "sees" packets from
>      correct VF ports.
>   3. packet passed up to stack.
> 
> I guess it is a bit "zombie" like on the receive path because the packet
> is never actually handled by VF netdev code per se and on egress can
> traverse both the VFR and PF netdevs qdiscs. But on the other hand the
> VFR netdevs and PF netdevs are all in the same driver. Plus using a
> queue per VFR is a bit of a waste as its not needed and also hardware
> may not have any mechanism to push VF traffic onto a rx queue.
> 
> On egress,
> 
>   1. VFR xmit is called
>   2. VFR xmit calls dev_queue_xmit_accel() with some meta-data if needed
>      for the lower netdev
>   3. lower netdev sends out the packet.
> 
> Again we don't need to waste any queues for each VFR and the VFR can be
> a LLTX device. In this scheme I think you avoid much of the changes in
> your patch and keep it all contained in the driver. Any thoughts?

Goes without saying that you have a much better understanding of packet
scheduling so please bear with me :)  My target model is that I have
n_cpus x "n_tc/prio" queues on the PF and I want to transmit the
fallback traffic over those same queues.  So no new HW queues are used
for VFRs at all.  This is a reverse of macvlan offload which AFAICT has
"bastard hw queues" which actually TX for a separate software device.

My understanding was that I can rework this model to have software
queues for VFRs (#sw queues == #PF queues + #VFRs) but no extra HW
queues (#hw queues == #PF queues) but then when the driver sees a
packet on sw-only VFR queue it has to pick one of the PF queues (which
one?), lock PF software queue to own it, and only then can it
transmit.  With the dst_metadata there is no need for extra locking or
queue selection.

> To address 'I wonder if the solution can be done without this lower
> netdev' I think it can be but it creates two issues which I'm not sure
> have a good solution.
> 
> Without a lowerdev we either (a) give each VFR its own queue which I
> don't like because it complicates mgmt and uses resources or (b) we
> implicitly share queues. The later could be fine it just looks a bit
> cleaner IMO to make it explicit.
> 
> With regard to VF-PF flow rules if we allow matching on ingress port
> then can all your flow rules be pushed through the PF netdevices or
> if you want any of the VFR netdevs? After all I expsect the flow rule
> table is actually a shared resource between all attached ports.

With the VF-PF forwarding rules I was just inching towards re-opening
the discussion on whether there should be an CPU port netdev.  I guess
there are good reasons why there isn't so maybe let's not go there :)
The meaning of PF netdevs in SR-IOV switchdev mode is "external ports"
AFAICT which could make it cumbersome to reach the host.

  reply	other threads:[~2016-09-23 15:29 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-22 19:26 [RFC] net: store port/representative id in metadata_dst Jakub Kicinski
2016-09-23  6:34 ` Jiri Pirko
2016-09-23  9:06   ` Jiri Benc
2016-09-23 12:55     ` Jakub Kicinski
2016-09-23 14:23       ` John Fastabend
2016-09-23 15:29         ` Jakub Kicinski [this message]
2016-09-23 17:22           ` Samudrala, Sridhar
2016-09-23 20:17             ` Jakub Kicinski
2016-09-23 20:25               ` John Fastabend
2016-09-23 20:45                 ` Jakub Kicinski
2016-09-23 21:20                   ` John Fastabend
2016-09-29 11:10                     ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160923162907.7fd5314e@jkicinski-Precision-T1700 \
    --to=jakub.kicinski@netronome.com \
    --cc=ast@kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=hannes@stressinduktion.org \
    --cc=jbenc@redhat.com \
    --cc=jiri@resnulli.us \
    --cc=john.fastabend@gmail.com \
    --cc=kubakici@wp.pl \
    --cc=netdev@vger.kernel.org \
    --cc=ogerlitz@mellanox.com \
    --cc=pabeni@redhat.com \
    --cc=pshelar@nicira.com \
    --cc=roopa@cumulusnetworks.com \
    --cc=simon.horman@netronome.com \
    --cc=sridhar.samudrala@intel.com \
    --cc=tgraf@suug.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).