netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jiri Pirko <jiri@resnulli.us>
To: John Fastabend <john.r.fastabend@intel.com>
Cc: Neil Horman <nhorman@tuxdriver.com>, Thomas Graf <tgraf@suug.ch>,
	Simon Horman <simon.horman@netronome.com>,
	netdev@vger.kernel.org, davem@davemloft.net, andy@greyhouse.net,
	dborkman@redhat.com, ogerlitz@mellanox.com, jesse@nicira.com,
	jpettit@nicira.com, joestringer@nicira.com, jhs@mojatatu.com,
	sfeldma@gmail.com, f.fainelli@gmail.com,
	roopa@cumulusnetworks.com, linville@tuxdriver.com,
	shrijeet@gmail.com, gospo@cumulusnetworks.com, bcrl@kvack.org
Subject: Re: Flows! Offload them.
Date: Fri, 27 Feb 2015 09:53:29 +0100	[thread overview]
Message-ID: <20150227085329.GC2057@nanopsycho.orion> (raw)
In-Reply-To: <54EF8BFB.5050608@intel.com>

Thu, Feb 26, 2015 at 10:11:23PM CET, john.r.fastabend@intel.com wrote:
>On 02/26/2015 12:16 PM, Neil Horman wrote:
>> On Thu, Feb 26, 2015 at 07:23:36AM -0800, John Fastabend wrote:
>>> On 02/26/2015 05:33 AM, Thomas Graf wrote:
>>>> On 02/26/15 at 10:16am, Jiri Pirko wrote:
>>>>> Well, on netdev01, I believe that a consensus was reached that for every
>>>>> switch offloaded functionality there has to be an implementation in
>>>>> kernel.
>>>>
>>>> Agreed. This should not prevent the policy being driven from user
>>>> space though.
>>>>
>>>>> What John's Flow API originally did was to provide a way to
>>>>> configure hardware independently of kernel. So the right way is to
>>>>> configure kernel and, if hw allows it, to offload the configuration to hw.
>>>>>
>>>>> In this case, seems to me logical to offload from one place, that being
>>>>> TC. The reason is, as I stated above, the possible conversion from OVS
>>>>> datapath to TC.
>>>>
>>>> Offloading of TC definitely makes a lot of sense. I think that even in
>>>> that case you will already encounter independent configuration of
>>>> hardware and kernel. Example: The hardware provides a fixed, generic
>>>> function to push up to n bytes onto a packet. This hardware function
>>>> could be used to implement TC actions "push_vlan", "push_vxlan",
>>>> "push_mpls". You would you would likely agree that TC should make use
>>>> of such a function even if the hardware version is different from the
>>>> software version. So I don't think we'll have a 1:1 mapping for all
>>>> configurations, regardless of whether the how is decided in kernel or
>>>> user space.
>>>
>>> Just to expand slightly on this. I don't think you can get to a 1:1
>>> mapping here. One reason is hardware typically has a TCAM and limited
>>> size. So you need a _policy_ to determine when to push rules into the
>>> hardware. The kernel doesn't know when to do this and I don't believe
>>> its the kernel's place to start enforcing policy like this. One thing I likely
>>> need to do is get some more "worlds" in rocker so we aren't stuck only
>>> thinking about the infinite size OF_DPA world. The OF_DPA world is only
>>> one world and not a terribly flexible one at that when compared with the
>>> NPU folk. So minimally you need a flag to indicate rules go into hardware
>>> vs software.
>>>
>>> That said I think the bigger mismatch between software and hardware is
>>> you program it differently because the data structures are different. Maybe
>>> a u32 example would help. For parsing with u32 you might build a parse
>>> graph with a root and some leaf nodes. In hardware you want to collapse
>>> this down onto the hardware. I argue this is not a kernel task because
>>> there are lots of ways to do this and there are trade-offs made with
>>> respect to space and performance and which table to use when it could be
>>> handled by a set of tables. Another example is a virtual switch possibly
>>> OVS but we have others. The software does some "unmasking" (there term)
>>> before sending the rules into the software dataplane cache. Basically this
>>> means we can ignore priority in the hash lookup. However this is not how you
>>> would optimally use hardware. Maybe I should do another write up with
>>> some more concrete examples.
>>>
>>> There are also lots of use cases to _not_ have hardware and software in
>>> sync. A flag allows this.
>>>
>>> My only point is I think we need to allow users to optimally use there
>>> hardware either via 'tc' or my previous 'flow' tool. Actually in my
>>> opinion I still think its best to have both interfaces.
>>>
>>> I'll go get some coffee now and hopefully that is somewhat clear.
>> 
>> 
>> I've been thinking about the policy apect of this, and the more I think about
>> it, the more I wonder if not allowing some sort of common policy in the kernel
>> is really the right thing to do here.  I know thats somewhat blasphemous, but
>> this isn't really administrative poilcy that we're talking about, at least not
>> 100%.  Its more of a behavioral profile that we're trying to enforce.  That may
>> be splitting hairs, but I think theres precidence for the latter.  That is to
>> say, we configure qdiscs to limit traffic flow to certain rates, and configure
>> policies which drop traffic that violates it (which includes random discard,
>> which is the antithesis of deterministic policy).  I'm not sure I see this as
>> any different, espcially if we limit its scope.  That is to say, why couldn't we
>> allow the kernel to program a predetermined set of policies that the admin can
>> set (i.e. offload routing to a hardware cache of X size with an lru
>> victimization).  If other well defined policies make sense, we can add them and
>> exposes options via iproute2 or some such to set them.  For the use case where
>> such pre-packaged policies don't make sense, we have things like the flow api to
>> offer users who want to be able to control their hardware in a more fine grained
>> approach.
>> 
>> Neil
>> 
>
>Hi Neil,
>
>I actually like this idea a lot. I might tweak a bit in that we could have some
>feature bits or something like feature bits that expose how to split up the
>hardware cache and give sizes.
>
>So the hypervisor (see I think of end hosts) or administrators could come in and
>say I want a route table and a nft table. This creates a "flavor" over how the
>hardware is going to be used. Another use case may not be doing routing at all
>but have an application that wants to manage the hardware at a more fine grained
>level with the exception of some nft commands so it could have a "nft"+"flow"
>flavor. Insert your favorite use case here.

I'm not sure I understand. You said that admin could say: "I want a
route table and a nft table". But how does he say it? Isn't is enough
just to insert some rules into these 2 things and that would give hw a
clue what the admin is doing and what he wants? I believe that this
offload should happen transparently.

Of course, you may want to balance resources as you said the hw capacity
is limited. But I would leave that optional. API unknown so far...

  parent reply	other threads:[~2015-02-27  8:53 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-26  7:42 Flows! Offload them Jiri Pirko
2015-02-26  8:38 ` Simon Horman
2015-02-26  9:16   ` Jiri Pirko
2015-02-26 13:33     ` Thomas Graf
2015-02-26 15:23       ` John Fastabend
2015-02-26 20:16         ` Neil Horman
2015-02-26 21:11           ` John Fastabend
2015-02-27  1:17             ` Neil Horman
2015-02-27  8:53             ` Jiri Pirko [this message]
2015-02-27 16:00               ` John Fastabend
2015-02-26 21:52           ` Simon Horman
2015-02-27  1:22             ` Neil Horman
2015-02-27  1:52               ` Tom Herbert
2015-03-02 13:49                 ` Andy Gospodarek
2015-03-02 16:54                   ` Scott Feldman
2015-03-02 18:06                     ` Andy Gospodarek
     [not found]                     ` <CAGpadYEC3-5AdkOG66q0vX+HM0c6EU-C0ZT=sKGe7rZRHsYYKg@mail.gmail.com>
2015-03-02 22:13                       ` Scott Feldman
2015-03-02 22:43                         ` Andy Gospodarek
2015-03-02 22:49                           ` Florian Fainelli
2015-02-27  8:41               ` Thomas Graf
2015-02-27 12:59                 ` Neil Horman
2015-03-01  9:36                 ` Arad, Ronen
2015-03-01 14:05                   ` Neil Horman
2015-03-02 14:16                     ` Jamal Hadi Salim
2015-03-01  9:47                 ` Arad, Ronen
2015-03-01 17:20                   ` Neil Horman
     [not found]       ` <CAGpadYGrjfkZqe0k7D05+cy3pY=1hXZtQqtV0J-8ogU80K7BUQ@mail.gmail.com>
2015-02-26 15:39         ` John Fastabend
     [not found]           ` <CAGpadYHfNcDR2ojubkCJ8-nJTQkdLkPsAwJu0wOKU82bLDzhww@mail.gmail.com>
2015-02-26 16:33             ` Thomas Graf
2015-02-26 16:53             ` John Fastabend
2015-02-27 13:33           ` Jamal Hadi Salim
2015-02-27 15:23             ` John Fastabend
2015-03-02 13:45               ` Jamal Hadi Salim
2015-02-26 17:38       ` David Ahern
2015-02-26 16:04     ` Tom Herbert
2015-02-26 16:17       ` Jiri Pirko
2015-02-26 18:15         ` Tom Herbert
2015-02-26 19:05           ` Thomas Graf
2015-02-27  9:00           ` Jiri Pirko
2015-02-28 20:02           ` David Miller
2015-02-28 21:31             ` Jiri Pirko
2015-02-26 18:16       ` Scott Feldman
2015-02-26 11:22 ` Sowmini Varadhan
2015-02-26 11:39   ` Jiri Pirko
2015-02-26 15:42     ` Sowmini Varadhan
2015-02-27 13:15     ` Named sockets WAS(Re: " Jamal Hadi Salim
2015-02-26 12:51 ` Thomas Graf
2015-02-26 13:17   ` Jiri Pirko
2015-02-26 19:32 ` Florian Fainelli
2015-02-26 20:58   ` John Fastabend
2015-02-26 21:45     ` Florian Fainelli
2015-02-26 23:06       ` John Fastabend
2015-02-27 18:37       ` Neil Horman
2015-02-27 14:01     ` Driver level interface WAS(Re: " Jamal Hadi Salim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150227085329.GC2057@nanopsycho.orion \
    --to=jiri@resnulli.us \
    --cc=andy@greyhouse.net \
    --cc=bcrl@kvack.org \
    --cc=davem@davemloft.net \
    --cc=dborkman@redhat.com \
    --cc=f.fainelli@gmail.com \
    --cc=gospo@cumulusnetworks.com \
    --cc=jesse@nicira.com \
    --cc=jhs@mojatatu.com \
    --cc=joestringer@nicira.com \
    --cc=john.r.fastabend@intel.com \
    --cc=jpettit@nicira.com \
    --cc=linville@tuxdriver.com \
    --cc=netdev@vger.kernel.org \
    --cc=nhorman@tuxdriver.com \
    --cc=ogerlitz@mellanox.com \
    --cc=roopa@cumulusnetworks.com \
    --cc=sfeldma@gmail.com \
    --cc=shrijeet@gmail.com \
    --cc=simon.horman@netronome.com \
    --cc=tgraf@suug.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).