netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Open vSwitch Design
@ 2011-11-24 20:10 Jesse Gross
       [not found] ` <CAEP_g=_2L1xFWtDXh_6YyXz1Mt9TR3zvjLzix+SpO6yzeOLsSQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Jesse Gross @ 2011-11-24 20:10 UTC (permalink / raw)
  To: netdev, dev
  Cc: David Miller, Stephen Hemminger, Chris Wright, Herbert Xu,
	Eric Dumazet, John Fastabend, Justin Pettit, jhs

I realized that since Open vSwitch is so userspace-centric some of the
design considerations might not be apparent from the kernel code
alone.  I did a poor job of explaining the larger picture which has
lead to some misconceptions, so I thought it would be helpful if I
gave a short overview.

One of the driving goals was to push as much logic as possible to
userspace, so the kernel portion is less than 6000 lines and has four
components:

 * Switching infrastructure:  As the name implies, Open vSwitch is
intended to be a network switch, focused on
virtualization/OpenFlow/software defined networking.  This means that
what we are modeling is not actually a collection of flows but a
switch which contains a group of related ports, a software virtual
device, etc.  The switch model is used in a variety of places, such as
to measure traffic that actually flows through it in order to
implement monitoring and sampling protocols.

 * Flow lookup:  Although used to implement OpenFlow, the kernel flow
table does not actually directly contain OpenFlow flows.  This is
because OpenFlow tables can contain wildcards, multiple pipeline
stages, etc. and we did not want to push that complexity into the
kernel fast path (nor tie it to a specific version of OpenFlow).
Instead an exact match flow table is populated on-demand from
userspace based on the more complex rules stored there.  Although it
might seem limiting, this design has allowed significant new
functionality to be added without modifications to the kernel or
performance impact.

 * Packet execution:  Once a flow is matched it can be output,
enqueued to a particular qdisc, etc.  Some of these operations are
specific to Open vSwitch, such as sampling, whereas others we leverage
existing infrastructure (including tc for QoS) by simply marking the
packet for further processing.

 * Userspace interfaces:  One of the difficulties of having a
specialized, exact match flow lookup engine is maintaining
compatibility across differing kernel/userspace versions.  This
compatibility shows up heavily in the userspace interfaces and is
achieved by passing the kernel's version of the flow along with packet
information.  This allows userspace to install appropriate flows even
if its interpretation of a packet differs from the kernel's without
version checks or maintaining multiple implementations of the flow
extraction code in the kernel.

It's obviously possible to put this code anywhere, whether it is an
independent module, in the bridge, or tc.  Regardless, however, it's
largely new code that is geared towards this particular model so it
seems better not to add to the complexity of existing components if at
all possible.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
       [not found] ` <CAEP_g=_2L1xFWtDXh_6YyXz1Mt9TR3zvjLzix+SpO6yzeOLsSQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-11-24 22:30   ` jamal
  2011-11-25  5:20     ` Stephen Hemminger
  0 siblings, 1 reply; 21+ messages in thread
From: jamal @ 2011-11-24 22:30 UTC (permalink / raw)
  To: Jesse Gross
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu,
	Eric Dumazet, netdev, John Fastabend, Stephen Hemminger,
	David Miller

Jesse,

I am going to try and respond to your comments below.

On Thu, 2011-11-24 at 12:10 -0800, Jesse Gross wrote:

> 
>  * Switching infrastructure:  As the name implies, Open vSwitch is
> intended to be a network switch, focused on
> virtualization/OpenFlow/software defined networking.  This means that
> what we are modeling is not actually a collection of flows but a
> switch which contains a group of related ports, a software virtual
> device, etc.  The switch model is used in a variety of places, such as
> to measure traffic that actually flows through it in order to
> implement monitoring and sampling protocols.

Can you explain why you couldnt use the current bridge code (likely with
some mods)? I can see you want to isolate the VMs via the virtual ports;
maybe even vlans on the virtual ports - the current bridge code should
be able to handle that.

>  * Flow lookup:  Although used to implement OpenFlow, the kernel flow
> table does not actually directly contain OpenFlow flows.  This is
> because OpenFlow tables can contain wildcards, multiple pipeline
> stages, etc. and we did not want to push that complexity into the
> kernel fast path (nor tie it to a specific version of OpenFlow).
> Instead an exact match flow table is populated on-demand from
> userspace based on the more complex rules stored there.  Although it
> might seem limiting, this design has allowed significant new
> functionality to be added without modifications to the kernel or
> performance impact.

This can be achieved easily with zero changes to the kernel code.
You need to have default filters that redirect flows to user space
when you fail to match.

>  * Packet execution:  Once a flow is matched it can be output,
> enqueued to a particular qdisc, etc.  Some of these operations are
> specific to Open vSwitch, such as sampling, whereas others we leverage
> existing infrastructure (including tc for QoS) by simply marking the
> packet for further processing.

The tc classifier-action-qdisc infrastructure handles this.
The sampler needs a new action defined.

>  * Userspace interfaces:  One of the difficulties of having a
> specialized, exact match flow lookup engine is maintaining
> compatibility across differing kernel/userspace versions.  This
> compatibility shows up heavily in the userspace interfaces and is
> achieved by passing the kernel's version of the flow along with packet
> information.  This allows userspace to install appropriate flows even
> if its interpretation of a packet differs from the kernel's without
> version checks or maintaining multiple implementations of the flow
> extraction code in the kernel.

I didnt quiet follow - are we talking about backward/forward
compatibility?


> It's obviously possible to put this code anywhere, whether it is an
> independent module, in the bridge, or tc.  Regardless, however, it's
> largely new code that is geared towards this particular model so it
> seems better not to add to the complexity of existing components if at
> all possible.

I am still not seeing how this could not be done without the
infrastructure that exists. Granted, the user space brains - thats where
everything else resides - but you are not pushing that i think.

cheers,
jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
  2011-11-24 22:30   ` jamal
@ 2011-11-25  5:20     ` Stephen Hemminger
       [not found]       ` <20111124212021.2ae2fb7f-QE31Isp8l5DVJhW05BI4jyWSNWFUUkiGXqFh9Ls21Oc@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Stephen Hemminger @ 2011-11-25  5:20 UTC (permalink / raw)
  To: jhs-jkUAjuhPggJWk0Htik3J/w
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu,
	Eric Dumazet, netdev, hadi-fAAogVwAN2Kw5LPnMra/2Q, Fastabend,
	John-/PVsmBQoxgPKo9QCiBeYKEEOCMrvLtNR, David Miller

On Thu, 24 Nov 2011 17:30:33 -0500
jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> wrote:

> Jesse,
> 
> I am going to try and respond to your comments below.
> 
> On Thu, 2011-11-24 at 12:10 -0800, Jesse Gross wrote:
> 
> > 
> >  * Switching infrastructure:  As the name implies, Open vSwitch is
> > intended to be a network switch, focused on
> > virtualization/OpenFlow/software defined networking.  This means that
> > what we are modeling is not actually a collection of flows but a
> > switch which contains a group of related ports, a software virtual
> > device, etc.  The switch model is used in a variety of places, such as
> > to measure traffic that actually flows through it in order to
> > implement monitoring and sampling protocols.
> 
> Can you explain why you couldnt use the current bridge code (likely with
> some mods)? I can see you want to isolate the VMs via the virtual ports;
> maybe even vlans on the virtual ports - the current bridge code should
> be able to handle that.

The way openvswitch works is that the flow table is populated
by user space. The kernel bridge works completely differently (it learns
about MAC addresses). 

> >  * Flow lookup:  Although used to implement OpenFlow, the kernel flow
> > table does not actually directly contain OpenFlow flows.  This is
> > because OpenFlow tables can contain wildcards, multiple pipeline
> > stages, etc. and we did not want to push that complexity into the
> > kernel fast path (nor tie it to a specific version of OpenFlow).
> > Instead an exact match flow table is populated on-demand from
> > userspace based on the more complex rules stored there.  Although it
> > might seem limiting, this design has allowed significant new
> > functionality to be added without modifications to the kernel or
> > performance impact.
> 
> This can be achieved easily with zero changes to the kernel code.
> You need to have default filters that redirect flows to user space
> when you fail to match.

Actually, this is what puts me off on the current implementation.
I would prefer that the kernel implementation was just a software
implementation of a hardware OpenFlow switch. That way it would
be transparent that the control plane in user space was talking to kernel
or hardware.

> >  * Packet execution:  Once a flow is matched it can be output,
> > enqueued to a particular qdisc, etc.  Some of these operations are
> > specific to Open vSwitch, such as sampling, whereas others we leverage
> > existing infrastructure (including tc for QoS) by simply marking the
> > packet for further processing.
> 
> The tc classifier-action-qdisc infrastructure handles this.
> The sampler needs a new action defined.

There are too many damn layers in the software path already.

> >  * Userspace interfaces:  One of the difficulties of having a
> > specialized, exact match flow lookup engine is maintaining
> > compatibility across differing kernel/userspace versions.  This
> > compatibility shows up heavily in the userspace interfaces and is
> > achieved by passing the kernel's version of the flow along with packet
> > information.  This allows userspace to install appropriate flows even
> > if its interpretation of a packet differs from the kernel's without
> > version checks or maintaining multiple implementations of the flow
> > extraction code in the kernel.
> 
> I didnt quiet follow - are we talking about backward/forward
> compatibility?

The problem is that there are two flow classifiers, one in OpenVswitch
in the kernel, and the other in the user space flow manager. I think the
issue is that the two have different code.

Is the kernel/userspace API for OpenVswitch nailed down and documented
well enough that alternative control plane software could be built?


> > It's obviously possible to put this code anywhere, whether it is an
> > independent module, in the bridge, or tc.  Regardless, however, it's
> > largely new code that is geared towards this particular model so it
> > seems better not to add to the complexity of existing components if at
> > all possible.
> 
> I am still not seeing how this could not be done without the
> infrastructure that exists. Granted, the user space brains - thats where
> everything else resides - but you are not pushing that i think.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
       [not found]       ` <20111124212021.2ae2fb7f-QE31Isp8l5DVJhW05BI4jyWSNWFUUkiGXqFh9Ls21Oc@public.gmane.org>
@ 2011-11-25  6:18         ` Eric Dumazet
  2011-11-25  6:25           ` David Miller
  2011-11-25 20:14           ` Jesse Gross
  2011-11-25 11:24         ` jamal
                           ` (2 subsequent siblings)
  3 siblings, 2 replies; 21+ messages in thread
From: Eric Dumazet @ 2011-11-25  6:18 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu, netdev,
	hadi-fAAogVwAN2Kw5LPnMra/2Q, jhs-jkUAjuhPggJWk0Htik3J/w,
	John Fastabend, David Miller

Le jeudi 24 novembre 2011 à 21:20 -0800, Stephen Hemminger a écrit :

> The problem is that there are two flow classifiers, one in OpenVswitch
> in the kernel, and the other in the user space flow manager. I think the
> issue is that the two have different code.

We have kind of same duplication in kernel already :)

__skb_get_rxhash() and net/sched/cls_flow.c contain roughly the same
logic...

Maybe its time to factorize the thing, eventually use it in a third
component (Open vSwitch...)



_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
  2011-11-25  6:18         ` Eric Dumazet
@ 2011-11-25  6:25           ` David Miller
       [not found]             ` <20111125.012517.2221372383643417980.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
  2011-11-25 20:14           ` Jesse Gross
  1 sibling, 1 reply; 21+ messages in thread
From: David Miller @ 2011-11-25  6:25 UTC (permalink / raw)
  To: eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, chrisw-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, hadi-fAAogVwAN2Kw5LPnMra/2Q,
	jhs-jkUAjuhPggJWk0Htik3J/w,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	herbert-F6s6mLieUQo7FNHlEwC/lvQIK84fMopw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA

From: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Fri, 25 Nov 2011 07:18:03 +0100

> Le jeudi 24 novembre 2011 à 21:20 -0800, Stephen Hemminger a écrit :
> 
>> The problem is that there are two flow classifiers, one in OpenVswitch
>> in the kernel, and the other in the user space flow manager. I think the
>> issue is that the two have different code.
> 
> We have kind of same duplication in kernel already :)
> 
> __skb_get_rxhash() and net/sched/cls_flow.c contain roughly the same
> logic...
> 
> Maybe its time to factorize the thing, eventually use it in a third
> component (Open vSwitch...)

Yes.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
       [not found]             ` <20111125.012517.2221372383643417980.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
@ 2011-11-25  6:36               ` Eric Dumazet
  2011-11-25 11:34                 ` jamal
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2011-11-25  6:36 UTC (permalink / raw)
  To: David Miller
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, chrisw-H+wXaHxf7aLQT0dZR+AlfA,
	Florian Westphal, netdev-u79uwXL29TY76Z2rM5mHXA,
	hadi-fAAogVwAN2Kw5LPnMra/2Q, jhs-jkUAjuhPggJWk0Htik3J/w,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	herbert-F6s6mLieUQo7FNHlEwC/lvQIK84fMopw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA

Le vendredi 25 novembre 2011 à 01:25 -0500, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 25 Nov 2011 07:18:03 +0100
> 
> > Le jeudi 24 novembre 2011 à 21:20 -0800, Stephen Hemminger a écrit :
> > 
> >> The problem is that there are two flow classifiers, one in OpenVswitch
> >> in the kernel, and the other in the user space flow manager. I think the
> >> issue is that the two have different code.
> > 
> > We have kind of same duplication in kernel already :)
> > 
> > __skb_get_rxhash() and net/sched/cls_flow.c contain roughly the same
> > logic...
> > 
> > Maybe its time to factorize the thing, eventually use it in a third
> > component (Open vSwitch...)
> 
> Yes.

A third reason to do that anyway is that net/sched/sch_sfb.c should use
__skb_get_rxhash() providing the perturbation itself, and not use the
standard (hashrnd) one ).

Right now, if two flows share same rxhash, the double SFB hash will also
share the same final hash.

(This point was mentioned by Florian Westphal)



_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
       [not found]       ` <20111124212021.2ae2fb7f-QE31Isp8l5DVJhW05BI4jyWSNWFUUkiGXqFh9Ls21Oc@public.gmane.org>
  2011-11-25  6:18         ` Eric Dumazet
@ 2011-11-25 11:24         ` jamal
  2011-11-25 17:28           ` Stephen Hemminger
  2011-11-25 17:55         ` Jesse Gross
  2011-11-25 19:52         ` Justin Pettit
  3 siblings, 1 reply; 21+ messages in thread
From: jamal @ 2011-11-25 11:24 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu,
	Eric Dumazet, netdev, John Fastabend, David Miller

On Thu, 2011-11-24 at 21:20 -0800, Stephen Hemminger wrote:
> On Thu, 24 Nov 2011 17:30:33 -0500
> jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> wrote:
> 
 
> > Can you explain why you couldnt use the current bridge code (likely with
> > some mods)? I can see you want to isolate the VMs via the virtual ports;
> > maybe even vlans on the virtual ports - the current bridge code should
> > be able to handle that.
> 
> The way openvswitch works is that the flow table is populated
> by user space. The kernel bridge works completely differently (it learns
> about MAC addresses). 
> 

Most hardware bridges out there support all different modes:
You can have learning in the hardware or defer it to user/control plane
by setting some flags. You can have broadcasting done in hardware or
defer to user space. 
The mods i was thinking of is to bring the Linux bridge to have the 
same behavior. You then need to allow netlink updates of bridge MAC
table from user space. There may be weaknesses with the current bridging
code in relation to Vlans that may need to be addressed.

[But my concern was not so much the bridge - because changes are needed
in that case; it is the "match, actionlist" that is already in place
that got to me.]

> Actually, this is what puts me off on the current implementation.
> I would prefer that the kernel implementation was just a software
> implementation of a hardware OpenFlow switch. That way it would
> be transparent that the control plane in user space was talking to kernel
> or hardware.

Or alternatively, allow the bridge code to support the different modes.
Learning as well as broadcasting mode needs to be settable.
Then you have interesting capability in the kernel that meets the
requirements of an open flow switch (+ anyone who wants to do policy
control in user space with their favorite standard).

> > The tc classifier-action-qdisc infrastructure handles this.
> > The sampler needs a new action defined.
> 
> There are too many damn layers in the software path already.

I think what they are doing in the separation of control and data
is reasonable. The policy and control are in user space. The fastpath
is in the kernel; and it may be in a variety of spots (some arp entry
here, some L3 entry there, a couple of match-action items etc)
the brains which understand the what the different things mean in
aggregation in terms of a service are in user space.

> 
> The problem is that there are two flow classifiers, one in OpenVswitch
> in the kernel, and the other in the user space flow manager. I think the
> issue is that the two have different code.

i see. I can understand having a simple classifier in the kernel and
more complex "consulting" sitting in user space which updates the 
kernel on how to deal with subsequent flow packets.

> Is the kernel/userspace API for OpenVswitch nailed down and documented
> well enough that alternative control plane software could be built?

They do have a generic netlink interface. I would prefer the netlink
interface already in place (which would have worked if they used 
the stuff already in place).


cheers,
jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
  2011-11-25  6:36               ` Eric Dumazet
@ 2011-11-25 11:34                 ` jamal
  2011-11-25 13:02                   ` Eric Dumazet
  2011-11-25 20:20                   ` Open vSwitch Design Jesse Gross
  0 siblings, 2 replies; 21+ messages in thread
From: jamal @ 2011-11-25 11:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, chrisw-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Florian Westphal,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	herbert-F6s6mLieUQo7FNHlEwC/lvQIK84fMopw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA, David Miller


Hrm. I forgot about the flow classifier - it may be what the openflow
folks need. It is more friendly for the well defined tuples than u32.

But what do you mean "refactor"? I can already use this classifier
and attach actions to set policy in the kernel.

cheers,
jamal

On Fri, 2011-11-25 at 07:36 +0100, Eric Dumazet wrote:

> > > 
> > > Maybe its time to factorize the thing, eventually use it in a third
> > > component (Open vSwitch...)
> > 
> > Yes.
> 
> A third reason to do that anyway is that net/sched/sch_sfb.c should use
> __skb_get_rxhash() providing the perturbation itself, and not use the
> standard (hashrnd) one ).
> 
> Right now, if two flows share same rxhash, the double SFB hash will also
> share the same final hash.
> 
> (This point was mentioned by Florian Westphal)
> 
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
  2011-11-25 11:34                 ` jamal
@ 2011-11-25 13:02                   ` Eric Dumazet
  2011-11-28 15:20                     ` [PATCH net-next 0/4] net: factorize flow dissector Eric Dumazet
  2011-11-25 20:20                   ` Open vSwitch Design Jesse Gross
  1 sibling, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2011-11-25 13:02 UTC (permalink / raw)
  To: jhs-jkUAjuhPggJWk0Htik3J/w
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, chrisw-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Florian Westphal,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	herbert-F6s6mLieUQo7FNHlEwC/lvQIK84fMopw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA, David Miller

Le vendredi 25 novembre 2011 à 06:34 -0500, jamal a écrit :
> Hrm. I forgot about the flow classifier - it may be what the openflow
> folks need. It is more friendly for the well defined tuples than u32.
> 
> But what do you mean "refactor"? I can already use this classifier
> and attach actions to set policy in the kernel.


cls_flow is not complete, since it doesnt handle tunnels for example.

It calls a 'partial flow classifier' to find each needed element, one by
one.
(adding tunnel decap would need to perform this several time for each
packet)

__skb_get_rxhash() is more tunnel aware, yet some protocols are still
missing, for example IPPROTO_IPV6.

Instead of adding logic to both dissectors, we could have a central flow
dissector, filling a temporary pivot structure with found elements (src
addr, dst addr, ports, ...), going through tunnels encap if found.

Then net/sched/cls_flow.c could pick needed elems from this structure to
compute the hash as specified in tc command :
(for example : tc filter ...  flow hash keys proto-dst,dst ...)

(One dissector call per packet for any number of keys in the filter)

Same for net/sched/sch_sfb.c : Use the pivot structure and compute the
two hashes (using two hashrnd values)

And __skb_get_rxhash() could use the same flow dissector, and pick (src
addr, dst addr, ports) to compute skb->rxhash, and set skb->l4_rxhash if
"ports" is not null.



_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
  2011-11-25 11:24         ` jamal
@ 2011-11-25 17:28           ` Stephen Hemminger
  0 siblings, 0 replies; 21+ messages in thread
From: Stephen Hemminger @ 2011-11-25 17:28 UTC (permalink / raw)
  To: jhs-jkUAjuhPggJWk0Htik3J/w
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu,
	Eric Dumazet, netdev, hadi-fAAogVwAN2Kw5LPnMra/2Q, Fastabend,
	John-/PVsmBQoxgPKo9QCiBeYKEEOCMrvLtNR, David Miller

On Fri, 25 Nov 2011 06:24:36 -0500
jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> wrote:

> Most hardware bridges out there support all different modes:
> You can have learning in the hardware or defer it to user/control plane
> by setting some flags. You can have broadcasting done in hardware or
> defer to user space. 
> The mods i was thinking of is to bring the Linux bridge to have the 
> same behavior. You then need to allow netlink updates of bridge MAC
> table from user space. There may be weaknesses with the current bridging
> code in relation to Vlans that may need to be addressed.
> 
> [But my concern was not so much the bridge - because changes are needed
> in that case; it is the "match, actionlist" that is already in place
> that got to me.]

The bridge module is already overly complex. Rather than adding more
modes, it should be split into separate modules. If you look at macvlan,
you will see it is already a subset of to bridge. Another example of
this is the team driver which is really just a subset of the bonding
code.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
       [not found]       ` <20111124212021.2ae2fb7f-QE31Isp8l5DVJhW05BI4jyWSNWFUUkiGXqFh9Ls21Oc@public.gmane.org>
  2011-11-25  6:18         ` Eric Dumazet
  2011-11-25 11:24         ` jamal
@ 2011-11-25 17:55         ` Jesse Gross
  2011-11-25 19:52         ` Justin Pettit
  3 siblings, 0 replies; 21+ messages in thread
From: Jesse Gross @ 2011-11-25 17:55 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu,
	Eric Dumazet, netdev, hadi-fAAogVwAN2Kw5LPnMra/2Q,
	jhs-jkUAjuhPggJWk0Htik3J/w, John Fastabend, David Miller

On Thu, Nov 24, 2011 at 9:20 PM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
> On Thu, 24 Nov 2011 17:30:33 -0500
> jamal <hadi@cyberus.ca> wrote:
>> On Thu, 2011-11-24 at 12:10 -0800, Jesse Gross wrote:
>> >  * Userspace interfaces:  One of the difficulties of having a
>> > specialized, exact match flow lookup engine is maintaining
>> > compatibility across differing kernel/userspace versions.  This
>> > compatibility shows up heavily in the userspace interfaces and is
>> > achieved by passing the kernel's version of the flow along with packet
>> > information.  This allows userspace to install appropriate flows even
>> > if its interpretation of a packet differs from the kernel's without
>> > version checks or maintaining multiple implementations of the flow
>> > extraction code in the kernel.
>>
>> I didnt quiet follow - are we talking about backward/forward
>> compatibility?
>
> The problem is that there are two flow classifiers, one in OpenVswitch
> in the kernel, and the other in the user space flow manager. I think the
> issue is that the two have different code.

Yes, since userspace is installing exact match entries, these flows
obviously need to be of the same form that the kernel would extract
from the packet.  Over time, I'm sure that additional packet formats
will be added so it's important to handle the case where there is a
mismatch.

> Is the kernel/userspace API for OpenVswitch nailed down and documented
> well enough that alternative control plane software could be built?

Yes, that's actually the reason why it took so long to actually submit
the code for upstream - we spent a lot of time cleaning up and
stripping down the interfaces so they could be locked down (or cleanly
extended).

There's a fair amount of documentation on how to maintain
compatibility for flows as mentioned above in the patch that I
submitted and we're certainly happy to write more if other things are
unclear.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
       [not found]       ` <20111124212021.2ae2fb7f-QE31Isp8l5DVJhW05BI4jyWSNWFUUkiGXqFh9Ls21Oc@public.gmane.org>
                           ` (2 preceding siblings ...)
  2011-11-25 17:55         ` Jesse Gross
@ 2011-11-25 19:52         ` Justin Pettit
       [not found]           ` <2DB44B16-598F-4414-8B35-8E322D705A9A-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>
  3 siblings, 1 reply; 21+ messages in thread
From: Justin Pettit @ 2011-11-25 19:52 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu,
	Eric Dumazet, netdev, hadi-fAAogVwAN2Kw5LPnMra/2Q,
	jhs-jkUAjuhPggJWk0Htik3J/w, John Fastabend, David Miller

On Nov 24, 2011, at 9:20 PM, Stephen Hemminger wrote:

>> This can be achieved easily with zero changes to the kernel code.
>> You need to have default filters that redirect flows to user space
>> when you fail to match.
> 
> Actually, this is what puts me off on the current implementation.
> I would prefer that the kernel implementation was just a software
> implementation of a hardware OpenFlow switch. That way it would
> be transparent that the control plane in user space was talking to kernel
> or hardware.

A big difficulty is finding an appropriate hardware abstraction.  I've worked on porting Open vSwitch to a few different vendors' switching ASICs, and they've all looked quite different from each other.  Even within a vendor, there can be fairly substantial differences.  Packet processing is broken up into stages (e.g., VLAN preprocessing, ingress ACL processing, L2 lookup, L3 lookup, packet modification, packet queuing, packet replication, egress ACL processing, etc.) and these can be done in different orders and have quite different behaviors.  Also, the size of the various tables varies widely between ASICs--even within the same family.

Hardware typically makes use of TCAMs, which support fast lookups of wildcarded flows.  They're expensive, though, so they're typically limited to entries in the very low thousands.  In software, we can trivially store 100,000s of entries, but supporting wildcarded lookups is very slow.  If we only use exact-match flows in the kernel (and leave the wildcarding in userspace for kernel misses), we can do extremely fast lookups with hashing on what becomes the fastpath.

Using exact-match entries has another big advantage: we can innovate the userspace portion without requiring changes to the kernel.  For example, we recently went from supporting a single OpenFlow table to 255 without any kernel changes.  This has an added benefit that a flow requiring multiple table lookups becomes a single hash lookup in the kernel, which is a huge performance gain in the fastpath.  Another example is our introduction of a number of metadata "registers" between tables that are never seen in the kernel, but open up a lot of interesting applications for OpenFlow controller writers.

If you're interested, we include a porting guide in the distribution that describes how one would go about bringing Open vSwitch to a new hardware or software platform:

	http://openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=PORTING

Obviously, it's not that relevant here, since there's already a port to Linux.  :-)  But we've iterated over a few different designs and worked on other ports, and we've found this hardware/software abstraction layer to work pretty well.  In fact, multiple ports of Open vSwitch have been done by name-brand third party vendors (this is the avenue most vendors use to get their OpenFlow support) and are now shipping.

We're always open to discussing ways that we can improve this interfaces, too, of course!

--Justin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
  2011-11-25  6:18         ` Eric Dumazet
  2011-11-25  6:25           ` David Miller
@ 2011-11-25 20:14           ` Jesse Gross
  1 sibling, 0 replies; 21+ messages in thread
From: Jesse Gross @ 2011-11-25 20:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu, netdev,
	hadi-fAAogVwAN2Kw5LPnMra/2Q, jhs-jkUAjuhPggJWk0Htik3J/w,
	John Fastabend, Stephen Hemminger, David Miller

On Thu, Nov 24, 2011 at 10:18 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 24 novembre 2011 à 21:20 -0800, Stephen Hemminger a écrit :
>
>> The problem is that there are two flow classifiers, one in OpenVswitch
>> in the kernel, and the other in the user space flow manager. I think the
>> issue is that the two have different code.
>
> We have kind of same duplication in kernel already :)
>
> __skb_get_rxhash() and net/sched/cls_flow.c contain roughly the same
> logic...
>
> Maybe its time to factorize the thing, eventually use it in a third
> component (Open vSwitch...)

I agree, there's no need to have three copies of packet header parsing
code and that's certainly something that we would be willing to work
on improving.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
  2011-11-25 11:34                 ` jamal
  2011-11-25 13:02                   ` Eric Dumazet
@ 2011-11-25 20:20                   ` Jesse Gross
       [not found]                     ` <CAEP_g=9tcH9kJrVsHc26kXWZEUS8G-U=U7y6k8xaZG5MD0OTyg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 21+ messages in thread
From: Jesse Gross @ 2011-11-25 20:20 UTC (permalink / raw)
  To: jhs-jkUAjuhPggJWk0Htik3J/w
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, chrisw-H+wXaHxf7aLQT0dZR+AlfA,
	Eric Dumazet, netdev-u79uwXL29TY76Z2rM5mHXA, Florian Westphal,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	herbert-F6s6mLieUQo7FNHlEwC/lvQIK84fMopw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA, David Miller

On Fri, Nov 25, 2011 at 3:34 AM, jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> wrote:
>
> Hrm. I forgot about the flow classifier - it may be what the openflow
> folks need. It is more friendly for the well defined tuples than u32.

The flow classifier isn't really designed to do rule lookup in the way
that OpenFlow/Open vSwitch does, since it's more about choosing which
fields are considered significant to the flow.  I'm sure that it could
be extended in some way but it seems that the better approach would be
to factor out the common pieces (such as the header extraction
mentioned before) than try to cram both models into one component.

I understand that you see some commonalities with various parts of the
system but often there are enough conceptual differences that you end
up trying to shove a square peg into a round hole.  As Stephen
mentioned about the bridge, many of these components are already
fairly complex and combining more functionality into them isn't always
a win.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
       [not found]           ` <2DB44B16-598F-4414-8B35-8E322D705A9A-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>
@ 2011-11-26  1:11             ` Jamal Hadi Salim
  2011-11-26  4:38               ` Stephen Hemminger
  2011-11-28 18:34               ` Justin Pettit
  0 siblings, 2 replies; 21+ messages in thread
From: Jamal Hadi Salim @ 2011-11-26  1:11 UTC (permalink / raw)
  To: Justin Pettit
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu,
	Eric Dumazet, netdev, John Fastabend, Stephen Hemminger,
	David Miller

On Fri, 2011-11-25 at 11:52 -0800, Justin Pettit wrote:
> On Nov 24, 2011, at 9:20 PM, Stephen Hemminger wrote:
> 

> A big difficulty is finding an appropriate hardware abstraction.  I've worked on porting 
> Open vSwitch to a few different vendors' switching ASICs, and they've all looked quite 
> different from each other.  Even within a vendor, there can be fairly substantial differences.  
> Packet processing is broken up into stages (e.g., VLAN preprocessing, ingress ACL processing, 
> L2 lookup, L3 lookup, packet modification, packet queuing, packet replication, egress ACL 
> processing, etc.)
> and these can be done in different orders and have quite different behaviors.

Theres some discussion going on on how to get ASIC support on the
variety of chips with different offloads (qos, L2 etc); you may wanna
share your experiences.

Having said that - in the kernel we have all the mechanisms you describe
above with quiet a good fit. Speaking from experience of working on some
vendors ASICs (of which i am sure at least one you are working on).
As an example, the ACL can be applied before or after L2 or L3. We can
support wildcard matching to user space and exact-matches in the kernel.


> Also, the size of the various tables varies widely between ASICs--even within the same 
> family.
> 
> Hardware typically makes use of TCAMs, which support fast lookups of wildcarded flows.
> They're expensive, though, so they're typically limited to entries in the very low thousands.

Those are problems with most merchant silicon - small tables; but there
are some which are easily expandable via DRAM to support a full BGP
table for example.
 
> In software, we can trivially store 100,000s of entries, but supporting wildcarded lookups 
> is very slow.  If we only use exact-match flows in the kernel (and leave the wildcarding 
> in userspace for kernel misses), we can do extremely fast lookups with hashing on what 
> becomes the fastpath.

Justin - theres nothing new you need in the kernel to have that feature.
Let me rephrase that, that has not been a new feature for at least a
decade in Linux.
Add exact match filters with higher priority. Have the lowest priority
filter to redirect to user space. Let user space lookup some service
rule; have it download to the kernel one or more exact matches.
Let the packet proceed on its way down the kernel to its destination if
thats what is defined.

> Using exact-match entries has another big advantage: we can innovate the userspace portion 
> without requiring changes to the kernel.  For example, we recently went from supporting a 
> single OpenFlow table to 255 without any kernel changes.  This has an added benefit that 
> a flow requiring multiple table lookups becomes a single hash lookup in the kernel, which
> is a huge performance gain in the fastpath.  Another example is our introduction of a number
> of metadata "registers" between tables that are never seen in the kernel, but open up a lot 
> of interesting applications for OpenFlow controller writers.

That bit sounds interesting - I will look at your spec.

> If you're interested, we include a porting guide in the distribution that describes how one 
> would go about bringing Open vSwitch to a new hardware or software platform:
> 
> 	http://openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=PORTING
> 
> Obviously, it's not that relevant here, since there's already a port to Linux.  :-)  

Does this mean i can have a 24x10G switch sitting in hardware with Linux
hardware support if i use your kernel switch? 
Do the vendors agree to some common interface?

> But we've 
> iterated over a few different designs and worked on other ports, and we've found this 
> hardware/software abstraction layer to work pretty well.  In fact, multiple ports of 
> Open vSwitch have been done by name-brand third party vendors (this is the avenue most
> vendors use to get their OpenFlow support) and are now shipping.
> 
> We're always open to discussing ways that we can improve this interfaces, too, of course!

Make these vendor switches work with plain Linux. The Intel folks are
producing interfaces with L2, ACLs, VIs and are putting some effort to
integrate them into plain Linux. I should be able to set the qos rules
with tc on an intel chip.
You guys can still take advantage of all that and still have your nice
control plane.

cheers,
jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
       [not found]                     ` <CAEP_g=9tcH9kJrVsHc26kXWZEUS8G-U=U7y6k8xaZG5MD0OTyg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-11-26  1:23                       ` Jamal Hadi Salim
  0 siblings, 0 replies; 21+ messages in thread
From: Jamal Hadi Salim @ 2011-11-26  1:23 UTC (permalink / raw)
  To: Jesse Gross
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, chrisw-H+wXaHxf7aLQT0dZR+AlfA,
	Eric Dumazet, netdev-u79uwXL29TY76Z2rM5mHXA, Florian Westphal,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	herbert-F6s6mLieUQo7FNHlEwC/lvQIK84fMopw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA, David Miller

On Fri, 2011-11-25 at 12:20 -0800, Jesse Gross wrote:

> The flow classifier isn't really designed to do rule lookup in the way
> that OpenFlow/Open vSwitch does, since it's more about choosing which
> fields are considered significant to the flow.  I'm sure that it could
> be extended in some way but it seems that the better approach would be
> to factor out the common pieces (such as the header extraction
> mentioned before) than try to cram both models into one component.

Yes, it would need a tweak or two.
But u32 would. And the action subsystem does.

> I understand that you see some commonalities with various parts of the
> system but often there are enough conceptual differences that you end
> up trying to shove a square peg into a round hole. 

I have done this for years. I have very good knowledge of merchant
sillicom and i have programmed them on Linux; i know this space a lot
more than you are assuming.
If you can point me to _one_, just _one_ thing, that you do in the
classifier action piece that cannot be done in Linux today and is
more flexible in your setup than it is on Linux we can have a
useful discussion.

>  As Stephen
> mentioned about the bridge, many of these components are already
> fairly complex and combining more functionality into them isn't always
> a win.

I think the bridge started on a bad foot for not properly integrating
with Vlans and tightly integrating STP control in the kernel.

cheers,
jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
  2011-11-26  1:11             ` Jamal Hadi Salim
@ 2011-11-26  4:38               ` Stephen Hemminger
       [not found]                 ` <ec23d63d-27c9-4761-bdd3-e3f54bdb5e77-bX68f012229Xuxj3zoTs5AC/G2K4zDHf@public.gmane.org>
  2011-11-28 18:34               ` Justin Pettit
  1 sibling, 1 reply; 21+ messages in thread
From: Stephen Hemminger @ 2011-11-26  4:38 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu,
	Eric Dumazet, netdev, John Fastabend, David Miller

Not sure how Openvswitch implementation relates to Openflow specification.

There are a few switches supporting Openflow already:
  http://www.openflow.org/wp/switch-nec/
  http://www-03.ibm.com/systems/x/options/networking/bnt8264/index.html

The standard(s) are here:
  http://www.openflow.org/wp/documents/

Good info from recent symposium:
 http://opennetsummit.org/past_conferences.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
       [not found]                 ` <ec23d63d-27c9-4761-bdd3-e3f54bdb5e77-bX68f012229Xuxj3zoTs5AC/G2K4zDHf@public.gmane.org>
@ 2011-11-26  8:05                   ` Martin Casado
  0 siblings, 0 replies; 21+ messages in thread
From: Martin Casado @ 2011-11-26  8:05 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu,
	Eric Dumazet, netdev, Jamal Hadi Salim, John Fastabend,
	David Miller


> Not sure how Openvswitch implementation relates to Openflow specification.
The short answer is that Open vSwitch serves as one of the standard 
reference implementations for OpenFlow (in fact, the primary developers 
of Open vSwitch were some of the original designers of OpenFlow).  
Multiple hardware switches on the market use Open vSwitch as the basis 
for their OpenFlow support.
>
> There are a few switches supporting Openflow already:
>    http://www.openflow.org/wp/switch-nec/
>    http://www-03.ibm.com/systems/x/options/networking/bnt8264/index.html

There are many other ports announced or available from vendors such as 
HP, Brocade, Pica8, Extreme, Juniper, and NetGear.  Cisco has even 
announced support support for OpenFlow on the Nexus 3k 
(http://www.lightreading.com/document.asp?doc_id=213545).

.martin

>
> The standard(s) are here:
>    http://www.openflow.org/wp/documents/
>
> Good info from recent symposium:
>   http://opennetsummit.org/past_conferences.html
> _______________________________________________
> dev mailing list
> dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org
> http://openvswitch.org/mailman/listinfo/dev


-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Martin Casado
Nicira Networks, Inc.
www.nicira.com
cell: 650-776-1457
~~~~~~~~~~~~~~~~~~~~~~~~~~~

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH net-next 0/4] net: factorize flow dissector
  2011-11-25 13:02                   ` Eric Dumazet
@ 2011-11-28 15:20                     ` Eric Dumazet
  0 siblings, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2011-11-28 15:20 UTC (permalink / raw)
  To: jhs-jkUAjuhPggJWk0Htik3J/w
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, chrisw-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Florian Westphal,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	herbert-F6s6mLieUQo7FNHlEwC/lvQIK84fMopw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA, Dan Siemon, David Miller

Le vendredi 25 novembre 2011 à 14:02 +0100, Eric Dumazet a écrit :

> cls_flow is not complete, since it doesnt handle tunnels for example.
> 
> It calls a 'partial flow classifier' to find each needed element, one by
> one.
> (adding tunnel decap would need to perform this several time for each
> packet)
> 
> __skb_get_rxhash() is more tunnel aware, yet some protocols are still
> missing, for example IPPROTO_IPV6.
> 
> Instead of adding logic to both dissectors, we could have a central flow
> dissector, filling a temporary pivot structure with found elements (src
> addr, dst addr, ports, ...), going through tunnels encap if found.
> 
> Then net/sched/cls_flow.c could pick needed elems from this structure to
> compute the hash as specified in tc command :
> (for example : tc filter ...  flow hash keys proto-dst,dst ...)
> 
> (One dissector call per packet for any number of keys in the filter)
> 
> Same for net/sched/sch_sfb.c : Use the pivot structure and compute the
> two hashes (using two hashrnd values)
> 
> And __skb_get_rxhash() could use the same flow dissector, and pick (src
> addr, dst addr, ports) to compute skb->rxhash, and set skb->l4_rxhash if
> "ports" is not null.
> 

Here is a patch serie doing this factorization / cleanup.

[PATCH net-next 1/4] net: introduce skb_flow_dissect()
[PATCH net-next 2/4] net: use skb_flow_dissect() in __skb_get_rxhash()
[PATCH net-next 3/4] cls_flow: use skb_flow_dissect()
[PATCH net-next 4/4] sch_sfb: use skb_flow_dissect()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

cumulative diffstat :
 include/net/flow_keys.h   |   15 +++
 net/core/Makefile         |    2 
 net/core/dev.c            |  124 ++----------------------
 net/core/flow_dissector.c |  130 ++++++++++++++++++++++++++
 net/ipv4/tcp.c            |    8 -
 net/sched/cls_flow.c      |  180 +++++++++---------------------------
 net/sched/sch_sfb.c       |   17 ++-
 7 files changed, 225 insertions(+), 251 deletions(-)



_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
  2011-11-26  1:11             ` Jamal Hadi Salim
  2011-11-26  4:38               ` Stephen Hemminger
@ 2011-11-28 18:34               ` Justin Pettit
  2011-11-28 22:42                 ` Jamal Hadi Salim
  1 sibling, 1 reply; 21+ messages in thread
From: Justin Pettit @ 2011-11-28 18:34 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu,
	Eric Dumazet, netdev, John Fastabend, Stephen Hemminger,
	David Miller

On Nov 25, 2011, at 5:11 PM, Jamal Hadi Salim wrote:

>> A big difficulty is finding an appropriate hardware abstraction.  I've worked on porting 
>> Open vSwitch to a few different vendors' switching ASICs, and they've all looked quite 
>> different from each other.  Even within a vendor, there can be fairly substantial differences.  
>> Packet processing is broken up into stages (e.g., VLAN preprocessing, ingress ACL processing, 
>> L2 lookup, L3 lookup, packet modification, packet queuing, packet replication, egress ACL 
>> processing, etc.)
>> and these can be done in different orders and have quite different behaviors.
> 
> Theres some discussion going on on how to get ASIC support on the
> variety of chips with different offloads (qos, L2 etc); you may wanna
> share your experiences.

Are you talking about ASICs on NICs?  I was referring to integrating Open vSwitch into top-of-rack switches.  These typically have a 48x1G or 48x10G switching ASIC and a relatively slow (~800MHz PPC-class) management CPU running an operating system like Linux.  There's no way that these systems can have a standard CPU on the fastpath.

> Having said that - in the kernel we have all the mechanisms you describe
> above with quiet a good fit. Speaking from experience of working on some
> vendors ASICs (of which i am sure at least one you are working on).
> As an example, the ACL can be applied before or after L2 or L3. We can
> support wildcard matching to user space and exact-matches in the kernel.

I understood the original question to be: Can we make the interface to the kernel look like a hardware switch?  My answer had two main parts.  First, I don't think we could define a "standard" hardware interface, since they're all very different.  Second, even if we could, I think a software fastpath's strengths and weaknesses are such that the hardware model wouldn't be ideal.

>> Also, the size of the various tables varies widely between ASICs--even within the same 
>> family.
>> 
>> Hardware typically makes use of TCAMs, which support fast lookups of wildcarded flows.
>> They're expensive, though, so they're typically limited to entries in the very low thousands.
> 
> Those are problems with most merchant silicon - small tables; but there
> are some which are easily expandable via DRAM to support a full BGP
> table for example.

The problem is that DRAM isn't going to cut it on the ACL tables--which are typically used for flow-based matching--on a 48x10G (or even 48x1G) switch.  I've seen a couple of switching ASICs that support many 10s of thousands of ACL entries, but they require expensive external TCAMs for lookup and SRAM for counters.  Most of the white box vendors that I've seen that use those ASICs don't bother adding the external TCAM and SRAM to their designs.  Even when they are added, their matching capabilities are typically limited in order to keep up with traffic.

>> In software, we can trivially store 100,000s of entries, but supporting wildcarded lookups 
>> is very slow.  If we only use exact-match flows in the kernel (and leave the wildcarding 
>> in userspace for kernel misses), we can do extremely fast lookups with hashing on what 
>> becomes the fastpath.
> 
> Justin - theres nothing new you need in the kernel to have that feature.
> Let me rephrase that, that has not been a new feature for at least a
> decade in Linux.
> Add exact match filters with higher priority. Have the lowest priority
> filter to redirect to user space. Let user space lookup some service
> rule; have it download to the kernel one or more exact matches.
> Let the packet proceed on its way down the kernel to its destination if
> thats what is defined.

My point was that a software fastpath should look different than a hardware-based one.

>> Using exact-match entries has another big advantage: we can innovate the userspace portion 
>> without requiring changes to the kernel.  For example, we recently went from supporting a 
>> single OpenFlow table to 255 without any kernel changes.  This has an added benefit that 
>> a flow requiring multiple table lookups becomes a single hash lookup in the kernel, which
>> is a huge performance gain in the fastpath.  Another example is our introduction of a number
>> of metadata "registers" between tables that are never seen in the kernel, but open up a lot 
>> of interesting applications for OpenFlow controller writers.
> 
> That bit sounds interesting - I will look at your spec.

Great!

>> If you're interested, we include a porting guide in the distribution that describes how one 
>> would go about bringing Open vSwitch to a new hardware or software platform:
>> 
>> 	http://openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=PORTING
>> 
>> Obviously, it's not that relevant here, since there's already a port to Linux.  :-)  
> 
> Does this mean i can have a 24x10G switch sitting in hardware with Linux
> hardware support if i use your kernel switch? 

Yes, Open vSwitch has been ported to 24x10G ASICs running Linux on their management CPUs.  However, in these cases the datapath is handled by hardware and not the software forwarding plane, obviously.

> Do the vendors agree to some common interface?

Yes, if you view ofproto (as described in the porting guide) as that interface.  Every merchant silicon vendor I've seen views the interfaces to their ASICs as proprietary.  Someone (with the appropriate SDK and licenses) needs to write providers for those different hardware ports.  We've helped multiple vendors do this and know a few others that have done it on their own.

This really seems besides the point for this discussion, though.  We've written an ofproto provider for software switches called "dpif" (this is also described in the porting guide). What we're proposing be included in Linux is the kernel module that speaks to that dpif provider over a well-defined, stable, netlink-based protocol.

Here's just a quick (somewhat simplified) summary of the different layers.  At the top, there are controllers and switches that communicate using OpenFlow.  OpenFlow gives controller writers the ability to inspect and modify the switches' flow tables and interfaces.  If a flow entry doesn't match an existing entry, the packet is forwarded to the controller for further processing.  OpenFlow 1.0 was pretty basic and exposed a single flow table.  OpenFlow 1.1 introduced a number of new features including multiple table support.  The forthcoming OpenFlow 1.2 will include support for extensible matches, which means that new fields may be added without requiring a full revision of the specification.  OpenFlow is defined by the Open Networking Foundation and is not directly related to Open vSwitc
 h.

The userspace in Open vSwitch has an OpenFlow library that interacts with the controllers.  Userspace has its own classifier that supports wildcard entries and multiple tables.  Many of the changes to the OpenFlow protocol only require modifying that library and perhaps some of the glue code with the classifier.  (In theory, other software-defined networking protocols could be plugged in as well.)  The classifier interacts with the ofproto layer below it, which implements a fastpath.  On a hardware switch, since it supports wildcarding, it essentially becomes a passthrough that just calls the appropriate APIs for the ASIC.  On software, as we've discussed, exact-match flows work better.

For that reason, we've defined the dpif layer, which is an ofproto provider.  It's primary purpose is to take high-level concepts like "treat this group of interfaces as a LACP bond" or "support this set of wildcard flow entries" and explode them into exact-match entries on-demand.  We've then implemented a Linux dpif provider that takes the exact match entries created by the dpif layer and converts them into netlink messages that the kernel module understands.  These messages are well-defined and not specific to Open vSwitch or OpenFlow.

This layering has allowed us to introduce new OpenFlow-like features such as multiple tables and non-OpenFlow features such as port mirroring, STP, CCM, and new bonding modes without changes to the kernel module.  In fact, the only changes that should necessitate a kernel interface change are new matches or actions, such as would be required for handling MPLS.

>> But we've 
>> iterated over a few different designs and worked on other ports, and we've found this 
>> hardware/software abstraction layer to work pretty well.  In fact, multiple ports of 
>> Open vSwitch have been done by name-brand third party vendors (this is the avenue most
>> vendors use to get their OpenFlow support) and are now shipping.
>> 
>> We're always open to discussing ways that we can improve this interfaces, too, of course!
> 
> Make these vendor switches work with plain Linux. The Intel folks are
> producing interfaces with L2, ACLs, VIs and are putting some effort to
> integrate them into plain Linux. I should be able to set the qos rules
> with tc on an intel chip.
> You guys can still take advantage of all that and still have your nice
> control plane.

Once again, I think we are talking about different things.  I believe you are discussing interfacing with NICs, which is quite different from a high fanout switching ASIC.  As I previously mentioned, the point of my original post was that I think it would be best not to model a high fanout switch in the interface to the kernel.

--Justin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Open vSwitch Design
  2011-11-28 18:34               ` Justin Pettit
@ 2011-11-28 22:42                 ` Jamal Hadi Salim
  0 siblings, 0 replies; 21+ messages in thread
From: Jamal Hadi Salim @ 2011-11-28 22:42 UTC (permalink / raw)
  To: Justin Pettit
  Cc: Stephen Hemminger, Jesse Gross, netdev, dev, David Miller,
	Chris Wright, Herbert Xu, Eric Dumazet, John Fastabend

On Mon, 2011-11-28 at 10:34 -0800, Justin Pettit wrote:
> On Nov 25, 2011, at 5:11 PM, Jamal Hadi Salim wrote:

> 
> Are you talking about ASICs on NICs?  

I am indifferent - looking at it entirely from a control
perspective. i.e if i do "ip link blah down" on a port
i want that to work with zero changes to iproute2; the only
way you can achieve that is if you expose those ports as
netdevs.
This is what I said was a good thing the Intel folks were trying to
achieve (and what Lennert has done for the small Marvel switch chips).

> I was referring to integrating Open vSwitch into top-of-rack switches.  
> These typically have a 48x1G or 48x10G switching ASIC and a relatively 
> slow (~800MHz PPC-class) management CPU running an operating system like 
> Linux.  There's no way that these systems can have a standard CPU on the fastpath.

No, not the datapath; just control of the hardware.  If i run
"ip route add .." i want that to work on the ASIC.
Same with tc action/classification. I want to run those tools and
configure an ACL in the ASIC with no new learning curve.

> 
> I understood the original question to be: Can we make the interface to the 
> kernel look like a hardware switch?  My answer had two main parts.  First, 
> I don't think we could define a "standard" hardware interface, since they're
> all very different.  Second, even if we could, I think a software fastpath's
> strengths and weaknesses are such that the hardware model wouldn't be ideal.

Not talking about datapath - but control interface to those devices.
We cant define how the low levels look like. But if you expose things
using standard linux interfaces, then user space tools and APIs stay
unchanged.

Then i shouldnt care where the feature runs (hardware NIC, ASIC, pure
kernel level software etc).

> 
> The problem is that DRAM isn't going to cut it on the ACL tables--which are 
> typically used for flow-based matching--on a 48x10G (or even 48x1G) switch.

There are vendors who use DRAMS with speacilized interfaces that
interleave requests behind the scenes. Maybe i can point you to one
offline. 

> I've seen a couple of switching ASICs that support many 10s of thousands of
> ACL entries, but they require expensive external TCAMs for lookup and SRAM 
> for counters.  Most of the white box vendors that I've seen that use those 
> ASICs don't bother adding the external TCAM and SRAM to their designs.  
> Even when they are added, their matching capabilities are typically limited 
> in order to keep up with traffic.

I thought SRAM markets have dried up these days. Anyways what you are
refereing to above is generally true.

> > Justin - theres nothing new you need in the kernel to have that feature.
> > Let me rephrase that, that has not been a new feature for at least a
> > decade in Linux.
> > Add exact match filters with higher priority. Have the lowest priority
> > filter to redirect to user space. Let user space lookup some service
> > rule; have it download to the kernel one or more exact matches.
> > Let the packet proceed on its way down the kernel to its destination if
> > thats what is defined.
> 
> My point was that a software fastpath should look different than a hardware-based one.

And i was pointing  to what your datapath patches which in conjunction
with your user space code.

> > 
> > That bit sounds interesting - I will look at your spec.
> 
> Great!

I am sorry - been overloaded elsewhere havent looked. But i think above
I pretty much spelt out what my desires are.


> Yes, Open vSwitch has been ported to 24x10G ASICs running Linux on their management CPUs.  
> However, in these cases the datapath is handled by hardware and not the software forwarding 
> plane, obviously.

Of course.

> > Do the vendors agree to some common interface?
> 
> Yes, if you view ofproto (as described in the porting guide) as that interface.  Every merchant silicon vendor 
> I've seen views the interfaces to their ASICs as proprietary.  

Yes, the XAL agony (HALS and PALS that run on 26 other OSes).

> Someone (with the appropriate SDK and licenses) needs to write providers for those different hardware ports.  
> We've helped multiple vendors do this and know a few others that have done it on their own.

You know what would be really nice is if you achieved what i described
above.
Can I ifconfig an ethernet switch port?

> This really seems besides the point for this discussion, though.  
> We've written an ofproto provider for software switches called "dpif" 
> (this is also described in the porting guide). What we're proposing be 
> included in Linux is the kernel module that speaks to that dpif provider 
> over a well-defined, stable, netlink-based protocol.
> 
> Here's just a quick (somewhat simplified) summary of the different layers. 
> At the top, there are controllers and switches that communicate using OpenFlow.
> OpenFlow gives controller writers the ability to inspect and modify the switches' 
> flow tables and interfaces.  If a flow entry doesn't match an existing entry, the 
> packet is forwarded to the controller for further processing.  OpenFlow 1.0 was 
> pretty basic and exposed a single flow table.  OpenFlow 1.1 introduced a number 
> of new features including multiple table support.  The forthcoming OpenFlow 1.2 
> will include support for extensible matches, which means that new fields may be 
> added without requiring a full revision of the specification.  OpenFlow is defined 
> by the Open Networking Foundation and is not directly related to Open vSwitch.
> 
> The userspace in Open vSwitch has an OpenFlow library that interacts with the 
> controllers.  Userspace has its own classifier that supports wildcard entries 
> and multiple tables.  Many of the changes to the OpenFlow protocol only require 
> modifying that library and perhaps some of the glue code with the classifier.  
> (In theory, other software-defined networking protocols could be plugged in as well.)  
> The classifier interacts with the ofproto layer below it, which implements a fastpath.

Yes, when i looked at your code i can see that you have gone past
openflow.

> On a hardware switch, since it supports wildcarding, it essentially becomes a 
> passthrough that just calls the appropriate APIs for the ASIC.  

Are these APIs documented as well? Maybe thats all we need if you dont
have the standard linux tools working.

> On software, 
> as we've discussed, exact-match flows work better.
> 
> For that reason, we've defined the dpif layer, which is an ofproto provider.  
> It's primary purpose is to take high-level concepts like "treat this group of 
> interfaces as a LACP bond" or "support this set of wildcard flow entries" and 
> explode them into exact-match entries on-demand.  We've then implemented a 
> Linux dpif provider that takes the exact match entries created by the dpif 
> layer and converts them into netlink messages that the kernel module understands.  
> These messages are well-defined and not specific to Open vSwitch or OpenFlow.

Useful but that seems more like a service layer - I want just to be able
to ifconfig a port as a basic need.
In any case, I should look at your doc to get some clarity.

> This layering has allowed us to introduce new OpenFlow-like features such as multiple tables 
> and non-OpenFlow features such as port mirroring, STP, CCM, and new bonding modes without 
> changes to the kernel module.  In fact, the only changes that should necessitate a kernel 
> interface change are new matches or actions, such as would be required for handling MPLS.

I just need the basic cobbling blocks.
If you conform to what Linux already does and i can run standard tools,
we can have a lot of creative things that could be done.


> > Make these vendor switches work with plain Linux. The Intel folks are
> > producing interfaces with L2, ACLs, VIs and are putting some effort to
> > integrate them into plain Linux. I should be able to set the qos rules
> > with tc on an intel chip.
> > You guys can still take advantage of all that and still have your nice
> > control plane.
> 
> Once again, I think we are talking about different things.  I believe you are 
> discussing interfacing with NICs, which is quite different from a high fanout 
> switching ASIC.  As I previously mentioned, the point of my original 

> post was that I think it would be best not to model a high fanout switch in the interface to the kernel.
> 

I hope my clarification above makes more sense.

cheers,
jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2011-11-28 22:42 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-24 20:10 Open vSwitch Design Jesse Gross
     [not found] ` <CAEP_g=_2L1xFWtDXh_6YyXz1Mt9TR3zvjLzix+SpO6yzeOLsSQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-11-24 22:30   ` jamal
2011-11-25  5:20     ` Stephen Hemminger
     [not found]       ` <20111124212021.2ae2fb7f-QE31Isp8l5DVJhW05BI4jyWSNWFUUkiGXqFh9Ls21Oc@public.gmane.org>
2011-11-25  6:18         ` Eric Dumazet
2011-11-25  6:25           ` David Miller
     [not found]             ` <20111125.012517.2221372383643417980.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
2011-11-25  6:36               ` Eric Dumazet
2011-11-25 11:34                 ` jamal
2011-11-25 13:02                   ` Eric Dumazet
2011-11-28 15:20                     ` [PATCH net-next 0/4] net: factorize flow dissector Eric Dumazet
2011-11-25 20:20                   ` Open vSwitch Design Jesse Gross
     [not found]                     ` <CAEP_g=9tcH9kJrVsHc26kXWZEUS8G-U=U7y6k8xaZG5MD0OTyg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-11-26  1:23                       ` Jamal Hadi Salim
2011-11-25 20:14           ` Jesse Gross
2011-11-25 11:24         ` jamal
2011-11-25 17:28           ` Stephen Hemminger
2011-11-25 17:55         ` Jesse Gross
2011-11-25 19:52         ` Justin Pettit
     [not found]           ` <2DB44B16-598F-4414-8B35-8E322D705A9A-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>
2011-11-26  1:11             ` Jamal Hadi Salim
2011-11-26  4:38               ` Stephen Hemminger
     [not found]                 ` <ec23d63d-27c9-4761-bdd3-e3f54bdb5e77-bX68f012229Xuxj3zoTs5AC/G2K4zDHf@public.gmane.org>
2011-11-26  8:05                   ` Martin Casado
2011-11-28 18:34               ` Justin Pettit
2011-11-28 22:42                 ` Jamal Hadi Salim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).