* Open vSwitch Design
@ 2011-11-24 20:10 Jesse Gross
[not found] ` <CAEP_g=_2L1xFWtDXh_6YyXz1Mt9TR3zvjLzix+SpO6yzeOLsSQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 21+ messages in thread
From: Jesse Gross @ 2011-11-24 20:10 UTC (permalink / raw)
To: netdev, dev
Cc: David Miller, Stephen Hemminger, Chris Wright, Herbert Xu,
Eric Dumazet, John Fastabend, Justin Pettit, jhs
I realized that since Open vSwitch is so userspace-centric some of the
design considerations might not be apparent from the kernel code
alone. I did a poor job of explaining the larger picture which has
lead to some misconceptions, so I thought it would be helpful if I
gave a short overview.
One of the driving goals was to push as much logic as possible to
userspace, so the kernel portion is less than 6000 lines and has four
components:
* Switching infrastructure: As the name implies, Open vSwitch is
intended to be a network switch, focused on
virtualization/OpenFlow/software defined networking. This means that
what we are modeling is not actually a collection of flows but a
switch which contains a group of related ports, a software virtual
device, etc. The switch model is used in a variety of places, such as
to measure traffic that actually flows through it in order to
implement monitoring and sampling protocols.
* Flow lookup: Although used to implement OpenFlow, the kernel flow
table does not actually directly contain OpenFlow flows. This is
because OpenFlow tables can contain wildcards, multiple pipeline
stages, etc. and we did not want to push that complexity into the
kernel fast path (nor tie it to a specific version of OpenFlow).
Instead an exact match flow table is populated on-demand from
userspace based on the more complex rules stored there. Although it
might seem limiting, this design has allowed significant new
functionality to be added without modifications to the kernel or
performance impact.
* Packet execution: Once a flow is matched it can be output,
enqueued to a particular qdisc, etc. Some of these operations are
specific to Open vSwitch, such as sampling, whereas others we leverage
existing infrastructure (including tc for QoS) by simply marking the
packet for further processing.
* Userspace interfaces: One of the difficulties of having a
specialized, exact match flow lookup engine is maintaining
compatibility across differing kernel/userspace versions. This
compatibility shows up heavily in the userspace interfaces and is
achieved by passing the kernel's version of the flow along with packet
information. This allows userspace to install appropriate flows even
if its interpretation of a packet differs from the kernel's without
version checks or maintaining multiple implementations of the flow
extraction code in the kernel.
It's obviously possible to put this code anywhere, whether it is an
independent module, in the bridge, or tc. Regardless, however, it's
largely new code that is geared towards this particular model so it
seems better not to add to the complexity of existing components if at
all possible.
^ permalink raw reply [flat|nested] 21+ messages in thread[parent not found: <CAEP_g=_2L1xFWtDXh_6YyXz1Mt9TR3zvjLzix+SpO6yzeOLsSQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Open vSwitch Design [not found] ` <CAEP_g=_2L1xFWtDXh_6YyXz1Mt9TR3zvjLzix+SpO6yzeOLsSQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-11-24 22:30 ` jamal 2011-11-25 5:20 ` Stephen Hemminger 0 siblings, 1 reply; 21+ messages in thread From: jamal @ 2011-11-24 22:30 UTC (permalink / raw) To: Jesse Gross Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu, Eric Dumazet, netdev, John Fastabend, Stephen Hemminger, David Miller Jesse, I am going to try and respond to your comments below. On Thu, 2011-11-24 at 12:10 -0800, Jesse Gross wrote: > > * Switching infrastructure: As the name implies, Open vSwitch is > intended to be a network switch, focused on > virtualization/OpenFlow/software defined networking. This means that > what we are modeling is not actually a collection of flows but a > switch which contains a group of related ports, a software virtual > device, etc. The switch model is used in a variety of places, such as > to measure traffic that actually flows through it in order to > implement monitoring and sampling protocols. Can you explain why you couldnt use the current bridge code (likely with some mods)? I can see you want to isolate the VMs via the virtual ports; maybe even vlans on the virtual ports - the current bridge code should be able to handle that. > * Flow lookup: Although used to implement OpenFlow, the kernel flow > table does not actually directly contain OpenFlow flows. This is > because OpenFlow tables can contain wildcards, multiple pipeline > stages, etc. and we did not want to push that complexity into the > kernel fast path (nor tie it to a specific version of OpenFlow). > Instead an exact match flow table is populated on-demand from > userspace based on the more complex rules stored there. Although it > might seem limiting, this design has allowed significant new > functionality to be added without modifications to the kernel or > performance impact. This can be achieved easily with zero changes to the kernel code. You need to have default filters that redirect flows to user space when you fail to match. > * Packet execution: Once a flow is matched it can be output, > enqueued to a particular qdisc, etc. Some of these operations are > specific to Open vSwitch, such as sampling, whereas others we leverage > existing infrastructure (including tc for QoS) by simply marking the > packet for further processing. The tc classifier-action-qdisc infrastructure handles this. The sampler needs a new action defined. > * Userspace interfaces: One of the difficulties of having a > specialized, exact match flow lookup engine is maintaining > compatibility across differing kernel/userspace versions. This > compatibility shows up heavily in the userspace interfaces and is > achieved by passing the kernel's version of the flow along with packet > information. This allows userspace to install appropriate flows even > if its interpretation of a packet differs from the kernel's without > version checks or maintaining multiple implementations of the flow > extraction code in the kernel. I didnt quiet follow - are we talking about backward/forward compatibility? > It's obviously possible to put this code anywhere, whether it is an > independent module, in the bridge, or tc. Regardless, however, it's > largely new code that is geared towards this particular model so it > seems better not to add to the complexity of existing components if at > all possible. I am still not seeing how this could not be done without the infrastructure that exists. Granted, the user space brains - thats where everything else resides - but you are not pushing that i think. cheers, jamal ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Open vSwitch Design 2011-11-24 22:30 ` jamal @ 2011-11-25 5:20 ` Stephen Hemminger [not found] ` <20111124212021.2ae2fb7f-QE31Isp8l5DVJhW05BI4jyWSNWFUUkiGXqFh9Ls21Oc@public.gmane.org> 0 siblings, 1 reply; 21+ messages in thread From: Stephen Hemminger @ 2011-11-25 5:20 UTC (permalink / raw) To: jhs-jkUAjuhPggJWk0Htik3J/w Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu, Eric Dumazet, netdev, hadi-fAAogVwAN2Kw5LPnMra/2Q, Fastabend, John-/PVsmBQoxgPKo9QCiBeYKEEOCMrvLtNR, David Miller On Thu, 24 Nov 2011 17:30:33 -0500 jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> wrote: > Jesse, > > I am going to try and respond to your comments below. > > On Thu, 2011-11-24 at 12:10 -0800, Jesse Gross wrote: > > > > > * Switching infrastructure: As the name implies, Open vSwitch is > > intended to be a network switch, focused on > > virtualization/OpenFlow/software defined networking. This means that > > what we are modeling is not actually a collection of flows but a > > switch which contains a group of related ports, a software virtual > > device, etc. The switch model is used in a variety of places, such as > > to measure traffic that actually flows through it in order to > > implement monitoring and sampling protocols. > > Can you explain why you couldnt use the current bridge code (likely with > some mods)? I can see you want to isolate the VMs via the virtual ports; > maybe even vlans on the virtual ports - the current bridge code should > be able to handle that. The way openvswitch works is that the flow table is populated by user space. The kernel bridge works completely differently (it learns about MAC addresses). > > * Flow lookup: Although used to implement OpenFlow, the kernel flow > > table does not actually directly contain OpenFlow flows. This is > > because OpenFlow tables can contain wildcards, multiple pipeline > > stages, etc. and we did not want to push that complexity into the > > kernel fast path (nor tie it to a specific version of OpenFlow). > > Instead an exact match flow table is populated on-demand from > > userspace based on the more complex rules stored there. Although it > > might seem limiting, this design has allowed significant new > > functionality to be added without modifications to the kernel or > > performance impact. > > This can be achieved easily with zero changes to the kernel code. > You need to have default filters that redirect flows to user space > when you fail to match. Actually, this is what puts me off on the current implementation. I would prefer that the kernel implementation was just a software implementation of a hardware OpenFlow switch. That way it would be transparent that the control plane in user space was talking to kernel or hardware. > > * Packet execution: Once a flow is matched it can be output, > > enqueued to a particular qdisc, etc. Some of these operations are > > specific to Open vSwitch, such as sampling, whereas others we leverage > > existing infrastructure (including tc for QoS) by simply marking the > > packet for further processing. > > The tc classifier-action-qdisc infrastructure handles this. > The sampler needs a new action defined. There are too many damn layers in the software path already. > > * Userspace interfaces: One of the difficulties of having a > > specialized, exact match flow lookup engine is maintaining > > compatibility across differing kernel/userspace versions. This > > compatibility shows up heavily in the userspace interfaces and is > > achieved by passing the kernel's version of the flow along with packet > > information. This allows userspace to install appropriate flows even > > if its interpretation of a packet differs from the kernel's without > > version checks or maintaining multiple implementations of the flow > > extraction code in the kernel. > > I didnt quiet follow - are we talking about backward/forward > compatibility? The problem is that there are two flow classifiers, one in OpenVswitch in the kernel, and the other in the user space flow manager. I think the issue is that the two have different code. Is the kernel/userspace API for OpenVswitch nailed down and documented well enough that alternative control plane software could be built? > > It's obviously possible to put this code anywhere, whether it is an > > independent module, in the bridge, or tc. Regardless, however, it's > > largely new code that is geared towards this particular model so it > > seems better not to add to the complexity of existing components if at > > all possible. > > I am still not seeing how this could not be done without the > infrastructure that exists. Granted, the user space brains - thats where > everything else resides - but you are not pushing that i think. ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20111124212021.2ae2fb7f-QE31Isp8l5DVJhW05BI4jyWSNWFUUkiGXqFh9Ls21Oc@public.gmane.org>]
* Re: Open vSwitch Design [not found] ` <20111124212021.2ae2fb7f-QE31Isp8l5DVJhW05BI4jyWSNWFUUkiGXqFh9Ls21Oc@public.gmane.org> @ 2011-11-25 6:18 ` Eric Dumazet 2011-11-25 6:25 ` David Miller 2011-11-25 20:14 ` Jesse Gross 2011-11-25 11:24 ` jamal ` (2 subsequent siblings) 3 siblings, 2 replies; 21+ messages in thread From: Eric Dumazet @ 2011-11-25 6:18 UTC (permalink / raw) To: Stephen Hemminger Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu, netdev, hadi-fAAogVwAN2Kw5LPnMra/2Q, jhs-jkUAjuhPggJWk0Htik3J/w, John Fastabend, David Miller Le jeudi 24 novembre 2011 à 21:20 -0800, Stephen Hemminger a écrit : > The problem is that there are two flow classifiers, one in OpenVswitch > in the kernel, and the other in the user space flow manager. I think the > issue is that the two have different code. We have kind of same duplication in kernel already :) __skb_get_rxhash() and net/sched/cls_flow.c contain roughly the same logic... Maybe its time to factorize the thing, eventually use it in a third component (Open vSwitch...) _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Open vSwitch Design 2011-11-25 6:18 ` Eric Dumazet @ 2011-11-25 6:25 ` David Miller [not found] ` <20111125.012517.2221372383643417980.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> 2011-11-25 20:14 ` Jesse Gross 1 sibling, 1 reply; 21+ messages in thread From: David Miller @ 2011-11-25 6:25 UTC (permalink / raw) To: eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w Cc: dev-yBygre7rU0TnMu66kgdUjQ, chrisw-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA, hadi-fAAogVwAN2Kw5LPnMra/2Q, jhs-jkUAjuhPggJWk0Htik3J/w, john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w, herbert-F6s6mLieUQo7FNHlEwC/lvQIK84fMopw, shemminger-ZtmgI6mnKB3QT0dZR+AlfA From: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Date: Fri, 25 Nov 2011 07:18:03 +0100 > Le jeudi 24 novembre 2011 à 21:20 -0800, Stephen Hemminger a écrit : > >> The problem is that there are two flow classifiers, one in OpenVswitch >> in the kernel, and the other in the user space flow manager. I think the >> issue is that the two have different code. > > We have kind of same duplication in kernel already :) > > __skb_get_rxhash() and net/sched/cls_flow.c contain roughly the same > logic... > > Maybe its time to factorize the thing, eventually use it in a third > component (Open vSwitch...) Yes. ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20111125.012517.2221372383643417980.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>]
* Re: Open vSwitch Design [not found] ` <20111125.012517.2221372383643417980.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> @ 2011-11-25 6:36 ` Eric Dumazet 2011-11-25 11:34 ` jamal 0 siblings, 1 reply; 21+ messages in thread From: Eric Dumazet @ 2011-11-25 6:36 UTC (permalink / raw) To: David Miller Cc: dev-yBygre7rU0TnMu66kgdUjQ, chrisw-H+wXaHxf7aLQT0dZR+AlfA, Florian Westphal, netdev-u79uwXL29TY76Z2rM5mHXA, hadi-fAAogVwAN2Kw5LPnMra/2Q, jhs-jkUAjuhPggJWk0Htik3J/w, john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w, herbert-F6s6mLieUQo7FNHlEwC/lvQIK84fMopw, shemminger-ZtmgI6mnKB3QT0dZR+AlfA Le vendredi 25 novembre 2011 à 01:25 -0500, David Miller a écrit : > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Fri, 25 Nov 2011 07:18:03 +0100 > > > Le jeudi 24 novembre 2011 à 21:20 -0800, Stephen Hemminger a écrit : > > > >> The problem is that there are two flow classifiers, one in OpenVswitch > >> in the kernel, and the other in the user space flow manager. I think the > >> issue is that the two have different code. > > > > We have kind of same duplication in kernel already :) > > > > __skb_get_rxhash() and net/sched/cls_flow.c contain roughly the same > > logic... > > > > Maybe its time to factorize the thing, eventually use it in a third > > component (Open vSwitch...) > > Yes. A third reason to do that anyway is that net/sched/sch_sfb.c should use __skb_get_rxhash() providing the perturbation itself, and not use the standard (hashrnd) one ). Right now, if two flows share same rxhash, the double SFB hash will also share the same final hash. (This point was mentioned by Florian Westphal) _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Open vSwitch Design 2011-11-25 6:36 ` Eric Dumazet @ 2011-11-25 11:34 ` jamal 2011-11-25 13:02 ` Eric Dumazet 2011-11-25 20:20 ` Open vSwitch Design Jesse Gross 0 siblings, 2 replies; 21+ messages in thread From: jamal @ 2011-11-25 11:34 UTC (permalink / raw) To: Eric Dumazet Cc: dev-yBygre7rU0TnMu66kgdUjQ, chrisw-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA, Florian Westphal, john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w, herbert-F6s6mLieUQo7FNHlEwC/lvQIK84fMopw, shemminger-ZtmgI6mnKB3QT0dZR+AlfA, David Miller Hrm. I forgot about the flow classifier - it may be what the openflow folks need. It is more friendly for the well defined tuples than u32. But what do you mean "refactor"? I can already use this classifier and attach actions to set policy in the kernel. cheers, jamal On Fri, 2011-11-25 at 07:36 +0100, Eric Dumazet wrote: > > > > > > Maybe its time to factorize the thing, eventually use it in a third > > > component (Open vSwitch...) > > > > Yes. > > A third reason to do that anyway is that net/sched/sch_sfb.c should use > __skb_get_rxhash() providing the perturbation itself, and not use the > standard (hashrnd) one ). > > Right now, if two flows share same rxhash, the double SFB hash will also > share the same final hash. > > (This point was mentioned by Florian Westphal) > > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Open vSwitch Design 2011-11-25 11:34 ` jamal @ 2011-11-25 13:02 ` Eric Dumazet 2011-11-28 15:20 ` [PATCH net-next 0/4] net: factorize flow dissector Eric Dumazet 2011-11-25 20:20 ` Open vSwitch Design Jesse Gross 1 sibling, 1 reply; 21+ messages in thread From: Eric Dumazet @ 2011-11-25 13:02 UTC (permalink / raw) To: jhs-jkUAjuhPggJWk0Htik3J/w Cc: dev-yBygre7rU0TnMu66kgdUjQ, chrisw-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA, Florian Westphal, john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w, herbert-F6s6mLieUQo7FNHlEwC/lvQIK84fMopw, shemminger-ZtmgI6mnKB3QT0dZR+AlfA, David Miller Le vendredi 25 novembre 2011 à 06:34 -0500, jamal a écrit : > Hrm. I forgot about the flow classifier - it may be what the openflow > folks need. It is more friendly for the well defined tuples than u32. > > But what do you mean "refactor"? I can already use this classifier > and attach actions to set policy in the kernel. cls_flow is not complete, since it doesnt handle tunnels for example. It calls a 'partial flow classifier' to find each needed element, one by one. (adding tunnel decap would need to perform this several time for each packet) __skb_get_rxhash() is more tunnel aware, yet some protocols are still missing, for example IPPROTO_IPV6. Instead of adding logic to both dissectors, we could have a central flow dissector, filling a temporary pivot structure with found elements (src addr, dst addr, ports, ...), going through tunnels encap if found. Then net/sched/cls_flow.c could pick needed elems from this structure to compute the hash as specified in tc command : (for example : tc filter ... flow hash keys proto-dst,dst ...) (One dissector call per packet for any number of keys in the filter) Same for net/sched/sch_sfb.c : Use the pivot structure and compute the two hashes (using two hashrnd values) And __skb_get_rxhash() could use the same flow dissector, and pick (src addr, dst addr, ports) to compute skb->rxhash, and set skb->l4_rxhash if "ports" is not null. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH net-next 0/4] net: factorize flow dissector 2011-11-25 13:02 ` Eric Dumazet @ 2011-11-28 15:20 ` Eric Dumazet 0 siblings, 0 replies; 21+ messages in thread From: Eric Dumazet @ 2011-11-28 15:20 UTC (permalink / raw) To: jhs-jkUAjuhPggJWk0Htik3J/w Cc: dev-yBygre7rU0TnMu66kgdUjQ, chrisw-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA, Florian Westphal, john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w, herbert-F6s6mLieUQo7FNHlEwC/lvQIK84fMopw, shemminger-ZtmgI6mnKB3QT0dZR+AlfA, Dan Siemon, David Miller Le vendredi 25 novembre 2011 à 14:02 +0100, Eric Dumazet a écrit : > cls_flow is not complete, since it doesnt handle tunnels for example. > > It calls a 'partial flow classifier' to find each needed element, one by > one. > (adding tunnel decap would need to perform this several time for each > packet) > > __skb_get_rxhash() is more tunnel aware, yet some protocols are still > missing, for example IPPROTO_IPV6. > > Instead of adding logic to both dissectors, we could have a central flow > dissector, filling a temporary pivot structure with found elements (src > addr, dst addr, ports, ...), going through tunnels encap if found. > > Then net/sched/cls_flow.c could pick needed elems from this structure to > compute the hash as specified in tc command : > (for example : tc filter ... flow hash keys proto-dst,dst ...) > > (One dissector call per packet for any number of keys in the filter) > > Same for net/sched/sch_sfb.c : Use the pivot structure and compute the > two hashes (using two hashrnd values) > > And __skb_get_rxhash() could use the same flow dissector, and pick (src > addr, dst addr, ports) to compute skb->rxhash, and set skb->l4_rxhash if > "ports" is not null. > Here is a patch serie doing this factorization / cleanup. [PATCH net-next 1/4] net: introduce skb_flow_dissect() [PATCH net-next 2/4] net: use skb_flow_dissect() in __skb_get_rxhash() [PATCH net-next 3/4] cls_flow: use skb_flow_dissect() [PATCH net-next 4/4] sch_sfb: use skb_flow_dissect() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> cumulative diffstat : include/net/flow_keys.h | 15 +++ net/core/Makefile | 2 net/core/dev.c | 124 ++---------------------- net/core/flow_dissector.c | 130 ++++++++++++++++++++++++++ net/ipv4/tcp.c | 8 - net/sched/cls_flow.c | 180 +++++++++--------------------------- net/sched/sch_sfb.c | 17 ++- 7 files changed, 225 insertions(+), 251 deletions(-) _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Open vSwitch Design 2011-11-25 11:34 ` jamal 2011-11-25 13:02 ` Eric Dumazet @ 2011-11-25 20:20 ` Jesse Gross [not found] ` <CAEP_g=9tcH9kJrVsHc26kXWZEUS8G-U=U7y6k8xaZG5MD0OTyg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 21+ messages in thread From: Jesse Gross @ 2011-11-25 20:20 UTC (permalink / raw) To: jhs-jkUAjuhPggJWk0Htik3J/w Cc: dev-yBygre7rU0TnMu66kgdUjQ, chrisw-H+wXaHxf7aLQT0dZR+AlfA, Eric Dumazet, netdev-u79uwXL29TY76Z2rM5mHXA, Florian Westphal, john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w, herbert-F6s6mLieUQo7FNHlEwC/lvQIK84fMopw, shemminger-ZtmgI6mnKB3QT0dZR+AlfA, David Miller On Fri, Nov 25, 2011 at 3:34 AM, jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> wrote: > > Hrm. I forgot about the flow classifier - it may be what the openflow > folks need. It is more friendly for the well defined tuples than u32. The flow classifier isn't really designed to do rule lookup in the way that OpenFlow/Open vSwitch does, since it's more about choosing which fields are considered significant to the flow. I'm sure that it could be extended in some way but it seems that the better approach would be to factor out the common pieces (such as the header extraction mentioned before) than try to cram both models into one component. I understand that you see some commonalities with various parts of the system but often there are enough conceptual differences that you end up trying to shove a square peg into a round hole. As Stephen mentioned about the bridge, many of these components are already fairly complex and combining more functionality into them isn't always a win. ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <CAEP_g=9tcH9kJrVsHc26kXWZEUS8G-U=U7y6k8xaZG5MD0OTyg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Open vSwitch Design [not found] ` <CAEP_g=9tcH9kJrVsHc26kXWZEUS8G-U=U7y6k8xaZG5MD0OTyg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-11-26 1:23 ` Jamal Hadi Salim 0 siblings, 0 replies; 21+ messages in thread From: Jamal Hadi Salim @ 2011-11-26 1:23 UTC (permalink / raw) To: Jesse Gross Cc: dev-yBygre7rU0TnMu66kgdUjQ, chrisw-H+wXaHxf7aLQT0dZR+AlfA, Eric Dumazet, netdev-u79uwXL29TY76Z2rM5mHXA, Florian Westphal, john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w, herbert-F6s6mLieUQo7FNHlEwC/lvQIK84fMopw, shemminger-ZtmgI6mnKB3QT0dZR+AlfA, David Miller On Fri, 2011-11-25 at 12:20 -0800, Jesse Gross wrote: > The flow classifier isn't really designed to do rule lookup in the way > that OpenFlow/Open vSwitch does, since it's more about choosing which > fields are considered significant to the flow. I'm sure that it could > be extended in some way but it seems that the better approach would be > to factor out the common pieces (such as the header extraction > mentioned before) than try to cram both models into one component. Yes, it would need a tweak or two. But u32 would. And the action subsystem does. > I understand that you see some commonalities with various parts of the > system but often there are enough conceptual differences that you end > up trying to shove a square peg into a round hole. I have done this for years. I have very good knowledge of merchant sillicom and i have programmed them on Linux; i know this space a lot more than you are assuming. If you can point me to _one_, just _one_ thing, that you do in the classifier action piece that cannot be done in Linux today and is more flexible in your setup than it is on Linux we can have a useful discussion. > As Stephen > mentioned about the bridge, many of these components are already > fairly complex and combining more functionality into them isn't always > a win. I think the bridge started on a bad foot for not properly integrating with Vlans and tightly integrating STP control in the kernel. cheers, jamal ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Open vSwitch Design 2011-11-25 6:18 ` Eric Dumazet 2011-11-25 6:25 ` David Miller @ 2011-11-25 20:14 ` Jesse Gross 1 sibling, 0 replies; 21+ messages in thread From: Jesse Gross @ 2011-11-25 20:14 UTC (permalink / raw) To: Eric Dumazet Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu, netdev, hadi-fAAogVwAN2Kw5LPnMra/2Q, jhs-jkUAjuhPggJWk0Htik3J/w, John Fastabend, Stephen Hemminger, David Miller On Thu, Nov 24, 2011 at 10:18 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le jeudi 24 novembre 2011 à 21:20 -0800, Stephen Hemminger a écrit : > >> The problem is that there are two flow classifiers, one in OpenVswitch >> in the kernel, and the other in the user space flow manager. I think the >> issue is that the two have different code. > > We have kind of same duplication in kernel already :) > > __skb_get_rxhash() and net/sched/cls_flow.c contain roughly the same > logic... > > Maybe its time to factorize the thing, eventually use it in a third > component (Open vSwitch...) I agree, there's no need to have three copies of packet header parsing code and that's certainly something that we would be willing to work on improving. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Open vSwitch Design [not found] ` <20111124212021.2ae2fb7f-QE31Isp8l5DVJhW05BI4jyWSNWFUUkiGXqFh9Ls21Oc@public.gmane.org> 2011-11-25 6:18 ` Eric Dumazet @ 2011-11-25 11:24 ` jamal 2011-11-25 17:28 ` Stephen Hemminger 2011-11-25 17:55 ` Jesse Gross 2011-11-25 19:52 ` Justin Pettit 3 siblings, 1 reply; 21+ messages in thread From: jamal @ 2011-11-25 11:24 UTC (permalink / raw) To: Stephen Hemminger Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu, Eric Dumazet, netdev, John Fastabend, David Miller On Thu, 2011-11-24 at 21:20 -0800, Stephen Hemminger wrote: > On Thu, 24 Nov 2011 17:30:33 -0500 > jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> wrote: > > > Can you explain why you couldnt use the current bridge code (likely with > > some mods)? I can see you want to isolate the VMs via the virtual ports; > > maybe even vlans on the virtual ports - the current bridge code should > > be able to handle that. > > The way openvswitch works is that the flow table is populated > by user space. The kernel bridge works completely differently (it learns > about MAC addresses). > Most hardware bridges out there support all different modes: You can have learning in the hardware or defer it to user/control plane by setting some flags. You can have broadcasting done in hardware or defer to user space. The mods i was thinking of is to bring the Linux bridge to have the same behavior. You then need to allow netlink updates of bridge MAC table from user space. There may be weaknesses with the current bridging code in relation to Vlans that may need to be addressed. [But my concern was not so much the bridge - because changes are needed in that case; it is the "match, actionlist" that is already in place that got to me.] > Actually, this is what puts me off on the current implementation. > I would prefer that the kernel implementation was just a software > implementation of a hardware OpenFlow switch. That way it would > be transparent that the control plane in user space was talking to kernel > or hardware. Or alternatively, allow the bridge code to support the different modes. Learning as well as broadcasting mode needs to be settable. Then you have interesting capability in the kernel that meets the requirements of an open flow switch (+ anyone who wants to do policy control in user space with their favorite standard). > > The tc classifier-action-qdisc infrastructure handles this. > > The sampler needs a new action defined. > > There are too many damn layers in the software path already. I think what they are doing in the separation of control and data is reasonable. The policy and control are in user space. The fastpath is in the kernel; and it may be in a variety of spots (some arp entry here, some L3 entry there, a couple of match-action items etc) the brains which understand the what the different things mean in aggregation in terms of a service are in user space. > > The problem is that there are two flow classifiers, one in OpenVswitch > in the kernel, and the other in the user space flow manager. I think the > issue is that the two have different code. i see. I can understand having a simple classifier in the kernel and more complex "consulting" sitting in user space which updates the kernel on how to deal with subsequent flow packets. > Is the kernel/userspace API for OpenVswitch nailed down and documented > well enough that alternative control plane software could be built? They do have a generic netlink interface. I would prefer the netlink interface already in place (which would have worked if they used the stuff already in place). cheers, jamal ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Open vSwitch Design 2011-11-25 11:24 ` jamal @ 2011-11-25 17:28 ` Stephen Hemminger 0 siblings, 0 replies; 21+ messages in thread From: Stephen Hemminger @ 2011-11-25 17:28 UTC (permalink / raw) To: jhs-jkUAjuhPggJWk0Htik3J/w Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu, Eric Dumazet, netdev, hadi-fAAogVwAN2Kw5LPnMra/2Q, Fastabend, John-/PVsmBQoxgPKo9QCiBeYKEEOCMrvLtNR, David Miller On Fri, 25 Nov 2011 06:24:36 -0500 jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> wrote: > Most hardware bridges out there support all different modes: > You can have learning in the hardware or defer it to user/control plane > by setting some flags. You can have broadcasting done in hardware or > defer to user space. > The mods i was thinking of is to bring the Linux bridge to have the > same behavior. You then need to allow netlink updates of bridge MAC > table from user space. There may be weaknesses with the current bridging > code in relation to Vlans that may need to be addressed. > > [But my concern was not so much the bridge - because changes are needed > in that case; it is the "match, actionlist" that is already in place > that got to me.] The bridge module is already overly complex. Rather than adding more modes, it should be split into separate modules. If you look at macvlan, you will see it is already a subset of to bridge. Another example of this is the team driver which is really just a subset of the bonding code. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Open vSwitch Design [not found] ` <20111124212021.2ae2fb7f-QE31Isp8l5DVJhW05BI4jyWSNWFUUkiGXqFh9Ls21Oc@public.gmane.org> 2011-11-25 6:18 ` Eric Dumazet 2011-11-25 11:24 ` jamal @ 2011-11-25 17:55 ` Jesse Gross 2011-11-25 19:52 ` Justin Pettit 3 siblings, 0 replies; 21+ messages in thread From: Jesse Gross @ 2011-11-25 17:55 UTC (permalink / raw) To: Stephen Hemminger Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu, Eric Dumazet, netdev, hadi-fAAogVwAN2Kw5LPnMra/2Q, jhs-jkUAjuhPggJWk0Htik3J/w, John Fastabend, David Miller On Thu, Nov 24, 2011 at 9:20 PM, Stephen Hemminger <shemminger@vyatta.com> wrote: > On Thu, 24 Nov 2011 17:30:33 -0500 > jamal <hadi@cyberus.ca> wrote: >> On Thu, 2011-11-24 at 12:10 -0800, Jesse Gross wrote: >> > * Userspace interfaces: One of the difficulties of having a >> > specialized, exact match flow lookup engine is maintaining >> > compatibility across differing kernel/userspace versions. This >> > compatibility shows up heavily in the userspace interfaces and is >> > achieved by passing the kernel's version of the flow along with packet >> > information. This allows userspace to install appropriate flows even >> > if its interpretation of a packet differs from the kernel's without >> > version checks or maintaining multiple implementations of the flow >> > extraction code in the kernel. >> >> I didnt quiet follow - are we talking about backward/forward >> compatibility? > > The problem is that there are two flow classifiers, one in OpenVswitch > in the kernel, and the other in the user space flow manager. I think the > issue is that the two have different code. Yes, since userspace is installing exact match entries, these flows obviously need to be of the same form that the kernel would extract from the packet. Over time, I'm sure that additional packet formats will be added so it's important to handle the case where there is a mismatch. > Is the kernel/userspace API for OpenVswitch nailed down and documented > well enough that alternative control plane software could be built? Yes, that's actually the reason why it took so long to actually submit the code for upstream - we spent a lot of time cleaning up and stripping down the interfaces so they could be locked down (or cleanly extended). There's a fair amount of documentation on how to maintain compatibility for flows as mentioned above in the patch that I submitted and we're certainly happy to write more if other things are unclear. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Open vSwitch Design [not found] ` <20111124212021.2ae2fb7f-QE31Isp8l5DVJhW05BI4jyWSNWFUUkiGXqFh9Ls21Oc@public.gmane.org> ` (2 preceding siblings ...) 2011-11-25 17:55 ` Jesse Gross @ 2011-11-25 19:52 ` Justin Pettit [not found] ` <2DB44B16-598F-4414-8B35-8E322D705A9A-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> 3 siblings, 1 reply; 21+ messages in thread From: Justin Pettit @ 2011-11-25 19:52 UTC (permalink / raw) To: Stephen Hemminger Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu, Eric Dumazet, netdev, hadi-fAAogVwAN2Kw5LPnMra/2Q, jhs-jkUAjuhPggJWk0Htik3J/w, John Fastabend, David Miller On Nov 24, 2011, at 9:20 PM, Stephen Hemminger wrote: >> This can be achieved easily with zero changes to the kernel code. >> You need to have default filters that redirect flows to user space >> when you fail to match. > > Actually, this is what puts me off on the current implementation. > I would prefer that the kernel implementation was just a software > implementation of a hardware OpenFlow switch. That way it would > be transparent that the control plane in user space was talking to kernel > or hardware. A big difficulty is finding an appropriate hardware abstraction. I've worked on porting Open vSwitch to a few different vendors' switching ASICs, and they've all looked quite different from each other. Even within a vendor, there can be fairly substantial differences. Packet processing is broken up into stages (e.g., VLAN preprocessing, ingress ACL processing, L2 lookup, L3 lookup, packet modification, packet queuing, packet replication, egress ACL processing, etc.) and these can be done in different orders and have quite different behaviors. Also, the size of the various tables varies widely between ASICs--even within the same family. Hardware typically makes use of TCAMs, which support fast lookups of wildcarded flows. They're expensive, though, so they're typically limited to entries in the very low thousands. In software, we can trivially store 100,000s of entries, but supporting wildcarded lookups is very slow. If we only use exact-match flows in the kernel (and leave the wildcarding in userspace for kernel misses), we can do extremely fast lookups with hashing on what becomes the fastpath. Using exact-match entries has another big advantage: we can innovate the userspace portion without requiring changes to the kernel. For example, we recently went from supporting a single OpenFlow table to 255 without any kernel changes. This has an added benefit that a flow requiring multiple table lookups becomes a single hash lookup in the kernel, which is a huge performance gain in the fastpath. Another example is our introduction of a number of metadata "registers" between tables that are never seen in the kernel, but open up a lot of interesting applications for OpenFlow controller writers. If you're interested, we include a porting guide in the distribution that describes how one would go about bringing Open vSwitch to a new hardware or software platform: http://openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=PORTING Obviously, it's not that relevant here, since there's already a port to Linux. :-) But we've iterated over a few different designs and worked on other ports, and we've found this hardware/software abstraction layer to work pretty well. In fact, multiple ports of Open vSwitch have been done by name-brand third party vendors (this is the avenue most vendors use to get their OpenFlow support) and are now shipping. We're always open to discussing ways that we can improve this interfaces, too, of course! --Justin ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <2DB44B16-598F-4414-8B35-8E322D705A9A-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>]
* Re: Open vSwitch Design [not found] ` <2DB44B16-598F-4414-8B35-8E322D705A9A-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> @ 2011-11-26 1:11 ` Jamal Hadi Salim 2011-11-26 4:38 ` Stephen Hemminger 2011-11-28 18:34 ` Justin Pettit 0 siblings, 2 replies; 21+ messages in thread From: Jamal Hadi Salim @ 2011-11-26 1:11 UTC (permalink / raw) To: Justin Pettit Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu, Eric Dumazet, netdev, John Fastabend, Stephen Hemminger, David Miller On Fri, 2011-11-25 at 11:52 -0800, Justin Pettit wrote: > On Nov 24, 2011, at 9:20 PM, Stephen Hemminger wrote: > > A big difficulty is finding an appropriate hardware abstraction. I've worked on porting > Open vSwitch to a few different vendors' switching ASICs, and they've all looked quite > different from each other. Even within a vendor, there can be fairly substantial differences. > Packet processing is broken up into stages (e.g., VLAN preprocessing, ingress ACL processing, > L2 lookup, L3 lookup, packet modification, packet queuing, packet replication, egress ACL > processing, etc.) > and these can be done in different orders and have quite different behaviors. Theres some discussion going on on how to get ASIC support on the variety of chips with different offloads (qos, L2 etc); you may wanna share your experiences. Having said that - in the kernel we have all the mechanisms you describe above with quiet a good fit. Speaking from experience of working on some vendors ASICs (of which i am sure at least one you are working on). As an example, the ACL can be applied before or after L2 or L3. We can support wildcard matching to user space and exact-matches in the kernel. > Also, the size of the various tables varies widely between ASICs--even within the same > family. > > Hardware typically makes use of TCAMs, which support fast lookups of wildcarded flows. > They're expensive, though, so they're typically limited to entries in the very low thousands. Those are problems with most merchant silicon - small tables; but there are some which are easily expandable via DRAM to support a full BGP table for example. > In software, we can trivially store 100,000s of entries, but supporting wildcarded lookups > is very slow. If we only use exact-match flows in the kernel (and leave the wildcarding > in userspace for kernel misses), we can do extremely fast lookups with hashing on what > becomes the fastpath. Justin - theres nothing new you need in the kernel to have that feature. Let me rephrase that, that has not been a new feature for at least a decade in Linux. Add exact match filters with higher priority. Have the lowest priority filter to redirect to user space. Let user space lookup some service rule; have it download to the kernel one or more exact matches. Let the packet proceed on its way down the kernel to its destination if thats what is defined. > Using exact-match entries has another big advantage: we can innovate the userspace portion > without requiring changes to the kernel. For example, we recently went from supporting a > single OpenFlow table to 255 without any kernel changes. This has an added benefit that > a flow requiring multiple table lookups becomes a single hash lookup in the kernel, which > is a huge performance gain in the fastpath. Another example is our introduction of a number > of metadata "registers" between tables that are never seen in the kernel, but open up a lot > of interesting applications for OpenFlow controller writers. That bit sounds interesting - I will look at your spec. > If you're interested, we include a porting guide in the distribution that describes how one > would go about bringing Open vSwitch to a new hardware or software platform: > > http://openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=PORTING > > Obviously, it's not that relevant here, since there's already a port to Linux. :-) Does this mean i can have a 24x10G switch sitting in hardware with Linux hardware support if i use your kernel switch? Do the vendors agree to some common interface? > But we've > iterated over a few different designs and worked on other ports, and we've found this > hardware/software abstraction layer to work pretty well. In fact, multiple ports of > Open vSwitch have been done by name-brand third party vendors (this is the avenue most > vendors use to get their OpenFlow support) and are now shipping. > > We're always open to discussing ways that we can improve this interfaces, too, of course! Make these vendor switches work with plain Linux. The Intel folks are producing interfaces with L2, ACLs, VIs and are putting some effort to integrate them into plain Linux. I should be able to set the qos rules with tc on an intel chip. You guys can still take advantage of all that and still have your nice control plane. cheers, jamal ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Open vSwitch Design 2011-11-26 1:11 ` Jamal Hadi Salim @ 2011-11-26 4:38 ` Stephen Hemminger [not found] ` <ec23d63d-27c9-4761-bdd3-e3f54bdb5e77-bX68f012229Xuxj3zoTs5AC/G2K4zDHf@public.gmane.org> 2011-11-28 18:34 ` Justin Pettit 1 sibling, 1 reply; 21+ messages in thread From: Stephen Hemminger @ 2011-11-26 4:38 UTC (permalink / raw) To: Jamal Hadi Salim Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu, Eric Dumazet, netdev, John Fastabend, David Miller Not sure how Openvswitch implementation relates to Openflow specification. There are a few switches supporting Openflow already: http://www.openflow.org/wp/switch-nec/ http://www-03.ibm.com/systems/x/options/networking/bnt8264/index.html The standard(s) are here: http://www.openflow.org/wp/documents/ Good info from recent symposium: http://opennetsummit.org/past_conferences.html ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <ec23d63d-27c9-4761-bdd3-e3f54bdb5e77-bX68f012229Xuxj3zoTs5AC/G2K4zDHf@public.gmane.org>]
* Re: Open vSwitch Design [not found] ` <ec23d63d-27c9-4761-bdd3-e3f54bdb5e77-bX68f012229Xuxj3zoTs5AC/G2K4zDHf@public.gmane.org> @ 2011-11-26 8:05 ` Martin Casado 0 siblings, 0 replies; 21+ messages in thread From: Martin Casado @ 2011-11-26 8:05 UTC (permalink / raw) To: Stephen Hemminger Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu, Eric Dumazet, netdev, Jamal Hadi Salim, John Fastabend, David Miller > Not sure how Openvswitch implementation relates to Openflow specification. The short answer is that Open vSwitch serves as one of the standard reference implementations for OpenFlow (in fact, the primary developers of Open vSwitch were some of the original designers of OpenFlow). Multiple hardware switches on the market use Open vSwitch as the basis for their OpenFlow support. > > There are a few switches supporting Openflow already: > http://www.openflow.org/wp/switch-nec/ > http://www-03.ibm.com/systems/x/options/networking/bnt8264/index.html There are many other ports announced or available from vendors such as HP, Brocade, Pica8, Extreme, Juniper, and NetGear. Cisco has even announced support support for OpenFlow on the Nexus 3k (http://www.lightreading.com/document.asp?doc_id=213545). .martin > > The standard(s) are here: > http://www.openflow.org/wp/documents/ > > Good info from recent symposium: > http://opennetsummit.org/past_conferences.html > _______________________________________________ > dev mailing list > dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org > http://openvswitch.org/mailman/listinfo/dev -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Martin Casado Nicira Networks, Inc. www.nicira.com cell: 650-776-1457 ~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Open vSwitch Design 2011-11-26 1:11 ` Jamal Hadi Salim 2011-11-26 4:38 ` Stephen Hemminger @ 2011-11-28 18:34 ` Justin Pettit 2011-11-28 22:42 ` Jamal Hadi Salim 1 sibling, 1 reply; 21+ messages in thread From: Justin Pettit @ 2011-11-28 18:34 UTC (permalink / raw) To: Jamal Hadi Salim Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu, Eric Dumazet, netdev, John Fastabend, Stephen Hemminger, David Miller On Nov 25, 2011, at 5:11 PM, Jamal Hadi Salim wrote: >> A big difficulty is finding an appropriate hardware abstraction. I've worked on porting >> Open vSwitch to a few different vendors' switching ASICs, and they've all looked quite >> different from each other. Even within a vendor, there can be fairly substantial differences. >> Packet processing is broken up into stages (e.g., VLAN preprocessing, ingress ACL processing, >> L2 lookup, L3 lookup, packet modification, packet queuing, packet replication, egress ACL >> processing, etc.) >> and these can be done in different orders and have quite different behaviors. > > Theres some discussion going on on how to get ASIC support on the > variety of chips with different offloads (qos, L2 etc); you may wanna > share your experiences. Are you talking about ASICs on NICs? I was referring to integrating Open vSwitch into top-of-rack switches. These typically have a 48x1G or 48x10G switching ASIC and a relatively slow (~800MHz PPC-class) management CPU running an operating system like Linux. There's no way that these systems can have a standard CPU on the fastpath. > Having said that - in the kernel we have all the mechanisms you describe > above with quiet a good fit. Speaking from experience of working on some > vendors ASICs (of which i am sure at least one you are working on). > As an example, the ACL can be applied before or after L2 or L3. We can > support wildcard matching to user space and exact-matches in the kernel. I understood the original question to be: Can we make the interface to the kernel look like a hardware switch? My answer had two main parts. First, I don't think we could define a "standard" hardware interface, since they're all very different. Second, even if we could, I think a software fastpath's strengths and weaknesses are such that the hardware model wouldn't be ideal. >> Also, the size of the various tables varies widely between ASICs--even within the same >> family. >> >> Hardware typically makes use of TCAMs, which support fast lookups of wildcarded flows. >> They're expensive, though, so they're typically limited to entries in the very low thousands. > > Those are problems with most merchant silicon - small tables; but there > are some which are easily expandable via DRAM to support a full BGP > table for example. The problem is that DRAM isn't going to cut it on the ACL tables--which are typically used for flow-based matching--on a 48x10G (or even 48x1G) switch. I've seen a couple of switching ASICs that support many 10s of thousands of ACL entries, but they require expensive external TCAMs for lookup and SRAM for counters. Most of the white box vendors that I've seen that use those ASICs don't bother adding the external TCAM and SRAM to their designs. Even when they are added, their matching capabilities are typically limited in order to keep up with traffic. >> In software, we can trivially store 100,000s of entries, but supporting wildcarded lookups >> is very slow. If we only use exact-match flows in the kernel (and leave the wildcarding >> in userspace for kernel misses), we can do extremely fast lookups with hashing on what >> becomes the fastpath. > > Justin - theres nothing new you need in the kernel to have that feature. > Let me rephrase that, that has not been a new feature for at least a > decade in Linux. > Add exact match filters with higher priority. Have the lowest priority > filter to redirect to user space. Let user space lookup some service > rule; have it download to the kernel one or more exact matches. > Let the packet proceed on its way down the kernel to its destination if > thats what is defined. My point was that a software fastpath should look different than a hardware-based one. >> Using exact-match entries has another big advantage: we can innovate the userspace portion >> without requiring changes to the kernel. For example, we recently went from supporting a >> single OpenFlow table to 255 without any kernel changes. This has an added benefit that >> a flow requiring multiple table lookups becomes a single hash lookup in the kernel, which >> is a huge performance gain in the fastpath. Another example is our introduction of a number >> of metadata "registers" between tables that are never seen in the kernel, but open up a lot >> of interesting applications for OpenFlow controller writers. > > That bit sounds interesting - I will look at your spec. Great! >> If you're interested, we include a porting guide in the distribution that describes how one >> would go about bringing Open vSwitch to a new hardware or software platform: >> >> http://openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=PORTING >> >> Obviously, it's not that relevant here, since there's already a port to Linux. :-) > > Does this mean i can have a 24x10G switch sitting in hardware with Linux > hardware support if i use your kernel switch? Yes, Open vSwitch has been ported to 24x10G ASICs running Linux on their management CPUs. However, in these cases the datapath is handled by hardware and not the software forwarding plane, obviously. > Do the vendors agree to some common interface? Yes, if you view ofproto (as described in the porting guide) as that interface. Every merchant silicon vendor I've seen views the interfaces to their ASICs as proprietary. Someone (with the appropriate SDK and licenses) needs to write providers for those different hardware ports. We've helped multiple vendors do this and know a few others that have done it on their own. This really seems besides the point for this discussion, though. We've written an ofproto provider for software switches called "dpif" (this is also described in the porting guide). What we're proposing be included in Linux is the kernel module that speaks to that dpif provider over a well-defined, stable, netlink-based protocol. Here's just a quick (somewhat simplified) summary of the different layers. At the top, there are controllers and switches that communicate using OpenFlow. OpenFlow gives controller writers the ability to inspect and modify the switches' flow tables and interfaces. If a flow entry doesn't match an existing entry, the packet is forwarded to the controller for further processing. OpenFlow 1.0 was pretty basic and exposed a single flow table. OpenFlow 1.1 introduced a number of new features including multiple table support. The forthcoming OpenFlow 1.2 will include support for extensible matches, which means that new fields may be added without requiring a full revision of the specification. OpenFlow is defined by the Open Networking Foundation and is not directly related to Open vSwitc h. The userspace in Open vSwitch has an OpenFlow library that interacts with the controllers. Userspace has its own classifier that supports wildcard entries and multiple tables. Many of the changes to the OpenFlow protocol only require modifying that library and perhaps some of the glue code with the classifier. (In theory, other software-defined networking protocols could be plugged in as well.) The classifier interacts with the ofproto layer below it, which implements a fastpath. On a hardware switch, since it supports wildcarding, it essentially becomes a passthrough that just calls the appropriate APIs for the ASIC. On software, as we've discussed, exact-match flows work better. For that reason, we've defined the dpif layer, which is an ofproto provider. It's primary purpose is to take high-level concepts like "treat this group of interfaces as a LACP bond" or "support this set of wildcard flow entries" and explode them into exact-match entries on-demand. We've then implemented a Linux dpif provider that takes the exact match entries created by the dpif layer and converts them into netlink messages that the kernel module understands. These messages are well-defined and not specific to Open vSwitch or OpenFlow. This layering has allowed us to introduce new OpenFlow-like features such as multiple tables and non-OpenFlow features such as port mirroring, STP, CCM, and new bonding modes without changes to the kernel module. In fact, the only changes that should necessitate a kernel interface change are new matches or actions, such as would be required for handling MPLS. >> But we've >> iterated over a few different designs and worked on other ports, and we've found this >> hardware/software abstraction layer to work pretty well. In fact, multiple ports of >> Open vSwitch have been done by name-brand third party vendors (this is the avenue most >> vendors use to get their OpenFlow support) and are now shipping. >> >> We're always open to discussing ways that we can improve this interfaces, too, of course! > > Make these vendor switches work with plain Linux. The Intel folks are > producing interfaces with L2, ACLs, VIs and are putting some effort to > integrate them into plain Linux. I should be able to set the qos rules > with tc on an intel chip. > You guys can still take advantage of all that and still have your nice > control plane. Once again, I think we are talking about different things. I believe you are discussing interfacing with NICs, which is quite different from a high fanout switching ASIC. As I previously mentioned, the point of my original post was that I think it would be best not to model a high fanout switch in the interface to the kernel. --Justin ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Open vSwitch Design 2011-11-28 18:34 ` Justin Pettit @ 2011-11-28 22:42 ` Jamal Hadi Salim 0 siblings, 0 replies; 21+ messages in thread From: Jamal Hadi Salim @ 2011-11-28 22:42 UTC (permalink / raw) To: Justin Pettit Cc: Stephen Hemminger, Jesse Gross, netdev, dev, David Miller, Chris Wright, Herbert Xu, Eric Dumazet, John Fastabend On Mon, 2011-11-28 at 10:34 -0800, Justin Pettit wrote: > On Nov 25, 2011, at 5:11 PM, Jamal Hadi Salim wrote: > > Are you talking about ASICs on NICs? I am indifferent - looking at it entirely from a control perspective. i.e if i do "ip link blah down" on a port i want that to work with zero changes to iproute2; the only way you can achieve that is if you expose those ports as netdevs. This is what I said was a good thing the Intel folks were trying to achieve (and what Lennert has done for the small Marvel switch chips). > I was referring to integrating Open vSwitch into top-of-rack switches. > These typically have a 48x1G or 48x10G switching ASIC and a relatively > slow (~800MHz PPC-class) management CPU running an operating system like > Linux. There's no way that these systems can have a standard CPU on the fastpath. No, not the datapath; just control of the hardware. If i run "ip route add .." i want that to work on the ASIC. Same with tc action/classification. I want to run those tools and configure an ACL in the ASIC with no new learning curve. > > I understood the original question to be: Can we make the interface to the > kernel look like a hardware switch? My answer had two main parts. First, > I don't think we could define a "standard" hardware interface, since they're > all very different. Second, even if we could, I think a software fastpath's > strengths and weaknesses are such that the hardware model wouldn't be ideal. Not talking about datapath - but control interface to those devices. We cant define how the low levels look like. But if you expose things using standard linux interfaces, then user space tools and APIs stay unchanged. Then i shouldnt care where the feature runs (hardware NIC, ASIC, pure kernel level software etc). > > The problem is that DRAM isn't going to cut it on the ACL tables--which are > typically used for flow-based matching--on a 48x10G (or even 48x1G) switch. There are vendors who use DRAMS with speacilized interfaces that interleave requests behind the scenes. Maybe i can point you to one offline. > I've seen a couple of switching ASICs that support many 10s of thousands of > ACL entries, but they require expensive external TCAMs for lookup and SRAM > for counters. Most of the white box vendors that I've seen that use those > ASICs don't bother adding the external TCAM and SRAM to their designs. > Even when they are added, their matching capabilities are typically limited > in order to keep up with traffic. I thought SRAM markets have dried up these days. Anyways what you are refereing to above is generally true. > > Justin - theres nothing new you need in the kernel to have that feature. > > Let me rephrase that, that has not been a new feature for at least a > > decade in Linux. > > Add exact match filters with higher priority. Have the lowest priority > > filter to redirect to user space. Let user space lookup some service > > rule; have it download to the kernel one or more exact matches. > > Let the packet proceed on its way down the kernel to its destination if > > thats what is defined. > > My point was that a software fastpath should look different than a hardware-based one. And i was pointing to what your datapath patches which in conjunction with your user space code. > > > > That bit sounds interesting - I will look at your spec. > > Great! I am sorry - been overloaded elsewhere havent looked. But i think above I pretty much spelt out what my desires are. > Yes, Open vSwitch has been ported to 24x10G ASICs running Linux on their management CPUs. > However, in these cases the datapath is handled by hardware and not the software forwarding > plane, obviously. Of course. > > Do the vendors agree to some common interface? > > Yes, if you view ofproto (as described in the porting guide) as that interface. Every merchant silicon vendor > I've seen views the interfaces to their ASICs as proprietary. Yes, the XAL agony (HALS and PALS that run on 26 other OSes). > Someone (with the appropriate SDK and licenses) needs to write providers for those different hardware ports. > We've helped multiple vendors do this and know a few others that have done it on their own. You know what would be really nice is if you achieved what i described above. Can I ifconfig an ethernet switch port? > This really seems besides the point for this discussion, though. > We've written an ofproto provider for software switches called "dpif" > (this is also described in the porting guide). What we're proposing be > included in Linux is the kernel module that speaks to that dpif provider > over a well-defined, stable, netlink-based protocol. > > Here's just a quick (somewhat simplified) summary of the different layers. > At the top, there are controllers and switches that communicate using OpenFlow. > OpenFlow gives controller writers the ability to inspect and modify the switches' > flow tables and interfaces. If a flow entry doesn't match an existing entry, the > packet is forwarded to the controller for further processing. OpenFlow 1.0 was > pretty basic and exposed a single flow table. OpenFlow 1.1 introduced a number > of new features including multiple table support. The forthcoming OpenFlow 1.2 > will include support for extensible matches, which means that new fields may be > added without requiring a full revision of the specification. OpenFlow is defined > by the Open Networking Foundation and is not directly related to Open vSwitch. > > The userspace in Open vSwitch has an OpenFlow library that interacts with the > controllers. Userspace has its own classifier that supports wildcard entries > and multiple tables. Many of the changes to the OpenFlow protocol only require > modifying that library and perhaps some of the glue code with the classifier. > (In theory, other software-defined networking protocols could be plugged in as well.) > The classifier interacts with the ofproto layer below it, which implements a fastpath. Yes, when i looked at your code i can see that you have gone past openflow. > On a hardware switch, since it supports wildcarding, it essentially becomes a > passthrough that just calls the appropriate APIs for the ASIC. Are these APIs documented as well? Maybe thats all we need if you dont have the standard linux tools working. > On software, > as we've discussed, exact-match flows work better. > > For that reason, we've defined the dpif layer, which is an ofproto provider. > It's primary purpose is to take high-level concepts like "treat this group of > interfaces as a LACP bond" or "support this set of wildcard flow entries" and > explode them into exact-match entries on-demand. We've then implemented a > Linux dpif provider that takes the exact match entries created by the dpif > layer and converts them into netlink messages that the kernel module understands. > These messages are well-defined and not specific to Open vSwitch or OpenFlow. Useful but that seems more like a service layer - I want just to be able to ifconfig a port as a basic need. In any case, I should look at your doc to get some clarity. > This layering has allowed us to introduce new OpenFlow-like features such as multiple tables > and non-OpenFlow features such as port mirroring, STP, CCM, and new bonding modes without > changes to the kernel module. In fact, the only changes that should necessitate a kernel > interface change are new matches or actions, such as would be required for handling MPLS. I just need the basic cobbling blocks. If you conform to what Linux already does and i can run standard tools, we can have a lot of creative things that could be done. > > Make these vendor switches work with plain Linux. The Intel folks are > > producing interfaces with L2, ACLs, VIs and are putting some effort to > > integrate them into plain Linux. I should be able to set the qos rules > > with tc on an intel chip. > > You guys can still take advantage of all that and still have your nice > > control plane. > > Once again, I think we are talking about different things. I believe you are > discussing interfacing with NICs, which is quite different from a high fanout > switching ASIC. As I previously mentioned, the point of my original > post was that I think it would be best not to model a high fanout switch in the interface to the kernel. > I hope my clarification above makes more sense. cheers, jamal ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2011-11-28 22:42 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-24 20:10 Open vSwitch Design Jesse Gross
[not found] ` <CAEP_g=_2L1xFWtDXh_6YyXz1Mt9TR3zvjLzix+SpO6yzeOLsSQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-11-24 22:30 ` jamal
2011-11-25 5:20 ` Stephen Hemminger
[not found] ` <20111124212021.2ae2fb7f-QE31Isp8l5DVJhW05BI4jyWSNWFUUkiGXqFh9Ls21Oc@public.gmane.org>
2011-11-25 6:18 ` Eric Dumazet
2011-11-25 6:25 ` David Miller
[not found] ` <20111125.012517.2221372383643417980.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
2011-11-25 6:36 ` Eric Dumazet
2011-11-25 11:34 ` jamal
2011-11-25 13:02 ` Eric Dumazet
2011-11-28 15:20 ` [PATCH net-next 0/4] net: factorize flow dissector Eric Dumazet
2011-11-25 20:20 ` Open vSwitch Design Jesse Gross
[not found] ` <CAEP_g=9tcH9kJrVsHc26kXWZEUS8G-U=U7y6k8xaZG5MD0OTyg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-11-26 1:23 ` Jamal Hadi Salim
2011-11-25 20:14 ` Jesse Gross
2011-11-25 11:24 ` jamal
2011-11-25 17:28 ` Stephen Hemminger
2011-11-25 17:55 ` Jesse Gross
2011-11-25 19:52 ` Justin Pettit
[not found] ` <2DB44B16-598F-4414-8B35-8E322D705A9A-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>
2011-11-26 1:11 ` Jamal Hadi Salim
2011-11-26 4:38 ` Stephen Hemminger
[not found] ` <ec23d63d-27c9-4761-bdd3-e3f54bdb5e77-bX68f012229Xuxj3zoTs5AC/G2K4zDHf@public.gmane.org>
2011-11-26 8:05 ` Martin Casado
2011-11-28 18:34 ` Justin Pettit
2011-11-28 22:42 ` Jamal Hadi Salim
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).