From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Mark D. Gray" <mark.d.gray@intel.com>
Subject: Re: [ovs-dev] Status of Open vSwitch with DPDK
Date: Mon, 17 Aug 2015 15:53:01 +0100
Message-ID: <55D1F54D.9070205@intel.com>
References: <D1F17A83.5ED1%diproiettod@vmware.com>
 <738D45BC1F695740A983F43CFE1B7EA9437C8255@IRSMSX108.ger.corp.intel.com>
 <20150815071630.GB2600@x240.home>
Reply-To: mark.d.gray@intel.com
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
To: Daniele Di Proietto <diproiettod@vmware.com>,
 "dev@openvswitch.org" <dev@openvswitch.org>, dev <dev@dpdk.org>
Return-path: <dev-bounces@dpdk.org>
Received: from mga01.intel.com (mga01.intel.com [192.55.52.88])
 by dpdk.org (Postfix) with ESMTP id 86C8C8E5B
 for <dev@dpdk.org>; Mon, 17 Aug 2015 16:53:04 +0200 (CEST)
In-Reply-To: <20150815071630.GB2600@x240.home>
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On 08/15/15 08:16, Flavio Leitner wrote:
> On Fri, Aug 14, 2015 at 04:04:40PM +0000, Gray, Mark D wrote:
>> Hi Daniele,
>>
>> Thanks for starting this conversation. It is a good list :) I have cro=
ssed-posted this
>> to dpdk.org as I feel that some of the points could be interesting to =
that community
>> as they are related to how DPDK is used.
>>
>> How do "users" of OVS with DPDK feel about this list? Does anyone disa=
gree or
>> does anyone have any additions? What are your experiences?
>>
>>>
>>> There has been some discussion lately about the status of the Open vS=
witch
>>> port to DPDK.  While part of the code has been tested for quite some =
time,
>>> I think we can agree that there are a few rough spots that prevent it=
 from
>>> being easily deployed and used.
>>>
>>> I was hoping to get some feedback from the community about those roug=
h
>>> spots,
>>> i.e. areas where OVS+DPDK can/needs to improve to become more
>>> "production
>>> ready" and user-friendly.
>>>
>>> - PMD threads and queues management: the code has shown several bugs
>>> and
>>> the
>>>    netdev interfaces don't seem up to the job anymore.
>>
>> You had a few ideas about how to refactor this before but I was concer=
ned
>> about the effect it would have on throughput. I can't find the thread.
>>
>> Do you have some further ideas about how to achieve this?
>
> I miss the fact that we can't tell which queue can go to each PMD and
> also that all devices must have the same number of rx queues. I agree
> that there are other issues, but it seems the kind of configuration
> knobs I am looking for might not be the end goal since what has been
> said is to look for a more automated way.  Having said so, I also
> would like to hear if you have further ideas about how to archive that.
>
>
>>>    There's a lot of margin of improvement: we could factor out the co=
de from
>>>    dpif-netdev, add configuration parameters for advanced users, and =
figure
>>> out
>>>    a way to add unit tests.
>>>
>>
>> I think this is a general issue with both the kernel datapath (and net=
devs)
>> and the userspace datapath. There isn't much unit testing (or testing)=
 outside
>> of the slow path.
>
> Maybe we could exercise the interfaces using pcap pmd.
>
>

We had a similar idea. Using this, it would be possible to test the=20
entire datapath or netdev for functionality! I don=E2=80=99t think there =
is an=20
equivalent for the kernel datapath?

>>>    Related to this, the system should be as fast as possible out-of-t=
he-box,
>>>    without requiring too much tuning.
>>
>> This is a good point. I think the kernel datapath has a similar issue.=
 You can
>> get a certain level of performance without compiling with -Ofast or
>> pinning threads but you will (even with the kernel datapath) get bette=
r
>> performance if you pin threads (and possibly compile differently). I g=
uess
>> it is more visible with the dpdk datapath as performance is one of the=
 key
>> values. It is also more detrimental to the performance if you don't se=
t it
>> up correctly.
>
> Not only that, you need to consider how the resources will be
> distributed upfront so that you don't run out of hugepages, perhaps
> isolate PMD CPUs from the Linux scheduler, etc.  So, I think a more
> realistic goal would be: the system should require minimal/none tuning
> to run with acceptable performance.
>

How do you define "acceptable" performance :)?

>
>> Perhaps we could provide scripts to help do this?
>
> Or profiles (if that isn't included in your scripts definition)
>

Maybe we should define profiles like "performance", "minimum cores", etc

>
>> I think this is also interesting to the DPDK community. There is
>> knowledge required when running DPDK enabled apps to
>> get good performance: core pinning is one thing that comes to mind.
>>
>>>
>>> - Userspace tunneling: while the code has been there for quite some t=
ime it
>>>    hasn't received the level of testing that the Linux kernel datapat=
h
>>> tunneling
>>>    has.
>>>
>>
>> Again, there is a lack of test infrastructure in general for OVS. vspe=
rf is a good
>> start, and it would be great to see more people use and contribute to =
it!
>
> Yes.
>
>
>>> - Documentation: other than a step by step tutorial,  it cannot be sa=
id
>>> that
>>>    DPDK is a first class citizen in the OVS documentation.  Manpages =
could
>>> be
>>>    improved.
>>
>> Easily done. The INSTALL guide is pretty good but the structure could =
be better.
>> There is also a lack of manpages. Good point.
>
> Yup.
>
>
>>> - Vhost: the code has not received the level of testing of the kernel
>>> vhost.
>>>    Another doubt shared by some developers is whether we should keep
>>>    vhost-cuse, given its relatively low ease of use and the overlappi=
ng with
>>>    the far more standard vhost-user.
>>
>> vhost-cuse is required for older versions of qemu. I'm aware of some c=
ompanies
>> using it as they are restricted to an older version of qemu. I think i=
t is deprecated
>> at the moment? Is there a notice to that effect? We just need a plan f=
or when to
>> remove it and make sure that plan is clear?
>
> Apparently having two solutions to address the same issue causes more
> harm than good, so removing vhost-cuse would be helpful.  I agree that
> we need a clear plan with a soak time so users can either upgrade to
> vhost-user or tell why they can't.
>
>
>>> - Interface management and naming: interfaces must be manually remove=
d
>>> from
>>>    the kernel drivers.
>>>
>>>    We still don't have an easy way to identify them. Ideas are welcom=
e: how
>>> can
>>>    we make this user friendly?  Is there a better solution on the DPD=
K side?
>>
>> This is a tough one and is interesting to the DPDK community.  The bas=
ic issue
>> here is that users are more familiar with linux interfaces and linux n=
aming
>> conventions.
>>
>> "ovs-vsctl add-port bro eth0" makes a lot more sense than
>>
>> "dpdk_nic_bind -b igb_uio<pci_id>", then check the order that the port=
s
>> are enumerated and then run "ovs-vsctl add-port br0 dpdkN".
>>
>> I can think of ways to do this with physical NICs. For example,
>> you could reference the port by the linux name and when you try to add=
 it, OVS
>> could unbind from the kernel module and bind it to igb_uio?
>>
>> However, I am not sure how you would do it with virtual nics as there =
is not
>> even a real device.
>>
>> I think a general solution from the dpdk community would be really hel=
pful here.
>
>
> It doesn't look like openvswitch is the right place to fix this.  The
> openvswitch should deal with the port and the system should provide
> the port somehow.  That's what happens with the kernel datapath, for
> instance, openvswitch doesn't load any NIC driver.
>
> So, it seems to be more related to udev/systemd configuration in which
> the sys admin would tell the interfaces and the appropriate driver
> (UIO/VFIO/Bifurcated...).
>
> Even if the system delivers the DPDK port ready, it would be great to
> have some friendly mapping so that users can refer to ports with known
> names.
>

Agreed

>
>>>    How are DPDK interfaces handled by linux distributions? I've heard=
 about
>>>    ongoing work for RHEL and Ubuntu, it would be interesting to coord=
inate.
>
> We have implemented dpdk/vhost support in initscripts so you could
> configure the ports in the same way as for the kernel devices, but
> how to properly bind to the driver is unclear yet.
>
>
>>> - Insight into the system and debuggability: nothing beats tcpdump fo=
r the
>>>    kernel datapath.  Can something similar be done for the userspace
>>> datapath?
>>
>> Yeah, this would be useful. I have my own way of dealing with this. Fo=
r example,
>> you could dump from the LOCAL port on a NORMAL bridge or add a rule to
>> mirror a flow to another port but I feel there could be a better way t=
o do this in
>> DPDK. I have recently heard that the DPDK team do something with a pca=
p pmd
>> to help with debugging. A more general approach from dpdk would help a=
 lot.
>
> One idea maybe is that openvswitch could provide a mode to clone TX/RX
> packets to a pcap pmd. Or write the packets using pcap format directly
> to a file (avoid another pmd which might not be available). Or even
> push them using a tap device. Either way tcpdump or wireshark would wor=
k.
>
>
>>> - Consistency of the tools: some commands are slightly different for =
the
>>>    userspace/kernel datapath.  Ideally there shouldn't be any differe=
nce.
>
> Could you give some examples?
>
>
>> Yeah, there are some things that could be changed. DPDK just works dif=
ferently but
>> the benefits are significant :)
>>
>> We need to mount hugepages, bind nics to igb_uio, etc
>>
>> With a lot of this stuff, maybe the DPDK community's tools don't need =
to emulate
>> the linux networking tools exactly. Maybe over time as the DPDK commun=
ity
>> and user-base expands, people will become more familiar with the tools=
, processes, etc
>> and this will be less of an issue?
>>
>>
>>>
>>> - Packaging: how should the distributions package DPDK and OVS? Shoul=
d
>>> there
>>>    only be a single build to handle both the kernel and the userspace
>>> datapath,
>>>    eventually dynamically linked to DPDK?
>>
>> Yeah. Do we need to start with dpdk if we have compiled with DPDK supp=
ort???
>
> Well, certainly not everybody wants to have DPDK dependencies neither
> shared nor statically.  Maybe the path is a plug-in architecture?
>
>
>>> - Benchmarks: we often rely on extremely simple flow tables with sing=
le
>>> flow
>>>    traffic to evaluate the effect of a change.  That may be ok during
>>>    development, but OVS with the kernel datapath has been tested in
>>> different
>>>    scenarios with more complicated flow tables and even with hostile =
traffic
>>>    patterns.
>>>
>>>    Efforts in this sense are being made, like the vsperf project, or =
even
>>> the
>>>    simple ovs-pipeline.py
>>
>> vsperf will really help this.
>
> Indeed, but how is OVS kernel datapath being tested? Is there a
> script?  Maybe we can use the same tests for DPDK.
>
>
>>> I would appreciate feedback on the above points, not (only) in terms =
of
>>> solutions, but in terms of requirements that you feel are important f=
or our
>>> system to be considered ready.
>
> The list covers technical issues, documentation issues and usability
> issues which are great, thanks for doing it.  However, as said one
> important use-case is extreme performance and that requires configurati=
on
> or tuning flexibility which adds usability/supportability issues.  Will
> those knobs be a valid option provided that the defaults works well eno=
ugh?
>


I feel that we need to expose knobs up through Open vSwitch in order to=20
tune for extreme performance otherwise how do we highlight the value in=20
what we are doing? I think we need some way to allow a user to do this=20
type of configuration when they know what they are doing (without having=20
to recompile the code).

> Thanks,
> fbl
>