Re: vhost compliant virtio based networking interface in container

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Tetsuya Mukawa <mukawa@igel.co.jp>
To: "Xie, Huawei" <huawei.xie@intel.com>, "dev@dpdk.org" <dev@dpdk.org>
Cc: "nakajima.yoshihiro@lab.ntt.co.jp"
	<nakajima.yoshihiro@lab.ntt.co.jp>,
	"zhbzg@huawei.com" <zhbzg@huawei.com>,
	"gaoxiaoqiu@huawei.com" <gaoxiaoqiu@huawei.com>,
	"oscar.zhangbo@huawei.com" <oscar.zhangbo@huawei.com>,
	Zhuangyanying <ann.zhuangyanying@huawei.com>,
	"zhoujingbin@huawei.com" <zhoujingbin@huawei.com>,
	"guohongzhen@huawei.com" <guohongzhen@huawei.com>
Subject: Re: vhost compliant virtio based networking interface in container
Date: Tue, 08 Sep 2015 13:44:50 +0900	[thread overview]
Message-ID: <55EE67C2.5040301@igel.co.jp> (raw)
In-Reply-To: <C37D651A908B024F974696C65296B57B2BDBDDCD@SHSMSX101.ccr.corp.intel.com>

On 2015/09/07 14:54, Xie, Huawei wrote:
> On 8/26/2015 5:23 PM, Tetsuya Mukawa wrote:
>> On 2015/08/25 18:56, Xie, Huawei wrote:
>>> On 8/25/2015 10:59 AM, Tetsuya Mukawa wrote:
>>>> Hi Xie and Yanping,
>>>>
>>>>
>>>> May I ask you some questions?
>>>> It seems we are also developing an almost same one.
>>> Good to know that we are tackling the same problem and have the similar
>>> idea.
>>> What is your status now? We had the POC running, and compliant with
>>> dpdkvhost.
>>> Interrupt like notification isn't supported.
>> We implemented vhost PMD first, so we just start implementing it.
>>
>>>> On 2015/08/20 19:14, Xie, Huawei wrote:
>>>>> Added dev@dpdk.org
>>>>>
>>>>> On 8/20/2015 6:04 PM, Xie, Huawei wrote:
>>>>>> Yanping:
>>>>>> I read your mail, seems what we did are quite similar. Here i wrote a
>>>>>> quick mail to describe our design. Let me know if it is the same thing.
>>>>>>
>>>>>> Problem Statement:
>>>>>> We don't have a high performance networking interface in container for
>>>>>> NFV. Current veth pair based interface couldn't be easily accelerated.
>>>>>>
>>>>>> The key components involved:
>>>>>>     1.    DPDK based virtio PMD driver in container.
>>>>>>     2.    device simulation framework in container.
>>>>>>     3.    dpdk(or kernel) vhost running in host.
>>>>>>
>>>>>> How virtio is created?
>>>>>> A:  There is no "real" virtio-pci device in container environment.
>>>>>> 1). Host maintains pools of memories, and shares memory to container.
>>>>>> This could be accomplished through host share a huge page file to container.
>>>>>> 2). Containers creates virtio rings based on the shared memory.
>>>>>> 3). Container creates mbuf memory pools on the shared memory.
>>>>>> 4) Container send the memory and vring information to vhost through
>>>>>> vhost message. This could be done either through ioctl call or vhost
>>>>>> user message.
>>>>>>
>>>>>> How vhost message is sent?
>>>>>> A: There are two alternative ways to do this.
>>>>>> 1) The customized virtio PMD is responsible for all the vring creation,
>>>>>> and vhost message sending.
>>>> Above is our approach so far.
>>>> It seems Yanping also takes this kind of approach.
>>>> We are using vhost-user functionality instead of using the vhost-net
>>>> kernel module.
>>>> Probably this is the difference between Yanping and us.
>>> In my current implementation, the device simulation layer talks to "user
>>> space" vhost through cuse interface. It could also be done through vhost
>>> user socket. This isn't the key point.
>>> Here vhost-user is kind of confusing, maybe user space vhost is more
>>> accurate, either cuse or unix domain socket. :).
>>>
>>> As for yanping, they are now connecting to vhost-net kernel module, but
>>> they are also trying to connect to "user space" vhost.  Correct me if wrong.
>>> Yes, there is some difference between these two. Vhost-net kernel module
>>> could directly access other process's memory, while using
>>> vhost-user(cuse/user), we need do the memory mapping.
>>>> BTW, we are going to submit a vhost PMD for DPDK-2.2.
>>>> This PMD is implemented on librte_vhost.
>>>> It allows DPDK application to handle a vhost-user(cuse) backend as a
>>>> normal NIC port.
>>>> This PMD should work with both Xie and Yanping approach.
>>>> (In the case of Yanping approach, we may need vhost-cuse)
>>>>
>>>>>> 2) We could do this through a lightweight device simulation framework.
>>>>>>     The device simulation creates simple PCI bus. On the PCI bus,
>>>>>> virtio-net PCI devices are created. The device simulations provides
>>>>>> IOAPI for MMIO/IO access.
>>>> Does it mean you implemented a kernel module?
>>>> If so, do you still need vhost-cuse functionality to handle vhost
>>>> messages n userspace?
>>> The device simulation is  a library running in user space in container. 
>>> It is linked with DPDK app. It creates pseudo buses and virtio-net PCI
>>> devices.
>>> The virtio-container-PMD configures the virtio-net pseudo devices
>>> through IOAPI provided by the device simulation rather than IO
>>> instructions as in KVM.
>>> Why we use device simulation?
>>> We could create other virtio devices in container, and provide an common
>>> way to talk to vhost-xx module.
>> Thanks for explanation.
>> At first reading, I thought the difference between approach1 and
>> approach2 is whether we need to implement a new kernel module, or not.
>> But I understand how you implemented.
>>
>> Please let me explain our design more.
>> We might use a kind of similar approach to handle a pseudo virtio-net
>> device in DPDK.
>> (Anyway, we haven't finished implementing yet, this overview might have
>> some technical problems)
>>
>> Step1. Separate virtio-net and vhost-user socket related code from QEMU,
>> then implement it as a separated program.
>> The program also has below features.
>>  - Create a directory that contains almost same files like
>> /sys/bus/pci/device/<pci address>/*
>>    (To scan these file located on outside sysfs, we need to fix EAL)
>>  - This dummy device is driven by dummy-virtio-net-driver. This name is
>> specified by '<pci addr>/driver' file.
>>  - Create a shared file that represents pci configuration space, then
>> mmap it, also specify the path in '<pci addr>/resource_path'
>>
>> The program will be GPL, but it will be like a bridge on the shared
>> memory between virtio-net PMD and DPDK vhost backend.
>> Actually, It will work under virtio-net PMD, but we don't need to link it.
>> So I guess we don't have GPL license issue.
>>
>> Step2. Fix pci scan code of EAL to scan dummy devices.
>>  - To scan above files, extend pci_scan() of EAL.
>>
>> Step3. Add a new kdrv type to EAL.
>>  - To handle the 'dummy-virtio-net-driver', add a new kdrv type to EAL.
>>
>> Step4. Implement pci_dummy_virtio_net_map/unmap().
>>  - It will have almost same functionality like pci_uio_map(), but for
>> dummy virtio-net device.
>>  - The dummy device will be mmaped using a path specified in '<pci
>> addr>/resource_path'.
>>
>> Step5. Add a new compile option for virtio-net device to replace IO
>> functions.
>>  - The IO functions of virtio-net PMD will be replaced by read() and
>> write() to access to the shared memory.
>>  - Add notification mechanism to IO functions. This will be used when
>> write() to the shared memory is done.
>>  (Not sure exactly, but probably we need it)
>>
>> Does it make sense?
>> I guess Step1&2 is different from your approach, but the rest might be
>> similar.
>>
>> Actually, we just need sysfs entries for a virtio-net dummy device, but
>> so far, I don't have a fine way to register them from user space without
>> loading a kernel module.
> Tetsuya:
> I don't quite get the details. Who will create those sysfs entries? A
> kernel module right?

Hi Xie,

I don't create sysfs entries. Just create a directory that contains
files looks like sysfs entries.
And initialize EAL with not only sysfs but also the above directory.

In quoted last sentence, I wanted to say we just needed files looks like
sysfs entries.
But I don't know a good way to create files under sysfs without loading
kernel module.
This is because I try to create the additional directory.

> The virtio-net is configured through read/write to sharing
> memory(between host and guest), right?

Yes, I agree.

> Where is shared vring created and shared memory created, on shared huge
> page between host and guest?

The vritqueues(vrings) are on guest hugepage.

Let me explain.
Guest container should have read/write access to a part of hugepage
directory on host.
(For example, /mnt/huge/conainer1/ is shared between host and guest.)
Also host and guest needs to communicate through a unix domain socket.
(For example, host and guest can communicate with using
"/tmp/container1/sock")

If we can do like above, a virtio-net PMD on guest can creates
virtqueues(vrings) on it's hugepage, and writes these information to a
pseudo virtio-net device that is a process created in guest container.
Then the pseudo virtio-net device sends it to vhost-user backend(host
DPDK application) through a unix domain socket.

So with my plan, there are 3 processes.
DPDK applications on host and guest, also a process that works like
virtio-net device.

> Who will talk to dpdkvhost?

If we need to talk to a cuse device or the vhost-net kernel module, an
above pseudo virtio-net device could talk to.
(But, so far, my target is only vhost-user.)

>> This is because I need to change pci_scan() also.
>>
>> It seems you have implemented a virtio-net pseudo device as BSD license.
>> If so, this kind of PMD would be nice to use it.
> Currently it is based on native linux kvm tool.

Great, I hadn't noticed this option.

>> In the case that it takes much time to implement some lost
>> functionalities like interrupt mode, using QEMU code might be an one of
>> options.
> For interrupt mode, i plan to use eventfd for sleep/wake, have not tried
> yet.
>> Anyway, we just need a fine virtual NIC between containers and host.
>> So we don't hold to our approach and implementation.
> Do you have comments to my implementation?
> We could publish the version without the device framework first for
> reference.

No I don't have. Could you please share it?
I am looking forward to seeing it.

Tetsuya

next prev parent reply	other threads:[~2015-09-08  4:44 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <C37D651A908B024F974696C65296B57B2BD9F976@SHSMSX101.ccr.corp.intel.com>
2015-08-20 10:14 ` vhost compliant virtio based networking interface in container Xie, Huawei
2015-08-25  2:58   ` Tetsuya Mukawa
2015-08-25  9:56     ` Xie, Huawei
2015-08-26  9:23       ` Tetsuya Mukawa
2015-09-07  5:54         ` Xie, Huawei
2015-09-08  4:44           ` Tetsuya Mukawa [this message]
2015-09-14  3:15             ` Xie, Huawei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55EE67C2.5040301@igel.co.jp \
    --to=mukawa@igel.co.jp \
    --cc=ann.zhuangyanying@huawei.com \
    --cc=dev@dpdk.org \
    --cc=gaoxiaoqiu@huawei.com \
    --cc=guohongzhen@huawei.com \
    --cc=huawei.xie@intel.com \
    --cc=nakajima.yoshihiro@lab.ntt.co.jp \
    --cc=oscar.zhangbo@huawei.com \
    --cc=zhbzg@huawei.com \
    --cc=zhoujingbin@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.