Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH 0/5] Add vhost-blk support
From: Jeff Moyer @ 2012-07-12 16:06 UTC (permalink / raw)
  To: Asias He
  Cc: linux-aio, kvm, Michael S. Tsirkin, linux-kernel, virtualization,
	Benjamin LaHaise, Alexander Viro, linux-fsdevel
In-Reply-To: <1342107302-28116-1-git-send-email-asias@redhat.com>

Asias He <asias@redhat.com> writes:

> Hi folks,
>
> This patchset adds vhost-blk support. vhost-blk is a in kernel virito-blk
> device accelerator. Compared to userspace virtio-blk implementation, vhost-blk
> gives about 5% to 15% performance improvement.
>
> Asias He (5):
>   aio: Export symbols and struct kiocb_batch for in kernel aio usage
>   eventfd: Export symbol eventfd_file_create()
>   vhost: Make vhost a separate module
>   vhost-net: Use VHOST_NET_FEATURES for vhost-net
>   vhost-blk: Add vhost-blk support

I only saw patches 0 and 1.  Where are the other 4?  If the answer is,
"not on lkml," then please resend them, CC'ing lkml.  I'd like to be
able to see the usage of the aio routines.

Cheers,
Jeff

^ permalink raw reply

* Re: [patch 1/3 -next] tcm_vhost: unlock on error in tcm_vhost_drop_nexus()
From: Nicholas A. Bellinger @ 2012-07-12 21:47 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: target-devel, virtualization, kernel-janitors, kvm,
	Michael S. Tsirkin
In-Reply-To: <20120712144752.GD24202@elgon.mountain>

On Thu, 2012-07-12 at 17:47 +0300, Dan Carpenter wrote:
> We need to unlock here before returning.
> 
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> 
> diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
> index da0b8ac..d217bed 100644
> --- a/drivers/vhost/tcm_vhost.c
> +++ b/drivers/vhost/tcm_vhost.c
> @@ -1189,6 +1189,7 @@ static int tcm_vhost_drop_nexus(
>  	}
>  
>  	if (atomic_read(&tpg->tv_tpg_vhost_count)) {
> +		mutex_unlock(&tpg->tv_tpg_mutex);
>  		pr_err("Unable to remove TCM_vHost I_T Nexus with"
>  			" active TPG vhost count: %d\n",
>  			atomic_read(&tpg->tv_tpg_vhost_count));

Applied.  Thanks Dan!

^ permalink raw reply

* Re: [patch 2/3 -next] tcm_vhost: strlen() doesn't count the terminator
From: Nicholas A. Bellinger @ 2012-07-12 21:48 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: target-devel, virtualization, kernel-janitors, kvm,
	Michael S. Tsirkin
In-Reply-To: <20120712144831.GE24202@elgon.mountain>

On Thu, 2012-07-12 at 17:48 +0300, Dan Carpenter wrote:
> We do snprintf() from "page" to a buffer with TCM_VHOST_NAMELEN
> characters so the current code will silently truncate the last
> character.
> 
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> 
> diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
> index d217bed..57d39c5 100644
> --- a/drivers/vhost/tcm_vhost.c
> +++ b/drivers/vhost/tcm_vhost.c
> @@ -1254,7 +1254,7 @@ static ssize_t tcm_vhost_tpg_store_nexus(
>  	 * the fabric protocol_id set in tcm_vhost_make_tport(), and call
>  	 * tcm_vhost_make_nexus().
>  	 */
> -	if (strlen(page) > TCM_VHOST_NAMELEN) {
> +	if (strlen(page) >= TCM_VHOST_NAMELEN) {
>  		pr_err("Emulated NAA Sas Address: %s, exceeds"
>  				" max: %d\n", page, TCM_VHOST_NAMELEN);
>  		return -EINVAL;

Also applied.  Thanks!

^ permalink raw reply

* Re: [patch 3/3 -next] tcm_vhost: call kfree() on an error path
From: Nicholas A. Bellinger @ 2012-07-12 21:49 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: target-devel, virtualization, kernel-janitors, kvm,
	Michael S. Tsirkin
In-Reply-To: <20120712144852.GF24202@elgon.mountain>

On Thu, 2012-07-12 at 17:48 +0300, Dan Carpenter wrote:
> There is a memory leak here.
> 
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> 
> diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
> index 57d39c5..29850cb 100644
> --- a/drivers/vhost/tcm_vhost.c
> +++ b/drivers/vhost/tcm_vhost.c
> @@ -1420,6 +1420,7 @@ static struct se_wwn *tcm_vhost_make_tport(
>  
>  	pr_err("Unable to locate prefix for emulated Target Port:"
>  			" %s\n", name);
> +	kfree(tport);
>  	return ERR_PTR(-EINVAL);
>  
>  check_len:

Applied to for-next-merge, and folding the series into the initial merge
commit now..

Thank you!

--nab

^ permalink raw reply

* Re: [PATCH] hw/virtio-scsi: Set max_target=0 during vhost-scsi operation
From: Nicholas A. Bellinger @ 2012-07-12 22:08 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: Stefan Hajnoczi, kvm-devel, Michael S. Tsirkin, qemu-devel,
	lf-virt, target-devel, Paolo Bonzini, Zhi Yong Wu,
	linux-iscsi-target-dev
In-Reply-To: <CAEH94LgNkuomE3-8auuEWhe1hGoF3qpMs45G8fsY1xjvQe=2QQ@mail.gmail.com>

Hi Zhi,

On Thu, 2012-07-12 at 14:59 +0800, Zhi Yong Wu wrote:
> thanks, it is applied to my vhost_scsi git tree
> git://github.com/wuzhy/qemu.git vhost-scsi
> 

Thanks for picking up this patch in your vhost-scsi tree.

As mentioned off-list, I'd like to rebase to a more recent qemu.git to
include megasas 8708EM2 HBA emulation from Dr. Hannes so we can
experiment with a few more types of target setups.  ;)

I'll likely do this on my local branch for now, but if you have the
extra cycles please feel free to update vhost-scsi to the latest
qemu.git HEAD so we can have both vhost-scsi + megasas HBA emulation in
the same working tree.

Depending upon how long we'll need to hold vhost-scsi patches
out-of-tree (hopefully it's less than infinity ;) a qemu/vhost-scsi
working tree on kernel.org might also be helpful.

Thanks!

--nab

^ permalink raw reply

* virtio config access races
From: Michael S. Tsirkin @ 2012-07-12 22:59 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm, virtualization

It looks like there's a problem in the way virtio config currently
works: if driver reads config in probe routine, config
subsequently can change before core sets DRIVER_OK.
This will not cause an interrupt and so this event is lost.
Maybe we should document that devices should delay such
events until after DRIVER_OK?

-- 
MST

^ permalink raw reply

* Re: [PATCH 0/5] Add vhost-blk support
From: Asias He @ 2012-07-13  1:19 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-aio, kvm, Michael S. Tsirkin, linux-kernel, virtualization,
	Benjamin LaHaise, Alexander Viro, linux-fsdevel
In-Reply-To: <x498vepklqu.fsf@segfault.boston.devel.redhat.com>

Hello Jeff,

On 07/13/2012 12:06 AM, Jeff Moyer wrote:
> Asias He <asias@redhat.com> writes:
>
>> Hi folks,
>>
>> This patchset adds vhost-blk support. vhost-blk is a in kernel virito-blk
>> device accelerator. Compared to userspace virtio-blk implementation, vhost-blk
>> gives about 5% to 15% performance improvement.
>>
>> Asias He (5):
>>    aio: Export symbols and struct kiocb_batch for in kernel aio usage
>>    eventfd: Export symbol eventfd_file_create()
>>    vhost: Make vhost a separate module
>>    vhost-net: Use VHOST_NET_FEATURES for vhost-net
>>    vhost-blk: Add vhost-blk support
>
> I only saw patches 0 and 1.  Where are the other 4?  If the answer is,
> "not on lkml," then please resend them, CC'ing lkml.

I did send all the 0-5 patches to lkml, but I somehow messed up the 
thread. Will CC you next time.

> I'd like to be able to see the usage of the aio routines.

OK. It'd be nice if you could review. Thanks.

-- 
Asias

^ permalink raw reply

* Re: [PATCH 1/5] aio: Export symbols and struct kiocb_batch for in kernel aio usage
From: Asias He @ 2012-07-13  1:40 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-aio, kvm, Michael S. Tsirkin, linux-kernel, virtualization,
	Benjamin LaHaise, Alexander Viro, linux-fsdevel
In-Reply-To: <1342115416.3021.60.camel@dabdike.int.hansenpartnership.com>

Hi James,

On 07/13/2012 01:50 AM, James Bottomley wrote:
> On Thu, 2012-07-12 at 23:35 +0800, Asias He wrote:
>> This is useful for people who want to use aio in kernel, e.g. vhost-blk.
>>
>> Signed-off-by: Asias He <asias@redhat.com>
>> ---
>>   fs/aio.c            |   37 ++++++++++++++++++-------------------
>>   include/linux/aio.h |   21 +++++++++++++++++++++
>>   2 files changed, 39 insertions(+), 19 deletions(-)
>
> Um, I think you don't quite understand how aio in the kernel would work;
> it's not as simple as just exporting the interfaces.  There's already a
> (very long) patch set from oracle to do this so loop can use aio:
>
> http://marc.info/?l=linux-fsdevel&m=133312234313122

Oh, I did not see this patch set. Thanks for pointing it out! This bit 
hasn't merged, right? I'd love to use the aio_kernel_() interface if it 
is merged. It will simply vhost-blk. Due to lack of better kernel aio 
interface, we are currently doing io_setup, io_submit, etc. in vhost-blk 
on our own.

-- 
Asias

^ permalink raw reply

* Re: virtio config access races
From: Rusty Russell @ 2012-07-13  3:55 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization
In-Reply-To: <20120712225941.GA9317@redhat.com>

On Fri, 13 Jul 2012 01:59:41 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> It looks like there's a problem in the way virtio config currently
> works: if driver reads config in probe routine, config
> subsequently can change before core sets DRIVER_OK.
> This will not cause an interrupt and so this event is lost.
> Maybe we should document that devices should delay such
> events until after DRIVER_OK?

The device is currently defined to be active from the time we
acknowledge the features (which means we may get a spurious interrupt
before we probe, I think).  We abuse this for virtio_blk for example,
where we add_disk() inside the probe function.

Hmm, the changed interrupt is live from find_vqs, right?  Perhaps we
should leave it to drivers to set that up in the right order.

Thoughts?
Rusty.

^ permalink raw reply

* Re: [Xen-devel] [PATCH] xen: populate correct number of pages when across mem boundary
From: zhenzhong.duan @ 2012-07-13  5:37 UTC (permalink / raw)
  To: David Vrabel
  Cc: jeremy, xen-devel, Konrad Rzeszutek Wilk, x86, Feng Jin,
	linux-kernel, virtualization, mingo, hpa, tglx
In-Reply-To: <4FFEE552.4070201@citrix.com>



于 2012-07-12 22:55, David Vrabel 写道:
> On 04/07/12 07:49, zhenzhong.duan wrote:
>> When populate pages across a mem boundary at bootup, the page count
>> populated isn't correct. This is due to mem populated to non-mem
>> region and ignored.
>>
>> Pfn range is also wrongly aligned when mem boundary isn't page aligned.
>>
>> Also need consider the rare case when xen_do_chunk fail(populate).
>>
>> For a dom0 booted with dom_mem=3368952K(0xcd9ff000-4k) dmesg diff is:
>>   [    0.000000] Freeing 9e-100 pfn range: 98 pages freed
>>   [    0.000000] 1-1 mapping on 9e->100
>>   [    0.000000] 1-1 mapping on cd9ff->100000
>>   [    0.000000] Released 98 pages of unused memory
>>   [    0.000000] Set 206435 page(s) to 1-1 mapping
>> -[    0.000000] Populating cd9fe-cda00 pfn range: 1 pages added
>> +[    0.000000] Populating cd9fe-cd9ff pfn range: 1 pages added
>> +[    0.000000] Populating 100000-100061 pfn range: 97 pages added
>>   [    0.000000] BIOS-provided physical RAM map:
>>   [    0.000000] Xen: 0000000000000000 - 000000000009e000 (usable)
>>   [    0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved)
>>   [    0.000000] Xen: 0000000000100000 - 00000000cd9ff000 (usable)
>>   [    0.000000] Xen: 00000000cd9ffc00 - 00000000cda53c00 (ACPI NVS)
>> ...
>>   [    0.000000] Xen: 0000000100000000 - 0000000100061000 (usable)
>>   [    0.000000] Xen: 0000000100061000 - 000000012c000000 (unusable)
>> ...
>>   [    0.000000] MEMBLOCK configuration:
>> ...
>> -[    0.000000]  reserved[0x4]       [0x000000cd9ff000-0x000000cd9ffbff], 0xc00 bytes
>> -[    0.000000]  reserved[0x5]       [0x00000100000000-0x00000100060fff], 0x61000 bytes
>>
>> Related xen memory layout:
>> (XEN) Xen-e820 RAM map:
>> (XEN)  0000000000000000 - 000000000009ec00 (usable)
>> (XEN)  00000000000f0000 - 0000000000100000 (reserved)
>> (XEN)  0000000000100000 - 00000000cd9ffc00 (usable)
>>
>> Signed-off-by: Zhenzhong Duan<zhenzhong.duan@oracle.com>
>> ---
>>   arch/x86/xen/setup.c |   24 +++++++++++-------------
>>   1 files changed, 11 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
>> index a4790bf..bd78773 100644
>> --- a/arch/x86/xen/setup.c
>> +++ b/arch/x86/xen/setup.c
>> @@ -157,50 +157,48 @@ static unsigned long __init xen_populate_chunk(
>>   	unsigned long dest_pfn;
>>
>>   	for (i = 0, entry = list; i<  map_size; i++, entry++) {
>> -		unsigned long credits = credits_left;
>>   		unsigned long s_pfn;
>>   		unsigned long e_pfn;
>>   		unsigned long pfns;
>>   		long capacity;
>>
>> -		if (credits<= 0)
>> +		if (credits_left<= 0)
>>   			break;
>>
>>   		if (entry->type != E820_RAM)
>>   			continue;
>>
>> -		e_pfn = PFN_UP(entry->addr + entry->size);
>> +		e_pfn = PFN_DOWN(entry->addr + entry->size);
> Ok.
>
>>
>>   		/* We only care about E820 after the xen_start_info->nr_pages */
>>   		if (e_pfn<= max_pfn)
>>   			continue;
>>
>> -		s_pfn = PFN_DOWN(entry->addr);
>> +		s_pfn = PFN_UP(entry->addr);
> Ok.
>
>>   		/* If the E820 falls within the nr_pages, we want to start
>>   		 * at the nr_pages PFN.
>>   		 * If that would mean going past the E820 entry, skip it
>>   		 */
>> +again:
>>   		if (s_pfn<= max_pfn) {
>>   			capacity = e_pfn - max_pfn;
>>   			dest_pfn = max_pfn;
>>   		} else {
>> -			/* last_pfn MUST be within E820_RAM regions */
>> -			if (*last_pfn&&  e_pfn>= *last_pfn)
>> -				s_pfn = *last_pfn;
>>   			capacity = e_pfn - s_pfn;
>>   			dest_pfn = s_pfn;
>>   		}
>> -		/* If we had filled this E820_RAM entry, go to the next one. */
>> -		if (capacity<= 0)
>> -			continue;
>>
>> -		if (credits>  capacity)
>> -			credits = capacity;
>> +		if (credits_left<  capacity)
>> +			capacity = credits_left;
>>
>> -		pfns = xen_do_chunk(dest_pfn, dest_pfn + credits, false);
>> +		pfns = xen_do_chunk(dest_pfn, dest_pfn + capacity, false);
>>   		done += pfns;
>>   		credits_left -= pfns;
>>   		*last_pfn = (dest_pfn + pfns);
>> +		if (credits_left>  0&&  *last_pfn<  e_pfn) {
>> +			s_pfn = *last_pfn;
>> +			goto again;
>> +		}
> This looks like it will loop forever if xen_do_chunk() repeatedly fails
> because Xen is out of pages.  I think if xen_do_chunk() cannot get a
> page from Xen the repopulation process should stop -- aborting this
> chunk and any others.  This will allow the guest to continue to boot
> just with less memory than expected.
>
> David
Ok, I'll update the patch, loop forever isn't a good idea.
Originally, I considered the case there is dynamic memory control 
functionality in the system.
thanks for comment.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [Xen-devel] [PATCH-v2] xen: populate correct number of pages when across mem boundary
From: zhenzhong.duan @ 2012-07-13  8:31 UTC (permalink / raw)
  To: David Vrabel
  Cc: jeremy, xen-devel, Konrad Rzeszutek Wilk, x86, Feng Jin,
	linux-kernel, virtualization, mingo, hpa, tglx
In-Reply-To: <4FFEE552.4070201@citrix.com>

When populate pages across a mem boundary at bootup, the page count
populated isn't correct. This is due to mem populated to non-mem
region and ignored.

Pfn range is also wrongly aligned when mem boundary isn't page aligned.

-v2: If xen_do_chunk fail(populate), abort this chunk and any others.
Suggested by David, thanks.

For a dom0 booted with dom_mem=3368952K(0xcd9ff000-4k) dmesg diff is:
  [    0.000000] Freeing 9e-100 pfn range: 98 pages freed
  [    0.000000] 1-1 mapping on 9e->100
  [    0.000000] 1-1 mapping on cd9ff->100000
  [    0.000000] Released 98 pages of unused memory
  [    0.000000] Set 206435 page(s) to 1-1 mapping
-[    0.000000] Populating cd9fe-cda00 pfn range: 1 pages added
+[    0.000000] Populating cd9fe-cd9ff pfn range: 1 pages added
+[    0.000000] Populating 100000-100061 pfn range: 97 pages added
  [    0.000000] BIOS-provided physical RAM map:
  [    0.000000] Xen: 0000000000000000 - 000000000009e000 (usable)
  [    0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved)
  [    0.000000] Xen: 0000000000100000 - 00000000cd9ff000 (usable)
  [    0.000000] Xen: 00000000cd9ffc00 - 00000000cda53c00 (ACPI NVS)
...
  [    0.000000] Xen: 0000000100000000 - 0000000100061000 (usable)
  [    0.000000] Xen: 0000000100061000 - 000000012c000000 (unusable)
...
  [    0.000000] MEMBLOCK configuration:
...
-[    0.000000]  reserved[0x4]       [0x000000cd9ff000-0x000000cd9ffbff], 0xc00 bytes
-[    0.000000]  reserved[0x5]       [0x00000100000000-0x00000100060fff], 0x61000 bytes

Related xen memory layout:
(XEN) Xen-e820 RAM map:
(XEN)  0000000000000000 - 000000000009ec00 (usable)
(XEN)  00000000000f0000 - 0000000000100000 (reserved)
(XEN)  0000000000100000 - 00000000cd9ffc00 (usable)

Signed-off-by: Zhenzhong Duan<zhenzhong.duan@oracle.com>
---
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index a4790bf..ead8557 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -157,25 +157,24 @@ static unsigned long __init xen_populate_chunk(
  	unsigned long dest_pfn;

  	for (i = 0, entry = list; i<  map_size; i++, entry++) {
-		unsigned long credits = credits_left;
  		unsigned long s_pfn;
  		unsigned long e_pfn;
  		unsigned long pfns;
  		long capacity;

-		if (credits<= 0)
+		if (credits_left<= 0)
  			break;

  		if (entry->type != E820_RAM)
  			continue;

-		e_pfn = PFN_UP(entry->addr + entry->size);
+		e_pfn = PFN_DOWN(entry->addr + entry->size);

  		/* We only care about E820 after the xen_start_info->nr_pages */
  		if (e_pfn<= max_pfn)
  			continue;

-		s_pfn = PFN_DOWN(entry->addr);
+		s_pfn = PFN_UP(entry->addr);
  		/* If the E820 falls within the nr_pages, we want to start
  		 * at the nr_pages PFN.
  		 * If that would mean going past the E820 entry, skip it
@@ -184,23 +183,19 @@ static unsigned long __init xen_populate_chunk(
  			capacity = e_pfn - max_pfn;
  			dest_pfn = max_pfn;
  		} else {
-			/* last_pfn MUST be within E820_RAM regions */
-			if (*last_pfn&&  e_pfn>= *last_pfn)
-				s_pfn = *last_pfn;
  			capacity = e_pfn - s_pfn;
  			dest_pfn = s_pfn;
  		}
-		/* If we had filled this E820_RAM entry, go to the next one. */
-		if (capacity<= 0)
-			continue;

-		if (credits>  capacity)
-			credits = capacity;
+		if (credits_left<  capacity)
+			capacity = credits_left;

-		pfns = xen_do_chunk(dest_pfn, dest_pfn + credits, false);
+		pfns = xen_do_chunk(dest_pfn, dest_pfn + capacity, false);
  		done += pfns;
-		credits_left -= pfns;
  		*last_pfn = (dest_pfn + pfns);
+		if (pfns<  capacity)
+			break;
+		credits_left -= pfns;
  	}
  	return done;
  }
-- 
1.7.3

^ permalink raw reply related

* [PATCH V3 0/3] Improve virtio-blk performance
From: Asias He @ 2012-07-13  8:38 UTC (permalink / raw)
  To: kvm, linux-kernel, virtualization
  Cc: Jens Axboe, Tejun Heo, Shaohua Li, Michael S. Tsirkin

This patchset implements bio-based IO path for virito-blk to improve
performance.

Fio test shows bio-based IO path gives the following performance improvement:

1) Ramdisk device
     With bio-based IO path, sequential read/write, random read/write
     IOPS boost         : 28%, 24%, 21%, 16%
     Latency improvement: 32%, 17%, 21%, 16%
2) Fusion IO device
     With bio-based IO path, sequential read/write, random read/write
     IOPS boost         : 11%, 11%, 13%, 10%
     Latency improvement: 10%, 10%, 12%, 10%

Asias He (3):
  block: Introduce __blk_segment_map_sg() helper
  block: Add blk_bio_map_sg() helper
  virtio-blk: Add bio-based IO path for virtio-blk

 block/blk-merge.c          |  117 +++++++++++++++++--------
 drivers/block/virtio_blk.c |  203 +++++++++++++++++++++++++++++++++++---------
 include/linux/blkdev.h     |    2 +
 3 files changed, 247 insertions(+), 75 deletions(-)

-- 
1.7.10.4

^ permalink raw reply

* [PATCH V3 1/3] block: Introduce __blk_segment_map_sg() helper
From: Asias He @ 2012-07-13  8:38 UTC (permalink / raw)
  To: kvm, linux-kernel, virtualization; +Cc: Jens Axboe, Tejun Heo, Shaohua Li
In-Reply-To: <1342168731-11797-1-git-send-email-asias@redhat.com>

Split the mapping code in blk_rq_map_sg() to a helper
__blk_segment_map_sg(), so that other mapping function, e.g.
blk_bio_map_sg(), can share the code.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: linux-kernel@vger.kernel.org
Suggested-by: Tejun Heo <tj@kernel.org>
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Asias He <asias@redhat.com>
---
 block/blk-merge.c |   80 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 45 insertions(+), 35 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 160035f..576b68e 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -110,6 +110,49 @@ static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
 	return 0;
 }
 
+static void
+__blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
+		     struct scatterlist *sglist, struct bio_vec **bvprv,
+		     struct scatterlist **sg, int *nsegs, int *cluster)
+{
+
+	int nbytes = bvec->bv_len;
+
+	if (*bvprv && *cluster) {
+		if ((*sg)->length + nbytes > queue_max_segment_size(q))
+			goto new_segment;
+
+		if (!BIOVEC_PHYS_MERGEABLE(*bvprv, bvec))
+			goto new_segment;
+		if (!BIOVEC_SEG_BOUNDARY(q, *bvprv, bvec))
+			goto new_segment;
+
+		(*sg)->length += nbytes;
+	} else {
+new_segment:
+		if (!*sg)
+			*sg = sglist;
+		else {
+			/*
+			 * If the driver previously mapped a shorter
+			 * list, we could see a termination bit
+			 * prematurely unless it fully inits the sg
+			 * table on each mapping. We KNOW that there
+			 * must be more entries here or the driver
+			 * would be buggy, so force clear the
+			 * termination bit to avoid doing a full
+			 * sg_init_table() in drivers for each command.
+			 */
+			(*sg)->page_link &= ~0x02;
+			*sg = sg_next(*sg);
+		}
+
+		sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
+		(*nsegs)++;
+	}
+	*bvprv = bvec;
+}
+
 /*
  * map a request to scatterlist, return number of sg entries setup. Caller
  * must make sure sg can hold rq->nr_phys_segments entries
@@ -131,41 +174,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 	bvprv = NULL;
 	sg = NULL;
 	rq_for_each_segment(bvec, rq, iter) {
-		int nbytes = bvec->bv_len;
-
-		if (bvprv && cluster) {
-			if (sg->length + nbytes > queue_max_segment_size(q))
-				goto new_segment;
-
-			if (!BIOVEC_PHYS_MERGEABLE(bvprv, bvec))
-				goto new_segment;
-			if (!BIOVEC_SEG_BOUNDARY(q, bvprv, bvec))
-				goto new_segment;
-
-			sg->length += nbytes;
-		} else {
-new_segment:
-			if (!sg)
-				sg = sglist;
-			else {
-				/*
-				 * If the driver previously mapped a shorter
-				 * list, we could see a termination bit
-				 * prematurely unless it fully inits the sg
-				 * table on each mapping. We KNOW that there
-				 * must be more entries here or the driver
-				 * would be buggy, so force clear the
-				 * termination bit to avoid doing a full
-				 * sg_init_table() in drivers for each command.
-				 */
-				sg->page_link &= ~0x02;
-				sg = sg_next(sg);
-			}
-
-			sg_set_page(sg, bvec->bv_page, nbytes, bvec->bv_offset);
-			nsegs++;
-		}
-		bvprv = bvec;
+		__blk_segment_map_sg(q, bvec, sglist, &bvprv, &sg,
+				     &nsegs, &cluster);
 	} /* segments in rq */
 
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH V3 2/3] block: Add blk_bio_map_sg() helper
From: Asias He @ 2012-07-13  8:38 UTC (permalink / raw)
  To: kvm, linux-kernel, virtualization
  Cc: Jens Axboe, Tejun Heo, Shaohua Li, Christoph Hellwig
In-Reply-To: <1342168731-11797-1-git-send-email-asias@redhat.com>

Add a helper to map a bio to a scatterlist, modelled after
blk_rq_map_sg.

This helper is useful for any driver that wants to create
a scatterlist from its ->make_request_fn method.

Changes in v2:
 - Use __blk_segment_map_sg to avoid duplicated code
 - Add cocbook style function comment

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Asias He <asias@redhat.com>
---
 block/blk-merge.c      |   37 +++++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h |    2 ++
 2 files changed, 39 insertions(+)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 576b68e..e76279e 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -209,6 +209,43 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 }
 EXPORT_SYMBOL(blk_rq_map_sg);
 
+/**
+ * blk_bio_map_sg - map a bio to a scatterlist
+ * @q: request_queue in question
+ * @bio: bio being mapped
+ * @sglist: scatterlist being mapped
+ *
+ * Note:
+ *    Caller must make sure sg can hold bio->bi_phys_segments entries
+ *
+ * Will return the number of sg entries setup
+ */
+int blk_bio_map_sg(struct request_queue *q, struct bio *bio,
+		   struct scatterlist *sglist)
+{
+	struct bio_vec *bvec, *bvprv;
+	struct scatterlist *sg;
+	int nsegs, cluster;
+	unsigned long i;
+
+	nsegs = 0;
+	cluster = blk_queue_cluster(q);
+
+	bvprv = NULL;
+	sg = NULL;
+	bio_for_each_segment(bvec, bio, i) {
+		__blk_segment_map_sg(q, bvec, sglist, &bvprv, &sg,
+				     &nsegs, &cluster);
+	} /* segments in bio */
+
+	if (sg)
+		sg_mark_end(sg);
+
+	BUG_ON(bio->bi_phys_segments && nsegs > bio->bi_phys_segments);
+	return nsegs;
+}
+EXPORT_SYMBOL(blk_bio_map_sg);
+
 static inline int ll_new_hw_segment(struct request_queue *q,
 				    struct request *req,
 				    struct bio *bio)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 07954b0..87fb56c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -883,6 +883,8 @@ extern void blk_queue_flush_queueable(struct request_queue *q, bool queueable);
 extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
 
 extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
+extern int blk_bio_map_sg(struct request_queue *q, struct bio *bio,
+			  struct scatterlist *sglist);
 extern void blk_dump_rq_flags(struct request *, char *);
 extern long nr_blockdev_pages(void);
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH V3 3/3] virtio-blk: Add bio-based IO path for virtio-blk
From: Asias He @ 2012-07-13  8:38 UTC (permalink / raw)
  To: kvm, linux-kernel, virtualization; +Cc: Christoph Hellwig, Michael S. Tsirkin
In-Reply-To: <1342168731-11797-1-git-send-email-asias@redhat.com>

This patch introduces bio-based IO path for virtio-blk.

Compared to request-based IO path, bio-based IO path uses driver
provided ->make_request_fn() method to bypasses the IO scheduler. It
handles the bio to device directly without allocating a request in block
layer. This reduces the IO path in guest kernel to achieve high IOPS
and lower latency. The downside is that guest can not use the IO
scheduler to merge and sort requests. However, this is not a big problem
if the backend disk in host side uses faster disk device.

When the bio-based IO path is not enabled, virtio-blk still uses the
original request-based IO path, no performance difference is observed.

Performance evaluation:
-----------------------------
1) Fio test is performed in a 8 vcpu guest with ramdisk based guest using
kvm tool.

Short version:
 With bio-based IO path, sequential read/write, random read/write
 IOPS boost         : 28%, 24%, 21%, 16%
 Latency improvement: 32%, 17%, 21%, 16%

Long version:
 With bio-based IO path:
  seq-read  : io=2048.0MB, bw=116996KB/s, iops=233991 , runt= 17925msec
  seq-write : io=2048.0MB, bw=100829KB/s, iops=201658 , runt= 20799msec
  rand-read : io=3095.7MB, bw=112134KB/s, iops=224268 , runt= 28269msec
  rand-write: io=3095.7MB, bw=96198KB/s,  iops=192396 , runt= 32952msec
    clat (usec): min=0 , max=2631.6K, avg=58716.99, stdev=191377.30
    clat (usec): min=0 , max=1753.2K, avg=66423.25, stdev=81774.35
    clat (usec): min=0 , max=2915.5K, avg=61685.70, stdev=120598.39
    clat (usec): min=0 , max=1933.4K, avg=76935.12, stdev=96603.45
  cpu : usr=74.08%, sys=703.84%, ctx=29661403, majf=21354, minf=22460954
  cpu : usr=70.92%, sys=702.81%, ctx=77219828, majf=13980, minf=27713137
  cpu : usr=72.23%, sys=695.37%, ctx=88081059, majf=18475, minf=28177648
  cpu : usr=69.69%, sys=654.13%, ctx=145476035, majf=15867, minf=26176375
 With request-based IO path:
  seq-read  : io=2048.0MB, bw=91074KB/s, iops=182147 , runt= 23027msec
  seq-write : io=2048.0MB, bw=80725KB/s, iops=161449 , runt= 25979msec
  rand-read : io=3095.7MB, bw=92106KB/s, iops=184211 , runt= 34416msec
  rand-write: io=3095.7MB, bw=82815KB/s, iops=165630 , runt= 38277msec
    clat (usec): min=0 , max=1932.4K, avg=77824.17, stdev=170339.49
    clat (usec): min=0 , max=2510.2K, avg=78023.96, stdev=146949.15
    clat (usec): min=0 , max=3037.2K, avg=74746.53, stdev=128498.27
    clat (usec): min=0 , max=1363.4K, avg=89830.75, stdev=114279.68
  cpu : usr=53.28%, sys=724.19%, ctx=37988895, majf=17531, minf=23577622
  cpu : usr=49.03%, sys=633.20%, ctx=205935380, majf=18197, minf=27288959
  cpu : usr=55.78%, sys=722.40%, ctx=101525058, majf=19273, minf=28067082
  cpu : usr=56.55%, sys=690.83%, ctx=228205022, majf=18039, minf=26551985

2) Fio test is performed in a 8 vcpu guest with Fusion-IO based guest using
kvm tool.

Short version:
 With bio-based IO path, sequential read/write, random read/write
 IOPS boost         : 11%, 11%, 13%, 10%
 Latency improvement: 10%, 10%, 12%, 10%
Long Version:
 With bio-based IO path:
  read : io=2048.0MB, bw=58920KB/s, iops=117840 , runt= 35593msec
  write: io=2048.0MB, bw=64308KB/s, iops=128616 , runt= 32611msec
  read : io=3095.7MB, bw=59633KB/s, iops=119266 , runt= 53157msec
  write: io=3095.7MB, bw=62993KB/s, iops=125985 , runt= 50322msec
    clat (usec): min=0 , max=1284.3K, avg=128109.01, stdev=71513.29
    clat (usec): min=94 , max=962339 , avg=116832.95, stdev=65836.80
    clat (usec): min=0 , max=1846.6K, avg=128509.99, stdev=89575.07
    clat (usec): min=0 , max=2256.4K, avg=121361.84, stdev=82747.25
  cpu : usr=56.79%, sys=421.70%, ctx=147335118, majf=21080, minf=19852517
  cpu : usr=61.81%, sys=455.53%, ctx=143269950, majf=16027, minf=24800604
  cpu : usr=63.10%, sys=455.38%, ctx=178373538, majf=16958, minf=24822612
  cpu : usr=62.04%, sys=453.58%, ctx=226902362, majf=16089, minf=23278105
 With request-based IO path:
  read : io=2048.0MB, bw=52896KB/s, iops=105791 , runt= 39647msec
  write: io=2048.0MB, bw=57856KB/s, iops=115711 , runt= 36248msec
  read : io=3095.7MB, bw=52387KB/s, iops=104773 , runt= 60510msec
  write: io=3095.7MB, bw=57310KB/s, iops=114619 , runt= 55312msec
    clat (usec): min=0 , max=1532.6K, avg=142085.62, stdev=109196.84
    clat (usec): min=0 , max=1487.4K, avg=129110.71, stdev=114973.64
    clat (usec): min=0 , max=1388.6K, avg=145049.22, stdev=107232.55
    clat (usec): min=0 , max=1465.9K, avg=133585.67, stdev=110322.95
  cpu : usr=44.08%, sys=590.71%, ctx=451812322, majf=14841, minf=17648641
  cpu : usr=48.73%, sys=610.78%, ctx=418953997, majf=22164, minf=26850689
  cpu : usr=45.58%, sys=581.16%, ctx=714079216, majf=21497, minf=22558223
  cpu : usr=48.40%, sys=599.65%, ctx=656089423, majf=16393, minf=23824409

How to use:
-----------------------------
Add 'virtio_blk.use_bio=1' to kernel cmdline or 'modprobe virtio_blk
use_bio=1' to enable ->make_request_fn() based I/O path.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Asias He <asias@redhat.com>
---
 drivers/block/virtio_blk.c |  203 +++++++++++++++++++++++++++++++++++---------
 1 file changed, 163 insertions(+), 40 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 774c31d..e137190 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -14,6 +14,9 @@
 
 #define PART_BITS 4
 
+static bool use_bio;
+module_param(use_bio, bool, S_IRUGO);
+
 static int major;
 static DEFINE_IDA(vd_index_ida);
 
@@ -23,6 +26,7 @@ struct virtio_blk
 {
 	struct virtio_device *vdev;
 	struct virtqueue *vq;
+	wait_queue_head_t queue_wait;
 
 	/* The disk structure for the kernel. */
 	struct gendisk *disk;
@@ -51,53 +55,87 @@ struct virtio_blk
 struct virtblk_req
 {
 	struct request *req;
+	struct bio *bio;
 	struct virtio_blk_outhdr out_hdr;
 	struct virtio_scsi_inhdr in_hdr;
 	u8 status;
+	struct scatterlist sg[];
 };
 
-static void blk_done(struct virtqueue *vq)
+static inline int virtblk_result(struct virtblk_req *vbr)
+{
+	switch (vbr->status) {
+	case VIRTIO_BLK_S_OK:
+		return 0;
+	case VIRTIO_BLK_S_UNSUPP:
+		return -ENOTTY;
+	default:
+		return -EIO;
+	}
+}
+
+static inline void virtblk_request_done(struct virtio_blk *vblk,
+					struct virtblk_req *vbr)
+{
+	struct request *req = vbr->req;
+	int error = virtblk_result(vbr);
+
+	if (req->cmd_type == REQ_TYPE_BLOCK_PC) {
+		req->resid_len = vbr->in_hdr.residual;
+		req->sense_len = vbr->in_hdr.sense_len;
+		req->errors = vbr->in_hdr.errors;
+	} else if (req->cmd_type == REQ_TYPE_SPECIAL) {
+		req->errors = (error != 0);
+	}
+
+	__blk_end_request_all(req, error);
+	mempool_free(vbr, vblk->pool);
+}
+
+static inline void virtblk_bio_done(struct virtio_blk *vblk,
+				    struct virtblk_req *vbr)
+{
+	bio_endio(vbr->bio, virtblk_result(vbr));
+	mempool_free(vbr, vblk->pool);
+}
+
+static void virtblk_done(struct virtqueue *vq)
 {
 	struct virtio_blk *vblk = vq->vdev->priv;
+	unsigned long bio_done = 0, req_done = 0;
 	struct virtblk_req *vbr;
-	unsigned int len;
 	unsigned long flags;
+	unsigned int len;
 
 	spin_lock_irqsave(vblk->disk->queue->queue_lock, flags);
 	while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) {
-		int error;
-
-		switch (vbr->status) {
-		case VIRTIO_BLK_S_OK:
-			error = 0;
-			break;
-		case VIRTIO_BLK_S_UNSUPP:
-			error = -ENOTTY;
-			break;
-		default:
-			error = -EIO;
-			break;
-		}
-
-		switch (vbr->req->cmd_type) {
-		case REQ_TYPE_BLOCK_PC:
-			vbr->req->resid_len = vbr->in_hdr.residual;
-			vbr->req->sense_len = vbr->in_hdr.sense_len;
-			vbr->req->errors = vbr->in_hdr.errors;
-			break;
-		case REQ_TYPE_SPECIAL:
-			vbr->req->errors = (error != 0);
-			break;
-		default:
-			break;
+		if (vbr->bio) {
+			virtblk_bio_done(vblk, vbr);
+			bio_done++;
+		} else {
+			virtblk_request_done(vblk, vbr);
+			req_done++;
 		}
-
-		__blk_end_request_all(vbr->req, error);
-		mempool_free(vbr, vblk->pool);
 	}
 	/* In case queue is stopped waiting for more buffers. */
-	blk_start_queue(vblk->disk->queue);
+	if (req_done)
+		blk_start_queue(vblk->disk->queue);
 	spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags);
+
+	if (bio_done)
+		wake_up(&vblk->queue_wait);
+}
+
+static inline struct virtblk_req *virtblk_alloc_req(struct virtio_blk *vblk,
+						    gfp_t gfp_mask)
+{
+	struct virtblk_req *vbr;
+
+	vbr = mempool_alloc(vblk->pool, gfp_mask);
+	if (vbr && use_bio)
+		sg_init_table(vbr->sg, vblk->sg_elems);
+
+	return vbr;
 }
 
 static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
@@ -106,13 +144,13 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 	unsigned long num, out = 0, in = 0;
 	struct virtblk_req *vbr;
 
-	vbr = mempool_alloc(vblk->pool, GFP_ATOMIC);
+	vbr = virtblk_alloc_req(vblk, GFP_ATOMIC);
 	if (!vbr)
 		/* When another request finishes we'll try again. */
 		return false;
 
 	vbr->req = req;
-
+	vbr->bio = NULL;
 	if (req->cmd_flags & REQ_FLUSH) {
 		vbr->out_hdr.type = VIRTIO_BLK_T_FLUSH;
 		vbr->out_hdr.sector = 0;
@@ -172,7 +210,8 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 		}
 	}
 
-	if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr, GFP_ATOMIC)<0) {
+	if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr,
+			      GFP_ATOMIC) < 0) {
 		mempool_free(vbr, vblk->pool);
 		return false;
 	}
@@ -180,7 +219,7 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 	return true;
 }
 
-static void do_virtblk_request(struct request_queue *q)
+static void virtblk_request(struct request_queue *q)
 {
 	struct virtio_blk *vblk = q->queuedata;
 	struct request *req;
@@ -203,6 +242,82 @@ static void do_virtblk_request(struct request_queue *q)
 		virtqueue_kick(vblk->vq);
 }
 
+static void virtblk_add_buf_wait(struct virtio_blk *vblk,
+				 struct virtblk_req *vbr,
+				 unsigned long out,
+				 unsigned long in)
+{
+	DEFINE_WAIT(wait);
+
+	for (;;) {
+		prepare_to_wait_exclusive(&vblk->queue_wait, &wait,
+					  TASK_UNINTERRUPTIBLE);
+
+		spin_lock_irq(vblk->disk->queue->queue_lock);
+		if (virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
+				      GFP_ATOMIC) < 0) {
+			spin_unlock_irq(vblk->disk->queue->queue_lock);
+			io_schedule();
+		} else {
+			virtqueue_kick(vblk->vq);
+			spin_unlock_irq(vblk->disk->queue->queue_lock);
+			break;
+		}
+
+	}
+
+	finish_wait(&vblk->queue_wait, &wait);
+}
+
+static void virtblk_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct virtio_blk *vblk = q->queuedata;
+	unsigned int num, out = 0, in = 0;
+	struct virtblk_req *vbr;
+
+	BUG_ON(bio->bi_phys_segments + 2 > vblk->sg_elems);
+	BUG_ON(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
+
+	vbr = virtblk_alloc_req(vblk, GFP_NOIO);
+	if (!vbr) {
+		bio_endio(bio, -ENOMEM);
+		return;
+	}
+
+	vbr->bio = bio;
+	vbr->req = NULL;
+	vbr->out_hdr.type = 0;
+	vbr->out_hdr.sector = bio->bi_sector;
+	vbr->out_hdr.ioprio = bio_prio(bio);
+
+	sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
+
+	num = blk_bio_map_sg(q, bio, vbr->sg + out);
+
+	sg_set_buf(&vbr->sg[num + out + in++], &vbr->status,
+		   sizeof(vbr->status));
+
+	if (num) {
+		if (bio->bi_rw & REQ_WRITE) {
+			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
+			out += num;
+		} else {
+			vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
+			in += num;
+		}
+	}
+
+	spin_lock_irq(vblk->disk->queue->queue_lock);
+	if (unlikely(virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
+				       GFP_ATOMIC) < 0)) {
+		spin_unlock_irq(vblk->disk->queue->queue_lock);
+		virtblk_add_buf_wait(vblk, vbr, out, in);
+		return;
+	}
+	virtqueue_kick(vblk->vq);
+	spin_unlock_irq(vblk->disk->queue->queue_lock);
+}
+
 /* return id (s/n) string for *disk to *id_str
  */
 static int virtblk_get_id(struct gendisk *disk, char *id_str)
@@ -360,7 +475,7 @@ static int init_vq(struct virtio_blk *vblk)
 	int err = 0;
 
 	/* We expect one virtqueue, for output. */
-	vblk->vq = virtio_find_single_vq(vblk->vdev, blk_done, "requests");
+	vblk->vq = virtio_find_single_vq(vblk->vdev, virtblk_done, "requests");
 	if (IS_ERR(vblk->vq))
 		err = PTR_ERR(vblk->vq);
 
@@ -400,6 +515,8 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 	struct virtio_blk *vblk;
 	struct request_queue *q;
 	int err, index;
+	int pool_size;
+
 	u64 cap;
 	u32 v, blk_size, sg_elems, opt_io_size;
 	u16 min_io_size;
@@ -429,10 +546,12 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 		goto out_free_index;
 	}
 
+	init_waitqueue_head(&vblk->queue_wait);
 	vblk->vdev = vdev;
 	vblk->sg_elems = sg_elems;
 	sg_init_table(vblk->sg, vblk->sg_elems);
 	mutex_init(&vblk->config_lock);
+
 	INIT_WORK(&vblk->config_work, virtblk_config_changed_work);
 	vblk->config_enable = true;
 
@@ -440,7 +559,10 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vblk;
 
-	vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
+	pool_size = sizeof(struct virtblk_req);
+	if (use_bio)
+		pool_size += sizeof(struct scatterlist) * sg_elems;
+	vblk->pool = mempool_create_kmalloc_pool(1, pool_size);
 	if (!vblk->pool) {
 		err = -ENOMEM;
 		goto out_free_vq;
@@ -453,12 +575,14 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 		goto out_mempool;
 	}
 
-	q = vblk->disk->queue = blk_init_queue(do_virtblk_request, NULL);
+	q = vblk->disk->queue = blk_init_queue(virtblk_request, NULL);
 	if (!q) {
 		err = -ENOMEM;
 		goto out_put_disk;
 	}
 
+	if (use_bio)
+		blk_queue_make_request(q, virtblk_make_request);
 	q->queuedata = vblk;
 
 	virtblk_name_format("vd", index, vblk->disk->disk_name, DISK_NAME_LEN);
@@ -471,7 +595,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 	vblk->index = index;
 
 	/* configure queue flush support */
-	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
+	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH) && !use_bio)
 		blk_queue_flush(q, REQ_FLUSH);
 
 	/* If disk is read-only in the host, the guest should obey */
@@ -544,7 +668,6 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 	if (!err && opt_io_size)
 		blk_queue_io_opt(q, blk_size * opt_io_size);
 
-
 	add_disk(vblk->disk);
 	err = device_create_file(disk_to_dev(vblk->disk), &dev_attr_serial);
 	if (err)
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH RESEND 0/5] Add vhost-blk support
From: Asias He @ 2012-07-13  8:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-aio, kvm, Michael S. Tsirkin, virtualization,
	James Bottomley, Jeff Moyer, Benjamin LaHaise, Alexander Viro,
	linux-fsdevel


Hi folks,

[I am resending to fix the broken thread in the previous one.]

This patchset adds vhost-blk support. vhost-blk is a in kernel virito-blk
device accelerator. Compared to userspace virtio-blk implementation, vhost-blk
gives about 5% to 15% performance improvement.

Asias He (5):
  aio: Export symbols and struct kiocb_batch for in kernel aio usage
  eventfd: Export symbol eventfd_file_create()
  vhost: Make vhost a separate module
  vhost-net: Use VHOST_NET_FEATURES for vhost-net
  vhost-blk: Add vhost-blk support

 drivers/vhost/Kconfig  |   20 +-
 drivers/vhost/Makefile |    6 +-
 drivers/vhost/blk.c    |  600 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/net.c    |    4 +-
 drivers/vhost/test.c   |    4 +-
 drivers/vhost/vhost.c  |   48 ++++
 drivers/vhost/vhost.h  |   18 +-
 fs/aio.c               |   37 ++-
 fs/eventfd.c           |    1 +
 include/linux/aio.h    |   21 ++
 include/linux/vhost.h  |    3 +
 11 files changed, 731 insertions(+), 31 deletions(-)
 create mode 100644 drivers/vhost/blk.c

-- 
1.7.10.4

^ permalink raw reply

* [PATCH RESEND 1/5] aio: Export symbols and struct kiocb_batch for in kernel aio usage
From: Asias He @ 2012-07-13  8:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-aio, kvm, Michael S. Tsirkin, virtualization,
	James Bottomley, Jeff Moyer, Benjamin LaHaise, Alexander Viro,
	linux-fsdevel
In-Reply-To: <1342169711-12386-1-git-send-email-asias@redhat.com>

This is useful for people who want to use aio in kernel, e.g. vhost-blk.

Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: linux-aio@kvack.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Asias He <asias@redhat.com>
---
 fs/aio.c            |   37 ++++++++++++++++++-------------------
 include/linux/aio.h |   21 +++++++++++++++++++++
 2 files changed, 39 insertions(+), 19 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 55c4c76..93dfbdd 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -224,22 +224,24 @@ static void __put_ioctx(struct kioctx *ctx)
 	call_rcu(&ctx->rcu_head, ctx_rcu_free);
 }
 
-static inline int try_get_ioctx(struct kioctx *kioctx)
+inline int try_get_ioctx(struct kioctx *kioctx)
 {
 	return atomic_inc_not_zero(&kioctx->users);
 }
+EXPORT_SYMBOL(try_get_ioctx);
 
-static inline void put_ioctx(struct kioctx *kioctx)
+inline void put_ioctx(struct kioctx *kioctx)
 {
 	BUG_ON(atomic_read(&kioctx->users) <= 0);
 	if (unlikely(atomic_dec_and_test(&kioctx->users)))
 		__put_ioctx(kioctx);
 }
+EXPORT_SYMBOL(put_ioctx);
 
 /* ioctx_alloc
  *	Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
  */
-static struct kioctx *ioctx_alloc(unsigned nr_events)
+struct kioctx *ioctx_alloc(unsigned nr_events)
 {
 	struct mm_struct *mm;
 	struct kioctx *ctx;
@@ -303,6 +305,7 @@ out_freectx:
 	dprintk("aio: error allocating ioctx %d\n", err);
 	return ERR_PTR(err);
 }
+EXPORT_SYMBOL(ioctx_alloc);
 
 /* kill_ctx
  *	Cancels all outstanding aio requests on an aio context.  Used 
@@ -436,23 +439,14 @@ static struct kiocb *__aio_get_req(struct kioctx *ctx)
 	return req;
 }
 
-/*
- * struct kiocb's are allocated in batches to reduce the number of
- * times the ctx lock is acquired and released.
- */
-#define KIOCB_BATCH_SIZE	32L
-struct kiocb_batch {
-	struct list_head head;
-	long count; /* number of requests left to allocate */
-};
-
-static void kiocb_batch_init(struct kiocb_batch *batch, long total)
+void kiocb_batch_init(struct kiocb_batch *batch, long total)
 {
 	INIT_LIST_HEAD(&batch->head);
 	batch->count = total;
 }
+EXPORT_SYMBOL(kiocb_batch_init);
 
-static void kiocb_batch_free(struct kioctx *ctx, struct kiocb_batch *batch)
+void kiocb_batch_free(struct kioctx *ctx, struct kiocb_batch *batch)
 {
 	struct kiocb *req, *n;
 
@@ -470,6 +464,7 @@ static void kiocb_batch_free(struct kioctx *ctx, struct kiocb_batch *batch)
 		wake_up_all(&ctx->wait);
 	spin_unlock_irq(&ctx->ctx_lock);
 }
+EXPORT_SYMBOL(kiocb_batch_free);
 
 /*
  * Allocate a batch of kiocbs.  This avoids taking and dropping the
@@ -540,7 +535,7 @@ out:
 	return allocated;
 }
 
-static inline struct kiocb *aio_get_req(struct kioctx *ctx,
+inline struct kiocb *aio_get_req(struct kioctx *ctx,
 					struct kiocb_batch *batch)
 {
 	struct kiocb *req;
@@ -552,6 +547,7 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx,
 	list_del(&req->ki_batch);
 	return req;
 }
+EXPORT_SYMBOL(aio_get_req);
 
 static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
 {
@@ -721,7 +717,7 @@ static inline int __queue_kicked_iocb(struct kiocb *iocb)
  * simplifies the coding of individual aio operations as
  * it avoids various potential races.
  */
-static ssize_t aio_run_iocb(struct kiocb *iocb)
+ssize_t aio_run_iocb(struct kiocb *iocb)
 {
 	struct kioctx	*ctx = iocb->ki_ctx;
 	ssize_t (*retry)(struct kiocb *);
@@ -815,6 +811,7 @@ out:
 	}
 	return ret;
 }
+EXPORT_SYMBOL(aio_run_iocb);
 
 /*
  * __aio_run_iocbs:
@@ -1136,7 +1133,7 @@ static inline void clear_timeout(struct aio_timeout *to)
 	del_singleshot_timer_sync(&to->timer);
 }
 
-static int read_events(struct kioctx *ctx,
+int read_events(struct kioctx *ctx,
 			long min_nr, long nr,
 			struct io_event __user *event,
 			struct timespec __user *timeout)
@@ -1252,6 +1249,7 @@ out:
 	destroy_timer_on_stack(&to.timer);
 	return i ? i : ret;
 }
+EXPORT_SYMBOL(read_events);
 
 /* Take an ioctx and remove it from the list of ioctx's.  Protects 
  * against races with itself via ->dead.
@@ -1492,7 +1490,7 @@ static ssize_t aio_setup_single_vector(int type, struct file * file, struct kioc
  *	Performs the initial checks and aio retry method
  *	setup for the kiocb at the time of io submission.
  */
-static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
+ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
 {
 	struct file *file = kiocb->ki_filp;
 	ssize_t ret = 0;
@@ -1570,6 +1568,7 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
 
 	return 0;
 }
+EXPORT_SYMBOL(aio_setup_iocb);
 
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 			 struct iocb *iocb, struct kiocb_batch *batch,
diff --git a/include/linux/aio.h b/include/linux/aio.h
index b1a520e..4731da5 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -126,6 +126,16 @@ struct kiocb {
 	struct eventfd_ctx	*ki_eventfd;
 };
 
+/*
+ * struct kiocb's are allocated in batches to reduce the number of
+ * times the ctx lock is acquired and released.
+ */
+#define KIOCB_BATCH_SIZE	32L
+struct kiocb_batch {
+	struct list_head head;
+	long count; /* number of requests left to allocate */
+};
+
 #define is_sync_kiocb(iocb)	((iocb)->ki_key == KIOCB_SYNC_KEY)
 #define init_sync_kiocb(x, filp)			\
 	do {						\
@@ -216,6 +226,17 @@ struct mm_struct;
 extern void exit_aio(struct mm_struct *mm);
 extern long do_io_submit(aio_context_t ctx_id, long nr,
 			 struct iocb __user *__user *iocbpp, bool compat);
+extern struct kioctx *ioctx_alloc(unsigned nr_events);
+extern ssize_t aio_run_iocb(struct kiocb *iocb);
+extern int read_events(struct kioctx *ctx, long min_nr, long nr,
+		       struct io_event __user *event,
+		       struct timespec __user *timeout);
+extern ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat);
+extern void kiocb_batch_init(struct kiocb_batch *batch, long total);
+extern void kiocb_batch_free(struct kioctx *ctx, struct kiocb_batch *batch);
+extern struct kiocb *aio_get_req(struct kioctx *ctx, struct kiocb_batch *batch);
+extern int try_get_ioctx(struct kioctx *kioctx);
+extern void put_ioctx(struct kioctx *kioctx);
 #else
 static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
 static inline int aio_put_req(struct kiocb *iocb) { return 0; }
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH RESEND 2/5] eventfd: Export symbol eventfd_file_create()
From: Asias He @ 2012-07-13  8:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, Michael S. Tsirkin, virtualization, Jeff Moyer,
	Alexander Viro, linux-fsdevel
In-Reply-To: <1342169711-12386-1-git-send-email-asias@redhat.com>

This is useful for people who want to create an eventfd in kernel,
e.g. vhost-blk.

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Asias He <asias@redhat.com>
---
 fs/eventfd.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index d81b9f6..b288963 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -402,6 +402,7 @@ struct file *eventfd_file_create(unsigned int count, int flags)
 
 	return file;
 }
+EXPORT_SYMBOL_GPL(eventfd_file_create);
 
 SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
 {
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH RESEND 3/5] vhost: Make vhost a separate module
From: Asias He @ 2012-07-13  8:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: virtualization, kvm, Michael S. Tsirkin
In-Reply-To: <1342169711-12386-1-git-send-email-asias@redhat.com>

Currently, vhost-net is the only consumer of vhost infrastructure. So
vhost infrastructure and vhost-net driver are in a single module.

Separating this as a vhost.ko module and a vhost-net.ko module makes it
is easier to share code with other vhost drivers, e.g. vhost-blk.ko,
tcm-vhost.ko.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Asias He <asias@redhat.com>
---
 drivers/vhost/Kconfig  |   10 +++++++++-
 drivers/vhost/Makefile |    4 +++-
 drivers/vhost/vhost.c  |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/vhost.h  |    1 +
 4 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index e4e2fd1..c387067 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -1,6 +1,14 @@
+config VHOST
+	tristate "Host kernel accelerator for virtio (EXPERIMENTAL)"
+	---help---
+	  This kernel module can be loaded in host kernel to accelerate
+	  guest networking and block.
+
+	  To compile this driver as a module, choose M here: the module will
+	  be called vhost_net.
 config VHOST_NET
 	tristate "Host kernel accelerator for virtio net (EXPERIMENTAL)"
-	depends on NET && EVENTFD && (TUN || !TUN) && (MACVTAP || !MACVTAP) && EXPERIMENTAL
+	depends on VHOST && NET && EVENTFD && (TUN || !TUN) && (MACVTAP || !MACVTAP) && EXPERIMENTAL
 	---help---
 	  This kernel module can be loaded in host kernel to accelerate
 	  guest networking with virtio_net. Not to be confused with virtio_net
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..cd36885 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
+obj-$(CONFIG_VHOST)	+= vhost.o
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
-vhost_net-y := vhost.o net.o
+
+vhost_net-y		:= net.o
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 112156f..6e9f586 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -25,6 +25,7 @@
 #include <linux/slab.h>
 #include <linux/kthread.h>
 #include <linux/cgroup.h>
+#include <linux/module.h>
 
 #include <linux/net.h>
 #include <linux/if_packet.h>
@@ -84,6 +85,7 @@ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 
 	vhost_work_init(&poll->work, fn);
 }
+EXPORT_SYMBOL_GPL(vhost_poll_init);
 
 /* Start polling a file. We add ourselves to file's wait queue. The caller must
  * keep a reference to a file until after vhost_poll_stop is called. */
@@ -95,6 +97,7 @@ void vhost_poll_start(struct vhost_poll *poll, struct file *file)
 	if (mask)
 		vhost_poll_wakeup(&poll->wait, 0, 0, (void *)mask);
 }
+EXPORT_SYMBOL_GPL(vhost_poll_start);
 
 /* Stop polling a file. After this function returns, it becomes safe to drop the
  * file reference. You must also flush afterwards. */
@@ -102,6 +105,7 @@ void vhost_poll_stop(struct vhost_poll *poll)
 {
 	remove_wait_queue(poll->wqh, &poll->wait);
 }
+EXPORT_SYMBOL_GPL(vhost_poll_stop);
 
 static bool vhost_work_seq_done(struct vhost_dev *dev, struct vhost_work *work,
 				unsigned seq)
@@ -136,6 +140,7 @@ void vhost_poll_flush(struct vhost_poll *poll)
 {
 	vhost_work_flush(poll->dev, &poll->work);
 }
+EXPORT_SYMBOL_GPL(vhost_poll_flush);
 
 static inline void vhost_work_queue(struct vhost_dev *dev,
 				    struct vhost_work *work)
@@ -155,6 +160,7 @@ void vhost_poll_queue(struct vhost_poll *poll)
 {
 	vhost_work_queue(poll->dev, &poll->work);
 }
+EXPORT_SYMBOL_GPL(vhost_poll_queue);
 
 static void vhost_vq_reset(struct vhost_dev *dev,
 			   struct vhost_virtqueue *vq)
@@ -251,6 +257,7 @@ void vhost_enable_zcopy(int vq)
 {
 	vhost_zcopy_mask |= 0x1 << vq;
 }
+EXPORT_SYMBOL_GPL(vhost_enable_zcopy);
 
 /* Helper to allocate iovec buffers for all vqs. */
 static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
@@ -322,6 +329,7 @@ long vhost_dev_init(struct vhost_dev *dev,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(vhost_dev_init);
 
 /* Caller should have device mutex */
 long vhost_dev_check_owner(struct vhost_dev *dev)
@@ -329,6 +337,7 @@ long vhost_dev_check_owner(struct vhost_dev *dev)
 	/* Are you the owner? If not, I don't think you mean to do that */
 	return dev->mm == current->mm ? 0 : -EPERM;
 }
+EXPORT_SYMBOL_GPL(vhost_dev_check_owner);
 
 struct vhost_attach_cgroups_struct {
 	struct vhost_work work;
@@ -414,6 +423,7 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
 	RCU_INIT_POINTER(dev->memory, memory);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(vhost_dev_reset_owner);
 
 /* In case of DMA done not in order in lower device driver for some reason.
  * upend_idx is used to track end of used idx, done_idx is used to track head
@@ -438,6 +448,7 @@ int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
 		vq->done_idx = i;
 	return j;
 }
+EXPORT_SYMBOL_GPL(vhost_zerocopy_signal_used);
 
 /* Caller should have device mutex if and only if locked is set */
 void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
@@ -489,6 +500,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 		mmput(dev->mm);
 	dev->mm = NULL;
 }
+EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
 
 static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
 {
@@ -574,6 +586,7 @@ int vhost_log_access_ok(struct vhost_dev *dev)
 				       lockdep_is_held(&dev->mutex));
 	return memory_access_ok(dev, mp, 1);
 }
+EXPORT_SYMBOL_GPL(vhost_log_access_ok);
 
 /* Verify access for write logging. */
 /* Caller should have vq mutex and device mutex */
@@ -599,6 +612,7 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 	return vq_access_ok(vq->dev, vq->num, vq->desc, vq->avail, vq->used) &&
 		vq_log_access_ok(vq->dev, vq, vq->log_base);
 }
+EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
 
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
 {
@@ -909,6 +923,7 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned long arg)
 done:
 	return r;
 }
+EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
 
 static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
 						     __u64 addr, __u32 len)
@@ -1000,6 +1015,7 @@ int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 	BUG();
 	return 0;
 }
+EXPORT_SYMBOL_GPL(vhost_log_write);
 
 static int vhost_update_used_flags(struct vhost_virtqueue *vq)
 {
@@ -1051,6 +1067,7 @@ int vhost_init_used(struct vhost_virtqueue *vq)
 	vq->signalled_used_valid = false;
 	return get_user(vq->last_used_idx, &vq->used->idx);
 }
+EXPORT_SYMBOL_GPL(vhost_init_used);
 
 static int translate_desc(struct vhost_dev *dev, u64 addr, u32 len,
 			  struct iovec iov[], int iov_size)
@@ -1327,12 +1344,14 @@ int vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
 	BUG_ON(!(vq->used_flags & VRING_USED_F_NO_NOTIFY));
 	return head;
 }
+EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
 
 /* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
 void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
 {
 	vq->last_avail_idx -= n;
 }
+EXPORT_SYMBOL_GPL(vhost_discard_vq_desc);
 
 /* After we've used one of their buffers, we tell them about it.  We'll then
  * want to notify the guest, using eventfd. */
@@ -1381,6 +1400,7 @@ int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
 		vq->signalled_used_valid = false;
 	return 0;
 }
+EXPORT_SYMBOL_GPL(vhost_add_used);
 
 static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 			    struct vring_used_elem *heads,
@@ -1450,6 +1470,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
 	}
 	return r;
 }
+EXPORT_SYMBOL_GPL(vhost_add_used_n);
 
 static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
@@ -1494,6 +1515,7 @@ void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	if (vq->call_ctx && vhost_notify(dev, vq))
 		eventfd_signal(vq->call_ctx, 1);
 }
+EXPORT_SYMBOL_GPL(vhost_signal);
 
 /* And here's the combo meal deal.  Supersize me! */
 void vhost_add_used_and_signal(struct vhost_dev *dev,
@@ -1503,6 +1525,7 @@ void vhost_add_used_and_signal(struct vhost_dev *dev,
 	vhost_add_used(vq, head, len);
 	vhost_signal(dev, vq);
 }
+EXPORT_SYMBOL_GPL(vhost_add_used_and_signal);
 
 /* multi-buffer version of vhost_add_used_and_signal */
 void vhost_add_used_and_signal_n(struct vhost_dev *dev,
@@ -1512,6 +1535,7 @@ void vhost_add_used_and_signal_n(struct vhost_dev *dev,
 	vhost_add_used_n(vq, heads, count);
 	vhost_signal(dev, vq);
 }
+EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
 
 /* OK, now we need to know about added descriptors. */
 bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
@@ -1549,6 +1573,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 
 	return avail_idx != vq->avail_idx;
 }
+EXPORT_SYMBOL_GPL(vhost_enable_notify);
 
 /* We don't need to be notified again. */
 void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
@@ -1565,6 +1590,7 @@ void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 			       &vq->used->flags, r);
 	}
 }
+EXPORT_SYMBOL_GPL(vhost_disable_notify);
 
 static void vhost_zerocopy_done_signal(struct kref *kref)
 {
@@ -1588,11 +1614,13 @@ struct vhost_ubuf_ref *vhost_ubuf_alloc(struct vhost_virtqueue *vq,
 	ubufs->vq = vq;
 	return ubufs;
 }
+EXPORT_SYMBOL_GPL(vhost_ubuf_alloc);
 
 void vhost_ubuf_put(struct vhost_ubuf_ref *ubufs)
 {
 	kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
 }
+EXPORT_SYMBOL_GPL(vhost_ubuf_put);
 
 void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref *ubufs)
 {
@@ -1600,6 +1628,7 @@ void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref *ubufs)
 	wait_event(ubufs->wait, !atomic_read(&ubufs->kref.refcount));
 	kfree(ubufs);
 }
+EXPORT_SYMBOL_GPL(vhost_ubuf_put_and_wait);
 
 void vhost_zerocopy_callback(struct ubuf_info *ubuf)
 {
@@ -1611,3 +1640,22 @@ void vhost_zerocopy_callback(struct ubuf_info *ubuf)
 	vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
 	kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
 }
+EXPORT_SYMBOL_GPL(vhost_zerocopy_callback);
+
+static int __init vhost_init(void)
+{
+	return 0;
+}
+
+static void __exit vhost_exit(void)
+{
+	return;
+}
+
+module_init(vhost_init);
+module_exit(vhost_exit);
+
+MODULE_VERSION("0.0.1");
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Michael S. Tsirkin");
+MODULE_DESCRIPTION("Host kernel accelerator for virtio");
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8de1fd5..c5c7fb0 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -12,6 +12,7 @@
 #include <linux/virtio_config.h>
 #include <linux/virtio_ring.h>
 #include <linux/atomic.h>
+#include <linux/virtio_net.h>
 
 /* This is for zerocopy, used buffer len is set to 1 when lower device DMA
  * done */
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH RESEND 4/5] vhost-net: Use VHOST_NET_FEATURES for vhost-net
From: Asias He @ 2012-07-13  8:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: virtualization, kvm, Michael S. Tsirkin
In-Reply-To: <1342169711-12386-1-git-send-email-asias@redhat.com>

vhost-net's feature does not deseve the name VHOST_FEATURES. Use
VHOST_NET_FEATURES instead.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Asias He <asias@redhat.com>
---
 drivers/vhost/net.c   |    4 ++--
 drivers/vhost/test.c  |    4 ++--
 drivers/vhost/vhost.h |   12 ++++++------
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f82a739..072cbba 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -823,14 +823,14 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
 			return -EFAULT;
 		return vhost_net_set_backend(n, backend.index, backend.fd);
 	case VHOST_GET_FEATURES:
-		features = VHOST_FEATURES;
+		features = VHOST_NET_FEATURES;
 		if (copy_to_user(featurep, &features, sizeof features))
 			return -EFAULT;
 		return 0;
 	case VHOST_SET_FEATURES:
 		if (copy_from_user(&features, featurep, sizeof features))
 			return -EFAULT;
-		if (features & ~VHOST_FEATURES)
+		if (features & ~VHOST_NET_FEATURES)
 			return -EOPNOTSUPP;
 		return vhost_net_set_features(n, features);
 	case VHOST_RESET_OWNER:
diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 3de00d9..91d6f06 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -261,14 +261,14 @@ static long vhost_test_ioctl(struct file *f, unsigned int ioctl,
 			return -EFAULT;
 		return vhost_test_run(n, test);
 	case VHOST_GET_FEATURES:
-		features = VHOST_FEATURES;
+		features = VHOST_NET_FEATURES;
 		if (copy_to_user(featurep, &features, sizeof features))
 			return -EFAULT;
 		return 0;
 	case VHOST_SET_FEATURES:
 		if (copy_from_user(&features, featurep, sizeof features))
 			return -EFAULT;
-		if (features & ~VHOST_FEATURES)
+		if (features & ~VHOST_NET_FEATURES)
 			return -EOPNOTSUPP;
 		return vhost_test_set_features(n, features);
 	case VHOST_RESET_OWNER:
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index c5c7fb0..cc046a9 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -199,12 +199,12 @@ int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq);
 	} while (0)
 
 enum {
-	VHOST_FEATURES = (1ULL << VIRTIO_F_NOTIFY_ON_EMPTY) |
-			 (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
-			 (1ULL << VIRTIO_RING_F_EVENT_IDX) |
-			 (1ULL << VHOST_F_LOG_ALL) |
-			 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
-			 (1ULL << VIRTIO_NET_F_MRG_RXBUF),
+	VHOST_NET_FEATURES =	(1ULL << VIRTIO_F_NOTIFY_ON_EMPTY) |
+				(1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+				(1ULL << VIRTIO_RING_F_EVENT_IDX) |
+				(1ULL << VHOST_F_LOG_ALL) |
+				(1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
+				(1ULL << VIRTIO_NET_F_MRG_RXBUF),
 };
 
 static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH RESEND 5/5] vhost-blk: Add vhost-blk support
From: Asias He @ 2012-07-13  8:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: virtualization, kvm, Michael S. Tsirkin
In-Reply-To: <1342169711-12386-1-git-send-email-asias@redhat.com>

vhost-blk is a in kernel virito-blk device accelerator.

This patch is based on Liu Yuan's implementation with various
improvements and bug fixes. Notably, this patch makes guest notify and
host completion processing in parallel which gives about 60% performance
improvement compared to Liu Yuan's implementation.

Performance evaluation:
-----------------------------
The comparison is between kvm tool with usersapce implementation and kvm
tool with vhost-blk.

1) Fio with libaio ioengine on Fusion IO device
With bio-based IO path, sequential read/write, random read/write
IOPS boost         : 8.4%, 15.3%, 10.4%, 14.6%
Latency improvement: 8.5%, 15.4%, 10.4%, 15.1%

2) Fio with vsync ioengine on Fusion IO device
With bio-based IO path, sequential read/write, random read/write
IOPS boost         : 10.5%, 4.8%, 5.2%, 5.6%
Latency improvement: 11.4%, 5.0%, 5.2%, 5.8%

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Asias He <asias@redhat.com>
---
 drivers/vhost/Kconfig  |   10 +
 drivers/vhost/Makefile |    2 +
 drivers/vhost/blk.c    |  600 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/vhost.h  |    5 +
 include/linux/vhost.h  |    3 +
 5 files changed, 620 insertions(+)
 create mode 100644 drivers/vhost/blk.c

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index c387067..fa071a8 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -16,4 +16,14 @@ config VHOST_NET
 
 	  To compile this driver as a module, choose M here: the module will
 	  be called vhost_net.
+config VHOST_BLK
+	tristate "Host kernel accelerator for virtio blk (EXPERIMENTAL)"
+	depends on VHOST && BLOCK && AIO && EVENTFD && EXPERIMENTAL
+	---help---
+	  This kernel module can be loaded in host kernel to accelerate
+	  guest block with virtio_blk. Not to be confused with virtio_blk
+	  module itself which needs to be loaded in guest kernel.
+
+	  To compile this driver as a module, choose M here: the module will
+	  be called vhost_blk.
 
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index cd36885..aa461d5 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,4 +1,6 @@
 obj-$(CONFIG_VHOST)	+= vhost.o
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
+obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
 
 vhost_net-y		:= net.o
+vhost_blk-y		:= blk.o
diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
new file mode 100644
index 0000000..6a94894
--- /dev/null
+++ b/drivers/vhost/blk.c
@@ -0,0 +1,600 @@
+/*
+ * Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan <tailai.ly@taobao.com>
+ *
+ * Copyright (C) 2012 Red Hat, Inc.
+ * Author: Asias He <asias@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ *
+ * virtio-blk server in host kernel.
+ */
+
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/vhost.h>
+#include <linux/virtio_blk.h>
+#include <linux/eventfd.h>
+#include <linux/mutex.h>
+#include <linux/file.h>
+#include <linux/mmu_context.h>
+#include <linux/anon_inodes.h>
+#include <linux/kthread.h>
+#include <linux/blkdev.h>
+
+#include "vhost.h"
+
+#define BLK_HDR	0
+
+enum {
+	VHOST_BLK_VQ_REQ = 0,
+	VHOST_BLK_VQ_MAX = 1,
+};
+
+struct vhost_blk_req {
+	u16 head;
+	u8 *status;
+};
+
+struct vhost_blk {
+	struct task_struct *worker_host_kick;
+	struct task_struct *worker;
+	struct vhost_blk_req *reqs;
+	struct vhost_virtqueue vq;
+	struct eventfd_ctx *ectx;
+	struct io_event *ioevent;
+	struct kioctx *ioctx;
+	struct vhost_dev dev;
+	struct file *efile;
+	u64 ioevent_nr;
+	bool stop;
+};
+
+static inline int vhost_blk_read_events(struct vhost_blk *blk, long nr)
+{
+	mm_segment_t old_fs = get_fs();
+	int ret;
+
+	set_fs(KERNEL_DS);
+	ret = read_events(blk->ioctx, nr, nr, blk->ioevent, NULL);
+	set_fs(old_fs);
+
+	return ret;
+}
+
+static int vhost_blk_setup(struct vhost_blk *blk)
+{
+	struct kioctx *ctx;
+
+	if (blk->ioctx)
+		return 0;
+
+	blk->ioevent_nr = blk->vq.num;
+	ctx = ioctx_alloc(blk->ioevent_nr);
+	if (IS_ERR(ctx)) {
+		pr_err("Failed to ioctx_alloc");
+		return PTR_ERR(ctx);
+	}
+	put_ioctx(ctx);
+	blk->ioctx = ctx;
+
+	blk->ioevent = kmalloc(sizeof(struct io_event) * blk->ioevent_nr,
+			       GFP_KERNEL);
+	if (!blk->ioevent) {
+		pr_err("Failed to allocate memory for io_events");
+		return -ENOMEM;
+	}
+
+	blk->reqs = kmalloc(sizeof(struct vhost_blk_req) * blk->ioevent_nr,
+			    GFP_KERNEL);
+	if (!blk->reqs) {
+		pr_err("Failed to allocate memory for vhost_blk_req");
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static inline int vhost_blk_set_status(struct vhost_blk *blk, u8 *statusp,
+				       u8 status)
+{
+	if (copy_to_user(statusp, &status, sizeof(status))) {
+		vq_err(&blk->vq, "Failed to write status\n");
+		vhost_discard_vq_desc(&blk->vq, 1);
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+static void vhost_blk_enable_vq(struct vhost_blk *blk,
+				struct vhost_virtqueue *vq)
+{
+	wake_up_process(blk->worker_host_kick);
+}
+
+static int vhost_blk_io_submit(struct vhost_blk *blk, struct file *file,
+			       struct vhost_blk_req *req,
+			       struct iovec *iov, u64 nr_vecs, loff_t offset,
+			       int opcode)
+{
+	struct kioctx *ioctx = blk->ioctx;
+	mm_segment_t oldfs = get_fs();
+	struct kiocb_batch batch;
+	struct blk_plug plug;
+	struct kiocb *iocb;
+	int ret;
+
+	if (!try_get_ioctx(ioctx)) {
+		pr_info("Failed to get ioctx");
+		return -EAGAIN;
+	}
+
+	atomic_long_inc_not_zero(&file->f_count);
+	eventfd_ctx_get(blk->ectx);
+
+	/* TODO: batch to 1 is not good! */
+	kiocb_batch_init(&batch, 1);
+	blk_start_plug(&plug);
+
+	iocb = aio_get_req(ioctx, &batch);
+	if (unlikely(!iocb)) {
+		ret = -EAGAIN;
+		goto out;
+	}
+
+	iocb->ki_filp	= file;
+	iocb->ki_pos	= offset;
+	iocb->ki_buf	= (void *)iov;
+	iocb->ki_left	= nr_vecs;
+	iocb->ki_nbytes	= nr_vecs;
+	iocb->ki_opcode	= opcode;
+	iocb->ki_obj.user = req;
+	iocb->ki_eventfd  = blk->ectx;
+
+	set_fs(KERNEL_DS);
+	ret = aio_setup_iocb(iocb, false);
+	set_fs(oldfs);
+	if (unlikely(ret))
+		goto out_put_iocb;
+
+	spin_lock_irq(&ioctx->ctx_lock);
+	if (unlikely(ioctx->dead)) {
+		spin_unlock_irq(&ioctx->ctx_lock);
+		ret = -EINVAL;
+		goto out_put_iocb;
+	}
+	aio_run_iocb(iocb);
+	spin_unlock_irq(&ioctx->ctx_lock);
+
+	aio_put_req(iocb);
+
+	blk_finish_plug(&plug);
+	kiocb_batch_free(ioctx, &batch);
+	put_ioctx(ioctx);
+
+	return ret;
+out_put_iocb:
+	aio_put_req(iocb); /* Drop extra ref to req */
+	aio_put_req(iocb); /* Drop I/O ref to req */
+out:
+	put_ioctx(ioctx);
+	return ret;
+}
+
+static void vhost_blk_flush(struct vhost_blk *blk)
+{
+	vhost_poll_flush(&blk->vq.poll);
+}
+
+static struct file *vhost_blk_stop_vq(struct vhost_blk *blk,
+				      struct vhost_virtqueue *vq)
+{
+	struct file *file;
+
+	mutex_lock(&vq->mutex);
+	file = rcu_dereference_protected(vq->private_data,
+			lockdep_is_held(&vq->mutex));
+	rcu_assign_pointer(vq->private_data, NULL);
+	mutex_unlock(&vq->mutex);
+
+	return file;
+
+}
+
+static inline void vhost_blk_stop(struct vhost_blk *blk, struct file **file)
+{
+
+	*file = vhost_blk_stop_vq(blk, &blk->vq);
+}
+
+/* Handle guest request */
+static int vhost_blk_do_req(struct vhost_virtqueue *vq,
+			    struct virtio_blk_outhdr *hdr,
+			    u16 head, u16 out, u16 in,
+			    struct file *file)
+{
+	struct vhost_blk *blk = container_of(vq->dev, struct vhost_blk, dev);
+	struct iovec *iov = &vq->iov[BLK_HDR + 1];
+	loff_t offset = hdr->sector << 9;
+	struct vhost_blk_req *req;
+	u64 nr_vecs;
+	int ret = 0;
+	u8 status;
+
+	if (hdr->type == VIRTIO_BLK_T_IN || hdr->type == VIRTIO_BLK_T_GET_ID)
+		nr_vecs = in - 1;
+	else
+		nr_vecs = out - 1;
+
+	req		= &blk->reqs[head];
+	req->head	= head;
+	req->status	= blk->vq.iov[nr_vecs + 1].iov_base;
+
+	switch (hdr->type) {
+	case VIRTIO_BLK_T_OUT:
+		ret = vhost_blk_io_submit(blk, file, req, iov, nr_vecs, offset,
+					  IOCB_CMD_PWRITEV);
+		break;
+	case VIRTIO_BLK_T_IN:
+		ret = vhost_blk_io_submit(blk, file, req, iov, nr_vecs, offset,
+					  IOCB_CMD_PREADV);
+		break;
+	case VIRTIO_BLK_T_FLUSH:
+		ret = vfs_fsync(file, 1);
+		status = ret < 0 ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK;
+		ret = vhost_blk_set_status(blk, req->status, status);
+		if (!ret)
+			vhost_add_used_and_signal(&blk->dev, vq, head, ret);
+		break;
+	case VIRTIO_BLK_T_GET_ID:
+		/* TODO: need a real ID string */
+		ret = snprintf(vq->iov[BLK_HDR + 1].iov_base,
+			       VIRTIO_BLK_ID_BYTES, "VHOST-BLK-DISK");
+		status = ret < 0 ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK;
+		ret = vhost_blk_set_status(blk, req->status, status);
+		if (!ret)
+			vhost_add_used_and_signal(&blk->dev, vq, head,
+						  VIRTIO_BLK_ID_BYTES);
+		break;
+	default:
+		pr_warn("Unsupported request type %d\n", hdr->type);
+		vhost_discard_vq_desc(vq, 1);
+		ret = -EFAULT;
+		break;
+	}
+
+	return ret;
+}
+
+/* Guest kick us for IO request */
+static void vhost_blk_handle_guest_kick(struct vhost_work *work)
+{
+	struct virtio_blk_outhdr hdr;
+	struct vhost_virtqueue *vq;
+	struct vhost_blk *blk;
+	struct file *f;
+	int in, out;
+	u16 head;
+
+	vq = container_of(work, struct vhost_virtqueue, poll.work);
+	blk = container_of(vq->dev, struct vhost_blk, dev);
+
+	/* TODO: check that we are running from vhost_worker? */
+	f = rcu_dereference_check(vq->private_data, 1);
+	if (!f)
+		return;
+
+	vhost_disable_notify(&blk->dev, vq);
+	for (;;) {
+		head = vhost_get_vq_desc(&blk->dev, vq, vq->iov,
+					 ARRAY_SIZE(vq->iov),
+					 &out, &in, NULL, NULL);
+		if (unlikely(head < 0))
+			break;
+
+		if (unlikely(head == vq->num)) {
+			if (unlikely(vhost_enable_notify(&blk->dev, vq))) {
+				vhost_disable_notify(&blk->dev, vq);
+				continue;
+			}
+			break;
+		}
+
+		if (unlikely(vq->iov[BLK_HDR].iov_len != sizeof(hdr))) {
+			vq_err(vq, "Bad block header lengh!\n");
+			vhost_discard_vq_desc(vq, 1);
+			break;
+		}
+
+		if (unlikely(copy_from_user(&hdr, vq->iov[BLK_HDR].iov_base,
+					    sizeof(hdr)))) {
+			vq_err(vq, "Failed to get block header!\n");
+			vhost_discard_vq_desc(vq, 1);
+			break;
+		}
+
+
+		if (unlikely(vhost_blk_do_req(vq, &hdr, head, out, in, f) < 0))
+			break;
+	}
+}
+
+/* Complete the IO request */
+static int vhost_blk_host_kick_thread(void *data)
+{
+	mm_segment_t oldfs = get_fs();
+	struct vhost_blk *blk = data;
+	struct vhost_virtqueue *vq;
+	struct vhost_blk_req *req;
+	struct io_event *e;
+	int ret, i, len;
+	u64 count, nr;
+	u8 status;
+
+	vq = &blk->vq;
+	set_fs(USER_DS);
+	use_mm(blk->dev.mm);
+	for (;;) {
+		do {
+			ret = eventfd_ctx_read(blk->ectx, 0, &count);
+			if (unlikely(kthread_should_stop() || blk->stop))
+				goto out;
+		} while (ret != 0);
+
+		do {
+			nr = vhost_blk_read_events(blk,
+						   min(count, blk->ioevent_nr));
+			if (unlikely(nr <= 0))
+				continue;
+			count -= nr;
+
+			for (i = 0; i < nr; i++) {
+				e = &blk->ioevent[i];
+				req = (void *)e->obj;
+				len = e->res;
+				status = len > 0 ? VIRTIO_BLK_S_OK :
+						   VIRTIO_BLK_S_IOERR;
+				ret = copy_to_user(req->status, &status,
+						   sizeof(status));
+				if (unlikely(ret)) {
+					vq_err(&blk->vq,
+					       "Failed to write status\n");
+					continue;
+				}
+				vhost_add_used(&blk->vq, req->head, len);
+			}
+			vhost_signal(&blk->dev, &blk->vq);
+		} while (count > 0);
+	}
+
+out:
+	unuse_mm(blk->dev.mm);
+	set_fs(oldfs);
+	return 0;
+}
+
+static int vhost_blk_open(struct inode *inode, struct file *file)
+{
+	struct vhost_blk *blk;
+	int ret;
+
+	blk = kzalloc(sizeof(*blk), GFP_KERNEL);
+	if (!blk) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	blk->vq.handle_kick = vhost_blk_handle_guest_kick;
+
+	ret = vhost_dev_init(&blk->dev, &blk->vq, VHOST_BLK_VQ_MAX);
+	if (ret < 0)
+		goto out_dev;
+	/*
+	 * Create an eventfd which is used by aio code to
+	 * notify guest when request is completed.
+	 */
+	blk->efile = eventfd_file_create(0, 0);
+	if (IS_ERR(blk->efile))
+		goto out_dev;
+	blk->ectx = eventfd_ctx_fileget(blk->efile);
+	if (IS_ERR(blk->ectx))
+		goto out_dev;
+
+	file->private_data = blk;
+
+	blk->worker_host_kick = kthread_create(vhost_blk_host_kick_thread,
+			blk, "vhost-blk-%d", current->pid);
+	if (IS_ERR(blk->worker_host_kick)) {
+		ret = PTR_ERR(blk->worker_host_kick);
+		goto out_dev;
+	}
+
+	return ret;
+out_dev:
+	kfree(blk);
+out:
+	return ret;
+}
+
+static int vhost_blk_release(struct inode *inode, struct file *f)
+{
+	struct vhost_blk *blk = f->private_data;
+	struct file *file;
+
+	vhost_blk_stop(blk, &file);
+	vhost_blk_flush(blk);
+	vhost_dev_cleanup(&blk->dev, false);
+	if (file)
+		fput(file);
+
+	blk->stop = true;
+	eventfd_signal(blk->ectx, 1);
+	kthread_stop(blk->worker_host_kick);
+
+	eventfd_ctx_put(blk->ectx);
+	if (blk->efile)
+		fput(blk->efile);
+
+	kfree(blk->ioevent);
+	kfree(blk->reqs);
+	kfree(blk);
+
+	return 0;
+}
+
+static int vhost_blk_set_features(struct vhost_blk *blk, u64 features)
+{
+	mutex_lock(&blk->dev.mutex);
+	blk->dev.acked_features = features;
+	mutex_unlock(&blk->dev.mutex);
+
+	return 0;
+}
+
+static long vhost_blk_set_backend(struct vhost_blk *blk, unsigned index, int fd)
+{
+	struct vhost_virtqueue *vq = &blk->vq;
+	struct file *file, *oldfile;
+	int ret;
+
+	mutex_lock(&blk->dev.mutex);
+	ret = vhost_dev_check_owner(&blk->dev);
+	if (ret)
+		goto out_dev;
+
+	if (index >= VHOST_BLK_VQ_MAX) {
+		ret = -ENOBUFS;
+		goto out_dev;
+	}
+
+	mutex_lock(&vq->mutex);
+
+	if (!vhost_vq_access_ok(vq)) {
+		ret = -EFAULT;
+		goto out_vq;
+	}
+
+	file = fget(fd);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out_vq;
+	}
+
+	oldfile = rcu_dereference_protected(vq->private_data,
+			lockdep_is_held(&vq->mutex));
+	if (file != oldfile) {
+		rcu_assign_pointer(vq->private_data, file);
+		vhost_blk_enable_vq(blk, vq);
+
+		ret = vhost_init_used(vq);
+		if (ret)
+			goto out_vq;
+	}
+
+	mutex_unlock(&vq->mutex);
+
+	if (oldfile) {
+		vhost_blk_flush(blk);
+		fput(oldfile);
+	}
+
+	mutex_unlock(&blk->dev.mutex);
+	return 0;
+
+out_vq:
+	mutex_unlock(&vq->mutex);
+out_dev:
+	mutex_unlock(&blk->dev.mutex);
+	return ret;
+}
+
+static long vhost_blk_reset_owner(struct vhost_blk *blk)
+{
+	struct file *file = NULL;
+	int err;
+
+	mutex_lock(&blk->dev.mutex);
+	err = vhost_dev_check_owner(&blk->dev);
+	if (err)
+		goto done;
+	vhost_blk_stop(blk, &file);
+	vhost_blk_flush(blk);
+	err = vhost_dev_reset_owner(&blk->dev);
+done:
+	mutex_unlock(&blk->dev.mutex);
+	if (file)
+		fput(file);
+	return err;
+}
+
+static long vhost_blk_ioctl(struct file *f, unsigned int ioctl,
+			    unsigned long arg)
+{
+	struct vhost_blk *blk = f->private_data;
+	void __user *argp = (void __user *)arg;
+	struct vhost_vring_file backend;
+	u64 __user *featurep = argp;
+	u64 features;
+	int ret;
+
+	switch (ioctl) {
+	case VHOST_BLK_SET_BACKEND:
+		if (copy_from_user(&backend, argp, sizeof backend))
+			return -EFAULT;
+		return vhost_blk_set_backend(blk, backend.index, backend.fd);
+	case VHOST_GET_FEATURES:
+		features = VHOST_BLK_FEATURES;
+		if (copy_to_user(featurep, &features, sizeof features))
+			return -EFAULT;
+		return 0;
+	case VHOST_SET_FEATURES:
+		if (copy_from_user(&features, featurep, sizeof features))
+			return -EFAULT;
+		if (features & ~VHOST_BLK_FEATURES)
+			return -EOPNOTSUPP;
+		return vhost_blk_set_features(blk, features);
+	case VHOST_RESET_OWNER:
+		return vhost_blk_reset_owner(blk);
+	default:
+		mutex_lock(&blk->dev.mutex);
+		ret = vhost_dev_ioctl(&blk->dev, ioctl, arg);
+		if (!ret && ioctl == VHOST_SET_VRING_NUM)
+			ret = vhost_blk_setup(blk);
+		vhost_blk_flush(blk);
+		mutex_unlock(&blk->dev.mutex);
+		return ret;
+	}
+}
+
+static const struct file_operations vhost_blk_fops = {
+	.owner          = THIS_MODULE,
+	.open           = vhost_blk_open,
+	.release        = vhost_blk_release,
+	.llseek		= noop_llseek,
+	.unlocked_ioctl = vhost_blk_ioctl,
+};
+
+static struct miscdevice vhost_blk_misc = {
+	MISC_DYNAMIC_MINOR,
+	"vhost-blk",
+	&vhost_blk_fops,
+};
+
+int vhost_blk_init(void)
+{
+	return misc_register(&vhost_blk_misc);
+}
+
+void vhost_blk_exit(void)
+{
+	misc_deregister(&vhost_blk_misc);
+}
+
+module_init(vhost_blk_init);
+module_exit(vhost_blk_exit);
+
+MODULE_VERSION("0.0.2");
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Asias He");
+MODULE_DESCRIPTION("Host kernel accelerator for virtio_blk");
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index cc046a9..1d4db7b 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -205,6 +205,11 @@ enum {
 				(1ULL << VHOST_F_LOG_ALL) |
 				(1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
 				(1ULL << VIRTIO_NET_F_MRG_RXBUF),
+
+	VHOST_BLK_FEATURES =	(1ULL << VIRTIO_F_NOTIFY_ON_EMPTY) |
+				(1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+				(1ULL << VIRTIO_RING_F_EVENT_IDX) |
+				(1ULL << VHOST_F_LOG_ALL),
 };
 
 static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
diff --git a/include/linux/vhost.h b/include/linux/vhost.h
index e847f1e..5869728 100644
--- a/include/linux/vhost.h
+++ b/include/linux/vhost.h
@@ -121,6 +121,9 @@ struct vhost_memory {
  * device.  This can be used to stop the ring (e.g. for migration). */
 #define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)
 
+/* VHOST_BLK specific defines */
+#define VHOST_BLK_SET_BACKEND _IOW(VHOST_VIRTIO, 0x40, struct vhost_vring_file)
+
 /* Feature bits */
 /* Log all write descriptors. Can be changed while device is active. */
 #define VHOST_F_LOG_ALL 26
-- 
1.7.10.4

^ permalink raw reply related

* 0xB16B00B5? Really? (was Re: Move hyperv out of the drivers/staging/ directory)
From: Paolo Bonzini @ 2012-07-13 10:23 UTC (permalink / raw)
  To: KY Srinivasan
  Cc: Greg KH, devel@linuxdriverproject.org,
	linux-kernel@vger.kernel.org, virtualization@lists.osdl.org
In-Reply-To: <20111004193414.GA15672@suse.de>

Il 04/10/2011 21:34, Greg KH ha scritto:
> diff --git a/drivers/staging/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> similarity index 99%
> rename from drivers/staging/hv/hyperv_vmbus.h
> rename to drivers/hv/hyperv_vmbus.h
> index 3d2d836..8261cb6 100644
> --- a/drivers/staging/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -28,8 +28,7 @@
>  #include <linux/list.h>
>  #include <asm/sync_bitops.h>
>  #include <linux/atomic.h>
> -
> -#include "hyperv.h"
> +#include <linux/hyperv.h>
>  
>  /*
>   * The below CPUID leaves are present if VersionAndFeatures.HypervisorPresent

git's rename detection snips away this gem:

+#define HV_LINUX_GUEST_ID_LO		0x00000000
+#define HV_LINUX_GUEST_ID_HI		0xB16B00B5
+#define HV_LINUX_GUEST_ID		(((u64)HV_LINUX_GUEST_ID_HI << 32) | \
+					   HV_LINUX_GUEST_ID_LO)

Somone was trying to be funny, I guess.

KY, I suppose you have access to Hyper-V code or can ask someone who does.
Is this signature actually used in the Hyper-V host code?

Paolo

^ permalink raw reply

* [patch -next] tcm_vhost: another strlen() off by one
From: Dan Carpenter @ 2012-07-13 10:45 UTC (permalink / raw)
  To: Michael S. Tsirkin, Nicholas Bellinger
  Cc: kernel-janitors, kvm, virtualization

strlen() doesn't count the NUL terminator.  I missed this one in the
patches I sent yesterday.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
index 29850cb..ea72198 100644
--- a/drivers/vhost/tcm_vhost.c
+++ b/drivers/vhost/tcm_vhost.c
@@ -1424,7 +1424,7 @@ static struct se_wwn *tcm_vhost_make_tport(
 	return ERR_PTR(-EINVAL);
 
 check_len:
-	if (strlen(name) > TCM_VHOST_NAMELEN) {
+	if (strlen(name) >= TCM_VHOST_NAMELEN) {
 		pr_err("Emulated %s Address: %s, exceeds"
 			" max: %d\n", name, tcm_vhost_dump_proto_id(tport),
 			TCM_VHOST_NAMELEN);

^ permalink raw reply related

* RE: 0xB16B00B5? Really? (was Re: Move hyperv out of the drivers/staging/ directory)
From: KY Srinivasan @ 2012-07-13 13:13 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Greg KH, devel@linuxdriverproject.org,
	linux-kernel@vger.kernel.org, virtualization@lists.osdl.org
In-Reply-To: <4FFFF711.8040003@redhat.com>



> -----Original Message-----
> From: Paolo Bonzini [mailto:paolo.bonzini@gmail.com] On Behalf Of Paolo
> Bonzini
> Sent: Friday, July 13, 2012 6:23 AM
> To: KY Srinivasan
> Cc: Greg KH; devel@linuxdriverproject.org; linux-kernel@vger.kernel.org;
> virtualization@lists.osdl.org
> Subject: 0xB16B00B5? Really? (was Re: Move hyperv out of the drivers/staging/
> directory)
> 
> Il 04/10/2011 21:34, Greg KH ha scritto:
> > diff --git a/drivers/staging/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> > similarity index 99%
> > rename from drivers/staging/hv/hyperv_vmbus.h
> > rename to drivers/hv/hyperv_vmbus.h
> > index 3d2d836..8261cb6 100644
> > --- a/drivers/staging/hv/hyperv_vmbus.h
> > +++ b/drivers/hv/hyperv_vmbus.h
> > @@ -28,8 +28,7 @@
> >  #include <linux/list.h>
> >  #include <asm/sync_bitops.h>
> >  #include <linux/atomic.h>
> > -
> > -#include "hyperv.h"
> > +#include <linux/hyperv.h>
> >
> >  /*
> >   * The below CPUID leaves are present if
> VersionAndFeatures.HypervisorPresent
> 
> git's rename detection snips away this gem:
> 
> +#define HV_LINUX_GUEST_ID_LO		0x00000000
> +#define HV_LINUX_GUEST_ID_HI		0xB16B00B5
> +#define HV_LINUX_GUEST_ID		(((u64)HV_LINUX_GUEST_ID_HI
> << 32) | \
> +					   HV_LINUX_GUEST_ID_LO)
> 
> Somone was trying to be funny, I guess.
> 
> KY, I suppose you have access to Hyper-V code or can ask someone who does.
> Is this signature actually used in the Hyper-V host code?

It is still early in the morning here and pardon me if I am not seeing the issue.
Could you elaborate on what you want changed and why. This is a guest
signature that is stashed away in the hypervisor and perhaps can be retrieved
by the host. Other than that, this is not used anywhere else. MSFT has defined a
a namespace for guest IDs and while some ranges are reserved for MSFT operating
systems, there really is nothing special about the guest ID.

Regards,

K. Y

^ permalink raw reply

* Re: 0xB16B00B5? Really? (was Re: Move hyperv out of the drivers/staging/ directory)
From: Paolo Bonzini @ 2012-07-13 13:15 UTC (permalink / raw)
  To: KY Srinivasan
  Cc: devel@linuxdriverproject.org, Greg KH,
	linux-kernel@vger.kernel.org, virtualization@lists.osdl.org
In-Reply-To: <426367E2313C2449837CD2DE46E7EAF9223B57EC@SN2PRD0310MB382.namprd03.prod.outlook.com>

Il 13/07/2012 15:13, KY Srinivasan ha scritto:
>> > 
>> > Somone was trying to be funny, I guess.
>> > 
>> > KY, I suppose you have access to Hyper-V code or can ask someone who does.
>> > Is this signature actually used in the Hyper-V host code?
> It is still early in the morning here and pardon me if I am not seeing the issue.

[offlist]

0xB16B00B5 = big boobs

Paolo

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox