kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] RFC: ->make_request support for virtio-blk
@ 2011-10-05 19:54 Christoph Hellwig
  2011-10-05 19:54 ` [PATCH 1/5] block: add bio_map_sg Christoph Hellwig
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Christoph Hellwig @ 2011-10-05 19:54 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel

This patchset allows the virtio-blk driver to support much higher IOP
rates which can be driven out of modern PCI-e flash devices.  At this
point it really is just a RFC due to various issues.

The first four patches are infrastructure that could go in fairly
soon as far as I'm concerned.  Patch 5 implements the actual ->make_request
support and still has a few issues, see there for more details.  With
it I can driver my PCI-e test devices to 85-90% of the native IOPS
and bandwith, but be warned that this is still a fairly low end setup
as far as expensive flash storage is concerned.

One big downside that is has is that it current exposes a nasty race
in the qemu virtqueue code - just running xfstests inside a guest
using the new virtio-blk driver (even on a slow device) will trigger
it and lead to a filesystem shutdown.  I've tracked it down to getting
data I/O segments overwritten with status s/g list entries, but got
lost at that point.  I can start a separate thread on it.

Besides that it is missing a few features, and we have to decided
how to select which mode to use in virtio-blk - either a module option,
sysfs attribute or something that the host communicates.  Or maybe
decide that just going with ->make_request alone is fine, even on
my cheap laptop SSD it actually is just as fast if not slightly
faster than the request based variant on my laptop.

There are a few other bottlenecks in virtio that this exposes.  The
first one is the low queue length of just 128 entries in the virtio-blk
queue - to drive higher IOPs with a deep queue we absolutely need
to increment that.

Comments welcome!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/5] block: add bio_map_sg
  2011-10-05 19:54 [PATCH 0/5] RFC: ->make_request support for virtio-blk Christoph Hellwig
@ 2011-10-05 19:54 ` Christoph Hellwig
  2011-10-05 22:51   ` Boaz Harrosh
  2011-10-05 19:54 ` [PATCH 2/5] virtio: support unlocked queue kick Christoph Hellwig
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2011-10-05 19:54 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel

[-- Attachment #1: block-add-bio_map_sg --]
[-- Type: text/plain, Size: 2915 bytes --]

Add a helper to map a bio to a scatterlist, modelled after blk_rq_map_sg.
This helper is useful for any driver that wants to create a scatterlist
from its ->make_request method.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/block/blk-merge.c
===================================================================
--- linux-2.6.orig/block/blk-merge.c	2011-10-04 11:49:32.857014742 -0400
+++ linux-2.6/block/blk-merge.c	2011-10-04 13:37:51.305630951 -0400
@@ -199,6 +199,69 @@ new_segment:
 }
 EXPORT_SYMBOL(blk_rq_map_sg);
 
+/*
+ * map a bio to a scatterlist, return number of sg entries setup. Caller
+ * must make sure sg can hold bio->bi_phys_segments entries
+ */
+int bio_map_sg(struct request_queue *q, struct bio *bio,
+		  struct scatterlist *sglist)
+{
+	struct bio_vec *bvec, *bvprv;
+	struct scatterlist *sg;
+	int nsegs, cluster;
+	unsigned long i;
+
+	nsegs = 0;
+	cluster = blk_queue_cluster(q);
+
+	bvprv = NULL;
+	sg = NULL;
+	bio_for_each_segment(bvec, bio, i) {
+		int nbytes = bvec->bv_len;
+
+		if (bvprv && cluster) {
+			if (sg->length + nbytes > queue_max_segment_size(q))
+				goto new_segment;
+
+			if (!BIOVEC_PHYS_MERGEABLE(bvprv, bvec))
+				goto new_segment;
+			if (!BIOVEC_SEG_BOUNDARY(q, bvprv, bvec))
+				goto new_segment;
+
+			sg->length += nbytes;
+		} else {
+new_segment:
+			if (!sg)
+				sg = sglist;
+			else {
+				/*
+				 * If the driver previously mapped a shorter
+				 * list, we could see a termination bit
+				 * prematurely unless it fully inits the sg
+				 * table on each mapping. We KNOW that there
+				 * must be more entries here or the driver
+				 * would be buggy, so force clear the
+				 * termination bit to avoid doing a full
+				 * sg_init_table() in drivers for each command.
+				 */
+				sg->page_link &= ~0x02;
+				sg = sg_next(sg);
+			}
+
+			sg_set_page(sg, bvec->bv_page, nbytes, bvec->bv_offset);
+			nsegs++;
+		}
+		bvprv = bvec;
+	} /* segments in bio */
+
+	if (sg)
+		sg_mark_end(sg);
+
+	BUG_ON(bio->bi_phys_segments && nsegs > bio->bi_phys_segments);
+	return nsegs;
+}
+EXPORT_SYMBOL(bio_map_sg);
+
 static inline int ll_new_hw_segment(struct request_queue *q,
 				    struct request *req,
 				    struct bio *bio)
Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h	2011-10-04 13:37:13.216148915 -0400
+++ linux-2.6/include/linux/blkdev.h	2011-10-04 13:37:51.317613617 -0400
@@ -854,6 +854,8 @@ extern void blk_queue_flush_queueable(st
 extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
 
 extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
+extern int bio_map_sg(struct request_queue *q, struct bio *bio,
+		struct scatterlist *sglist);
 extern void blk_dump_rq_flags(struct request *, char *);
 extern long nr_blockdev_pages(void);
 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/5] virtio: support unlocked queue kick
  2011-10-05 19:54 [PATCH 0/5] RFC: ->make_request support for virtio-blk Christoph Hellwig
  2011-10-05 19:54 ` [PATCH 1/5] block: add bio_map_sg Christoph Hellwig
@ 2011-10-05 19:54 ` Christoph Hellwig
  2011-10-06  8:42   ` Stefan Hajnoczi
                     ` (2 more replies)
  2011-10-05 19:54 ` [PATCH 3/5] virtio-blk: remove the unused list of pending requests Christoph Hellwig
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 21+ messages in thread
From: Christoph Hellwig @ 2011-10-05 19:54 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel

[-- Attachment #1: unlocked-virtqueue-kick --]
[-- Type: text/plain, Size: 2565 bytes --]

Split virtqueue_kick to be able to do the actual notification outside the
lock protecting the virtqueue.  This patch was originally done by
Stefan Hajnoczi, but I can't find the original one anymore and had to
recreated it from memory.  Pointers to the original or corrections for
the commit message are welcome.

Index: linux-2.6/drivers/virtio/virtio_ring.c
===================================================================
--- linux-2.6.orig/drivers/virtio/virtio_ring.c	2011-09-15 15:28:55.891347016 +0200
+++ linux-2.6/drivers/virtio/virtio_ring.c	2011-10-03 18:45:32.492738431 +0200
@@ -237,9 +237,11 @@ add_head:
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_buf_gfp);
 
-void virtqueue_kick(struct virtqueue *_vq)
+bool virtqueue_kick_prepare(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
+	bool need_kick = false;
+
 	u16 new, old;
 	START_USE(vq);
 	/* Descriptors and available array need to be set before we expose the
@@ -253,15 +255,32 @@ void virtqueue_kick(struct virtqueue *_v
 	/* Need to update avail index before checking if we should notify */
 	virtio_mb();
 
-	if (vq->event ?
-	    vring_need_event(vring_avail_event(&vq->vring), new, old) :
-	    !(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
-		/* Prod other side to tell it about changes. */
-		vq->notify(&vq->vq);
-
+	if (vq->event) {
+		if (vring_need_event(vring_avail_event(&vq->vring), new, old))
+			need_kick = true;
+	} else {
+		if (!(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
+			need_kick = true;
+	}
 	END_USE(vq);
+	return need_kick;
+}
+EXPORT_SYMBOL_GPL(virtqueue_kick_prepare);
+
+void virtqueue_notify(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	/* Prod other side to tell it about changes. */
+	vq->notify(_vq);
+}
+EXPORT_SYMBOL_GPL(virtqueue_notify);
+
+void virtqueue_kick(struct virtqueue *vq)
+{
+	if (virtqueue_kick_prepare(vq))
+		virtqueue_notify(vq);
 }
-EXPORT_SYMBOL_GPL(virtqueue_kick);
 
 static void detach_buf(struct vring_virtqueue *vq, unsigned int head)
 {
Index: linux-2.6/include/linux/virtio.h
===================================================================
--- linux-2.6.orig/include/linux/virtio.h	2011-09-15 15:28:57.078857804 +0200
+++ linux-2.6/include/linux/virtio.h	2011-10-03 18:41:07.309766531 +0200
@@ -86,6 +86,8 @@ static inline int virtqueue_add_buf(stru
 }
 
 void virtqueue_kick(struct virtqueue *vq);
+bool virtqueue_kick_prepare(struct virtqueue *vq);
+void virtqueue_notify(struct virtqueue *vq);
 
 void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len);
 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3/5] virtio-blk: remove the unused list of pending requests
  2011-10-05 19:54 [PATCH 0/5] RFC: ->make_request support for virtio-blk Christoph Hellwig
  2011-10-05 19:54 ` [PATCH 1/5] block: add bio_map_sg Christoph Hellwig
  2011-10-05 19:54 ` [PATCH 2/5] virtio: support unlocked queue kick Christoph Hellwig
@ 2011-10-05 19:54 ` Christoph Hellwig
  2011-10-05 19:54 ` [PATCH 4/5] virtio-blk: reimplement the serial attribute without using requests Christoph Hellwig
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2011-10-05 19:54 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel

[-- Attachment #1: virtio-blk-remove-list --]
[-- Type: text/plain, Size: 1530 bytes --]

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/drivers/block/virtio_blk.c
===================================================================
--- linux-2.6.orig/drivers/block/virtio_blk.c	2011-10-03 19:55:29.061215040 +0200
+++ linux-2.6/drivers/block/virtio_blk.c	2011-10-03 19:55:54.196412176 +0200
@@ -24,9 +24,6 @@ struct virtio_blk
 	/* The disk structure for the kernel. */
 	struct gendisk *disk;
 
-	/* Request tracking. */
-	struct list_head reqs;
-
 	mempool_t *pool;
 
 	/* Process context for config space updates */
@@ -41,7 +38,6 @@ struct virtio_blk
 
 struct virtblk_req
 {
-	struct list_head list;
 	struct request *req;
 	struct virtio_blk_outhdr out_hdr;
 	struct virtio_scsi_inhdr in_hdr;
@@ -85,7 +81,6 @@ static void blk_done(struct virtqueue *v
 		}
 
 		__blk_end_request_all(vbr->req, error);
-		list_del(&vbr->list);
 		mempool_free(vbr, vblk->pool);
 	}
 	/* In case queue is stopped waiting for more buffers. */
@@ -170,7 +165,6 @@ static bool do_req(struct request_queue
 		return false;
 	}
 
-	list_add_tail(&vbr->list, &vblk->reqs);
 	return true;
 }
 
@@ -368,7 +362,6 @@ static int __devinit virtblk_probe(struc
 		goto out;
 	}
 
-	INIT_LIST_HEAD(&vblk->reqs);
 	spin_lock_init(&vblk->lock);
 	vblk->vdev = vdev;
 	vblk->sg_elems = sg_elems;
@@ -526,9 +519,6 @@ static void __devexit virtblk_remove(str
 
 	flush_work(&vblk->config_work);
 
-	/* Nothing should be pending. */
-	BUG_ON(!list_empty(&vblk->reqs));
-
 	/* Stop all the virtqueues. */
 	vdev->config->reset(vdev);
 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 4/5] virtio-blk: reimplement the serial attribute without using requests
  2011-10-05 19:54 [PATCH 0/5] RFC: ->make_request support for virtio-blk Christoph Hellwig
                   ` (2 preceding siblings ...)
  2011-10-05 19:54 ` [PATCH 3/5] virtio-blk: remove the unused list of pending requests Christoph Hellwig
@ 2011-10-05 19:54 ` Christoph Hellwig
  2011-10-05 19:54 ` [PATCH 5/5] virtio-blk: implement ->make_request Christoph Hellwig
  2011-10-05 20:31 ` [PATCH 0/5] RFC: ->make_request support for virtio-blk Vivek Goyal
  5 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2011-10-05 19:54 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel

[-- Attachment #1: virtio-blk-reimplement-serial --]
[-- Type: text/plain, Size: 6366 bytes --]

If we want to do bio-based I/O in virtio-blk we have to implement reading
the serial attribute ourselves.  Do that and also prepare struct virtblk_req
for dealing with different types of requests.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/drivers/block/virtio_blk.c
===================================================================
--- linux-2.6.orig/drivers/block/virtio_blk.c	2011-10-03 20:32:12.997713070 +0200
+++ linux-2.6/drivers/block/virtio_blk.c	2011-10-03 20:37:28.836714193 +0200
@@ -38,12 +38,42 @@ struct virtio_blk
 
 struct virtblk_req
 {
-	struct request *req;
+	void *private;
 	struct virtio_blk_outhdr out_hdr;
 	struct virtio_scsi_inhdr in_hdr;
+	u8 kind;
+#define VIRTIO_BLK_REQUEST	0x00
+#define VIRTIO_BLK_INTERNAL	0x01
 	u8 status;
 };
 
+static inline int virtblk_result(struct virtblk_req *vbr)
+{
+	switch (vbr->status) {
+	case VIRTIO_BLK_S_OK:
+		return 0;
+	case VIRTIO_BLK_S_UNSUPP:
+		return -ENOTTY;
+	default:
+		return -EIO;
+	}
+}
+
+static void virtblk_request_done(struct virtio_blk *vblk,
+		struct virtblk_req *vbr)
+{
+	struct request *req = vbr->private;
+
+	if (req->cmd_type == REQ_TYPE_BLOCK_PC) {
+		req->resid_len = vbr->in_hdr.residual;
+		req->sense_len = vbr->in_hdr.sense_len;
+		req->errors = vbr->in_hdr.errors;
+	}
+
+	__blk_end_request_all(req, virtblk_result(vbr));
+	mempool_free(vbr, vblk->pool);
+}
+
 static void blk_done(struct virtqueue *vq)
 {
 	struct virtio_blk *vblk = vq->vdev->priv;
@@ -53,35 +83,16 @@ static void blk_done(struct virtqueue *v
 
 	spin_lock_irqsave(&vblk->lock, flags);
 	while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) {
-		int error;
-
-		switch (vbr->status) {
-		case VIRTIO_BLK_S_OK:
-			error = 0;
-			break;
-		case VIRTIO_BLK_S_UNSUPP:
-			error = -ENOTTY;
-			break;
-		default:
-			error = -EIO;
-			break;
-		}
-
-		switch (vbr->req->cmd_type) {
-		case REQ_TYPE_BLOCK_PC:
-			vbr->req->resid_len = vbr->in_hdr.residual;
-			vbr->req->sense_len = vbr->in_hdr.sense_len;
-			vbr->req->errors = vbr->in_hdr.errors;
+		switch (vbr->kind) {
+		case VIRTIO_BLK_REQUEST:
+			virtblk_request_done(vblk, vbr);
 			break;
-		case REQ_TYPE_SPECIAL:
-			vbr->req->errors = (error != 0);
+		case VIRTIO_BLK_INTERNAL:
+			complete(vbr->private);
 			break;
 		default:
-			break;
+			BUG();
 		}
-
-		__blk_end_request_all(vbr->req, error);
-		mempool_free(vbr, vblk->pool);
 	}
 	/* In case queue is stopped waiting for more buffers. */
 	blk_start_queue(vblk->disk->queue);
@@ -99,28 +110,24 @@ static bool do_req(struct request_queue
 		/* When another request finishes we'll try again. */
 		return false;
 
-	vbr->req = req;
+	vbr->private = req;
+	vbr->kind = VIRTIO_BLK_REQUEST;
 
 	if (req->cmd_flags & REQ_FLUSH) {
 		vbr->out_hdr.type = VIRTIO_BLK_T_FLUSH;
 		vbr->out_hdr.sector = 0;
-		vbr->out_hdr.ioprio = req_get_ioprio(vbr->req);
+		vbr->out_hdr.ioprio = req_get_ioprio(req);
 	} else {
 		switch (req->cmd_type) {
 		case REQ_TYPE_FS:
 			vbr->out_hdr.type = 0;
-			vbr->out_hdr.sector = blk_rq_pos(vbr->req);
-			vbr->out_hdr.ioprio = req_get_ioprio(vbr->req);
+			vbr->out_hdr.sector = blk_rq_pos(req);
+			vbr->out_hdr.ioprio = req_get_ioprio(req);
 			break;
 		case REQ_TYPE_BLOCK_PC:
 			vbr->out_hdr.type = VIRTIO_BLK_T_SCSI_CMD;
 			vbr->out_hdr.sector = 0;
-			vbr->out_hdr.ioprio = req_get_ioprio(vbr->req);
-			break;
-		case REQ_TYPE_SPECIAL:
-			vbr->out_hdr.type = VIRTIO_BLK_T_GET_ID;
-			vbr->out_hdr.sector = 0;
-			vbr->out_hdr.ioprio = req_get_ioprio(vbr->req);
+			vbr->out_hdr.ioprio = req_get_ioprio(req);
 			break;
 		default:
 			/* We don't put anything else in the queue. */
@@ -136,13 +143,14 @@ static bool do_req(struct request_queue
 	 * block, and before the normal inhdr we put the sense data and the
 	 * inhdr with additional status information before the normal inhdr.
 	 */
-	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC)
-		sg_set_buf(&vblk->sg[out++], vbr->req->cmd, vbr->req->cmd_len);
+	if (req->cmd_type == REQ_TYPE_BLOCK_PC)
+		sg_set_buf(&vblk->sg[out++], req->cmd, req->cmd_len);
 
-	num = blk_rq_map_sg(q, vbr->req, vblk->sg + out);
+	num = blk_rq_map_sg(q, req, vblk->sg + out);
 
-	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
-		sg_set_buf(&vblk->sg[num + out + in++], vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
+	if (req->cmd_type == REQ_TYPE_BLOCK_PC) {
+		sg_set_buf(&vblk->sg[num + out + in++], req->sense,
+			   SCSI_SENSE_BUFFERSIZE);
 		sg_set_buf(&vblk->sg[num + out + in++], &vbr->in_hdr,
 			   sizeof(vbr->in_hdr));
 	}
@@ -151,7 +159,7 @@ static bool do_req(struct request_queue
 		   sizeof(vbr->status));
 
 	if (num) {
-		if (rq_data_dir(vbr->req) == WRITE) {
+		if (rq_data_dir(req) == WRITE) {
 			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
 			out += num;
 		} else {
@@ -196,26 +204,39 @@ static void do_virtblk_request(struct re
 static int virtblk_get_id(struct gendisk *disk, char *id_str)
 {
 	struct virtio_blk *vblk = disk->private_data;
-	struct request *req;
-	struct bio *bio;
-	int err;
-
-	bio = bio_map_kern(vblk->disk->queue, id_str, VIRTIO_BLK_ID_BYTES,
-			   GFP_KERNEL);
-	if (IS_ERR(bio))
-		return PTR_ERR(bio);
-
-	req = blk_make_request(vblk->disk->queue, bio, GFP_KERNEL);
-	if (IS_ERR(req)) {
-		bio_put(bio);
-		return PTR_ERR(req);
-	}
-
-	req->cmd_type = REQ_TYPE_SPECIAL;
-	err = blk_execute_rq(vblk->disk->queue, vblk->disk, req, false);
-	blk_put_request(req);
+	struct virtblk_req *vbr;
+	DECLARE_COMPLETION_ONSTACK(done);
+	int error;
 
-	return err;
+	vbr = kmalloc(sizeof(*vbr), GFP_KERNEL);
+	if (!vbr)
+		return -ENOMEM;
+	vbr->private = &done;
+	vbr->kind = VIRTIO_BLK_INTERNAL;
+
+	vbr->out_hdr.type = VIRTIO_BLK_T_GET_ID | VIRTIO_BLK_T_IN;
+	vbr->out_hdr.sector = 0;
+	vbr->out_hdr.ioprio = 0;
+
+	sg_set_buf(&vblk->sg[0], &vbr->out_hdr, sizeof(vbr->out_hdr));
+	sg_set_buf(&vblk->sg[1], id_str, VIRTIO_BLK_ID_BYTES);
+	sg_set_buf(&vblk->sg[2], &vbr->status, sizeof(vbr->status));
+
+	spin_lock_irq(&vblk->lock);
+	if (virtqueue_add_buf(vblk->vq, vblk->sg, 1, 2, vbr) < 0) {
+		spin_unlock_irq(&vblk->lock);
+		/* XXX: eventuall wait for free space */
+		error = -EBUSY;
+		goto out_free;
+	}
+	virtqueue_kick(vblk->vq);
+	spin_unlock_irq(&vblk->lock);
+
+	wait_for_completion(&done);
+	error = virtblk_result(vbr);
+out_free:
+	kfree(vbr);
+	return error;
 }
 
 static int virtblk_ioctl(struct block_device *bdev, fmode_t mode,

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 5/5] virtio-blk: implement ->make_request
  2011-10-05 19:54 [PATCH 0/5] RFC: ->make_request support for virtio-blk Christoph Hellwig
                   ` (3 preceding siblings ...)
  2011-10-05 19:54 ` [PATCH 4/5] virtio-blk: reimplement the serial attribute without using requests Christoph Hellwig
@ 2011-10-05 19:54 ` Christoph Hellwig
  2011-10-06  1:52   ` Rusty Russell
  2011-10-06 13:53   ` Jens Axboe
  2011-10-05 20:31 ` [PATCH 0/5] RFC: ->make_request support for virtio-blk Vivek Goyal
  5 siblings, 2 replies; 21+ messages in thread
From: Christoph Hellwig @ 2011-10-05 19:54 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel

[-- Attachment #1: virtio-blk-add-make_request-support --]
[-- Type: text/plain, Size: 7864 bytes --]

Add an alternate I/O path that implements ->make_request for virtio-blk.
This is required for high IOPs devices which get slowed down to 1/5th of
the native speed by all the locking, memory allocation and other overhead
in the request based I/O path.

This patch is not quite merge ready due to two issues:

 - it doesn't implement FUA and FLUSH requests yet
 - it hardcodes which I/O path to chose

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/drivers/block/virtio_blk.c
===================================================================
--- linux-2.6.orig/drivers/block/virtio_blk.c	2011-10-05 10:36:42.883913334 -0400
+++ linux-2.6/drivers/block/virtio_blk.c	2011-10-05 15:29:35.591405323 -0400
@@ -11,6 +11,8 @@
 
 #define PART_BITS 4
 
+static int use_make_request = 1;
+
 static int major, index;
 struct workqueue_struct *virtblk_wq;
 
@@ -20,6 +22,7 @@ struct virtio_blk
 
 	struct virtio_device *vdev;
 	struct virtqueue *vq;
+	wait_queue_head_t queue_wait;
 
 	/* The disk structure for the kernel. */
 	struct gendisk *disk;
@@ -39,11 +42,13 @@ struct virtio_blk
 struct virtblk_req
 {
 	void *private;
+	struct virtblk_req *next;
 	struct virtio_blk_outhdr out_hdr;
 	struct virtio_scsi_inhdr in_hdr;
 	u8 kind;
 #define VIRTIO_BLK_REQUEST	0x00
-#define VIRTIO_BLK_INTERNAL	0x01
+#define VIRTIO_BLK_BIO		0x01
+#define VIRTIO_BLK_INTERNAL	0x02
 	u8 status;
 };
 
@@ -74,10 +79,17 @@ static void virtblk_request_done(struct
 	mempool_free(vbr, vblk->pool);
 }
 
+static void virtblk_bio_done(struct virtio_blk *vblk,
+		struct virtblk_req *vbr)
+{
+	bio_endio(vbr->private, virtblk_result(vbr));
+	mempool_free(vbr, vblk->pool);
+}
+
 static void blk_done(struct virtqueue *vq)
 {
 	struct virtio_blk *vblk = vq->vdev->priv;
-	struct virtblk_req *vbr;
+	struct virtblk_req *vbr, *head = NULL, *tail = NULL;
 	unsigned int len;
 	unsigned long flags;
 
@@ -88,15 +100,47 @@ static void blk_done(struct virtqueue *v
 			virtblk_request_done(vblk, vbr);
 			break;
 		case VIRTIO_BLK_INTERNAL:
-			complete(vbr->private);
+		case VIRTIO_BLK_BIO:
+			if (head) {
+				tail->next = vbr;
+				tail = vbr;
+			} else {
+				tail = head = vbr;
+			}
 			break;
 		default:
 			BUG();
 		}
 	}
-	/* In case queue is stopped waiting for more buffers. */
-	blk_start_queue(vblk->disk->queue);
+
+	if (!use_make_request) {
+		/* In case queue is stopped waiting for more buffers. */
+		blk_start_queue(vblk->disk->queue);
+	}
 	spin_unlock_irqrestore(&vblk->lock, flags);
+
+	wake_up(&vblk->queue_wait);
+
+	/*
+	 * Process completions after freeing up space in the virtqueue and
+	 * dropping the lock.
+	 */
+	while (head) {
+		vbr = head;
+		head = head->next;
+
+		switch (vbr->kind) {
+		case VIRTIO_BLK_BIO:
+			virtblk_bio_done(vblk, vbr);
+			break;
+		case VIRTIO_BLK_INTERNAL:
+			complete(vbr->private);
+			break;
+		default:
+			BUG();
+		}
+
+	}
 }
 
 static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
@@ -111,6 +155,7 @@ static bool do_req(struct request_queue
 		return false;
 
 	vbr->private = req;
+	vbr->next = NULL;
 	vbr->kind = VIRTIO_BLK_REQUEST;
 
 	if (req->cmd_flags & REQ_FLUSH) {
@@ -199,6 +244,128 @@ static void do_virtblk_request(struct re
 		virtqueue_kick(vblk->vq);
 }
 
+struct virtblk_plug_cb {
+	struct blk_plug_cb cb;
+	struct virtio_blk *vblk;
+};
+
+static void virtblk_unplug(struct blk_plug_cb *bcb)
+{
+	struct virtblk_plug_cb *cb =
+		container_of(bcb, struct virtblk_plug_cb, cb);
+
+	virtqueue_notify(cb->vblk->vq);
+	kfree(cb);
+}
+
+static bool virtblk_plugged(struct virtio_blk *vblk)
+{
+	struct blk_plug *plug = current->plug;
+	struct virtblk_plug_cb *cb;
+
+	if (!plug)
+		return false;
+
+	list_for_each_entry(cb, &plug->cb_list, cb.list) {
+		if (cb->cb.callback == virtblk_unplug && cb->vblk == vblk)
+			return true;
+	}
+
+	/* Not currently on the callback list */
+	cb = kmalloc(sizeof(*cb), GFP_ATOMIC);
+	if (!cb)
+		return false;
+
+	cb->vblk = vblk;
+	cb->cb.callback = virtblk_unplug;
+	list_add(&cb->cb.list, &plug->cb_list);
+	return true;
+}
+
+static void virtblk_add_buf_wait(struct virtio_blk *vblk,
+	struct virtblk_req *vbr, unsigned long out, unsigned long in)
+{
+	DEFINE_WAIT(wait);
+	bool retry, notify;
+
+	for (;;) {
+		prepare_to_wait(&vblk->queue_wait, &wait,
+				TASK_UNINTERRUPTIBLE);
+
+		spin_lock_irq(&vblk->lock);
+		if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr) < 0) {
+			retry = true;
+		} else {
+			retry = false;
+		}
+		notify = virtqueue_kick_prepare(vblk->vq);
+		spin_unlock_irq(&vblk->lock);
+
+		if (notify)
+			virtqueue_notify(vblk->vq);
+
+		if (!retry)
+			break;
+		schedule();
+	}
+	finish_wait(&vblk->queue_wait, &wait);
+}
+
+static int virtblk_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct virtio_blk *vblk = q->queuedata;
+	unsigned long num, out = 0, in = 0;
+	struct virtblk_req *vbr;
+	bool retry, notify;
+
+	BUG_ON(bio->bi_phys_segments + 2 > vblk->sg_elems);
+	BUG_ON(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
+
+	vbr = mempool_alloc(vblk->pool, GFP_NOIO);
+
+	vbr->private = bio;
+	vbr->next = NULL;
+	vbr->kind = VIRTIO_BLK_BIO;
+
+	vbr->out_hdr.type = 0;
+	vbr->out_hdr.sector = bio->bi_sector;
+	vbr->out_hdr.ioprio = bio_prio(bio);
+
+	sg_set_buf(&vblk->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
+
+	num = bio_map_sg(q, bio, vblk->sg + out);
+
+	sg_set_buf(&vblk->sg[num + out + in++], &vbr->status,
+		   sizeof(vbr->status));
+
+	if (num) {
+		if (bio->bi_rw & REQ_WRITE) {
+			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
+			out += num;
+		} else {
+			vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
+			in += num;
+		}
+	}
+
+	spin_lock_irq(&vblk->lock);
+	if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr) < 0) {
+		retry = true;
+	} else {
+		retry = false;
+	}
+
+	notify = virtqueue_kick_prepare(vblk->vq);
+	spin_unlock_irq(&vblk->lock);
+
+	if (notify && !virtblk_plugged(vblk))
+		virtqueue_notify(vblk->vq);
+
+	if (retry)
+		virtblk_add_buf_wait(vblk, vbr, out, in);
+	return 0;
+}
+
 /* return id (s/n) string for *disk to *id_str
  */
 static int virtblk_get_id(struct gendisk *disk, char *id_str)
@@ -212,6 +379,7 @@ static int virtblk_get_id(struct gendisk
 	if (!vbr)
 		return -ENOMEM;
 	vbr->private = &done;
+	vbr->next = NULL;
 	vbr->kind = VIRTIO_BLK_INTERNAL;
 
 	vbr->out_hdr.type = VIRTIO_BLK_T_GET_ID | VIRTIO_BLK_T_IN;
@@ -248,7 +416,8 @@ static int virtblk_ioctl(struct block_de
 	/*
 	 * Only allow the generic SCSI ioctls if the host can support it.
 	 */
-	if (!virtio_has_feature(vblk->vdev, VIRTIO_BLK_F_SCSI))
+	if (!virtio_has_feature(vblk->vdev, VIRTIO_BLK_F_SCSI) &&
+	    !use_make_request)
 		return -ENOTTY;
 
 	return scsi_cmd_ioctl(disk->queue, disk, mode, cmd,
@@ -383,6 +552,7 @@ static int __devinit virtblk_probe(struc
 		goto out;
 	}
 
+	init_waitqueue_head(&vblk->queue_wait);
 	spin_lock_init(&vblk->lock);
 	vblk->vdev = vdev;
 	vblk->sg_elems = sg_elems;
@@ -409,10 +579,20 @@ static int __devinit virtblk_probe(struc
 		goto out_mempool;
 	}
 
-	q = vblk->disk->queue = blk_init_queue(do_virtblk_request, &vblk->lock);
-	if (!q) {
-		err = -ENOMEM;
-		goto out_put_disk;
+	if (use_make_request) {
+		q = vblk->disk->queue = blk_alloc_queue(GFP_KERNEL);
+		if (!q) {
+			err = -ENOMEM;
+			goto out_put_disk;
+		}
+		blk_queue_make_request(q, virtblk_make_request);
+		printk("virtio-blk: using bios directly\n");
+	} else {
+		q = vblk->disk->queue = blk_init_queue(do_virtblk_request, &vblk->lock);
+		if (!q) {
+			err = -ENOMEM;
+			goto out_put_disk;
+		}
 	}
 
 	q->queuedata = vblk;
@@ -438,7 +618,7 @@ static int __devinit virtblk_probe(struc
 	index++;
 
 	/* configure queue flush support */
-	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
+	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH) && !use_make_request)
 		blk_queue_flush(q, REQ_FLUSH);
 
 	/* If disk is read-only in the host, the guest should obey */


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] RFC: ->make_request support for virtio-blk
  2011-10-05 19:54 [PATCH 0/5] RFC: ->make_request support for virtio-blk Christoph Hellwig
                   ` (4 preceding siblings ...)
  2011-10-05 19:54 ` [PATCH 5/5] virtio-blk: implement ->make_request Christoph Hellwig
@ 2011-10-05 20:31 ` Vivek Goyal
  2011-10-05 21:53   ` Christoph Hellwig
  5 siblings, 1 reply; 21+ messages in thread
From: Vivek Goyal @ 2011-10-05 20:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Rusty Russell, Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm,
	linux-kernel

On Wed, Oct 05, 2011 at 03:54:03PM -0400, Christoph Hellwig wrote:
> This patchset allows the virtio-blk driver to support much higher IOP
> rates which can be driven out of modern PCI-e flash devices.  At this
> point it really is just a RFC due to various issues.
> 

So you no longer believe that request queue overhead can be brought
down to mangeable levels for these fast devices. And instead go for
bio based drivers and give up on merging and implement own FLUSH/FUA
machinery.

Not that I am advocating for continuing with request based driver. Just
curious..

Thanks
Vivek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] RFC: ->make_request support for virtio-blk
  2011-10-05 20:31 ` [PATCH 0/5] RFC: ->make_request support for virtio-blk Vivek Goyal
@ 2011-10-05 21:53   ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2011-10-05 21:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Christoph Hellwig, Rusty Russell, Chris Wright, Jens Axboe,
	Stefan Hajnoczi, kvm, linux-kernel

On Wed, Oct 05, 2011 at 04:31:16PM -0400, Vivek Goyal wrote:
> So you no longer believe that request queue overhead can be brought
> down to mangeable levels for these fast devices. And instead go for
> bio based drivers and give up on merging and implement own FLUSH/FUA
> machinery.

Not in a reasonable time frame. Having a common interface is still the
grand plan, but the virtio performance issues are hurting people so
badly that we need a fix soon.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/5] block: add bio_map_sg
  2011-10-05 19:54 ` [PATCH 1/5] block: add bio_map_sg Christoph Hellwig
@ 2011-10-05 22:51   ` Boaz Harrosh
  2011-10-06 13:40     ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Boaz Harrosh @ 2011-10-05 22:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Rusty Russell, Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm,
	linux-kernel

On 10/05/2011 09:54 PM, Christoph Hellwig wrote:
> Add a helper to map a bio to a scatterlist, modelled after blk_rq_map_sg.
> This helper is useful for any driver that wants to create a scatterlist
> from its ->make_request method.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 

I have some questions.

- Could we later use this bio_map_sg() to implement blk_rq_map_sg() and
  remove some duplicated code?

- Don't you need to support a chained bio (bio->next != NULL)? Because
  I did not see any looping in the last patch 
	[PATCH 5/5] virtio-blk: implement ->make_request
  Or is it that ->make_request() is a single bio at a time?
  If so could we benefit from both bio-chaining and sg-chaning to
  make bigger IOs?

Thanks
Boaz

> Index: linux-2.6/block/blk-merge.c
> ===================================================================
> --- linux-2.6.orig/block/blk-merge.c	2011-10-04 11:49:32.857014742 -0400
> +++ linux-2.6/block/blk-merge.c	2011-10-04 13:37:51.305630951 -0400
> @@ -199,6 +199,69 @@ new_segment:
>  }
>  EXPORT_SYMBOL(blk_rq_map_sg);
>  
> +/*
> + * map a bio to a scatterlist, return number of sg entries setup. Caller
> + * must make sure sg can hold bio->bi_phys_segments entries
> + */
> +int bio_map_sg(struct request_queue *q, struct bio *bio,
> +		  struct scatterlist *sglist)
> +{
> +	struct bio_vec *bvec, *bvprv;
> +	struct scatterlist *sg;
> +	int nsegs, cluster;
> +	unsigned long i;
> +
> +	nsegs = 0;
> +	cluster = blk_queue_cluster(q);
> +
> +	bvprv = NULL;
> +	sg = NULL;
> +	bio_for_each_segment(bvec, bio, i) {
> +		int nbytes = bvec->bv_len;
> +
> +		if (bvprv && cluster) {
> +			if (sg->length + nbytes > queue_max_segment_size(q))
> +				goto new_segment;
> +
> +			if (!BIOVEC_PHYS_MERGEABLE(bvprv, bvec))
> +				goto new_segment;
> +			if (!BIOVEC_SEG_BOUNDARY(q, bvprv, bvec))
> +				goto new_segment;
> +
> +			sg->length += nbytes;
> +		} else {
> +new_segment:
> +			if (!sg)
> +				sg = sglist;
> +			else {
> +				/*
> +				 * If the driver previously mapped a shorter
> +				 * list, we could see a termination bit
> +				 * prematurely unless it fully inits the sg
> +				 * table on each mapping. We KNOW that there
> +				 * must be more entries here or the driver
> +				 * would be buggy, so force clear the
> +				 * termination bit to avoid doing a full
> +				 * sg_init_table() in drivers for each command.
> +				 */
> +				sg->page_link &= ~0x02;
> +				sg = sg_next(sg);
> +			}
> +
> +			sg_set_page(sg, bvec->bv_page, nbytes, bvec->bv_offset);
> +			nsegs++;
> +		}
> +		bvprv = bvec;
> +	} /* segments in bio */
> +
> +	if (sg)
> +		sg_mark_end(sg);
> +
> +	BUG_ON(bio->bi_phys_segments && nsegs > bio->bi_phys_segments);
> +	return nsegs;
> +}
> +EXPORT_SYMBOL(bio_map_sg);
> +
>  static inline int ll_new_hw_segment(struct request_queue *q,
>  				    struct request *req,
>  				    struct bio *bio)
> Index: linux-2.6/include/linux/blkdev.h
> ===================================================================
> --- linux-2.6.orig/include/linux/blkdev.h	2011-10-04 13:37:13.216148915 -0400
> +++ linux-2.6/include/linux/blkdev.h	2011-10-04 13:37:51.317613617 -0400
> @@ -854,6 +854,8 @@ extern void blk_queue_flush_queueable(st
>  extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
>  
>  extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
> +extern int bio_map_sg(struct request_queue *q, struct bio *bio,
> +		struct scatterlist *sglist);
>  extern void blk_dump_rq_flags(struct request *, char *);
>  extern long nr_blockdev_pages(void);
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] virtio-blk: implement ->make_request
  2011-10-05 19:54 ` [PATCH 5/5] virtio-blk: implement ->make_request Christoph Hellwig
@ 2011-10-06  1:52   ` Rusty Russell
  2011-10-06 13:42     ` Christoph Hellwig
  2011-10-06 13:53   ` Jens Axboe
  1 sibling, 1 reply; 21+ messages in thread
From: Rusty Russell @ 2011-10-06  1:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel

On Wed, 05 Oct 2011 15:54:08 -0400, Christoph Hellwig <hch@infradead.org> wrote:
> Add an alternate I/O path that implements ->make_request for virtio-blk.
> This is required for high IOPs devices which get slowed down to 1/5th of
> the native speed by all the locking, memory allocation and other overhead
> in the request based I/O path.

Ouch.

I'd be tempted to just switch across to this, though I'd be interested
to see if the simple add_buf change I referred to before has some effect
by itself (I doubt it).

Also, though it's overkill I'd use standard list primitives rather than
open-coding a single linked list.

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] virtio: support unlocked queue kick
  2011-10-05 19:54 ` [PATCH 2/5] virtio: support unlocked queue kick Christoph Hellwig
@ 2011-10-06  8:42   ` Stefan Hajnoczi
       [not found]   ` <87r52qgaf3.fsf@rustcorp.com.au>
  2011-11-03  6:45   ` Minchan Kim
  2 siblings, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2011-10-06  8:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Rusty Russell, Chris Wright, Jens Axboe, kvm, linux-kernel

On Wed, Oct 05, 2011 at 03:54:05PM -0400, Christoph Hellwig wrote:
> Split virtqueue_kick to be able to do the actual notification outside the
> lock protecting the virtqueue.  This patch was originally done by
> Stefan Hajnoczi, but I can't find the original one anymore and had to
> recreated it from memory.  Pointers to the original or corrections for
> the commit message are welcome.

Here is the patch:
https://github.com/stefanha/linux/commit/a6d06644e3a58e57a774e77d7dc34c4a5a2e7496

Or as an email if you want to track it down in your inbox:
http://www.spinics.net/lists/linux-virtualization/msg14616.html

The code you posted is equivalent, the only additional thing my patch
did was to update doc comments in include/linux/virtio.h.  Feel free to
grab my patch or just the hunk.

Stefan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] virtio: support unlocked queue kick
       [not found]   ` <87r52qgaf3.fsf@rustcorp.com.au>
@ 2011-10-06 13:19     ` Michael S. Tsirkin
  2011-11-01 14:40       ` Michael S. Tsirkin
  0 siblings, 1 reply; 21+ messages in thread
From: Michael S. Tsirkin @ 2011-10-06 13:19 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Christoph Hellwig, Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm,
	linux-kernel

On Thu, Oct 06, 2011 at 12:15:36PM +1030, Rusty Russell wrote:
> On Wed, 05 Oct 2011 15:54:05 -0400, Christoph Hellwig <hch@infradead.org> wrote:
> > Split virtqueue_kick to be able to do the actual notification outside the
> > lock protecting the virtqueue.  This patch was originally done by
> > Stefan Hajnoczi, but I can't find the original one anymore and had to
> > recreated it from memory.  Pointers to the original or corrections for
> > the commit message are welcome.
> 
> An alternative to this is to update the ring on every virtqueue_add_buf.
> MST discussed this for virtio_net, and I think it's better in general.
> 
> The only reason that it wasn't written that way originally is that the
> barriers make add_buf slightly more expensive.
> 
> Cheers,
> Rusty.

With event index, I'm not sure that's enough to make the kick lockless
anymore.


-- 
MST

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/5] block: add bio_map_sg
  2011-10-05 22:51   ` Boaz Harrosh
@ 2011-10-06 13:40     ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2011-10-06 13:40 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, Rusty Russell, Chris Wright, Jens Axboe,
	Stefan Hajnoczi, kvm, linux-kernel

On Thu, Oct 06, 2011 at 12:51:39AM +0200, Boaz Harrosh wrote:
> I have some questions.
> 
> - Could we later use this bio_map_sg() to implement blk_rq_map_sg() and
>   remove some duplicated code?

I didn't even think about that, but it actually looks very possible
to factor the "meat" in the for each bvec loop into a common helper
used by both.

> - Don't you need to support a chained bio (bio->next != NULL)? Because
>   I did not see any looping in the last patch 
> 	[PATCH 5/5] virtio-blk: implement ->make_request
>   Or is it that ->make_request() is a single bio at a time?
>   If so could we benefit from both bio-chaining and sg-chaning to
>   make bigger IOs?

At this point ->make_request is a single bio interface.  I have some
WIP patches to do the onstack plugging per bio, at which point it would
change to take a list.  For this to work we'd need major changes to
all ->make_request drivers.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] virtio-blk: implement ->make_request
  2011-10-06  1:52   ` Rusty Russell
@ 2011-10-06 13:42     ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2011-10-06 13:42 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Christoph Hellwig, Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm,
	linux-kernel

On Thu, Oct 06, 2011 at 12:22:14PM +1030, Rusty Russell wrote:
> On Wed, 05 Oct 2011 15:54:08 -0400, Christoph Hellwig <hch@infradead.org> wrote:
> > Add an alternate I/O path that implements ->make_request for virtio-blk.
> > This is required for high IOPs devices which get slowed down to 1/5th of
> > the native speed by all the locking, memory allocation and other overhead
> > in the request based I/O path.
> 
> Ouch.
> 
> I'd be tempted to just switch across to this, though I'd be interested
> to see if the simple add_buf change I referred to before has some effect
> by itself (I doubt it).

Benchmarking this more extensively even on low-end devices is number
on my todo list after sorting out the virtqueue race and implementing
flush/fua support.  I'd really prefer to switch over to it
unconditionally if the performance numbers allow it.

> Also, though it's overkill I'd use standard list primitives rather than
> open-coding a single linked list.

I really prefer using standard helpers, but using a doubly linked list
and increasing memory usage seems like such a waste.  Maybe I should
annoy Linus by proposing another iteration of a common single linked
list implementation :)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] virtio-blk: implement ->make_request
  2011-10-05 19:54 ` [PATCH 5/5] virtio-blk: implement ->make_request Christoph Hellwig
  2011-10-06  1:52   ` Rusty Russell
@ 2011-10-06 13:53   ` Jens Axboe
  1 sibling, 0 replies; 21+ messages in thread
From: Jens Axboe @ 2011-10-06 13:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Rusty Russell, Chris Wright, Stefan Hajnoczi, kvm, linux-kernel

On Wed, Oct 05 2011, Christoph Hellwig wrote:
> Add an alternate I/O path that implements ->make_request for virtio-blk.
> This is required for high IOPs devices which get slowed down to 1/5th of
> the native speed by all the locking, memory allocation and other overhead
> in the request based I/O path.

We definitely have some performance fruit hanging rather low in that
path, but a factor 5 performance difference sounds insanely excessive. I
haven't looked at virtio_blk in detail, but I've done 500K+ IOPS on
request based drivers before (yes, on real hardware). So it could be
that virtio_blk is just doing things rather suboptimally in some places,
and that it would be possible to claim most of that speedup there too.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] virtio: support unlocked queue kick
  2011-10-06 13:19     ` Michael S. Tsirkin
@ 2011-11-01 14:40       ` Michael S. Tsirkin
  2011-11-02  3:19         ` Rusty Russell
  0 siblings, 1 reply; 21+ messages in thread
From: Michael S. Tsirkin @ 2011-11-01 14:40 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Christoph Hellwig, Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm,
	linux-kernel

On Thu, Oct 06, 2011 at 11:18:28AM -0200, Michael S. Tsirkin wrote:
> On Thu, Oct 06, 2011 at 12:15:36PM +1030, Rusty Russell wrote:
> > On Wed, 05 Oct 2011 15:54:05 -0400, Christoph Hellwig <hch@infradead.org> wrote:
> > > Split virtqueue_kick to be able to do the actual notification outside the
> > > lock protecting the virtqueue.  This patch was originally done by
> > > Stefan Hajnoczi, but I can't find the original one anymore and had to
> > > recreated it from memory.  Pointers to the original or corrections for
> > > the commit message are welcome.
> > 
> > An alternative to this is to update the ring on every virtqueue_add_buf.
> > MST discussed this for virtio_net, and I think it's better in general.
> > 
> > The only reason that it wasn't written that way originally is that the
> > barriers make add_buf slightly more expensive.
> > 
> > Cheers,
> > Rusty.
> 
> With event index, I'm not sure that's enough to make the kick lockless
> anymore.

Hmm, any comment on this? These have been benchmarked
to give a sizeable speedup, so I'm thinking it's better to take
the patches as is, if someone has the inclination to redo
the work with an atomic virtqueue_add_buf, that can
be applied on top.

> 
> -- 
> MST

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] virtio: support unlocked queue kick
  2011-11-01 14:40       ` Michael S. Tsirkin
@ 2011-11-02  3:19         ` Rusty Russell
  2011-11-02  7:25           ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Rusty Russell @ 2011-11-02  3:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Christoph Hellwig, Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm,
	linux-kernel

On Tue, 1 Nov 2011 16:40:45 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Oct 06, 2011 at 11:18:28AM -0200, Michael S. Tsirkin wrote:
> > On Thu, Oct 06, 2011 at 12:15:36PM +1030, Rusty Russell wrote:
> > > On Wed, 05 Oct 2011 15:54:05 -0400, Christoph Hellwig <hch@infradead.org> wrote:
> > > > Split virtqueue_kick to be able to do the actual notification outside the
> > > > lock protecting the virtqueue.  This patch was originally done by
> > > An alternative to this is to update the ring on every virtqueue_add_buf.
> > > MST discussed this for virtio_net, and I think it's better in general.
> > > 
> > > The only reason that it wasn't written that way originally is that the
> > > barriers make add_buf slightly more expensive.
> > 
> > With event index, I'm not sure that's enough to make the kick lockless
> > anymore.
> 
> Hmm, any comment on this? These have been benchmarked
> to give a sizeable speedup, so I'm thinking it's better to take
> the patches as is, if someone has the inclination to redo
> the work with an atomic virtqueue_add_buf, that can
> be applied on top.

I thought it was still a WIP?

Since the problem is contention on the lock inside the block layer, the
simplest solution is to have a separate lock to protect the virtqueue.

A bit more work for virtio_blk, but probably in the noise.  And it
eliminated the number of gratuitous wakeups a race would cause in the
lockless patch.

Something like this (untested):

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -19,8 +19,12 @@ struct workqueue_struct *virtblk_wq;
 
 struct virtio_blk
 {
+	/* Lock for block layer. */
 	spinlock_t lock;
 
+	/* Lock for virtqueue (nests inside vblk->lock). */
+	spinlock_t vq_lock;
+
 	struct virtio_device *vdev;
 	struct virtqueue *vq;
 
@@ -62,6 +66,7 @@ static void blk_done(struct virtqueue *v
 	unsigned long flags;
 
 	spin_lock_irqsave(&vblk->lock, flags);
+	spin_lock(&vblk->vq_lock);
 	while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) {
 		int error;
 
@@ -94,6 +99,7 @@ static void blk_done(struct virtqueue *v
 		list_del(&vbr->list);
 		mempool_free(vbr, vblk->pool);
 	}
+	spin_unlock(&vblk->vq_lock);
 	/* In case queue is stopped waiting for more buffers. */
 	blk_start_queue(vblk->disk->queue);
 	spin_unlock_irqrestore(&vblk->lock, flags);
@@ -171,10 +177,13 @@ static bool do_req(struct request_queue 
 		}
 	}
 
+	spin_lock(&vblk->vq_lock);
 	if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr) < 0) {
+		spin_unlock(&vblk->vq_lock);
 		mempool_free(vbr, vblk->pool);
 		return false;
 	}
+	spin_unlock(&vblk->vq_lock);
 
 	list_add_tail(&vbr->list, &vblk->reqs);
 	return true;
@@ -199,8 +208,11 @@ static void do_virtblk_request(struct re
 		issued++;
 	}
 
-	if (issued)
+	if (issued) {
+		spin_lock(&vblk->vq_lock);
 		virtqueue_kick(vblk->vq);
+		spin_unlock(&vblk->vq_lock);
+	}
 }
 
 /* return id (s/n) string for *disk to *id_str
@@ -384,6 +396,7 @@ static int __devinit virtblk_probe(struc
 
 	INIT_LIST_HEAD(&vblk->reqs);
 	spin_lock_init(&vblk->lock);
+	spin_lock_init(&vblk->vq_lock);
 	vblk->vdev = vdev;
 	vblk->sg_elems = sg_elems;
 	sg_init_table(vblk->sg, vblk->sg_elems);

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] virtio: support unlocked queue kick
  2011-11-02  3:19         ` Rusty Russell
@ 2011-11-02  7:25           ` Christoph Hellwig
  2011-11-03  4:01             ` Rusty Russell
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2011-11-02  7:25 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Michael S. Tsirkin, Christoph Hellwig, Chris Wright, Jens Axboe,
	Stefan Hajnoczi, kvm, linux-kernel

On Wed, Nov 02, 2011 at 01:49:36PM +1030, Rusty Russell wrote:
> I thought it was still a WIP?

The whole series - yes.  This patch (and the serial number rewrite): no
- these are pretty much rock solid.

> Since the problem is contention on the lock inside the block layer, the
> simplest solution is to have a separate lock to protect the virtqueue.

As long as we still use a ->request_fn based driver that is not going
to buy us anything, in fact it's going to make things worse.
->request_fn based drivers always have the queue lock held over the
invocation of ->request_fn anyway, and then need it around the call
to __blk_end_request_all.  So you might minimally reduce contention
time, but skyrocket the number of lock acquisations when separating
them without changes to the block layer.

With the ->make_request_fn based driver vlkb->lock does't protect
anything but the virtuequeue anyway, but not having to take it
over the wakeup there is a) done easily and b) neatly fits the model.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] virtio: support unlocked queue kick
  2011-11-02  7:25           ` Christoph Hellwig
@ 2011-11-03  4:01             ` Rusty Russell
  2011-11-03  5:15               ` Rusty Russell
  0 siblings, 1 reply; 21+ messages in thread
From: Rusty Russell @ 2011-11-03  4:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Michael S. Tsirkin, Christoph Hellwig, Chris Wright, Jens Axboe,
	Stefan Hajnoczi, kvm, linux-kernel

On Wed, 2 Nov 2011 03:25:44 -0400, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Nov 02, 2011 at 01:49:36PM +1030, Rusty Russell wrote:
> > I thought it was still a WIP?
> 
> The whole series - yes.  This patch (and the serial number rewrite): no
> - these are pretty much rock solid.

OK, thanks.

> > Since the problem is contention on the lock inside the block layer, the
> > simplest solution is to have a separate lock to protect the virtqueue.
> 
> As long as we still use a ->request_fn based driver that is not going
> to buy us anything, in fact it's going to make things worse.

Of course...

> With the ->make_request_fn based driver vlkb->lock does't protect
> anything but the virtuequeue anyway, but not having to take it
> over the wakeup there is a) done easily and b) neatly fits the model.

It adds YA API though.  But I can't better it.  Doing the "should we
kick" check outside the lock is problematic, and doing it inside every
add() is inefficient.

So let's change the API for everyone, into:

        bool virtqueue_should_kick(struct virtqueue *vq);
        void virtqueue_kick(struct virtqueue *vq);

Patch series coming...

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] virtio: support unlocked queue kick
  2011-11-03  4:01             ` Rusty Russell
@ 2011-11-03  5:15               ` Rusty Russell
  0 siblings, 0 replies; 21+ messages in thread
From: Rusty Russell @ 2011-11-03  5:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Michael S. Tsirkin, Christoph Hellwig, Chris Wright, Jens Axboe,
	Stefan Hajnoczi, kvm, linux-kernel

On Thu, 03 Nov 2011 14:31:48 +1030, Rusty Russell <rusty@rustcorp.com.au> wrote:
> So let's change the API for everyone, into:
> 
>         bool virtqueue_should_kick(struct virtqueue *vq);
>         void virtqueue_kick(struct virtqueue *vq);
> 
> Patch series coming...

Nope, that sucked.  virtqueue_should_kick() has side effects (it has to,
if you want the actual kick to be outside the lock).

I stole the names from your patch instead, just added some documentation
and restored the "EXPORT_SYMBOL_GPL(virtqueue_kick);".  No Signed-off-by
on yours.

Also, one trivial nit.  You did:

	bool need_kick = false;
...
	if (vq->event) {
		if (vring_need_event(vring_avail_event(&vq->vring), new, old))
			need_kick = true;
	} else {
		if (!(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
			need_kick = true;
	}
...
        return need_kick;

I prefer to use uninitialized vars where possible:

        bool kick;

	if (vq->event) {
		kick = vring_need_event(vring_avail_event(&vq->vring), new, old);
	} else {
		kick = (!(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY));
	}
...
        return kick;

This way, GCC will give an uninitialized var warning if someone changes
the code and forgets to set kick.  This is especially noticeable with
err values and complex functions using goto.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] virtio: support unlocked queue kick
  2011-10-05 19:54 ` [PATCH 2/5] virtio: support unlocked queue kick Christoph Hellwig
  2011-10-06  8:42   ` Stefan Hajnoczi
       [not found]   ` <87r52qgaf3.fsf@rustcorp.com.au>
@ 2011-11-03  6:45   ` Minchan Kim
  2 siblings, 0 replies; 21+ messages in thread
From: Minchan Kim @ 2011-11-03  6:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Rusty Russell, Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm,
	linux-kernel

Hi Christoph,

On Wed, Oct 05, 2011 at 03:54:05PM -0400, Christoph Hellwig wrote:
> Split virtqueue_kick to be able to do the actual notification outside the
> lock protecting the virtqueue.  This patch was originally done by

Nitpick:

Until now, virtqueue_kick is called under lock, still so description
confuses me. It would be better to be [5/5].

> Stefan Hajnoczi, but I can't find the original one anymore and had to
> recreated it from memory.  Pointers to the original or corrections for
> the commit message are welcome.
> 
> Index: linux-2.6/drivers/virtio/virtio_ring.c
> ===================================================================
> --- linux-2.6.orig/drivers/virtio/virtio_ring.c	2011-09-15 15:28:55.891347016 +0200
> +++ linux-2.6/drivers/virtio/virtio_ring.c	2011-10-03 18:45:32.492738431 +0200
> @@ -237,9 +237,11 @@ add_head:
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_add_buf_gfp);
>  
> -void virtqueue_kick(struct virtqueue *_vq)
> +bool virtqueue_kick_prepare(struct virtqueue *_vq)
>  {
>  	struct vring_virtqueue *vq = to_vvq(_vq);
> +	bool need_kick = false;
> +
>  	u16 new, old;
>  	START_USE(vq);
>  	/* Descriptors and available array need to be set before we expose the
> @@ -253,15 +255,32 @@ void virtqueue_kick(struct virtqueue *_v
>  	/* Need to update avail index before checking if we should notify */
>  	virtio_mb();
>  
> -	if (vq->event ?
> -	    vring_need_event(vring_avail_event(&vq->vring), new, old) :
> -	    !(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
> -		/* Prod other side to tell it about changes. */
> -		vq->notify(&vq->vq);
> -
> +	if (vq->event) {
> +		if (vring_need_event(vring_avail_event(&vq->vring), new, old))
> +			need_kick = true;
> +	} else {
> +		if (!(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
> +			need_kick = true;
> +	}
>  	END_USE(vq);
> +	return need_kick;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_kick_prepare);
> +
> +void virtqueue_notify(struct virtqueue *_vq)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +
> +	/* Prod other side to tell it about changes. */
> +	vq->notify(_vq);
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_notify);
> +
> +void virtqueue_kick(struct virtqueue *vq)
> +{
> +	if (virtqueue_kick_prepare(vq))
> +		virtqueue_notify(vq);
>  }
> -EXPORT_SYMBOL_GPL(virtqueue_kick);

It should be exported for using other modules.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2011-11-03  6:45 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-05 19:54 [PATCH 0/5] RFC: ->make_request support for virtio-blk Christoph Hellwig
2011-10-05 19:54 ` [PATCH 1/5] block: add bio_map_sg Christoph Hellwig
2011-10-05 22:51   ` Boaz Harrosh
2011-10-06 13:40     ` Christoph Hellwig
2011-10-05 19:54 ` [PATCH 2/5] virtio: support unlocked queue kick Christoph Hellwig
2011-10-06  8:42   ` Stefan Hajnoczi
     [not found]   ` <87r52qgaf3.fsf@rustcorp.com.au>
2011-10-06 13:19     ` Michael S. Tsirkin
2011-11-01 14:40       ` Michael S. Tsirkin
2011-11-02  3:19         ` Rusty Russell
2011-11-02  7:25           ` Christoph Hellwig
2011-11-03  4:01             ` Rusty Russell
2011-11-03  5:15               ` Rusty Russell
2011-11-03  6:45   ` Minchan Kim
2011-10-05 19:54 ` [PATCH 3/5] virtio-blk: remove the unused list of pending requests Christoph Hellwig
2011-10-05 19:54 ` [PATCH 4/5] virtio-blk: reimplement the serial attribute without using requests Christoph Hellwig
2011-10-05 19:54 ` [PATCH 5/5] virtio-blk: implement ->make_request Christoph Hellwig
2011-10-06  1:52   ` Rusty Russell
2011-10-06 13:42     ` Christoph Hellwig
2011-10-06 13:53   ` Jens Axboe
2011-10-05 20:31 ` [PATCH 0/5] RFC: ->make_request support for virtio-blk Vivek Goyal
2011-10-05 21:53   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).