Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Rusty Russell <rusty@rustcorp.com.au>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>,
	Wang Sen <senwang@linux.vnet.ibm.com>,
	linux-scsi@vger.kernel.org, JBottomley@parallels.com,
	stefanha@linux.vnet.ibm.com, mc@linux.vnet.ibm.com,
	linux-kernel@vger.kernel.org,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>
Subject: Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
Date: Mon, 30 Jul 2012 09:20:12 +0930	[thread overview]
Message-ID: <87obmym8jv.fsf@rustcorp.com.au> (raw)
In-Reply-To: <50124D2E.7050407@redhat.com>

On Fri, 27 Jul 2012 10:11:26 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 27/07/2012 08:27, Rusty Russell ha scritto:
> >> > +int virtqueue_add_buf_sg(struct virtqueue *_vq,
> >> > +			 struct scatterlist *sg_out,
> >> > +			 unsigned int out,
> >> > +			 struct scatterlist *sg_in,
> >> > +			 unsigned int in,
> >> > +			 void *data,
> >> > +			 gfp_t gfp)
> > The point of chained scatterlists is they're self-terminated, so the
> > in & out counts should be calculated.
> > 
> > Counting them is not *that* bad, since we're about to read them all
> > anyway.
> > 
> > (Yes, the chained scatterlist stuff is complete crack, but I lost that
> > debate years ago.)
> > 
> > Here's my variant.  Networking, console and block seem OK, at least
> > (ie. it booted!).
> 
> I hate the for loops, even though we're about indeed to read all the
> scatterlists anyway... all they do is lengthen critical sections.

You're preaching to the choir: I agree.  But even more, I hate the
passing of a number and a scatterlist: it makes it doubly ambigious
whether we should use the number or the terminator.  And I really hate
bad APIs, even more than a bit of icache loss.

> Also, being the first user of chained scatterlist doesn't exactly give
> me warm fuzzies.

We're far from the first user: they've been in the kernel for well over
7 years.  They were introduced for the block layer, but they tended to
ignore private uses of scatterlists like this one.

> I think it's simpler if we provide an API to add individual buffers to
> the virtqueue, so that you can do multiple virtqueue_add_buf_more
> (whatever) before kicking the virtqueue.  The idea is that I can still
> use indirect buffers for the scatterlists that come from the block layer
> or from an skb, but I will use direct buffers for the request/response
> descriptors.  The direct buffers are always a small number (usually 2),
> so you can balance the effect by making the virtqueue bigger.  And for
> small reads and writes, you save a kmalloc on a very hot path.

This is why I hate chained scatterlists: there's no sane way to tell the
difference between passing a simple array and passing a chain.  We're
mugging the type system.

I think the right way of doing this is a flag.  We could abuse gfp_t,
but that's super-ugly.  Perhaps we should create our own
VQ_ATOMIC/VQ_KERNEL/VQ_DIRECT enum?

Or we could do what we previously suggested, and try to do some
intelligent size heuristic.  I've posted a patch from 3 years ago below.
 
> (BTW, scatterlists will have separate entries for each page; we do not
> need this in virtio buffers.  Collapsing physically-adjacent entries
> will speed up QEMU and will also help avoiding indirect buffers).

Yes, we should do this.  But note that this means an iteration, so we
might as well combine the loops :)

Cheers,
Rusty.

FIXME: remove printk
virtio: use indirect buffers based on demand (heuristic)

virtio_ring uses a ring buffer of descriptors: indirect support allows
a single descriptor to refer to a table of descriptors.  This saves
space in the ring, but requires a kmalloc/kfree.

Rather than try to figure out what the right threshold at which to use
indirect buffers, we drop the threshold dynamically when the ring is
under stress.

Note: to stress this, I reduced the ring size to 32 in lguest, and a
1G send reduced the threshold to 9.

Note2: I moved the BUG_ON()s above the indirect test, where they belong
(indirect falls thru on OOM, so the constraints still apply).
---
 drivers/virtio/virtio_ring.c |   61 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 52 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -63,6 +63,8 @@ struct vring_virtqueue
 
 	/* Host supports indirect buffers */
 	bool indirect;
+	/* Threshold before we go indirect. */
+	unsigned int indirect_threshold;
 
 	/* Number of free buffers */
 	unsigned int num_free;
@@ -139,6 +141,32 @@ static int vring_add_indirect(struct vri
 	return head;
 }
 
+static void adjust_threshold(struct vring_virtqueue *vq,
+			     unsigned int out, unsigned int in)
+{
+	/* There are really two species of virtqueue, and it matters here.
+	 * If there are no output parts, it's a "normally full" receive queue,
+	 * otherwise it's a "normally empty" send queue. */
+	if (out) {
+		/* Leave threshold unless we're full. */
+		if (out + in < vq->num_free)
+			return;
+	} else {
+		/* Leave threshold unless we're empty. */
+		if (vq->num_free != vq->vring.num)
+			return;
+	}
+
+	/* Never drop threshold below 1 */
+	vq->indirect_threshold /= 2;
+	vq->indirect_threshold |= 1;
+
+	printk("%s %s: indirect threshold %u (%u+%u vs %u)\n",
+	       dev_name(&vq->vq.vdev->dev),
+	       vq->vq.name, vq->indirect_threshold,
+	       out, in, vq->num_free);
+}
+
 static int vring_add_buf(struct virtqueue *_vq,
 			 struct scatterlist sg[],
 			 unsigned int out,
@@ -151,19 +179,33 @@ static int vring_add_buf(struct virtqueu
 	START_USE(vq);
 
 	BUG_ON(data == NULL);
-
-	/* If the host supports indirect descriptor tables, and we have multiple
-	 * buffers, then go indirect. FIXME: tune this threshold */
-	if (vq->indirect && (out + in) > 1 && vq->num_free) {
-		head = vring_add_indirect(vq, sg, out, in);
-		if (head != vq->vring.num)
-			goto add_head;
-	}
-
 	BUG_ON(out + in > vq->vring.num);
 	BUG_ON(out + in == 0);
 
 	vq->addbuf_total++;
+
+	/* If the host supports indirect descriptor tables, consider it. */
+	if (vq->indirect) {
+		bool try_indirect;
+
+		/* We tweak the threshold automatically. */
+		adjust_threshold(vq, out, in);
+
+		/* If we can't fit any at all, fall through. */
+		if (vq->num_free == 0)
+			try_indirect = false;
+		else if (out + in > vq->num_free)
+			try_indirect = true;
+		else
+			try_indirect = (out + in > vq->indirect_threshold);
+
+		if (try_indirect) {
+			head = vring_add_indirect(vq, sg, out, in);
+			if (head != vq->vring.num)
+				goto add_head;
+		}
+	}
+
 	if (vq->num_free < out + in) {
 		pr_debug("Can't add buf len %i - avail = %i\n",
 			 out + in, vq->num_free);
@@ -401,6 +443,7 @@ struct virtqueue *vring_new_virtqueue(un
 		= vq->other_notify = 0;
 
 	vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
+	vq->indirect_threshold = num;
 
 	/* No callback?  Tell other side not to bother us. */
 	if (!callback)

WARNING: multiple messages have this Message-ID (diff)

From: Rusty Russell <rusty@rustcorp.com.au>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>,
	Wang Sen <senwang@linux.vnet.ibm.com>,
	linux-scsi@vger.kernel.org, JBottomley@parallels.com,
	stefanha@linux.vnet.ibm.com, mc@linux.vnet.ibm.com,
	linux-kernel@vger.kernel.org,
	"kvm\@vger.kernel.org" <kvm@vger.kernel.org>
Subject: Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
Date: Mon, 30 Jul 2012 09:20:12 +0930	[thread overview]
Message-ID: <87obmym8jv.fsf@rustcorp.com.au> (raw)
In-Reply-To: <50124D2E.7050407@redhat.com>

On Fri, 27 Jul 2012 10:11:26 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 27/07/2012 08:27, Rusty Russell ha scritto:
> >> > +int virtqueue_add_buf_sg(struct virtqueue *_vq,
> >> > +			 struct scatterlist *sg_out,
> >> > +			 unsigned int out,
> >> > +			 struct scatterlist *sg_in,
> >> > +			 unsigned int in,
> >> > +			 void *data,
> >> > +			 gfp_t gfp)
> > The point of chained scatterlists is they're self-terminated, so the
> > in & out counts should be calculated.
> > 
> > Counting them is not *that* bad, since we're about to read them all
> > anyway.
> > 
> > (Yes, the chained scatterlist stuff is complete crack, but I lost that
> > debate years ago.)
> > 
> > Here's my variant.  Networking, console and block seem OK, at least
> > (ie. it booted!).
> 
> I hate the for loops, even though we're about indeed to read all the
> scatterlists anyway... all they do is lengthen critical sections.

You're preaching to the choir: I agree.  But even more, I hate the
passing of a number and a scatterlist: it makes it doubly ambigious
whether we should use the number or the terminator.  And I really hate
bad APIs, even more than a bit of icache loss.

> Also, being the first user of chained scatterlist doesn't exactly give
> me warm fuzzies.

We're far from the first user: they've been in the kernel for well over
7 years.  They were introduced for the block layer, but they tended to
ignore private uses of scatterlists like this one.

> I think it's simpler if we provide an API to add individual buffers to
> the virtqueue, so that you can do multiple virtqueue_add_buf_more
> (whatever) before kicking the virtqueue.  The idea is that I can still
> use indirect buffers for the scatterlists that come from the block layer
> or from an skb, but I will use direct buffers for the request/response
> descriptors.  The direct buffers are always a small number (usually 2),
> so you can balance the effect by making the virtqueue bigger.  And for
> small reads and writes, you save a kmalloc on a very hot path.

This is why I hate chained scatterlists: there's no sane way to tell the
difference between passing a simple array and passing a chain.  We're
mugging the type system.

I think the right way of doing this is a flag.  We could abuse gfp_t,
but that's super-ugly.  Perhaps we should create our own
VQ_ATOMIC/VQ_KERNEL/VQ_DIRECT enum?

Or we could do what we previously suggested, and try to do some
intelligent size heuristic.  I've posted a patch from 3 years ago below.
 
> (BTW, scatterlists will have separate entries for each page; we do not
> need this in virtio buffers.  Collapsing physically-adjacent entries
> will speed up QEMU and will also help avoiding indirect buffers).

Yes, we should do this.  But note that this means an iteration, so we
might as well combine the loops :)

Cheers,
Rusty.

FIXME: remove printk
virtio: use indirect buffers based on demand (heuristic)

virtio_ring uses a ring buffer of descriptors: indirect support allows
a single descriptor to refer to a table of descriptors.  This saves
space in the ring, but requires a kmalloc/kfree.

Rather than try to figure out what the right threshold at which to use
indirect buffers, we drop the threshold dynamically when the ring is
under stress.

Note: to stress this, I reduced the ring size to 32 in lguest, and a
1G send reduced the threshold to 9.

Note2: I moved the BUG_ON()s above the indirect test, where they belong
(indirect falls thru on OOM, so the constraints still apply).
---
 drivers/virtio/virtio_ring.c |   61 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 52 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -63,6 +63,8 @@ struct vring_virtqueue
 
 	/* Host supports indirect buffers */
 	bool indirect;
+	/* Threshold before we go indirect. */
+	unsigned int indirect_threshold;
 
 	/* Number of free buffers */
 	unsigned int num_free;
@@ -139,6 +141,32 @@ static int vring_add_indirect(struct vri
 	return head;
 }
 
+static void adjust_threshold(struct vring_virtqueue *vq,
+			     unsigned int out, unsigned int in)
+{
+	/* There are really two species of virtqueue, and it matters here.
+	 * If there are no output parts, it's a "normally full" receive queue,
+	 * otherwise it's a "normally empty" send queue. */
+	if (out) {
+		/* Leave threshold unless we're full. */
+		if (out + in < vq->num_free)
+			return;
+	} else {
+		/* Leave threshold unless we're empty. */
+		if (vq->num_free != vq->vring.num)
+			return;
+	}
+
+	/* Never drop threshold below 1 */
+	vq->indirect_threshold /= 2;
+	vq->indirect_threshold |= 1;
+
+	printk("%s %s: indirect threshold %u (%u+%u vs %u)\n",
+	       dev_name(&vq->vq.vdev->dev),
+	       vq->vq.name, vq->indirect_threshold,
+	       out, in, vq->num_free);
+}
+
 static int vring_add_buf(struct virtqueue *_vq,
 			 struct scatterlist sg[],
 			 unsigned int out,
@@ -151,19 +179,33 @@ static int vring_add_buf(struct virtqueu
 	START_USE(vq);
 
 	BUG_ON(data == NULL);
-
-	/* If the host supports indirect descriptor tables, and we have multiple
-	 * buffers, then go indirect. FIXME: tune this threshold */
-	if (vq->indirect && (out + in) > 1 && vq->num_free) {
-		head = vring_add_indirect(vq, sg, out, in);
-		if (head != vq->vring.num)
-			goto add_head;
-	}
-
 	BUG_ON(out + in > vq->vring.num);
 	BUG_ON(out + in == 0);
 
 	vq->addbuf_total++;
+
+	/* If the host supports indirect descriptor tables, consider it. */
+	if (vq->indirect) {
+		bool try_indirect;
+
+		/* We tweak the threshold automatically. */
+		adjust_threshold(vq, out, in);
+
+		/* If we can't fit any at all, fall through. */
+		if (vq->num_free == 0)
+			try_indirect = false;
+		else if (out + in > vq->num_free)
+			try_indirect = true;
+		else
+			try_indirect = (out + in > vq->indirect_threshold);
+
+		if (try_indirect) {
+			head = vring_add_indirect(vq, sg, out, in);
+			if (head != vq->vring.num)
+				goto add_head;
+		}
+	}
+
 	if (vq->num_free < out + in) {
 		pr_debug("Can't add buf len %i - avail = %i\n",
 			 out + in, vq->num_free);
@@ -401,6 +443,7 @@ struct virtqueue *vring_new_virtqueue(un
 		= vq->other_notify = 0;
 
 	vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
+	vq->indirect_threshold = num;
 
 	/* No callback?  Tell other side not to bother us. */
 	if (!callback)

next prev parent reply	other threads:[~2012-07-29 23:50 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-07-25  8:29 [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list Wang Sen
2012-07-25  8:44 ` Paolo Bonzini
2012-07-25  9:22   ` Boaz Harrosh
2012-07-25  9:22     ` Boaz Harrosh
2012-07-25  9:41     ` Paolo Bonzini
2012-07-25 12:34       ` Boaz Harrosh
2012-07-25 12:34         ` Boaz Harrosh
2012-07-25 12:49         ` Paolo Bonzini
2012-07-25 13:26           ` Boaz Harrosh
2012-07-25 13:26             ` Boaz Harrosh
2012-07-25 13:36             ` Paolo Bonzini
2012-07-25 14:36               ` Boaz Harrosh
2012-07-25 14:36                 ` Boaz Harrosh
2012-07-25 15:09                 ` performance improvements for the sglist API (Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list) Paolo Bonzini
2012-07-25 15:16                   ` Paolo Bonzini
2012-07-25 14:17             ` virtio(-scsi) vs. chained sg_lists (was " Paolo Bonzini
2012-07-25 15:28               ` Boaz Harrosh
2012-07-25 15:28                 ` Boaz Harrosh
2012-07-25 17:43                 ` Paolo Bonzini
2012-07-25 19:16                   ` Boaz Harrosh
2012-07-25 19:16                     ` Boaz Harrosh
2012-07-25 20:06                     ` Paolo Bonzini
2012-07-25 21:04                       ` Boaz Harrosh
2012-07-25 21:04                         ` Boaz Harrosh
2012-07-26  7:23                         ` Paolo Bonzini
2012-07-26  7:56                           ` Boaz Harrosh
2012-07-26  7:56                             ` Boaz Harrosh
2012-07-26  7:58                             ` Paolo Bonzini
2012-07-26 13:05                               ` Paolo Bonzini
2012-07-27  6:27                                 ` Rusty Russell
2012-07-27  6:27                                   ` Rusty Russell
2012-07-27  8:11                                   ` Paolo Bonzini
2012-07-29 23:50                                     ` Rusty Russell [this message]
2012-07-29 23:50                                       ` Rusty Russell
2012-07-30  7:12                                       ` Paolo Bonzini
2012-07-30  8:56                                         ` Boaz Harrosh
2012-07-30  8:56                                           ` Boaz Harrosh
2012-07-25 10:41   ` [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list Stefan Hajnoczi
2012-07-25 11:48     ` Sen Wang
2012-07-25 11:44   ` Sen Wang
2012-07-25 12:40     ` Boaz Harrosh
2012-07-25 12:40       ` Boaz Harrosh
2012-07-27  3:12       ` Wang Sen
2012-07-27  6:50         ` Paolo Bonzini
2012-07-25 10:04 ` Rolf Eike Beer
2012-07-25 10:04   ` Rolf Eike Beer
2012-07-25 11:46   ` Sen Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87obmym8jv.fsf@rustcorp.com.au \
    --to=rusty@rustcorp.com.au \
    --cc=JBottomley@parallels.com \
    --cc=bharrosh@panasas.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=mc@linux.vnet.ibm.com \
    --cc=pbonzini@redhat.com \
    --cc=senwang@linux.vnet.ibm.com \
    --cc=stefanha@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.