All of lore.kernel.org
 help / color / mirror / Atom feed
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2] Persistent grant maps for xen blk drivers
Date: Tue, 30 Oct 2012 16:38:27 -0400	[thread overview]
Message-ID: <20121030203827.GB16465@phenom.dumpdata.com> (raw)
In-Reply-To: <50901D6C.6020500@citrix.com>

On Tue, Oct 30, 2012 at 07:33:16PM +0100, Roger Pau Monné wrote:
> On 30/10/12 18:01, Konrad Rzeszutek Wilk wrote:
> > On Wed, Oct 24, 2012 at 06:58:45PM +0200, Roger Pau Monne wrote:
> >> This patch implements persistent grants for the xen-blk{front,back}
> >> mechanism. The effect of this change is to reduce the number of unmap
> >> operations performed, since they cause a (costly) TLB shootdown. This
> >> allows the I/O performance to scale better when a large number of VMs
> >> are performing I/O.
> >>
> >> Previously, the blkfront driver was supplied a bvec[] from the request
> >> queue. This was granted to dom0; dom0 performed the I/O and wrote
> >> directly into the grant-mapped memory and unmapped it; blkfront then
> >> removed foreign access for that grant. The cost of unmapping scales
> >> badly with the number of CPUs in Dom0. An experiment showed that when
> >> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
> >> ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
> >> (at which point 650,000 IOPS are being performed in total). If more
> >> than 5 guests are used, the performance declines. By 10 guests, only
> >> 400,000 IOPS are being performed.
> >>
> >> This patch improves performance by only unmapping when the connection
> >> between blkfront and back is broken.
> >>
> >> On startup blkfront notifies blkback that it is using persistent
> >> grants, and blkback will do the same. If blkback is not capable of
> >> persistent mapping, blkfront will still use the same grants, since it
> >> is compatible with the previous protocol, and simplifies the code
> >> complexity in blkfront.
> >>
> >> To perform a read, in persistent mode, blkfront uses a separate pool
> >> of pages that it maps to dom0. When a request comes in, blkfront
> >> transmutes the request so that blkback will write into one of these
> >> free pages. Blkback keeps note of which grefs it has already
> >> mapped. When a new ring request comes to blkback, it looks to see if
> >> it has already mapped that page. If so, it will not map it again. If
> >> the page hasn't been previously mapped, it is mapped now, and a record
> >> is kept of this mapping. Blkback proceeds as usual. When blkfront is
> >> notified that blkback has completed a request, it memcpy's from the
> >> shared memory, into the bvec supplied. A record that the {gref, page}
> >> tuple is mapped, and not inflight is kept.
> >>
> >> Writes are similar, except that the memcpy is peformed from the
> >> supplied bvecs, into the shared pages, before the request is put onto
> >> the ring.
> >>
> >> Blkback stores a mapping of grefs=>{page mapped to by gref} in
> >> a red-black tree. As the grefs are not known apriori, and provide no
> >> guarantees on their ordering, we have to perform a search
> >> through this tree to find the page, for every gref we receive. This
> >> operation takes O(log n) time in the worst case. In blkfront grants
> >> are stored using a single linked list.
> >>
> >> The maximum number of grants that blkback will persistenly map is
> >> currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
> >> prevent a malicios guest from attempting a DoS, by supplying fresh
> >> grefs, causing the Dom0 kernel to map excessively. If a guest
> >> is using persistent grants and exceeds the maximum number of grants to
> >> map persistenly the newly passed grefs will be mapped and unmaped.
> >> Using this approach, we can have requests that mix persistent and
> >> non-persistent grants, and we need to handle them correctly.
> >> This allows us to set the maximum number of persistent grants to a
> >> lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
> >> setting it will lead to unpredictable performance.
> >>
> >> In writing this patch, the question arrises as to if the additional
> >> cost of performing memcpys in the guest (to/from the pool of granted
> >> pages) outweigh the gains of not performing TLB shootdowns. The answer
> >> to that question is `no'. There appears to be very little, if any
> >> additional cost to the guest of using persistent grants. There is
> >> perhaps a small saving, from the reduced number of hypercalls
> >> performed in granting, and ending foreign access.
> >>
> >> Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
> >> Signed-off-by: Roger Pau Monne <roger.pau@citrix.com>
> >> Cc: <konrad.wilk@oracle.com>
> >> Cc: <linux-kernel@vger.kernel.org>
> >> ---
> >> Changes since v1:
> >>  * Changed the unmap_seg array to a bitmap.
> >>  * Only report using persistent grants in blkfront if blkback supports
> >>    it.
> >>  * Reword some comments.
> >>  * Fix a bug when setting the handler, index j was not incremented
> >>    correctly.
> >>  * Check that the tree of grants in blkback is not empty before
> >>    iterating over it when doing the cleanup.
> >>  * Rebase on top of linux-net.
> > 
> > I fixed the 'new_map = [1|0]' you had in and altered it to use 'true'
> > or 'false', but when running some tests (with a 64-bit PV guest) I got it
> > to bug.
> 
> Thanks for the testing. I'm going to rebase on top of your linux-next
> branch and see if I can reproduce it. Did you run any kind of specific
> test/benchmark? I've been running with this patch for a long time (on

None. Just booted a guest with a phy:/dev/vg_guest/blah.

> top of your previous linux-next branch), and I haven't been able to get
> it to bug.

  reply	other threads:[~2012-10-30 20:51 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-24 16:58 [PATCH v2] Persistent grant maps for xen blk drivers Roger Pau Monne
2012-10-29 13:57 ` Konrad Rzeszutek Wilk
2012-10-30 17:01 ` Konrad Rzeszutek Wilk
2012-10-30 18:33   ` Roger Pau Monné
2012-10-30 20:38     ` Konrad Rzeszutek Wilk [this message]
  -- strict thread matches above, loose matches on Subject: below --
2012-09-21 15:52 Oliver Chick
2012-09-21 18:41 ` Konrad Rzeszutek Wilk
2012-09-21 18:56   ` Konrad Rzeszutek Wilk
2012-09-21 20:46     ` Konrad Rzeszutek Wilk
2012-09-24 14:38       ` Andres Lagar-Cavilla
2012-09-24 15:06         ` Konrad Rzeszutek Wilk
2012-09-24 15:21           ` Andres Lagar-Cavilla
2012-09-27 15:49   ` Oliver Chick

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121030203827.GB16465@phenom.dumpdata.com \
    --to=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=roger.pau@citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.