From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Roger Pau Monne <roger.pau@citrix.com>
Cc: xen-devel@lists.xen.org, Oliver Chick <oliver.chick@citrix.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2] Persistent grant maps for xen blk drivers
Date: Mon, 29 Oct 2012 09:57:33 -0400 [thread overview]
Message-ID: <20121029135733.GZ2708@phenom.dumpdata.com> (raw)
In-Reply-To: <1351097925-26221-1-git-send-email-roger.pau@citrix.com>
On Wed, Oct 24, 2012 at 06:58:45PM +0200, Roger Pau Monne wrote:
> This patch implements persistent grants for the xen-blk{front,back}
> mechanism. The effect of this change is to reduce the number of unmap
> operations performed, since they cause a (costly) TLB shootdown. This
> allows the I/O performance to scale better when a large number of VMs
> are performing I/O.
>
> Previously, the blkfront driver was supplied a bvec[] from the request
> queue. This was granted to dom0; dom0 performed the I/O and wrote
> directly into the grant-mapped memory and unmapped it; blkfront then
> removed foreign access for that grant. The cost of unmapping scales
> badly with the number of CPUs in Dom0. An experiment showed that when
> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
> ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
> (at which point 650,000 IOPS are being performed in total). If more
> than 5 guests are used, the performance declines. By 10 guests, only
> 400,000 IOPS are being performed.
>
> This patch improves performance by only unmapping when the connection
> between blkfront and back is broken.
>
> On startup blkfront notifies blkback that it is using persistent
> grants, and blkback will do the same. If blkback is not capable of
> persistent mapping, blkfront will still use the same grants, since it
> is compatible with the previous protocol, and simplifies the code
> complexity in blkfront.
>
> To perform a read, in persistent mode, blkfront uses a separate pool
> of pages that it maps to dom0. When a request comes in, blkfront
> transmutes the request so that blkback will write into one of these
> free pages. Blkback keeps note of which grefs it has already
> mapped. When a new ring request comes to blkback, it looks to see if
> it has already mapped that page. If so, it will not map it again. If
> the page hasn't been previously mapped, it is mapped now, and a record
> is kept of this mapping. Blkback proceeds as usual. When blkfront is
> notified that blkback has completed a request, it memcpy's from the
> shared memory, into the bvec supplied. A record that the {gref, page}
> tuple is mapped, and not inflight is kept.
>
> Writes are similar, except that the memcpy is peformed from the
> supplied bvecs, into the shared pages, before the request is put onto
> the ring.
>
> Blkback stores a mapping of grefs=>{page mapped to by gref} in
> a red-black tree. As the grefs are not known apriori, and provide no
> guarantees on their ordering, we have to perform a search
> through this tree to find the page, for every gref we receive. This
> operation takes O(log n) time in the worst case. In blkfront grants
> are stored using a single linked list.
>
> The maximum number of grants that blkback will persistenly map is
> currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
> prevent a malicios guest from attempting a DoS, by supplying fresh
> grefs, causing the Dom0 kernel to map excessively. If a guest
> is using persistent grants and exceeds the maximum number of grants to
> map persistenly the newly passed grefs will be mapped and unmaped.
> Using this approach, we can have requests that mix persistent and
> non-persistent grants, and we need to handle them correctly.
> This allows us to set the maximum number of persistent grants to a
> lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
> setting it will lead to unpredictable performance.
>
> In writing this patch, the question arrises as to if the additional
> cost of performing memcpys in the guest (to/from the pool of granted
> pages) outweigh the gains of not performing TLB shootdowns. The answer
> to that question is `no'. There appears to be very little, if any
> additional cost to the guest of using persistent grants. There is
> perhaps a small saving, from the reduced number of hypercalls
> performed in granting, and ending foreign access.
>
> Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
> Signed-off-by: Roger Pau Monne <roger.pau@citrix.com>
> Cc: <konrad.wilk@oracle.com>
> Cc: <linux-kernel@vger.kernel.org>
> ---
> Changes since v1:
> * Changed the unmap_seg array to a bitmap.
> * Only report using persistent grants in blkfront if blkback supports
> it.
> * Reword some comments.
> * Fix a bug when setting the handler, index j was not incremented
> correctly.
> * Check that the tree of grants in blkback is not empty before
> iterating over it when doing the cleanup.
> * Rebase on top of linux-net.
It looks good to me. Going to stick it on my for-jens-3.8 branch.
> ---
> Benchmarks showing the impact of this patch in blk performance can be
> found at:
>
> http://xenbits.xensource.com/people/royger/persistent_grants/
> ---
> drivers/block/xen-blkback/blkback.c | 292 ++++++++++++++++++++++++++++++++---
> drivers/block/xen-blkback/common.h | 17 ++
> drivers/block/xen-blkback/xenbus.c | 23 +++-
> drivers/block/xen-blkfront.c | 197 ++++++++++++++++++++----
> 4 files changed, 474 insertions(+), 55 deletions(-)
>
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index 280a138..663d42d 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -39,6 +39,7 @@
> #include <linux/list.h>
> #include <linux/delay.h>
> #include <linux/freezer.h>
> +#include <linux/bitmap.h>
>
> #include <xen/events.h>
> #include <xen/page.h>
> @@ -79,6 +80,7 @@ struct pending_req {
> unsigned short operation;
> int status;
> struct list_head free_list;
> + DECLARE_BITMAP(unmap_seg, BLKIF_MAX_SEGMENTS_PER_REQUEST);
> };
>
> #define BLKBACK_INVALID_HANDLE (~0)
> @@ -99,6 +101,36 @@ struct xen_blkbk {
> static struct xen_blkbk *blkbk;
>
> /*
> + * Maximum number of grant pages that can be mapped in blkback.
> + * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of
> + * pages that blkback will persistently map.
> + * Currently, this is:
> + * RING_SIZE = 32 (for all known ring types)
> + * BLKIF_MAX_SEGMENTS_PER_REQUEST = 11
> + * sizeof(struct persistent_gnt) = 48
> + * So the maximum memory used to store the grants is:
> + * 32 * 11 * 48 = 16896 bytes
> + */
> +static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol)
> +{
> + switch (protocol) {
> + case BLKIF_PROTOCOL_NATIVE:
> + return __CONST_RING_SIZE(blkif, PAGE_SIZE) *
> + BLKIF_MAX_SEGMENTS_PER_REQUEST;
> + case BLKIF_PROTOCOL_X86_32:
> + return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) *
> + BLKIF_MAX_SEGMENTS_PER_REQUEST;
> + case BLKIF_PROTOCOL_X86_64:
> + return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) *
> + BLKIF_MAX_SEGMENTS_PER_REQUEST;
> + default:
> + BUG();
> + }
> + return 0;
> +}
> +
> +
> +/*
> * Little helpful macro to figure out the index and virtual address of the
> * pending_pages[..]. For each 'pending_req' we have have up to
> * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through
> @@ -129,6 +161,57 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> static void make_response(struct xen_blkif *blkif, u64 id,
> unsigned short op, int st);
>
> +#define foreach_grant(pos, rbtree, node) \
> + for ((pos) = container_of(rb_first((rbtree)), typeof(*(pos)), node); \
> + &(pos)->node != NULL; \
> + (pos) = container_of(rb_next(&(pos)->node), typeof(*(pos)), node))
> +
> +
> +static void add_persistent_gnt(struct rb_root *root,
> + struct persistent_gnt *persistent_gnt)
> +{
> + struct rb_node **new = &(root->rb_node), *parent = NULL;
> + struct persistent_gnt *this;
> +
> + /* Figure out where to put new node */
> + while (*new) {
> + this = container_of(*new, struct persistent_gnt, node);
> +
> + parent = *new;
> + if (persistent_gnt->gnt < this->gnt)
> + new = &((*new)->rb_left);
> + else if (persistent_gnt->gnt > this->gnt)
> + new = &((*new)->rb_right);
> + else {
> + pr_alert(DRV_PFX " trying to add a gref that's already in the tree\n");
> + BUG();
> + }
> + }
> +
> + /* Add new node and rebalance tree. */
> + rb_link_node(&(persistent_gnt->node), parent, new);
> + rb_insert_color(&(persistent_gnt->node), root);
> +}
> +
> +static struct persistent_gnt *get_persistent_gnt(struct rb_root *root,
> + grant_ref_t gref)
> +{
> + struct persistent_gnt *data;
> + struct rb_node *node = root->rb_node;
> +
> + while (node) {
> + data = container_of(node, struct persistent_gnt, node);
> +
> + if (gref < data->gnt)
> + node = node->rb_left;
> + else if (gref > data->gnt)
> + node = node->rb_right;
> + else
> + return data;
> + }
> + return NULL;
> +}
> +
> /*
> * Retrieve from the 'pending_reqs' a free pending_req structure to be used.
> */
> @@ -275,6 +358,11 @@ int xen_blkif_schedule(void *arg)
> {
> struct xen_blkif *blkif = arg;
> struct xen_vbd *vbd = &blkif->vbd;
> + struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> + struct persistent_gnt *persistent_gnt;
> + int ret = 0;
> + int segs_to_unmap = 0;
>
> xen_blkif_get(blkif);
>
> @@ -302,6 +390,36 @@ int xen_blkif_schedule(void *arg)
> print_stats(blkif);
> }
>
> + /* Free all persistent grant pages */
> + if (!RB_EMPTY_ROOT(&blkif->persistent_gnts)) {
> + foreach_grant(persistent_gnt, &blkif->persistent_gnts, node) {
> + BUG_ON(persistent_gnt->handle ==
> + BLKBACK_INVALID_HANDLE);
> + gnttab_set_unmap_op(&unmap[segs_to_unmap],
> + (unsigned long) pfn_to_kaddr(page_to_pfn(
> + persistent_gnt->page)),
> + GNTMAP_host_map,
> + persistent_gnt->handle);
> +
> + pages[segs_to_unmap] = persistent_gnt->page;
> + rb_erase(&persistent_gnt->node,
> + &blkif->persistent_gnts);
> + kfree(persistent_gnt);
> + blkif->persistent_gnt_c--;
> +
> + if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST ||
> + !rb_next(&persistent_gnt->node)) {
> + ret = gnttab_unmap_refs(unmap, NULL, pages,
> + segs_to_unmap);
> + BUG_ON(ret);
> + segs_to_unmap = 0;
> + }
> + }
> + }
> +
> + BUG_ON(blkif->persistent_gnt_c != 0);
> + BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
> +
> if (log_stats)
> print_stats(blkif);
>
> @@ -328,6 +446,8 @@ static void xen_blkbk_unmap(struct pending_req *req)
> int ret;
>
> for (i = 0; i < req->nr_pages; i++) {
> + if (!test_bit(i, req->unmap_seg))
> + continue;
> handle = pending_handle(req, i);
> if (handle == BLKBACK_INVALID_HANDLE)
> continue;
> @@ -344,12 +464,26 @@ static void xen_blkbk_unmap(struct pending_req *req)
>
> static int xen_blkbk_map(struct blkif_request *req,
> struct pending_req *pending_req,
> - struct seg_buf seg[])
> + struct seg_buf seg[],
> + struct page *pages[])
> {
> struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> - int i;
> + struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> + struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> + struct persistent_gnt *persistent_gnt = NULL;
> + struct xen_blkif *blkif = pending_req->blkif;
> + phys_addr_t addr = 0;
> + int i, j;
> + bool new_map;
> int nseg = req->u.rw.nr_segments;
> + int segs_to_map = 0;
> int ret = 0;
> + int use_persistent_gnts;
> +
> + use_persistent_gnts = (blkif->vbd.feature_gnt_persistent);
> +
> + BUG_ON(blkif->persistent_gnt_c >
> + max_mapped_grant_pages(pending_req->blkif->blk_protocol));
>
> /*
> * Fill out preq.nr_sects with proper amount of sectors, and setup
> @@ -359,36 +493,143 @@ static int xen_blkbk_map(struct blkif_request *req,
> for (i = 0; i < nseg; i++) {
> uint32_t flags;
>
> - flags = GNTMAP_host_map;
> - if (pending_req->operation != BLKIF_OP_READ)
> - flags |= GNTMAP_readonly;
> - gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags,
> - req->u.rw.seg[i].gref,
> - pending_req->blkif->domid);
> + if (use_persistent_gnts)
> + persistent_gnt = get_persistent_gnt(
> + &blkif->persistent_gnts,
> + req->u.rw.seg[i].gref);
> +
> + if (persistent_gnt) {
> + /*
> + * We are using persistent grants and
> + * the grant is already mapped
> + */
> + new_map = 0;
> + } else if (use_persistent_gnts &&
> + blkif->persistent_gnt_c <
> + max_mapped_grant_pages(blkif->blk_protocol)) {
> + /*
> + * We are using persistent grants, the grant is
> + * not mapped but we have room for it
> + */
> + new_map = 1;
> + persistent_gnt = kzalloc(
> + sizeof(struct persistent_gnt),
> + GFP_KERNEL);
> + if (!persistent_gnt)
> + return -ENOMEM;
> + persistent_gnt->page = alloc_page(GFP_KERNEL);
> + if (!persistent_gnt->page) {
> + kfree(persistent_gnt);
> + return -ENOMEM;
> + }
> + persistent_gnt->gnt = req->u.rw.seg[i].gref;
> +
> + pages_to_gnt[segs_to_map] =
> + persistent_gnt->page;
> + addr = (unsigned long) pfn_to_kaddr(
> + page_to_pfn(persistent_gnt->page));
> +
> + add_persistent_gnt(&blkif->persistent_gnts,
> + persistent_gnt);
> + blkif->persistent_gnt_c++;
> + pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
> + persistent_gnt->gnt, blkif->persistent_gnt_c,
> + max_mapped_grant_pages(blkif->blk_protocol));
> + } else {
> + /*
> + * We are either using persistent grants and
> + * hit the maximum limit of grants mapped,
> + * or we are not using persistent grants.
> + */
> + if (use_persistent_gnts &&
> + !blkif->vbd.overflow_max_grants) {
> + blkif->vbd.overflow_max_grants = 1;
> + pr_alert(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
> + blkif->domid, blkif->vbd.handle);
> + }
> + new_map = 1;
> + pages[i] = blkbk->pending_page(pending_req, i);
> + addr = vaddr(pending_req, i);
> + pages_to_gnt[segs_to_map] =
> + blkbk->pending_page(pending_req, i);
> + }
> +
> + if (persistent_gnt) {
> + pages[i] = persistent_gnt->page;
> + persistent_gnts[i] = persistent_gnt;
> + } else {
> + persistent_gnts[i] = NULL;
> + }
> +
> + if (new_map) {
> + flags = GNTMAP_host_map;
> + if (!persistent_gnt &&
> + (pending_req->operation != BLKIF_OP_READ))
> + flags |= GNTMAP_readonly;
> + gnttab_set_map_op(&map[segs_to_map++], addr,
> + flags, req->u.rw.seg[i].gref,
> + blkif->domid);
> + }
> }
>
> - ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req, 0), nseg);
> - BUG_ON(ret);
> + if (segs_to_map) {
> + ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_map);
> + BUG_ON(ret);
> + }
>
> /*
> * Now swizzle the MFN in our domain with the MFN from the other domain
> * so that when we access vaddr(pending_req,i) it has the contents of
> * the page from the other domain.
> */
> - for (i = 0; i < nseg; i++) {
> - if (unlikely(map[i].status != 0)) {
> - pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
> - map[i].handle = BLKBACK_INVALID_HANDLE;
> - ret |= 1;
> + bitmap_zero(pending_req->unmap_seg, BLKIF_MAX_SEGMENTS_PER_REQUEST);
> + for (i = 0, j = 0; i < nseg; i++) {
> + if (!persistent_gnts[i] || !persistent_gnts[i]->handle) {
> + /* This is a newly mapped grant */
> + BUG_ON(j >= segs_to_map);
> + if (unlikely(map[j].status != 0)) {
> + pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
> + map[j].handle = BLKBACK_INVALID_HANDLE;
> + ret |= 1;
> + if (persistent_gnts[i]) {
> + rb_erase(&persistent_gnts[i]->node,
> + &blkif->persistent_gnts);
> + blkif->persistent_gnt_c--;
> + kfree(persistent_gnts[i]);
> + persistent_gnts[i] = NULL;
> + }
> + }
> + }
> + if (persistent_gnts[i]) {
> + if (!persistent_gnts[i]->handle) {
> + /*
> + * If this is a new persistent grant
> + * save the handler
> + */
> + persistent_gnts[i]->handle = map[j].handle;
> + persistent_gnts[i]->dev_bus_addr =
> + map[j++].dev_bus_addr;
> + }
> + pending_handle(pending_req, i) =
> + persistent_gnts[i]->handle;
> +
> + if (ret)
> + continue;
> +
> + seg[i].buf = persistent_gnts[i]->dev_bus_addr |
> + (req->u.rw.seg[i].first_sect << 9);
> + } else {
> + pending_handle(pending_req, i) = map[j].handle;
> + bitmap_set(pending_req->unmap_seg, i, 1);
> +
> + if (ret) {
> + j++;
> + continue;
> + }
> +
> + seg[i].buf = map[j++].dev_bus_addr |
> + (req->u.rw.seg[i].first_sect << 9);
> }
> -
> - pending_handle(pending_req, i) = map[i].handle;
> -
> - if (ret)
> - continue;
> -
> - seg[i].buf = map[i].dev_bus_addr |
> - (req->u.rw.seg[i].first_sect << 9);
> }
> return ret;
> }
> @@ -591,6 +832,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> int operation;
> struct blk_plug plug;
> bool drain = false;
> + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>
> switch (req->operation) {
> case BLKIF_OP_READ:
> @@ -677,7 +919,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> * the hypercall to unmap the grants - that is all done in
> * xen_blkbk_unmap.
> */
> - if (xen_blkbk_map(req, pending_req, seg))
> + if (xen_blkbk_map(req, pending_req, seg, pages))
> goto fail_flush;
>
> /*
> @@ -689,7 +931,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> for (i = 0; i < nseg; i++) {
> while ((bio == NULL) ||
> (bio_add_page(bio,
> - blkbk->pending_page(pending_req, i),
> + pages[i],
> seg[i].nsec << 9,
> seg[i].buf & ~PAGE_MASK) == 0)) {
>
> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
> index 9ad3b5e..ae7951f 100644
> --- a/drivers/block/xen-blkback/common.h
> +++ b/drivers/block/xen-blkback/common.h
> @@ -34,6 +34,7 @@
> #include <linux/vmalloc.h>
> #include <linux/wait.h>
> #include <linux/io.h>
> +#include <linux/rbtree.h>
> #include <asm/setup.h>
> #include <asm/pgalloc.h>
> #include <asm/hypervisor.h>
> @@ -160,10 +161,22 @@ struct xen_vbd {
> sector_t size;
> bool flush_support;
> bool discard_secure;
> +
> + unsigned int feature_gnt_persistent:1;
> + unsigned int overflow_max_grants:1;
> };
>
> struct backend_info;
>
> +
> +struct persistent_gnt {
> + struct page *page;
> + grant_ref_t gnt;
> + grant_handle_t handle;
> + uint64_t dev_bus_addr;
> + struct rb_node node;
> +};
> +
> struct xen_blkif {
> /* Unique identifier for this interface. */
> domid_t domid;
> @@ -190,6 +203,10 @@ struct xen_blkif {
> struct task_struct *xenblkd;
> unsigned int waiting_reqs;
>
> + /* tree to store persistent grants */
> + struct rb_root persistent_gnts;
> + unsigned int persistent_gnt_c;
> +
> /* statistics */
> unsigned long st_print;
> int st_rd_req;
> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> index 4f66171..b225026 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -118,6 +118,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
> atomic_set(&blkif->drain, 0);
> blkif->st_print = jiffies;
> init_waitqueue_head(&blkif->waiting_to_free);
> + blkif->persistent_gnts.rb_node = NULL;
>
> return blkif;
> }
> @@ -673,6 +674,13 @@ again:
>
> xen_blkbk_barrier(xbt, be, be->blkif->vbd.flush_support);
>
> + err = xenbus_printf(xbt, dev->nodename, "feature-persistent", "%u", 1);
> + if (err) {
> + xenbus_dev_fatal(dev, err, "writing %s/feature-persistent",
> + dev->nodename);
> + goto abort;
> + }
> +
> err = xenbus_printf(xbt, dev->nodename, "sectors", "%llu",
> (unsigned long long)vbd_sz(&be->blkif->vbd));
> if (err) {
> @@ -721,6 +729,7 @@ static int connect_ring(struct backend_info *be)
> struct xenbus_device *dev = be->dev;
> unsigned long ring_ref;
> unsigned int evtchn;
> + unsigned int pers_grants;
> char protocol[64] = "";
> int err;
>
> @@ -750,8 +759,18 @@ static int connect_ring(struct backend_info *be)
> xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
> return -1;
> }
> - pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)\n",
> - ring_ref, evtchn, be->blkif->blk_protocol, protocol);
> + err = xenbus_gather(XBT_NIL, dev->otherend,
> + "feature-persistent-grants", "%u",
> + &pers_grants, NULL);
> + if (err)
> + pers_grants = 0;
> +
> + be->blkif->vbd.feature_gnt_persistent = pers_grants;
> + be->blkif->vbd.overflow_max_grants = 0;
> +
> + pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) %s\n",
> + ring_ref, evtchn, be->blkif->blk_protocol, protocol,
> + pers_grants ? "persistent grants" : "");
>
> /* Map the shared frame, irq etc. */
> err = xen_blkif_map(be->blkif, ring_ref, evtchn);
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 007db89..911d733 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -44,6 +44,7 @@
> #include <linux/mutex.h>
> #include <linux/scatterlist.h>
> #include <linux/bitmap.h>
> +#include <linux/llist.h>
>
> #include <xen/xen.h>
> #include <xen/xenbus.h>
> @@ -64,10 +65,17 @@ enum blkif_state {
> BLKIF_STATE_SUSPENDED,
> };
>
> +struct grant {
> + grant_ref_t gref;
> + unsigned long pfn;
> + struct llist_node node;
> +};
> +
> struct blk_shadow {
> struct blkif_request req;
> struct request *request;
> unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> + struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> };
>
> static DEFINE_MUTEX(blkfront_mutex);
> @@ -97,6 +105,8 @@ struct blkfront_info
> struct work_struct work;
> struct gnttab_free_callback callback;
> struct blk_shadow shadow[BLK_RING_SIZE];
> + struct llist_head persistent_gnts;
> + unsigned int persistent_gnts_c;
> unsigned long shadow_free;
> unsigned int feature_flush;
> unsigned int flush_op;
> @@ -104,6 +114,7 @@ struct blkfront_info
> unsigned int feature_secdiscard:1;
> unsigned int discard_granularity;
> unsigned int discard_alignment;
> + unsigned int feature_persistent:1;
> int is_ready;
> };
>
> @@ -287,21 +298,36 @@ static int blkif_queue_request(struct request *req)
> unsigned long id;
> unsigned int fsect, lsect;
> int i, ref;
> +
> + /*
> + * Used to store if we are able to queue the request by just using
> + * existing persistent grants, or if we have to get new grants,
> + * as there are not sufficiently many free.
> + */
> + bool new_persistent_gnts;
> grant_ref_t gref_head;
> + struct page *granted_page;
> + struct grant *gnt_list_entry = NULL;
> struct scatterlist *sg;
>
> if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
> return 1;
>
> - if (gnttab_alloc_grant_references(
> - BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
> - gnttab_request_free_callback(
> - &info->callback,
> - blkif_restart_queue_callback,
> - info,
> - BLKIF_MAX_SEGMENTS_PER_REQUEST);
> - return 1;
> - }
> + /* Check if we have enought grants to allocate a requests */
> + if (info->persistent_gnts_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) {
> + new_persistent_gnts = 1;
> + if (gnttab_alloc_grant_references(
> + BLKIF_MAX_SEGMENTS_PER_REQUEST - info->persistent_gnts_c,
> + &gref_head) < 0) {
> + gnttab_request_free_callback(
> + &info->callback,
> + blkif_restart_queue_callback,
> + info,
> + BLKIF_MAX_SEGMENTS_PER_REQUEST);
> + return 1;
> + }
> + } else
> + new_persistent_gnts = 0;
>
> /* Fill out a communications ring structure. */
> ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
> @@ -341,18 +367,73 @@ static int blkif_queue_request(struct request *req)
> BLKIF_MAX_SEGMENTS_PER_REQUEST);
>
> for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) {
> - buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg)));
> fsect = sg->offset >> 9;
> lsect = fsect + (sg->length >> 9) - 1;
> - /* install a grant reference. */
> - ref = gnttab_claim_grant_reference(&gref_head);
> - BUG_ON(ref == -ENOSPC);
>
> - gnttab_grant_foreign_access_ref(
> - ref,
> + if (info->persistent_gnts_c) {
> + BUG_ON(llist_empty(&info->persistent_gnts));
> + gnt_list_entry = llist_entry(
> + llist_del_first(&info->persistent_gnts),
> + struct grant, node);
> +
> + ref = gnt_list_entry->gref;
> + buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
> + info->persistent_gnts_c--;
> + } else {
> + ref = gnttab_claim_grant_reference(&gref_head);
> + BUG_ON(ref == -ENOSPC);
> +
> + gnt_list_entry =
> + kmalloc(sizeof(struct grant),
> + GFP_ATOMIC);
> + if (!gnt_list_entry)
> + return -ENOMEM;
> +
> + granted_page = alloc_page(GFP_ATOMIC);
> + if (!granted_page) {
> + kfree(gnt_list_entry);
> + return -ENOMEM;
> + }
> +
> + gnt_list_entry->pfn =
> + page_to_pfn(granted_page);
> + gnt_list_entry->gref = ref;
> +
> + buffer_mfn = pfn_to_mfn(page_to_pfn(
> + granted_page));
> + gnttab_grant_foreign_access_ref(ref,
> info->xbdev->otherend_id,
> - buffer_mfn,
> - rq_data_dir(req));
> + buffer_mfn, 0);
> + }
> +
> + info->shadow[id].grants_used[i] = gnt_list_entry;
> +
> + if (rq_data_dir(req)) {
> + char *bvec_data;
> + void *shared_data;
> +
> + BUG_ON(sg->offset + sg->length > PAGE_SIZE);
> +
> + shared_data = kmap_atomic(
> + pfn_to_page(gnt_list_entry->pfn));
> + bvec_data = kmap_atomic(sg_page(sg));
> +
> + /*
> + * this does not wipe data stored outside the
> + * range sg->offset..sg->offset+sg->length.
> + * Therefore, blkback *could* see data from
> + * previous requests. This is OK as long as
> + * persistent grants are shared with just one
> + * domain. It may need refactoring if this
> + * changes
> + */
> + memcpy(shared_data + sg->offset,
> + bvec_data + sg->offset,
> + sg->length);
> +
> + kunmap_atomic(bvec_data);
> + kunmap_atomic(shared_data);
> + }
>
> info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn);
> ring_req->u.rw.seg[i] =
> @@ -368,7 +449,8 @@ static int blkif_queue_request(struct request *req)
> /* Keep a private copy so we can reissue requests when recovering. */
> info->shadow[id].req = *ring_req;
>
> - gnttab_free_grant_references(gref_head);
> + if (new_persistent_gnts)
> + gnttab_free_grant_references(gref_head);
>
> return 0;
> }
> @@ -480,12 +562,13 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
> static void xlvbd_flush(struct blkfront_info *info)
> {
> blk_queue_flush(info->rq, info->feature_flush);
> - printk(KERN_INFO "blkfront: %s: %s: %s\n",
> + printk(KERN_INFO "blkfront: %s: %s: %s %s\n",
> info->gd->disk_name,
> info->flush_op == BLKIF_OP_WRITE_BARRIER ?
> "barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
> "flush diskcache" : "barrier or flush"),
> - info->feature_flush ? "enabled" : "disabled");
> + info->feature_flush ? "enabled" : "disabled",
> + info->feature_persistent ? "using persistent grants" : "");
> }
>
> static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
> @@ -707,6 +790,9 @@ static void blkif_restart_queue(struct work_struct *work)
>
> static void blkif_free(struct blkfront_info *info, int suspend)
> {
> + struct llist_node *all_gnts;
> + struct grant *persistent_gnt;
> +
> /* Prevent new requests being issued until we fix things up. */
> spin_lock_irq(&info->io_lock);
> info->connected = suspend ?
> @@ -714,6 +800,17 @@ static void blkif_free(struct blkfront_info *info, int suspend)
> /* No more blkif_request(). */
> if (info->rq)
> blk_stop_queue(info->rq);
> +
> + /* Remove all persistent grants */
> + if (info->persistent_gnts_c) {
> + all_gnts = llist_del_all(&info->persistent_gnts);
> + llist_for_each_entry(persistent_gnt, all_gnts, node) {
> + gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
> + kfree(persistent_gnt);
> + }
> + info->persistent_gnts_c = 0;
> + }
> +
> /* No more gnttab callback work. */
> gnttab_cancel_free_callback(&info->callback);
> spin_unlock_irq(&info->io_lock);
> @@ -734,13 +831,42 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>
> }
>
> -static void blkif_completion(struct blk_shadow *s)
> +static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
> + struct blkif_response *bret)
> {
> int i;
> - /* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place
> - * flag. */
> - for (i = 0; i < s->req.u.rw.nr_segments; i++)
> - gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL);
> + struct bio_vec *bvec;
> + struct req_iterator iter;
> + unsigned long flags;
> + char *bvec_data;
> + void *shared_data;
> + unsigned int offset = 0;
> +
> + if (bret->operation == BLKIF_OP_READ) {
> + /*
> + * Copy the data received from the backend into the bvec.
> + * Since bv_offset can be different than 0, and bv_len different
> + * than PAGE_SIZE, we have to keep track of the current offset,
> + * to be sure we are copying the data from the right shared page.
> + */
> + rq_for_each_segment(bvec, s->request, iter) {
> + BUG_ON((bvec->bv_offset + bvec->bv_len) > PAGE_SIZE);
> + i = offset >> PAGE_SHIFT;
> + shared_data = kmap_atomic(
> + pfn_to_page(s->grants_used[i]->pfn));
> + bvec_data = bvec_kmap_irq(bvec, &flags);
> + memcpy(bvec_data, shared_data + bvec->bv_offset,
> + bvec->bv_len);
> + bvec_kunmap_irq(bvec_data, &flags);
> + kunmap_atomic(shared_data);
> + offset += bvec->bv_len;
> + }
> + }
> + /* Add the persistent grant into the list of free grants */
> + for (i = 0; i < s->req.u.rw.nr_segments; i++) {
> + llist_add(&s->grants_used[i]->node, &info->persistent_gnts);
> + info->persistent_gnts_c++;
> + }
> }
>
> static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> @@ -783,7 +909,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> req = info->shadow[id].request;
>
> if (bret->operation != BLKIF_OP_DISCARD)
> - blkif_completion(&info->shadow[id]);
> + blkif_completion(&info->shadow[id], info, bret);
>
> if (add_id_to_freelist(info, id)) {
> WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n",
> @@ -942,6 +1068,11 @@ again:
> message = "writing protocol";
> goto abort_transaction;
> }
> + err = xenbus_printf(xbt, dev->nodename,
> + "feature-persistent-grants", "%u", 1);
> + if (err)
> + dev_warn(&dev->dev,
> + "writing persistent grants feature to xenbus");
>
> err = xenbus_transaction_end(xbt, 0);
> if (err) {
> @@ -1029,6 +1160,8 @@ static int blkfront_probe(struct xenbus_device *dev,
> spin_lock_init(&info->io_lock);
> info->xbdev = dev;
> info->vdevice = vdevice;
> + init_llist_head(&info->persistent_gnts);
> + info->persistent_gnts_c = 0;
> info->connected = BLKIF_STATE_DISCONNECTED;
> INIT_WORK(&info->work, blkif_restart_queue);
>
> @@ -1093,7 +1226,7 @@ static int blkif_recover(struct blkfront_info *info)
> req->u.rw.seg[j].gref,
> info->xbdev->otherend_id,
> pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]),
> - rq_data_dir(info->shadow[req->u.rw.id].request));
> + 0);
> }
> info->shadow[req->u.rw.id].req = *req;
>
> @@ -1225,7 +1358,7 @@ static void blkfront_connect(struct blkfront_info *info)
> unsigned long sector_size;
> unsigned int binfo;
> int err;
> - int barrier, flush, discard;
> + int barrier, flush, discard, persistent;
>
> switch (info->connected) {
> case BLKIF_STATE_CONNECTED:
> @@ -1303,6 +1436,14 @@ static void blkfront_connect(struct blkfront_info *info)
> if (!err && discard)
> blkfront_setup_discard(info);
>
> + err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> + "feature-persistent", "%u", &persistent,
> + NULL);
> + if (err)
> + info->feature_persistent = 0;
> + else
> + info->feature_persistent = persistent;
> +
> err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size);
> if (err) {
> xenbus_dev_fatal(info->xbdev, err, "xlvbd_add at %s",
> --
> 1.7.7.5 (Apple Git-26)
next prev parent reply other threads:[~2012-10-29 14:10 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-10-24 16:58 [PATCH v2] Persistent grant maps for xen blk drivers Roger Pau Monne
2012-10-29 13:57 ` Konrad Rzeszutek Wilk [this message]
2012-10-30 17:01 ` Konrad Rzeszutek Wilk
2012-10-30 18:33 ` Roger Pau Monné
2012-10-30 20:38 ` Konrad Rzeszutek Wilk
-- strict thread matches above, loose matches on Subject: below --
2012-09-21 15:52 Oliver Chick
2012-09-21 18:41 ` Konrad Rzeszutek Wilk
2012-09-21 18:56 ` Konrad Rzeszutek Wilk
2012-09-21 20:46 ` Konrad Rzeszutek Wilk
2012-09-24 14:38 ` Andres Lagar-Cavilla
2012-09-24 15:06 ` Konrad Rzeszutek Wilk
2012-09-24 15:21 ` Andres Lagar-Cavilla
2012-09-27 15:49 ` Oliver Chick
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20121029135733.GZ2708@phenom.dumpdata.com \
--to=konrad.wilk@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=oliver.chick@citrix.com \
--cc=roger.pau@citrix.com \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.