* [PATCH v2 0/2] block/xen-blkfront: Support non-indirect grant with 64KB page granularity
@ 2015-10-19 14:19 Julien Grall
2015-10-19 14:19 ` [PATCH v2 1/2] block/xen-blkfront: Introduce blkif_ring_get_request Julien Grall
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Julien Grall @ 2015-10-19 14:19 UTC (permalink / raw)
To: xen-devel, linux-kernel
Cc: Julien Grall, Konrad Rzeszutek Wilk, Roger Pau Monné,
Boris Ostrovsky, David Vrabel
Hi all,
This is a follow-up on the previous discussion [1] related to guest using 64KB
page granularity which doesn't boot when the backend isn't using indirect
descriptor.
This has been successfully tested on ARM64 with both 64KB and 4KB page
granularity guests and QEMU as the backend. Indeed QEMU doesn't support
indirect descriptor.
This series is based on xentip/for-linus-4.4 which include the support for
64KB Linux guest.
For all the changes see in each patch.
Sincerely yours,
[1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg01659.html
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Julien Grall (2):
block/xen-blkfront: Introduce blkif_ring_get_request
block/xen-blkfront: Handle non-indirect grant with 64KB pages
drivers/block/xen-blkfront.c | 230 ++++++++++++++++++++++++++++++++++++++-----
1 file changed, 203 insertions(+), 27 deletions(-)
--
2.1.4
^ permalink raw reply [flat|nested] 11+ messages in thread* [PATCH v2 1/2] block/xen-blkfront: Introduce blkif_ring_get_request 2015-10-19 14:19 [PATCH v2 0/2] block/xen-blkfront: Support non-indirect grant with 64KB page granularity Julien Grall @ 2015-10-19 14:19 ` Julien Grall 2015-10-19 14:19 ` [PATCH v2 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages Julien Grall [not found] ` <5640C114.5080104@citrix.com> 2 siblings, 0 replies; 11+ messages in thread From: Julien Grall @ 2015-10-19 14:19 UTC (permalink / raw) To: xen-devel, linux-kernel Cc: Julien Grall, Roger Pau Monné, Konrad Rzeszutek Wilk, Boris Ostrovsky, David Vrabel The code to get a request is always the same. Therefore we can factorize it in a single function. Signed-off-by: Julien Grall <julien.grall@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> --- Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Changes in v2: - Add Royger's acked-by --- drivers/block/xen-blkfront.c | 30 +++++++++++++++++++----------- 1 file changed, 19 insertions(+), 11 deletions(-) diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index 3ea948c..5982768 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -456,6 +456,23 @@ static int blkif_ioctl(struct block_device *bdev, fmode_t mode, return 0; } +static unsigned long blkif_ring_get_request(struct blkfront_info *info, + struct request *req, + struct blkif_request **ring_req) +{ + unsigned long id; + + *ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt); + info->ring.req_prod_pvt++; + + id = get_id_from_freelist(info); + info->shadow[id].request = req; + + (*ring_req)->u.rw.id = id; + + return id; +} + static int blkif_queue_discard_req(struct request *req) { struct blkfront_info *info = req->rq_disk->private_data; @@ -463,9 +480,7 @@ static int blkif_queue_discard_req(struct request *req) unsigned long id; /* Fill out a communications ring structure. */ - ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt); - id = get_id_from_freelist(info); - info->shadow[id].request = req; + id = blkif_ring_get_request(info, req, &ring_req); ring_req->operation = BLKIF_OP_DISCARD; ring_req->u.discard.nr_sectors = blk_rq_sectors(req); @@ -476,8 +491,6 @@ static int blkif_queue_discard_req(struct request *req) else ring_req->u.discard.flag = 0; - info->ring.req_prod_pvt++; - /* Keep a private copy so we can reissue requests when recovering. */ info->shadow[id].req = *ring_req; @@ -613,9 +626,7 @@ static int blkif_queue_rw_req(struct request *req) new_persistent_gnts = 0; /* Fill out a communications ring structure. */ - ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt); - id = get_id_from_freelist(info); - info->shadow[id].request = req; + id = blkif_ring_get_request(info, req, &ring_req); BUG_ON(info->max_indirect_segments == 0 && GREFS(req->nr_phys_segments) > BLKIF_MAX_SEGMENTS_PER_REQUEST); @@ -628,7 +639,6 @@ static int blkif_queue_rw_req(struct request *req) for_each_sg(info->shadow[id].sg, sg, num_sg, i) num_grant += gnttab_count_grant(sg->offset, sg->length); - ring_req->u.rw.id = id; info->shadow[id].num_sg = num_sg; if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST) { /* @@ -694,8 +704,6 @@ static int blkif_queue_rw_req(struct request *req) if (setup.segments) kunmap_atomic(setup.segments); - info->ring.req_prod_pvt++; - /* Keep a private copy so we can reissue requests when recovering. */ info->shadow[id].req = *ring_req; -- 2.1.4 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v2 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages 2015-10-19 14:19 [PATCH v2 0/2] block/xen-blkfront: Support non-indirect grant with 64KB page granularity Julien Grall 2015-10-19 14:19 ` [PATCH v2 1/2] block/xen-blkfront: Introduce blkif_ring_get_request Julien Grall @ 2015-10-19 14:19 ` Julien Grall 2015-11-12 16:40 ` Roger Pau Monné 2015-11-12 16:45 ` Roger Pau Monné [not found] ` <5640C114.5080104@citrix.com> 2 siblings, 2 replies; 11+ messages in thread From: Julien Grall @ 2015-10-19 14:19 UTC (permalink / raw) To: xen-devel, linux-kernel Cc: Julien Grall, Konrad Rzeszutek Wilk, Roger Pau Monné, Boris Ostrovsky, David Vrabel The minimal size of request in the block framework is always PAGE_SIZE. It means that when 64KB guest is support, the request will at least be 64KB. Although, if the backend doesn't support indirect descriptor (such as QDISK in QEMU), a ring request is only able to accommodate 11 segments of 4KB (i.e 44KB). The current frontend is assuming that an I/O request will always fit in a ring request. This is not true any more when using 64KB page granularity and will therefore crash during boot. On ARM64, the ABI is completely neutral to the page granularity used by the domU. The guest has the choice between different page granularity supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB). This can't be enforced by the hypervisor and therefore it's possible to run guests using different page granularity. So we can't mandate the block backend to support indirect descriptor when the frontend is using 64KB page granularity and have to fix it properly in the frontend. The solution exposed below is based on modifying directly the frontend guest rather than asking the block framework to support smaller size (i.e < PAGE_SIZE). This is because the change is the block framework are not trivial as everything seems to relying on a struct *page (see [1]). Although, it may be possible that someone succeed to do it in the future and we would therefore be able to use it. Given that a block request may not fit in a single ring request, a second request is introduced for the data that cannot fit in the first one. This means that the second ring request should never be used on Linux if the page size is smaller than 44KB. To achieve the support of the extra ring request, the block queue size is divided by two. Therefore, the ring will always contain enough space to accommodate 2 ring requests. While this will reduce the overall performance, it will make the implementation more contained. The way forward to get better performance is to implement in the backend either indirect descriptor or multiple grants ring. Note that the parameters blk_queue_max_* helpers haven't been updated. The block code will set the mimimum size supported and we may be able to support directly any change in the block framework that lower down the minimal size of a request. [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html Signed-off-by: Julien Grall <julien.grall@citrix.com> --- Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Changes in v2: - Update the commit message - Typoes - Rename ring_req2/id2 to extra_ring_req/extra_id --- drivers/block/xen-blkfront.c | 200 +++++++++++++++++++++++++++++++++++++++---- 1 file changed, 184 insertions(+), 16 deletions(-) diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index 5982768..8ea2c97 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -60,6 +60,20 @@ #include <asm/xen/hypervisor.h> +/* + * The minimal size of segment supportd by the block framework is PAGE_SIZE. + * When Linux is using a different page size than xen, it may not be possible + * to put all the data in a single segment. + * This can happen when the backend doesn't support indirect descriptor and + * therefore the maximum amount of data that a request can carry is + * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE = 44KB + * + * Note that we only support one extra request. So the Linux page size + * should be <= ( 2 * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) = + * 88KB. + */ +#define HAS_EXTRA_REQ (BLKIF_MAX_SEGMENTS_PER_REQUEST < XEN_PFN_PER_PAGE) + enum blkif_state { BLKIF_STATE_DISCONNECTED, BLKIF_STATE_CONNECTED, @@ -79,6 +93,18 @@ struct blk_shadow { struct grant **indirect_grants; struct scatterlist *sg; unsigned int num_sg; + enum { + REQ_WAITING, + REQ_DONE, + REQ_FAIL + } status; + + #define NO_ASSOCIATED_ID ~0UL + /* + * Id of the sibling if we ever need 2 requests when handling a + * block I/O request + */ + unsigned long associated_id; }; struct split_bio { @@ -467,6 +493,8 @@ static unsigned long blkif_ring_get_request(struct blkfront_info *info, id = get_id_from_freelist(info); info->shadow[id].request = req; + info->shadow[id].status = REQ_WAITING; + info->shadow[id].associated_id = NO_ASSOCIATED_ID; (*ring_req)->u.rw.id = id; @@ -508,6 +536,9 @@ struct setup_rw_req { bool need_copy; unsigned int bvec_off; char *bvec_data; + + bool require_extra_req; + struct blkif_request *extra_ring_req; }; static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset, @@ -521,8 +552,24 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset, unsigned int grant_idx = setup->grant_idx; struct blkif_request *ring_req = setup->ring_req; struct blkfront_info *info = setup->info; + /* + * We always use the shadow of the first request to store the list + * of grant associated to the block I/O request. This made the + * completion more easy to handle even if the block I/O request is + * split. + */ struct blk_shadow *shadow = &info->shadow[setup->id]; + if (unlikely(setup->require_extra_req && + grant_idx >= BLKIF_MAX_SEGMENTS_PER_REQUEST)) { + /* + * We are using the second request, setup grant_idx + * to be the index of the segment array + */ + grant_idx -= BLKIF_MAX_SEGMENTS_PER_REQUEST; + ring_req = setup->extra_ring_req; + } + if ((ring_req->operation == BLKIF_OP_INDIRECT) && (grant_idx % GRANTS_PER_INDIRECT_FRAME == 0)) { if (setup->segments) @@ -537,7 +584,11 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset, gnt_list_entry = get_grant(&setup->gref_head, gfn, info); ref = gnt_list_entry->gref; - shadow->grants_used[grant_idx] = gnt_list_entry; + /* + * All the grants are stored in the shadow of the first + * request. Therefore we have to use the global index + */ + shadow->grants_used[setup->grant_idx] = gnt_list_entry; if (setup->need_copy) { void *shared_data; @@ -579,11 +630,31 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset, (setup->grant_idx)++; } +static void blkif_setup_extra_req(struct blkif_request *first, + struct blkif_request *second) +{ + uint16_t nr_segments = first->u.rw.nr_segments; + + /* + * The second request is only present when the first request uses + * all its segments. It's always the continuity of the first one. + */ + first->u.rw.nr_segments = BLKIF_MAX_SEGMENTS_PER_REQUEST; + + second->u.rw.nr_segments = nr_segments - BLKIF_MAX_SEGMENTS_PER_REQUEST; + second->u.rw.sector_number = first->u.rw.sector_number + + (BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) / 512; + + second->u.rw.handle = first->u.rw.handle; + second->operation = first->operation; +} + static int blkif_queue_rw_req(struct request *req) { struct blkfront_info *info = req->rq_disk->private_data; - struct blkif_request *ring_req; - unsigned long id; + struct blkif_request *ring_req, *extra_ring_req = NULL; + unsigned long id, extra_id = NO_ASSOCIATED_ID; + bool require_extra_req = false; int i; struct setup_rw_req setup = { .grant_idx = 0, @@ -628,19 +699,19 @@ static int blkif_queue_rw_req(struct request *req) /* Fill out a communications ring structure. */ id = blkif_ring_get_request(info, req, &ring_req); - BUG_ON(info->max_indirect_segments == 0 && - GREFS(req->nr_phys_segments) > BLKIF_MAX_SEGMENTS_PER_REQUEST); - BUG_ON(info->max_indirect_segments && - GREFS(req->nr_phys_segments) > info->max_indirect_segments); - num_sg = blk_rq_map_sg(req->q, req, info->shadow[id].sg); num_grant = 0; /* Calculate the number of grant used */ for_each_sg(info->shadow[id].sg, sg, num_sg, i) num_grant += gnttab_count_grant(sg->offset, sg->length); + require_extra_req = info->max_indirect_segments == 0 && + num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST; + BUG_ON(!HAS_EXTRA_REQ && require_extra_req); + info->shadow[id].num_sg = num_sg; - if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST) { + if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST && + likely(!require_extra_req)) { /* * The indirect operation can only be a BLKIF_OP_READ or * BLKIF_OP_WRITE @@ -680,10 +751,30 @@ static int blkif_queue_rw_req(struct request *req) } } ring_req->u.rw.nr_segments = num_grant; + if (unlikely(require_extra_req)) { + extra_id = blkif_ring_get_request(info, req, + &extra_ring_req); + /* + * Only the first request contains the scatter-gather + * list. + */ + info->shadow[extra_id].num_sg = 0; + + blkif_setup_extra_req(ring_req, extra_ring_req); + + /* Link the 2 requests together */ + info->shadow[extra_id].associated_id = id; + info->shadow[id].associated_id = extra_id; + } } setup.ring_req = ring_req; setup.id = id; + + setup.require_extra_req = require_extra_req; + if (unlikely(require_extra_req)) + setup.extra_ring_req = extra_ring_req; + for_each_sg(info->shadow[id].sg, sg, num_sg, i) { BUG_ON(sg->offset + sg->length > PAGE_SIZE); @@ -706,6 +797,8 @@ static int blkif_queue_rw_req(struct request *req) /* Keep a private copy so we can reissue requests when recovering. */ info->shadow[id].req = *ring_req; + if (unlikely(require_extra_req)) + info->shadow[extra_id].req = *extra_ring_req; if (new_persistent_gnts) gnttab_free_grant_references(setup.gref_head); @@ -797,7 +890,16 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size, memset(&info->tag_set, 0, sizeof(info->tag_set)); info->tag_set.ops = &blkfront_mq_ops; info->tag_set.nr_hw_queues = 1; - info->tag_set.queue_depth = BLK_RING_SIZE(info); + if (HAS_EXTRA_REQ && info->max_indirect_segments == 0) { + /* + * When indirect descriptior is not supported, the I/O request + * will be split between multiple request in the ring. + * To avoid problems when sending the request, divide by + * 2 the depth of the queue. + */ + info->tag_set.queue_depth = BLK_RING_SIZE(info) / 2; + } else + info->tag_set.queue_depth = BLK_RING_SIZE(info); info->tag_set.numa_node = NUMA_NO_NODE; info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE; info->tag_set.cmd_size = 0; @@ -1217,19 +1319,67 @@ static void blkif_copy_from_grant(unsigned long gfn, unsigned int offset, kunmap_atomic(shared_data); } -static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info, +static bool blkif_completion(unsigned long *id, struct blkfront_info *info, struct blkif_response *bret) { int i = 0; struct scatterlist *sg; int num_sg, num_grant; + struct blk_shadow *s = &info->shadow[*id]; struct copy_from_grant data = { - .s = s, .grant_idx = 0, }; num_grant = s->req.operation == BLKIF_OP_INDIRECT ? s->req.u.indirect.nr_segments : s->req.u.rw.nr_segments; + + /* The I/O request may be split in two */ + if (unlikely(s->associated_id != NO_ASSOCIATED_ID)) { + struct blk_shadow *s2 = &info->shadow[s->associated_id]; + + /* Keep the status of the current response in shadow */ + s->status = (bret->status == BLKIF_RSP_OKAY) ? + REQ_DONE : REQ_FAIL; + + /* Wait the second response if not yet here */ + if (s2->status == REQ_WAITING) + return 0; + + /* + * The status of the current response will be used in + * order to know if the request has failed. + * Update the current response status only if has not + * failed. + */ + if (bret->status == BLKIF_RSP_OKAY && s2->status == REQ_FAIL) + bret->status = BLKIF_RSP_ERROR; + + /* + * All the grants is stored in the first shadow in order + * to make the completion code simpler. + */ + num_grant += s2->req.u.rw.nr_segments; + + /* + * The two responses may not come in order. Only the + * first request will store the scatter-gather list + */ + if (s2->num_sg != 0) { + /* Update "id" with the ID of the first response */ + *id = s->associated_id; + s = s2; + } + + /* + * We don't need anymore the second request, so recycling + * it now. + */ + if (add_id_to_freelist(info, s->associated_id)) + WARN(1, "%s: can't recycle the second part (id = %ld) of the request\n", + info->gd->disk_name, s->associated_id); + } + + data.s = s; num_sg = s->num_sg; if (bret->operation == BLKIF_OP_READ && info->feature_persistent) { @@ -1299,6 +1449,8 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info, } } } + + return 1; } static irqreturn_t blkif_interrupt(int irq, void *dev_id) @@ -1339,8 +1491,14 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id) } req = info->shadow[id].request; - if (bret->operation != BLKIF_OP_DISCARD) - blkif_completion(&info->shadow[id], info, bret); + if (bret->operation != BLKIF_OP_DISCARD) { + /* + * We may need to wait for an extra response if the + * I/O request is split in 2 + */ + if (!blkif_completion(&id, info, bret)) + continue; + } if (add_id_to_freelist(info, id)) { WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n", @@ -1863,8 +2021,18 @@ static int blkfront_setup_indirect(struct blkfront_info *info) unsigned int psegs, grants; int err, i; - if (info->max_indirect_segments == 0) - grants = BLKIF_MAX_SEGMENTS_PER_REQUEST; + if (info->max_indirect_segments == 0) { + if (!HAS_EXTRA_REQ) + grants = BLKIF_MAX_SEGMENTS_PER_REQUEST; + else { + /* + * When an extra req is required, the maximum + * grants supported is related to the size of the + * Linux block segment + */ + grants = GRANTS_PER_PSEG; + } + } else grants = info->max_indirect_segments; psegs = grants / GRANTS_PER_PSEG; -- 2.1.4 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v2 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages 2015-10-19 14:19 ` [PATCH v2 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages Julien Grall @ 2015-11-12 16:40 ` Roger Pau Monné 2015-11-12 17:30 ` [Xen-devel] " Julien Grall 2015-11-12 16:45 ` Roger Pau Monné 1 sibling, 1 reply; 11+ messages in thread From: Roger Pau Monné @ 2015-11-12 16:40 UTC (permalink / raw) To: Julien Grall, xen-devel, linux-kernel Cc: Konrad Rzeszutek Wilk, Boris Ostrovsky, David Vrabel El 19/10/15 a les 16.19, Julien Grall ha escrit: > The minimal size of request in the block framework is always PAGE_SIZE. > It means that when 64KB guest is support, the request will at least be > 64KB. > > Although, if the backend doesn't support indirect descriptor (such as QDISK > in QEMU), a ring request is only able to accommodate 11 segments of 4KB > (i.e 44KB). > > The current frontend is assuming that an I/O request will always fit in > a ring request. This is not true any more when using 64KB page > granularity and will therefore crash during boot. > > On ARM64, the ABI is completely neutral to the page granularity used by > the domU. The guest has the choice between different page granularity > supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB). > This can't be enforced by the hypervisor and therefore it's possible to > run guests using different page granularity. > > So we can't mandate the block backend to support indirect descriptor > when the frontend is using 64KB page granularity and have to fix it > properly in the frontend. > > The solution exposed below is based on modifying directly the frontend > guest rather than asking the block framework to support smaller size > (i.e < PAGE_SIZE). This is because the change is the block framework are > not trivial as everything seems to relying on a struct *page (see [1]). > Although, it may be possible that someone succeed to do it in the future > and we would therefore be able to use it. > > Given that a block request may not fit in a single ring request, a > second request is introduced for the data that cannot fit in the first > one. This means that the second ring request should never be used on > Linux if the page size is smaller than 44KB. > > To achieve the support of the extra ring request, the block queue size > is divided by two. Therefore, the ring will always contain enough space > to accommodate 2 ring requests. While this will reduce the overall > performance, it will make the implementation more contained. The way > forward to get better performance is to implement in the backend either > indirect descriptor or multiple grants ring. > > Note that the parameters blk_queue_max_* helpers haven't been updated. > The block code will set the mimimum size supported and we may be able > to support directly any change in the block framework that lower down > the minimal size of a request. > > [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html > > Signed-off-by: Julien Grall <julien.grall@citrix.com> LGTM, only a couple of typos and a simplification: Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> > --- > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Cc: "Roger Pau Monné" <roger.pau@citrix.com> > Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> > Cc: David Vrabel <david.vrabel@citrix.com> > > Changes in v2: > - Update the commit message > - Typoes > - Rename ring_req2/id2 to extra_ring_req/extra_id > --- > drivers/block/xen-blkfront.c | 200 +++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 184 insertions(+), 16 deletions(-) > > diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c > index 5982768..8ea2c97 100644 > --- a/drivers/block/xen-blkfront.c > +++ b/drivers/block/xen-blkfront.c > @@ -60,6 +60,20 @@ > > #include <asm/xen/hypervisor.h> > > +/* > + * The minimal size of segment supportd by the block framework is PAGE_SIZE. ^supported > + * When Linux is using a different page size than xen, it may not be possible > + * to put all the data in a single segment. > + * This can happen when the backend doesn't support indirect descriptor and > + * therefore the maximum amount of data that a request can carry is > + * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE = 44KB > + * > + * Note that we only support one extra request. So the Linux page size > + * should be <= ( 2 * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) = > + * 88KB. > + */ > +#define HAS_EXTRA_REQ (BLKIF_MAX_SEGMENTS_PER_REQUEST < XEN_PFN_PER_PAGE) > + > enum blkif_state { > BLKIF_STATE_DISCONNECTED, > BLKIF_STATE_CONNECTED, > @@ -79,6 +93,18 @@ struct blk_shadow { > struct grant **indirect_grants; > struct scatterlist *sg; > unsigned int num_sg; > + enum { > + REQ_WAITING, > + REQ_DONE, > + REQ_FAIL > + } status; > + > + #define NO_ASSOCIATED_ID ~0UL > + /* > + * Id of the sibling if we ever need 2 requests when handling a > + * block I/O request > + */ > + unsigned long associated_id; > }; > > struct split_bio { > @@ -467,6 +493,8 @@ static unsigned long blkif_ring_get_request(struct blkfront_info *info, > > id = get_id_from_freelist(info); > info->shadow[id].request = req; > + info->shadow[id].status = REQ_WAITING; > + info->shadow[id].associated_id = NO_ASSOCIATED_ID; > > (*ring_req)->u.rw.id = id; > > @@ -508,6 +536,9 @@ struct setup_rw_req { > bool need_copy; > unsigned int bvec_off; > char *bvec_data; > + > + bool require_extra_req; > + struct blkif_request *extra_ring_req; > }; > > static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset, > @@ -521,8 +552,24 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset, > unsigned int grant_idx = setup->grant_idx; > struct blkif_request *ring_req = setup->ring_req; > struct blkfront_info *info = setup->info; > + /* > + * We always use the shadow of the first request to store the list > + * of grant associated to the block I/O request. This made the > + * completion more easy to handle even if the block I/O request is > + * split. > + */ > struct blk_shadow *shadow = &info->shadow[setup->id]; > > + if (unlikely(setup->require_extra_req && > + grant_idx >= BLKIF_MAX_SEGMENTS_PER_REQUEST)) { > + /* > + * We are using the second request, setup grant_idx > + * to be the index of the segment array > + */ > + grant_idx -= BLKIF_MAX_SEGMENTS_PER_REQUEST; > + ring_req = setup->extra_ring_req; > + } > + > if ((ring_req->operation == BLKIF_OP_INDIRECT) && > (grant_idx % GRANTS_PER_INDIRECT_FRAME == 0)) { > if (setup->segments) > @@ -537,7 +584,11 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset, > > gnt_list_entry = get_grant(&setup->gref_head, gfn, info); > ref = gnt_list_entry->gref; > - shadow->grants_used[grant_idx] = gnt_list_entry; > + /* > + * All the grants are stored in the shadow of the first > + * request. Therefore we have to use the global index > + */ > + shadow->grants_used[setup->grant_idx] = gnt_list_entry; > > if (setup->need_copy) { > void *shared_data; > @@ -579,11 +630,31 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset, > (setup->grant_idx)++; > } > > +static void blkif_setup_extra_req(struct blkif_request *first, > + struct blkif_request *second) > +{ > + uint16_t nr_segments = first->u.rw.nr_segments; > + > + /* > + * The second request is only present when the first request uses > + * all its segments. It's always the continuity of the first one. > + */ > + first->u.rw.nr_segments = BLKIF_MAX_SEGMENTS_PER_REQUEST; > + > + second->u.rw.nr_segments = nr_segments - BLKIF_MAX_SEGMENTS_PER_REQUEST; > + second->u.rw.sector_number = first->u.rw.sector_number + > + (BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) / 512; > + > + second->u.rw.handle = first->u.rw.handle; > + second->operation = first->operation; > +} > + > static int blkif_queue_rw_req(struct request *req) > { > struct blkfront_info *info = req->rq_disk->private_data; > - struct blkif_request *ring_req; > - unsigned long id; > + struct blkif_request *ring_req, *extra_ring_req = NULL; > + unsigned long id, extra_id = NO_ASSOCIATED_ID; > + bool require_extra_req = false; > int i; > struct setup_rw_req setup = { > .grant_idx = 0, > @@ -628,19 +699,19 @@ static int blkif_queue_rw_req(struct request *req) > /* Fill out a communications ring structure. */ > id = blkif_ring_get_request(info, req, &ring_req); > > - BUG_ON(info->max_indirect_segments == 0 && > - GREFS(req->nr_phys_segments) > BLKIF_MAX_SEGMENTS_PER_REQUEST); > - BUG_ON(info->max_indirect_segments && > - GREFS(req->nr_phys_segments) > info->max_indirect_segments); > - > num_sg = blk_rq_map_sg(req->q, req, info->shadow[id].sg); > num_grant = 0; > /* Calculate the number of grant used */ > for_each_sg(info->shadow[id].sg, sg, num_sg, i) > num_grant += gnttab_count_grant(sg->offset, sg->length); > > + require_extra_req = info->max_indirect_segments == 0 && > + num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST; > + BUG_ON(!HAS_EXTRA_REQ && require_extra_req); > + > info->shadow[id].num_sg = num_sg; > - if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST) { > + if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST && > + likely(!require_extra_req)) { > /* > * The indirect operation can only be a BLKIF_OP_READ or > * BLKIF_OP_WRITE > @@ -680,10 +751,30 @@ static int blkif_queue_rw_req(struct request *req) > } > } > ring_req->u.rw.nr_segments = num_grant; > + if (unlikely(require_extra_req)) { > + extra_id = blkif_ring_get_request(info, req, > + &extra_ring_req); > + /* > + * Only the first request contains the scatter-gather > + * list. > + */ > + info->shadow[extra_id].num_sg = 0; > + > + blkif_setup_extra_req(ring_req, extra_ring_req); > + > + /* Link the 2 requests together */ > + info->shadow[extra_id].associated_id = id; > + info->shadow[id].associated_id = extra_id; > + } > } > > setup.ring_req = ring_req; > setup.id = id; > + > + setup.require_extra_req = require_extra_req; > + if (unlikely(require_extra_req)) > + setup.extra_ring_req = extra_ring_req; > + > for_each_sg(info->shadow[id].sg, sg, num_sg, i) { > BUG_ON(sg->offset + sg->length > PAGE_SIZE); > > @@ -706,6 +797,8 @@ static int blkif_queue_rw_req(struct request *req) > > /* Keep a private copy so we can reissue requests when recovering. */ > info->shadow[id].req = *ring_req; > + if (unlikely(require_extra_req)) > + info->shadow[extra_id].req = *extra_ring_req; > > if (new_persistent_gnts) > gnttab_free_grant_references(setup.gref_head); > @@ -797,7 +890,16 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size, > memset(&info->tag_set, 0, sizeof(info->tag_set)); > info->tag_set.ops = &blkfront_mq_ops; > info->tag_set.nr_hw_queues = 1; > - info->tag_set.queue_depth = BLK_RING_SIZE(info); > + if (HAS_EXTRA_REQ && info->max_indirect_segments == 0) { > + /* > + * When indirect descriptior is not supported, the I/O request ^descriptors are > + * will be split between multiple request in the ring. > + * To avoid problems when sending the request, divide by > + * 2 the depth of the queue. > + */ > + info->tag_set.queue_depth = BLK_RING_SIZE(info) / 2; > + } else > + info->tag_set.queue_depth = BLK_RING_SIZE(info); > info->tag_set.numa_node = NUMA_NO_NODE; > info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE; > info->tag_set.cmd_size = 0; > @@ -1217,19 +1319,67 @@ static void blkif_copy_from_grant(unsigned long gfn, unsigned int offset, > kunmap_atomic(shared_data); > } > > -static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info, > +static bool blkif_completion(unsigned long *id, struct blkfront_info *info, > struct blkif_response *bret) > { > int i = 0; > struct scatterlist *sg; > int num_sg, num_grant; > + struct blk_shadow *s = &info->shadow[*id]; > struct copy_from_grant data = { > - .s = s, > .grant_idx = 0, > }; > > num_grant = s->req.operation == BLKIF_OP_INDIRECT ? > s->req.u.indirect.nr_segments : s->req.u.rw.nr_segments; > + > + /* The I/O request may be split in two */ > + if (unlikely(s->associated_id != NO_ASSOCIATED_ID)) { > + struct blk_shadow *s2 = &info->shadow[s->associated_id]; > + > + /* Keep the status of the current response in shadow */ > + s->status = (bret->status == BLKIF_RSP_OKAY) ? > + REQ_DONE : REQ_FAIL; > + > + /* Wait the second response if not yet here */ > + if (s2->status == REQ_WAITING) > + return 0; > + > + /* > + * The status of the current response will be used in > + * order to know if the request has failed. > + * Update the current response status only if has not > + * failed. > + */ > + if (bret->status == BLKIF_RSP_OKAY && s2->status == REQ_FAIL) This could be simplified by only checking if s2->status == REQ_FAIL. > + bret->status = BLKIF_RSP_ERROR; > + > + /* > + * All the grants is stored in the first shadow in order > + * to make the completion code simpler. > + */ > + num_grant += s2->req.u.rw.nr_segments; > + > + /* > + * The two responses may not come in order. Only the > + * first request will store the scatter-gather list > + */ > + if (s2->num_sg != 0) { > + /* Update "id" with the ID of the first response */ > + *id = s->associated_id; > + s = s2; > + } > + > + /* > + * We don't need anymore the second request, so recycling > + * it now. > + */ > + if (add_id_to_freelist(info, s->associated_id)) > + WARN(1, "%s: can't recycle the second part (id = %ld) of the request\n", > + info->gd->disk_name, s->associated_id); > + } > + > + data.s = s; > num_sg = s->num_sg; > > if (bret->operation == BLKIF_OP_READ && info->feature_persistent) { > @@ -1299,6 +1449,8 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info, > } > } > } > + > + return 1; > } > > static irqreturn_t blkif_interrupt(int irq, void *dev_id) > @@ -1339,8 +1491,14 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id) > } > req = info->shadow[id].request; > > - if (bret->operation != BLKIF_OP_DISCARD) > - blkif_completion(&info->shadow[id], info, bret); > + if (bret->operation != BLKIF_OP_DISCARD) { > + /* > + * We may need to wait for an extra response if the > + * I/O request is split in 2 > + */ > + if (!blkif_completion(&id, info, bret)) > + continue; > + } > > if (add_id_to_freelist(info, id)) { > WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n", > @@ -1863,8 +2021,18 @@ static int blkfront_setup_indirect(struct blkfront_info *info) > unsigned int psegs, grants; > int err, i; > > - if (info->max_indirect_segments == 0) > - grants = BLKIF_MAX_SEGMENTS_PER_REQUEST; > + if (info->max_indirect_segments == 0) { > + if (!HAS_EXTRA_REQ) > + grants = BLKIF_MAX_SEGMENTS_PER_REQUEST; > + else { > + /* > + * When an extra req is required, the maximum > + * grants supported is related to the size of the > + * Linux block segment > + */ > + grants = GRANTS_PER_PSEG; > + } > + } > else > grants = info->max_indirect_segments; > psegs = grants / GRANTS_PER_PSEG; > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Xen-devel] [PATCH v2 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages 2015-11-12 16:40 ` Roger Pau Monné @ 2015-11-12 17:30 ` Julien Grall 2015-11-12 17:51 ` Roger Pau Monné 0 siblings, 1 reply; 11+ messages in thread From: Julien Grall @ 2015-11-12 17:30 UTC (permalink / raw) To: Roger Pau Monné, xen-devel, linux-kernel Cc: Boris Ostrovsky, David Vrabel Hi, On 12/11/15 16:40, Roger Pau Monné wrote: >> [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html >> >> Signed-off-by: Julien Grall <julien.grall@citrix.com> > > LGTM, only a couple of typos and a simplification: > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Do you mean Acked-by? ;) >> + >> + /* The I/O request may be split in two */ >> + if (unlikely(s->associated_id != NO_ASSOCIATED_ID)) { >> + struct blk_shadow *s2 = &info->shadow[s->associated_id]; >> + >> + /* Keep the status of the current response in shadow */ >> + s->status = (bret->status == BLKIF_RSP_OKAY) ? >> + REQ_DONE : REQ_FAIL; >> + >> + /* Wait the second response if not yet here */ >> + if (s2->status == REQ_WAITING) >> + return 0; >> + >> + /* >> + * The status of the current response will be used in >> + * order to know if the request has failed. >> + * Update the current response status only if has not >> + * failed. >> + */ >> + if (bret->status == BLKIF_RSP_OKAY && s2->status == REQ_FAIL) > > This could be simplified by only checking if s2->status == REQ_FAIL. I didn't do it because bret->status may be different than BLKIF_RSP_ERROR (for instance BLKIF_RSP_EOPNOTSUPP). Regards, -- Julien Grall ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Xen-devel] [PATCH v2 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages 2015-11-12 17:30 ` [Xen-devel] " Julien Grall @ 2015-11-12 17:51 ` Roger Pau Monné 2015-11-12 18:04 ` Julien Grall 0 siblings, 1 reply; 11+ messages in thread From: Roger Pau Monné @ 2015-11-12 17:51 UTC (permalink / raw) To: Julien Grall, xen-devel, linux-kernel; +Cc: Boris Ostrovsky, David Vrabel El 12/11/15 a les 18.30, Julien Grall ha escrit: > Hi, > > On 12/11/15 16:40, Roger Pau Monné wrote: >>> [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html >>> >>> Signed-off-by: Julien Grall <julien.grall@citrix.com> >> >> LGTM, only a couple of typos and a simplification: >> >> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> > > Do you mean Acked-by? ;) Yes, I also had problems with smtp, so I thought this one was actually not sent. You have another one with a proper Ack :). >>> + >>> + /* The I/O request may be split in two */ >>> + if (unlikely(s->associated_id != NO_ASSOCIATED_ID)) { >>> + struct blk_shadow *s2 = &info->shadow[s->associated_id]; >>> + >>> + /* Keep the status of the current response in shadow */ >>> + s->status = (bret->status == BLKIF_RSP_OKAY) ? >>> + REQ_DONE : REQ_FAIL; >>> + >>> + /* Wait the second response if not yet here */ >>> + if (s2->status == REQ_WAITING) >>> + return 0; >>> + >>> + /* >>> + * The status of the current response will be used in >>> + * order to know if the request has failed. >>> + * Update the current response status only if has not >>> + * failed. >>> + */ >>> + if (bret->status == BLKIF_RSP_OKAY && s2->status == REQ_FAIL) >> >> This could be simplified by only checking if s2->status == REQ_FAIL. > > I didn't do it because bret->status may be different than > BLKIF_RSP_ERROR (for instance BLKIF_RSP_EOPNOTSUPP). I think this is not actually possible in practice, but what if bret->status == BLKIF_RSP_OKAY and the bret from s2 actually had BLKIF_RSP_EOPNOTSUPP, wouldn't we loose the EOPNOTSUPP by unconditionally setting BLKIF_RSP_ERROR? Should s->status be able to store all the possible return codes from the response (OK/ERROR/NOTSUPP)? Roger. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Xen-devel] [PATCH v2 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages 2015-11-12 17:51 ` Roger Pau Monné @ 2015-11-12 18:04 ` Julien Grall 2015-11-12 18:24 ` Roger Pau Monné 0 siblings, 1 reply; 11+ messages in thread From: Julien Grall @ 2015-11-12 18:04 UTC (permalink / raw) To: Roger Pau Monné, xen-devel, linux-kernel Cc: Boris Ostrovsky, David Vrabel On 12/11/15 17:51, Roger Pau Monné wrote: > El 12/11/15 a les 18.30, Julien Grall ha escrit: >> Hi, >> >> On 12/11/15 16:40, Roger Pau Monné wrote: >>>> [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html >>>> >>>> Signed-off-by: Julien Grall <julien.grall@citrix.com> >>> >>> LGTM, only a couple of typos and a simplification: >>> >>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> >> >> Do you mean Acked-by? ;) > > Yes, I also had problems with smtp, so I thought this one was actually > not sent. You have another one with a proper Ack :). > >>>> + >>>> + /* The I/O request may be split in two */ >>>> + if (unlikely(s->associated_id != NO_ASSOCIATED_ID)) { >>>> + struct blk_shadow *s2 = &info->shadow[s->associated_id]; >>>> + >>>> + /* Keep the status of the current response in shadow */ >>>> + s->status = (bret->status == BLKIF_RSP_OKAY) ? >>>> + REQ_DONE : REQ_FAIL; >>>> + >>>> + /* Wait the second response if not yet here */ >>>> + if (s2->status == REQ_WAITING) >>>> + return 0; >>>> + >>>> + /* >>>> + * The status of the current response will be used in >>>> + * order to know if the request has failed. >>>> + * Update the current response status only if has not >>>> + * failed. >>>> + */ >>>> + if (bret->status == BLKIF_RSP_OKAY && s2->status == REQ_FAIL) >>> >>> This could be simplified by only checking if s2->status == REQ_FAIL. >> >> I didn't do it because bret->status may be different than >> BLKIF_RSP_ERROR (for instance BLKIF_RSP_EOPNOTSUPP). > > I think this is not actually possible in practice, but what if > bret->status == BLKIF_RSP_OKAY and the bret from s2 actually had > BLKIF_RSP_EOPNOTSUPP, wouldn't we loose the EOPNOTSUPP by > unconditionally setting BLKIF_RSP_ERROR? No because EOPNOTSUPP are used when an operation is not supported. As the 2 ring request is coming from the same I/O request, it will always have the same operation. So if one get EOPNOTSUPP the other will get too. > > Should s->status be able to store all the possible return codes from the > response (OK/ERROR/NOTSUPP)? That could would work. However, how do you decide which will be the final status? Regards, -- Julien Grall ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Xen-devel] [PATCH v2 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages 2015-11-12 18:04 ` Julien Grall @ 2015-11-12 18:24 ` Roger Pau Monné 2015-11-12 18:32 ` Julien Grall 0 siblings, 1 reply; 11+ messages in thread From: Roger Pau Monné @ 2015-11-12 18:24 UTC (permalink / raw) To: Julien Grall, xen-devel, linux-kernel; +Cc: Boris Ostrovsky, David Vrabel El 12/11/15 a les 19.04, Julien Grall ha escrit: > On 12/11/15 17:51, Roger Pau Monné wrote: >> El 12/11/15 a les 18.30, Julien Grall ha escrit: >>> Hi, >>> >>> On 12/11/15 16:40, Roger Pau Monné wrote: >>>>> [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html >>>>> >>>>> Signed-off-by: Julien Grall <julien.grall@citrix.com> >>>> >>>> LGTM, only a couple of typos and a simplification: >>>> >>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> >>> >>> Do you mean Acked-by? ;) >> >> Yes, I also had problems with smtp, so I thought this one was actually >> not sent. You have another one with a proper Ack :). >> >>>>> + >>>>> + /* The I/O request may be split in two */ >>>>> + if (unlikely(s->associated_id != NO_ASSOCIATED_ID)) { >>>>> + struct blk_shadow *s2 = &info->shadow[s->associated_id]; >>>>> + >>>>> + /* Keep the status of the current response in shadow */ >>>>> + s->status = (bret->status == BLKIF_RSP_OKAY) ? >>>>> + REQ_DONE : REQ_FAIL; >>>>> + >>>>> + /* Wait the second response if not yet here */ >>>>> + if (s2->status == REQ_WAITING) >>>>> + return 0; >>>>> + >>>>> + /* >>>>> + * The status of the current response will be used in >>>>> + * order to know if the request has failed. >>>>> + * Update the current response status only if has not >>>>> + * failed. >>>>> + */ >>>>> + if (bret->status == BLKIF_RSP_OKAY && s2->status == REQ_FAIL) >>>> >>>> This could be simplified by only checking if s2->status == REQ_FAIL. >>> >>> I didn't do it because bret->status may be different than >>> BLKIF_RSP_ERROR (for instance BLKIF_RSP_EOPNOTSUPP). >> >> I think this is not actually possible in practice, but what if >> bret->status == BLKIF_RSP_OKAY and the bret from s2 actually had >> BLKIF_RSP_EOPNOTSUPP, wouldn't we loose the EOPNOTSUPP by >> unconditionally setting BLKIF_RSP_ERROR? > > No because EOPNOTSUPP are used when an operation is not supported. As > the 2 ring request is coming from the same I/O request, it will always > have the same operation. > > So if one get EOPNOTSUPP the other will get too. That's why I said that I think it's not currently possible. IMHO, it's fine as it is now. The only scenario I can think of that can lead to that combination is that we migrate the guest and one request gets processed by one backend that supports the operation, while the other request get processed by a backend that doesn't support it. With your current implementation we would return an error code anyway, which is not that bad I guess. >> >> Should s->status be able to store all the possible return codes from the >> response (OK/ERROR/NOTSUPP)? > > That could would work. However, how do you decide which will be the > final status? It should be the most restrictive one, for example if we have ERROR and NOTSUPP we should return ERROR, while if we have OK and NOTSUPP we should return NOTSUPP. Roger. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Xen-devel] [PATCH v2 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages 2015-11-12 18:24 ` Roger Pau Monné @ 2015-11-12 18:32 ` Julien Grall 0 siblings, 0 replies; 11+ messages in thread From: Julien Grall @ 2015-11-12 18:32 UTC (permalink / raw) To: Roger Pau Monné, xen-devel, linux-kernel Cc: Boris Ostrovsky, David Vrabel, Ian Campbell On 12/11/15 18:24, Roger Pau Monné wrote: >> So if one get EOPNOTSUPP the other will get too. > > That's why I said that I think it's not currently possible. IMHO, it's > fine as it is now. > > The only scenario I can think of that can lead to that combination is > that we migrate the guest and one request gets processed by one backend > that supports the operation, while the other request get processed by a > backend that doesn't support it. > > With your current implementation we would return an error code anyway, > which is not that bad I guess. hmmm ... We would return an error to the block layer rather than 0 because the operation is not supported. That reminds me that blkif_recover needs to be fixed to support splitting request. I haven't done it because ARM doesn't yet support suspend/resume (CCing Ian who is working on it). >>> >>> Should s->status be able to store all the possible return codes from the >>> response (OK/ERROR/NOTSUPP)? >> >> That could would work. However, how do you decide which will be the >> final status? > > It should be the most restrictive one, for example if we have ERROR and > NOTSUPP we should return ERROR, while if we have OK and NOTSUPP we > should return NOTSUPP. I will give a look. -- Julien Grall ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages 2015-10-19 14:19 ` [PATCH v2 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages Julien Grall 2015-11-12 16:40 ` Roger Pau Monné @ 2015-11-12 16:45 ` Roger Pau Monné 1 sibling, 0 replies; 11+ messages in thread From: Roger Pau Monné @ 2015-11-12 16:45 UTC (permalink / raw) To: Julien Grall, xen-devel, linux-kernel Cc: Konrad Rzeszutek Wilk, Boris Ostrovsky, David Vrabel El 19/10/15 a les 16.19, Julien Grall ha escrit: > The minimal size of request in the block framework is always PAGE_SIZE. > It means that when 64KB guest is support, the request will at least be > 64KB. > > Although, if the backend doesn't support indirect descriptor (such as QDISK > in QEMU), a ring request is only able to accommodate 11 segments of 4KB > (i.e 44KB). > > The current frontend is assuming that an I/O request will always fit in > a ring request. This is not true any more when using 64KB page > granularity and will therefore crash during boot. > > On ARM64, the ABI is completely neutral to the page granularity used by > the domU. The guest has the choice between different page granularity > supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB). > This can't be enforced by the hypervisor and therefore it's possible to > run guests using different page granularity. > > So we can't mandate the block backend to support indirect descriptor > when the frontend is using 64KB page granularity and have to fix it > properly in the frontend. > > The solution exposed below is based on modifying directly the frontend > guest rather than asking the block framework to support smaller size > (i.e < PAGE_SIZE). This is because the change is the block framework are > not trivial as everything seems to relying on a struct *page (see [1]). > Although, it may be possible that someone succeed to do it in the future > and we would therefore be able to use it. > > Given that a block request may not fit in a single ring request, a > second request is introduced for the data that cannot fit in the first > one. This means that the second ring request should never be used on > Linux if the page size is smaller than 44KB. > > To achieve the support of the extra ring request, the block queue size > is divided by two. Therefore, the ring will always contain enough space > to accommodate 2 ring requests. While this will reduce the overall > performance, it will make the implementation more contained. The way > forward to get better performance is to implement in the backend either > indirect descriptor or multiple grants ring. > > Note that the parameters blk_queue_max_* helpers haven't been updated. > The block code will set the mimimum size supported and we may be able > to support directly any change in the block framework that lower down > the minimal size of a request. > > [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html > > Signed-off-by: Julien Grall <julien.grall@citrix.com> LGTM, only a couple of typos and a simplification: Acked-by: Roger Pau Monné <roger.pau@citrix.com> > --- > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Cc: "Roger Pau Monné" <roger.pau@citrix.com> > Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> > Cc: David Vrabel <david.vrabel@citrix.com> > > Changes in v2: > - Update the commit message > - Typoes > - Rename ring_req2/id2 to extra_ring_req/extra_id > --- > drivers/block/xen-blkfront.c | 200 +++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 184 insertions(+), 16 deletions(-) > > diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c > index 5982768..8ea2c97 100644 > --- a/drivers/block/xen-blkfront.c > +++ b/drivers/block/xen-blkfront.c > @@ -60,6 +60,20 @@ > > #include <asm/xen/hypervisor.h> > > +/* > + * The minimal size of segment supportd by the block framework is PAGE_SIZE. ^supported > + * When Linux is using a different page size than xen, it may not be possible > + * to put all the data in a single segment. > + * This can happen when the backend doesn't support indirect descriptor and > + * therefore the maximum amount of data that a request can carry is > + * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE = 44KB > + * > + * Note that we only support one extra request. So the Linux page size > + * should be <= ( 2 * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) = > + * 88KB. > + */ > +#define HAS_EXTRA_REQ (BLKIF_MAX_SEGMENTS_PER_REQUEST < XEN_PFN_PER_PAGE) > + > enum blkif_state { > BLKIF_STATE_DISCONNECTED, > BLKIF_STATE_CONNECTED, > @@ -79,6 +93,18 @@ struct blk_shadow { > struct grant **indirect_grants; > struct scatterlist *sg; > unsigned int num_sg; > + enum { > + REQ_WAITING, > + REQ_DONE, > + REQ_FAIL > + } status; > + > + #define NO_ASSOCIATED_ID ~0UL > + /* > + * Id of the sibling if we ever need 2 requests when handling a > + * block I/O request > + */ > + unsigned long associated_id; > }; > > struct split_bio { > @@ -467,6 +493,8 @@ static unsigned long blkif_ring_get_request(struct blkfront_info *info, > > id = get_id_from_freelist(info); > info->shadow[id].request = req; > + info->shadow[id].status = REQ_WAITING; > + info->shadow[id].associated_id = NO_ASSOCIATED_ID; > > (*ring_req)->u.rw.id = id; > > @@ -508,6 +536,9 @@ struct setup_rw_req { > bool need_copy; > unsigned int bvec_off; > char *bvec_data; > + > + bool require_extra_req; > + struct blkif_request *extra_ring_req; > }; > > static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset, > @@ -521,8 +552,24 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset, > unsigned int grant_idx = setup->grant_idx; > struct blkif_request *ring_req = setup->ring_req; > struct blkfront_info *info = setup->info; > + /* > + * We always use the shadow of the first request to store the list > + * of grant associated to the block I/O request. This made the > + * completion more easy to handle even if the block I/O request is > + * split. > + */ > struct blk_shadow *shadow = &info->shadow[setup->id]; > > + if (unlikely(setup->require_extra_req && > + grant_idx >= BLKIF_MAX_SEGMENTS_PER_REQUEST)) { > + /* > + * We are using the second request, setup grant_idx > + * to be the index of the segment array > + */ > + grant_idx -= BLKIF_MAX_SEGMENTS_PER_REQUEST; > + ring_req = setup->extra_ring_req; > + } > + > if ((ring_req->operation == BLKIF_OP_INDIRECT) && > (grant_idx % GRANTS_PER_INDIRECT_FRAME == 0)) { > if (setup->segments) > @@ -537,7 +584,11 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset, > > gnt_list_entry = get_grant(&setup->gref_head, gfn, info); > ref = gnt_list_entry->gref; > - shadow->grants_used[grant_idx] = gnt_list_entry; > + /* > + * All the grants are stored in the shadow of the first > + * request. Therefore we have to use the global index > + */ > + shadow->grants_used[setup->grant_idx] = gnt_list_entry; > > if (setup->need_copy) { > void *shared_data; > @@ -579,11 +630,31 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset, > (setup->grant_idx)++; > } > > +static void blkif_setup_extra_req(struct blkif_request *first, > + struct blkif_request *second) > +{ > + uint16_t nr_segments = first->u.rw.nr_segments; > + > + /* > + * The second request is only present when the first request uses > + * all its segments. It's always the continuity of the first one. > + */ > + first->u.rw.nr_segments = BLKIF_MAX_SEGMENTS_PER_REQUEST; > + > + second->u.rw.nr_segments = nr_segments - BLKIF_MAX_SEGMENTS_PER_REQUEST; > + second->u.rw.sector_number = first->u.rw.sector_number + > + (BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) / 512; > + > + second->u.rw.handle = first->u.rw.handle; > + second->operation = first->operation; > +} > + > static int blkif_queue_rw_req(struct request *req) > { > struct blkfront_info *info = req->rq_disk->private_data; > - struct blkif_request *ring_req; > - unsigned long id; > + struct blkif_request *ring_req, *extra_ring_req = NULL; > + unsigned long id, extra_id = NO_ASSOCIATED_ID; > + bool require_extra_req = false; > int i; > struct setup_rw_req setup = { > .grant_idx = 0, > @@ -628,19 +699,19 @@ static int blkif_queue_rw_req(struct request *req) > /* Fill out a communications ring structure. */ > id = blkif_ring_get_request(info, req, &ring_req); > > - BUG_ON(info->max_indirect_segments == 0 && > - GREFS(req->nr_phys_segments) > BLKIF_MAX_SEGMENTS_PER_REQUEST); > - BUG_ON(info->max_indirect_segments && > - GREFS(req->nr_phys_segments) > info->max_indirect_segments); > - > num_sg = blk_rq_map_sg(req->q, req, info->shadow[id].sg); > num_grant = 0; > /* Calculate the number of grant used */ > for_each_sg(info->shadow[id].sg, sg, num_sg, i) > num_grant += gnttab_count_grant(sg->offset, sg->length); > > + require_extra_req = info->max_indirect_segments == 0 && > + num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST; > + BUG_ON(!HAS_EXTRA_REQ && require_extra_req); > + > info->shadow[id].num_sg = num_sg; > - if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST) { > + if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST && > + likely(!require_extra_req)) { > /* > * The indirect operation can only be a BLKIF_OP_READ or > * BLKIF_OP_WRITE > @@ -680,10 +751,30 @@ static int blkif_queue_rw_req(struct request *req) > } > } > ring_req->u.rw.nr_segments = num_grant; > + if (unlikely(require_extra_req)) { > + extra_id = blkif_ring_get_request(info, req, > + &extra_ring_req); > + /* > + * Only the first request contains the scatter-gather > + * list. > + */ > + info->shadow[extra_id].num_sg = 0; > + > + blkif_setup_extra_req(ring_req, extra_ring_req); > + > + /* Link the 2 requests together */ > + info->shadow[extra_id].associated_id = id; > + info->shadow[id].associated_id = extra_id; > + } > } > > setup.ring_req = ring_req; > setup.id = id; > + > + setup.require_extra_req = require_extra_req; > + if (unlikely(require_extra_req)) > + setup.extra_ring_req = extra_ring_req; > + > for_each_sg(info->shadow[id].sg, sg, num_sg, i) { > BUG_ON(sg->offset + sg->length > PAGE_SIZE); > > @@ -706,6 +797,8 @@ static int blkif_queue_rw_req(struct request *req) > > /* Keep a private copy so we can reissue requests when recovering. */ > info->shadow[id].req = *ring_req; > + if (unlikely(require_extra_req)) > + info->shadow[extra_id].req = *extra_ring_req; > > if (new_persistent_gnts) > gnttab_free_grant_references(setup.gref_head); > @@ -797,7 +890,16 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size, > memset(&info->tag_set, 0, sizeof(info->tag_set)); > info->tag_set.ops = &blkfront_mq_ops; > info->tag_set.nr_hw_queues = 1; > - info->tag_set.queue_depth = BLK_RING_SIZE(info); > + if (HAS_EXTRA_REQ && info->max_indirect_segments == 0) { > + /* > + * When indirect descriptior is not supported, the I/O request ^descriptors are > + * will be split between multiple request in the ring. > + * To avoid problems when sending the request, divide by > + * 2 the depth of the queue. > + */ > + info->tag_set.queue_depth = BLK_RING_SIZE(info) / 2; > + } else > + info->tag_set.queue_depth = BLK_RING_SIZE(info); > info->tag_set.numa_node = NUMA_NO_NODE; > info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE; > info->tag_set.cmd_size = 0; > @@ -1217,19 +1319,67 @@ static void blkif_copy_from_grant(unsigned long gfn, unsigned int offset, > kunmap_atomic(shared_data); > } > > -static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info, > +static bool blkif_completion(unsigned long *id, struct blkfront_info *info, > struct blkif_response *bret) > { > int i = 0; > struct scatterlist *sg; > int num_sg, num_grant; > + struct blk_shadow *s = &info->shadow[*id]; > struct copy_from_grant data = { > - .s = s, > .grant_idx = 0, > }; > > num_grant = s->req.operation == BLKIF_OP_INDIRECT ? > s->req.u.indirect.nr_segments : s->req.u.rw.nr_segments; > + > + /* The I/O request may be split in two */ > + if (unlikely(s->associated_id != NO_ASSOCIATED_ID)) { > + struct blk_shadow *s2 = &info->shadow[s->associated_id]; > + > + /* Keep the status of the current response in shadow */ > + s->status = (bret->status == BLKIF_RSP_OKAY) ? > + REQ_DONE : REQ_FAIL; > + > + /* Wait the second response if not yet here */ > + if (s2->status == REQ_WAITING) > + return 0; > + > + /* > + * The status of the current response will be used in > + * order to know if the request has failed. > + * Update the current response status only if has not > + * failed. > + */ > + if (bret->status == BLKIF_RSP_OKAY && s2->status == REQ_FAIL) This condition could be made simpler by just checking s2->status == REQ_FAIL and then setting bret->status unconditionally? > + bret->status = BLKIF_RSP_ERROR; Roger. ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <5640C114.5080104@citrix.com>]
[parent not found: <20151109160523.GE8121@char.us.oracle.com>]
* Re: [Xen-devel] [PATCH v2 0/2] block/xen-blkfront: Support non-indirect grant with 64KB page granularity [not found] ` <20151109160523.GE8121@char.us.oracle.com> @ 2015-11-18 17:34 ` Julien Grall 0 siblings, 0 replies; 11+ messages in thread From: Julien Grall @ 2015-11-18 17:34 UTC (permalink / raw) To: Konrad Rzeszutek Wilk, bob.liu Cc: xen-devel, Boris Ostrovsky, Roger Pau Monné, linux-kernel, David Vrabel Hi Konrad, On 09/11/15 16:05, Konrad Rzeszutek Wilk wrote: > On Mon, Nov 09, 2015 at 03:51:48PM +0000, Julien Grall wrote: >> Hi, >> >> Any comments on this new version? > > I completly ignored thinking that the: > > c004a6f block/xen-blkfront: Make it running on 64KB page granularity > 4f503fb block/xen-blkfront: split get_grant in 2 > a7a6df2 block/xen-blkfront: Store a page rather a pfn in the grant structure > 3320466 block/xen-blkfront: Split blkif_queue_request in 2 > > were only needed. Sorry! Could you repost them please? And > CC bob.liu@oracle.com please? I think this version can still applied cleanly on top of linux-4.4. Anyways, I'm planning to resend send a new version with Roger's suggestions. Regard, -- Julien Grall ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2015-11-18 17:36 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-19 14:19 [PATCH v2 0/2] block/xen-blkfront: Support non-indirect grant with 64KB page granularity Julien Grall
2015-10-19 14:19 ` [PATCH v2 1/2] block/xen-blkfront: Introduce blkif_ring_get_request Julien Grall
2015-10-19 14:19 ` [PATCH v2 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages Julien Grall
2015-11-12 16:40 ` Roger Pau Monné
2015-11-12 17:30 ` [Xen-devel] " Julien Grall
2015-11-12 17:51 ` Roger Pau Monné
2015-11-12 18:04 ` Julien Grall
2015-11-12 18:24 ` Roger Pau Monné
2015-11-12 18:32 ` Julien Grall
2015-11-12 16:45 ` Roger Pau Monné
[not found] ` <5640C114.5080104@citrix.com>
[not found] ` <20151109160523.GE8121@char.us.oracle.com>
2015-11-18 17:34 ` [Xen-devel] [PATCH v2 0/2] block/xen-blkfront: Support non-indirect grant with 64KB page granularity Julien Grall
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox