[RFC] Support of non-indirect grant backend on 64KB guest

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Support of non-indirect grant backend on 64KB guest
@ 2015-08-18  6:29 Julien Grall
  2015-08-18  7:09 ` Roger Pau Monné
  2015-08-20 17:42 ` David Vrabel
  0 siblings, 2 replies; 29+ messages in thread
From: Julien Grall @ 2015-08-18  6:29 UTC (permalink / raw)
  To: Roger Pau Monné, Stefano Stabellini, Konrad Rzeszutek Wilk,
	Ian Campbell, xen-devel@lists.xen.org

Hi,

Firstly, this patch is not ready at all and mostly here for collecting comment about the way to do it. It's not clean so no need to complain about the coding style.

The qdisk backend in QEMU is not supporting indirect grant, this is means that a request can only support 11 * 4KB = 44KB.

When using 64KB page, a Linux block request (struct *request) may contain up to 64KB of data. This is because the block segment size must at least be the size of a Linux page.

So when indirect is not supported by the backend, we are not able to fit all the data in a single request. We therefore need to create a second request to copy the rest of the data.

I've wrote a patch last week which make 64KB guest booting with qdisk. Although, I'm not sure this is the right way to do it. I would appreciate if one of the block maintainers give me insight about it.

The patch can be found below.

Regards,

commit 62922ae04af371bcb6e4467eb2e470d83dac2a81
Author: Julien Grall <julien.grall@citrix.com>
Date:   Thu Aug 13 13:13:35 2015 +0100

    blkfront: Start to handle non-indirect grant

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 68ca4e5..76247ab 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -79,6 +79,13 @@ struct blk_shadow {
 	struct blk_grant **indirect_grants;
 	struct scatterlist *sg;
 	unsigned int num_sg;
+	enum
+	{
+		REQ_WAITING,
+		REQ_DONE,
+		REQ_FAIL
+	} status;
+	unsigned long associated_id;
 };
 
 struct split_bio {
@@ -467,6 +474,7 @@ static unsigned long blkif_ring_get_request(struct blkfront_info *info,
 
 	id = get_id_from_freelist(info);
 	info->shadow[id].request = req;
+	info->shadow[id].status = REQ_WAITING;
 
 	(*ring_req)->u.rw.id = id;
 
@@ -508,6 +516,9 @@ struct setup_rw_req {
 	bool need_copy;
 	unsigned int bvec_off;
 	char *bvec_data;
+
+	bool require_extra_req;
+	struct blkif_request *ring_req2;
 };
 
 static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
@@ -517,12 +528,20 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
 	int n, ref;
 	struct blk_grant *gnt_list_entry;
 	unsigned int fsect, lsect;
+	struct blkif_request *ring_req;
 	/* Convenient aliases */
 	unsigned int grant_idx = setup->grant_idx;
-	struct blkif_request *ring_req = setup->ring_req;
 	struct blkfront_info *info = setup->info;
 	struct blk_shadow *shadow = &info->shadow[setup->id];
 
+	if (likely(!setup->require_extra_req ||
+		   grant_idx < BLKIF_MAX_SEGMENTS_PER_REQUEST)) {
+		ring_req = setup->ring_req;
+	} else {
+		grant_idx -= BLKIF_MAX_SEGMENTS_PER_REQUEST;
+		ring_req = setup->ring_req2;
+	}
+
 	if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
 	    (grant_idx % GRANTS_PER_INDIRECT_FRAME == 0)) {
 		if (setup->segments)
@@ -537,7 +556,7 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
 
 	gnt_list_entry = get_grant(&setup->gref_head, gfn, info);
 	ref = gnt_list_entry->gref;
-	shadow->grants_used[grant_idx] = gnt_list_entry;
+	shadow->grants_used[setup->grant_idx] = gnt_list_entry;
 
 	if (setup->need_copy) {
 		void *shared_data;
@@ -579,11 +598,31 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
 	(setup->grant_idx)++;
 }
 
+static void blkif_setup_extra_req(struct blkif_request *first,
+				  struct blkif_request *second)
+{
+	uint16_t nr_segments = first->u.rw.nr_segments;
+
+
+	/* The second request is only present when the first request uses
+	 * all its segments. It's always the continuity of the first one
+	 */
+	first->u.rw.nr_segments = BLKIF_MAX_SEGMENTS_PER_REQUEST;
+
+	second->u.rw.nr_segments = nr_segments - BLKIF_MAX_SEGMENTS_PER_REQUEST;
+	second->u.rw.sector_number = first->u.rw.sector_number +
+		(BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) / 512;
+
+	second->u.rw.handle = first->u.rw.handle;
+	second->operation = first->operation;
+}
+
 static int blkif_queue_rw_req(struct request *req)
 {
 	struct blkfront_info *info = req->rq_disk->private_data;
-	struct blkif_request *ring_req;
-	unsigned long id;
+	struct blkif_request *ring_req, *ring_req2 = NULL;
+	unsigned long id, id2 = ~0;
+	bool require_extra_req = false;
 	int i;
 	struct setup_rw_req setup = {
 		.grant_idx = 0,
@@ -628,19 +667,28 @@ static int blkif_queue_rw_req(struct request *req)
 	/* Fill out a communications ring structure. */
 	id = blkif_ring_get_request(info, req, &ring_req);
 
-	BUG_ON(info->max_indirect_segments == 0 &&
-	       GREFS(req->nr_phys_segments) > BLKIF_MAX_SEGMENTS_PER_REQUEST);
-	BUG_ON(info->max_indirect_segments &&
-	       GREFS(req->nr_phys_segments) > info->max_indirect_segments);
-
 	num_sg = blk_rq_map_sg(req->q, req, info->shadow[id].sg);
 	num_grant = 0;
 	/* Calculate the number of grant used */
 	for_each_sg(info->shadow[id].sg, sg, num_sg, i)
 	       num_grant += gnttab_count_grant(sg->offset, sg->length);
 
+	require_extra_req = info->max_indirect_segments == 0 &&
+		num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST;
+	BUG_ON((XEN_PAGE_SIZE == PAGE_SIZE) && require_extra_req);
+
+	if (unlikely(require_extra_req))
+	{
+		id2 = blkif_ring_get_request(info, req, &ring_req2);
+		info->shadow[id2].num_sg = 0;
+		info->shadow[id2].associated_id = id;
+	}
+
+	info->shadow[id].associated_id = id2;
+
 	info->shadow[id].num_sg = num_sg;
-	if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
+	if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST &&
+	    likely(!require_extra_req)) {
 		/*
 		 * The indirect operation can only be a BLKIF_OP_READ or
 		 * BLKIF_OP_WRITE
@@ -680,10 +728,17 @@ static int blkif_queue_rw_req(struct request *req)
 			}
 		}
 		ring_req->u.rw.nr_segments = num_grant;
+		if (unlikely(require_extra_req))
+			blkif_setup_extra_req(ring_req, ring_req2);
 	}
 
 	setup.ring_req = ring_req;
 	setup.id = id;
+
+	setup.require_extra_req = require_extra_req;
+	if (unlikely(require_extra_req))
+		setup.ring_req2 = ring_req2;
+
 	for_each_sg(info->shadow[id].sg, sg, num_sg, i) {
 		BUG_ON(sg->offset + sg->length > PAGE_SIZE);
 
@@ -706,6 +761,8 @@ static int blkif_queue_rw_req(struct request *req)
 
 	/* Keep a private copy so we can reissue requests when recovering. */
 	info->shadow[id].req = *ring_req;
+	if (unlikely(require_extra_req))
+		info->shadow[id2].req = *ring_req2;
 
 	if (new_persistent_gnts)
 		gnttab_free_grant_references(setup.gref_head);
@@ -797,7 +854,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
 	memset(&info->tag_set, 0, sizeof(info->tag_set));
 	info->tag_set.ops = &blkfront_mq_ops;
 	info->tag_set.nr_hw_queues = 1;
-	info->tag_set.queue_depth =  BLK_RING_SIZE(info);
+	info->tag_set.queue_depth =  BLK_RING_SIZE(info) / 2;
 	info->tag_set.numa_node = NUMA_NO_NODE;
 	info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
 	info->tag_set.cmd_size = 0;
@@ -822,6 +879,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
 			queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
 	}
 
+
 	/* Hard sector size and max sectors impersonate the equiv. hardware. */
 	blk_queue_logical_block_size(rq, sector_size);
 	blk_queue_physical_block_size(rq, physical_sector_size);
@@ -1229,7 +1287,21 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
 	};
 
 	num_grant = s->req.operation == BLKIF_OP_INDIRECT ?
-		s->req.u.indirect.nr_segments : s->req.u.rw.nr_segments;
+			s->req.u.indirect.nr_segments : s->req.u.rw.nr_segments;
+
+	if (unlikely(s->associated_id != ~0)) {
+		struct blk_shadow *s2 = &info->shadow[s->associated_id];
+		BUG_ON(s->req.operation == BLKIF_OP_INDIRECT);
+
+		num_grant += s2->req.u.rw.nr_segments;
+
+		/* Only the first request can have sg != 0 */
+		if (s2->num_sg != 0) {
+			data.s = s2;
+			s = s2;
+		}
+	}
+
 	num_sg = s->num_sg;
 
 	if (bret->operation == BLKIF_OP_READ && info->feature_persistent) {
@@ -1248,6 +1320,7 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
 			kunmap_atomic(data.bvec_data);
 		}
 	}
+
 	/* Add the persistent grant into the list of free grants */
 	for (i = 0; i < num_grant; i++) {
 		if (gnttab_query_foreign_access(s->grants_used[i]->gref)) {
@@ -1337,9 +1410,22 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 		}
 		req  = info->shadow[id].request;
 
-		if (bret->operation != BLKIF_OP_DISCARD)
+		if (bret->operation != BLKIF_OP_DISCARD) {
+			unsigned long id2 = info->shadow[id].associated_id;
+
+			if (unlikely(id2 != ~0)) {
+				info->shadow[id].status = (bret->status == BLKIF_RSP_OKAY) ? REQ_DONE : REQ_FAIL;
+
+				if (info->shadow[id2].status == REQ_WAITING)
+					continue;
+			}
+
 			blkif_completion(&info->shadow[id], info, bret);
 
+			if (unlikely(id2 != ~0))
+				BUG_ON(add_id_to_freelist(info, id2));
+		}
+
 		if (add_id_to_freelist(info, id)) {
 			WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n",
 			     info->gd->disk_name, op_name(bret->operation), id);
@@ -1874,7 +1960,13 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
 						  xen_blkif_max_segments);
 		grants = info->max_indirect_segments;
 	}
+
 	psegs = grants / GRANTS_PER_PSEG;
+	if (!psegs)
+	{
+		psegs = 1;
+		grants = GRANTS_PER_PSEG;
+	}
 
 	err = fill_grant_buffer(info,
 				(grants + INDIRECT_GREFS(grants)) * BLK_RING_SIZE(info));

-- 
Julien Grall

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-18  6:29 [RFC] Support of non-indirect grant backend on 64KB guest Julien Grall
@ 2015-08-18  7:09 ` Roger Pau Monné
  2015-08-18  7:26   ` Jan Beulich
  2015-08-18 18:45   ` Julien Grall
  2015-08-20 17:42 ` David Vrabel
  1 sibling, 2 replies; 29+ messages in thread
From: Roger Pau Monné @ 2015-08-18  7:09 UTC (permalink / raw)
  To: Julien Grall, Stefano Stabellini, Konrad Rzeszutek Wilk,
	Ian Campbell, xen-devel@lists.xen.org

Hello,

El 18/08/15 a les 8.29, Julien Grall ha escrit:
> Hi,
> 
> Firstly, this patch is not ready at all and mostly here for collecting comment about the way to do it. It's not clean so no need to complain about the coding style.
> 
> The qdisk backend in QEMU is not supporting indirect grant, this is means that a request can only support 11 * 4KB = 44KB.
> 
> When using 64KB page, a Linux block request (struct *request) may contain up to 64KB of data. This is because the block segment size must at least be the size of a Linux page.
> 
> So when indirect is not supported by the backend, we are not able to fit all the data in a single request. We therefore need to create a second request to copy the rest of the data.
> 
> I've wrote a patch last week which make 64KB guest booting with qdisk. Although, I'm not sure this is the right way to do it. I would appreciate if one of the block maintainers give me insight about it.

Maybe I'm missing some key data, but I see two ways to solve this, the
first one is the one you describe above, and consists in allowing
blkfront to split a request into multiple ring slots. The other solution
would be to add indirect descriptors support to Qdisk, has this been
looked into?

AFAICT it looks more interesting, and x86 can also benefit from it.
Since I would like to prevent adding more cruft to blkfront, I rather
prefer 64KB guests to require indirect descriptors in order to run.

Roger.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-18  7:09 ` Roger Pau Monné
@ 2015-08-18  7:26   ` Jan Beulich
  2015-08-18 18:45   ` Julien Grall
  1 sibling, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2015-08-18  7:26 UTC (permalink / raw)
  To: Julien Grall, Roger Pau Monné
  Cc: xen-devel@lists.xen.org, Ian Campbell, Stefano Stabellini

>>> On 18.08.15 at 09:09, <roger.pau@citrix.com> wrote:
> Hello,
> 
> El 18/08/15 a les 8.29, Julien Grall ha escrit:
>> Hi,
>> 
>> Firstly, this patch is not ready at all and mostly here for collecting 
> comment about the way to do it. It's not clean so no need to complain about 
> the coding style.
>> 
>> The qdisk backend in QEMU is not supporting indirect grant, this is means 
> that a request can only support 11 * 4KB = 44KB.
>> 
>> When using 64KB page, a Linux block request (struct *request) may contain up 
> to 64KB of data. This is because the block segment size must at least be the 
> size of a Linux page.
>> 
>> So when indirect is not supported by the backend, we are not able to fit all 
> the data in a single request. We therefore need to create a second request to 
> copy the rest of the data.
>> 
>> I've wrote a patch last week which make 64KB guest booting with qdisk. 
> Although, I'm not sure this is the right way to do it. I would appreciate if 
> one of the block maintainers give me insight about it.
> 
> Maybe I'm missing some key data, but I see two ways to solve this, the
> first one is the one you describe above, and consists in allowing
> blkfront to split a request into multiple ring slots. The other solution
> would be to add indirect descriptors support to Qdisk, has this been
> looked into?
> 
> AFAICT it looks more interesting, and x86 can also benefit from it.
> Since I would like to prevent adding more cruft to blkfront, I rather
> prefer 64KB guests to require indirect descriptors in order to run.

I was about to reply (almost) the same...

Jan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-18  7:09 ` Roger Pau Monné
  2015-08-18  7:26   ` Jan Beulich
@ 2015-08-18 18:45   ` Julien Grall
  2015-08-19  8:50     ` Roger Pau Monné
  2015-08-19  8:58     ` Jan Beulich
  1 sibling, 2 replies; 29+ messages in thread
From: Julien Grall @ 2015-08-18 18:45 UTC (permalink / raw)
  To: Roger Pau Monné, Stefano Stabellini, Konrad Rzeszutek Wilk,
	Ian Campbell, xen-devel@lists.xen.org

Hi Roger,

On 18/08/2015 00:09, Roger Pau Monné wrote:
> Hello,
>
> El 18/08/15 a les 8.29, Julien Grall ha escrit:
>> Hi,
>>
>> Firstly, this patch is not ready at all and mostly here for collecting comment about the way to do it. It's not clean so no need to complain about the coding style.
>>
>> The qdisk backend in QEMU is not supporting indirect grant, this is means that a request can only support 11 * 4KB = 44KB.
>>
>> When using 64KB page, a Linux block request (struct *request) may contain up to 64KB of data. This is because the block segment size must at least be the size of a Linux page.
>>
>> So when indirect is not supported by the backend, we are not able to fitall the data in a single request. We therefore need to create a second request to copy the rest of the data.
>>
>> I've wrote a patch last week which make 64KB guest booting with qdisk. Although, I'm not sure this is the right way to do it. I would appreciate ifone of the block maintainers give me insight about it.
>
> Maybe I'm missing some key data, but I see two ways to solve this, the
> first one is the one you describe above, and consists in allowing
> blkfront to split a request into multiple ring slots. The other solution
> would be to add indirect descriptors support to Qdisk, has this been
> looked into?
>
> AFAICT it looks more interesting, and x86 can also benefit from it.
> Since I would like to prevent adding more cruft to blkfront, I rather
> prefer 64KB guests to require indirect descriptors in order to run.

Actually supporting indirect in Qdisk was one of our idea. While I agree 
this is a good improvement in general we put aside this idea for various 
reasons.

The first one is openStack is using by default Qdisk backend, so Linux 
64KB guest wouldn't be able to boot on current version of Xen. This is 
the only blocker in order use 64KB guests, everything else is working.
Having the indirect grant support in QEMU for Xen 4.6 is not realistic, 
there is only a month left and we are already in feature.

That would mean that any new distribution using Linux 64KB would not 
work out-of-box on Xen.

Furthermore, not supporting non-indirect grant in the frontend means 
that any userspace backend won't be supported for Linux 64KB guests.

Overall, I think we have to support non-indirect with Linux 64KB guests. 
Many (but not all) distribution will only support 64KB pages, so we 
can't wait until Xen 4.7 to get something running. Not that I rule out 
the requirement for the user to upgrade the QEMU version in order to run 
64KB guests.

Regards,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-18 18:45   ` Julien Grall
@ 2015-08-19  8:50     ` Roger Pau Monné
  2015-08-19 14:54       ` Julien Grall
  2015-08-19  8:58     ` Jan Beulich
  1 sibling, 1 reply; 29+ messages in thread
From: Roger Pau Monné @ 2015-08-19  8:50 UTC (permalink / raw)
  To: Julien Grall, Stefano Stabellini, Konrad Rzeszutek Wilk,
	Ian Campbell, xen-devel@lists.xen.org

El 18/08/15 a les 20.45, Julien Grall ha escrit:
> Hi Roger,
> 
> On 18/08/2015 00:09, Roger Pau Monné wrote:
>> Hello,
>>
>> El 18/08/15 a les 8.29, Julien Grall ha escrit:
>>> Hi,
>>>
>>> Firstly, this patch is not ready at all and mostly here for
>>> collecting comment about the way to do it. It's not clean so no need
>>> to complain about the coding style.
>>>
>>> The qdisk backend in QEMU is not supporting indirect grant, this is
>>> means that a request can only support 11 * 4KB = 44KB.
>>>
>>> When using 64KB page, a Linux block request (struct *request) may
>>> contain up to 64KB of data. This is because the block segment size
>>> must at least be the size of a Linux page.
>>>
>>> So when indirect is not supported by the backend, we are not able to
>>> fitall the data in a single request. We therefore need to create a
>>> second request to copy the rest of the data.
>>>
>>> I've wrote a patch last week which make 64KB guest booting with
>>> qdisk. Although, I'm not sure this is the right way to do it. I would
>>> appreciate ifone of the block maintainers give me insight about it.
>>
>> Maybe I'm missing some key data, but I see two ways to solve this, the
>> first one is the one you describe above, and consists in allowing
>> blkfront to split a request into multiple ring slots. The other solution
>> would be to add indirect descriptors support to Qdisk, has this been
>> looked into?
>>
>> AFAICT it looks more interesting, and x86 can also benefit from it.
>> Since I would like to prevent adding more cruft to blkfront, I rather
>> prefer 64KB guests to require indirect descriptors in order to run.
> 
> Actually supporting indirect in Qdisk was one of our idea. While I agree
> this is a good improvement in general we put aside this idea for various
> reasons.
> 
> The first one is openStack is using by default Qdisk backend, so Linux
> 64KB guest wouldn't be able to boot on current version of Xen. This is
> the only blocker in order use 64KB guests, everything else is working.
> Having the indirect grant support in QEMU for Xen 4.6 is not realistic,
> there is only a month left and we are already in feature.

Linux 64KB is already not able to boot on OpenStack from what you said,
users will need to either update their kernels or their Qemu (depending
on what we end up patching).

> That would mean that any new distribution using Linux 64KB would not
> work out-of-box on Xen.
> 
> Furthermore, not supporting non-indirect grant in the frontend means
> that any userspace backend won't be supported for Linux 64KB guests.
> 
> Overall, I think we have to support non-indirect with Linux 64KB guests.
> Many (but not all) distribution will only support 64KB pages, so we
> can't wait until Xen 4.7 to get something running. Not that I rule out
> the requirement for the user to upgrade the QEMU version in order to run
> 64KB guests.

Why can this be fixed in the Qemu side and the fix backported to 4.6.1?
If you want to fix it in Linux you will also have to wait for the next
release anyway, and then asks users to use a specific kernel version
(because distros won't pick the change that fast).

Overall I still think this should be fixed in Qemu, as said above users
will have to update anyway, either their kernels or their Qemu version.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-18 18:45   ` Julien Grall
  2015-08-19  8:50     ` Roger Pau Monné
@ 2015-08-19  8:58     ` Jan Beulich
  2015-08-19 15:25       ` Julien Grall
  1 sibling, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2015-08-19  8:58 UTC (permalink / raw)
  To: Julien Grall
  Cc: StefanoStabellini, xen-devel@lists.xen.org, Ian Campbell,
	roger.pau

>>> On 18.08.15 at 20:45, <julien.grall@citrix.com> wrote:
> Hi Roger,
> 
> On 18/08/2015 00:09, Roger Pau Monné wrote:
>> Hello,
>>
>> El 18/08/15 a les 8.29, Julien Grall ha escrit:
>>> Hi,
>>>
>>> Firstly, this patch is not ready at all and mostly here for collecting 
> comment about the way to do it. It's not clean so no need to complain about 
> the coding style.
>>>
>>> The qdisk backend in QEMU is not supporting indirect grant, this is means 
> that a request can only support 11 * 4KB = 44KB.
>>>
>>> When using 64KB page, a Linux block request (struct *request) may contain up 
> to 64KB of data. This is because the block segment size must at least be the 
> size of a Linux page.
>>>
>>> So when indirect is not supported by the backend, we are not able to fitall 
> the data in a single request. We therefore need to create a second request to 
> copy the rest of the data.
>>>
>>> I've wrote a patch last week which make 64KB guest booting with qdisk. 
> Although, I'm not sure this is the right way to do it. I would appreciate 
> ifone of the block maintainers give me insight about it.
>>
>> Maybe I'm missing some key data, but I see two ways to solve this, the
>> first one is the one you describe above, and consists in allowing
>> blkfront to split a request into multiple ring slots. The other solution
>> would be to add indirect descriptors support to Qdisk, has this been
>> looked into?
>>
>> AFAICT it looks more interesting, and x86 can also benefit from it.
>> Since I would like to prevent adding more cruft to blkfront, I rather
>> prefer 64KB guests to require indirect descriptors in order to run.
> 
> Actually supporting indirect in Qdisk was one of our idea. While I agree 
> this is a good improvement in general we put aside this idea for various 
> reasons.
> 
> The first one is openStack is using by default Qdisk backend, so Linux 
> 64KB guest wouldn't be able to boot on current version of Xen. This is 
> the only blocker in order use 64KB guests, everything else is working.
> Having the indirect grant support in QEMU for Xen 4.6 is not realistic, 
> there is only a month left and we are already in feature.
> 
> That would mean that any new distribution using Linux 64KB would not 
> work out-of-box on Xen.
> 
> Furthermore, not supporting non-indirect grant in the frontend means 
> that any userspace backend won't be supported for Linux 64KB guests.
> 
> Overall, I think we have to support non-indirect with Linux 64KB guests. 
> Many (but not all) distribution will only support 64KB pages, so we 
> can't wait until Xen 4.7 to get something running. Not that I rule out 
> the requirement for the user to upgrade the QEMU version in order to run 
> 64KB guests.

To be honest, none of this really reads like a good reason for not
following Roger's suggestion. All it points out that there is a new
feature that's not fully cooked yet. Distros wanting to support such
guests should be willing to either backport the necessary qemu
patch(es) or update qemu. Uglifying blkfront (via other than an
experimental, out of tree patch) doesn't look like a good idea to me.

And then, if blkfront was to be made capable, it would seem to me
that the better route would be to have it use its native page size for
the blkif instantiation, and extend the blkif protocol so the frontend
can communicate its page size to the backend. The current
backend-page-size == frontend-page-size restriction would then
become <= (or maybe could even go away altogether).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-19  8:50     ` Roger Pau Monné
@ 2015-08-19 14:54       ` Julien Grall
  2015-08-19 15:17         ` Roger Pau Monné
  0 siblings, 1 reply; 29+ messages in thread
From: Julien Grall @ 2015-08-19 14:54 UTC (permalink / raw)
  To: Roger Pau Monné, Stefano Stabellini, Konrad Rzeszutek Wilk,
	Ian Campbell, xen-devel@lists.xen.org



On 19/08/2015 01:50, Roger Pau Monné wrote:
> El 18/08/15 a les 20.45, Julien Grall ha escrit:
>> Hi Roger,
>>
>> On 18/08/2015 00:09, Roger Pau Monné wrote:
>>> Hello,
>>>
>>> El 18/08/15 a les 8.29, Julien Grall ha escrit:
>>>> Hi,
>>>>
>>>> Firstly, this patch is not ready at all and mostly here for
>>>> collecting comment about the way to do it. It's not clean so no need
>>>> to complain about the coding style.
>>>>
>>>> The qdisk backend in QEMU is not supporting indirect grant, this is
>>>> means that a request can only support 11 * 4KB = 44KB.
>>>>
>>>> When using 64KB page, a Linux block request (struct *request) may
>>>> contain up to 64KB of data. This is because the block segment size
>>>> must at least be the size of a Linux page.
>>>>
>>>> So when indirect is not supported by the backend, we are not able to
>>>> fitall the data in a single request. We therefore need to create a
>>>> second request to copy the rest of the data.
>>>>
>>>> I've wrote a patch last week which make 64KB guest booting with
>>>> qdisk. Although, I'm not sure this is the right way to do it. I would
>>>> appreciate ifone of the block maintainers give me insight about it.
>>>
>>> Maybe I'm missing some key data, but I see two ways to solve this, the
>>> first one is the one you describe above, and consists in allowing
>>> blkfront to split a request into multiple ring slots. The other solution
>>> would be to add indirect descriptors support to Qdisk, has this been
>>> looked into?
>>>
>>> AFAICT it looks more interesting, and x86 can also benefit from it.
>>> Since I would like to prevent adding more cruft to blkfront, I rather
>>> prefer 64KB guests to require indirect descriptors in order to run.
>>
>> Actually supporting indirect in Qdisk was one of our idea. While I agree
>> this is a good improvement in general we put aside this idea for various
>> reasons.
>>
>> The first one is openStack is using by default Qdisk backend, so Linux
>> 64KB guest wouldn't be able to boot on current version of Xen. This is
>> the only blocker in order use 64KB guests, everything else is working.
>> Having the indirect grant support in QEMU for Xen 4.6 is not realistic,
>> there is only a month left and we are already in feature.
>
> Linux 64KB is already not able to boot on OpenStack from what you said,
> users will need to either update their kernels or their Qemu (depending
> on what we end up patching).

Linux 64KB guest is not able to boot at all as Xen guest. The support of 
64KB is aiming to be merged in Linux 4.3.

So with your suggestion the users would have to update both QEMU and 
Linux rather than only one components. That make the things not working 
out-of-box (i.e without any changes in Xen side).

>> That would mean that any new distribution using Linux 64KB would not
>> work out-of-box on Xen.
>>
>> Furthermore, not supporting non-indirect grant in the frontend means
>> that any userspace backend won't be supported for Linux 64KB guests.
>>
>> Overall, I think we have to support non-indirect with Linux 64KB guests.
>> Many (but not all) distribution will only support 64KB pages, so we
>> can't wait until Xen 4.7 to get something running. Not that I rule out
>> the requirement for the user to upgrade the QEMU version in order to run
>> 64KB guests.
>
> Why can this be fixed in the Qemu side and the fix backported to 4.6.1?

And what about any other backend not supporting indirect grant?

> If you want to fix it in Linux you will also have to wait for the next
> release anyway, and then asks users to use a specific kernel version
> (because distros won't pick the change that fast).

Aarch64 distros are not yet out officially. There is still work going 
out and they are planning to target to use Linux 4.2/4.3. We have 
distributions willing to take patches in their tree in order to support 
Xen guests.

Although, if we have to patch QEMU, we also need to push on distribution 
that may not need 4KB enabled in order to use them as DOM0...

> Overall I still think this should be fixed in Qemu, as said above users
> will have to update anyway, either their kernels or their Qemu version.

Let's take the problem in another way. Your are a big cloud provider 
using Aarch64 hardware. You decided to use Xen 4.5 (and not Xen 4.6) as 
the base version and a DOM0 using 4KB page granularity. Now one of the 
small customer decides to use a distribution which have 64KB pages 
enabled, if booting using Qdisk it won't work at all. Why on the earth 
the cloud provider will update his QEMU to support this kind of guest?

I think this is a really bad idea given that 64KB should work out-of-box 
and not depending on the backend side.

Regards,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-19 14:54       ` Julien Grall
@ 2015-08-19 15:17         ` Roger Pau Monné
  2015-08-19 15:52           ` Julien Grall
  2015-08-19 23:44           ` Stefano Stabellini
  0 siblings, 2 replies; 29+ messages in thread
From: Roger Pau Monné @ 2015-08-19 15:17 UTC (permalink / raw)
  To: Julien Grall, Stefano Stabellini, Konrad Rzeszutek Wilk,
	Ian Campbell, xen-devel@lists.xen.org

Hello,

El 19/08/15 a les 16.54, Julien Grall ha escrit:
> On 19/08/2015 01:50, Roger Pau Monné wrote:
>> Why can this be fixed in the Qemu side and the fix backported to 4.6.1?
> 
> And what about any other backend not supporting indirect grant?

The only other backend I know of is tapdisk/blktap, and I'm not even
sure if that's an option on ARM. IMHO at this point we should not worry
about out-of-tree backends, there's only one, and I don't think it's so
outrageous to force indirect-descriptors support for ARM guests.

>> If you want to fix it in Linux you will also have to wait for the next
>> release anyway, and then asks users to use a specific kernel version
>> (because distros won't pick the change that fast).
> 
> Aarch64 distros are not yet out officially. There is still work going
> out and they are planning to target to use Linux 4.2/4.3. We have
> distributions willing to take patches in their tree in order to support
> Xen guests.

So they are willing to take Linux kernel patches for allowing 64KB
guests but not Qemu patches? This seems quite weird IMHO, patching the
kernel is much more risky than patching Qemu.

> Although, if we have to patch QEMU, we also need to push on distribution
> that may not need 4KB enabled in order to use them as DOM0...
>
>> Overall I still think this should be fixed in Qemu, as said above users
>> will have to update anyway, either their kernels or their Qemu version.
> 
> Let's take the problem in another way. Your are a big cloud provider
> using Aarch64 hardware. You decided to use Xen 4.5 (and not Xen 4.6) as
> the base version and a DOM0 using 4KB page granularity. Now one of the
> small customer decides to use a distribution which have 64KB pages
> enabled, if booting using Qdisk it won't work at all. Why on the earth
> the cloud provider will update his QEMU to support this kind of guest?

Well, you are a could provider, you make money out of this, why on earth
won't you patch/update Qemu if it's needed by your customers?

> I think this is a really bad idea given that 64KB should work out-of-box
> and not depending on the backend side.

My opinion is that we have already merged quite a lot of this mess in
order to support guests with different page sizes. And in this case, the
addition of code can be done to a userspace component, which is much
less risky than adding it to blkfront, also taking into account that
it's a general improvement for Qdisk that other arches can also leverage.

So in one hand you are adding code to a kernel component, that makes the
code much more messy and can only be leveraged by ARM. On the other
hand, you can add code a user-space backend, and that code is also
beneficial for other arches. IMHO, the decision is quite clear.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-19  8:58     ` Jan Beulich
@ 2015-08-19 15:25       ` Julien Grall
  0 siblings, 0 replies; 29+ messages in thread
From: Julien Grall @ 2015-08-19 15:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: roger.pau, xen-devel@lists.xen.org, Ian Campbell,
	StefanoStabellini

On 19/08/2015 01:58, Jan Beulich wrote:
>>>> On 18.08.15 at 20:45, <julien.grall@citrix.com> wrote:
>> Hi Roger,
>>
>> On 18/08/2015 00:09, Roger Pau Monné wrote:
>>> Hello,
>>>
>>> El 18/08/15 a les 8.29, Julien Grall ha escrit:
>>>> Hi,
>>>>
>>>> Firstly, this patch is not ready at all and mostly here for collecting
>> comment about the way to do it. It's not clean so no need to complain about
>> the coding style.
>>>>
>>>> The qdisk backend in QEMU is not supporting indirect grant, this is means
>> that a request can only support 11 * 4KB = 44KB.
>>>>
>>>> When using 64KB page, a Linux block request (struct *request) may contain up
>> to 64KB of data. This is because the block segment size must at least be the
>> size of a Linux page.
>>>>
>>>> So when indirect is not supported by the backend, we are not able to fitall
>> the data in a single request. We therefore need to create a second request to
>> copy the rest of the data.
>>>>
>>>> I've wrote a patch last week which make 64KB guest booting with qdisk.
>> Although, I'm not sure this is the right way to do it. I would appreciate
>> ifone of the block maintainers give me insight about it.
>>>
>>> Maybe I'm missing some key data, but I see two ways to solve this, the
>>> first one is the one you describe above, and consists in allowing
>>> blkfront to split a request into multiple ring slots. The other solution
>>> would be to add indirect descriptors support to Qdisk, has this been
>>> looked into?
>>>
>>> AFAICT it looks more interesting, and x86 can also benefit from it.
>>> Since I would like to prevent adding more cruft to blkfront, I rather
>>> prefer 64KB guests to require indirect descriptors in order to run.
>>
>> Actually supporting indirect in Qdisk was one of our idea. While I agree
>> this is a good improvement in general we put aside this idea for various
>> reasons.
>>
>> The first one is openStack is using by default Qdisk backend, so Linux
>> 64KB guest wouldn't be able to boot on current version of Xen. This is
>> the only blocker in order use 64KB guests, everything else is working.
>> Having the indirect grant support in QEMU for Xen 4.6 is not realistic,
>> there is only a month left and we are already in feature.
>>
>> That would mean that any new distribution using Linux 64KB would not
>> work out-of-box on Xen.
>>
>> Furthermore, not supporting non-indirect grant in the frontend means
>> that any userspace backend won't be supported for Linux 64KB guests.
>>
>> Overall, I think we have to support non-indirect with Linux 64KB guests.
>> Many (but not all) distribution will only support 64KB pages, so we
>> can't wait until Xen 4.7 to get something running. Not that I rule out
>> the requirement for the user to upgrade the QEMU version in order to run
>> 64KB guests.
>
> To be honest, none of this really reads like a good reason for not
> following Roger's suggestion. All it points out that there is a new
> feature that's not fully cooked yet. Distros wanting to support such
> guests should be willing to either backport the necessary qemu
> patch(es) or update qemu. Uglifying blkfront (via other than an
> experimental, out of tree patch) doesn't look like a good idea to me.

With your suggested approach, modifiying QEMU, you would have to push a 
patch in DOM0 distributions (which may be different as the guests) in 
order to boot a such guests. On the other hand, patching Linux, will 
make a guest booting out-of-box without having the cloud provider using 
aarch64 platform fixing their DOM0 in order to boot a such a guest.

> And then, if blkfront was to be made capable, it would seem to me
> that the better route would be to have it use its native page size for
> the blkif instantiation, and extend the blkif protocol so the frontend
> can communicate its page size to the backend. The current
> backend-page-size == frontend-page-size restriction would then
> become <= (or maybe could even go away altogether).

And then we could also improve memory performance right now rather than 
having a first version upstream... if we take this approach, it will 
take another year to upstream the 64KB guest support in Linux.
Having a fully working, high-performing 64KB guest support is huge and 
we can't do everything at one time.

Having 64KB grant is my next plan but it requires changes in both Linux 
and Xen. Although, this can't be done in a short timeline and we need to 
have a first approach done today.

Major distributions will be shipped with 64KB page supports only 
(including the one from your company Suse), they are targeting Linux 
4.2/4.3 and are willing to take patches in their kernel.

Now, if we decide to not support non-indirect grant for 64KB guests, we 
also have to reach distributions using 4KB pages in order to update 
their QEMU.

Regards,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-19 15:17         ` Roger Pau Monné
@ 2015-08-19 15:52           ` Julien Grall
  2015-08-19 23:44           ` Stefano Stabellini
  1 sibling, 0 replies; 29+ messages in thread
From: Julien Grall @ 2015-08-19 15:52 UTC (permalink / raw)
  To: Roger Pau Monné, Stefano Stabellini, Konrad Rzeszutek Wilk,
	Ian Campbell, xen-devel@lists.xen.org



On 19/08/2015 08:17, Roger Pau Monné wrote:
> Hello,
>
> El 19/08/15 a les 16.54, Julien Grall ha escrit:
>> On 19/08/2015 01:50, Roger Pau Monné wrote:
>>> Why can this be fixed in the Qemu side and the fix backported to 4.6.1?
>>
>> And what about any other backend not supporting indirect grant?
>
> The only other backend I know of is tapdisk/blktap, and I'm not even
> sure if that's an option on ARM. IMHO at this point we should not worry
> about out-of-tree backends, there's only one, and I don't think it's so
> outrageous to force indirect-descriptors support for ARM guests.
>
>>> If you want to fix it in Linux you will also have to wait for the next
>>> release anyway, and then asks users to use a specific kernel version
>>> (because distros won't pick the change that fast).
>>
>> Aarch64 distros are not yet out officially. There is still work going
>> out and they are planning to target to use Linux 4.2/4.3. We have
>> distributions willing to take patches in their tree in order to support
>> Xen guests.
>
> So they are willing to take Linux kernel patches for allowing 64KB
> guests but not Qemu patches? This seems quite weird IMHO, patching the
> kernel is much more risky than patching Qemu.

I'm just saying that DOM0 distribution may not be the same a the guest 
distributions. So we would have to reach out any distributions to have 
QEMU patched. Rather than only having the guest distributions fixed.


>> Although, if we have to patch QEMU, we also need to push on distribution
>> that may not need 4KB enabled in order to use them as DOM0...
>>
>>> Overall I still think this should be fixed in Qemu, as said above users
>>> will have to update anyway, either their kernels or their Qemu version.
>>
>> Let's take the problem in another way. Your are a big cloud provider
>> using Aarch64 hardware. You decided to use Xen 4.5 (and not Xen 4.6) as
>> the base version and a DOM0 using 4KB page granularity. Now one of the
>> small customer decides to use a distribution which have 64KB pages
>> enabled, if booting using Qdisk it won't work at all. Why on the earth
>> the cloud provider will update his QEMU to support this kind of guest?
>
> Well, you are a could provider, you make money out of this, why on earth
> won't you patch/update Qemu if it's needed by your customers?

Because you would have to update your platform with a new QEMU in order 
to support indirect grants. Unless we backport the patch to Xen 4.5 too. 
Although, IIRC, backporting a feature is not common.

>> I think this is a really bad idea given that 64KB should work out-of-box
>> and not depending on the backend side.
>
> My opinion is that we have already merged quite a lot of this mess in
> order to support guests with different page sizes. And in this case, the
> addition of code can be done to a userspace component, which is much
> less risky than adding it to blkfront, also taking into account that
> it's a general improvement for Qdisk that other arches can also leverage.

I need to correct you on one point here, the mess is Linux assuming that 
he is using the same page size as Xen, not adding support of different 
page size. It was a mistake when PV drivers has been created and had to 
be fixed in any case.

Regards,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-19 15:17         ` Roger Pau Monné
  2015-08-19 15:52           ` Julien Grall
@ 2015-08-19 23:44           ` Stefano Stabellini
  2015-08-20  8:31             ` Roger Pau Monné
  2015-08-20  9:37             ` Jan Beulich
  1 sibling, 2 replies; 29+ messages in thread
From: Stefano Stabellini @ 2015-08-19 23:44 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Julien Grall, xen-devel@lists.xen.org, Ian Campbell,
	Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 1407 bytes --]

On Wed, 19 Aug 2015, Roger Pau Monné wrote:
> My opinion is that we have already merged quite a lot of this mess in
> order to support guests with different page sizes. And in this case, the
> addition of code can be done to a userspace component, which is much
> less risky than adding it to blkfront, also taking into account that
> it's a general improvement for Qdisk that other arches can also leverage.
>
> So in one hand you are adding code to a kernel component, that makes the
> code much more messy and can only be leveraged by ARM. On the other
> hand, you can add code a user-space backend, and that code is also
> beneficial for other arches. IMHO, the decision is quite clear.

64K pages not working is entirely a Linux problem, not a Xen problem.
Xen uses 4K pages as usual and exports the same 4K based hypercall
interface as usual. That needs to work, no matter what the guest decides
to put in its own pagetables.

I remind everybody that Xen interfaces on ARM and ARM64 are fully
maintained for backward compatibility. Xen is not forcing Linux to use
64K pages, that's entirely a Linux decision. The issue has nothing to do
with Xen.

The bug here is that Linux has broken 64K pages support and that should
be fixed. I don't think is reasonable to make changes to the Xen ABIs
just to accommodate the brokenness of one guest kernel in a particular
configuration.

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-19 23:44           ` Stefano Stabellini
@ 2015-08-20  8:31             ` Roger Pau Monné
  2015-08-20  9:43               ` David Vrabel
  2015-08-20  9:37             ` Jan Beulich
  1 sibling, 1 reply; 29+ messages in thread
From: Roger Pau Monné @ 2015-08-20  8:31 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Julien Grall, xen-devel@lists.xen.org, Ian Campbell

El 20/08/15 a les 1.44, Stefano Stabellini ha escrit:
> On Wed, 19 Aug 2015, Roger Pau Monné wrote:
>> My opinion is that we have already merged quite a lot of this mess in
>> order to support guests with different page sizes. And in this case, the
>> addition of code can be done to a userspace component, which is much
>> less risky than adding it to blkfront, also taking into account that
>> it's a general improvement for Qdisk that other arches can also leverage.
>>
>> So in one hand you are adding code to a kernel component, that makes the
>> code much more messy and can only be leveraged by ARM. On the other
>> hand, you can add code a user-space backend, and that code is also
>> beneficial for other arches. IMHO, the decision is quite clear.
> 
> 64K pages not working is entirely a Linux problem, not a Xen problem.
> Xen uses 4K pages as usual and exports the same 4K based hypercall
> interface as usual. That needs to work, no matter what the guest decides
> to put in its own pagetables.
> 
> I remind everybody that Xen interfaces on ARM and ARM64 are fully
> maintained for backward compatibility. Xen is not forcing Linux to use
> 64K pages, that's entirely a Linux decision. The issue has nothing to do
> with Xen.
> 
> The bug here is that Linux has broken 64K pages support and that should
> be fixed. I don't think is reasonable to make changes to the Xen ABIs
> just to accommodate the brokenness of one guest kernel in a particular
> configuration.

Is it a change to the ABI to mandate indirect-descriptors support in
order to run arm64 with 64KB guests?

IMHO, it is not, and none of the proposed solutions (either change
blkfront or Qdisk) include any change to the Xen ABI. In this case my
preference would be to perform the change in the backend for the reasons
detailed above.

Anyway, I'm not going to block such a change, I just think there are
technically better ways to solve this issue.

Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-19 23:44           ` Stefano Stabellini
  2015-08-20  8:31             ` Roger Pau Monné
@ 2015-08-20  9:37             ` Jan Beulich
  1 sibling, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2015-08-20  9:37 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Julien Grall, xen-devel@lists.xen.org, Ian Campbell, roger.pau

>>> On 20.08.15 at 01:44, <stefano.stabellini@eu.citrix.com> wrote:
> On Wed, 19 Aug 2015, Roger Pau Monné wrote:
>> My opinion is that we have already merged quite a lot of this mess in
>> order to support guests with different page sizes. And in this case, the
>> addition of code can be done to a userspace component, which is much
>> less risky than adding it to blkfront, also taking into account that
>> it's a general improvement for Qdisk that other arches can also leverage.
>>
>> So in one hand you are adding code to a kernel component, that makes the
>> code much more messy and can only be leveraged by ARM. On the other
>> hand, you can add code a user-space backend, and that code is also
>> beneficial for other arches. IMHO, the decision is quite clear.
> 
> 64K pages not working is entirely a Linux problem, not a Xen problem.
> Xen uses 4K pages as usual and exports the same 4K based hypercall
> interface as usual. That needs to work, no matter what the guest decides
> to put in its own pagetables.
> 
> I remind everybody that Xen interfaces on ARM and ARM64 are fully
> maintained for backward compatibility. Xen is not forcing Linux to use
> 64K pages, that's entirely a Linux decision. The issue has nothing to do
> with Xen.
> 
> The bug here is that Linux has broken 64K pages support and that should
> be fixed. I don't think is reasonable to make changes to the Xen ABIs
> just to accommodate the brokenness of one guest kernel in a particular
> configuration.

I would agree if the proposal was to add hacks or workarounds to the
ABI, but afaics neither what Roger suggested nor what I said go in
that kind of a direction.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-20  8:31             ` Roger Pau Monné
@ 2015-08-20  9:43               ` David Vrabel
  2015-08-20 16:16                 ` Julien Grall
  2015-08-20 17:23                 ` Stefano Stabellini
  0 siblings, 2 replies; 29+ messages in thread
From: David Vrabel @ 2015-08-20  9:43 UTC (permalink / raw)
  To: Roger Pau Monné, Stefano Stabellini
  Cc: Julien Grall, Ian Campbell, xen-devel@lists.xen.org

On 20/08/15 09:31, Roger Pau Monné wrote:
> El 20/08/15 a les 1.44, Stefano Stabellini ha escrit:
>> On Wed, 19 Aug 2015, Roger Pau Monné wrote:
>>> My opinion is that we have already merged quite a lot of this mess in
>>> order to support guests with different page sizes. And in this case, the
>>> addition of code can be done to a userspace component, which is much
>>> less risky than adding it to blkfront, also taking into account that
>>> it's a general improvement for Qdisk that other arches can also leverage.
>>>
>>> So in one hand you are adding code to a kernel component, that makes the
>>> code much more messy and can only be leveraged by ARM. On the other
>>> hand, you can add code a user-space backend, and that code is also
>>> beneficial for other arches. IMHO, the decision is quite clear.
>>
>> 64K pages not working is entirely a Linux problem, not a Xen problem.
>> Xen uses 4K pages as usual and exports the same 4K based hypercall
>> interface as usual. That needs to work, no matter what the guest decides
>> to put in its own pagetables.
>>
>> I remind everybody that Xen interfaces on ARM and ARM64 are fully
>> maintained for backward compatibility. Xen is not forcing Linux to use
>> 64K pages, that's entirely a Linux decision. The issue has nothing to do
>> with Xen.
>>
>> The bug here is that Linux has broken 64K pages support and that should
>> be fixed. I don't think is reasonable to make changes to the Xen ABIs
>> just to accommodate the brokenness of one guest kernel in a particular
>> configuration.
> 
> Is it a change to the ABI to mandate indirect-descriptors support in
> order to run arm64 with 64KB guests?

No (given that there is no guest using 64 KiB guest pages that works yet).

> IMHO, it is not, and none of the proposed solutions (either change
> blkfront or Qdisk) include any change to the Xen ABI. In this case my
> preference would be to perform the change in the backend for the reasons
> detailed above.
> 
> Anyway, I'm not going to block such a change, I just think there are
> technically better ways to solve this issue.

I'm already unhappy about the complexity of the stop-gap support for 64
KiB guest pages in Linux and if there's a way to get it working by
adding a useful missing feature to qemu's disk backend then that is what
should be done.

David

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-20  9:43               ` David Vrabel
@ 2015-08-20 16:16                 ` Julien Grall
  2015-08-20 17:23                 ` Stefano Stabellini
  1 sibling, 0 replies; 29+ messages in thread
From: Julien Grall @ 2015-08-20 16:16 UTC (permalink / raw)
  To: David Vrabel, Roger Pau Monné, Stefano Stabellini
  Cc: Ian Campbell, xen-devel@lists.xen.org



On 20/08/2015 02:43, David Vrabel wrote:
> On 20/08/15 09:31, Roger Pau Monné wrote:
>> El 20/08/15 a les 1.44, Stefano Stabellini ha escrit:
>>> On Wed, 19 Aug 2015, Roger Pau Monné wrote:
>>>> My opinion is that we have already merged quite a lot of this mess in
>>>> order to support guests with different page sizes. And in this case, the
>>>> addition of code can be done to a userspace component, which is much
>>>> less risky than adding it to blkfront, also taking into account that
>>>> it's a general improvement for Qdisk that other arches can also leverage.
>>>>
>>>> So in one hand you are adding code to a kernel component, that makes the
>>>> code much more messy and can only be leveraged by ARM. On the other
>>>> hand, you can add code a user-space backend, and that code is also
>>>> beneficial for other arches. IMHO, the decision is quite clear.
>>>
>>> 64K pages not working is entirely a Linux problem, not a Xen problem.
>>> Xen uses 4K pages as usual and exports the same 4K based hypercall
>>> interface as usual. That needs to work, no matter what the guest decides
>>> to put in its own pagetables.
>>>
>>> I remind everybody that Xen interfaces on ARM and ARM64 are fully
>>> maintained for backward compatibility. Xen is not forcing Linux to use
>>> 64K pages, that's entirely a Linux decision. The issue has nothing to do
>>> with Xen.
>>>
>>> The bug here is that Linux has broken 64K pages support and that should
>>> be fixed. I don't think is reasonable to make changes to the Xen ABIs
>>> just to accommodate the brokenness of one guest kernel in a particular
>>> configuration.
>>
>> Is it a change to the ABI to mandate indirect-descriptors support in
>> order to run arm64 with 64KB guests?
>
> No (given that there is no guest using 64 KiB guest pages that works yet).

I need to correct you here, there is no Linux guests using 64KB page 
granularity. It's possible to come up with a guests using 64KB and 
working with non-indirect grant.

It's not easily possible to do it in Linux because the block framework 
is always passing a Linux page size (i.e 4KB or 64KB). So we are 
exposing broken Linux PV drivers/interface to the backend.

>> IMHO, it is not, and none of the proposed solutions (either change
>> blkfront or Qdisk) include any change to the Xen ABI. In this case my
>> preference would be to perform the change in the backend for the reasons
>> detailed above.
>>
>> Anyway, I'm not going to block such a change, I just think there are
>> technically better ways to solve this issue.
>
> I'm already unhappy about the complexity of the stop-gap support for 64
> KiB guest pages in Linux and if there's a way to get it working by
> adding a useful missing feature to qemu's disk backend then that is what
> should be done.

I'd like to remind you that assuming that Linux and Xen was using the 
same page size was wrong. There is nothing in the Xen interface 
requiring a such things. So we ought to do it at some point given that 
ARM is not the only architecture supporting 2 different page size.

Regards,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-20  9:43               ` David Vrabel
  2015-08-20 16:16                 ` Julien Grall
@ 2015-08-20 17:23                 ` Stefano Stabellini
  2015-08-21 16:05                   ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 29+ messages in thread
From: Stefano Stabellini @ 2015-08-20 17:23 UTC (permalink / raw)
  To: David Vrabel
  Cc: Julien Grall, Stefano Stabellini, xen-devel@lists.xen.org,
	Ian Campbell, Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 2349 bytes --]

On Thu, 20 Aug 2015, David Vrabel wrote:
> On 20/08/15 09:31, Roger Pau Monné wrote:
> > El 20/08/15 a les 1.44, Stefano Stabellini ha escrit:
> >> On Wed, 19 Aug 2015, Roger Pau Monné wrote:
> >>> My opinion is that we have already merged quite a lot of this mess in
> >>> order to support guests with different page sizes. And in this case, the
> >>> addition of code can be done to a userspace component, which is much
> >>> less risky than adding it to blkfront, also taking into account that
> >>> it's a general improvement for Qdisk that other arches can also leverage.
> >>>
> >>> So in one hand you are adding code to a kernel component, that makes the
> >>> code much more messy and can only be leveraged by ARM. On the other
> >>> hand, you can add code a user-space backend, and that code is also
> >>> beneficial for other arches. IMHO, the decision is quite clear.
> >>
> >> 64K pages not working is entirely a Linux problem, not a Xen problem.
> >> Xen uses 4K pages as usual and exports the same 4K based hypercall
> >> interface as usual. That needs to work, no matter what the guest decides
> >> to put in its own pagetables.
> >>
> >> I remind everybody that Xen interfaces on ARM and ARM64 are fully
> >> maintained for backward compatibility. Xen is not forcing Linux to use
> >> 64K pages, that's entirely a Linux decision. The issue has nothing to do
> >> with Xen.
> >>
> >> The bug here is that Linux has broken 64K pages support and that should
> >> be fixed. I don't think is reasonable to make changes to the Xen ABIs
> >> just to accommodate the brokenness of one guest kernel in a particular
> >> configuration.
> >
> > Is it a change to the ABI to mandate indirect-descriptors support in
> > order to run arm64 with 64KB guests?
>
> No (given that there is no guest using 64 KiB guest pages that works yet).

I think that this is wrong: given that the ABI is completely neutral to
the page granularity used by DomU, this as an ABI change. DomU and Dom0
can safely mix and match granularity and using 64K pages is purely a
single guest choice, not a platform wide choice. One should be able to
run any DomU with your good old Dom0 kernel.

If we wanted to mandate indirect-descriptors (which I think would be a
bad decision), we would have to bump a version number somewhere.

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-18  6:29 [RFC] Support of non-indirect grant backend on 64KB guest Julien Grall
  2015-08-18  7:09 ` Roger Pau Monné
@ 2015-08-20 17:42 ` David Vrabel
  2015-08-21  1:30   ` Julien Grall
  1 sibling, 1 reply; 29+ messages in thread
From: David Vrabel @ 2015-08-20 17:42 UTC (permalink / raw)
  To: Julien Grall, Roger Pau Monné, Stefano Stabellini,
	Konrad Rzeszutek Wilk, Ian Campbell, xen-devel@lists.xen.org

On 18/08/15 07:29, Julien Grall wrote:
> Hi,
> 
> Firstly, this patch is not ready at all and mostly here for
> collecting comment about the way to do it. It's not clean so no need
> to complain about the coding style.
> 
> The qdisk backend in QEMU is not supporting indirect grant, this is
> means that a request can only support 11 * 4KB = 44KB.
> 
> When using 64KB page, a Linux block request (struct *request) may
> contain up to 64KB of data. This is because the block segment size
> must at least be the size of a Linux page.

You should ensure you configure the request queue with the limits that
are currently supported.  In particular:

	/* Each segment in a request is up to an aligned page in
           size. */
	blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
	blk_queue_max_segment_size(rq, PAGE_SIZE);

Is obviously wrong with PAGE_SIZE > 44 KiB.

Get the block later to split requests and don't do it in blkfront.

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-20 17:42 ` David Vrabel
@ 2015-08-21  1:30   ` Julien Grall
  2015-08-21 16:07     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 29+ messages in thread
From: Julien Grall @ 2015-08-21  1:30 UTC (permalink / raw)
  To: David Vrabel, Roger Pau Monné, Stefano Stabellini,
	Konrad Rzeszutek Wilk, Ian Campbell, xen-devel@lists.xen.org



On 20/08/2015 10:42, David Vrabel wrote:
>> When using 64KB page, a Linux block request (struct *request) may
>> contain up to 64KB of data. This is because the block segment size
>> must at least be the size of a Linux page.
>
> You should ensure you configure the request queue with the limits that
> are currently supported.  In particular:
>
> 	/* Each segment in a request is up to an aligned page in
>             size. */
> 	blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
> 	blk_queue_max_segment_size(rq, PAGE_SIZE);
>
> Is obviously wrong with PAGE_SIZE > 44 KiB.
>
> Get the block later to split requests and don't do it in blkfront.

I may not have been enough clear in the paragraph you quoted. I said 
that the minimum size supported by the block framework is a linux page 
size. It means that if you pass a value smaller than that it will 
replace with the page linux granularity.

It would have been handy that the block framework supports smaller size 
but it's not the case, give a look to the implementation of both function.

Cheers,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-20 17:23                 ` Stefano Stabellini
@ 2015-08-21 16:05                   ` Konrad Rzeszutek Wilk
  2015-08-21 16:08                     ` David Vrabel
  0 siblings, 1 reply; 29+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-08-21 16:05 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Julien Grall, Ian Campbell, Roger Pau Monné, David Vrabel,
	xen-devel@lists.xen.org

On Thu, Aug 20, 2015 at 06:23:16PM +0100, Stefano Stabellini wrote:
> On Thu, 20 Aug 2015, David Vrabel wrote:
> > On 20/08/15 09:31, Roger Pau Monné wrote:
> > > El 20/08/15 a les 1.44, Stefano Stabellini ha escrit:
> > >> On Wed, 19 Aug 2015, Roger Pau Monné wrote:
> > >>> My opinion is that we have already merged quite a lot of this mess in
> > >>> order to support guests with different page sizes. And in this case, the
> > >>> addition of code can be done to a userspace component, which is much
> > >>> less risky than adding it to blkfront, also taking into account that
> > >>> it's a general improvement for Qdisk that other arches can also leverage.
> > >>>
> > >>> So in one hand you are adding code to a kernel component, that makes the
> > >>> code much more messy and can only be leveraged by ARM. On the other
> > >>> hand, you can add code a user-space backend, and that code is also
> > >>> beneficial for other arches. IMHO, the decision is quite clear.
> > >>
> > >> 64K pages not working is entirely a Linux problem, not a Xen problem.
> > >> Xen uses 4K pages as usual and exports the same 4K based hypercall
> > >> interface as usual. That needs to work, no matter what the guest decides
> > >> to put in its own pagetables.
> > >>
> > >> I remind everybody that Xen interfaces on ARM and ARM64 are fully
> > >> maintained for backward compatibility. Xen is not forcing Linux to use
> > >> 64K pages, that's entirely a Linux decision. The issue has nothing to do
> > >> with Xen.
> > >>
> > >> The bug here is that Linux has broken 64K pages support and that should
> > >> be fixed. I don't think is reasonable to make changes to the Xen ABIs
> > >> just to accommodate the brokenness of one guest kernel in a particular
> > >> configuration.
> > >
> > > Is it a change to the ABI to mandate indirect-descriptors support in
> > > order to run arm64 with 64KB guests?
> >
> > No (given that there is no guest using 64 KiB guest pages that works yet).
> 
> I think that this is wrong: given that the ABI is completely neutral to
> the page granularity used by DomU, this as an ABI change. DomU and Dom0
> can safely mix and match granularity and using 64K pages is purely a
> single guest choice, not a platform wide choice. One should be able to
> run any DomU with your good old Dom0 kernel.
> 
> If we wanted to mandate indirect-descriptors (which I think would be a
> bad decision), we would have to bump a version number somewhere.

I have to concur with that. We can't mandate that ARM 64k page MUST use
indirect descriptors.

<sigh> I am not particulary happy about this but the xen-blkfront changes
are the best way to make this work.

At least we don't have this problem with xen-netfront (right?!)? or other
drivers?


> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-21  1:30   ` Julien Grall
@ 2015-08-21 16:07     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 29+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-08-21 16:07 UTC (permalink / raw)
  To: Julien Grall
  Cc: Ian Campbell, Stefano Stabellini, xen-devel@lists.xen.org,
	David Vrabel, Roger Pau Monné

On Thu, Aug 20, 2015 at 06:30:17PM -0700, Julien Grall wrote:
> 
> 
> On 20/08/2015 10:42, David Vrabel wrote:
> >>When using 64KB page, a Linux block request (struct *request) may
> >>contain up to 64KB of data. This is because the block segment size
> >>must at least be the size of a Linux page.
> >
> >You should ensure you configure the request queue with the limits that
> >are currently supported.  In particular:
> >
> >	/* Each segment in a request is up to an aligned page in
> >            size. */
> >	blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
> >	blk_queue_max_segment_size(rq, PAGE_SIZE);
> >
> >Is obviously wrong with PAGE_SIZE > 44 KiB.
> >
> >Get the block later to split requests and don't do it in blkfront.
> 
> I may not have been enough clear in the paragraph you quoted. I said that
> the minimum size supported by the block framework is a linux page size. It
> means that if you pass a value smaller than that it will replace with the
> page linux granularity.
> 
> It would have been handy that the block framework supports smaller size but
> it's not the case, give a look to the implementation of both function.

>From a block API size one can argue that the driver is doing something
wrong by requiring < PAGE_SIZE requests. It is kind of implied that all
the drivers are able to manage this and break up an 'struct request'
in multiple IO commands if it needs to.

> 
> Cheers,
> 
> -- 
> Julien Grall

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-21 16:05                   ` Konrad Rzeszutek Wilk
@ 2015-08-21 16:08                     ` David Vrabel
  2015-08-21 16:49                       ` Stefano Stabellini
  2015-08-21 17:10                       ` PAGE_SIZE (64KB), while block driver 'struct request' deals with < PAGE_SIZE (up to 44Kb). Was:Re: " Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 29+ messages in thread
From: David Vrabel @ 2015-08-21 16:08 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Stefano Stabellini
  Cc: Julien Grall, Roger Pau Monné, Ian Campbell,
	xen-devel@lists.xen.org

On 21/08/15 17:05, Konrad Rzeszutek Wilk wrote:
> 
> I have to concur with that. We can't mandate that ARM 64k page MUST use
> indirect descriptors.

Then it has to be fixed in the block layer to allow < PAGE_SIZE segments
and to get the block layer to split requests for blkfront.

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-21 16:08                     ` David Vrabel
@ 2015-08-21 16:49                       ` Stefano Stabellini
  2015-08-21 17:10                       ` PAGE_SIZE (64KB), while block driver 'struct request' deals with < PAGE_SIZE (up to 44Kb). Was:Re: " Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 29+ messages in thread
From: Stefano Stabellini @ 2015-08-21 16:49 UTC (permalink / raw)
  To: David Vrabel
  Cc: Ian Campbell, Stefano Stabellini, xen-devel@lists.xen.org,
	Julien Grall, Roger Pau Monné

On Fri, 21 Aug 2015, David Vrabel wrote:
> On 21/08/15 17:05, Konrad Rzeszutek Wilk wrote:
> > 
> > I have to concur with that. We can't mandate that ARM 64k page MUST use
> > indirect descriptors.
> 
> Then it has to be fixed in the block layer to allow < PAGE_SIZE segments
> and to get the block layer to split requests for blkfront.
 
We could ask Jens how feasible that would be. I hope this won't turn
into a ping-pong among subsystem maintainers.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* PAGE_SIZE (64KB), while block driver 'struct request' deals with < PAGE_SIZE (up to 44Kb). Was:Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-21 16:08                     ` David Vrabel
  2015-08-21 16:49                       ` Stefano Stabellini
@ 2015-08-21 17:10                       ` Konrad Rzeszutek Wilk
  2015-08-27 17:51                         ` Julien Grall
  1 sibling, 1 reply; 29+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-08-21 17:10 UTC (permalink / raw)
  To: David Vrabel, konrad.wilk, axboe
  Cc: Julien Grall, Roger Pau Monné, xen-devel@lists.xen.org,
	Ian Campbell, Stefano Stabellini

On Fri, Aug 21, 2015 at 05:08:35PM +0100, David Vrabel wrote:
> On 21/08/15 17:05, Konrad Rzeszutek Wilk wrote:
> > 
> > I have to concur with that. We can't mandate that ARM 64k page MUST use
> > indirect descriptors.
> 
> Then it has to be fixed in the block layer to allow < PAGE_SIZE segments
> and to get the block layer to split requests for blkfront.

Hey Jens,

I am hoping you can help us figure this problem out.

The Linux ARM is capable of using 4KB pages and 64KB pages. Our block
driver (xen-blkfront) was built with 4KB pages in mind and without using
any fancy flags (which some backends lack) the maximum amount of I/O it can
fit on a ring is 44KB.

This has the unfortunate effect that when the xen-blkfront
gets an 'struct request' it can have on page (64KB) and it can't actually
fit it on the ring! And the lowest segment size it advertises is PAGE_SIZE
(64KB). I believe Julien (who found this) tried initially advertising
smaller segment size than PAGE_SIZE (32KB). However looking at
__blk_segment_map_sg it looks to assume smallest size is PAGE_SIZE so
that would explain why it did not work.

One wya to make this work is for the driver (xen-blkfront) to split
the 'struct request' I/O in two internal requests.

But this has to be a normal problem. Surely there are other drivers
(MMC?) that can't handle PAGE_SIZE and have to break their I/Os.
Would it make sense for the common block code to be able to deal
with this?

Thank you!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: PAGE_SIZE (64KB), while block driver 'struct request' deals with < PAGE_SIZE (up to 44Kb). Was:Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-21 17:10                       ` PAGE_SIZE (64KB), while block driver 'struct request' deals with < PAGE_SIZE (up to 44Kb). Was:Re: " Konrad Rzeszutek Wilk
@ 2015-08-27 17:51                         ` Julien Grall
  2015-09-04 14:04                           ` Stefano Stabellini
  0 siblings, 1 reply; 29+ messages in thread
From: Julien Grall @ 2015-08-27 17:51 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, David Vrabel, axboe
  Cc: xen-devel@lists.xen.org, Ian Campbell, Stefano Stabellini,
	Roger Pau Monné

Hi,

On 21/08/15 18:10, Konrad Rzeszutek Wilk wrote:
> On Fri, Aug 21, 2015 at 05:08:35PM +0100, David Vrabel wrote:
>> On 21/08/15 17:05, Konrad Rzeszutek Wilk wrote:
>>>
>>> I have to concur with that. We can't mandate that ARM 64k page MUST use
>>> indirect descriptors.
>>
>> Then it has to be fixed in the block layer to allow < PAGE_SIZE segments
>> and to get the block layer to split requests for blkfront.
> 
> Hey Jens,
> 
> I am hoping you can help us figure this problem out.
> 
> The Linux ARM is capable of using 4KB pages and 64KB pages. Our block
> driver (xen-blkfront) was built with 4KB pages in mind and without using
> any fancy flags (which some backends lack) the maximum amount of I/O it can
> fit on a ring is 44KB.
> 
> This has the unfortunate effect that when the xen-blkfront
> gets an 'struct request' it can have on page (64KB) and it can't actually
> fit it on the ring! And the lowest segment size it advertises is PAGE_SIZE
> (64KB). I believe Julien (who found this) tried initially advertising
> smaller segment size than PAGE_SIZE (32KB). However looking at
> __blk_segment_map_sg it looks to assume smallest size is PAGE_SIZE so
> that would explain why it did not work.

To be honest, I haven't tried to see how the block layer will act if I
dropped those checks in blk-settings.c until today.

I don't see any assumption about PAGE_SIZE in __blk_segment_map_sg.
Although dropping the checks in blk-settings (see quick patch [1]),
I got the following error in the frontend:

bio too big device xvda (128 > 88)
Buffer I/O error on dev xvda, logical block 0, async page read
bio too big device xvda (128 > 88)
Buffer I/O error on dev xvda, logical block 0, async page read

The "bio too big device ..." comes from generic_make_request_checks
(linux/block/blk-core.c) and the stack trace is:

[<fffffe0000096c7c>] dump_backtrace+0x0/0x124
[<fffffe0000096db0>] show_stack+0x10/0x1c
[<fffffe00005885e8>] dump_stack+0x78/0xbc
[<fffffe00000bf7f8>] warn_slowpath_common+0x98/0xd0
[<fffffe00000bf8f0>] warn_slowpath_null+0x14/0x20
[<fffffe00002df304>] generic_make_request_checks+0x114/0x230
[<fffffe00002e0580>] generic_make_request+0x10/0x108
[<fffffe00002e070c>] submit_bio+0x94/0x1e0
[<fffffe00001d573c>] submit_bh_wbc.isra.36+0x100/0x1a8
[<fffffe00001d5bf8>] block_read_full_page+0x320/0x3e8
[<fffffe00001d877c>] blkdev_readpage+0x14/0x20
[<fffffe000014582c>] do_read_cache_page+0x16c/0x1a0
[<fffffe0000145870>] read_cache_page+0x10/0x1c
[<fffffe00002f2908>] read_dev_sector+0x30/0x9c
[<fffffe00002f3d84>] msdos_partition+0x84/0x554
[<fffffe00002f38e4>] check_partition+0xf8/0x21c
[<fffffe00002f2f28>] rescan_partitions+0xb0/0x2a4
[<fffffe00001d98b0>] __blkdev_get+0x228/0x34c
[<fffffe00001d9a14>] blkdev_get+0x40/0x364
[<fffffe00002f0b6c>] add_disk+0x398/0x424
[<fffffe00003d8500>] blkback_changed+0x1200/0x152c
[<fffffe000036a954>] xenbus_otherend_changed+0x9c/0xa4
[<fffffe000036c984>] backend_changed+0xc/0x18
[<fffffe000036a088>] xenwatch_thread+0xa0/0x13c
[<fffffe00000d98d0>] kthread+0xd8/0xf0

The fs buffer code seems to assume that the block driver will always support
at least a bio of PAGE_SIZE.

> One wya to make this work is for the driver (xen-blkfront) to split
> the 'struct request' I/O in two internal requests.
> 
> But this has to be a normal problem. Surely there are other drivers
> (MMC?) that can't handle PAGE_SIZE and have to break their I/Os.
> Would it make sense for the common block code to be able to deal
> with this?

It will take me a bit of time to fully understand the block layer
as the changes doesn't seem trivial from POV (I don't have any
knowledge in it). So I will wait a feedback from Jens before
going further on this approach.

Regards,

[1] patch

diff --git a/block/blk-settings.c b/block/blk-settings.c
index e0057d0..ac024e7 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -251,12 +251,15 @@ EXPORT_SYMBOL(blk_queue_bounce_limit);
  **/
 void blk_limits_max_hw_sectors(struct queue_limits *limits, unsigned int max_hw_sectors)
 {
+#if 0
        if ((max_hw_sectors << 9) < PAGE_CACHE_SIZE) {
                max_hw_sectors = 1 << (PAGE_CACHE_SHIFT - 9);
                printk(KERN_INFO "%s: set to minimum %d\n",
                       __func__, max_hw_sectors);
        }
 
+#endif
+
        limits->max_sectors = limits->max_hw_sectors = max_hw_sectors;
 }
 EXPORT_SYMBOL(blk_limits_max_hw_sectors);
@@ -351,11 +354,14 @@ EXPORT_SYMBOL(blk_queue_max_segments);
  **/
 void blk_queue_max_segment_size(struct request_queue *q, unsigned int max_size)
 {
+#if 0
        if (max_size < PAGE_CACHE_SIZE) {
                max_size = PAGE_CACHE_SIZE;
                printk(KERN_INFO "%s: set to minimum %d\n",
                       __func__, max_size);
        }
+#endif
+
 
        q->limits.max_segment_size = max_size;
 }
@@ -777,11 +783,14 @@ EXPORT_SYMBOL_GPL(blk_queue_dma_drain);
  **/
 void blk_queue_segment_boundary(struct request_queue *q, unsigned long mask)
 {
+#if 0
        if (mask < PAGE_CACHE_SIZE - 1) {
                mask = PAGE_CACHE_SIZE - 1;
                printk(KERN_INFO "%s: set to minimum %lx\n",
                       __func__, mask);
        }
+#endif
+
 
        q->limits.seg_boundary_mask = mask;
 }



-- 
Julien Grall

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: PAGE_SIZE (64KB), while block driver 'struct request' deals with < PAGE_SIZE (up to 44Kb). Was:Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-08-27 17:51                         ` Julien Grall
@ 2015-09-04 14:04                           ` Stefano Stabellini
  2015-09-04 15:41                             ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 29+ messages in thread
From: Stefano Stabellini @ 2015-09-04 14:04 UTC (permalink / raw)
  To: Julien Grall
  Cc: Ian Campbell, Stefano Stabellini, xen-devel@lists.xen.org, axboe,
	David Vrabel, Roger Pau Monné

On Thu, 27 Aug 2015, Julien Grall wrote:
> On 21/08/15 18:10, Konrad Rzeszutek Wilk wrote:
> > On Fri, Aug 21, 2015 at 05:08:35PM +0100, David Vrabel wrote:
> >> On 21/08/15 17:05, Konrad Rzeszutek Wilk wrote:
> >>>
> >>> I have to concur with that. We can't mandate that ARM 64k page MUST use
> >>> indirect descriptors.
> >>
> >> Then it has to be fixed in the block layer to allow < PAGE_SIZE segments
> >> and to get the block layer to split requests for blkfront.
> > 
> > Hey Jens,
> > 
> > I am hoping you can help us figure this problem out.
> > 
> > The Linux ARM is capable of using 4KB pages and 64KB pages. Our block
> > driver (xen-blkfront) was built with 4KB pages in mind and without using
> > any fancy flags (which some backends lack) the maximum amount of I/O it can
> > fit on a ring is 44KB.
> > 
> > This has the unfortunate effect that when the xen-blkfront
> > gets an 'struct request' it can have on page (64KB) and it can't actually
> > fit it on the ring! And the lowest segment size it advertises is PAGE_SIZE
> > (64KB). I believe Julien (who found this) tried initially advertising
> > smaller segment size than PAGE_SIZE (32KB). However looking at
> > __blk_segment_map_sg it looks to assume smallest size is PAGE_SIZE so
> > that would explain why it did not work.
> 
> To be honest, I haven't tried to see how the block layer will act if I
> dropped those checks in blk-settings.c until today.
> 
> I don't see any assumption about PAGE_SIZE in __blk_segment_map_sg.
> Although dropping the checks in blk-settings (see quick patch [1]),
> I got the following error in the frontend:
> 
> bio too big device xvda (128 > 88)
> Buffer I/O error on dev xvda, logical block 0, async page read
> bio too big device xvda (128 > 88)
> Buffer I/O error on dev xvda, logical block 0, async page read
> 
> The "bio too big device ..." comes from generic_make_request_checks
> (linux/block/blk-core.c) and the stack trace is:
> 
> [<fffffe0000096c7c>] dump_backtrace+0x0/0x124
> [<fffffe0000096db0>] show_stack+0x10/0x1c
> [<fffffe00005885e8>] dump_stack+0x78/0xbc
> [<fffffe00000bf7f8>] warn_slowpath_common+0x98/0xd0
> [<fffffe00000bf8f0>] warn_slowpath_null+0x14/0x20
> [<fffffe00002df304>] generic_make_request_checks+0x114/0x230
> [<fffffe00002e0580>] generic_make_request+0x10/0x108
> [<fffffe00002e070c>] submit_bio+0x94/0x1e0
> [<fffffe00001d573c>] submit_bh_wbc.isra.36+0x100/0x1a8
> [<fffffe00001d5bf8>] block_read_full_page+0x320/0x3e8
> [<fffffe00001d877c>] blkdev_readpage+0x14/0x20
> [<fffffe000014582c>] do_read_cache_page+0x16c/0x1a0
> [<fffffe0000145870>] read_cache_page+0x10/0x1c
> [<fffffe00002f2908>] read_dev_sector+0x30/0x9c
> [<fffffe00002f3d84>] msdos_partition+0x84/0x554
> [<fffffe00002f38e4>] check_partition+0xf8/0x21c
> [<fffffe00002f2f28>] rescan_partitions+0xb0/0x2a4
> [<fffffe00001d98b0>] __blkdev_get+0x228/0x34c
> [<fffffe00001d9a14>] blkdev_get+0x40/0x364
> [<fffffe00002f0b6c>] add_disk+0x398/0x424
> [<fffffe00003d8500>] blkback_changed+0x1200/0x152c
> [<fffffe000036a954>] xenbus_otherend_changed+0x9c/0xa4
> [<fffffe000036c984>] backend_changed+0xc/0x18
> [<fffffe000036a088>] xenwatch_thread+0xa0/0x13c
> [<fffffe00000d98d0>] kthread+0xd8/0xf0
> 
> The fs buffer code seems to assume that the block driver will always support
> at least a bio of PAGE_SIZE.
> 
> > One wya to make this work is for the driver (xen-blkfront) to split
> > the 'struct request' I/O in two internal requests.
> > 
> > But this has to be a normal problem. Surely there are other drivers
> > (MMC?) that can't handle PAGE_SIZE and have to break their I/Os.
> > Would it make sense for the common block code to be able to deal
> > with this?
> 
> It will take me a bit of time to fully understand the block layer
> as the changes doesn't seem trivial from POV (I don't have any
> knowledge in it). So I will wait a feedback from Jens before
> going further on this approach.

Maybe we could fall back to the previous plan of modifying xen-blkfront
for the moment?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: PAGE_SIZE (64KB), while block driver 'struct request' deals with < PAGE_SIZE (up to 44Kb). Was:Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-09-04 14:04                           ` Stefano Stabellini
@ 2015-09-04 15:41                             ` Konrad Rzeszutek Wilk
  2015-09-04 16:15                               ` Julien Grall
  0 siblings, 1 reply; 29+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-04 15:41 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: axboe, Ian Campbell, xen-devel@lists.xen.org, Julien Grall,
	David Vrabel, Roger Pau Monné

On Fri, Sep 04, 2015 at 03:04:18PM +0100, Stefano Stabellini wrote:
> On Thu, 27 Aug 2015, Julien Grall wrote:
> > On 21/08/15 18:10, Konrad Rzeszutek Wilk wrote:
> > > On Fri, Aug 21, 2015 at 05:08:35PM +0100, David Vrabel wrote:
> > >> On 21/08/15 17:05, Konrad Rzeszutek Wilk wrote:
> > >>>
> > >>> I have to concur with that. We can't mandate that ARM 64k page MUST use
> > >>> indirect descriptors.
> > >>
> > >> Then it has to be fixed in the block layer to allow < PAGE_SIZE segments
> > >> and to get the block layer to split requests for blkfront.
> > > 
> > > Hey Jens,
> > > 
> > > I am hoping you can help us figure this problem out.
> > > 
> > > The Linux ARM is capable of using 4KB pages and 64KB pages. Our block
> > > driver (xen-blkfront) was built with 4KB pages in mind and without using
> > > any fancy flags (which some backends lack) the maximum amount of I/O it can
> > > fit on a ring is 44KB.
> > > 
> > > This has the unfortunate effect that when the xen-blkfront
> > > gets an 'struct request' it can have on page (64KB) and it can't actually
> > > fit it on the ring! And the lowest segment size it advertises is PAGE_SIZE
> > > (64KB). I believe Julien (who found this) tried initially advertising
> > > smaller segment size than PAGE_SIZE (32KB). However looking at
> > > __blk_segment_map_sg it looks to assume smallest size is PAGE_SIZE so
> > > that would explain why it did not work.
> > 
> > To be honest, I haven't tried to see how the block layer will act if I
> > dropped those checks in blk-settings.c until today.
> > 
> > I don't see any assumption about PAGE_SIZE in __blk_segment_map_sg.
> > Although dropping the checks in blk-settings (see quick patch [1]),
> > I got the following error in the frontend:
> > 
> > bio too big device xvda (128 > 88)
> > Buffer I/O error on dev xvda, logical block 0, async page read
> > bio too big device xvda (128 > 88)
> > Buffer I/O error on dev xvda, logical block 0, async page read
> > 
> > The "bio too big device ..." comes from generic_make_request_checks
> > (linux/block/blk-core.c) and the stack trace is:
> > 
> > [<fffffe0000096c7c>] dump_backtrace+0x0/0x124
> > [<fffffe0000096db0>] show_stack+0x10/0x1c
> > [<fffffe00005885e8>] dump_stack+0x78/0xbc
> > [<fffffe00000bf7f8>] warn_slowpath_common+0x98/0xd0
> > [<fffffe00000bf8f0>] warn_slowpath_null+0x14/0x20
> > [<fffffe00002df304>] generic_make_request_checks+0x114/0x230
> > [<fffffe00002e0580>] generic_make_request+0x10/0x108
> > [<fffffe00002e070c>] submit_bio+0x94/0x1e0
> > [<fffffe00001d573c>] submit_bh_wbc.isra.36+0x100/0x1a8
> > [<fffffe00001d5bf8>] block_read_full_page+0x320/0x3e8
> > [<fffffe00001d877c>] blkdev_readpage+0x14/0x20
> > [<fffffe000014582c>] do_read_cache_page+0x16c/0x1a0
> > [<fffffe0000145870>] read_cache_page+0x10/0x1c
> > [<fffffe00002f2908>] read_dev_sector+0x30/0x9c
> > [<fffffe00002f3d84>] msdos_partition+0x84/0x554
> > [<fffffe00002f38e4>] check_partition+0xf8/0x21c
> > [<fffffe00002f2f28>] rescan_partitions+0xb0/0x2a4
> > [<fffffe00001d98b0>] __blkdev_get+0x228/0x34c
> > [<fffffe00001d9a14>] blkdev_get+0x40/0x364
> > [<fffffe00002f0b6c>] add_disk+0x398/0x424
> > [<fffffe00003d8500>] blkback_changed+0x1200/0x152c
> > [<fffffe000036a954>] xenbus_otherend_changed+0x9c/0xa4
> > [<fffffe000036c984>] backend_changed+0xc/0x18
> > [<fffffe000036a088>] xenwatch_thread+0xa0/0x13c
> > [<fffffe00000d98d0>] kthread+0xd8/0xf0
> > 
> > The fs buffer code seems to assume that the block driver will always support
> > at least a bio of PAGE_SIZE.
> > 
> > > One wya to make this work is for the driver (xen-blkfront) to split
> > > the 'struct request' I/O in two internal requests.
> > > 
> > > But this has to be a normal problem. Surely there are other drivers
> > > (MMC?) that can't handle PAGE_SIZE and have to break their I/Os.
> > > Would it make sense for the common block code to be able to deal
> > > with this?
> > 
> > It will take me a bit of time to fully understand the block layer
> > as the changes doesn't seem trivial from POV (I don't have any
> > knowledge in it). So I will wait a feedback from Jens before
> > going further on this approach.
> 
> Maybe we could fall back to the previous plan of modifying xen-blkfront
> for the moment?

Which afaic need to be reposted?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: PAGE_SIZE (64KB), while block driver 'struct request' deals with < PAGE_SIZE (up to 44Kb). Was:Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-09-04 15:41                             ` Konrad Rzeszutek Wilk
@ 2015-09-04 16:15                               ` Julien Grall
  2015-09-04 17:32                                 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 29+ messages in thread
From: Julien Grall @ 2015-09-04 16:15 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Stefano Stabellini
  Cc: axboe, Ian Campbell, xen-devel@lists.xen.org, David Vrabel,
	Roger Pau Monné

On 04/09/15 16:41, Konrad Rzeszutek Wilk wrote:
>> Maybe we could fall back to the previous plan of modifying xen-blkfront
>> for the moment?
> 
> Which afaic need to be reposted?

Right. Although I didn't see any comment on the patch. All the comments
was about the problem. Does it mean that you and Roger are "happy" with
the way it's done?

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: PAGE_SIZE (64KB), while block driver 'struct request' deals with < PAGE_SIZE (up to 44Kb). Was:Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-09-04 16:15                               ` Julien Grall
@ 2015-09-04 17:32                                 ` Konrad Rzeszutek Wilk
  2015-09-04 22:05                                   ` Julien Grall
  0 siblings, 1 reply; 29+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-04 17:32 UTC (permalink / raw)
  To: Julien Grall
  Cc: Ian Campbell, Stefano Stabellini, xen-devel@lists.xen.org, axboe,
	David Vrabel, Roger Pau Monné

On Fri, Sep 04, 2015 at 05:15:13PM +0100, Julien Grall wrote:
> On 04/09/15 16:41, Konrad Rzeszutek Wilk wrote:
> >> Maybe we could fall back to the previous plan of modifying xen-blkfront
> >> for the moment?
> > 
> > Which afaic need to be reposted?
> 
> Right. Although I didn't see any comment on the patch. All the comments
> was about the problem. Does it mean that you and Roger are "happy" with
> the way it's done?

There were some #idef I think? And I recall seeing the segment limit being
advertised as PAGE_SIZE / 2?

I dug in the other block drivers to figure out what they
do when the underlaying storage cannot handle < PAGE_SIZE data.
And I couldn't find them. Which means this should be really dealt
on the drivers side (xen-blkfront) as an quirk.

Anyhow what I am going to do is - once it is reposted, force the
driver under x86 to work under this condition. That is disable
persistent support and only use 2048 .. and then drive some IO.
If all goes well I will send it in a git pull to Jens.

> 
> Regards,
> 
> -- 
> Julien Grall

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: PAGE_SIZE (64KB), while block driver 'struct request' deals with < PAGE_SIZE (up to 44Kb). Was:Re: [RFC] Support of non-indirect grant backend on 64KB guest
  2015-09-04 17:32                                 ` Konrad Rzeszutek Wilk
@ 2015-09-04 22:05                                   ` Julien Grall
  0 siblings, 0 replies; 29+ messages in thread
From: Julien Grall @ 2015-09-04 22:05 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Ian Campbell, Stefano Stabellini, xen-devel@lists.xen.org, axboe,
	David Vrabel, Roger Pau Monné

Hi Konrad,

On 04/09/2015 18:32, Konrad Rzeszutek Wilk wrote:
> On Fri, Sep 04, 2015 at 05:15:13PM +0100, Julien Grall wrote:
>> On 04/09/15 16:41, Konrad Rzeszutek Wilk wrote:
>>>> Maybe we could fall back to the previous plan of modifying xen-blkfront
>>>> for the moment?
>>>
>>> Which afaic need to be reposted?
>>
>> Right. Although I didn't see any comment on the patch. All the comments
>> was about the problem. Does it mean that you and Roger are "happy" with
>> the way it's done?
>
> There were some #idef I think? And I recall seeing the segment limit being
> advertised as PAGE_SIZE / 2?

There is no ifdef but a PAGE_SIZE / 2. I know about the latter and plan 
to replace it. It was only to have a quick patch to expose my problem.

> I dug in the other block drivers to figure out what they
> do when the underlaying storage cannot handle < PAGE_SIZE data.
> And I couldn't find them. Which means this should be really dealt
> on the drivers side (xen-blkfront) as an quirk.
>
> Anyhow what I am going to do is - once it is reposted, force the
> driver under x86 to work under this condition. That is disable
> persistent support and only use 2048 .. and then drive some IO.
> If all goes well I will send it in a git pull to Jens.

It will also depends on 64KB series which I plan to repost next week. I 
will send as follow-up.

To test it properly on x86 it will be necessary to drop on BUG_ON in the 
code which ensure that the extra req is never use for 4KB
(see BUG_ON((XEN_PAGE_SIZE == PAGE_SIZE) && require_extra_req)).

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2015-09-04 22:05 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-18  6:29 [RFC] Support of non-indirect grant backend on 64KB guest Julien Grall
2015-08-18  7:09 ` Roger Pau Monné
2015-08-18  7:26   ` Jan Beulich
2015-08-18 18:45   ` Julien Grall
2015-08-19  8:50     ` Roger Pau Monné
2015-08-19 14:54       ` Julien Grall
2015-08-19 15:17         ` Roger Pau Monné
2015-08-19 15:52           ` Julien Grall
2015-08-19 23:44           ` Stefano Stabellini
2015-08-20  8:31             ` Roger Pau Monné
2015-08-20  9:43               ` David Vrabel
2015-08-20 16:16                 ` Julien Grall
2015-08-20 17:23                 ` Stefano Stabellini
2015-08-21 16:05                   ` Konrad Rzeszutek Wilk
2015-08-21 16:08                     ` David Vrabel
2015-08-21 16:49                       ` Stefano Stabellini
2015-08-21 17:10                       ` PAGE_SIZE (64KB), while block driver 'struct request' deals with < PAGE_SIZE (up to 44Kb). Was:Re: " Konrad Rzeszutek Wilk
2015-08-27 17:51                         ` Julien Grall
2015-09-04 14:04                           ` Stefano Stabellini
2015-09-04 15:41                             ` Konrad Rzeszutek Wilk
2015-09-04 16:15                               ` Julien Grall
2015-09-04 17:32                                 ` Konrad Rzeszutek Wilk
2015-09-04 22:05                                   ` Julien Grall
2015-08-20  9:37             ` Jan Beulich
2015-08-19  8:58     ` Jan Beulich
2015-08-19 15:25       ` Julien Grall
2015-08-20 17:42 ` David Vrabel
2015-08-21  1:30   ` Julien Grall
2015-08-21 16:07     ` Konrad Rzeszutek Wilk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).