* Block ring protocol (segment expansion, multi-page, etc).
@ 2012-09-05 13:29 Konrad Rzeszutek Wilk
2012-09-06 10:47 ` Konrad Rzeszutek Wilk
2012-09-06 11:02 ` Jan Beulich
0 siblings, 2 replies; 3+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-09-05 13:29 UTC (permalink / raw)
To: ronghui.duan, justing, donald.d.dugger, JBeulich, xen-devel
Please correct me if I a got something wrong.
About two or three years ago Citrix (and Red Hat I think?) posted a
multi-page extension protocol (max-ring-page-order, max-ring-pages
and ring-page-order and ring-pages)-
which never got upstream (needed just to be rebased on the driver that
went in the kernel I think?).
Then about a year ago SpectraLogic started enhancing the FreeBSD variant
of blkback - and realized what Ronghui also did - that the just doing a
multi-page extension is not enough. The issue was that if one just
expanded to a ring composed of two pages, 1/4 of the page was wasted b/c
of the segment is constrained to 11.
Justin (SpectraLogic) came up with a protocol enh were the existing
blkif protocol is the same, but the BLKIF_MAX_SEGMENTS_PER_REQUEST
is negotitated via max-request-segments. And then there is the
max-request-size which rolls the segment size and the size of the ring
to give you an idea of what is the biggest I/O you can fit on a ring in
a single transaction. This solves the wastage problem and expands the
ring.
Ronghui did something similar, but instead of re-using the existing
blkif structure he split them in two. One ring is for
blkif_request_header (which has the segments ripped out), and the other
is for just for blkif_request_segments. Solves the wastage and also
allows to expand the ring.
The three major outstanding issues that exists with the current protocol
that I know of are:
- We split up the I/O requests. This ends up eating a lot of CPU
cycles.
- We might have huge I/O requests. Justin mentioned 1MB single I/Os -
and to fit that on a ring it has to be .. well, be able to fit 256
segments. Jan mentioned 256kB for SCSI - since the protocol
extensions here could very well be carried over.
- concurrent usage. If we have more than 4 VBDs blkback suffers when it
tries to get a page as there is a "global" pool shared across all
guests instead of being something 'per guest' or 'per VBD'.
So.. Ronghui - I am curious to why you choosen the path of making two
seperate rings? Was the mechanism that Justin came up not really that
good or was this just easier to implement?
Thanks.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Block ring protocol (segment expansion, multi-page, etc).
2012-09-05 13:29 Block ring protocol (segment expansion, multi-page, etc) Konrad Rzeszutek Wilk
@ 2012-09-06 10:47 ` Konrad Rzeszutek Wilk
2012-09-06 11:02 ` Jan Beulich
1 sibling, 0 replies; 3+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-09-06 10:47 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk, Oliver.Chick
Cc: justing, ronghui.duan, JBeulich, donald.d.dugger, xen-devel
On Wed, Sep 05, 2012 at 09:29:21AM -0400, Konrad Rzeszutek Wilk wrote:
> Please correct me if I a got something wrong.
CC-ing here a Citrix person who has expressed interest in also
implementing persistent grants in block backend.
>
> About two or three years ago Citrix (and Red Hat I think?) posted a
> multi-page extension protocol (max-ring-page-order, max-ring-pages
> and ring-page-order and ring-pages)-
> which never got upstream (needed just to be rebased on the driver that
> went in the kernel I think?).
>
> Then about a year ago SpectraLogic started enhancing the FreeBSD variant
> of blkback - and realized what Ronghui also did - that the just doing a
> multi-page extension is not enough. The issue was that if one just
> expanded to a ring composed of two pages, 1/4 of the page was wasted b/c
> of the segment is constrained to 11.
>
> Justin (SpectraLogic) came up with a protocol enh were the existing
> blkif protocol is the same, but the BLKIF_MAX_SEGMENTS_PER_REQUEST
> is negotitated via max-request-segments. And then there is the
> max-request-size which rolls the segment size and the size of the ring
> to give you an idea of what is the biggest I/O you can fit on a ring in
> a single transaction. This solves the wastage problem and expands the
> ring.
>
> Ronghui did something similar, but instead of re-using the existing
> blkif structure he split them in two. One ring is for
> blkif_request_header (which has the segments ripped out), and the other
> is for just for blkif_request_segments. Solves the wastage and also
> allows to expand the ring.
>
> The three major outstanding issues that exists with the current protocol
> that I know of are:
> - We split up the I/O requests. This ends up eating a lot of CPU
> cycles.
> - We might have huge I/O requests. Justin mentioned 1MB single I/Os -
> and to fit that on a ring it has to be .. well, be able to fit 256
> segments. Jan mentioned 256kB for SCSI - since the protocol
> extensions here could very well be carried over.
> - concurrent usage. If we have more than 4 VBDs blkback suffers when it
> tries to get a page as there is a "global" pool shared across all
> guests instead of being something 'per guest' or 'per VBD'.
>
> So.. Ronghui - I am curious to why you choosen the path of making two
> seperate rings? Was the mechanism that Justin came up not really that
> good or was this just easier to implement?
>
> Thanks.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Block ring protocol (segment expansion, multi-page, etc).
2012-09-05 13:29 Block ring protocol (segment expansion, multi-page, etc) Konrad Rzeszutek Wilk
2012-09-06 10:47 ` Konrad Rzeszutek Wilk
@ 2012-09-06 11:02 ` Jan Beulich
1 sibling, 0 replies; 3+ messages in thread
From: Jan Beulich @ 2012-09-06 11:02 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk; +Cc: justing, ronghui.duan, donald.d.dugger, xen-devel
>>> On 05.09.12 at 15:29, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> The three major outstanding issues that exists with the current protocol
> that I know of are:
> - We split up the I/O requests. This ends up eating a lot of CPU
> cycles.
> - We might have huge I/O requests. Justin mentioned 1MB single I/Os -
> and to fit that on a ring it has to be .. well, be able to fit 256
> segments. Jan mentioned 256kB for SCSI - since the protocol
> extensions here could very well be carried over.
This one is at least partly solved with the higher segment count.
With Justin's scheme, up to 255 segments (i.e. slightly less than
1Mb) can be transferred at a time. With Ronghui's scheme (and
provided the segment count is wider than a byte), there shouldn't
be any really limiting upper bound anymore.
> - concurrent usage. If we have more than 4 VBDs blkback suffers when it
> tries to get a page as there is a "global" pool shared across all
> guests instead of being something 'per guest' or 'per VBD'.
Per-vbd would be what we currently have, where for little used
vbd-s a pointlessly large amount of pages is set aside. Per-guest
is what I think it needs to be (to prevent multiple guests from
starving one another).
But then it's also not just the page pool, but also the number
of grants used/mapped - without command line override there's
32 map track frames, allowing 32k grants to be mapped in a
single domain (e.g. Dom0). Scaling the larger segment and
request counts with the number of guests and considering that
other backends also need to be able to do their jobs, this could
become a noticeable limit quite quickly (especially considering
that failed grant map operations fail the request in the backend
rather than deferring it, at least when GNTST_no_device_space
gets returned).
Jan
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2012-09-06 11:02 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-05 13:29 Block ring protocol (segment expansion, multi-page, etc) Konrad Rzeszutek Wilk
2012-09-06 10:47 ` Konrad Rzeszutek Wilk
2012-09-06 11:02 ` Jan Beulich
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.