From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: "axboe@kernel.dk" <axboe@kernel.dk>,
"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
Felipe Franciosi <felipe.franciosi@citrix.com>,
"martin.petersen@oracle.com" <martin.petersen@oracle.com>,
"matthew@wil.cx" <matthew@wil.cx>
Subject: Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
Date: Wed, 20 Feb 2013 16:31:24 -0500 [thread overview]
Message-ID: <20130220213124.GA5292@phenom.dumpdata.com> (raw)
In-Reply-To: <20130122194640.GB10733@phenom.dumpdata.com>
> > > Which has a nice power of two ring to it. (Ah the puns!)
> > >
> > > I like the idea of putting the request on a diet - but too much
> > > could cause us to miss the opportunity to insert other flags on it.
> > > If I recall correctly, the DIF/DIX only need 8 bytes of data.
> > > If we make the assumption that:
> > > I/O request = one ring entry
> >
> > So we only need to reserve 8bytes for each DIF/IDX, even if the request
> > contains a variable number of data? (I mean, block requests can at a
> > minimum contain 4096bytes, or much more)
>
> I need to double check with Martin (CC-ed here). But my recollection
> is that it is just attached the the 'bio'. So if the BIO is 4K or 1MB -
> it would only have one DIF/DIX data type.
And that is semi-correct. If the user did a horrible job (say using
dd) the pages are chained together - and we end up with a link list
of bio's. The last bio would point to a page filled with 'sector's worth
of data has a checksum. Each checksum occupies 8 bytes. So if the
total 'bio' length is say 1MB, this last page is filled with 256 of
checksums - so 2048 bytes of data.
>
> Hmm, but then we operate on the 'struct request' so that might not
> be the case..
> >
> > > and the "one ring entry" can use the the '4' grants if we just have a
> > > 16KB I/O request, but if it is more than that - we use the indirect page
> >
> > Well, on my purpose I've limited the number of segments of a "rw"
> > requests to 2, so it's only 8K, anything bigger has to use indirect
> > descriptors, which can fit 4M of data (because I'm passing 4 grant
> > frames full of "blkif_request_indirect_entry" entries).
>
> <nods>
> >
> > > and can stuff 1MB of data in there.
> > > The extra 32-bytes of space for such things as 'DIF/DIX'. This also
> > > means we could unify the 'struct request' with the 'discard' operation
> > > and it could utilize the 32-bytes of extra unused payload data.
> > >
> > >>>
> > >>>
> > >>> The ‘operation’ would be BLKIF_OP_INDIRECT. The read/write/discard,
> > >>> etc operation would now be in indirect.op. The indirect.gref points to
> > >>> a page that is filled with:
> > >>>
> > >>>
> > >>> struct blkif_request_indirect_entry {
> > >>> blkif_sector_t sector_number;
> > >>> struct blkif_request_segment seg;
> > >>> } __attribute__((__packed__));
> > >>> //16 bytes, so we can fit in a page 256 of these structures.
> > >>>
> > >>>
> > >>> This means that with the existing 36 slots in the ring (single page)
> > >>> we can cover: 32 slots * each blkif_request_indirect covers: 256 * 4096
> > >>> ~= 32M. If we don’t want to use indirect descriptor we can still use
> > >>> up to 4 pages of the request (as it has enough space to contain four
> > >>> segments and the structure will still be cache-aligned).
> > >>>
Martin asked me why we even do this via these entries. Meaning why
have this tuple of information for each page: <lba, first_sect, last_sect, gref>.
The lba on the next subsequent indirect entry is going to be incremented by
one. The first_sect and last_sect too... So why not just do:
struct blkif_request_indirect {
uint8_t operation;
blkif_vdev_t handle; /* only for read/write requests */
#ifdef CONFIG_X86_64
uint32_t _pad1; /* offsetof(blkif_request,u.rw.id) == 8 */
#endif
uint64_t id; /* private guest value, echoed in resp */
blkif_sector_t sector_number;/* start sector idx on disk (r/w only) */
grant_ref_t indirect_desc;
uint16_t nr_elems;
}
And the 'indirect_desc' would point to a page that looks quite close to
what the scatterlist looks like:
struct indirect_chain {
uint16_t op_flag; //*Can D_NEXT, D_START, D_END ?
uint16_t next;
uint16_t offset;
uint16_t length;
uint32_t gref;
uint32_t _pad; // Need this in case we ever want to
// make gref + _pad be a physical addr.
}
And the page itself would be:
struct indirect_chain[256];
the 'next' would just contain the index inside in indirect_chain page - so from
0->256. The offset and length would reference wherein the page the data is
contained.
This way the 'lba' information is part of the 'blkif_request_indirect' and the
payload info is all in the indirect descriptors.
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
next prev parent reply other threads:[~2013-02-20 21:31 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-12-18 14:31 RFC v1: Xen block protocol overhaul - problem statement (with pictures!) Konrad Rzeszutek Wilk
2012-12-18 14:49 ` Jan Beulich
2013-01-18 16:00 ` Roger Pau Monné
2013-01-18 18:20 ` Konrad Rzeszutek Wilk
2013-01-19 12:44 ` Roger Pau Monné
2013-01-22 19:46 ` Konrad Rzeszutek Wilk
2013-01-23 9:53 ` Ian Campbell
2013-01-23 15:21 ` Konrad Rzeszutek Wilk
2013-01-23 15:41 ` Ian Campbell
2013-01-23 16:59 ` Konrad Rzeszutek Wilk
2013-01-24 10:06 ` Ian Campbell
2013-01-24 15:11 ` Konrad Rzeszutek Wilk
2013-02-20 21:31 ` Konrad Rzeszutek Wilk [this message]
2013-01-21 12:37 ` Ian Campbell
2013-01-22 19:25 ` Konrad Rzeszutek Wilk
2013-01-23 9:24 ` Ian Campbell
2013-01-23 15:03 ` Konrad Rzeszutek Wilk
2013-01-23 15:39 ` Ian Campbell
2013-01-23 16:57 ` Konrad Rzeszutek Wilk
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130220213124.GA5292@phenom.dumpdata.com \
--to=konrad.wilk@oracle.com \
--cc=axboe@kernel.dk \
--cc=felipe.franciosi@citrix.com \
--cc=martin.petersen@oracle.com \
--cc=matthew@wil.cx \
--cc=roger.pau@citrix.com \
--cc=xen-devel@lists.xensource.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).