Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: "axboe@kernel.dk" <axboe@kernel.dk>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	Felipe Franciosi <felipe.franciosi@citrix.com>,
	"martin.petersen@oracle.com" <martin.petersen@oracle.com>,
	"matthew@wil.cx" <matthew@wil.cx>
Subject: Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
Date: Wed, 20 Feb 2013 16:31:24 -0500	[thread overview]
Message-ID: <20130220213124.GA5292@phenom.dumpdata.com> (raw)
In-Reply-To: <20130122194640.GB10733@phenom.dumpdata.com>

> > > Which has a nice power of two ring to it. (Ah the puns!)
> > > 
> > > I like the idea of putting the request on a diet - but too much
> > > could cause us to miss the opportunity to insert other flags on it.
> > > If I recall correctly, the DIF/DIX only need 8 bytes of data.
> > > If we make the assumption that:
> > >         I/O request = one ring entry
> > 
> > So we only need to reserve 8bytes for each DIF/IDX, even if the request
> > contains a variable number of data? (I mean, block requests can at a
> > minimum contain 4096bytes, or much more)
> 
> I need to double check with Martin (CC-ed here). But my recollection
> is that it is just attached the the 'bio'. So if the BIO is 4K or 1MB -
> it would only have one DIF/DIX data type.

And that is semi-correct. If the user did a horrible job (say using
dd) the pages are chained together - and we end up with a link list
of bio's. The last bio would point to a page filled with 'sector's worth
of data has a checksum. Each checksum occupies 8 bytes. So if the
total 'bio' length is say 1MB, this last page is filled with 256 of
checksums - so 2048 bytes of data.

> 
> Hmm, but then we operate on the 'struct request' so that might not
> be the case..
> > 
> > > and the  "one ring entry" can use the the '4' grants if we just have a
> > > 16KB I/O request, but if it is more than that - we use the indirect page
> > 
> > Well, on my purpose I've limited the number of segments of a "rw"
> > requests to 2, so it's only 8K, anything bigger has to use indirect
> > descriptors, which can fit 4M of data (because I'm passing 4 grant
> > frames full of "blkif_request_indirect_entry" entries).
> 
> <nods>
> > 
> > > and can stuff 1MB of data in there.
> > > The extra 32-bytes of space for such things as 'DIF/DIX'. This also
> > > means we could unify the 'struct request' with the 'discard' operation
> > > and it could utilize the 32-bytes of extra unused payload data.
> > > 
> > >>>
> > >>>
> > >>> The ‘operation’ would be BLKIF_OP_INDIRECT. The read/write/discard,
> > >>> etc operation would now be in indirect.op. The indirect.gref points to
> > >>> a page that is filled with:
> > >>>
> > >>>
> > >>> struct blkif_request_indirect_entry {
> > >>>         blkif_sector_t sector_number;
> > >>>         struct blkif_request_segment seg;
> > >>> } __attribute__((__packed__));
> > >>> //16 bytes, so we can fit in a page 256 of these structures.
> > >>>
> > >>>
> > >>> This means that with the existing 36 slots in the ring (single page)
> > >>> we can cover: 32 slots * each blkif_request_indirect covers: 256 * 4096
> > >>> ~= 32M. If we don’t want to use indirect descriptor we can still use
> > >>> up to 4 pages of the request (as it has enough space to contain four
> > >>> segments and the structure will still be cache-aligned).
> > >>>


Martin asked me why we even do this via these entries. Meaning why
have this tuple of information for each page: <lba, first_sect, last_sect, gref>.
The lba on the next subsequent indirect entry is going to be incremented by
one. The first_sect and last_sect too... So why not just do:

struct blkif_request_indirect {
        uint8_t        operation;
        blkif_vdev_t   handle;       /* only for read/write requests         */
#ifdef CONFIG_X86_64
        uint32_t       _pad1;        /* offsetof(blkif_request,u.rw.id) == 8 */
#endif
        uint64_t       id;           /* private guest value, echoed in resp  */
        blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */

	grant_ref_t	indirect_desc;
	uint16_t	nr_elems;
}

And the 'indirect_desc' would point to a page that looks quite close to
what the scatterlist looks like:

	struct indirect_chain {
		uint16_t	op_flag;	//*Can D_NEXT, D_START, D_END ?
		uint16_t	next;
		uint16_t	offset;
		uint16_t	length;
		uint32_t	gref;
		uint32_t	_pad;		// Need this in case we ever want to
						// make gref + _pad be a physical addr.
	}

And the page itself would be:
	struct indirect_chain[256];

the 'next' would just contain the index inside in indirect_chain page - so from
0->256.  The offset and length would reference wherein the page the data is
contained.

This way the 'lba' information is part of the 'blkif_request_indirect' and the
payload info is all in the indirect descriptors.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2013-02-20 21:31 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-18 14:31 RFC v1: Xen block protocol overhaul - problem statement (with pictures!) Konrad Rzeszutek Wilk
2012-12-18 14:49 ` Jan Beulich
2013-01-18 16:00 ` Roger Pau Monné
2013-01-18 18:20   ` Konrad Rzeszutek Wilk
2013-01-19 12:44     ` Roger Pau Monné
2013-01-22 19:46       ` Konrad Rzeszutek Wilk
2013-01-23  9:53         ` Ian Campbell
2013-01-23 15:21           ` Konrad Rzeszutek Wilk
2013-01-23 15:41             ` Ian Campbell
2013-01-23 16:59               ` Konrad Rzeszutek Wilk
2013-01-24 10:06                 ` Ian Campbell
2013-01-24 15:11                   ` Konrad Rzeszutek Wilk
2013-02-20 21:31         ` Konrad Rzeszutek Wilk [this message]
2013-01-21 12:37     ` Ian Campbell
2013-01-22 19:25       ` Konrad Rzeszutek Wilk
2013-01-23  9:24         ` Ian Campbell
2013-01-23 15:03           ` Konrad Rzeszutek Wilk
2013-01-23 15:39             ` Ian Campbell
2013-01-23 16:57               ` Konrad Rzeszutek Wilk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130220213124.GA5292@phenom.dumpdata.com \
    --to=konrad.wilk@oracle.com \
    --cc=axboe@kernel.dk \
    --cc=felipe.franciosi@citrix.com \
    --cc=martin.petersen@oracle.com \
    --cc=matthew@wil.cx \
    --cc=roger.pau@citrix.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).