Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: "axboe@kernel.dk" <axboe@kernel.dk>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	Felipe Franciosi <felipe.franciosi@citrix.com>,
	"martin.petersen@oracle.com" <martin.petersen@oracle.com>,
	"matthew@wil.cx" <matthew@wil.cx>
Subject: Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
Date: Wed, 20 Feb 2013 16:31:24 -0500	[thread overview]
Message-ID: <20130220213124.GA5292@phenom.dumpdata.com> (raw)
In-Reply-To: <20130122194640.GB10733@phenom.dumpdata.com>

> > > Which has a nice power of two ring to it. (Ah the puns!)
> > > 
> > > I like the idea of putting the request on a diet - but too much
> > > could cause us to miss the opportunity to insert other flags on it.
> > > If I recall correctly, the DIF/DIX only need 8 bytes of data.
> > > If we make the assumption that:
> > >         I/O request = one ring entry
> > 
> > So we only need to reserve 8bytes for each DIF/IDX, even if the request
> > contains a variable number of data? (I mean, block requests can at a
> > minimum contain 4096bytes, or much more)
> 
> I need to double check with Martin (CC-ed here). But my recollection
> is that it is just attached the the 'bio'. So if the BIO is 4K or 1MB -
> it would only have one DIF/DIX data type.

And that is semi-correct. If the user did a horrible job (say using
dd) the pages are chained together - and we end up with a link list
of bio's. The last bio would point to a page filled with 'sector's worth
of data has a checksum. Each checksum occupies 8 bytes. So if the
total 'bio' length is say 1MB, this last page is filled with 256 of
checksums - so 2048 bytes of data.

> 
> Hmm, but then we operate on the 'struct request' so that might not
> be the case..
> > 
> > > and the  "one ring entry" can use the the '4' grants if we just have a
> > > 16KB I/O request, but if it is more than that - we use the indirect page
> > 
> > Well, on my purpose I've limited the number of segments of a "rw"
> > requests to 2, so it's only 8K, anything bigger has to use indirect
> > descriptors, which can fit 4M of data (because I'm passing 4 grant
> > frames full of "blkif_request_indirect_entry" entries).
> 
> <nods>
> > 
> > > and can stuff 1MB of data in there.
> > > The extra 32-bytes of space for such things as 'DIF/DIX'. This also
> > > means we could unify the 'struct request' with the 'discard' operation
> > > and it could utilize the 32-bytes of extra unused payload data.
> > > 
> > >>>
> > >>>
> > >>> The ‘operation’ would be BLKIF_OP_INDIRECT. The read/write/discard,
> > >>> etc operation would now be in indirect.op. The indirect.gref points to
> > >>> a page that is filled with:
> > >>>
> > >>>
> > >>> struct blkif_request_indirect_entry {
> > >>>         blkif_sector_t sector_number;
> > >>>         struct blkif_request_segment seg;
> > >>> } __attribute__((__packed__));
> > >>> //16 bytes, so we can fit in a page 256 of these structures.
> > >>>
> > >>>
> > >>> This means that with the existing 36 slots in the ring (single page)
> > >>> we can cover: 32 slots * each blkif_request_indirect covers: 256 * 4096
> > >>> ~= 32M. If we don’t want to use indirect descriptor we can still use
> > >>> up to 4 pages of the request (as it has enough space to contain four
> > >>> segments and the structure will still be cache-aligned).
> > >>>


Martin asked me why we even do this via these entries. Meaning why
have this tuple of information for each page: <lba, first_sect, last_sect, gref>.
The lba on the next subsequent indirect entry is going to be incremented by
one. The first_sect and last_sect too... So why not just do:

struct blkif_request_indirect {
        uint8_t        operation;
        blkif_vdev_t   handle;       /* only for read/write requests         */
#ifdef CONFIG_X86_64
        uint32_t       _pad1;        /* offsetof(blkif_request,u.rw.id) == 8 */
#endif
        uint64_t       id;           /* private guest value, echoed in resp  */
        blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */

	grant_ref_t	indirect_desc;
	uint16_t	nr_elems;
}

And the 'indirect_desc' would point to a page that looks quite close to
what the scatterlist looks like:

	struct indirect_chain {
		uint16_t	op_flag;	//*Can D_NEXT, D_START, D_END ?
		uint16_t	next;
		uint16_t	offset;
		uint16_t	length;
		uint32_t	gref;
		uint32_t	_pad;		// Need this in case we ever want to
						// make gref + _pad be a physical addr.
	}

And the page itself would be:
	struct indirect_chain[256];

the 'next' would just contain the index inside in indirect_chain page - so from
0->256.  The offset and length would reference wherein the page the data is
contained.

This way the 'lba' information is part of the 'blkif_request_indirect' and the
payload info is all in the indirect descriptors.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2013-02-20 21:31 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-18 14:31 RFC v1: Xen block protocol overhaul - problem statement (with pictures!) Konrad Rzeszutek Wilk
2012-12-18 14:49 ` Jan Beulich
2013-01-18 16:00 ` Roger Pau Monné
2013-01-18 18:20   ` Konrad Rzeszutek Wilk
2013-01-19 12:44     ` Roger Pau Monné
2013-01-22 19:46       ` Konrad Rzeszutek Wilk
2013-01-23  9:53         ` Ian Campbell
2013-01-23 15:21           ` Konrad Rzeszutek Wilk
2013-01-23 15:41             ` Ian Campbell
2013-01-23 16:59               ` Konrad Rzeszutek Wilk
2013-01-24 10:06                 ` Ian Campbell
2013-01-24 15:11                   ` Konrad Rzeszutek Wilk
2013-02-20 21:31         ` Konrad Rzeszutek Wilk [this message]
2013-01-21 12:37     ` Ian Campbell
2013-01-22 19:25       ` Konrad Rzeszutek Wilk
2013-01-23  9:24         ` Ian Campbell
2013-01-23 15:03           ` Konrad Rzeszutek Wilk
2013-01-23 15:39             ` Ian Campbell
2013-01-23 16:57               ` Konrad Rzeszutek Wilk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130220213124.GA5292@phenom.dumpdata.com \
    --to=konrad.wilk@oracle.com \
    --cc=axboe@kernel.dk \
    --cc=felipe.franciosi@citrix.com \
    --cc=martin.petersen@oracle.com \
    --cc=matthew@wil.cx \
    --cc=roger.pau@citrix.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.