From: Markus Pargmann <mpa@pengutronix.de>
To: Wouter Verhelst <w@uter.be>
Cc: nbd-general@lists.sourceforge.net, Kevin Wolf <kwolf@redhat.com>,
qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>,
"Denis V. Lunev" <den@openvz.org>
Subject: Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
Date: Mon, 04 Apr 2016 12:18:37 +0200 [thread overview]
Message-ID: <1580868.I2qeg0bkmp@adelgunde> (raw)
In-Reply-To: <20160325084929.GA2671@grep.be>
[-- Attachment #1: Type: text/plain, Size: 7678 bytes --]
Hi,
back from my easter vacation. A bit surprised to find 200 mails in the
NBD mailing list ;).
On Friday 25 March 2016 09:49:29 Wouter Verhelst wrote:
> On Thu, Mar 24, 2016 at 04:08:13PM -0600, Eric Blake wrote:
> > On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> > > From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > >
> > > With the availability of sparse storage formats, it is often needed to
> > > query status of a particular LBA range and read only those blocks of
> > > data that are actually present on the block device.
> > >
> > > To provide such information, the patch adds GET_LBA_STATUS extension
> > > with one new NBD_CMD_GET_LBA_STATUS command.
> > >
> >
> > > +* `NBD_CMD_GET_LBA_STATUS` (7)
> > > +
> > > + An LBA range status query request. Length and offset define the range
> > > + of interest. The server MUST reply with a reply header, followed
> > > + immediately by the following data:
> > > +
> > > + - 32 bits, length of parameter data that follow (unsigned)
> > > + - zero or more LBA status descriptors, each having the following
> > > + structure:
> > > +
> > > + * 64 bits, offset (unsigned)
> > > + * 32 bits, length (unsigned)
> > > + * 16 bits, status (unsigned)
> > > +
> > > + unless an error condition has occurred.
> >
> > To date, only the NBD_CMD_READ command caused the server to send data
> > after the handle in its reply. This would be the second command with a
> > data field in the response, but it is returning a variable amount of
> > data, not directly tied to the client's length - but at least you did
> > make it structured so that the client knows how much to read. However,
> > your patch is incomplete; you'll need to edit the "Transmission" section
> > of the document to mention the rules on the server sending data, as the
> > existing text would now be wrong:
> >
> > > The server replies with:
> > >
> > > S: 32 bits, 0x67446698, magic (NBD_REPLY_MAGIC)
> > > S: 32 bits, error
> > > S: 64 bits, handle
> > > S: (length bytes of data if the request is of type NBD_CMD_READ)
> >
> > You may also want to add a rule that for all future extensions, any
> > command that requires data in the server response (other than the server
> > response to NBD_CMD_READ) must include a 32-bit length as the first
> > field of its data payload, so that the server reply is always structured.
>
> Right.
>
> > Hmm, I still think it would be hard to write a wireshark dissector for
> > server replies, since there is no indication whether a given reply will
> > include data or not.
>
> Well, a wireshark dissector already exists. However, it is very old; it
> doesn't support the NBD_CMD_TRIM or NBD_CMD_FLUSH commands, it doesn't
> support the NBD_CMD_FLAG_FUA option to NBD_CMD_WRITE, and it doesn't
> deal with negotiation at all. It was written when newstyle negotiation
> didn't exist yet, and scanning for INIT_PASSWD at the time was too hard,
> according to the guy who wrote it (don't remember who that was).
>
> > The _client_ knows (well, any well-written client
> > that uses a different value for 'handle' for every command sent to the
> > server), because it can read the returned 'handle' field to know what
> > command it matches to and therefore what data to expect; but a
> > third-party observer seeing _just_ the server messages has no idea which
> > server responses should have payload.
>
> It can if it knows enough about the protocol, but granted, that makes it
> harder for us to extend the protocol without breaking the dissector.
>
> > Scanning the stream and assuming that a new reply starts each time
> > NBD_REPLY_MAGIC is encountered is a close approximation, but hits
> > false positives if the data being returned for NBD_CMD_READ contains
> > the same magic number as part of its contents.
>
> Indeed.
>
> > Obviously, back-compat says we can't change the response to
> > NBD_CMD_READ, but that means that a wireshark dissector has to either
> > maintain context, or hunt through returned data looking for potential
> > magic numbers and possibly hitting false positives.
> >
> > That said, maybe we want to allow global flag negotiation prior to
> > transmission to add a new flag on both server and client side - the
> > server would advertise that it is capable of a new reply mode, and the
> > client then has to reply that it wants to use the new reply mode; in
>
> Global flag negotiation is already possible. There is a client flags
> field, which is sent before option haggling, that could be used for
> negotiation of such backwards-incompatible features.
>
> > that mode, all server replies starting with magic 0x67446698
> > (NBD_REPLY_MAGIC) will never have a data payload, and all commands that
> > cause a reply with payload (both NBD_CMD_READ and the new
> > NBD_CMD_GET_LBA_STATUS of this message, or whatever name we give it)
> > will reply with a NEW magic number:
> >
> > S: 32 bits, XXXX, magic (NBD_REPLY_MAGIC2)
> > S: 32 bits, error
> > S: 64 bits, handle
> > S: 32 bits, length
> > S: length bytes of data
>
> I like this. However, before committing to it, I would like to see
> Markus' feedback on that (explicit Cc added, even though he's on the
> list, too).
>
> We'd also need a way to communicate the ability to speak this protocol
> from the kernel to userspace before telling the server that the client
> understands something which maybe its kernel doesn't.
>
> Markus?
>
> > so that the server's data stream is fully structured without having to
> > maintain context of the client's requests. That is, a dissector can now
> > merely scan for both magic numbers; and on a stream between a client and
> > server that have negotiated the new mode, the old magic number will
> > never have a payload, and the new magic number will always be
> > accompanied with a payload that describes how much data to read to the
> > boundary of the next reply.
> >
> > For that matter, right now, NBD_CMD_READ is required to either end the
> > session or return length bytes of data even when error is non-zero (but
> > where those bytes MAY be invalid); but by adding a negotiated flag for
> > structured length replies, it would be possible to allow for short reads
> > and/or returning an error with 0 bytes of payload but still keeping the
> > connection to the client open, without having to send wasted bytes over
> > the wire.
>
> Yes. This has been discussed on the nbd-general list in the past. There
> is also the (significant) problem of the server having maybe already
> sent out the header before discovering there is an error, at which point
> it can *only* drop the connection.
I am still not through all the new mails on the list, so there may be
some more discussions about this. But I will answer here simply.
I generally like the idea of having this new kind of reply. Is this
solving our issues where we want to "stream" data directly from a
filedescriptor into a tcp-stream?
Does it make sense to extend this for reads with an offset? This way we
could not only send in chunks but also order them randomly. Is there any
use-case where it does make sense to read data not sequentially?
Best Regards,
Markus
--
Pengutronix e.K. | |
Industrial Linux Solutions | http://www.pengutronix.de/ |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
next prev parent reply other threads:[~2016-04-04 10:19 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-03-23 14:16 [Qemu-devel] [PATCH 0/2] NBD protocol extensions: WRITE_ZEROES and GET_LBA_STATUS Denis V. Lunev
2016-03-23 14:16 ` [Qemu-devel] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension Denis V. Lunev
2016-03-23 15:14 ` Eric Blake
2016-03-23 17:40 ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-03-24 7:16 ` [Qemu-devel] " Pavel Borzenkov
2016-03-24 7:36 ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-03-23 17:21 ` Wouter Verhelst
2016-03-24 7:57 ` Pavel Borzenkov
2016-03-24 8:26 ` Wouter Verhelst
2016-03-24 11:35 ` Pavel Borzenkov
2016-03-24 11:37 ` Paolo Bonzini
2016-03-24 12:31 ` Wouter Verhelst
2016-03-24 14:53 ` Eric Blake
2016-03-23 14:16 ` [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension Denis V. Lunev
2016-03-23 16:27 ` Eric Blake
2016-03-24 12:30 ` Pavel Borzenkov
2016-03-24 15:04 ` Eric Blake
2016-03-24 16:36 ` Pavel Borzenkov
2016-03-23 17:58 ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-03-23 18:14 ` Kevin Wolf
2016-03-24 8:25 ` Pavel Borzenkov
2016-03-24 8:41 ` Wouter Verhelst
2016-03-24 11:36 ` Pavel Borzenkov
2016-03-24 12:32 ` Wouter Verhelst
2016-03-24 8:43 ` Pavel Borzenkov
2016-03-24 9:33 ` Wouter Verhelst
2016-03-24 10:32 ` Alex Bligh
2016-03-24 11:58 ` Paolo Bonzini
2016-03-24 12:17 ` Alex Bligh
2016-03-24 12:32 ` Paolo Bonzini
2016-03-24 13:31 ` Alex Bligh
2016-03-24 13:32 ` Paolo Bonzini
2016-03-24 11:55 ` Paolo Bonzini
2016-03-24 12:43 ` Wouter Verhelst
2016-03-24 15:25 ` Eric Blake
2016-03-24 15:33 ` Paolo Bonzini
2016-03-24 15:53 ` Wouter Verhelst
2016-03-24 16:04 ` Eric Blake
2016-03-24 16:07 ` Kevin Wolf
2016-03-24 16:47 ` Wouter Verhelst
2016-03-29 9:38 ` Kevin Wolf
2016-03-29 9:53 ` Wouter Verhelst
2016-03-29 10:25 ` Paolo Bonzini
2016-03-24 22:08 ` [Qemu-devel] " Eric Blake
2016-03-25 8:49 ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-03-25 9:01 ` Alex Bligh
2016-03-28 15:58 ` Eric Blake
2016-04-04 10:32 ` Markus Pargmann
2016-04-04 10:18 ` Markus Pargmann [this message]
2016-04-04 16:54 ` Eric Blake
2016-04-04 22:17 ` Wouter Verhelst
2016-04-04 16:40 ` [Qemu-devel] " Eric Blake
2016-04-04 20:16 ` Denis V. Lunev
2016-04-04 20:36 ` [Qemu-devel] [Nbd] " Eric Blake
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1580868.I2qeg0bkmp@adelgunde \
--to=mpa@pengutronix.de \
--cc=den@openvz.org \
--cc=kwolf@redhat.com \
--cc=nbd-general@lists.sourceforge.net \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
--cc=w@uter.be \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).