From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:48004) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TRepp-00066L-SR for qemu-devel@nongnu.org; Fri, 26 Oct 2012 04:00:11 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TRepe-0007LY-Lc for qemu-devel@nongnu.org; Fri, 26 Oct 2012 04:00:04 -0400 Received: from mx1.redhat.com ([209.132.183.28]:20641) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TRepe-0007LU-CO for qemu-devel@nongnu.org; Fri, 26 Oct 2012 03:59:54 -0400 Message-ID: <508A42ED.10006@redhat.com> Date: Fri, 26 Oct 2012 09:59:41 +0200 From: Kevin Wolf MIME-Version: 1.0 References: <5086728B.1010809@redhat.com> <1350990519.16343.255.camel@eboracum.office.bytemark.co.uk> <20121023150225.GB17080@jl-vm1.vm.bytemark.co.uk> <1351080994.16343.293.camel@eboracum.office.bytemark.co.uk> <5087E5BF.4090901@redhat.com> <20121024143212.GA20318@jl-vm1.vm.bytemark.co.uk> <5088DDFD.7050203@redhat.com> <20121025170933.GE20318@jl-vm1.vm.bytemark.co.uk> In-Reply-To: <20121025170933.GE20318@jl-vm1.vm.bytemark.co.uk> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Jamie Lokier Cc: pbonzini@redhat.com, qemu-devel@nongnu.org, Nicholas Thomas Am 25.10.2012 19:09, schrieb Jamie Lokier: > Kevin Wolf wrote: >> Am 24.10.2012 16:32, schrieb Jamie Lokier: >>> Kevin Wolf wrote: >>>> Am 24.10.2012 14:16, schrieb Nicholas Thomas: >>>>> On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote: >>>>>> Since the I/O _order_ before, and sometimes after, flush, is important >>>>>> for data integrity, this needs to be maintained when I/Os are queued in >>>>>> the disconnected state -- including those which were inflight at the >>>>>> time disconnect was detected and then retried on reconnect. >>>>> >>>>> Hmm, discussing this on IRC I was told that it wasn't necessary to >>>>> preserve order - although I forget the fine detail. Depending on the >>>>> implementation of qemu's coroutine mutexes, operations may not actually >>>>> be performed in order right now - it's not too easy to work out what's >>>>> happening. >>>> >>>> It's possible to reorder, but it must be consistent with the order in >>>> which completion is signalled to the guest. The semantics of flush is >>>> that at the point that the flush completes, all writes to the disk that >>>> already have completed successfully are stable. It doesn't say anything >>>> about writes that are still in flight, they may or may not be flushed to >>>> disk. >>> >>> I admit I wasn't thinking clearly how much ordering NBD actually >>> guarantees (or if there's ordering the guest depends on implicitly >>> even if it's not guaranteed in specification), and how that is related >>> within QEMU to virtio/FUA/NCQ/TCQ/SCSI-ORDERED ordering guarantees >>> that the guest expects for various emulated devices and their settings. >>> >>> The ordering (if any) needed from the NBD driver (or any backend) is >>> going to depend on the assumptions baked into the interface between >>> QEMU device emulation <-> backend. >>> >>> E.g. if every device emulation waited for all outstanding writes to >>> complete before sending a flush, then it wouldn't matter how the >>> backend reordered its requests, even getting the completions out of >>> order. >>> >>> Is that relationship documented (and conformed to)? >> >> No, like so many other things in qemu it's not spelt out explicitly. >> However, as I understand it it's the same behaviour as real hardware >> has, so device emulation at least for the common devices doesn't have to >> implement anything special for it. If the hardware even supports >> parallel requests, otherwise it would automatically only have a single >> request in flight (like IDE). > > That's why I mention virtio/FUA/NCQ/TCQ/SCSI-ORDERED, which are quite > common. > > They are features of devices which support multiple parallel requests, > but with certain ordering constraints conveyed by or expected by the > guest, which has to be ensured when it's mapped onto a QEMU fully > asynchronous backend. > > That means they are features of the hardware which device emulations > _do_ have to implement. If they don't, the storage is unreliable on > things like host power removal and virtual power removal. Yes, device emulations that need to maintain a given order must pay attention to wait for completion of the previous requests. > If the backends are allowed to explicitly have no coupling between > different request types (even flush/discard and write), and ordering > constraints are being enforced by the order in which device emulations > submit and wait, that's fine. > > I mention this, because POSIX aio_fsync() is _not_ fully decoupled > according to it's specification. > > So it might be that some device emulations are depending on the > semantics of aio_fsync() or the QEMU equivalent by now; and randomly > reordering in the NBD driver in unusual circumstances (or any other > backend), would break those semantics. qemu AIO has always had this semantics since bdrv_aio_flush() was introduced. It behaves the same way for image files. So I don't see any problem with NBD making use of the same. Kevin