From: Stefan Hajnoczi <stefanha@gmail.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Xiao Guangrong <guangrong.xiao@gmail.com>,
Eduardo Habkost <ehabkost@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com>,
qemu-devel@nongnu.org, Igor Mammedov <imammedo@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Richard Henderson <rth@twiddle.net>,
Xiao Guangrong <xiaoguangrong.eric@gmail.com>,
Christoph Hellwig <hch@infradead.org>
Subject: Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Date: Fri, 21 Apr 2017 14:56:34 +0100 [thread overview]
Message-ID: <20170421135634.GB28249@stefanha-x1.localdomain> (raw)
In-Reply-To: <CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 7599 bytes --]
On Thu, Apr 20, 2017 at 12:49:21PM -0700, Dan Williams wrote:
> On Tue, Apr 11, 2017 at 7:56 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> > [ adding Christoph ]
> >
> > On Tue, Apr 11, 2017 at 1:41 AM, Haozhong Zhang
> > <haozhong.zhang@intel.com> wrote:
> >> On 04/06/17 20:02 +0800, Xiao Guangrong wrote:
> >>>
> >>>
> >>> On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote:
> >>> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> >>> > > This patch series constructs the flush hint address structures for
> >>> > > nvdimm devices in QEMU.
> >>> > >
> >>> > > It's of course not for 2.9. I send it out early in order to get
> >>> > > comments on one point I'm uncertain (see the detailed explanation
> >>> > > below). Thanks for any comments in advance!
> >>> > > Background
> >>> > > ---------------
> >>> >
> >>> > Extra background:
> >>> >
> >>> > Flush Hint Addresses are necessary because:
> >>> >
> >>> > 1. Some hardware configurations may require them. In other words, a
> >>> > cache flush instruction is not enough to persist data.
> >>> >
> >>> > 2. The host file system may need fsync(2) calls (e.g. to persist
> >>> > metadata changes).
> >>> >
> >>> > Without Flush Hint Addresses only some NVDIMM configurations actually
> >>> > guarantee data persistence.
> >>> >
> >>> > > Flush hint address structure is a substructure of NFIT and specifies
> >>> > > one or more addresses, namely Flush Hint Addresses. Software can write
> >>> > > to any one of these flush hint addresses to cause any preceding writes
> >>> > > to the NVDIMM region to be flushed out of the intervening platform
> >>> > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> >>> > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
> >>> >
> >>> > Do you have performance data? I'm concerned that Flush Hint Address
> >>> > hardware interface is not virtualization-friendly.
> >>> >
> >>> > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
> >>> >
> >>> > wmb();
> >>> > for (i = 0; i < nd_region->ndr_mappings; i++)
> >>> > if (ndrd_get_flush_wpq(ndrd, i, 0))
> >>> > writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
> >>> > wmb();
> >>> >
> >>> > That looks pretty lightweight - it's an MMIO write between write
> >>> > barriers.
> >>> >
> >>> > This patch implements the MMIO write like this:
> >>> >
> >>> > void nvdimm_flush(NVDIMMDevice *nvdimm)
> >>> > {
> >>> > if (nvdimm->backend_fd != -1) {
> >>> > /*
> >>> > * If the backend store is a physical NVDIMM device, fsync()
> >>> > * will trigger the flush via the flush hint on the host device.
> >>> > */
> >>> > fsync(nvdimm->backend_fd);
> >>> > }
> >>> > }
> >>> >
> >>> > The MMIO store instruction turned into a synchronous fsync(2) system
> >>> > call plus vmexit/vmenter and QEMU userspace context switch:
> >>> >
> >>> > 1. The vcpu blocks during the fsync(2) system call. The MMIO write
> >>> > instruction has an unexpected and huge latency.
> >>> >
> >>> > 2. The vcpu thread holds the QEMU global mutex so all other threads
> >>> > (including the monitor) are blocked during fsync(2). Other vcpu
> >>> > threads may block if they vmexit.
> >>> >
> >>> > It is hard to implement this efficiently in QEMU. This is why I said
> >>> > the hardware interface is not virtualization-friendly. It's cheap on
> >>> > real hardware but expensive under virtualization.
> >>> >
> >>> > We should think about the optimal way of implementing Flush Hint
> >>> > Addresses in QEMU. But if there is no reasonable way to implement them
> >>> > then I think it's better *not* to implement them, just like the Block
> >>> > Window feature which is also not virtualization-friendly. Users who
> >>> > want a block device can use virtio-blk. I don't think NVDIMM Block
> >>> > Window can achieve better performance than virtio-blk under
> >>> > virtualization (although I'm happy to be proven wrong).
> >>> >
> >>> > Some ideas for a faster implementation:
> >>> >
> >>> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
> >>> > global mutex. Little synchronization is necessary as long as the
> >>> > NVDIMM device isn't hot unplugged (not yet supported anyway).
> >>> >
> >>> > 2. Can the host kernel provide a way to mmap Address Flush Hints from
> >>> > the physical NVDIMM in cases where the configuration does not require
> >>> > host kernel interception? That way QEMU can map the physical
> >>> > NVDIMM's Address Flush Hints directly into the guest. The hypervisor
> >>> > is bypassed and performance would be good.
> >>> >
> >>> > I'm not sure there is anything we can do to make the case where the host
> >>> > kernel wants an fsync(2) fast :(.
> >>>
> >>> Good point.
> >>>
> >>> We can assume flush-CPU-cache-to-make-persistence is always
> >>> available on Intel's hardware so that flush-hint-table is not
> >>> needed if the vNVDIMM is based on a real Intel's NVDIMM device.
> >>>
> >>
> >> We can let users of qemu (e.g. libvirt) detect whether the backend
> >> device supports ADR, and pass 'flush-hint' option to qemu only if ADR
> >> is not supported.
> >>
> >
> > There currently is no ACPI mechanism to detect the presence of ADR.
> > Also, you still need the flush for fs metadata management.
> >
> >>> If the vNVDIMM device is based on the regular file, i think
> >>> fsync is the bottleneck rather than this mmio-virtualization. :(
> >>>
> >>
> >> Yes, fsync() on the regular file is the bottleneck. We may either
> >>
> >> 1/ perform the host-side flush in an asynchronous way which will not
> >> block vcpu too long,
> >>
> >> or
> >>
> >> 2/ not provide strong durability guarantee for non-NVDIMM backend and
> >> not emulate flush-hint for guest at all. (I know 1/ does not
> >> provide strong durability guarantee either).
> >
> > or
> >
> > 3/ Use device-dax as a stop-gap until we can get an efficient fsync()
> > overhead reduction (or bypass) mechanism built and accepted for
> > filesystem-dax.
>
> I didn't realize we have a bigger problem with host filesystem-fsync
> and that WPQ exits will not save us. Applications that use device-dax
> in the guest may never trigger a WPQ flush, because userspace flushing
> with device-dax is expected to be safe. WPQ flush was never meant to
> be a persistency mechanism the way it is proposed here, it's only
> meant to minimize the fallout from potential ADR failure. My apologies
> for insinuating that it was viable.
>
> So, until we solve this userspace flushing problem virtualization must
> not pass through any file except a device-dax instance for any
> production workload.
Okay. That's what I've assumed up until now and I think distros will
document this limitation.
> Also these performance overheads seem prohibitive. We really want to
> take whatever fsync minimization / bypass mechanism we come up with on
> the host into a fast para-virtualized interface for the guest. Guests
> need to be able to avoid hypervisor and host syscall overhead in the
> fast path.
It's hard to avoid the hypervisor if the host kernel file system needs
an fsync() to persist everything. There should be a fast path where the
host file is preallocated and no fancy file system features are in use
(e.g. deduplication, copy-on-write snapshots) where host file systems
don't need fsync().
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
next prev parent reply other threads:[~2017-04-21 13:56 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-03-31 8:41 [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure Haozhong Zhang
2017-03-31 8:41 ` [Qemu-devel] [RFC PATCH 1/4] pc-dimm: add 'reserved-size' to reserve address range after the ending address Haozhong Zhang
2017-04-06 10:24 ` Stefan Hajnoczi
2017-04-06 10:46 ` Haozhong Zhang
2017-04-07 13:46 ` Stefan Hajnoczi
2017-04-11 8:57 ` Haozhong Zhang
2017-04-20 10:54 ` Igor Mammedov
2017-04-06 11:50 ` Xiao Guangrong
2017-03-31 8:41 ` [Qemu-devel] [RFC PATCH 2/4] nvdimm: add functions to initialize and perform flush on back store Haozhong Zhang
2017-04-06 11:52 ` Xiao Guangrong
2017-04-11 8:22 ` Haozhong Zhang
2017-04-11 8:29 ` Haozhong Zhang
2017-04-11 11:55 ` Xiao Guangrong
2017-04-20 13:12 ` Igor Mammedov
2017-03-31 8:41 ` [Qemu-devel] [RFC PATCH 3/4] nvdimm acpi: record the cache line size in AcpiNVDIMMState Haozhong Zhang
2017-03-31 8:41 ` [Qemu-devel] [RFC PATCH 4/4] nvdimm acpi: build flush hint address structure if required Haozhong Zhang
2017-04-06 10:13 ` Stefan Hajnoczi
2017-04-06 10:53 ` Haozhong Zhang
2017-04-07 14:41 ` Stefan Hajnoczi
2017-04-07 15:51 ` Dan Williams
2017-04-06 10:25 ` Stefan Hajnoczi
2017-04-20 11:22 ` Igor Mammedov
2017-04-06 9:39 ` [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure Xiao Guangrong
2017-04-06 9:58 ` Haozhong Zhang
2017-04-06 11:46 ` Xiao Guangrong
2017-04-06 9:43 ` Stefan Hajnoczi
2017-04-06 10:31 ` Haozhong Zhang
2017-04-07 14:38 ` Stefan Hajnoczi
2017-04-06 12:02 ` Xiao Guangrong
2017-04-11 8:41 ` Haozhong Zhang
2017-04-11 14:56 ` Dan Williams
2017-04-20 19:49 ` Dan Williams
2017-04-21 13:56 ` Stefan Hajnoczi [this message]
2017-04-21 19:14 ` Dan Williams
2017-04-06 14:32 ` Dan Williams
2017-04-07 14:31 ` Stefan Hajnoczi
2017-04-11 6:34 ` Haozhong Zhang
2017-04-18 10:15 ` Stefan Hajnoczi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170421135634.GB28249@stefanha-x1.localdomain \
--to=stefanha@gmail.com \
--cc=dan.j.williams@intel.com \
--cc=ehabkost@redhat.com \
--cc=guangrong.xiao@gmail.com \
--cc=hch@infradead.org \
--cc=imammedo@redhat.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=rth@twiddle.net \
--cc=xiaoguangrong.eric@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).