From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55913) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cxxDq-0002G7-6k for qemu-devel@nongnu.org; Tue, 11 Apr 2017 10:56:47 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cxxDo-0001RY-Uk for qemu-devel@nongnu.org; Tue, 11 Apr 2017 10:56:46 -0400 Received: from mail-oi0-x230.google.com ([2607:f8b0:4003:c06::230]:35142) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1cxxDo-0001QE-OY for qemu-devel@nongnu.org; Tue, 11 Apr 2017 10:56:44 -0400 Received: by mail-oi0-x230.google.com with SMTP id f22so56285061oib.2 for ; Tue, 11 Apr 2017 07:56:42 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20170411084133.wroqrhn5eczesqam@hz-desktop> References: <20170331084147.32716-1-haozhong.zhang@intel.com> <20170406094359.GB21261@stefanha-x1.localdomain> <30db934f-d27e-1d15-6257-84224283dea9@gmail.com> <20170411084133.wroqrhn5eczesqam@hz-desktop> From: Dan Williams Date: Tue, 11 Apr 2017 07:56:41 -0700 Message-ID: Content-Type: text/plain; charset=UTF-8 Subject: Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Xiao Guangrong , Stefan Hajnoczi , Eduardo Habkost , "Michael S. Tsirkin" , qemu-devel@nongnu.org, Igor Mammedov , Paolo Bonzini , Dan J Williams , Richard Henderson , Xiao Guangrong , Christoph Hellwig [ adding Christoph ] On Tue, Apr 11, 2017 at 1:41 AM, Haozhong Zhang wrote: > On 04/06/17 20:02 +0800, Xiao Guangrong wrote: >> >> >> On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote: >> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote: >> > > This patch series constructs the flush hint address structures for >> > > nvdimm devices in QEMU. >> > > >> > > It's of course not for 2.9. I send it out early in order to get >> > > comments on one point I'm uncertain (see the detailed explanation >> > > below). Thanks for any comments in advance! >> > > Background >> > > --------------- >> > >> > Extra background: >> > >> > Flush Hint Addresses are necessary because: >> > >> > 1. Some hardware configurations may require them. In other words, a >> > cache flush instruction is not enough to persist data. >> > >> > 2. The host file system may need fsync(2) calls (e.g. to persist >> > metadata changes). >> > >> > Without Flush Hint Addresses only some NVDIMM configurations actually >> > guarantee data persistence. >> > >> > > Flush hint address structure is a substructure of NFIT and specifies >> > > one or more addresses, namely Flush Hint Addresses. Software can write >> > > to any one of these flush hint addresses to cause any preceding writes >> > > to the NVDIMM region to be flushed out of the intervening platform >> > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec >> > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure". >> > >> > Do you have performance data? I'm concerned that Flush Hint Address >> > hardware interface is not virtualization-friendly. >> > >> > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does: >> > >> > wmb(); >> > for (i = 0; i < nd_region->ndr_mappings; i++) >> > if (ndrd_get_flush_wpq(ndrd, i, 0)) >> > writeq(1, ndrd_get_flush_wpq(ndrd, i, idx)); >> > wmb(); >> > >> > That looks pretty lightweight - it's an MMIO write between write >> > barriers. >> > >> > This patch implements the MMIO write like this: >> > >> > void nvdimm_flush(NVDIMMDevice *nvdimm) >> > { >> > if (nvdimm->backend_fd != -1) { >> > /* >> > * If the backend store is a physical NVDIMM device, fsync() >> > * will trigger the flush via the flush hint on the host device. >> > */ >> > fsync(nvdimm->backend_fd); >> > } >> > } >> > >> > The MMIO store instruction turned into a synchronous fsync(2) system >> > call plus vmexit/vmenter and QEMU userspace context switch: >> > >> > 1. The vcpu blocks during the fsync(2) system call. The MMIO write >> > instruction has an unexpected and huge latency. >> > >> > 2. The vcpu thread holds the QEMU global mutex so all other threads >> > (including the monitor) are blocked during fsync(2). Other vcpu >> > threads may block if they vmexit. >> > >> > It is hard to implement this efficiently in QEMU. This is why I said >> > the hardware interface is not virtualization-friendly. It's cheap on >> > real hardware but expensive under virtualization. >> > >> > We should think about the optimal way of implementing Flush Hint >> > Addresses in QEMU. But if there is no reasonable way to implement them >> > then I think it's better *not* to implement them, just like the Block >> > Window feature which is also not virtualization-friendly. Users who >> > want a block device can use virtio-blk. I don't think NVDIMM Block >> > Window can achieve better performance than virtio-blk under >> > virtualization (although I'm happy to be proven wrong). >> > >> > Some ideas for a faster implementation: >> > >> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU >> > global mutex. Little synchronization is necessary as long as the >> > NVDIMM device isn't hot unplugged (not yet supported anyway). >> > >> > 2. Can the host kernel provide a way to mmap Address Flush Hints from >> > the physical NVDIMM in cases where the configuration does not require >> > host kernel interception? That way QEMU can map the physical >> > NVDIMM's Address Flush Hints directly into the guest. The hypervisor >> > is bypassed and performance would be good. >> > >> > I'm not sure there is anything we can do to make the case where the host >> > kernel wants an fsync(2) fast :(. >> >> Good point. >> >> We can assume flush-CPU-cache-to-make-persistence is always >> available on Intel's hardware so that flush-hint-table is not >> needed if the vNVDIMM is based on a real Intel's NVDIMM device. >> > > We can let users of qemu (e.g. libvirt) detect whether the backend > device supports ADR, and pass 'flush-hint' option to qemu only if ADR > is not supported. > There currently is no ACPI mechanism to detect the presence of ADR. Also, you still need the flush for fs metadata management. >> If the vNVDIMM device is based on the regular file, i think >> fsync is the bottleneck rather than this mmio-virtualization. :( >> > > Yes, fsync() on the regular file is the bottleneck. We may either > > 1/ perform the host-side flush in an asynchronous way which will not > block vcpu too long, > > or > > 2/ not provide strong durability guarantee for non-NVDIMM backend and > not emulate flush-hint for guest at all. (I know 1/ does not > provide strong durability guarantee either). or 3/ Use device-dax as a stop-gap until we can get an efficient fsync() overhead reduction (or bypass) mechanism built and accepted for filesystem-dax.