qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Xiao Guangrong <guangrong.xiao@gmail.com>
To: Stefan Hajnoczi <stefanha@gmail.com>,
	Haozhong Zhang <haozhong.zhang@intel.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	qemu-devel@nongnu.org, Igor Mammedov <imammedo@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	dan.j.williams@intel.com, Richard Henderson <rth@twiddle.net>,
	Xiao Guangrong <xiaoguangrong.eric@gmail.com>
Subject: Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Date: Thu, 6 Apr 2017 20:02:51 +0800	[thread overview]
Message-ID: <30db934f-d27e-1d15-6257-84224283dea9@gmail.com> (raw)
In-Reply-To: <20170406094359.GB21261@stefanha-x1.localdomain>



On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote:
> On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
>> This patch series constructs the flush hint address structures for
>> nvdimm devices in QEMU.
>>
>> It's of course not for 2.9. I send it out early in order to get
>> comments on one point I'm uncertain (see the detailed explanation
>> below). Thanks for any comments in advance!
>> Background
>> ---------------
>
> Extra background:
>
> Flush Hint Addresses are necessary because:
>
> 1. Some hardware configurations may require them.  In other words, a
>    cache flush instruction is not enough to persist data.
>
> 2. The host file system may need fsync(2) calls (e.g. to persist
>    metadata changes).
>
> Without Flush Hint Addresses only some NVDIMM configurations actually
> guarantee data persistence.
>
>> Flush hint address structure is a substructure of NFIT and specifies
>> one or more addresses, namely Flush Hint Addresses. Software can write
>> to any one of these flush hint addresses to cause any preceding writes
>> to the NVDIMM region to be flushed out of the intervening platform
>> buffers to the targeted NVDIMM. More details can be found in ACPI Spec
>> 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
>
> Do you have performance data?  I'm concerned that Flush Hint Address
> hardware interface is not virtualization-friendly.
>
> In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
>
>   wmb();
>   for (i = 0; i < nd_region->ndr_mappings; i++)
>       if (ndrd_get_flush_wpq(ndrd, i, 0))
>           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
>   wmb();
>
> That looks pretty lightweight - it's an MMIO write between write
> barriers.
>
> This patch implements the MMIO write like this:
>
>   void nvdimm_flush(NVDIMMDevice *nvdimm)
>   {
>       if (nvdimm->backend_fd != -1) {
>           /*
>            * If the backend store is a physical NVDIMM device, fsync()
>            * will trigger the flush via the flush hint on the host device.
>            */
>           fsync(nvdimm->backend_fd);
>       }
>   }
>
> The MMIO store instruction turned into a synchronous fsync(2) system
> call plus vmexit/vmenter and QEMU userspace context switch:
>
> 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
>    instruction has an unexpected and huge latency.
>
> 2. The vcpu thread holds the QEMU global mutex so all other threads
>    (including the monitor) are blocked during fsync(2).  Other vcpu
>    threads may block if they vmexit.
>
> It is hard to implement this efficiently in QEMU.  This is why I said
> the hardware interface is not virtualization-friendly.  It's cheap on
> real hardware but expensive under virtualization.
>
> We should think about the optimal way of implementing Flush Hint
> Addresses in QEMU.  But if there is no reasonable way to implement them
> then I think it's better *not* to implement them, just like the Block
> Window feature which is also not virtualization-friendly.  Users who
> want a block device can use virtio-blk.  I don't think NVDIMM Block
> Window can achieve better performance than virtio-blk under
> virtualization (although I'm happy to be proven wrong).
>
> Some ideas for a faster implementation:
>
> 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
>    global mutex.  Little synchronization is necessary as long as the
>    NVDIMM device isn't hot unplugged (not yet supported anyway).
>
> 2. Can the host kernel provide a way to mmap Address Flush Hints from
>    the physical NVDIMM in cases where the configuration does not require
>    host kernel interception?  That way QEMU can map the physical
>    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
>    is bypassed and performance would be good.
>
> I'm not sure there is anything we can do to make the case where the host
> kernel wants an fsync(2) fast :(.

Good point.

We can assume flush-CPU-cache-to-make-persistence is always
available on Intel's hardware so that flush-hint-table is not
needed if the vNVDIMM is based on a real Intel's NVDIMM device.

If the vNVDIMM device is based on the regular file, i think
fsync is the bottleneck rather than this mmio-virtualization. :(

  parent reply	other threads:[~2017-04-06 12:03 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-31  8:41 [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure Haozhong Zhang
2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 1/4] pc-dimm: add 'reserved-size' to reserve address range after the ending address Haozhong Zhang
2017-04-06 10:24   ` Stefan Hajnoczi
2017-04-06 10:46     ` Haozhong Zhang
2017-04-07 13:46       ` Stefan Hajnoczi
2017-04-11  8:57         ` Haozhong Zhang
2017-04-20 10:54           ` Igor Mammedov
2017-04-06 11:50   ` Xiao Guangrong
2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 2/4] nvdimm: add functions to initialize and perform flush on back store Haozhong Zhang
2017-04-06 11:52   ` Xiao Guangrong
2017-04-11  8:22     ` Haozhong Zhang
2017-04-11  8:29       ` Haozhong Zhang
2017-04-11 11:55         ` Xiao Guangrong
2017-04-20 13:12   ` Igor Mammedov
2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 3/4] nvdimm acpi: record the cache line size in AcpiNVDIMMState Haozhong Zhang
2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 4/4] nvdimm acpi: build flush hint address structure if required Haozhong Zhang
2017-04-06 10:13   ` Stefan Hajnoczi
2017-04-06 10:53     ` Haozhong Zhang
2017-04-07 14:41       ` Stefan Hajnoczi
2017-04-07 15:51     ` Dan Williams
2017-04-06 10:25   ` Stefan Hajnoczi
2017-04-20 11:22   ` Igor Mammedov
2017-04-06  9:39 ` [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure Xiao Guangrong
2017-04-06  9:58   ` Haozhong Zhang
2017-04-06 11:46     ` Xiao Guangrong
2017-04-06  9:43 ` Stefan Hajnoczi
2017-04-06 10:31   ` Haozhong Zhang
2017-04-07 14:38     ` Stefan Hajnoczi
2017-04-06 12:02   ` Xiao Guangrong [this message]
2017-04-11  8:41     ` Haozhong Zhang
2017-04-11 14:56       ` Dan Williams
2017-04-20 19:49         ` Dan Williams
2017-04-21 13:56           ` Stefan Hajnoczi
2017-04-21 19:14             ` Dan Williams
2017-04-06 14:32   ` Dan Williams
2017-04-07 14:31     ` Stefan Hajnoczi
2017-04-11  6:34   ` Haozhong Zhang
2017-04-18 10:15     ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=30db934f-d27e-1d15-6257-84224283dea9@gmail.com \
    --to=guangrong.xiao@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=ehabkost@redhat.com \
    --cc=haozhong.zhang@intel.com \
    --cc=imammedo@redhat.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rth@twiddle.net \
    --cc=stefanha@gmail.com \
    --cc=xiaoguangrong.eric@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).