Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Haozhong Zhang <haozhong.zhang@intel.com>
To: Stefan Hajnoczi <stefanha@gmail.com>, dan.j.williams@intel.com
Cc: qemu-devel@nongnu.org,
	Xiao Guangrong <xiaoguangrong.eric@gmail.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Eduardo Habkost <ehabkost@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Igor Mammedov <imammedo@redhat.com>,
	Richard Henderson <rth@twiddle.net>
Subject: Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Date: Thu, 6 Apr 2017 18:31:17 +0800	[thread overview]
Message-ID: <20170406103117.7elooodj2k5wrbnu@hz-desktop> (raw)
In-Reply-To: <20170406094359.GB21261@stefanha-x1.localdomain>

On 04/06/17 10:43 +0100, Stefan Hajnoczi wrote:
> On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> > This patch series constructs the flush hint address structures for
> > nvdimm devices in QEMU.
> > 
> > It's of course not for 2.9. I send it out early in order to get
> > comments on one point I'm uncertain (see the detailed explanation
> > below). Thanks for any comments in advance!
> > Background
> > ---------------
> 
> Extra background:
> 
> Flush Hint Addresses are necessary because:
> 
> 1. Some hardware configurations may require them.  In other words, a
>    cache flush instruction is not enough to persist data.
> 
> 2. The host file system may need fsync(2) calls (e.g. to persist
>    metadata changes).
> 
> Without Flush Hint Addresses only some NVDIMM configurations actually
> guarantee data persistence.
> 
> > Flush hint address structure is a substructure of NFIT and specifies
> > one or more addresses, namely Flush Hint Addresses. Software can write
> > to any one of these flush hint addresses to cause any preceding writes
> > to the NVDIMM region to be flushed out of the intervening platform
> > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
> 
> Do you have performance data?  I'm concerned that Flush Hint Address
> hardware interface is not virtualization-friendly.
>

I haven't tested how much vNVDIMM performance drops with this patch
series.

I tested the fsycn latency of a regular file on the bare metal by
writing 1 GB random data to a file (on ext4 fs on SSD) and then
performing fsync. The average latency of fsync in that case is 3 ms.
I currently don't have NVDIMM hardware, so I cannot get its latency
data. Anyway, as your comment below, the latency should be larger for
VM.

> In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
> 
>   wmb();
>   for (i = 0; i < nd_region->ndr_mappings; i++)
>       if (ndrd_get_flush_wpq(ndrd, i, 0))
>           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
>   wmb();
> 
> That looks pretty lightweight - it's an MMIO write between write
> barriers.
> 
> This patch implements the MMIO write like this:
> 
>   void nvdimm_flush(NVDIMMDevice *nvdimm)
>   {
>       if (nvdimm->backend_fd != -1) {
>           /*
>            * If the backend store is a physical NVDIMM device, fsync()
>            * will trigger the flush via the flush hint on the host device.
>            */
>           fsync(nvdimm->backend_fd);
>       }
>   }
> 
> The MMIO store instruction turned into a synchronous fsync(2) system
> call plus vmexit/vmenter and QEMU userspace context switch:
> 
> 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
>    instruction has an unexpected and huge latency.
> 
> 2. The vcpu thread holds the QEMU global mutex so all other threads
>    (including the monitor) are blocked during fsync(2).  Other vcpu
>    threads may block if they vmexit.
> 
> It is hard to implement this efficiently in QEMU.  This is why I said
> the hardware interface is not virtualization-friendly.  It's cheap on
> real hardware but expensive under virtualization.
>

I don't have the NVDIMM hardware, so I don't know the latency of
writing to host flush hint address. Dan, do you have any latency data
on the bare metal?

> We should think about the optimal way of implementing Flush Hint
> Addresses in QEMU.  But if there is no reasonable way to implement them
> then I think it's better *not* to implement them, just like the Block
> Window feature which is also not virtualization-friendly.  Users who
> want a block device can use virtio-blk.  I don't think NVDIMM Block
> Window can achieve better performance than virtio-blk under
> virtualization (although I'm happy to be proven wrong).
> 
> Some ideas for a faster implementation:
> 
> 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
>    global mutex.  Little synchronization is necessary as long as the
>    NVDIMM device isn't hot unplugged (not yet supported anyway).
>

ACPI spec does not say it allows or disallows multiple writes to the
same flush hint address in parallel. If it can, I think we can remove
the global locking requirement for the MMIO memory region of the flush
hint address of vNVDIMM.

> 2. Can the host kernel provide a way to mmap Address Flush Hints from
>    the physical NVDIMM in cases where the configuration does not require
>    host kernel interception?  That way QEMU can map the physical
>    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
>    is bypassed and performance would be good.
>

It may work if the backend store of vNVDIMM is the physical NVDIMM
region and the latency of writing to host flush hint address is much
cheaper then performing fsync.

However, if the backend store is a regular file, then we still need to
use fsync.

> I'm not sure there is anything we can do to make the case where the host
> kernel wants an fsync(2) fast :(.
> 
> Benchmark results would be important for deciding how big the problem
> is.

Let me collect performance data w/ and w/o this patch series.

Thanks,
Haozhong

next prev parent reply	other threads:[~2017-04-06 10:31 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-31  8:41 [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure Haozhong Zhang
2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 1/4] pc-dimm: add 'reserved-size' to reserve address range after the ending address Haozhong Zhang
2017-04-06 10:24   ` Stefan Hajnoczi
2017-04-06 10:46     ` Haozhong Zhang
2017-04-07 13:46       ` Stefan Hajnoczi
2017-04-11  8:57         ` Haozhong Zhang
2017-04-20 10:54           ` Igor Mammedov
2017-04-06 11:50   ` Xiao Guangrong
2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 2/4] nvdimm: add functions to initialize and perform flush on back store Haozhong Zhang
2017-04-06 11:52   ` Xiao Guangrong
2017-04-11  8:22     ` Haozhong Zhang
2017-04-11  8:29       ` Haozhong Zhang
2017-04-11 11:55         ` Xiao Guangrong
2017-04-20 13:12   ` Igor Mammedov
2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 3/4] nvdimm acpi: record the cache line size in AcpiNVDIMMState Haozhong Zhang
2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 4/4] nvdimm acpi: build flush hint address structure if required Haozhong Zhang
2017-04-06 10:13   ` Stefan Hajnoczi
2017-04-06 10:53     ` Haozhong Zhang
2017-04-07 14:41       ` Stefan Hajnoczi
2017-04-07 15:51     ` Dan Williams
2017-04-06 10:25   ` Stefan Hajnoczi
2017-04-20 11:22   ` Igor Mammedov
2017-04-06  9:39 ` [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure Xiao Guangrong
2017-04-06  9:58   ` Haozhong Zhang
2017-04-06 11:46     ` Xiao Guangrong
2017-04-06  9:43 ` Stefan Hajnoczi
2017-04-06 10:31   ` Haozhong Zhang [this message]
2017-04-07 14:38     ` Stefan Hajnoczi
2017-04-06 12:02   ` Xiao Guangrong
2017-04-11  8:41     ` Haozhong Zhang
2017-04-11 14:56       ` Dan Williams
2017-04-20 19:49         ` Dan Williams
2017-04-21 13:56           ` Stefan Hajnoczi
2017-04-21 19:14             ` Dan Williams
2017-04-06 14:32   ` Dan Williams
2017-04-07 14:31     ` Stefan Hajnoczi
2017-04-11  6:34   ` Haozhong Zhang
2017-04-18 10:15     ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170406103117.7elooodj2k5wrbnu@hz-desktop \
    --to=haozhong.zhang@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=ehabkost@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rth@twiddle.net \
    --cc=stefanha@gmail.com \
    --cc=xiaoguangrong.eric@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).