From: Stefan Hajnoczi <stefanha@redhat.com>
To: Pankaj Gupta <pagupta@redhat.com>
Cc: kvm@vger.kernel.org, qemu-devel@nongnu.org, riel@redhat.com,
pbonzini@redhat.com, kwolf@redhat.com,
Haozhong Zhang <haozhong.zhang@intel.com>,
Dan Williams <dan.j.williams@intel.com>,
Xiao Guangrong <xiaoguangrong.eric@gmail.com>
Subject: Re: [Qemu-devel] KVM "fake DAX" device flushing
Date: Thu, 11 May 2017 14:17:03 -0400 [thread overview]
Message-ID: <20170511181703.GC8701@stefanha-x1.localdomain> (raw)
In-Reply-To: <1494431760-6455-1-git-send-email-pagupta@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 7490 bytes --]
On Wed, May 10, 2017 at 09:26:00PM +0530, Pankaj Gupta wrote:
> We are sharing initial project proposal for
> 'KVM "fake DAX" device flushing' project for feedback.
> Got the idea during discussion with 'Rik van Riel'.
CCing NVDIMM folks.
>
> Also, request answers to 'Questions' section.
>
> Abstract :
> ----------
> Project idea is to use fake persistent memory with direct
> access(DAX) in virtual machines. Overall goal of project
> is to increase the number of virtual machines that can be
> run on a physical machine, in order to increase the density
> of customer virtual machines.
>
> The idea is to avoid the guest page cache, and minimize the
> memory footprint of virtual machines. By presenting a disk
> image as a nvdimm direct access (DAX) memory region in a
> virtual machine, the guest OS can avoid using page cache
> memory for most file accesses.
>
> Problem Statement :
> ------------------
> * Guest uses page cache in memory to process fast requests
> for disk read/write. This results in big memory footprint
> of guests without host knowing much details of the guest
> memory.
>
> * If guests use direct access(DAX) with fake persistent
> storage, the host manages the page cache for guests,
> allowing the host to easily reclaim/evict less frequently
> used page cache pages without requiring guest cooperation,
> like ballooning would.
>
> * Host manages guest cache as ‘mmaped’ disk image area in
> qemu address space. This region is passed to guest as fake
> persistent memory range. We need a new flushing interface
> to flush this cache to secondary storage to persist guest
> writes.
>
> * New asynchronous flushing interface will allow guests to
> cause the host flush the dirty data to backup storage file.
> Systems with pmem storage make use of CLFLUSH instruction
> to flush single cache line to persistent storage and it
> takes care of flushing. With fake persistent storage in
> guest we cannot depend on CLFLUSH instruction to flush entire
> dirty cache to backing storage. Even If we trap and emulate
> CLFLUSH instruction guest vCPU has to wait till we flush all
> the dirty memory. Instead of this we need to implement a new
> asynchronous guest flushing interface, which allows the guest
> to specify a larger range to be flushed at once, and allows
> the vCPU to run something else while the data is being synced
> to disk.
>
> * New flushing interface will consists of a para virt driver to
> new fake nvdimm like device which will process guest flushing
> requests like fsync/msync etc instead of pmem library calls
> like clflush. The corresponding device at host side will be
> responsible for flushing requests for guest dirty pages.
> Guest can put current task in sleep and vCPU can run any other
> task while host side flushing of guests pages is in progress.
>
> Host controlled fake nvdimm DAX to avoid guest page cache :
> -------------------------------------------------------------
> * Bypass guest page cache by using a fake persistent storage
> like nvdimm & DAX. Guest Read/Write is directly done on
> fake persistent storage without involving guest kernel for
> caching data.
>
> * Fake nvdimm device passed to guest is backed by a regular
> file in host stored in secondary storage.
>
> * Qemu has implementation of fake NVDIMM/DAX device. Use this
> capability of passing regular host file(disk) as nvdimm device
> to guest.
>
> * Nvdimm with DAX works for ext4/xfs filesystem. Supported
> filesystem should be DAX compatible.
>
> * As we are using guest disk as fake DAX/NVDIMM device, we
> need a mechanism for persistence of data backed on regular
> host storage file.
>
> * For live migration use case, if host side backing file is
> shared storage, we need to flush the page cache for the disk
> image at the destination (new fadvise interface, FADV_INVALIDATE_CACHE?)
> before starting execution of the guest on the destination host.
Good point. QEMU currently only supports live migration with O_DIRECT.
I think the problem was that userspace cannot guarantee consistency in
the general case. If you find a solution to this problem for fake
NVDIMM then maybe the QEMU block layer can also begin supporting live
migration with buffered I/O.
>
> Design :
> ---------
> * In order to not have page cache inside the guest, qemu would:
>
> 1) mmap the guest's disk image and present that disk image to
> the guest as a persistent memory range.
>
> 2) Present information to the guest telling it that the persistent
> memory range is not physical persistent memory.
Steps 1 & 2 are already supported by QEMU NVDIMM emulation today.
> 3) Present an additional paravirt device alongside the persistent
> memory range, that can be used to sync (ranges of) data to disk.
>
> * Guest would use the disk image mostly like a persistent memory
> device, with two exceptions:
>
> 1) It would not tell userspace that the files on that device are
> persistent memory. This is done so userspace knows to call
> fsync/msync, instead of the pmem clflush library call.
Not sure I agree with hiding the nvdimm nature of the device. Instead I
think you need to build this capability into the Linux nvdimm code.
libpmem will detect these types of devices and issue fsync/msync when
the application wants to flush.
> 2) When userspace calls fsync/msync on files on the fake persistent
> memory device, issue a request through the paravirt device that
> causes the host to flush the device back end.
>
> * Guest uses fake persistent storage data updates can be still in
> qemu memory. We need a way to flush cached data in host to backed
s/qemu memory/host memory/
I guess you mean that host userspace needs a way to reliably flush an
address range to the underlying storage.
> secondary storage.
>
> * Once the guest receives a completion event from the host, it will
> allow userspace programs that were waiting on the fsync/msync to
> continue running.
>
> * Host is responsible for paging in pages in host backing area for
> guest persistent memory as they are accessed by the guest, and
> for evicting pages as host memory fills up.
>
> Questions :
> -----------
> * What should the flushing interface between guest and host look
> like?
A simple hack for prototyping is to instantiate an virtio-blk-pci for
the mmapped host file. The guest can send flush commands on the
virtio-blk-pci device but will otherwise use the mapped memory directly.
> * Any suggestions to hook the IO caching code with KVM/Qemu or
> thoughts on how we should do it?
>
> * Thinking of implementing a guest para virt driver which will send
> guest requests to Qemu to flush data to disk. Not sure at this
> point how to tell userspace to work on this device as any regular
> device without considering it as persistent device. Any suggestions
> on this?
>
> * Not thought yet about ballooning impact. But feel this solution
> could be better than ballooning in long term? As we will be
> managing all guests cache from host side.
>
> * Not sure this solution works for ARM and other architectures and
> Windows?
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
next prev parent reply other threads:[~2017-05-11 18:17 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-05-10 15:56 [Qemu-devel] KVM "fake DAX" device flushing Pankaj Gupta
2017-05-11 18:17 ` Stefan Hajnoczi [this message]
2017-05-11 19:15 ` Dan Williams
2017-05-11 21:35 ` Rik van Riel
2017-05-11 21:38 ` Rik van Riel
2017-05-12 13:42 ` Stefan Hajnoczi
2017-05-12 16:53 ` Kevin Wolf
2017-05-15 9:12 ` Stefan Hajnoczi
2017-05-12 6:56 ` Pankaj Gupta
2017-05-11 22:06 ` Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170511181703.GC8701@stefanha-x1.localdomain \
--to=stefanha@redhat.com \
--cc=dan.j.williams@intel.com \
--cc=haozhong.zhang@intel.com \
--cc=kvm@vger.kernel.org \
--cc=kwolf@redhat.com \
--cc=pagupta@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=riel@redhat.com \
--cc=xiaoguangrong.eric@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).