From: Pankaj Gupta <pagupta@redhat.com>
To: Haozhong Zhang <haozhong.zhang@intel.com>
Cc: kvm-devel <kvm@vger.kernel.org>,
Qemu Developers <qemu-devel@nongnu.org>,
"linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>,
Rik van Riel <riel@redhat.com>,
Dan Williams <dan.j.williams@intel.com>,
Stefan Hajnoczi <stefanha@redhat.com>,
ross zwisler <ross.zwisler@linux.intel.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Kevin Wolf <kwolf@redhat.com>,
Nitesh Narayan Lal <nilal@redhat.com>,
xiaoguangrong eric <xiaoguangrong.eric@gmail.com>
Subject: Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
Date: Fri, 21 Jul 2017 06:21:39 -0400 (EDT) [thread overview]
Message-ID: <813318776.33377694.1500632499830.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <20170721095131.ule4owoayuqwh6d3@hz-desktop>
> >
> > Hello,
> >
> > We shared a proposal for 'KVM fake DAX flushing interface'.
> >
> > https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html
> >
>
> In above link,
> "Overall goal of project
> is to increase the number of virtual machines that can be
> run on a physical machine, in order to *increase the density*
> of customer virtual machines"
>
> Is the fake persistent memory used as normal RAM in guest? If no, how
> is it expected to be used in guest?
Yes, guest will have a nvdimm DAX device and not use page cache for most
of the operations. Host will manage memory requirement of all the guests.
>
> > We did initial POC in which we used 'virtio-blk' device to perform
> > a device flush on pmem fsync on ext4 filesystem. They are few hacks
> > to make things work. We need suggestions on below points before we
> > start actual implementation.
> >
> > A] Problems to solve:
> > ------------------
> >
> > 1] We are considering two approaches for 'fake DAX flushing interface'.
> >
> > 1.1] fake dax with NVDIMM flush hints & KVM async page fault
> >
> > - Existing interface.
> >
> > - The approach to use flush hint address is already nacked upstream.
> >
> > - Flush hint not queued interface for flushing. Applications might
> > avoid to use it.
> >
> > - Flush hint address traps from guest to host and do an entire fsync
> > on backing file which itself is costly.
> >
> > - Can be used to flush specific pages on host backing disk. We can
> > send data(pages information) equal to cache-line size(limitation)
> > and tell host to sync corresponding pages instead of entire disk
> > sync.
> >
> > - This will be an asynchronous operation and vCPU control is returned
> > quickly.
> >
> >
> > 1.2] Using additional para virt device in addition to pmem device(fake dax
> > with device flush)
> >
> > - New interface
> >
> > - Guest maintains information of DAX dirty pages as exceptional
> > entries in
> > radix tree.
> >
> > - If we want to flush specific pages from guest to host, we need to
> > send
> > list of the dirty pages corresponding to file on which we are doing
> > fsync.
> >
> > - This will require implementation of new interface, a new paravirt
> > device
> > for sending flush requests.
> >
> > - Host side will perform fsync/fdatasync on list of dirty pages or
> > entire
> > block device backed file.
> >
> > 2] Questions:
> > -----------
> >
> > 2.1] Not sure why WPQ flush is not a queued interface? We can force
> > applications
> > to call this? device DAX neither calls fsync/msync?
> >
> > 2.2] Depending upon interface we decide, we need optimal solution to sync
> > range of pages?
> >
> > - Send range of pages from guest to host to sync asynchronously
> > instead
> > of syncing entire block device?
>
> e.g. a new virtio device to deliver sync requests to host?
>
> >
> > - Other option is to sync entire disk backing file to make sure all
> > the
> > writes are persistent. In our case, backing file is a regular file
> > on
> > non NVDIMM device so host page cache has list of dirty pages which
> > can be used either with fsync or similar interface.
>
> As the amount of dirty pages can be variant, the latency of each host
> fsync is likely to vary in a large range.
>
> >
> > 2.3] If we do host fsync on entire disk we will be flushing all the dirty
> > data
> > to backend file. Just thinking what would be better approach,
> > flushing
> > pages on corresponding guest file fsync or entire block device?
> >
> > 2.4] If we decide to choose one of the above approaches, we need to
> > consider
> > all DAX supporting filesystems(ext4/xfs). Would hooking code to
> > corresponding
> > fsync code of fs seems reasonable? Just thinking for flush hint
> > address use-case?
> > Or how flush hint addresses would be invoked with fsync or similar
> > api?
> >
> > 2.5] Also with filesystem journalling and other mount options like
> > barriers,
> > ordered etc, how we decide to use page flush hint or regular fsync on
> > file?
> >
> > 2.6] If at guest side we have PFN of all the dirty pages in radixtree? and
> > we send
> > these to to host? At host side would we able to find corresponding
> > page and flush
> > them all?
>
> That may require the host file system provides API to flush specified
> blocks/extents and their meta data in the file system. I'm not
> familiar with this part and don't know whether such API exists.
>
> Haozhong
>
next prev parent reply other threads:[~2017-07-21 10:21 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1455443283.33337333.1500618150787.JavaMail.zimbra@redhat.com>
2017-07-21 6:56 ` [Qemu-devel] KVM "fake DAX" flushing interface - discussion Pankaj Gupta
2017-07-21 9:51 ` Haozhong Zhang
2017-07-21 10:21 ` Pankaj Gupta [this message]
2017-07-21 12:12 ` Stefan Hajnoczi
2017-07-21 13:29 ` Pankaj Gupta
2017-07-21 14:00 ` Rik van Riel
2017-07-21 15:58 ` Stefan Hajnoczi
2017-07-22 19:34 ` Dan Williams
2017-07-23 14:04 ` Rik van Riel
2017-07-23 16:01 ` Dan Williams
2017-07-23 18:10 ` Rik van Riel
2017-07-23 20:10 ` Dan Williams
2017-07-24 10:23 ` Jan Kara
2017-07-24 12:06 ` Pankaj Gupta
2017-07-24 12:37 ` Jan Kara
2017-07-24 15:10 ` Dan Williams
2017-07-24 15:48 ` Jan Kara
2017-07-24 16:19 ` Dan Williams
2017-07-25 14:27 ` Pankaj Gupta
2017-07-25 14:46 ` Dan Williams
2017-07-25 20:59 ` Rik van Riel
2017-07-26 13:47 ` Pankaj Gupta
2017-07-26 21:27 ` Rik van Riel
2017-07-26 21:40 ` Dan Williams
2017-07-26 23:46 ` Rik van Riel
2017-07-27 0:54 ` Dan Williams
2017-10-31 7:13 ` Xiao Guangrong
2017-10-31 14:20 ` Dan Williams
2017-11-01 3:43 ` Xiao Guangrong
2017-11-01 4:25 ` Dan Williams
2017-11-01 6:46 ` Xiao Guangrong
2017-11-01 15:20 ` Dan Williams
2017-11-02 8:50 ` Xiao Guangrong
2017-11-02 16:30 ` Dan Williams
2017-11-03 6:21 ` Xiao Guangrong
2017-11-21 18:19 ` Rik van Riel
2017-11-21 18:26 ` Dan Williams
2017-11-21 18:35 ` Rik van Riel
2017-11-23 4:05 ` Xiao Guangrong
2017-11-23 16:14 ` Dan Williams
2017-11-23 16:28 ` Paolo Bonzini
2017-11-24 12:40 ` Pankaj Gupta
2017-11-24 12:44 ` Paolo Bonzini
2017-11-24 13:02 ` Pankaj Gupta
2017-11-24 13:20 ` Paolo Bonzini
2017-11-28 18:03 ` Dan Williams
2018-01-13 6:23 ` Pankaj Gupta
2018-01-17 16:17 ` Dan Williams
2018-01-17 17:31 ` Pankaj Gupta
2018-01-18 16:53 ` David Hildenbrand
2018-01-18 17:38 ` Dan Williams
2018-01-18 17:48 ` David Hildenbrand
2018-01-18 18:45 ` Dan Williams
2018-01-18 18:54 ` Pankaj Gupta
2018-01-18 18:59 ` Dan Williams
2018-01-18 19:36 ` Pankaj Gupta
2018-01-18 19:48 ` Dan Williams
2018-01-18 19:51 ` David Hildenbrand
2018-01-18 20:11 ` Dan Williams
2017-11-06 7:57 ` Pankaj Gupta
2017-11-06 16:57 ` Dan Williams
2017-11-07 11:21 ` Pankaj Gupta
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=813318776.33377694.1500632499830.JavaMail.zimbra@redhat.com \
--to=pagupta@redhat.com \
--cc=dan.j.williams@intel.com \
--cc=haozhong.zhang@intel.com \
--cc=kvm@vger.kernel.org \
--cc=kwolf@redhat.com \
--cc=linux-nvdimm@ml01.01.org \
--cc=nilal@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=riel@redhat.com \
--cc=ross.zwisler@linux.intel.com \
--cc=stefanha@redhat.com \
--cc=xiaoguangrong.eric@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).