qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Haozhong Zhang <haozhong.zhang@intel.com>
To: Pankaj Gupta <pagupta@redhat.com>
Cc: kvm-devel <kvm@vger.kernel.org>,
	Qemu Developers <qemu-devel@nongnu.org>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>,
	Rik van Riel <riel@redhat.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	ross.zwisler@linux.intel.com, Paolo Bonzini <pbonzini@redhat.com>,
	Kevin Wolf <kwolf@redhat.com>,
	Nitesh Narayan Lal <nilal@redhat.com>,
	xiaoguangrong.eric@gmail.com
Subject: Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
Date: Fri, 21 Jul 2017 17:51:31 +0800	[thread overview]
Message-ID: <20170721095131.ule4owoayuqwh6d3@hz-desktop> (raw)
In-Reply-To: <945864462.33340808.1500620194836.JavaMail.zimbra@redhat.com>

On 07/21/17 02:56 -0400, Pankaj Gupta wrote:
> 
> Hello,
> 
> We shared a proposal for 'KVM fake DAX flushing interface'.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html
>

In above link,
  "Overall goal of project 
   is to increase the number of virtual machines that can be 
   run on a physical machine, in order to *increase the density*
   of customer virtual machines"

Is the fake persistent memory used as normal RAM in guest? If no, how
is it expected to be used in guest?

> We did initial POC in which we used 'virtio-blk' device to perform 
> a device flush on pmem fsync on ext4 filesystem. They are few hacks 
> to make things work. We need suggestions on below points before we 
> start actual implementation.
>
> A] Problems to solve:
> ------------------
> 
> 1] We are considering two approaches for 'fake DAX flushing interface'.
>     
>  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> 
>      - Existing interface.
> 
>      - The approach to use flush hint address is already nacked upstream.
> 
>      - Flush hint not queued interface for flushing. Applications might 
>        avoid to use it.
> 
>      - Flush hint address traps from guest to host and do an entire fsync 
>        on backing file which itself is costly.
> 
>      - Can be used to flush specific pages on host backing disk. We can 
>        send data(pages information) equal to cache-line size(limitation) 
>        and tell host to sync corresponding pages instead of entire disk sync.
> 
>      - This will be an asynchronous operation and vCPU control is returned 
>        quickly.
> 
> 
>  1.2] Using additional para virt device in addition to pmem device(fake dax with device flush)
> 
>      - New interface
> 
>      - Guest maintains information of DAX dirty pages as exceptional entries in 
>        radix tree.
> 
>      - If we want to flush specific pages from guest to host, we need to send 
>        list of the dirty pages corresponding to file on which we are doing fsync.
> 
>      - This will require implementation of new interface, a new paravirt device 
>        for sending flush requests.
> 
>      - Host side will perform fsync/fdatasync on list of dirty pages or entire 
>        block device backed file.
> 
> 2] Questions:
> -----------
> 
>  2.1] Not sure why WPQ flush is not a queued interface? We can force applications 
>       to call this? device DAX neither calls fsync/msync?
> 
>  2.2] Depending upon interface we decide, we need optimal solution to sync 
>       range of pages?
> 
>      - Send range of pages from guest to host to sync asynchronously instead 
>        of syncing entire block device?

e.g. a new virtio device to deliver sync requests to host?

> 
>      - Other option is to sync entire disk backing file to make sure all the 
>        writes are persistent. In our case, backing file is a regular file on 
>        non NVDIMM device so host page cache has list of dirty pages which
>        can be used either with fsync or similar interface.

As the amount of dirty pages can be variant, the latency of each host
fsync is likely to vary in a large range.

> 
>  2.3] If we do host fsync on entire disk we will be flushing all the dirty data
>       to backend file. Just thinking what would be better approach, flushing 
>       pages on corresponding guest file fsync or entire block device?
> 
>  2.4] If we decide to choose one of the above approaches, we need to consider 
>       all DAX supporting filesystems(ext4/xfs). Would hooking code to corresponding
>       fsync code of fs seems reasonable? Just thinking for flush hint address use-case?
>       Or how flush hint addresses would be invoked with fsync or similar api?
> 
>  2.5] Also with filesystem journalling and other mount options like barriers, 
>       ordered etc, how we decide to use page flush hint or regular fsync on file?
>  
>  2.6] If at guest side we have PFN of all the dirty pages in radixtree? and we send 
>       these to to host? At host side would we able to find corresponding page and flush 
>       them all?

That may require the host file system provides API to flush specified
blocks/extents and their meta data in the file system. I'm not
familiar with this part and don't know whether such API exists.

Haozhong

  reply	other threads:[~2017-07-21  9:52 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1455443283.33337333.1500618150787.JavaMail.zimbra@redhat.com>
2017-07-21  6:56 ` [Qemu-devel] KVM "fake DAX" flushing interface - discussion Pankaj Gupta
2017-07-21  9:51   ` Haozhong Zhang [this message]
2017-07-21 10:21     ` Pankaj Gupta
2017-07-21 12:12   ` Stefan Hajnoczi
2017-07-21 13:29     ` Pankaj Gupta
2017-07-21 14:00       ` Rik van Riel
2017-07-21 15:58       ` Stefan Hajnoczi
2017-07-22 19:34         ` Dan Williams
2017-07-23 14:04           ` Rik van Riel
2017-07-23 16:01             ` Dan Williams
2017-07-23 18:10               ` Rik van Riel
2017-07-23 20:10                 ` Dan Williams
2017-07-24 10:23                   ` Jan Kara
2017-07-24 12:06                     ` Pankaj Gupta
2017-07-24 12:37                       ` Jan Kara
2017-07-24 15:10                         ` Dan Williams
2017-07-24 15:48                           ` Jan Kara
2017-07-24 16:19                             ` Dan Williams
2017-07-25 14:27                         ` Pankaj Gupta
2017-07-25 14:46                           ` Dan Williams
2017-07-25 20:59                             ` Rik van Riel
2017-07-26 13:47                               ` Pankaj Gupta
2017-07-26 21:27                                 ` Rik van Riel
2017-07-26 21:40                                   ` Dan Williams
2017-07-26 23:46                                     ` Rik van Riel
2017-07-27  0:54                                       ` Dan Williams
2017-10-31  7:13                                         ` Xiao Guangrong
2017-10-31 14:20                                           ` Dan Williams
2017-11-01  3:43                                             ` Xiao Guangrong
2017-11-01  4:25                                               ` Dan Williams
2017-11-01  6:46                                                 ` Xiao Guangrong
2017-11-01 15:20                                                   ` Dan Williams
2017-11-02  8:50                                                     ` Xiao Guangrong
2017-11-02 16:30                                                       ` Dan Williams
2017-11-03  6:21                                                         ` Xiao Guangrong
2017-11-21 18:19                                                           ` Rik van Riel
2017-11-21 18:26                                                             ` Dan Williams
2017-11-21 18:35                                                               ` Rik van Riel
2017-11-23  4:05                                                             ` Xiao Guangrong
2017-11-23 16:14                                                               ` Dan Williams
2017-11-23 16:28                                                                 ` Paolo Bonzini
2017-11-24 12:40                                                                   ` Pankaj Gupta
2017-11-24 12:44                                                                     ` Paolo Bonzini
2017-11-24 13:02                                                                       ` Pankaj Gupta
2017-11-24 13:20                                                                         ` Paolo Bonzini
2017-11-28 18:03                                                                     ` Dan Williams
2018-01-13  6:23                                                                       ` Pankaj Gupta
2018-01-17 16:17                                                                         ` Dan Williams
2018-01-17 17:31                                                                           ` Pankaj Gupta
2018-01-18 16:53                                                                     ` David Hildenbrand
2018-01-18 17:38                                                                       ` Dan Williams
2018-01-18 17:48                                                                         ` David Hildenbrand
2018-01-18 18:45                                                                           ` Dan Williams
2018-01-18 18:54                                                                           ` Pankaj Gupta
2018-01-18 18:59                                                                             ` Dan Williams
2018-01-18 19:36                                                                               ` Pankaj Gupta
2018-01-18 19:48                                                                                 ` Dan Williams
2018-01-18 19:51                                                                               ` David Hildenbrand
2018-01-18 20:11                                                                                 ` Dan Williams
2017-11-06  7:57                                                         ` Pankaj Gupta
2017-11-06 16:57                                                           ` Dan Williams
2017-11-07 11:21                                                             ` Pankaj Gupta

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170721095131.ule4owoayuqwh6d3@hz-desktop \
    --to=haozhong.zhang@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwolf@redhat.com \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=nilal@redhat.com \
    --cc=pagupta@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=riel@redhat.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=stefanha@redhat.com \
    --cc=xiaoguangrong.eric@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).