qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Pankaj Gupta <pagupta@redhat.com>
To: kvm-devel <kvm@vger.kernel.org>,
	Qemu Developers <qemu-devel@nongnu.org>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>
Cc: Rik van Riel <riel@redhat.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	ross.zwisler@linux.intel.com, Paolo Bonzini <pbonzini@redhat.com>,
	Kevin Wolf <kwolf@redhat.com>,
	Nitesh Narayan Lal <nilal@redhat.com>,
	xiaoguangrong.eric@gmail.com,
	Haozhong Zhang <haozhong.zhang@intel.com>
Subject: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
Date: Fri, 21 Jul 2017 02:56:34 -0400 (EDT)	[thread overview]
Message-ID: <945864462.33340808.1500620194836.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <1455443283.33337333.1500618150787.JavaMail.zimbra@redhat.com>


Hello,

We shared a proposal for 'KVM fake DAX flushing interface'.

https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html

We did initial POC in which we used 'virtio-blk' device to perform 
a device flush on pmem fsync on ext4 filesystem. They are few hacks 
to make things work. We need suggestions on below points before we 
start actual implementation.

A] Problems to solve:
------------------

1] We are considering two approaches for 'fake DAX flushing interface'.
    
 1.1] fake dax with NVDIMM flush hints & KVM async page fault

     - Existing interface.

     - The approach to use flush hint address is already nacked upstream.

     - Flush hint not queued interface for flushing. Applications might 
       avoid to use it.

     - Flush hint address traps from guest to host and do an entire fsync 
       on backing file which itself is costly.

     - Can be used to flush specific pages on host backing disk. We can 
       send data(pages information) equal to cache-line size(limitation) 
       and tell host to sync corresponding pages instead of entire disk sync.

     - This will be an asynchronous operation and vCPU control is returned 
       quickly.


 1.2] Using additional para virt device in addition to pmem device(fake dax with device flush)

     - New interface

     - Guest maintains information of DAX dirty pages as exceptional entries in 
       radix tree.

     - If we want to flush specific pages from guest to host, we need to send 
       list of the dirty pages corresponding to file on which we are doing fsync.

     - This will require implementation of new interface, a new paravirt device 
       for sending flush requests.

     - Host side will perform fsync/fdatasync on list of dirty pages or entire 
       block device backed file.

2] Questions:
-----------

 2.1] Not sure why WPQ flush is not a queued interface? We can force applications 
      to call this? device DAX neither calls fsync/msync?

 2.2] Depending upon interface we decide, we need optimal solution to sync 
      range of pages?

     - Send range of pages from guest to host to sync asynchronously instead 
       of syncing entire block device?

     - Other option is to sync entire disk backing file to make sure all the 
       writes are persistent. In our case, backing file is a regular file on 
       non NVDIMM device so host page cache has list of dirty pages which
       can be used either with fsync or similar interface.

 2.3] If we do host fsync on entire disk we will be flushing all the dirty data
      to backend file. Just thinking what would be better approach, flushing 
      pages on corresponding guest file fsync or entire block device?

 2.4] If we decide to choose one of the above approaches, we need to consider 
      all DAX supporting filesystems(ext4/xfs). Would hooking code to corresponding
      fsync code of fs seems reasonable? Just thinking for flush hint address use-case?
      Or how flush hint addresses would be invoked with fsync or similar api?

 2.5] Also with filesystem journalling and other mount options like barriers, 
      ordered etc, how we decide to use page flush hint or regular fsync on file?
 
 2.6] If at guest side we have PFN of all the dirty pages in radixtree? and we send 
      these to to host? At host side would we able to find corresponding page and flush 
      them all?

Suggestions & ideas are welcome.

Thanks,
Pankaj

       reply	other threads:[~2017-07-21  7:05 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1455443283.33337333.1500618150787.JavaMail.zimbra@redhat.com>
2017-07-21  6:56 ` Pankaj Gupta [this message]
2017-07-21  9:51   ` [Qemu-devel] KVM "fake DAX" flushing interface - discussion Haozhong Zhang
2017-07-21 10:21     ` Pankaj Gupta
2017-07-21 12:12   ` Stefan Hajnoczi
2017-07-21 13:29     ` Pankaj Gupta
2017-07-21 14:00       ` Rik van Riel
2017-07-21 15:58       ` Stefan Hajnoczi
2017-07-22 19:34         ` Dan Williams
2017-07-23 14:04           ` Rik van Riel
2017-07-23 16:01             ` Dan Williams
2017-07-23 18:10               ` Rik van Riel
2017-07-23 20:10                 ` Dan Williams
2017-07-24 10:23                   ` Jan Kara
2017-07-24 12:06                     ` Pankaj Gupta
2017-07-24 12:37                       ` Jan Kara
2017-07-24 15:10                         ` Dan Williams
2017-07-24 15:48                           ` Jan Kara
2017-07-24 16:19                             ` Dan Williams
2017-07-25 14:27                         ` Pankaj Gupta
2017-07-25 14:46                           ` Dan Williams
2017-07-25 20:59                             ` Rik van Riel
2017-07-26 13:47                               ` Pankaj Gupta
2017-07-26 21:27                                 ` Rik van Riel
2017-07-26 21:40                                   ` Dan Williams
2017-07-26 23:46                                     ` Rik van Riel
2017-07-27  0:54                                       ` Dan Williams
2017-10-31  7:13                                         ` Xiao Guangrong
2017-10-31 14:20                                           ` Dan Williams
2017-11-01  3:43                                             ` Xiao Guangrong
2017-11-01  4:25                                               ` Dan Williams
2017-11-01  6:46                                                 ` Xiao Guangrong
2017-11-01 15:20                                                   ` Dan Williams
2017-11-02  8:50                                                     ` Xiao Guangrong
2017-11-02 16:30                                                       ` Dan Williams
2017-11-03  6:21                                                         ` Xiao Guangrong
2017-11-21 18:19                                                           ` Rik van Riel
2017-11-21 18:26                                                             ` Dan Williams
2017-11-21 18:35                                                               ` Rik van Riel
2017-11-23  4:05                                                             ` Xiao Guangrong
2017-11-23 16:14                                                               ` Dan Williams
2017-11-23 16:28                                                                 ` Paolo Bonzini
2017-11-24 12:40                                                                   ` Pankaj Gupta
2017-11-24 12:44                                                                     ` Paolo Bonzini
2017-11-24 13:02                                                                       ` Pankaj Gupta
2017-11-24 13:20                                                                         ` Paolo Bonzini
2017-11-28 18:03                                                                     ` Dan Williams
2018-01-13  6:23                                                                       ` Pankaj Gupta
2018-01-17 16:17                                                                         ` Dan Williams
2018-01-17 17:31                                                                           ` Pankaj Gupta
2018-01-18 16:53                                                                     ` David Hildenbrand
2018-01-18 17:38                                                                       ` Dan Williams
2018-01-18 17:48                                                                         ` David Hildenbrand
2018-01-18 18:45                                                                           ` Dan Williams
2018-01-18 18:54                                                                           ` Pankaj Gupta
2018-01-18 18:59                                                                             ` Dan Williams
2018-01-18 19:36                                                                               ` Pankaj Gupta
2018-01-18 19:48                                                                                 ` Dan Williams
2018-01-18 19:51                                                                               ` David Hildenbrand
2018-01-18 20:11                                                                                 ` Dan Williams
2017-11-06  7:57                                                         ` Pankaj Gupta
2017-11-06 16:57                                                           ` Dan Williams
2017-11-07 11:21                                                             ` Pankaj Gupta

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=945864462.33340808.1500620194836.JavaMail.zimbra@redhat.com \
    --to=pagupta@redhat.com \
    --cc=dan.j.williams@intel.com \
    --cc=haozhong.zhang@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwolf@redhat.com \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=nilal@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=riel@redhat.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=stefanha@redhat.com \
    --cc=xiaoguangrong.eric@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).