public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>, Pankaj Gupta <pagupta@redhat.com>,
	Rik van Riel <riel@redhat.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Stefan Hajnoczi <stefanha@gmail.com>,
	kvm-devel <kvm@vger.kernel.org>,
	Qemu Developers <qemu-devel@nongnu.org>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>,
	ross zwisler <ross.zwisler@linux.intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Kevin Wolf <kwolf@redhat.com>,
	Nitesh Narayan Lal <nilal@redhat.com>,
	xiaoguangrong eric <xiaoguangrong.eric@gmail.com>,
	Haozhong Zhang <haozhong.zhang@intel.com>,
	Ross Zwisler <ross.zwisler@intel.com>
Subject: Re: KVM "fake DAX" flushing interface - discussion
Date: Mon, 24 Jul 2017 17:48:49 +0200	[thread overview]
Message-ID: <20170724154849.GQ652@quack2.suse.cz> (raw)
In-Reply-To: <CAPcyv4g-X8S95=uQBo5MzxpwKfqdTbmBis3B56i59wqWiPnCBA@mail.gmail.com>

On Mon 24-07-17 08:10:05, Dan Williams wrote:
> On Mon, Jul 24, 2017 at 5:37 AM, Jan Kara <jack@suse.cz> wrote:
> > On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
> >>
> >> > On Sun 23-07-17 13:10:34, Dan Williams wrote:
> >> > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> >> > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> >> > > >> [ adding Ross and Jan ]
> >> > > >>
> >> > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
> >> > > >> wrote:
> >> > > >> >
> >> > > >> > The goal is to increase density of guests, by moving page
> >> > > >> > cache into the host (where it can be easily reclaimed).
> >> > > >> >
> >> > > >> > If we assume the guests will be backed by relatively fast
> >> > > >> > SSDs, a "whole device flush" from filesystem journaling
> >> > > >> > code (issued where the filesystem issues a barrier or
> >> > > >> > disk cache flush today) may be just what we need to make
> >> > > >> > that work.
> >> > > >>
> >> > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
> >> > > >>
> >> > > >> However, it still seems like the storage interface is not capable of
> >> > > >> expressing what is needed, because the operation that is needed is a
> >> > > >> range flush. In the guest you want the DAX page dirty tracking to
> >> > > >> communicate range flush information to the host, but there's no
> >> > > >> readily available block i/o semantic that software running on top of
> >> > > >> the fake pmem device can use to communicate with the host. Instead
> >> > > >> you
> >> > > >> want to intercept the dax_flush() operation and turn it into a queued
> >> > > >> request on the host.
> >> > > >>
> >> > > >> In 4.13 we have turned this dax_flush() operation into an explicit
> >> > > >> driver call. That seems a better interface to modify than trying to
> >> > > >> map block-storage flush-cache / force-unit-access commands to this
> >> > > >> host request.
> >> > > >>
> >> > > >> The additional piece you would need to consider is whether to track
> >> > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
> >> > > >> dirtying events, or arrange for every dax_copy_from_iter()
> >> > > >> operation()
> >> > > >> to also queue a sync on the host, but that essentially turns the host
> >> > > >> page cache into a pseudo write-through mode.
> >> > > >
> >> > > > I suspect initially it will be fine to not offer DAX
> >> > > > semantics to applications using these "fake DAX" devices
> >> > > > from a virtual machine, because the DAX APIs are designed
> >> > > > for a much higher performance device than these fake DAX
> >> > > > setups could ever give.
> >> > >
> >> > > Right, we don't need DAX, per se, in the guest.
> >> > >
> >> > > >
> >> > > > Having userspace call fsync/msync like done normally, and
> >> > > > having those coarser calls be turned into somewhat efficient
> >> > > > backend flushes would be perfectly acceptable.
> >> > > >
> >> > > > The big question is, what should that kind of interface look
> >> > > > like?
> >> > >
> >> > > To me, this looks much like the dirty cache tracking that is done in
> >> > > the address_space radix for the DAX case, but modified to coordinate
> >> > > queued / page-based flushing when the guest  wants to persist data.
> >> > > The similarity to DAX is not storing guest allocated pages in the
> >> > > radix but entries that track dirty guest physical addresses.
> >> >
> >> > Let me check whether I understand the problem correctly. So we want to
> >> > export a block device (essentially a page cache of this block device) to a
> >> > guest as PMEM and use DAX in the guest to save guest's page cache. The
> >>
> >> that's correct.
> >>
> >> > natural way to make the persistence work would be to make ->flush callback
> >> > of the PMEM device to do an upcall to the host which could then fdatasync()
> >> > appropriate image file range however the performance would suck in such
> >> > case since ->flush gets called for at most one page ranges from DAX.
> >>
> >> Discussion is : sync a range using paravirt device or flush hit addresses
> >> vs block device flush.
> >>
> >> >
> >> > So what you could do instead is to completely ignore ->flush calls for the
> >> > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
> >> > PMEM device (generated by blkdev_issue_flush() or the journalling
> >> > machinery) and fdatasync() the whole image file at that moment - in fact
> >> > you must do that for metadata IO to hit persistent storage anyway in your
> >> > setting. This would very closely follow how exporting block devices with
> >> > volatile cache works with KVM these days AFAIU and the performance will be
> >> > the same.
> >>
> >> yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
> >> As per suggestions looks like block flushing device is way ahead.
> >>
> >> If we do an asynchronous block flush at guest side(put current task in
> >> wait queue till host side fdatasync completes) can solve the purpose? Or
> >> do we need another paravirt device for this?
> >
> > Well, even currently if you have PMEM device, you still have also a block
> > device and a request queue associated with it and metadata IO goes through
> > that path. So in your case you will have the same in the guest as a result
> > of exposing virtual PMEM device to the guest and you just need to make sure
> > this virtual block device behaves the same way as traditional virtualized
> > block devices in KVM in respose to 'REQ_OP_WRITE | REQ_PREFLUSH' requests.
> 
> This approach would turn into a full fsync on the host. The question
> in my mind is whether there is any optimization to be had by trapping
> dax_flush() and calling msync() on host ranges, but Jan is right
> trapping blkdev_issue_flush() and turning around and calling host
> fsync() is the most straightforward approach that does not need driver
> interface changes. The dax_flush() approach would need to modify it
> into a async completion interface.

If the backing device on the host is actually a normal block device or an
image file, doing full fsync() is the most efficient implementation
anyway...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2017-07-24 15:48 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1455443283.33337333.1500618150787.JavaMail.zimbra@redhat.com>
     [not found] ` <1455443283.33337333.1500618150787.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-07-21  6:56   ` KVM "fake DAX" flushing interface - discussion Pankaj Gupta
2017-07-21  9:51     ` Haozhong Zhang
2017-07-21 10:21       ` Pankaj Gupta
2017-07-21 12:12     ` Stefan Hajnoczi
2017-07-21 13:29       ` Pankaj Gupta
     [not found]         ` <46101617.33460557.1500643755247.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-07-21 14:00           ` Rik van Riel
2017-07-21 15:58         ` Stefan Hajnoczi
     [not found]           ` <20170721155848.GO18014-lxVrvc10SDRcolVlb+j0YCZi+YwRKgec@public.gmane.org>
2017-07-22 19:34             ` Dan Williams
     [not found]               ` <CAPcyv4gtWYpzbmggsbdLocPiMzU2rVt-ee+kL24gbrPxKd5Eyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-07-23 14:04                 ` Rik van Riel
     [not found]                   ` <1500818683.4073.31.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-07-23 16:01                     ` Dan Williams
     [not found]                       ` <CAPcyv4h5O4D2kp6SJhWiz4V=dOLDa_Q3pk2B=u-x7hKKQqdXsQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-07-23 18:10                         ` Rik van Riel
2017-07-23 20:10                           ` Dan Williams
     [not found]                             ` <CAPcyv4hpbk0jgp+mA=q05zVBV8ZSZvCvV68JJ4gjE3QhK70d1w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-07-24 10:23                               ` Jan Kara
     [not found]                                 ` <20170724102330.GE652-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2017-07-24 12:06                                   ` Pankaj Gupta
2017-07-24 12:37                                     ` Jan Kara
     [not found]                                       ` <20170724123752.GN652-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2017-07-24 15:10                                         ` Dan Williams
2017-07-24 15:48                                           ` Jan Kara [this message]
2017-07-24 16:19                                             ` Dan Williams
2017-07-25 14:27                                       ` Pankaj Gupta
     [not found]                                         ` <1888117852.34216619.1500992835767.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-07-25 14:46                                           ` Dan Williams
2017-07-25 20:59                                             ` Rik van Riel
     [not found]                                               ` <1501016375.26846.21.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-07-26 13:47                                                 ` Pankaj Gupta
2017-07-26 21:27                                                   ` Rik van Riel
2017-07-26 21:40                                                     ` Dan Williams
2017-07-26 23:46                                                       ` Rik van Riel
     [not found]                                                         ` <1501112787.4073.49.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-07-27  0:54                                                           ` Dan Williams
     [not found]                                                             ` <CAPcyv4gbC6Hx_4YsCfOd2t=fn=wPGp5h__1QH=-p40TPFNbFzA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-10-31  7:13                                                               ` Xiao Guangrong
2017-10-31 14:20                                                                 ` Dan Williams
     [not found]                                                                   ` <CAPcyv4iw2cCpDmr+4kxsFvdy+iGZiz=ok-kLhsDKpqDy+szf-Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-11-01  3:43                                                                     ` Xiao Guangrong
2017-11-01  4:25                                                                       ` Dan Williams
     [not found]                                                                         ` <CAPcyv4jR_LdbsX-rAsHC7++C6d-WYC084uWXzr+08PSYwoXFMw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-11-01  6:46                                                                           ` Xiao Guangrong
     [not found]                                                                             ` <ca6aaa77-cca0-441e-be49-73133d8581cf-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-11-01 15:20                                                                               ` Dan Williams
     [not found]                                                                                 ` <CAPcyv4gKzvd39WbnKjbs3Bn9+o1tt=vz90CYMFu0DF5PsfHUig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-11-02  8:50                                                                                   ` Xiao Guangrong
2017-11-02 16:30                                                                                     ` Dan Williams
     [not found]                                                                                       ` <CAPcyv4iH==cqVAdd8i1y-8A6SuXU75OH1EZzgNMvtA21wfxPpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-11-03  6:21                                                                                         ` Xiao Guangrong
2017-11-21 18:19                                                                                           ` Rik van Riel
     [not found]                                                                                             ` <1511288389.1080.14.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-11-21 18:26                                                                                               ` Dan Williams
2017-11-21 18:35                                                                                                 ` Rik van Riel
2017-11-23  4:05                                                                                             ` Xiao Guangrong
2017-11-23 16:14                                                                                               ` Dan Williams
2017-11-23 16:28                                                                                                 ` Paolo Bonzini
2017-11-24 12:40                                                                                                   ` Pankaj Gupta
     [not found]                                                                                                     ` <336152896.34452750.1511527207457.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-11-24 12:44                                                                                                       ` Paolo Bonzini
2017-11-24 13:02                                                                                                         ` [Qemu-devel] " Pankaj Gupta
2017-11-24 13:20                                                                                                           ` Paolo Bonzini
2017-11-28 18:03                                                                                                     ` Dan Williams
     [not found]                                                                                                       ` <CAPcyv4j6nk1cJFuG4DDA9JoNJe2d3rSskdFSUPu4aWzWX+JQeQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-01-13  6:23                                                                                                         ` Pankaj Gupta
     [not found]                                                                                                           ` <326660076.6160176.1515824585284.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-17 16:17                                                                                                             ` Dan Williams
2018-01-17 17:31                                                                                                               ` Pankaj Gupta
2018-01-18 16:53                                                                                                     ` David Hildenbrand
     [not found]                                                                                                       ` <f1ca60cc-5506-a161-b473-f0de363b7e95-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-18 17:38                                                                                                         ` Dan Williams
2018-01-18 17:48                                                                                                           ` David Hildenbrand
     [not found]                                                                                                             ` <72839100-7fdf-693c-e9c2-348a5add8a56-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-18 18:45                                                                                                               ` Dan Williams
2018-01-18 18:54                                                                                                             ` Pankaj Gupta
2018-01-18 18:59                                                                                                               ` Dan Williams
     [not found]                                                                                                                 ` <CAPcyv4hso5FYCyxYBHRhHvsU+M_wrkQBwVKurK-i6BQYzQduPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-01-18 19:36                                                                                                                   ` Pankaj Gupta
2018-01-18 19:48                                                                                                                     ` Dan Williams
2018-01-18 19:51                                                                                                                   ` David Hildenbrand
2018-01-18 20:11                                                                                                                     ` Dan Williams
2017-11-06  7:57                                                                                       ` [Qemu-devel] " Pankaj Gupta
2017-11-06 16:57                                                                                         ` Dan Williams
     [not found]                                                                                           ` <CAPcyv4jdJwUQTy7O7Ar82J+gAi54ycCTa=HSfXY5Ogwqi+oC-Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-11-07 11:21                                                                                             ` Pankaj Gupta

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170724154849.GQ652@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=dan.j.williams@intel.com \
    --cc=haozhong.zhang@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwolf@redhat.com \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=nilal@redhat.com \
    --cc=pagupta@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=riel@redhat.com \
    --cc=ross.zwisler@intel.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=stefanha@gmail.com \
    --cc=stefanha@redhat.com \
    --cc=xiaoguangrong.eric@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox