Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Alex Williamson <alex.williamson@redhat.com>
To: Haozhong Zhang <haozhong.zhang@intel.com>
Cc: "Dan Williams" <dan.j.williams@intel.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Radim Krčmář" <rkrcmar@redhat.com>,
	kvm@vger.kernel.org, "Qemu Developers" <qemu-devel@nongnu.org>,
	"Eduardo Habkost" <ehabkost@redhat.com>,
	"Igor Mammedov" <imammedo@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	dgilbert@redhat.com,
	"Xiao Guangrong" <xiaoguangrong.eric@gmail.com>,
	"Stefan Hajnoczi" <stefanha@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file
Date: Thu, 1 Feb 2018 10:43:30 -0700	[thread overview]
Message-ID: <20180201104330.3efb58a4@w520.home> (raw)
In-Reply-To: <20180201101744.wkkqcag5dfcwxjak@hz-desktop>

On Thu, 1 Feb 2018 18:17:44 +0800
Haozhong Zhang <haozhong.zhang@intel.com> wrote:

> On 01/31/18 19:02 -0800, Dan Williams wrote:
> > On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang
> > <haozhong.zhang@intel.com> wrote:  
> > > + vfio maintainer Alex Williamson in case my understanding of vfio is incorrect.
> > >
> > > On 01/31/18 16:32 -0800, Dan Williams wrote:  
> > >> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
> > >> <haozhong.zhang@intel.com> wrote:  
> > >> > On 01/31/18 16:08 -0800, Dan Williams wrote:  
> > >> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
> > >> >> <haozhong.zhang@intel.com> wrote:  
> > >> >> > On 01/31/18 14:25 -0800, Dan Williams wrote:  
> > >> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
> > >> >> >> <haozhong.zhang@intel.com> wrote:  
> > >> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
> > >> >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
> > >> >> >> > files on ext4/xfs file system mounted with '-o dax').  
> > >> >> >>
> > >> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
> > >> >> >> metadata is in sync after a fault. However, that does not make
> > >> >> >> filesystem-DAX safe for use with QEMU, because we still need to
> > >> >> >> coordinate DMA with fileystem operations. There is no way to do that
> > >> >> >> coordination from within a guest. QEMU needs to use device-dax if the
> > >> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch
> > >> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to
> > >> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX
> > >> >> >> pages to be mapped in EPT entries unless / until we have a solution to
> > >> >> >> the DMA synchronization problem. Apologies for not noticing this
> > >> >> >> earlier.  
> > >> >> >
> > >> >> > QEMU does not truncate or punch holes of the file once it has been
> > >> >> > mmap()'ed. Does the problem [1] still exist in such case?  
> > >> >>
> > >> >> Something else on the system might. The only agent that could enforce
> > >> >> protection is the kernel, and the kernel will likely just disallow
> > >> >> passing addresses from filesystem-dax vmas through to a guest
> > >> >> altogether. I think there's even a problem in the non-DAX case unless
> > >> >> KVM is pinning pages while they are handed out to a guest. The problem
> > >> >> is that we don't have a page cache page to pin in the DAX case.
> > >> >>  
> > >> >
> > >> > Does it mean any user-space code like
> > >> >   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
> > >> >   // make DMA to ptr
> > >> > is unsafe?  
> > >>
> > >> Yes, it is currently unsafe because there is no coordination with the
> > >> filesytem if it decides to make block layout changes. We can fix that
> > >> in the non-virtualization case by having the filesystem wait for DMA
> > >> completion callbacks (i.e. what for all pages to be idle), but as far
> > >> as I can see we can't do the same coordination for DMA initiated by a
> > >> guest device driver.
> > >>  
> > >
> > > I think that fix [1] also works for KVM/QEMU. The guest DMA are
> > > performed on two types of devices:
> > >
> > > 1. For emulated devices, the guest DMA requests are trapped and
> > >    actually performed by QEMU on the host side. The host side fix [1]
> > >    can cover this case.
> > >
> > > 2. For passthrough devices, vfio pins all pages, including those
> > >    backed by dax mode files, used by the guest if any device is
> > >    passthroughed to it. If I read the commit message in [2] correctly,
> > >    operations that change the page-to-file offset association of pages
> > >    from dax mode files will be deferred until the reference count of
> > >    the affected pages becomes 1.  That is, if any passthrough device
> > >    is used with a VM, the changes of page-to-file offset will not be
> > >    able to happen until the VM is shutdown, so the fix [1] still takes
> > >    effect here.  
> > 
> > This sounds like a longterm mapping under control of vfio and not the
> > filesystem. See get_user_pages_longterm(), it is a problem if pages
> > are pinned indefinitely especially DAX. It sounds like vfio is in the
> > same boat as RDMA and cannot support long lived pins of DAX pages. As
> > of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual
> > fix will be to create a "memory-registration with lease" semantic
> > available for RDMA so that the kernel can forcibly revoke page pinning
> > to perform physical layout changes. In the near it seems
> > vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so
> > that filesystem-dax mappings are explicitly disallowed.  
> 
> It seems that KVM and VFIO need to switch to get_user_pages_longterm()
> which fails getting pages backed by dax mode files.
> 
> However, as get_user_pages() and its variants in the current KVM and
> VFIO may be called after a VM starts running, e.g., handling EPT
> violation on demand, and hotplugging a passthrough device to VM,
> simply switching to the longterm version would cause VM crash in those
> cases. Therefore, it also needs to patch or document in QEMU to not
> use dax files with memory-backend-file. Paolo, Radim and Alex, what do
> you think?

Yeah, it looks like vaddr_get_pfn() needs to do its own vma_is_fsdax()
check or convert it to the _longterm gup variant.  On hot-adding an
assigned device to a VM, QEMU should just fail the initfn of the
device, which would be non-fatal to the VM.  OTOH, if one of these
problem mappings can be hot added to the VM, such as via memory
hotplug, I think the mapping failure would be fatal to the VM.  Thanks,

Alex

next prev parent reply	other threads:[~2018-02-01 17:43 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-31  6:02 [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file Haozhong Zhang
2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 1/6] util/mmap-alloc: switch qemu_ram_mmap() to 'flags' parameter Haozhong Zhang
2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 2/6] exec: switch qemu_ram_alloc_from_{file, fd} to the " Haozhong Zhang
2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 3/6] memory: switch memory_region_init_ram_from_file() to " Haozhong Zhang
2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 4/6] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap() Haozhong Zhang
2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 5/6] hostmem: add more information in error messages Haozhong Zhang
2018-01-31  6:02 ` [Qemu-devel] [PATCH v4 6/6] hostmem-file: add 'sync' option Haozhong Zhang
2018-01-31 22:25 ` [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file Dan Williams
2018-02-01  0:02   ` Haozhong Zhang
2018-02-01  0:08     ` Dan Williams
2018-02-01  0:24       ` Haozhong Zhang
2018-02-01  0:32         ` Dan Williams
2018-02-01  2:29           ` Haozhong Zhang
2018-02-01  3:02             ` Dan Williams
2018-02-01  3:11               ` Dan Williams
2018-02-01 10:17               ` Haozhong Zhang
2018-02-01 17:43                 ` Alex Williamson [this message]
2018-02-01 17:05               ` Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180201104330.3efb58a4@w520.home \
    --to=alex.williamson@redhat.com \
    --cc=dan.j.williams@intel.com \
    --cc=dgilbert@redhat.com \
    --cc=ehabkost@redhat.com \
    --cc=haozhong.zhang@intel.com \
    --cc=imammedo@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rkrcmar@redhat.com \
    --cc=stefanha@redhat.com \
    --cc=xiaoguangrong.eric@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).