From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43751) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ehIti-00057h-EW for qemu-devel@nongnu.org; Thu, 01 Feb 2018 12:43:43 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ehIte-0006iF-Gq for qemu-devel@nongnu.org; Thu, 01 Feb 2018 12:43:42 -0500 Received: from mx1.redhat.com ([209.132.183.28]:36916) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1ehIte-0006hY-8L for qemu-devel@nongnu.org; Thu, 01 Feb 2018 12:43:38 -0500 Date: Thu, 1 Feb 2018 10:43:30 -0700 From: Alex Williamson Message-ID: <20180201104330.3efb58a4@w520.home> In-Reply-To: <20180201101744.wkkqcag5dfcwxjak@hz-desktop> References: <20180131060229.9294-1-haozhong.zhang@intel.com> <20180201000240.zmwa6wwnthg2su36@hz-desktop> <20180201002457.hgyvph7z6iys5yvy@hz-desktop> <20180201022900.4znup2fovqr7bu5m@hz-desktop> <20180201101744.wkkqcag5dfcwxjak@hz-desktop> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Haozhong Zhang Cc: Dan Williams , Paolo Bonzini , Radim =?UTF-8?B?S3LEjW3DocWZ?= , kvm@vger.kernel.org, Qemu Developers , Eduardo Habkost , Igor Mammedov , "Michael S. Tsirkin" , dgilbert@redhat.com, Xiao Guangrong , Stefan Hajnoczi On Thu, 1 Feb 2018 18:17:44 +0800 Haozhong Zhang wrote: > On 01/31/18 19:02 -0800, Dan Williams wrote: > > On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang > > wrote: > > > + vfio maintainer Alex Williamson in case my understanding of vfio is incorrect. > > > > > > On 01/31/18 16:32 -0800, Dan Williams wrote: > > >> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang > > >> wrote: > > >> > On 01/31/18 16:08 -0800, Dan Williams wrote: > > >> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang > > >> >> wrote: > > >> >> > On 01/31/18 14:25 -0800, Dan Williams wrote: > > >> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang > > >> >> >> wrote: > > >> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to > > >> >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g., > > >> >> >> > files on ext4/xfs file system mounted with '-o dax'). > > >> >> >> > > >> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the > > >> >> >> metadata is in sync after a fault. However, that does not make > > >> >> >> filesystem-DAX safe for use with QEMU, because we still need to > > >> >> >> coordinate DMA with fileystem operations. There is no way to do that > > >> >> >> coordination from within a guest. QEMU needs to use device-dax if the > > >> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch > > >> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to > > >> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX > > >> >> >> pages to be mapped in EPT entries unless / until we have a solution to > > >> >> >> the DMA synchronization problem. Apologies for not noticing this > > >> >> >> earlier. > > >> >> > > > >> >> > QEMU does not truncate or punch holes of the file once it has been > > >> >> > mmap()'ed. Does the problem [1] still exist in such case? > > >> >> > > >> >> Something else on the system might. The only agent that could enforce > > >> >> protection is the kernel, and the kernel will likely just disallow > > >> >> passing addresses from filesystem-dax vmas through to a guest > > >> >> altogether. I think there's even a problem in the non-DAX case unless > > >> >> KVM is pinning pages while they are handed out to a guest. The problem > > >> >> is that we don't have a page cache page to pin in the DAX case. > > >> >> > > >> > > > >> > Does it mean any user-space code like > > >> > ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem > > >> > // make DMA to ptr > > >> > is unsafe? > > >> > > >> Yes, it is currently unsafe because there is no coordination with the > > >> filesytem if it decides to make block layout changes. We can fix that > > >> in the non-virtualization case by having the filesystem wait for DMA > > >> completion callbacks (i.e. what for all pages to be idle), but as far > > >> as I can see we can't do the same coordination for DMA initiated by a > > >> guest device driver. > > >> > > > > > > I think that fix [1] also works for KVM/QEMU. The guest DMA are > > > performed on two types of devices: > > > > > > 1. For emulated devices, the guest DMA requests are trapped and > > > actually performed by QEMU on the host side. The host side fix [1] > > > can cover this case. > > > > > > 2. For passthrough devices, vfio pins all pages, including those > > > backed by dax mode files, used by the guest if any device is > > > passthroughed to it. If I read the commit message in [2] correctly, > > > operations that change the page-to-file offset association of pages > > > from dax mode files will be deferred until the reference count of > > > the affected pages becomes 1. That is, if any passthrough device > > > is used with a VM, the changes of page-to-file offset will not be > > > able to happen until the VM is shutdown, so the fix [1] still takes > > > effect here. > > > > This sounds like a longterm mapping under control of vfio and not the > > filesystem. See get_user_pages_longterm(), it is a problem if pages > > are pinned indefinitely especially DAX. It sounds like vfio is in the > > same boat as RDMA and cannot support long lived pins of DAX pages. As > > of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual > > fix will be to create a "memory-registration with lease" semantic > > available for RDMA so that the kernel can forcibly revoke page pinning > > to perform physical layout changes. In the near it seems > > vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so > > that filesystem-dax mappings are explicitly disallowed. > > It seems that KVM and VFIO need to switch to get_user_pages_longterm() > which fails getting pages backed by dax mode files. > > However, as get_user_pages() and its variants in the current KVM and > VFIO may be called after a VM starts running, e.g., handling EPT > violation on demand, and hotplugging a passthrough device to VM, > simply switching to the longterm version would cause VM crash in those > cases. Therefore, it also needs to patch or document in QEMU to not > use dax files with memory-backend-file. Paolo, Radim and Alex, what do > you think? Yeah, it looks like vaddr_get_pfn() needs to do its own vma_is_fsdax() check or convert it to the _longterm gup variant. On hot-adding an assigned device to a VM, QEMU should just fail the initfn of the device, which would be non-fatal to the VM. OTOH, if one of these problem mappings can be hot added to the VM, such as via memory hotplug, I think the mapping failure would be fatal to the VM. Thanks, Alex