From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:48068) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ecE8n-0006QG-Nc for qemu-devel@nongnu.org; Thu, 18 Jan 2018 12:38:18 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ecE8m-0001ub-PK for qemu-devel@nongnu.org; Thu, 18 Jan 2018 12:38:17 -0500 Received: from mail-ot0-x242.google.com ([2607:f8b0:4003:c0f::242]:40078) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1ecE8m-0001tL-JA for qemu-devel@nongnu.org; Thu, 18 Jan 2018 12:38:16 -0500 Received: by mail-ot0-x242.google.com with SMTP id x4so5140984otg.7 for ; Thu, 18 Jan 2018 09:38:15 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <1455443283.33337333.1500618150787.JavaMail.zimbra@redhat.com> <86754966-281f-c3ed-938c-f009440de563@gmail.com> <1511288389.1080.14.camel@redhat.com> <654f8935-258e-22ef-fae4-3e14e91e8fae@redhat.com> <336152896.34452750.1511527207457.JavaMail.zimbra@redhat.com> From: Dan Williams Date: Thu, 18 Jan 2018 09:38:13 -0800 Message-ID: Content-Type: text/plain; charset="UTF-8" Subject: Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: David Hildenbrand Cc: Pankaj Gupta , Paolo Bonzini , Rik van Riel , Xiao Guangrong , Christoph Hellwig , Jan Kara , Stefan Hajnoczi , Stefan Hajnoczi , kvm-devel , Qemu Developers , "linux-nvdimm@lists.01.org" , ross zwisler , Kevin Wolf , Nitesh Narayan Lal , Haozhong Zhang , Ross Zwisler On Thu, Jan 18, 2018 at 8:53 AM, David Hildenbrand wrote: > On 24.11.2017 13:40, Pankaj Gupta wrote: >> >> Hello, >> >> Thank you all for all the useful suggestions. >> I want to summarize the discussions so far in the >> thread. Please see below: >> >>>>> >>>>>> We can go with the "best" interface for what >>>>>> could be a relatively slow flush (fsync on a >>>>>> file on ssd/disk on the host), which requires >>>>>> that the flushing task wait on completion >>>>>> asynchronously. >>>>> >>>>> >>>>> I'd like to clarify the interface of "wait on completion >>>>> asynchronously" and KVM async page fault a bit more. >>>>> >>>>> Current design of async-page-fault only works on RAM rather >>>>> than MMIO, i.e, if the page fault caused by accessing the >>>>> device memory of a emulated device, it needs to go to >>>>> userspace (QEMU) which emulates the operation in vCPU's >>>>> thread. >>>>> >>>>> As i mentioned before the memory region used for vNVDIMM >>>>> flush interface should be MMIO and consider its support >>>>> on other hypervisors, so we do better push this async >>>>> mechanism into the flush interface design itself rather >>>>> than depends on kvm async-page-fault. >>>> >>>> I would expect this interface to be virtio-ring based to queue flush >>>> requests asynchronously to the host. >>> >>> Could we reuse the virtio-blk device, only with a different device id? >> >> As per previous discussions, there were suggestions on main two parts of the project: >> >> 1] Expose vNVDIMM memory range to KVM guest. >> >> - Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM spec >> changes for this? >> >> - Guest should be able to add this memory in system memory map. Name of the added memory in >> '/proc/iomem' should be different(shared memory?) than persistent memory as it >> does not satisfy exact definition of persistent memory (requires an explicit flush). >> >> - Guest should not allow 'device-dax' and other fancy features which are not >> virtualization friendly. >> >> 2] Flushing interface to persist guest changes. >> >> - As per suggestion by ChristophH (CCed), we explored options other then virtio like MMIO etc. >> Looks like most of these options are not use-case friendly. As we want to do fsync on a >> file on ssd/disk on the host and we cannot make guest vCPU's wait for that time. >> >> - Though adding new driver(virtio-pmem) looks like repeated work and not needed so we can >> go with the existing pmem driver and add flush specific to this new memory type. > > I'd like to emphasize again, that I would prefer a virtio-pmem only > solution. > > There are architectures out there (e.g. s390x) that don't support > NVDIMMs - there is no HW interface to expose any such stuff. > > However, with virtio-pmem, we could make it work also on architectures > not having ACPI and friends. ACPI and virtio-only can share the same pmem driver. There are two parts to this, region discovery and setting up the pmem driver. For discovery you can either have an NFIT-bus defined range, or a new virtio-pmem-bus define it. As far as the pmem driver itself it's agnostic to how the range is discovered. In other words, pmem consumes 'regions' from libnvdimm and the a bus provider like nfit, e820, or a new virtio-mechansim produce 'regions'.