From mboxrd@z Thu Jan 1 00:00:00 1970 From: Haozhong Zhang Subject: Re: [RFC Design Doc] Add vNVDIMM support for Xen Date: Wed, 17 Feb 2016 17:03:55 +0800 Message-ID: <20160217090355.GE5459@hz-desktop.sh.intel.com> References: <20160203070052.GA4248@hz-desktop.sh.intel.com> <56B219D4.9030507@citrix.com> <20160204025526.GA3504@hz-desktop.sh.intel.com> <20160215031657.GA8938@hz-desktop.sh.intel.com> <56C32A6302000078000D2A1C@prv-mh.provo.novell.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline In-Reply-To: <56C32A6302000078000D2A1C@prv-mh.provo.novell.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Xiao Guangrong Cc: Juergen Gross , Kevin Tian , Wei Liu , Ian Campbell , Stefano Stabellini , George Dunlap , Andrew Cooper , IanJackson , George Dunlap , "xen-devel@lists.xen.org" , Jan Beulich , Jun Nakajima , Keir Fraser List-Id: xen-devel@lists.xenproject.org On 02/16/16 05:55, Jan Beulich wrote: > >>> On 16.02.16 at 12:14, wrote: > > On Mon, 15 Feb 2016, Zhang, Haozhong wrote: > >> On 02/04/16 20:24, Stefano Stabellini wrote: > >> > On Thu, 4 Feb 2016, Haozhong Zhang wrote: > >> > > On 02/03/16 15:22, Stefano Stabellini wrote: > >> > > > On Wed, 3 Feb 2016, George Dunlap wrote: > >> > > > > On 03/02/16 12:02, Stefano Stabellini wrote: > >> > > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote: > >> > > > > >> Or, we can make a file system on /dev/pmem0, create files on it, set > >> > > > > >> the owner of those files to xen-qemuuser-domid$domid, and then pass > >> > > > > >> those files to QEMU. In this way, non-root QEMU should be able to > >> > > > > >> mmap those files. > >> > > > > > > >> > > > > > Maybe that would work. Worth adding it to the design, I would like to > >> > > > > > read more details on it. > >> > > > > > > >> > > > > > Also note that QEMU initially runs as root but drops privileges to > >> > > > > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU > >> > > > > > *could* mmap /dev/pmem0 while is still running as root, but then it > >> > > > > > wouldn't work for any devices that need to be mmap'ed at run time > >> > > > > > (hotplug scenario). > >> > > > > > >> > > > > This is basically the same problem we have for a bunch of other things, > >> > > > > right? Having xl open a file and then pass it via qmp to qemu should > >> > > > > work in theory, right? > >> > > > > >> > > > Is there one /dev/pmem? per assignable region? > >> > > > >> > > Yes. > >> > > > >> > > BTW, I'm wondering whether and how non-root qemu works with xl disk > >> > > configuration that is going to access a host block device, e.g. > >> > > disk = [ '/dev/sdb,,hda' ] > >> > > If that works with non-root qemu, I may take the similar solution for > >> > > pmem. > >> > > >> > Today the user is required to give the correct ownership and access mode > >> > to the block device, so that non-root QEMU can open it. However in the > >> > case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence > >> > the feature doesn't work at all with non-root QEMU > >> > (http://marc.info/?l=xen-devel&m=145261763600528). > >> > > >> > If there is one /dev/pmem device per assignable region, then it would be > >> > conceivable to change its ownership so that non-root QEMU can open it. > >> > Or, better, the file descriptor could be passed by the toolstack via > >> > qmp. > >> > >> Passing file descriptor via qmp is not enough. > >> > >> Let me clarify where the requirement for root/privileged permissions > >> comes from. The primary workflow in my design that maps a host pmem > >> region or files in host pmem region to guest is shown as below: > >> (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on > >> /dev/pmem0) to its virtual address space, i.e. the guest virtual > >> address space. > >> (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA > >> occupied by the host pmem to a DomU. This step requires the > >> translation from the guest virtual address (where the host pmem is > >> mmaped in (1)) to the host physical address. The translation can be > >> done by either > >> (a) QEMU that parses its own /proc/self/pagemap, > >> or > >> (b) Xen hypervisor that does the translation by itself [1] (though > >> this choice is not quite doable from Konrad's comments [2]). > >> > >> [1] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html > >> [2] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html > >> > >> For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability > >> since linux kernel 4.0. Furthermore, if we don't mlock the mapped host > >> pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap), > >> pagemap will not contain all mappings. However, mlock may require > >> privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because > >> mlock operates on memory, the permission to open(2) the host pmem files > >> does not solve the problem and therefore passing file descriptor via qmp > >> does not help. > >> > >> For 2-b, from Konrad's comments [2], mlock is also required and > >> privileged permission may be required consequently. > >> > >> Note that the mapping and the address translation are done before QEMU > >> dropping privileged permissions, so non-root QEMU should be able to work > >> with above design until we start considering vNVDIMM hotplug (which has > >> not been supported by the current vNVDIMM implementation in QEMU). In > >> the hotplug case, we may let Xen pass explicit flags to QEMU to keep it > >> running with root permissions. > > > > Are we all good with the fact that vNVDIMM hotplug won't work (unless > > the user explicitly asks for it at domain creation time, which is > > very unlikely otherwise she could use coldplug)? > > No, at least there needs to be a road towards hotplug, even if > initially this may not be supported/implemented. Guangrong: any plan or design for vNVDIMM hotplug in QEMU? Haozhong