From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 50BA9ECAAD4 for ; Wed, 31 Aug 2022 16:04:06 +0000 (UTC) Received: from list by lists.xenproject.org with outflank-mailman.395843.635731 (Exim 4.92) (envelope-from ) id 1oTQBz-0001nA-FM; Wed, 31 Aug 2022 16:03:51 +0000 X-Outflank-Mailman: Message body and most headers restored to incoming version Received: by outflank-mailman (output) from mailman id 395843.635731; Wed, 31 Aug 2022 16:03:51 +0000 Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1oTQBz-0001n2-B0; Wed, 31 Aug 2022 16:03:51 +0000 Received: by outflank-mailman (input) for mailman id 395843; Wed, 31 Aug 2022 16:03:49 +0000 Received: from se1-gles-flk1-in.inumbo.com ([94.247.172.50] helo=se1-gles-flk1.inumbo.com) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1oTQBx-0001mw-7D for xen-devel@lists.xen.org; Wed, 31 Aug 2022 16:03:49 +0000 Received: from mail-wm1-x32d.google.com (mail-wm1-x32d.google.com [2a00:1450:4864:20::32d]) by se1-gles-flk1.inumbo.com (Halon) with ESMTPS id 7deb9c1f-2946-11ed-82f2-63bd783d45fa; Wed, 31 Aug 2022 18:03:47 +0200 (CEST) Received: by mail-wm1-x32d.google.com with SMTP id ay39-20020a05600c1e2700b003a5503a80cfso8280797wmb.2 for ; Wed, 31 Aug 2022 09:03:45 -0700 (PDT) Received: from zen.linaroharston ([185.81.254.11]) by smtp.gmail.com with ESMTPSA id n6-20020a05600c4f8600b003a54d610e5fsm2638337wmq.26.2022.08.31.09.03.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 31 Aug 2022 09:03:43 -0700 (PDT) Received: from zen (localhost [127.0.0.1]) by zen.linaroharston (Postfix) with ESMTP id 0F28D1FFB7; Wed, 31 Aug 2022 17:03:43 +0100 (BST) X-BeenThere: xen-devel@lists.xenproject.org List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" X-Inumbo-ID: 7deb9c1f-2946-11ed-82f2-63bd783d45fa DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=content-transfer-encoding:mime-version:message-id:in-reply-to:date :subject:cc:to:from:user-agent:references:from:to:cc:subject:date; bh=0YN6pdbUMWwP22LK5vInS861cQR+zwOyt+mGgXSMMMQ=; b=A9l5kZQJ4Dq/0k9hze3V/Bf5aK61/VUNyi9rWC9/sEbKPG303jOLZopBpku/LKoTFi yYPup9i4zMzmCU+H+C0CyerSMX2JtRZqMMAb+6DnIDK+nz0NRxC2MEOEVrlbbUQx/n9E nHWnO61NSTjOle8H5wI2/Z01KR/yBCKuX96DKEBTwp6YmLGlVSZ4zcZjzMB5hf09Ij1p qz4tF490+DqZyPBYecCMmiNZWjy5gzjGAt6TrpqZDrOAkuYJ0pU7NIXdktB/P4H4agK6 FYPyHwvuW8vl9x6nMW3YXw+X2i5XAkKqZiMpuZnBN7iWnW4FqHV6Jg/9pvrSkzBsNZHu 9eFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:in-reply-to:date :subject:cc:to:from:user-agent:references:x-gm-message-state:from:to :cc:subject:date; bh=0YN6pdbUMWwP22LK5vInS861cQR+zwOyt+mGgXSMMMQ=; b=i4HvVOUPvYfHIDvn8FeAI0yPYGHpVk78/sz8Otk+gFLNAlNdIUbjDIqfd1Uj/KPmaS 8P3ga7lMnWdqo1JkizMJD0puUexeLQOaPY4IS4DT3/4hbX9kQi5hmpWS1v0TD/Paf2rm nBRCR2qcQfGLej7xNeSBpNbdlDTGTOd5ICh1SdjBrKVPXLo00HDxjbfG5jfY4/z+MNVu jnCYMXB9Cd9ZmQWw+1UkQ/I5I4u822C9/EP5/37plwzgtVRTyZ1tAqGGsIAZDzFv38Xp 0BVz1eHsaDa9Z9t9uh/6BAyg5A4FGRQZFv5FRYF4B8Z7LIP+424pivyzWuEgDEpuMD7v q06A== X-Gm-Message-State: ACgBeo3/FFvdFgHGwmkpgo3lP+X8yCWXgHSP32qsG3XjwJKajE33EXN0 E5ku6w0vHIY9bJtPgQJ8DFWdbQ== X-Google-Smtp-Source: AA6agR5UMv9VbNtyXCUmLERehOYjp1CfN7QP6wWG3IQVnFp1jxC4R3yNkzvaI0fBL+k4tiks0ES2Og== X-Received: by 2002:a05:600c:3d93:b0:3a5:eb9b:b495 with SMTP id bi19-20020a05600c3d9300b003a5eb9bb495mr2417150wmb.91.1661961825165; Wed, 31 Aug 2022 09:03:45 -0700 (PDT) References: <57549560-879d-f705-8693-9bfdc73e3f7f@suse.com> <20220824091918.szrd6e6ll6h6u2pw@vireshk-i7> <63eecf7a-aa69-28bf-9664-8ee8554a161f@citrix.com> <877d2xj23o.fsf@linaro.org> <87y1vdh9or.fsf@linaro.org> <9697a559-fe42-3efb-5715-1f6bc631f57d@suse.com> User-agent: mu4e 1.9.0; emacs 28.1.91 From: Alex =?utf-8?Q?Benn=C3=A9e?= To: Juergen Gross Cc: Andrew Cooper , Viresh Kumar , Stefano Stabellini , "xen-devel@lists.xen.org" , "Stratos-dev@op-lists.linaro.org" , "mathieu.poirier@linaro.com" , "christopher.w.clark@gmail.com" , "boris.ostrovsky@oracle.com" , "gregkh@linuxfoundation.org" , "vincent.guittot@linaro.org" , "olekstysh@gmail.com" Subject: Re: Understanding osdep_xenforeignmemory_map mmap behaviour Date: Wed, 31 Aug 2022 17:02:03 +0100 In-reply-to: <9697a559-fe42-3efb-5715-1f6bc631f57d@suse.com> Message-ID: <875yi8pfhc.fsf@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Juergen Gross writes: > [[PGP Signed Part:Undecided]] > On 24.08.22 17:58, Alex Benn=C3=A9e wrote: >> Juergen Gross writes: >>=20 >>> [[PGP Signed Part:Undecided]] >>> On 24.08.22 13:22, Alex Benn=C3=A9e wrote: >>>> Andrew Cooper writes: >>>> >>>>> On 24/08/2022 10:19, Viresh Kumar wrote: >>>>>> On 24-03-22, 06:12, Juergen Gross wrote: >>>>>>> For a rather long time we were using "normal" user pages for this p= urpose, >>>>>>> which were just locked into memory for doing the hypercall. >>>>>>> >>>>>>> Unfortunately there have been very rare problems with that approach= , as >>>>>>> the Linux kernel can set a user page related PTE to invalid for sho= rt >>>>>>> periods of time, which led to EFAULT in the hypervisor when trying = to >>>>>>> access the hypercall data. >>>>>>> >>>>>>> In Linux this can avoided only by using kernel memory, which is the >>>>>>> reason why the hypercall buffers are allocated and mmap()-ed throug= h the >>>>>>> privcmd driver. >>>>>> Hi Juergen, >>>>>> >>>>>> I understand why we moved from user pages to kernel pages, but I don= 't >>>>>> fully understand why we need to make two separate calls to map the >>>>>> guest memory, i.e. mmap() followed by ioctl(IOCTL_PRIVCMD_MMAPBATCH). >>>>>> >>>>>> Why aren't we doing all of it from mmap() itself ? I hacked it up to >>>>>> check on it and it works fine if we do it all from mmap() itself. >>>> As I understand it the MMAPBATCH ioctl is being treated like every >>>> other >>>> hypercall proxy through the ioctl interface. Which makes sense from the >>>> point of view of having a consistent interface to the hypervisor but n= ot >>>> from point of view of providing a consistent userspace interface for >>>> mapping memory which doesn't care about the hypervisor details. >>>> The privcmd_mmapbatch_v2 interface is slightly richer than what you >>>> could expose via mmap() because it allows the handling of partial >>>> mappings with what I presume is a per-page *err array. If you issued t= he >>>> hypercall directly from the mmap() and one of the pages wasn't mapped = by >>>> the hypervisor you would have to unwind everything before returning >>>> EFAULT to the user. >>>> >>>>>> Aren't we abusing the Linux userspace ABI here ? As standard userspa= ce >>>>>> code would expect just mmap() to be enough to map the memory. Yes, t= he >>>>>> current user, Xen itself, is adapted to make two calls, but it breaks >>>>>> as soon as we want to use something that relies on Linux userspace >>>>>> ABI. >>>>>> >>>>>> For instance, in our case, where we are looking to create >>>>>> hypervisor-agnostic virtio backends, the rust-vmm library [1] issues >>>>>> mmap() only and expects it to work. It doesn't know it is running on= a >>>>>> Xen system, and it shouldn't know that as well. >>>>> >>>>> Use /dev/xen/hypercall which has a sane ABI for getting "safe" memory. >>>>> privcmd is very much not sane. >>>>> >>>>> In practice you'll need to use both.=C2=A0 /dev/xen/hypercall for get= ting >>>>> "safe" memory, and /dev/xen/privcmd for issuing hypercalls for now. >>>> I'm unsure what is meant by safe memory here. privcmd_buf_mmap() >>>> looks >>>> like it just allocates a bunch of GFP_KERNEL pages rather than >>>> interacting with the hypervisor directly. Are these the same pages that >>>> get used when you eventually call privcmd_ioctl_mmap_batch()? >>> >>> privcmd_buf_mmap() is allocating kernel pages which are used for data b= eing >>> accessed by the hypervisor when doing the hypercall later. This is a ge= neric >>> interface being used for all hypercalls, not only for >>> privcmd_ioctl_mmap_batch(). >>> >>>> The fact that /dev/xen/hypercall is specified by xen_privcmdbuf_dev is= a >>>> little confusing TBH. >>>> Anyway the goal here is to provide a non-xen aware userspace with >>>> standard userspace API to access the guests memory. Perhaps messing >>> >>> This is what the Xen related libraries are meant for. Your decision to >>> ignore those is firing back now. >> We didn't ignore them - the initial version of the xen-vhost-master >> binary was built with the rust and linking to the Xen libraries. We are >> however in the process of moving to more pure rust (with the xen-sys >> crate being a pure rust ioctl/hypercall wrapper). > > Ah, okay, I wasn't aware of this. > >> However I was under the impression there where two classes of >> hypercalls. ABI stable ones which won't change (which is all we are >> planning to implement for xen-sys) and non-stable ABIs which would need >> mediating by the xen libs. We are hoping we can do all of VirtIO with >> just the stable ABI. > > Okay. > >>=20 >>>> around with the semantics of the /dev/xen/[hypercall|privcmd] devices >>>> nodes is too confusing. >>>> Maybe we could instead: >>>> 1. Have the Xen aware VMM ask to make the guests memory visible to >>>> the >>>> host kernels address space. >>> >>> Urgh. This would be a major breach of the Xen security concept. >>> >>>> 2. When this is done explicitly create a device node to represent i= t (/dev/xen/dom-%d-mem?) >>>> 3. Pass this new device to the non-Xen aware userspace which uses t= he >>>> standard mmap() call to make the kernel pages visible to userspa= ce >>>> Does that make sense? >>> >>> Maybe from your point of view, but not from the Xen architectural point >>> of view IMHO. You are removing basically the main security advantages of >>> Xen by generating a kernel interface for mapping arbitrary guest memory >>> easily. >> We are not talking about doing an end-run around the Xen >> architecture. >> The guest still has to instruct the hypervisor to grant access to its >> memory. Currently this is a global thing (i.e. whole address space or >> nothing) but obviously more fine grained grants can be done on a >> transaction by transaction basis although we are exploring more >> efficient mechanisms for this (shared pools and carve outs). > > Happy to hear that. > >> This does raise questions for the mmap interface though - each >> individually granted region would need to be mapped into the dom0 >> userspace virtual address space or perhaps a new flag for mmap() so we >> can map the whole address space but expect SIGBUS faults if we access >> something that hasn't been granted. > > Do I understand that correctly? You want the guest to grant a memory > region to the backend, and the backend should be able to map this region > not using grants, but the guest physical addresses? Yes - although it doesn't have to be the whole GPA range. The vhost-user protocol communicates what offset into the GPA space the various memory regions exist at. > > > Juergen > > [2. OpenPGP public key --- application/pgp-keys; OpenPGP_0xB0DE9DD628BF13= 2F.asc]... > > [[End of PGP Signed Part]] --=20 Alex Benn=C3=A9e