From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.0 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8B490C433E0 for ; Mon, 25 May 2020 13:48:11 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3F372207C3 for ; Mon, 25 May 2020 13:48:11 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="fF8ehDGe" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3F372207C3 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=oracle.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C026680037; Mon, 25 May 2020 09:48:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BB31F8E0008; Mon, 25 May 2020 09:48:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AA07180037; Mon, 25 May 2020 09:48:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0250.hostedemail.com [216.40.44.250]) by kanga.kvack.org (Postfix) with ESMTP id 92A8C8E0008 for ; Mon, 25 May 2020 09:48:10 -0400 (EDT) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 57FB1181AC9CB for ; Mon, 25 May 2020 13:48:10 +0000 (UTC) X-FDA: 76855370340.12.arm11_474a07794af26 X-HE-Tag: arm11_474a07794af26 X-Filterd-Recvd-Size: 14012 Received: from aserp2120.oracle.com (aserp2120.oracle.com [141.146.126.78]) by imf42.hostedemail.com (Postfix) with ESMTP for ; Mon, 25 May 2020 13:48:09 +0000 (UTC) Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 04PDgGit014534; Mon, 25 May 2020 13:47:47 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2020-01-29; bh=s7KIET0nPuMkj+HSRLz0InJT7D6cWyb2Hd4U4WhRrRo=; b=fF8ehDGeJ1pY/y1tbaYEKRs8k4Rl9Ajo2W/mUo+bdWKISsF7qGp3nSW3SBiEuqdHiDYt OGxfQfD7s0eZpnamHRNIUYBBA/jjA5NzHk9H+VFWfvlgEdHyOjriy12LgUUxRT4Be6vv f/bFTF/OemjmL3KzEqA2aj44BCW6LPjwkc6XwdOamxqmtFRrHupYJ4dXOmSUkV+H6K8p rugrka68HBDEbuF34WAnpk4sitU40JVDEW6156dH2eCkpqeFA/hmM32ix4Yzix/Rftke k6FR+kVq+Z9mFU8VbxM3kyhOQWR3T6DtGg3ydwrjpkbej99S/ctk/RcJ+wJF7AfKDpVa XQ== Received: from aserp3030.oracle.com (aserp3030.oracle.com [141.146.126.71]) by aserp2120.oracle.com with ESMTP id 316usknndc-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Mon, 25 May 2020 13:47:28 +0000 Received: from pps.filterd (aserp3030.oracle.com [127.0.0.1]) by aserp3030.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 04PDguJs002776; Mon, 25 May 2020 13:47:28 GMT Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by aserp3030.oracle.com with ESMTP id 317ddm9gsk-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 25 May 2020 13:47:28 +0000 Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id 04PDlOPf004311; Mon, 25 May 2020 13:47:24 GMT Received: from [192.168.14.112] (/79.178.199.48) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 25 May 2020 06:47:24 -0700 Subject: Re: [RFC 00/16] KVM protected memory extension To: "Kirill A. Shutemov" , Dave Hansen , Andy Lutomirski , Peter Zijlstra , Paolo Bonzini , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel Cc: David Rientjes , Andrea Arcangeli , Kees Cook , Will Drewry , "Edgecombe, Rick P" , "Kleen, Andi" , x86@kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" References: <20200522125214.31348-1-kirill.shutemov@linux.intel.com> From: Liran Alon Message-ID: <42685c32-a7a9-b971-0cf4-e8af8d9a40c6@oracle.com> Date: Mon, 25 May 2020 16:47:18 +0300 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:68.0) Gecko/20100101 Thunderbird/68.8.0 MIME-Version: 1.0 In-Reply-To: <20200522125214.31348-1-kirill.shutemov@linux.intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9631 signatures=668686 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 mlxlogscore=999 bulkscore=0 mlxscore=0 phishscore=0 adultscore=0 suspectscore=0 spamscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2005250106 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9631 signatures=668686 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 malwarescore=0 lowpriorityscore=0 suspectscore=0 spamscore=0 priorityscore=1501 clxscore=1011 impostorscore=0 bulkscore=0 mlxlogscore=999 phishscore=0 cotscore=-2147483648 adultscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2005250106 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 22/05/2020 15:51, Kirill A. Shutemov wrote: > =3D=3D Background / Problem =3D=3D > > There are a number of hardware features (MKTME, SEV) which protect gues= t > memory from some unauthorized host access. The patchset proposes a pure= ly > software feature that mitigates some of the same host-side read-only > attacks. > > > =3D=3D What does this set mitigate? =3D=3D > > - Host kernel =E2=80=9Daccidental=E2=80=9D access to guest data (thin= k speculation) Just to clarify: This is any host kernel memory info-leak vulnerability.=20 Not just speculative execution memory info-leaks. Also architectural ones= . In addition, note that removing guest data from host kernel VA space=20 also makes guest<->host memory exploits more difficult. E.g. Guest cannot use already available memory buffer in kernel VA space=20 for ROP or placing valuable guest-controlled code/data in general. > > - Host kernel induced access to guest data (write(fd, &guest_data_ptr= , len)) > > - Host userspace access to guest data (compromised qemu) I don't quite understand what is the benefit of preventing userspace VMM=20 access to guest data while the host kernel can still access it. QEMU is more easily compromised than the host kernel because it's=20 guest<->host attack surface is larger (E.g. Various device emulation). But this compromise comes from the guest itself. Not other guests. In=20 contrast to host kernel attack surface, which an info-leak there can be exploited from one guest to leak another guest data. > > =3D=3D What does this set NOT mitigate? =3D=3D > > - Full host kernel compromise. Kernel will just map the pages again. > > - Hardware attacks > > > The patchset is RFC-quality: it works but has known issues that must be > addressed before it can be considered for applying. > > We are looking for high-level feedback on the concept. Some open > questions: > > - This protects from some kernel and host userspace read-only attacks= , > but does not place the host kernel outside the trust boundary. Is i= t > still valuable? I don't currently see a good argument for preventing host userspace=20 access to guest data while host kernel can still access it. But there is definitely strong benefit of mitigating kernel info-leaks=20 exploitable from one guest to leak another guest data. > > - Can this approach be used to avoid cache-coherency problems with > hardware encryption schemes that repurpose physical bits? > > - The guest kernel must be modified for this to work. Is that a deal > breaker, especially for public clouds? > > - Are the costs of removing pages from the direct map too high to be > feasible? If I remember correctly, this perf cost was too high when considering=20 XPFO (eXclusive Page Frame Ownership) patch-series. This created two major perf costs: 1) Removing pages from direct-map prevented direct-map from simply be=20 entirely mapped as 1GB huge-pages. 2) Frequent allocation/free of userspace pages resulted in frequent TLB=20 invalidations. Having said that, (1) can be mitigated in case guest data is completely=20 allocated from 1GB hugetlbfs to guarantee it will not create smaller holes in direct-map. And (2) is not relevant for QEMU/KVM=20 use-case. This makes me wonder: XPFO patch-series, applied to the context of QEMU/KVM, seems to provide=20 exactly the functionality of this patch-series, with the exception of the additional "feature" of preventing guest data=20 from also being accessible to host userspace VMM. i.e. XPFO will unmap guest pages from host kernel direct-map while still=20 keeping them mapped in host userspace VMM page-tables. If I understand correctly, this "feature" is what brings most of the=20 extra complexity of this patch-series compared to XPFO. It requires guest modification to explicitly specify to host which pages=20 can be accessed by userspace VMM, it requires changes to add new VM_KVM_PROTECTED VMA flag & FOLL_KVM for GUP, and it=20 creates issues with Live-Migration support. So if there is no strong convincing argument for the motivation to=20 prevent userspace VMM access to guest data *while host kernel can still access guest data*, I don't see a good reason for using this=20 approach. Furthermore, I would like to point out that just unmapping guest data=20 from kernel direct-map is not sufficient to prevent all guest-to-guest info-leaks via a kernel memory info-leak vulnerability.=20 This is because host kernel VA space have other regions which contains guest sensitive data. For example, KVM per-vCPU struct=20 (which holds vCPU state) is allocated on slab and therefore still leakable. I recommend you will have a look at my (and Alexandre Charte) KVM Forum=20 2019 talk on KVM ASI which provides extensive background on the various attempts done by the community for mitigating host kernel=20 memory info-leaks exploitable by guest to leak other guests data: https://static.sched.com/hosted_files/kvmforum2019/34/KVM%20Forum%202019%= 20KVM%20ASI.pdf > > =3D=3D Series Overview =3D=3D > > The hardware features protect guest data by encrypting it and then > ensuring that only the right guest can decrypt it. This has the > side-effect of making the kernel direct map and userspace mapping > (QEMU et al) useless. But, this teaches us something very useful: > neither the kernel or userspace mappings are really necessary for norma= l > guest operations. > > Instead of using encryption, this series simply unmaps the memory. One > advantage compared to allowing access to ciphertext is that it allows b= ad > accesses to be caught instead of simply reading garbage. > > Protection from physical attacks needs to be provided by some other mea= ns. > On Intel platforms, (single-key) Total Memory Encryption (TME) provides > mitigation against physical attacks, such as DIMM interposers sniffing > memory bus traffic. > > The patchset modifies both host and guest kernel. The guest OS must ena= ble > the feature via hypercall and mark any memory range that has to be shar= ed > with the host: DMA regions, bounce buffers, etc. SEV does this marking = via a > bit in the guest=E2=80=99s page table while this approach uses a hyperc= all. > > For removing the userspace mapping, use a trick similar to what NUMA > balancing does: convert memory that belongs to KVM memory slots to > PROT_NONE: all existing entries converted to PROT_NONE with mprotect() = and > the newly faulted in pages get PROT_NONE from the updated vm_page_prot. > The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the > VMA must be treated in a special way in the GUP and fault paths. The fl= ag > allows GUP to return the page even though it is mapped with PROT_NONE, = but > only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace acc= ess > to the memory would result in SIGBUS. Any GUP access without FOLL_KVM > would result in -EFAULT. > > Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed f= rom > the direct mapping with kernel_map_pages(). Note that kernel_map_pages(= ) only > flushes local TLB. I think it's a reasonable compromise between securit= y and > perfromance. > > Zapping the PTE would bring the page back to the direct mapping after c= learing. > At least for now, we don't remove file-backed pages from the direct map= ping. > File-backed pages could be accessed via read/write syscalls. It adds > complexity. > > Occasionally, host kernel has to access guest memory that was not made > shared by the guest. For instance, it happens for instruction emulation= . > Normally, it's done via copy_to/from_user() which would fail with -EFAU= LT > now. We introduced a new pair of helpers: copy_to/from_guest(). The new > helpers acquire the page via GUP, map it into kernel address space with > kmap_atomic()-style mechanism and only then copy the data. > > For some instruction emulation copying is not good enough: cmpxchg > emulation has to have direct access to the guest memory. __kvm_map_gfn(= ) > is modified to accommodate the case. > > The patchset is on top of v5.7-rc6 plus this patch: > > https://urldefense.com/v3/__https://lkml.kernel.org/r/20200402172507.27= 86-1-jimmyassarsson@gmail.com__;!!GqivPVa7Brio!MSTb9DzpOUJMLMaMq-J7QOkops= KIlAYXpIxiu5FwFYfRctwIyNi8zBJWvlt89j8$ > > =3D=3D Open Issues =3D=3D > > Unmapping the pages from direct mapping bring a few of issues that have > not rectified yet: > > - Touching direct mapping leads to fragmentation. We need to be able = to > recover from it. I have a buggy patch that aims at recovering 2M/1G= page. > It has to be fixed and tested properly As I've mentioned above, not mapping all guest memory from 1GB hugetlbfs=20 will lead to holes in kernel direct-map which force it to not be mapped=20 anymore as a series of 1GB huge-pages. This have non-trivial performance cost. Thus, I am not sure addressing=20 this use-case is valuable. > > - Page migration and KSM is not supported yet. > > - Live migration of a guest would require a new flow. Not sure yet ho= w it > would look like. Note that Live-Migration issue is a result of not making guest data=20 accessible to host userspace VMM. -Liran > > - The feature interfere with NUMA balancing. Not sure yet if it's > possible to make them work together. > > - Guests have no mechanism to ensure that even a well-behaving host h= as > unmapped its private data. With SEV, for instance, the guest only = has > to trust the hardware to encrypt a page after the C bit is set in a > guest PTE. A mechanism for a guest to query the host mapping state= , or > to constantly assert the intent for a page to be Private would be > valuable.