From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0E161C27C76 for ; Wed, 25 Jan 2023 09:27:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6D5456B0072; Wed, 25 Jan 2023 04:27:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 659ED6B0073; Wed, 25 Jan 2023 04:27:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 485406B0075; Wed, 25 Jan 2023 04:27:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 33C2F6B0072 for ; Wed, 25 Jan 2023 04:27:15 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id EE096140C49 for ; Wed, 25 Jan 2023 09:27:14 +0000 (UTC) X-FDA: 80392792788.14.088D6AE Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf12.hostedemail.com (Postfix) with ESMTP id CC12940017 for ; Wed, 25 Jan 2023 09:27:11 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=U9eC8SYg; spf=pass (imf12.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1674638832; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=gBtAUTCWfw4ajbScdY9ziQC/mtBYAVVDyx4SSrAQ6CM=; b=uEruCe0BFQMw5+V/cNZd1a2EhsaHkZIqMuyX3kcPPJZmrFyxjSyadYKhfInUMrx/NInouA jh6ByEsQ1re3ciFkODqHQHDsZ4UlYBudyQYmLOJIeyMk+A+po8YLKZZ+V3iqAe2JyqV9n1 BQSP2DQTjS/t2J3bAGUK9mJraoq1WH4= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=U9eC8SYg; spf=pass (imf12.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1674638832; a=rsa-sha256; cv=none; b=2mSi+9hk6nyanJ+x2ot4zW2U0j54EPu9Z3VQiYeCN8ixs6Y0uquWDJ45UZUpg3gyGxS+ag Jofy/kX6nXdFd2ojEscoYhyAMr2FK55N4zICgDOmP9RICqfkvZ5B+o+dQVUb0CciNuoame PeJN6ja/6mk1fuR+3HplA9D4SuAhrNM= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674638831; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=gBtAUTCWfw4ajbScdY9ziQC/mtBYAVVDyx4SSrAQ6CM=; b=U9eC8SYgm7rNMVkrktbldQqKduFhIrtSFW2U6ycfYh5SIN7m/m5r2dzaNBdVA8ihAETFjK ymVd+ZJa+HvZinv310i0IZmUgJDGayqHrzvrd5ulwaQTNhqKP7nl3PSa7hdIxUVZfkiuEK Wmx3vS2519luD/MhQGhChlOxU1k21dI= Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com [209.85.221.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-374-5q0fO2URONCFJlvcLUOu7A-1; Wed, 25 Jan 2023 04:27:09 -0500 X-MC-Unique: 5q0fO2URONCFJlvcLUOu7A-1 Received: by mail-wr1-f70.google.com with SMTP id by12-20020a056000098c00b002bfb325c0edso660902wrb.10 for ; Wed, 25 Jan 2023 01:27:09 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=gBtAUTCWfw4ajbScdY9ziQC/mtBYAVVDyx4SSrAQ6CM=; b=tcPorGMPPJoGp0yS2EJ9rQKwdSBZks8YMzvuVLj/SXsG9KZl0sZ98xvnkLJe6IG4wR GaJqVNt2wF6MxXtCqjwgZZPhCkAj6+8zAGwnLkReaPF8GGFXgyjXIHcPJmQYyyq2ApAT 1Q/U1ZvryoiNT6mIr/YEax/0twiUHaxxVxWkeaYbpiPYcoDpvgkWg+Z07onMrmdu0weX 0Tk59HKDhylCaZBZxPx7dJn/fK6woHrcZJPOwX71HTw4tyZ+Jn7aOu8ot4zhxwdT49x7 i4hgl6FyJtlomYgM/+WtT0o97+uN4AHASAM/EcYCJdti4ymly7vsGubhEBEtpvyP+j1Z xhpA== X-Gm-Message-State: AFqh2kr07OR8alckUvW0ejR8cABXvn7nBlEbJbqC1cd2p3TaqmnxbxNi jJ7nAb/yY5GfWwAOek6U90oj32GlLv3LnfMB0VYknP0sx+25iuG5LdT7NIoFofMtQyGMWqMBGTm 6kyv8meZKJ1I= X-Received: by 2002:a05:600c:1d8e:b0:3d1:ebdf:d586 with SMTP id p14-20020a05600c1d8e00b003d1ebdfd586mr30109270wms.29.1674638827476; Wed, 25 Jan 2023 01:27:07 -0800 (PST) X-Google-Smtp-Source: AMrXdXt3I5ugL1L1S0w6an2fdSANTczn3IPM30D9smTkSjnVGxpx9rTFmh3kLNk4pToJoqtfXEhpIg== X-Received: by 2002:a05:600c:1d8e:b0:3d1:ebdf:d586 with SMTP id p14-20020a05600c1d8e00b003d1ebdfd586mr30109212wms.29.1674638827056; Wed, 25 Jan 2023 01:27:07 -0800 (PST) Received: from ?IPV6:2003:cb:c705:4c00:486:38e2:8ff8:a135? (p200300cbc7054c00048638e28ff8a135.dip0.t-ipconnect.de. [2003:cb:c705:4c00:486:38e2:8ff8:a135]) by smtp.gmail.com with ESMTPSA id o17-20020a05600c511100b003dc0d5b4fa6sm5924346wms.3.2023.01.25.01.27.05 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 25 Jan 2023 01:27:06 -0800 (PST) Message-ID: <4d224020-f26f-60a4-c7ab-721a024c7a6d@redhat.com> Date: Wed, 25 Jan 2023 10:27:04 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0 To: "Edgecombe, Rick P" , "bsingharora@gmail.com" , "hpa@zytor.com" , "Syromiatnikov, Eugene" , "peterz@infradead.org" , "rdunlap@infradead.org" , "keescook@chromium.org" , "Eranian, Stephane" , "kirill.shutemov@linux.intel.com" , "dave.hansen@linux.intel.com" , "linux-mm@kvack.org" , "fweimer@redhat.com" , "nadav.amit@gmail.com" , "jannh@google.com" , "dethoma@microsoft.com" , "kcc@google.com" , "linux-arch@vger.kernel.org" , "pavel@ucw.cz" , "oleg@redhat.com" , "hjl.tools@gmail.com" , "Yang, Weijiang" , "Lutomirski, Andy" , "bp@alien8.de" , "jamorris@linux.microsoft.com" , "linux-doc@vger.kernel.org" , "Schimpe, Christina" , "mike.kravetz@oracle.com" , "x86@kernel.org" , "akpm@linux-foundation.org" , "tglx@linutronix.de" , "arnd@arndb.de" , "john.allen@amd.com" , "rppt@kernel.org" , "andrew.cooper3@citrix.com" , "mingo@redhat.com" , "corbet@lwn.net" , "linux-kernel@vger.kernel.org" , "linux-api@vger.kernel.org" , "gorcunov@gmail.com" Cc: "Yu, Yu-cheng" References: <20230119212317.8324-1-rick.p.edgecombe@intel.com> <20230119212317.8324-19-rick.p.edgecombe@intel.com> <7f63d13d-7940-afb6-8b25-26fdf3804e00@redhat.com> <50cf64932507ba60639eca28692e7df285bcc0a7.camel@intel.com> <1327c608-1473-af4f-d962-c24f04f3952c@redhat.com> <8c3820ae1448de4baffe7c476b4b5d9ba0a309ff.camel@intel.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk In-Reply-To: <8c3820ae1448de4baffe7c476b4b5d9ba0a309ff.camel@intel.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: 4jhwn9eb3tdn5tdrdgxmz1mozporu1ir X-Rspam-User: X-Rspamd-Queue-Id: CC12940017 X-Rspamd-Server: rspam06 X-HE-Tag: 1674638831-7304 X-HE-Meta: U2FsdGVkX18n/mwS+OSI5hzicQtUIrJmvXyOpgsCo0M7/gy6G2BugMgMAVusQb2Y6XYd4X+Xd2T+hU0GzQl61zPZKuXzC/EhUusyvDPl0E3M/gUf7x/ByxH66uD07b1dlftYWqe8DwYT49+XyJ8jqmYAKfR9pF1F3SHi2hQ6L4FoMDiVLqkWa9bN11HjmiKmOXCtMSppsbu+7zo3cs4CUE+OekQSXTj3jiV4epC65PaV5mEydegNdsbxgeu2Bp8XpLeqEo2AoiJ+P7F4av463BvkAx8+OBN2GsVon/vOrD7MGrLR+9ruG/VAOF/dAdkrdIehyOOCrktcSw4K/7RPArwW92h+dQXoCL5l9qZx8h/hF3l1Z46W5MXG8Xdp3TAHwg+uq2HghZ7axfvPoLLwPVcvzzDNGrE2Kpu4TtOQFutLoppc5sdurouYTIsbNVQPM9Yg0Eua8Lcust8mP+y8hw6X3+y7o8UwYfLxhlZODfpiicnsCYoKHPCvZgUdusejkbPvwnQimjahcWj/S5ghLQQa0Df91axgCavuDLV+fafNk8TVl1Bx5/BULV+ChzCgkrPz/T8qA6AaJeRsg8xwtS/e/mGy76VJ0Wlvz3ooj3h0nw56xuU0pVxmvZz4HDuQu9qPvdMdHxdyjeG38RahnHuhIvWJgAmyObOjj7lfP8HK0RG9AIBULIa+CaWZSN1pBledv62ai6XuYo3+g9QClVKOJysazjBWQGFmwhcwQVGjwQnQPWP7Yyh00b9t4ViqMiTshsFNcM5tCBU2+1TYv92OJ7wPx8PrGyaog0VgGexuNn183JYjXNRG3nncpCGSrysU1zLWFQTjQJn+Fsz5V6iMj05+ApE0lcl2cTGlcg3qpKz+q+Okkx9nXlunJ+GKI/fvbdwh11gWkm76PC0aKl/YvN9C6uc3tjc3b96Q6QnCp8jisSOBHT8XVGtlG9ZMFKlmUaJarhSECLANroH 8U2oJvW2 FIdtsf0Cbt4vbjOvKnt3/HJ6UFUsKr0YCWwkQb0JPZAcWb8NsSpGYErssWYD8oIZUUPjMZHM7LBGQsn5xFUC+bUCNLX5hgTg6NDgbOPebCu4GqFwc0bz0S2N6rv9X6BP1uERG7looHPMxZAPiPqvNF/vfChI6lTwJks6WYIa9+j1K7bix0HcjEXVH1rX4UuNo5SfP/RRC2gAFeyC9UM75ANXOCfxnvJxs17MjZsFNTNUbPIdK2okPT9E9erjKklTpC/9eCC9go72AhAUtAdLUkcZLDAxWh3AS7d8c3Enxvbsev0aK0sw+rtbVe0wMJn3BaqxjXT+8QxImnXW3vlhZ9GDM2vLBha3NQX6+QuQ7lXSTNOPYEb8O4d2EzVQ+qFByKo+A8H6JglV/I0MZKnX5TwgD3UraL33hOHPNMDtX2giXC7psnYry99f82RT+MkCwX7OhxLB7G6U10DLyP0VMNPrRkv/FNSkww+i9F2VzAz+SZmcDcLhCnWtXIWOJhN4KZRniMW87jsg/oU6cMbv30Pzq4wGteXV1zHBtj9CJdyfDWJEzX2yHGpF4Twm6bcKZcTdCl6oKePd3RkI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 24.01.23 19:14, Edgecombe, Rick P wrote: > On Tue, 2023-01-24 at 17:24 +0100, David Hildenbrand wrote: >> On 23.01.23 21:47, Edgecombe, Rick P wrote: >>> On Mon, 2023-01-23 at 10:50 +0100, David Hildenbrand wrote: >>>> On 19.01.23 22:22, Rick Edgecombe wrote: >>>>> The x86 Control-flow Enforcement Technology (CET) feature >>>>> includes >>>>> a new >>>>> type of memory called shadow stack. This shadow stack memory >>>>> has >>>>> some >>>>> unusual properties, which requires some core mm changes to >>>>> function >>>>> properly. >>>>> >>>>> Since shadow stack memory can be changed from userspace, is >>>>> both >>>>> VM_SHADOW_STACK and VM_WRITE. But it should not be made >>>>> conventionally >>>>> writable (i.e. pte_mkwrite()). So some code that calls >>>>> pte_mkwrite() needs >>>>> to be adjusted. >>>>> >>>>> One such case is when memory is made writable without an actual >>>>> write >>>>> fault. This happens in some mprotect operations, and also >>>>> prot_numa >>>>> faults. >>>>> In both cases code checks whether it should be made >>>>> (conventionally) >>>>> writable by calling vma_wants_manual_pte_write_upgrade(). >>>>> >>>>> One way to fix this would be have code actually check if memory >>>>> is >>>>> also >>>>> VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But >>>>> since >>>>> most memory won't be shadow stack, just have simpler logic and >>>>> skip >>>>> this >>>>> optimization by changing vma_wants_manual_pte_write_upgrade() >>>>> to >>>>> not >>>>> return true for VM_SHADOW_STACK_MEMORY. This will simply handle >>>>> all >>>>> cases of this type. >>>>> >>>>> Cc: David Hildenbrand >>>>> Tested-by: Pengfei Xu >>>>> Tested-by: John Allen >>>>> Signed-off-by: Yu-cheng Yu >>>>> Reviewed-by: Kirill A. Shutemov < >>>>> kirill.shutemov@linux.intel.com> >>>>> Signed-off-by: Rick Edgecombe >>>>> --- >>>> >>>> Instead of having these x86-shadow stack details all over the MM >>>> space, >>>> was the option explored to handle this more in arch specific >>>> code? >>>> >>>> IIUC, one way to get it working would be >>>> >>>> 1) Have a SW "shadowstack" PTE flag. >>>> 2) Have an "SW-dirty" PTE flag, to store "dirty=1" when >>>> "write=0". >>> >>> I don't think that idea came up. So vma->vm_page_prot would have >>> the SW >>> shadow stack flag for VM_SHADOW_STACK, and pte_mkwrite() could do >>> Write=0,Dirty=1 part. It seems like it should work. >>> >> >> Right, if we include it in vma->vm_page_prot, we'd immediately let >> mk_pte() just handle that. >> >> Otherwise, we'd have to refactor e.g., mk_pte() to consume a vma >> instead >> of the vma->vm_page_prot. Let's see if we can avoid that for now. >> >>>> >>>> pte_mkwrite(), pte_write(), pte_dirty ... can then make decisions >>>> based >>>> on the "shadowstack" PTE flag and hide all these details from >>>> core- >>>> mm. >>>> >>>> When mapping a shadowstack page (new page, migration, swapin, >>>> ...), >>>> which can be obtained by looking at the VMA flags, the first >>>> thing >>>> you'd >>>> do is set the "shadowstack" PTE flag. >>> >>> I guess the downside is that it uses an extra software bit. But the >>> other positive is that it's less error prone, so that someone >>> writing >>> core-mm code won't introduce a change that makes shadow stack VMAs >>> Write=1 if they don't know to also check for VM_SHADOW_STACK. >> >> Right. And I think this mimics the what I would have expected HW to >> provide: a dedicated HW bit, not somehow mangling this into semantics >> of >> existing bits. > > Yea. > >> >> Roughly speaking: if we abstract it that way and get all of the "how >> to >> set it writable now?" out of core-MM, it not only is cleaner and >> less >> error prone, it might even allow other architectures that implement >> something comparable (e.g., using a dedicated HW bit) to actually >> reuse >> some of that work. Otherwise most of that "shstk" is really just x86 >> specific ... >> >> I guess the only cases we have to special case would be page pinning >> code where pte_write() would indicate that the PTE is writable (well, >> it >> is, just not by "ordinary CPU instruction" context directly): but you >> do >> that already, so ... :) >> >> Sorry for stumbling over that this late, I only started looking into >> this when you CCed me on that one patch. > > Sorry for not calling more attention to it earlier. Appreciate your > comments. > > Previously versions of this series had changed some of these > pte_mkwrite() calls to maybe_mkwrite(), which of course takes a vma. > This way an x86 implementation could use the VM_SHADOW_STACK vma flag > to decide between pte_mkwrite() and pte_mkwrite_shstk(). The feedback > was that in some of these code paths "maybe" isn't really an option, it > *needs* to make it writable. Even though the logic was the same, the > name of the function made it look wrong. > > But another option could be to change pte_mkwrite() to take a vma. This > would save using another software bit on x86, but instead requires a > small change to each arch's pte_mkwrite(). I played with that idea shortly as well, but discarded it. I was not able to convince myself that it wouldn't be required to pass in the VMA as well for things like pte_dirty(), pte_mkdirty(), pte_write(), ... which would end up fairly ugly (or even impossible in thing slike GUP-fast). For example, I wonder how we'd be handling stuff like do_numa_page() cleanly correctly, where we use pte_modify() + pte_mkwrite(), and either call might set the PTE writable and maintain dirty bit ... Having that said, maybe it could work with only a single saved-dirty bit and passing in the VMA for pte_mkwrite() only. pte_wrprotect() would detect "writable=0,dirty=1" and move the dirty bit to the soft-dirty bit instead, resulting in "writable=0,dirty=0,saved-dirty=1", pte_dirty() would return dirty==1||saved-dirty==1. pte_mkdirty() would set either set dirty=1 or saved-dirty=1, depending on the writable bit. pte_mkclean() would clean both bits. pte_write() would detect "writable == 1 || (writable==0 && dirty==1)" pte_mkwrite() would act according to the VMA, and in addition, merge the saved-dirty bit into the dirty bit. pte_modify() and mk_pte() .... would require more thought ... Further, ptep_modify_prot_commit() might have to be adjusted to properly flush in all relevant cases IIRC. > > x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(), but > maybe it could additionally warn if the vma is not writable. It also > seems more aligned with your changes to stop taking hints from PTE bits > and just look at the VMA? (I'm thinking about the dropping of the dirty > check in GUP and dropping pte_saved_write()) The soft-shstk bit wouldn't be a hint, it would be logically changing the "type" of the PTE such that any other PTE functions can do the right thing without having to consume the VMA. -- Thanks, David / dhildenb