From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E80C4C63705 for ; Wed, 7 Dec 2022 19:53:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2AA1B8E0006; Wed, 7 Dec 2022 14:53:42 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 25A518E0001; Wed, 7 Dec 2022 14:53:42 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0FBF48E0006; Wed, 7 Dec 2022 14:53:42 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 002298E0001 for ; Wed, 7 Dec 2022 14:53:41 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id C85C8404EF for ; Wed, 7 Dec 2022 19:53:41 +0000 (UTC) X-FDA: 80216560242.23.DE34100 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf02.hostedemail.com (Postfix) with ESMTP id 4E02F80004 for ; Wed, 7 Dec 2022 19:53:41 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=SnmClVjO; spf=pass (imf02.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670442821; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=aTaYB3sa4Il+5PaPzg2yGawyCbkJidScej/Ia+Ui0do=; b=WYEsZaTO3oUnUoH8xlHqxgoZOzc/yDW4mAjgzJew4eAqeLnnKChOWD8UjroLQ1+Qw0bvre EIav9hepvpKFBJtx1L4017Ns4VUmVLcZ/HyRQpmrfKFL2clUSTZFvhVlNNvrz2T//iVtcy JN8TbY/h4XeDOPd3O+7qPydMuNNS6uo= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=SnmClVjO; spf=pass (imf02.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670442821; a=rsa-sha256; cv=none; b=UJGEOBAoto26/C3KScEJjg1ZZ7TYhRXojzicteC2xrpLuDSxFtQaBZhifAYnlZ4I81aSbc dqdgXwi/bL3F2y/ZkHzytoDw8kpcqHbnOQfTV9SCySBbDp/khAU6048OW1J6o3Phrw/OeU MPC1ytskpBJolo5pfmlDKXjZo6Kuqnk= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1670442820; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=aTaYB3sa4Il+5PaPzg2yGawyCbkJidScej/Ia+Ui0do=; b=SnmClVjOVC8SYU1AezTyG6nCtMVPjoa+46Yd1po8UEt+be78s0al16l2XlYbeUSinUx7Uq wSq6qr0pzV2PhVV6BrW0FKuSe7xUHJorKAKv3hnAGPdgv2uJ1crEvIQ2Ae3chiBJvZn5PX iBTdPUy6dyEWFKKrbX4Z+893hBq6VnE= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-264-Pml1WfbjMIKG7SK-hvb6eA-1; Wed, 07 Dec 2022 14:53:39 -0500 X-MC-Unique: Pml1WfbjMIKG7SK-hvb6eA-1 Received: by mail-wr1-f71.google.com with SMTP id d6-20020adfa346000000b0024211c0f988so4519801wrb.9 for ; Wed, 07 Dec 2022 11:53:38 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=aTaYB3sa4Il+5PaPzg2yGawyCbkJidScej/Ia+Ui0do=; b=L2fiqhViThuUzjeTsZONtRqqaDcDtY6D5VbUprcBCW+9SmrENnZ2gHXGm+YJEEQ6+s HYP5DyIs4mlsCWV1V0xE2llcer126KCCdYAi6UbyR+Zbrb5fAkp8qpj80EsSrI5Ts84B dDJjmh6xiAGsOlagSvTuGK8IUB1PNi3xP96iwESIw+RK4yG5RFsfEDSD9x4Km7Yjn4h/ gnP5IXz/ZBQHz/fV1FGhM3xORrYVI1j1ijG+PXRB0ha5xlNHfjGGYXoq0HGQCoBTOrSt ywyXStKEE4fgueBEqZeNQKFNwiK82ZBpMb0YgIOZUFLeEU4jil5gE6ryiU5qtk4DU5o7 0ObQ== X-Gm-Message-State: ANoB5pnXVkkTvuTGnYFdL44Dzh2HoLgXOgM9woTonelJjBL1hqLFbO0M kZ1KCsDAQe1Kpsp6f4yNqTlSYSeJdaELMi3qA/jTSwADpQOicqX7raNREX6T302MTvNTshUrERC buS091qcRlo4= X-Received: by 2002:adf:e852:0:b0:242:2390:15c with SMTP id d18-20020adfe852000000b002422390015cmr22288517wrn.381.1670442818028; Wed, 07 Dec 2022 11:53:38 -0800 (PST) X-Google-Smtp-Source: AA0mqf7dos9Ed2Fzog7r6MvC5cKgX/zgKNi5bfkzd1mbQRVKTGJViHq2Wprb3XUGTPv4xA+SLw9Kjw== X-Received: by 2002:adf:e852:0:b0:242:2390:15c with SMTP id d18-20020adfe852000000b002422390015cmr22288503wrn.381.1670442817719; Wed, 07 Dec 2022 11:53:37 -0800 (PST) Received: from ?IPV6:2003:cb:c702:2500:fe2d:7534:ffa4:c1e5? (p200300cbc7022500fe2d7534ffa4c1e5.dip0.t-ipconnect.de. [2003:cb:c702:2500:fe2d:7534:ffa4:c1e5]) by smtp.gmail.com with ESMTPSA id f14-20020a1c6a0e000000b003cf75213bb9sm2688611wmc.8.2022.12.07.11.53.36 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 07 Dec 2022 11:53:37 -0800 (PST) Message-ID: <37a9442e-f6e5-35f5-0d51-669d60936b5f@redhat.com> Date: Wed, 7 Dec 2022 20:53:36 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.5.0 To: Peter Xu Cc: Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Ives van Hoorne , Andrew Morton , Alistair Popple , Mike Rapoport , Nadav Amit , Andrea Arcangeli References: <20221202122748.113774-1-david@redhat.com> <690afe0f-c9a0-9631-b365-d11d98fdf56f@redhat.com> <19800718-9cb6-9355-da1c-c7961b01e922@redhat.com> <92173bad-caa3-6b43-9d1e-9a471fdbc184@redhat.com> <22d8e8ac-d75-a66-2650-b4d59f89855e@google.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH RFC] mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spamd-Result: default: False [-1.40 / 9.00]; BAYES_HAM(-6.00)[100.00%]; SORBS_IRL_BL(3.00)[209.85.221.71:received]; SUSPICIOUS_RECIPS(1.50)[]; RCVD_NO_TLS_LAST(0.10)[]; MIME_GOOD(-0.10)[text/plain]; BAD_REP_POLICIES(0.10)[]; R_SPF_ALLOW(0.00)[+ip4:170.10.133.0/24]; RCPT_COUNT_SEVEN(0.00)[10]; DMARC_POLICY_ALLOW(0.00)[redhat.com,none]; DKIM_TRACE(0.00)[redhat.com:+]; ARC_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; FROM_EQ_ENVFROM(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; TO_DN_SOME(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; HAS_ORG_HEADER(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org]; TAGGED_RCPT(0.00)[]; RCVD_COUNT_THREE(0.00)[4]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; R_DKIM_ALLOW(0.00)[redhat.com:s=mimecast20190719]; FROM_HAS_DN(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[] X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 4E02F80004 X-Stat-Signature: 131cq1c5htu56rnndbbnjzx9kayotjf1 X-HE-Tag: 1670442821-692246 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: >> >> On upstream during the next write fault, we'll end up in do_numa_page() and >> simply remap the page writable due to vm_page_prot, not triggering a write >> fault. I can see the "numa_hint_faults" counter in /proc/vmstat increasing >> accordingly, so we're really in do_numa_page(). > > Seems true. I think fundamentally it's because numa hint rely on PROT_NONE > as the hint, and it explicitly checks against mprotect(PROT_NONE) using the > accessible check: > > if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) > return do_numa_page(vmf); > > I'm not sure whether we should also add a pte_uffd_wp(vmf->orig_pte) here > to mask out the uffd-wp cases. :/ more special UFFD-wp casing, I'm not sure sure about that. Most importantly, once someone unlocks NUMA hinting for shmem (e.g., MPOL_MF_LAZY, MPOL_F_NUMA_BALANCING) this might be problematic. That at least makes it sound fragile to me. > > So far it seems the outcome is not extremely bad - PROT_WRITE only mappings > are rare in real life, and also with the protnone recovery code (and along > with the vm_page_prot patch coming) we'll be able to still recover the pte > into a uffd-wp-ed pte without PROTNONE bit set. But I don't have a solid > clue yet on what's the best. > Yes, just another way to trigger surprise uffd-wp behavior (at least surprising for me ;) ). But this time, not involving mprotect(). I suspect there are more cases, but I might be wrong. I was primarily trying to find out which other cases might be affected. [..] >> >> >> Independent of uffd-wp on shmem, we seem to be missing propagating the >> uffd-wp bit when splitting the huge zeropage. So uffd-wp'ing the huge >> zeropage and then splitting it loses the uffd-wp markers. :/ > > For this one, thanks for the reproducer. I'm not extremely sure whether > it's a bug. > > Firstly, I think your reproducer should just work well with shmem, afaiu, > because shmem is based on pte markers and it should only work on pte level > (not pmd). The huge zero pmd should got split right after wr-protected. > So the reproducer shouldn't go wrong on shmem at all with/without any > recent fix. Let me know otherwise. shmem doesn't use the huge shared zeropage, so it should be fine. > > For anon, I'm not sure it's a bug, because there's a semantic difference on > anon/shmem. The thing is losing wr-protect on the zero page is the same as > losing wr-protect on a page that is not mapped. For anon currently we > can't track a page that is not mapped and we skip those ranges (being zero > when read). So fundamentally I am not sure whether it'll be an issue for > existing anon uffd-wp users because if it is then it's more than zero > pages. I think it's a bug, although most probably a low priority one. Once user space successfully placed an uffd-wp marker, and e.g., verified using pagemap that it is indeed placed, the system should not silently drop it. The behavior between an ordinary THP and a huge zeropage differs. For THP, we handle the split correctly and don't lose the marker. Assuming the huge zeropage woud be disabled, the behavior would be (IMHO) correct. The test case would pass. For example, QEMU with uffd-wp based snapshotting will make sure that all virtual addresses are populated (e.g., mapping the shared, eventually the huge zeropage -- populate_read_range()), before protecting using uffd-wp. Losing a uffd-wp marker would be problematic. The good news is that we barely will end up PTE-mapping the huge zeropage unless there is real user-space interaction (mprotect(), mremap(), mmap()), so this shouldn't trigger in the QEMU use-case. Anyhow, I'll send a patch in a couple of days and we can discuss further. It's independent of the other discussion, just wanted to report my findings after staring at that code for way too long today. -- Thanks, David / dhildenb