From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B8A88C4321E for ; Fri, 2 Dec 2022 16:56:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C3A3F6B0071; Fri, 2 Dec 2022 11:56:33 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BEA806B0073; Fri, 2 Dec 2022 11:56:33 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A65606B0074; Fri, 2 Dec 2022 11:56:33 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 9404D6B0071 for ; Fri, 2 Dec 2022 11:56:33 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 6C244A0A74 for ; Fri, 2 Dec 2022 16:56:33 +0000 (UTC) X-FDA: 80197969866.18.A60401F Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf08.hostedemail.com (Postfix) with ESMTP id E15A8160013 for ; Fri, 2 Dec 2022 16:56:32 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=bCw0BQeD; spf=pass (imf08.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670000193; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=b4eKbFg7g83Utc9laDTG1bfOmc5hP5ybhCQkldsbAH0=; b=IUxcOz8YdUbbDrVRyCiCeHxGok7gSMIkMcIjcNoua1bA7Eb5SOI/4Y6lHuqAGKVCOH3GJY bUAEqqFPBzFLjfs6IXuyRmBHafXjxRN+yHLGpXbuRu8nI2IXMAgfZRA9oTvlbKkaH36Nvc X7XGeq/Tjs4wDBGYHqSgwCRb10TTR5E= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=bCw0BQeD; spf=pass (imf08.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670000193; a=rsa-sha256; cv=none; b=QyANUpvZqTEffx/KWSPNujhStKZIU5ZJFaAfwNpKvWEiqA6U6YQaxcgrm8lUpwWk24t2yP HV/9gu3QfWmpB3YYGFN8KgPjxguSDTucXa0E0xgv2d8YXb6u0PxEcsSkakJmrYg7Au3DoA utuWBdwn+5p2Yb93MYvtY0Cm556riUs= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1670000192; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=b4eKbFg7g83Utc9laDTG1bfOmc5hP5ybhCQkldsbAH0=; b=bCw0BQeD60rFD7Nz14BJotNzPVhgMSmRzInIbVMPUf2BG5pnnO+zj31OAYvv104Ar9Dkm4 bnkgVHqJzodWVrZWVcomz5pkJaIIIyV4N/4wRTVMeJIXuO1ED+cmP4bsERKLLGLbzHyY5e N9zkw6keyanH4iB2cnIKpMBkSLCrqHQ= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-318-aG87dQd3MX-pAKkA41UFpw-1; Fri, 02 Dec 2022 11:56:26 -0500 X-MC-Unique: aG87dQd3MX-pAKkA41UFpw-1 Received: by mail-wm1-f70.google.com with SMTP id i8-20020a1c3b08000000b003d0683389daso4378379wma.6 for ; Fri, 02 Dec 2022 08:56:26 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=b4eKbFg7g83Utc9laDTG1bfOmc5hP5ybhCQkldsbAH0=; b=CkHGvGWBQmH7D2wv5GrqCvT//vtjooGIrfjHuuJjVDas848kJ+WT2o195ekQipgDSI EBjmBCHrkwYMrhY3ZcV/LKHkk/QOcU3iYCa4V3UiMeg7Z1XXmAKvw0RLmrhFxNHYzIHg dNf38uo905GDtCPUvwNGqPAwBeQ+4yhet29Pabthfk8U/pFCAnp5WTkyPP+J01VScs1Z n8xkP1+Bz2241SmpbXbnAR3mFyFoa982jiEgjxjoDV/I9EmAFHpJJJGXSfa8q+Aeqlxv EoUES5XweE2q9xhaJ/0zXQ50BbKjguUsg5VsC7dvm4K/TfVHEzrw2OUOQXnFL3oEax6D tNFw== X-Gm-Message-State: ANoB5pnICtDxhdLGUTPScuyUarOovwe5b/fJ5Fo8GAVTryZRMFMav88v 6yAP8ApkRUChBhR2xdNcO4R8r4Tvd1xSaDpP9QZAZ6dncFe49hu5u9B6YW7ofxwlEPBe48F5EAA aCmx1/Hn1dTQ= X-Received: by 2002:adf:a117:0:b0:242:d27:c359 with SMTP id o23-20020adfa117000000b002420d27c359mr18435640wro.152.1670000185305; Fri, 02 Dec 2022 08:56:25 -0800 (PST) X-Google-Smtp-Source: AA0mqf7K8uiiQgOvY60dYoLLbXPrh/iZzkx1xZ9Vi6Ebb/jodwg3ulKWAaHj6tRrBwlKdxWfBo7N6g== X-Received: by 2002:adf:a117:0:b0:242:d27:c359 with SMTP id o23-20020adfa117000000b002420d27c359mr18435628wro.152.1670000184965; Fri, 02 Dec 2022 08:56:24 -0800 (PST) Received: from ?IPV6:2003:cb:c703:7a00:852e:72cd:ed76:d72f? (p200300cbc7037a00852e72cded76d72f.dip0.t-ipconnect.de. [2003:cb:c703:7a00:852e:72cd:ed76:d72f]) by smtp.gmail.com with ESMTPSA id u17-20020a05600c19d100b003c6f8d30e40sm14189295wmq.31.2022.12.02.08.56.23 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 02 Dec 2022 08:56:24 -0800 (PST) Message-ID: <690afe0f-c9a0-9631-b365-d11d98fdf56f@redhat.com> Date: Fri, 2 Dec 2022 17:56:23 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.5.0 To: Peter Xu Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Ives van Hoorne , stable@vger.kernel.org, Andrew Morton , Hugh Dickins , Alistair Popple , Mike Rapoport , Nadav Amit , Andrea Arcangeli References: <20221202122748.113774-1-david@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH RFC] mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spamd-Result: default: False [-5.40 / 9.00]; BAYES_HAM(-6.00)[100.00%]; SUSPICIOUS_RECIPS(1.50)[]; DMARC_POLICY_ALLOW(-0.50)[redhat.com,none]; R_DKIM_ALLOW(-0.20)[redhat.com:s=mimecast20190719]; R_SPF_ALLOW(-0.20)[+ip4:170.10.133.0/24]; RCVD_NO_TLS_LAST(0.10)[]; MIME_GOOD(-0.10)[text/plain]; TAGGED_RCPT(0.00)[]; RCPT_COUNT_SEVEN(0.00)[11]; FROM_HAS_DN(0.00)[]; ARC_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; FROM_EQ_ENVFROM(0.00)[]; TO_DN_SOME(0.00)[]; DKIM_TRACE(0.00)[redhat.com:+]; RCVD_COUNT_THREE(0.00)[4]; PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; HAS_ORG_HEADER(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[] X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: E15A8160013 X-Stat-Signature: xbek5o4odyfqja3bkrq3xyty6kdaprtr X-HE-Tag: 1670000192-720297 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 02.12.22 17:33, Peter Xu wrote: > On Fri, Dec 02, 2022 at 01:27:48PM +0100, David Hildenbrand wrote: >> Currently, we don't enable writenotify when enabling userfaultfd-wp on >> a shared writable mapping (for now we only support SHMEM). The consequence > > and hugetlbfs > >> is that vma->vm_page_prot will still include write permissions, to be set >> as default for all PTEs that get remapped (e.g., mprotect(), NUMA hinting, >> page migration, ...). > > The thing is by default I think we want the write bit.. > > The simple example is (1) register UFFD_WP on shmem writable, (2) write a > page. Here we didn't wr-protect anything, so we want the write bit there. > > Or the other example is when UFFDIO_COPY with flags==0 even if with > VM_UFFD_WP. We definitely wants the write bit. > > We only doesn't want the write bit when uffd-wp is explicitly set. > > I think fundamentally the core is uffd-wp is pte-based, so the information > resides in pte not vma. I'm not strongly objecting this patch, especially > you mentioned auto-numa so I need to have a closer look later there. > However I do think uffd-wp is slightly special because we always need to > consider pte information anyway, so a per-vma information doesn't hugely > help, IMHO. That's the same as softdirty tracking, IMHO. [...] >> Running the mprotect() reproducer [1] without this commit: >> $ ./uffd-wp-mprotect >> FAIL: uffd-wp did not fire >> Running the mprotect() reproducer with this commit: >> $ ./uffd-wp-mprotect >> PASS: uffd-wp fired >> >> [1] https://lore.kernel.org/all/222fc0b2-6ec0-98e7-833f-ea868b248446@redhat.com/T/#u > > I still hope for a formal patch (non-rfc) we can have a reproducer outside > mprotect(). IMHO mprotect() is really ambiguously here being used with > uffd-wp, so not a good example IMO as I explained in the other thread [1]. I took the low hanging fruit to showcase that this is a more generic problem. The reproducer is IMHO nice because it's simple and race-free. > > I'll need to off-work most of the rest of today, but maybe I can also have > a look in the weekend or Monday more on the numa paths. Before that, can > we first reach a consensus that we have the mm/migrate patch there to be > merged first? These are two issues, IMHO. > > I know you're against me for some reason, but until now I sincerely don't > know why. That patch sololy recovers write bit status (by removing it for > read-only) for a migration entry and that definitely makes sense to me. As > I also mentioned in the old version of that thread, we can rework migration > entries and merge READ|WRITE entries into a GENERIC entry one day if you > think proper, but that's for later. I'm not against you. I'm against changing well-working, common code when it doesn't make any sense to me to change it. And now we have proof that mprotect() just behaves exactly the same way, using the basic rules of vma->vm_page_prot. Yes, there is broken sparc64 (below), but that shouldn't dictate our implementation. What *would* make sense to me, as I raised, is: diff --git a/mm/migrate.c b/mm/migrate.c index dff333593a8a..9fc181fd3c5a 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -213,8 +213,10 @@ static bool remove_migration_pte(struct folio *folio, pte = pte_mkdirty(pte); if (is_writable_migration_entry(entry)) pte = maybe_mkwrite(pte, vma); - else if (pte_swp_uffd_wp(*pvmw.pte)) + else if (pte_swp_uffd_wp(*pvmw.pte)) { pte = pte_mkuffd_wp(pte); + pt = pte_wrprotect(pte); + } if (folio_test_anon(folio) && !is_readable_migration_entry(entry)) rmap_flags |= RMAP_EXCLUSIVE; It still requires patch each and every possible code location, which I dislike as described in the patch description. The fact that there are still uffd-wp bugs with your patch makes that hopefully clear. I'd be interested if they can be reproduced witht his patch. > > Let me provide another example why I think recovering write bit may matter > outside uffd too so it's common and it's always good to explicit check it. > > If you still remember for sparc64 (I think you're also in the loop) > pte_mkdirty will also apply write bit there; even though I don't know why > but it works like that for years. Consider below sequence: Yes, and I consider that having to be fixed properly on in sparc64 code. pte_mkdirty() must not allow write access. That's why we have pte_mkwrite() after all. It's just plain wrong and requires custom hacks all over the place. As raised, the whole maybe_mkwrite() logic is completely broken on sparc64. > > - map a page writable, write to page (fault in, set dirty) > - mprotect(RO) on the page (keep dirty bit, vma/pte becomes RO) > - migrate the page > - mk_pte returns with WRITE bit cleared (which is correct) > - pte_mkdirty set dirty and write bit (because dirty used to set) > > If without the previous mm/migrate patch [1] IIUC it'll allow the pte > writable even without VM_WRITE here after migration. That would be my reaction: diff --git a/mm/migrate.c b/mm/migrate.c index dff333593a8a..05183cd22f0f 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -209,12 +209,20 @@ static bool remove_migration_pte(struct folio *folio, entry = pte_to_swp_entry(*pvmw.pte); if (!is_migration_entry_young(entry)) pte = pte_mkold(pte); + /* + * HACK alert. sparc64 really has to fix it's pte_mkwrite() + * implementation to not allow random write access. + */ +#ifndef CONFIG_SPARC64 if (folio_test_dirty(folio) && is_migration_entry_dirty(entry)) pte = pte_mkdirty(pte); +#endif > > I'm not sure whether I missed something, nor can I write a reproducer > because I don't have sparc64 systems on hand, not to mention time. But > hopefully I explained why I think it's safer to just always double check on > the write bit to be the same as when before migration, irrelevant of > uffd-wp, vma pgprot or mk_pte(). > > For this patch itself, I'll rethink again when I read more on numa. Most probably we'll have to agree to disagree. -- Thanks, David / dhildenb