From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2CC82C3A5A7 for ; Wed, 7 Dec 2022 00:07:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 97F668E0003; Tue, 6 Dec 2022 19:07:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 92F828E0001; Tue, 6 Dec 2022 19:07:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7D0C18E0003; Tue, 6 Dec 2022 19:07:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 717F38E0001 for ; Tue, 6 Dec 2022 19:07:51 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 4DD2F140F52 for ; Wed, 7 Dec 2022 00:07:51 +0000 (UTC) X-FDA: 80213571942.27.E98B358 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf24.hostedemail.com (Postfix) with ESMTP id 02A3F18000D for ; Wed, 7 Dec 2022 00:07:49 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=i4p7X51w; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf24.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670371670; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cLsuMwiJlrQl3J8JtMcNGM+h/BlzZINz+TbY2IHIMkw=; b=2SPLmTNmQ/eeu/KKqJ1aMDVhN+Elrpu/N9CXYBf251a7i1B6PxP7zvdnybtX0CBIVkJtCd KF3q5XXM/VV1xf4J/FbkKq1+Ax/VJs4/L44zmPzDAQpMj8YJkONqL5N2GVO4GyMyw6Iaam j0pU0Z8JC31Wkc3O1n9UYfh6CX3QGsM= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=i4p7X51w; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf24.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670371670; a=rsa-sha256; cv=none; b=2jueDHSSrIVYG/x5YvO0oMEPoeKxLWJ4wcM1DMba9YB2xfnPwepMVEdp5g4Px5mmcjLgnK heUDuopmIsJbkhoV8abxtg53EWrP1h1XDXxPhxTV23uwizH0WCXXJC+nkqQ+rAlkCA4GlA RfOUlKG9o0KBIOzfp7F+QPZ2Z/MGrjc= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1670371669; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=cLsuMwiJlrQl3J8JtMcNGM+h/BlzZINz+TbY2IHIMkw=; b=i4p7X51wpQ//CO0l2BPB80X0MAEepvJjjlDlv4/N8cBEbTg0UyQ4X0vSfIf3Gr04Fk2m9a G15jtvmSy+bgn9XPL+C87S2+hESll+lDJ61jswDi+bA66pvtc8czLT+MbfyJyttUomH1x2 zhBYO3WmAG1bF1Cy2y/eBZuB40xi5qU= Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com [209.85.219.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-175-whTxbmCqPe-Trb7rJfDgFA-1; Tue, 06 Dec 2022 19:07:46 -0500 X-MC-Unique: whTxbmCqPe-Trb7rJfDgFA-1 Received: by mail-qv1-f72.google.com with SMTP id ob12-20020a0562142f8c00b004c6c72bf1d0so33929515qvb.9 for ; Tue, 06 Dec 2022 16:07:46 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=cLsuMwiJlrQl3J8JtMcNGM+h/BlzZINz+TbY2IHIMkw=; b=05VZwL87nvIUp0inmGRO6QpgYEuf5C24eode39YYbBlrdkmddgN5TBtOO2+L8sTtd6 rErsXywJj8kJKv19z8anmZX395wO0bA3aHeZLQQwZ3ogsw+Skm4fRQy2XKO8/gC5NURo 5hCHpWY+4ZUG+uQIXf+8QRzwGWHnxjGLq6KEvT47NAtWu+Xt2GU7F9KDbOdzb6p+99EY r8v0R8KR1ZWbk3T612CGJ1lZeGXWURREjyLgbNP31E4Ibqf8nLFDITRImoOoKs19kRMc DG77O3xPJhEjZ7qjTexJSMW5JpGJEBF4Q8IJX0xdwMPnxvYOzPKvIzVQuXPRlcfZW8Kn i0Kw== X-Gm-Message-State: ANoB5pmZZVb2zFcaB52K2dMiBm/YTHnHRzjinjA8+F6Y5JlEf/iXefKx wsCRFL5J/rnRgg/ENulgml//DaQWJCTrgzF3WfJ2pqJ9kaxPF95IBGzQpheMany2Cw6bOq/w+If ZHAKqKwmiuPY= X-Received: by 2002:a05:622a:5a91:b0:3a5:fb61:c899 with SMTP id fz17-20020a05622a5a9100b003a5fb61c899mr677186qtb.0.1670371665933; Tue, 06 Dec 2022 16:07:45 -0800 (PST) X-Google-Smtp-Source: AA0mqf6jzT3JZZuSZmROY4u+XjwsXhDxlm0aqPPZx3Fgb6n79kWd/18ZqnsMtvyaIIIv0eApLkGMCA== X-Received: by 2002:a05:622a:5a91:b0:3a5:fb61:c899 with SMTP id fz17-20020a05622a5a9100b003a5fb61c899mr677174qtb.0.1670371665654; Tue, 06 Dec 2022 16:07:45 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-46-70-31-27-79.dsl.bell.ca. [70.31.27.79]) by smtp.gmail.com with ESMTPSA id x10-20020a05620a258a00b006bb82221013sm16003743qko.0.2022.12.06.16.07.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 06 Dec 2022 16:07:45 -0800 (PST) Date: Tue, 6 Dec 2022 19:07:43 -0500 From: Peter Xu To: John Hubbard Cc: Mike Kravetz , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton , Jann Horn , Andrew Morton , Andrea Arcangeli , Rik van Riel , Nadav Amit , Miaohe Lin , Muchun Song , David Hildenbrand Subject: Re: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare Message-ID: References: <20221129193526.3588187-1-peterx@redhat.com> <20221129193526.3588187-9-peterx@redhat.com> <0813b9ed-3c92-088c-4fb9-45fb648c6e73@nvidia.com> <97e3a8f2-df75-306e-2edf-85976c168955@nvidia.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 02A3F18000D X-Stat-Signature: whsoiubeo5bu4qm6z493ng5amr5nmuhr X-Spamd-Result: default: False [0.10 / 9.00]; BAYES_HAM(-6.00)[100.00%]; SORBS_IRL_BL(3.00)[209.85.219.72:received]; SUSPICIOUS_RECIPS(1.50)[]; SUBJECT_HAS_UNDERSCORES(1.00)[]; MID_RHS_NOT_FQDN(0.50)[]; RCVD_NO_TLS_LAST(0.10)[]; MIME_GOOD(-0.10)[text/plain]; BAD_REP_POLICIES(0.10)[]; PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org]; FROM_EQ_ENVFROM(0.00)[]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_TWELVE(0.00)[13]; ARC_NA(0.00)[]; R_DKIM_ALLOW(0.00)[redhat.com:s=mimecast20190719]; MIME_TRACE(0.00)[0:+]; TAGGED_RCPT(0.00)[]; DMARC_POLICY_ALLOW(0.00)[redhat.com,none]; TO_MATCH_ENVRCPT_SOME(0.00)[]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; DKIM_TRACE(0.00)[redhat.com:+]; R_SPF_ALLOW(0.00)[+ip4:170.10.129.0/24]; RCVD_COUNT_THREE(0.00)[4]; TO_DN_SOME(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[] X-HE-Tag: 1670371669-370683 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Dec 06, 2022 at 02:31:30PM -0800, John Hubbard wrote: > On 12/6/22 13:51, Peter Xu wrote: > > On Tue, Dec 06, 2022 at 01:03:45PM -0800, John Hubbard wrote: > > > On 12/6/22 08:45, Peter Xu wrote: > > > > I've got a fixup attached. John, since this got your attention please also > > > > have a look too in case there's further issues. > > > > > > > > > > Well, one question: Normally, the pattern of "release_lock(A); call f(); > > > acquire_lock(A);" is tricky, because one must revalidate that the state > > > protected by A has not changed while the lock was released. However, in > > > this case, it's letting page fault handling proceed, which already > > > assumes that pages might be gone, so generally that seems OK. > > > > Yes it's tricky, but not as tricky in this case. > > > > I hope my documentation supplemented that (in the fixup patch): > > > > + * @hugetlb_entry: if set, called for each hugetlb entry. Note that > > + * currently the hook function is protected by hugetlb > > + * vma lock to make sure pte_t* and the spinlock is valid > > + * to access. If the hook function needs to yield the [1] > > So far so good... > > > + * thread or retake the vma lock for some reason, it > > + * needs to properly release the vma lock manually, > > + * and retake it before the function returns. > > ...but you can actually delete this second sentence. It does not add > any real information--clearly, if you must drop the lock, then you must > "manually" drop the lock. > > And it still ignores my original question, which I don't think I've > fully communicated. Basically, what can happen to the protected data > during the time when the lock is not held? I thought I answered this one at [1] above. If not, I can extend the answer. What can happen is some thread can firstly unshare the pmd pgtable page (e.g. by clearing the PUD entry in current mm), then release the pmd pgtable page (e.g. by unmapping it) even if current thread is still accessing it. It will cause use-after-free on the pmd pgtable page on this thread in various ways. One way to trigger this is when the current thread tries to take the pgtable lock and it'll trigger warning like the call stack referenced in the cover letter of this series: https://lore.kernel.org/r/20221129193526.3588187-1-peterx@redhat.com Please also feel free to read the reproducer attached in the cover letter, it has details on how this can trigger (even though it's so hard to trigger so I added a delay in the kernel to make it trigger). The idea should be the same. > > > > > The vma lock here makes sure the pte_t and the pgtable spinlock being > > stable. Without the lock, they're prone to be freed in parallel. > > > > Yes, but think about this: if the vma lock protects against the pte > going away, then: > > lock() > get a pte > unlock() > > ...let hmm_vma_fault() cond_resched() run... > > lock() > ...whoops, something else release the pte that I'd previously > retrieved. Here the pte_t* is never referenced again after hugetlb_entry() returned. The loop looks like: do { next = hugetlb_entry_end(h, addr, end); pte = hugetlb_walk(vma, addr & hmask, sz); if (pte) err = ops->hugetlb_entry(pte, hmask, addr, next, walk); else if (ops->pte_hole) err = ops->pte_hole(addr, next, -1, walk); if (err) break; } while (addr = next, addr != end); After hugetlb_entry() returned, we'll _never_ touch that pte again we got from either huge_pte_offset() or hugetlb_walk() after this patchset applied. If we touch it, it's a potential bug as you mentioned. But we didn't. Hope it explains. > > > > > > > However, I'm lagging behind on understanding what the vma lock actually > > > protects. It seems to be a hugetlb-specific protection for concurrent > > > freeing of the page tables? > > > > Not exactly freeing, but unsharing. Mike probably has more to say. The > > series is here: > > > > https://lore.kernel.org/all/20220914221810.95771-1-mike.kravetz@oracle.com/#t > > > > > If so, then running a page fault handler seems safe. If there's something > > > else it protects, then we might need to revalidate that after > > > re-acquiring the vma lock. > > > > Nothing to validate here. The only reason to take the vma lock is to match > > with the caller who assumes the lock taken, so either it'll be released > > very soon or it prepares for the next hugetlb pgtable walk (huge_pte_offset). > > > > ummm, see above. :) > > > > > > > Also, scattering hugetlb-specific locks throughout mm seems like an > > > unfortuate thing, I wonder if there is a longer term plan to Not Do > > > That? > > > > So far HMM is really the only one - normally hugetlb_entry() hook is pretty > > light, so not really throughout the whole mm yet. It's even not urgently > > needed for the other two places calling cond_sched(), I added it mostly > > just for completeness, and with the slight hope that maybe we can yield > > earlier for some pmd unsharers. > > > > But yes it's unfortunate, I just didn't come up with a good solution. > > Suggestion is always welcomed. > > > > I guess it's on me to think of something cleaner, so if I do I'll pipe > up. :) That'll be very much appricated. It's really that I don't know how to make this better, or I can rework the series as long as it hasn't land upstream. Thanks, -- Peter Xu