From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9C308C433F5 for ; Tue, 4 Oct 2022 12:19:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A70E56B0072; Tue, 4 Oct 2022 08:19:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A20746B0073; Tue, 4 Oct 2022 08:19:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 89A706B0074; Tue, 4 Oct 2022 08:19:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 75FA96B0072 for ; Tue, 4 Oct 2022 08:19:45 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 47C28160934 for ; Tue, 4 Oct 2022 12:19:45 +0000 (UTC) X-FDA: 79983173130.27.FDC70E5 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf25.hostedemail.com (Postfix) with ESMTP id AF0F5A0004 for ; Tue, 4 Oct 2022 12:19:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1664885982; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=CGLX/ljMcH9ugjySzTmYidS+NlMSbNJzq8WccMn70Z8=; b=ewipdacI7yBnMfRXfskDFDY4bLMzSdnezPBNYtR2LQ+xyX5kuuVgqd0LwvG1nCLPhpI1sz f1P350AQwswWsEZ5SHC0u3zPn3RoR6Tw8e/UlHyHNInjvzyqAYxAi2l7qkBusmOHbDYDi3 6866qaIUkqluEZ9umGI2EPvYos6r+rk= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-451-UgcUwYBoPmuk7jTpB2G_Xw-1; Tue, 04 Oct 2022 08:19:39 -0400 X-MC-Unique: UgcUwYBoPmuk7jTpB2G_Xw-1 Received: by mail-wr1-f69.google.com with SMTP id m3-20020adfc583000000b0022cd60175bbso3984165wrg.6 for ; Tue, 04 Oct 2022 05:19:39 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date; bh=CGLX/ljMcH9ugjySzTmYidS+NlMSbNJzq8WccMn70Z8=; b=P7hCRUked/pMmCd/jP0s7OGbQaPp30zdvMElcE+X2tk1Jato3+LkyOA8Rbm6XHd0XC Tgdi4aHJAiWhHYwWahzN+ZdcRFMDg83u/12Zv8JRTST5amOBgKFmrQmeCQmIO/OUi8L4 7Hq6xV19v3FnZMjJv2tOL+s20zfNQmANI/Njp++qNtzE1XnqB31TAP9EmrgnTivCNpzT 8VCl2CK/ekMxvjGC/cwDRIrUvenVvQaOg+VKYLbRG5PyxHGn6eyPVlZazHmvJ7iiUlCz DeKwUh51XVVburTqJizA1KP5JH9NyrCiiuwEWNSZP72pVrpPckEu01V4qLrQcPp/ZbDL XMBw== X-Gm-Message-State: ACrzQf3tRAru4mFezotRDzTgfUA+huV6/CRls4FVIgHOqwGHOhln1GcN 8oNk/vOrvxUUj0q2ekk+RWiWQAS1beLn9Qt/6E8savNolcxeIxCZcRZHNPOREtPpgqHhsDnufNc /MMPvu7KidlM= X-Received: by 2002:adf:db0d:0:b0:22a:eeed:5145 with SMTP id s13-20020adfdb0d000000b0022aeeed5145mr16093045wri.590.1664885978629; Tue, 04 Oct 2022 05:19:38 -0700 (PDT) X-Google-Smtp-Source: AMsMyM50kctHe5RVsmDHaLGVaOyZ1F/I7XlFLf51A/N/qAKqGSHorWVQRXtqrUqNcrh7JdYaYygADA== X-Received: by 2002:adf:db0d:0:b0:22a:eeed:5145 with SMTP id s13-20020adfdb0d000000b0022aeeed5145mr16093017wri.590.1664885978307; Tue, 04 Oct 2022 05:19:38 -0700 (PDT) Received: from ?IPV6:2003:cb:c706:5000:4fff:1dd6:7868:a36? (p200300cbc70650004fff1dd678680a36.dip0.t-ipconnect.de. [2003:cb:c706:5000:4fff:1dd6:7868:a36]) by smtp.gmail.com with ESMTPSA id bv14-20020a0560001f0e00b0022ae8b862a7sm12583632wrb.35.2022.10.04.05.19.37 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 04 Oct 2022 05:19:37 -0700 (PDT) Message-ID: <41fb1d6c-0d36-e88c-39fe-ea1e9d80a1fc@redhat.com> Date: Tue, 4 Oct 2022 14:19:36 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.3.0 Subject: Re: [PATCH v2 1/3] mm/hugetlb: Fix race condition of uffd missing/minor handling To: Peter Xu , linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Andrew Morton , Nadav Amit , Mike Kravetz , Andrea Arcangeli , Axel Rasmussen , Mike Rapoport References: <20221004003705.497782-1-peterx@redhat.com> <20221004003705.497782-2-peterx@redhat.com> From: David Hildenbrand Organization: Red Hat In-Reply-To: <20221004003705.497782-2-peterx@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1664885984; a=rsa-sha256; cv=none; b=bCF+XDK+bRlBmUSkSDIYW1KEmX8Xe8r9g9MPdDiWic7HQRj5m5O8QnDrPaq7FOL7ZnPgio x4+jsP6PzSHetWhySAdhFRBvtx3HZk6oA0idHNNQ1Fkdu8dac1hkDlwWIjE5w0cr1+EEcL tBdMgB6LU3WjqRyKfgn+pQE2zsRTsuc= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=ewipdacI; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf25.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1664885984; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CGLX/ljMcH9ugjySzTmYidS+NlMSbNJzq8WccMn70Z8=; b=IHwdNNmriGtcDrmoQZbkxq4Vtrf2k0zeFWFEfWzCU6Gy3ZUjCxZJpKcIG7MsHFYDa0iFeN MBPvKsALhKoT92zoMInrzOOkhbJDYXPCjFgVmoZKOFj0k2MllloGubwNs3fcmjrB7wtWEA +D0pEe5Shmv9xSLWup5WkruKd7PhBOk= X-Rspam-User: Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=ewipdacI; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf25.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com X-Stat-Signature: jmodmrnd8fxjr3gsisqnjjwn71fkog8w X-Rspamd-Queue-Id: AF0F5A0004 X-Rspamd-Server: rspam09 X-HE-Tag: 1664885983-758754 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 04.10.22 02:37, Peter Xu wrote: > After the recent rework patchset of hugetlb locking on pmd sharing, > kselftest for userfaultfd sometimes fails on hugetlb private tests with > unexpected write fault checks. > > It turns out there's nothing wrong within the locking series regarding this > matter, but it could have changed the timing of threads so it can trigger > an old bug. > > The real bug is when we call hugetlb_no_page() we're not with the pgtable > lock. It means we're reading the pte values lockless. It's perfectly fine > in most cases because before we do normal page allocations we'll take the > lock and check pte_same() again. However before that, there are actually > two paths on userfaultfd missing/minor handling that may directly move on > with the fault process without checking the pte values. > > It means for these two paths we may be generating an uffd message based on > an unstable pte, while an unstable pte can legally be anything as long as > the modifier holds the pgtable lock. > > One example, which is also what happened in the failing kselftest and > caused the test failure, is that for private mappings wr-protection changes > can happen on one page. While hugetlb_change_protection() generally > requires pte being cleared before being changed, then there can be a race > condition like: > > thread 1 thread 2 > -------- -------- > > UFFDIO_WRITEPROTECT hugetlb_fault > hugetlb_change_protection > pgtable_lock() > huge_ptep_modify_prot_start > pte==NULL > hugetlb_no_page > generate uffd missing event > even if page existed!! > huge_ptep_modify_prot_commit > pgtable_unlock() > > Fix this by recheck the pte after pgtable lock for both userfaultfd missing > & minor fault paths. > > This bug should have been around starting from uffd hugetlb introduced, so > attaching a Fixes to the commit. Also attach another Fixes to the minor > support commit for easier tracking. > > Note that userfaultfd is actually fine with false positives (e.g. caused by > pte changed), but not wrong logical events (e.g. caused by reading a pte > during changing). The latter can confuse the userspace, so the strictness > is very much preferred. E.g., MISSING event should never happen on the > page after UFFDIO_COPY has correctly installed the page and returned. > > Cc: Andrea Arcangeli > Cc: Mike Kravetz > Cc: Axel Rasmussen > Cc: Nadav Amit > Fixes: 1a1aad8a9b7b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook") > Fixes: 7677f7fd8be7 ("userfaultfd: add minor fault registration mode") > Co-developed-by: Mike Kravetz > Signed-off-by: Peter Xu > --- > mm/hugetlb.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 49 insertions(+), 6 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 9679fe519b90..fa3fcdb0c4b8 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -5521,6 +5521,23 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma, > return ret; > } > > +/* > + * Recheck pte with pgtable lock. Returns true if pte didn't change, or > + * false if pte changed or is changing. > + */ > +static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm, > + pte_t *ptep, pte_t old_pte) > +{ > + spinlock_t *ptl; > + bool same; > + > + ptl = huge_pte_lock(h, mm, ptep); > + same = pte_same(huge_ptep_get(ptep), old_pte); > + spin_unlock(ptl); > + > + return same; > +} > + > static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > struct vm_area_struct *vma, > struct address_space *mapping, pgoff_t idx, > @@ -5562,9 +5579,30 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > goto out; > /* Check for page in userfault range */ > if (userfaultfd_missing(vma)) { > - ret = hugetlb_handle_userfault(vma, mapping, idx, > - flags, haddr, address, > - VM_UFFD_MISSING); > + /* > + * Since hugetlb_no_page() was examining pte > + * without pgtable lock, we need to re-test under > + * lock because the pte may not be stable and could > + * have changed from under us. Try to detect > + * either changed or during-changing ptes and retry > + * properly when needed. > + * > + * Note that userfaultfd is actually fine with > + * false positives (e.g. caused by pte changed), > + * but not wrong logical events (e.g. caused by > + * reading a pte during changing). The latter can > + * confuse the userspace, so the strictness is very > + * much preferred. E.g., MISSING event should > + * never happen on the page after UFFDIO_COPY has > + * correctly installed the page and returned. > + */ > + if (hugetlb_pte_stable(h, mm, ptep, old_pte)) > + ret = hugetlb_handle_userfault( > + vma, mapping, idx, flags, haddr, > + address, VM_UFFD_MISSING); That looks kind-of ugly now. I wonder if it would be worth factoring that handling out into a separate function and reusing it at two places. Would get rid of one level of code indent at least. Apart from that, LGTM. Although the lockless reading of the PTE screams for more trouble in the future :) -- Thanks, David / dhildenb