From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DEBCFC52D78 for ; Tue, 23 Aug 2022 10:23:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6D0CC8D0003; Tue, 23 Aug 2022 06:23:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 67EF78D0001; Tue, 23 Aug 2022 06:23:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 546828D0003; Tue, 23 Aug 2022 06:23:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 438048D0001 for ; Tue, 23 Aug 2022 06:23:40 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 22E0841498 for ; Tue, 23 Aug 2022 10:23:40 +0000 (UTC) X-FDA: 79830471000.28.D7FDECA Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf29.hostedemail.com (Postfix) with ESMTP id 91EA112005A for ; Tue, 23 Aug 2022 10:23:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1661250219; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1Kxt05L7z1d52ImlEEC1KlrDepPenx6fSSchHejx0uM=; b=K2kl58mRAut5raF3Qf1aokuqu8IiV/a9EcA4UbI4WIyiNy43fm4hNjiFgelAof5i0XsyRU vxI3bROG+EBZOMl8IuFeQPqAGNs0WQVNKHttayVdbi4eqn8VR0Stq8FPPH3PYn137WxP+d j5R3zvSL1XdeEya7OYgvCT2r+eVVcYk= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-322-l8y2m4A2PGunZwDxSxApXg-1; Tue, 23 Aug 2022 06:23:30 -0400 X-MC-Unique: l8y2m4A2PGunZwDxSxApXg-1 Received: by mail-wm1-f70.google.com with SMTP id c66-20020a1c3545000000b003a5f6dd6a25so10044357wma.1 for ; Tue, 23 Aug 2022 03:23:30 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc; bh=1Kxt05L7z1d52ImlEEC1KlrDepPenx6fSSchHejx0uM=; b=RPyAMA1imeUaVjEE45yHaz1Oy3Uyrm10cUJbA0fCaXodzJkswJ1GKSsiFIpzrmz2Tc YjRGYE4Tfcs3CH6v7t20c5nGxzlpTErtj0g8vyy67E61lWPfq0wnqKXwZbWsF0oxqVKI t9UgtSCP2W21wjxLcF42NYacCPW4YCyWFes2y2oSJtWDhqwZXhYpE7zr/aDm8gWeVZ1k R3jpqzUHagDJnal3qjDPvWdlSppL5N6pwtVbZzLaVKkxL/mjRoMa0YOnyV4vL5fdmx7o a3TFKzr9Wl90t6qBC7izMSRC0N+BXHeaegJgHi9r6P4VlrDQl109spK71bT2ysaVmn2A jriQ== X-Gm-Message-State: ACgBeo1eU9tg1XBB/NgDdie10Ie/uogwUiE9h252UkJe5drCVIaLsPKG asCOD/2QcqUut5f1Bwp58s4mOcFAu9UHmmJdDI1C4La2RUvgAuMlvGXpn6+zzpC0TfUCYw0Gyqq c34hdO9Kohu0= X-Received: by 2002:adf:a30d:0:b0:225:2656:235d with SMTP id c13-20020adfa30d000000b002252656235dmr13914322wrb.716.1661250209397; Tue, 23 Aug 2022 03:23:29 -0700 (PDT) X-Google-Smtp-Source: AA6agR7ENOYRDpGYMJbVEqOdw1Ty7+mKWBtH4vVkatfoFbGnxmW3tLacNmkRMzq7t1cgGU6El8m0Og== X-Received: by 2002:adf:a30d:0:b0:225:2656:235d with SMTP id c13-20020adfa30d000000b002252656235dmr13914311wrb.716.1661250209077; Tue, 23 Aug 2022 03:23:29 -0700 (PDT) Received: from ?IPV6:2003:cb:c70b:1600:c48b:1fab:a330:5182? (p200300cbc70b1600c48b1faba3305182.dip0.t-ipconnect.de. [2003:cb:c70b:1600:c48b:1fab:a330:5182]) by smtp.gmail.com with ESMTPSA id p30-20020a1c545e000000b003a500b612fcsm20693564wmi.12.2022.08.23.03.23.27 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 23 Aug 2022 03:23:28 -0700 (PDT) Message-ID: <7d4e7f47-30a5-3cc6-dc9f-aa89120847d8@redhat.com> Date: Tue, 23 Aug 2022 12:23:27 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 Subject: Re: [PATCH v2 1/5] mm/hugetlb: fix races when looking up a CONT-PTE size hugetlb page To: Baolin Wang , akpm@linux-foundation.org, songmuchun@bytedance.com, mike.kravetz@oracle.com Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <0e5d92da043d147a867f634b17acbcc97a7f0e64.1661240170.git.baolin.wang@linux.alibaba.com> <4c24b891-04ce-2608-79d2-a75dc236533f@redhat.com> <376d2e0a-d28a-984b-903c-1f6451b04a15@linux.alibaba.com> From: David Hildenbrand Organization: Red Hat In-Reply-To: <376d2e0a-d28a-984b-903c-1f6451b04a15@linux.alibaba.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1661250219; a=rsa-sha256; cv=none; b=7KfjQ4YV8dgRw0CxnvG6kAl8nnpv+78RYoqt8tHttzWjKZD41YUSs00dsJX9qjoV5RMrq6 bG3fF2dSf4WpVpJkOXgxAYeV/c4Z3yaCEgOIFQ3vS59GJdEXPSZfbeLUcPPv/G/k6R6rIH O6u6ONEJ4XllXJi4KGQAX8i4qOP84Hc= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=K2kl58mR; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf29.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1661250219; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1Kxt05L7z1d52ImlEEC1KlrDepPenx6fSSchHejx0uM=; b=k0k67wD0qexxzMrP+uN1v8qAbuZPtdNWUUSRy5h3l65lk2RfEpta5OcNwTA/oLGqVXa3lR SOK0amKP0sF3zdlCTlUNJxLZTnDE6mb6EzmLzsNCxgsk/UOvU52ELWlM7J1vVxjUgR3J0P Su4O1dKqkaFq8fIa5X9G9hv+QqDzqO4= X-Stat-Signature: iwpu5rdx5gdmazprobwiqrdt3itj5np5 X-Rspamd-Queue-Id: 91EA112005A Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=K2kl58mR; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf29.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1661250219-735957 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 23.08.22 12:02, Baolin Wang wrote: > > > On 8/23/2022 4:29 PM, David Hildenbrand wrote: >> On 23.08.22 09:50, Baolin Wang wrote: >>> On some architectures (like ARM64), it can support CONT-PTE/PMD size >>> hugetlb, which means it can support not only PMD/PUD size hugetlb >>> (2M and 1G), but also CONT-PTE/PMD size(64K and 32M) if a 4K page size >>> specified. >>> >>> So when looking up a CONT-PTE size hugetlb page by follow_page(), it >>> will use pte_offset_map_lock() to get the pte entry lock for the CONT-PTE >>> size hugetlb in follow_page_pte(). However this pte entry lock is incorrect >>> for the CONT-PTE size hugetlb, since we should use huge_pte_lock() to >>> get the correct lock, which is mm->page_table_lock. >>> >>> That means the pte entry of the CONT-PTE size hugetlb under current >>> pte lock is unstable in follow_page_pte(), we can continue to migrate >>> or poison the pte entry of the CONT-PTE size hugetlb, which can cause >>> some potential race issues, and following pte_xxx() validation is also >>> unstable in follow_page_pte(), even though they are under the 'pte lock'. >>> >>> Moreover we should use huge_ptep_get() to get the pte entry value of >>> the CONT-PTE size hugetlb, which already takes into account the subpages' >>> dirty or young bits in case we missed the dirty or young state of the >>> CONT-PTE size hugetlb. >>> >>> To fix above issues, introducing a new helper follow_huge_pte() to look >>> up a CONT-PTE size hugetlb page, which uses huge_pte_lock() to get the >>> correct pte entry lock to make the pte entry stable, as well as >>> supporting non-present pte handling. >>> >>> Signed-off-by: Baolin Wang >>> --- >>> include/linux/hugetlb.h | 8 ++++++++ >>> mm/gup.c | 11 ++++++++++ >>> mm/hugetlb.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++ >>> 3 files changed, 72 insertions(+) >>> >>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h >>> index 3ec981a..d491138 100644 >>> --- a/include/linux/hugetlb.h >>> +++ b/include/linux/hugetlb.h >>> @@ -207,6 +207,8 @@ struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address, >>> struct page *follow_huge_pd(struct vm_area_struct *vma, >>> unsigned long address, hugepd_t hpd, >>> int flags, int pdshift); >>> +struct page *follow_huge_pte(struct vm_area_struct *vma, unsigned long address, >>> + pmd_t *pmd, int flags); >>> struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address, >>> pmd_t *pmd, int flags); >>> struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address, >>> @@ -312,6 +314,12 @@ static inline struct page *follow_huge_pd(struct vm_area_struct *vma, >>> return NULL; >>> } >>> >>> +static inline struct page *follow_huge_pte(struct vm_area_struct *vma, >>> + unsigned long address, pmd_t *pmd, int flags) >>> +{ >>> + return NULL; >>> +} >>> + >>> static inline struct page *follow_huge_pmd(struct mm_struct *mm, >>> unsigned long address, pmd_t *pmd, int flags) >>> { >>> diff --git a/mm/gup.c b/mm/gup.c >>> index 3b656b7..87a94f5 100644 >>> --- a/mm/gup.c >>> +++ b/mm/gup.c >>> @@ -534,6 +534,17 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, >>> if (unlikely(pmd_bad(*pmd))) >>> return no_page_table(vma, flags); >>> >>> + /* >>> + * Considering PTE level hugetlb, like continuous-PTE hugetlb on >>> + * ARM64 architecture. >>> + */ >>> + if (is_vm_hugetlb_page(vma)) { >>> + page = follow_huge_pte(vma, address, pmd, flags); >>> + if (page) >>> + return page; >>> + return no_page_table(vma, flags); >>> + } >>> + >>> ptep = pte_offset_map_lock(mm, pmd, address, &ptl); >>> pte = *ptep; >>> if (!pte_present(pte)) { >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index 6c00ba1..cf742d1 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -6981,6 +6981,59 @@ struct page * __weak >>> return NULL; >>> } >>> >>> +/* Support looking up a CONT-PTE size hugetlb page. */ >>> +struct page * __weak >>> +follow_huge_pte(struct vm_area_struct *vma, unsigned long address, >>> + pmd_t *pmd, int flags) >>> +{ >>> + struct mm_struct *mm = vma->vm_mm; >>> + struct hstate *hstate = hstate_vma(vma); >>> + unsigned long size = huge_page_size(hstate); >>> + struct page *page = NULL; >>> + spinlock_t *ptl; >>> + pte_t *ptep, pte; >>> + >>> + /* >>> + * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via >>> + * follow_hugetlb_page(). >>> + */ >>> + if (WARN_ON_ONCE(flags & FOLL_PIN)) >>> + return NULL; >>> + >>> + ptep = huge_pte_offset(mm, address, size); >>> + if (!ptep) >>> + return NULL; >>> + >>> +retry: >>> + ptl = huge_pte_lock(hstate, mm, ptep); >>> + pte = huge_ptep_get(ptep); >>> + if (pte_present(pte)) { >>> + page = pte_page(pte); >>> + if (WARN_ON_ONCE(!try_grab_page(page, flags))) { >>> + page = NULL; >>> + goto out; >>> + } >>> + } else { >>> + if (!(flags & FOLL_MIGRATION)) { >>> + page = NULL; >>> + goto out; >>> + } >>> + >>> + if (is_hugetlb_entry_migration(pte)) { >>> + spin_unlock(ptl); >>> + __migration_entry_wait_huge(ptep, ptl); >>> + goto retry; >>> + } >>> + /* >>> + * hwpoisoned entry is treated as no_page_table in >>> + * follow_page_mask(). >>> + */ >>> + } >>> +out: >>> + spin_unlock(ptl); >>> + return page; >>> +} >>> + >>> struct page * __weak >>> follow_huge_pmd(struct mm_struct *mm, unsigned long address, >>> pmd_t *pmd, int flags) >> >> >> Can someone explain why: >> * follow_page() goes via follow_page_mask() for hugetlb >> * __get_user_pages() goes via follow_hugetlb_page() and never via >> follow_page_mask() for hugetlb? >> >> IOW, why can't we make follow_page_mask() just not handle hugetlb and >> route everything via follow_hugetlb_page() -- we primarily only have to >> teach it to not trigger faults. > > IMHO, these follow_huge_xxx() functions are arch-specified at first and > were moved into the common hugetlb.c by commit 9e5fc74c3025 ("mm: > hugetlb: Copy general hugetlb code from x86 to mm"), and now there are > still some arch-specified follow_huge_xxx() definition, for example: > ia64: follow_huge_addr > powerpc: follow_huge_pd > s390: follow_huge_pud > > What I mean is that follow_hugetlb_page() is a common and > not-arch-specified function, is it suitable to change it to be > arch-specified? > And thinking more, can we rename follow_hugetlb_page() as > hugetlb_page_faultin() and simplify it to only handle the page faults of > hugetlb like the faultin_page() for normal page? That means we can make > sure only follow_page_mask() can handle hugetlb. > If follow_hugetlb_page() can be arch-independent, why do we need the other arch-dependent functions? It all looks a bit weird to have two functions that walk page tables and are hugetlb aware. Either this screams for a cleanup or I am missing something fundamental. -- Thanks, David / dhildenb