From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 93477C4167B for ; Fri, 16 Dec 2022 15:51:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2A43B8E0008; Fri, 16 Dec 2022 10:51:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 22DA3900002; Fri, 16 Dec 2022 10:51:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0803A8E000C; Fri, 16 Dec 2022 10:51:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id E9C838E0008 for ; Fri, 16 Dec 2022 10:51:19 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id C0B2D1207B8 for ; Fri, 16 Dec 2022 15:51:19 +0000 (UTC) X-FDA: 80248608678.18.CC5E5FE Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf12.hostedemail.com (Postfix) with ESMTP id DA51840018 for ; Fri, 16 Dec 2022 15:51:17 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=iVTBbnFc; spf=pass (imf12.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1671205877; a=rsa-sha256; cv=none; b=MT5fyFUkKomhhbJeGKmwXiKhqiWMZZ6Oioql5wbigMcYYVWcfB2EepwJ7DZGjdZ8Tz7k2O xF3Igv6Vynt+ry6+IzBWoJvyzyVMl73tooqfIdkeCQVc3XpipJEXEB+zSJEcCa8cugtFZx GrVz+V7+j0VvF71tfGy+QddZkYf7tOk= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=iVTBbnFc; spf=pass (imf12.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1671205877; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vsQ4xrOeNHA8Em2t3SYPnGMugJW9C8Z5spFzSFBIRTk=; b=zQPkUqHISlDE4ZfJUs2YTwfeBMvRKGj7ls1AXRcDR60kFF1nHEqmbXFVEP6M52moBj7tjT rbTizxmimH4FOUbFpNpAqVjrzqGe7EJIV7Vk6QdFtoKjCjC4wMCI5MHyA0YofqYtWwNX19 tE2OkxSodFHwPp7G3atlPvB4qBxOIAk= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1671205877; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vsQ4xrOeNHA8Em2t3SYPnGMugJW9C8Z5spFzSFBIRTk=; b=iVTBbnFcz5yro4LmhsLBZ1+sZW4/M9qjY6OGwtlVz/SbyOJhHzyAtdXnDm4HN/6thcQBlX NoNTNtBUpyvZQ4j7wKoE3V64Nx6BAN0H+KJdJjuzb4fqSPpoZtsRhsE8tAz8VintREGRvw umxt1F+w3Xxzz4/X6SjC1l36JrfsLVY= Received: from mail-qv1-f69.google.com (mail-qv1-f69.google.com [209.85.219.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-277-vTqT24PBMgircqiXkkaL8A-1; Fri, 16 Dec 2022 10:51:16 -0500 X-MC-Unique: vTqT24PBMgircqiXkkaL8A-1 Received: by mail-qv1-f69.google.com with SMTP id r10-20020ad4522a000000b004d28fcbfe17so1672583qvq.4 for ; Fri, 16 Dec 2022 07:51:16 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vsQ4xrOeNHA8Em2t3SYPnGMugJW9C8Z5spFzSFBIRTk=; b=xs8HIkuoi0OYIXk0oQpxeKKK2kJQjN3FNJDy8tuEEqow4Ikri6HbXIlBs4V1+DPL0Q 5pAsGiJ23dF/tUTb8dWseoup63RqyAk/xZmQd1gsDhgABPCiejxQKkArZsj1SC/Ibh9d X2JvDhgYdDyu94cfgK1T1RBA8VB8sYasi/Oy82AhpYyc280q2obIcQVTZHttHfY1brLO 8hcYpTaBIt0sTT0vaaQRO/2dsQzc3/v7CPet/JtRaqfSCv+X1o6C/e4U5bT9Qm1PYEiY 0KNBpplgEVaQ+CSBhgQ15ghQ4pDjZ9Q8UwXx1ApK17MbFs0fnerT5GUBcmOKVdCkpTnY +kPg== X-Gm-Message-State: AFqh2kp6AGz8w7umB4r6cRyIdf31PXM6VzoM/2+627pWRHppS7PWt7XP XQC7WIU6ejn7d7HaHpHxo1+EhODr1UchNz3fVcSRmMyNBVOJAQc4ecsc6QV2iUqimj2pRHX3wfE jumUaV4y6oDI= X-Received: by 2002:a05:6214:3787:b0:4f3:7d92:13a9 with SMTP id ni7-20020a056214378700b004f37d9213a9mr8375879qvb.15.1671205875660; Fri, 16 Dec 2022 07:51:15 -0800 (PST) X-Google-Smtp-Source: AMrXdXuI37Cnbiq7WTJZi91OO/hWpXTz/+s2VaXzLPKoSNSCTyQ3nllT7XSFGoanS+lfNocZLgfsog== X-Received: by 2002:a05:6214:3787:b0:4f3:7d92:13a9 with SMTP id ni7-20020a056214378700b004f37d9213a9mr8375847qvb.15.1671205875427; Fri, 16 Dec 2022 07:51:15 -0800 (PST) Received: from x1n.redhat.com (bras-base-aurron9127w-grc-45-70-31-26-132.dsl.bell.ca. [70.31.26.132]) by smtp.gmail.com with ESMTPSA id s81-20020a37a954000000b006eeb3165554sm1682297qke.19.2022.12.16.07.51.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 16 Dec 2022 07:51:14 -0800 (PST) From: Peter Xu To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Muchun Song , Miaohe Lin , Andrea Arcangeli , Nadav Amit , James Houghton , peterx@redhat.com, Mike Kravetz , David Hildenbrand , Rik van Riel , John Hubbard , Andrew Morton , Jann Horn Subject: [PATCH v4 4/9] mm/hugetlb: Move swap entry handling into vma lock when faulted Date: Fri, 16 Dec 2022 10:50:55 -0500 Message-Id: <20221216155100.2043537-5-peterx@redhat.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20221216155100.2043537-1-peterx@redhat.com> References: <20221216155100.2043537-1-peterx@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII"; x-default=true X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: DA51840018 X-Stat-Signature: k85frze77w7t44m55b865r1w53ck3yiz X-HE-Tag: 1671205877-59162 X-HE-Meta: U2FsdGVkX1/geprLu0+ET0xqiRl4oav8VxvJZ+p39WUZGzKs6ff6j7UNEHv7J67PIkQydkp7u3CQVeMTcJatllyoTwl00aFNWgE2iSnkibIJLWuEjSPj5KB+TKEmTmszJZZ5s7iSN7cDis5BbD8x5r1J7t7R0okdgIlLhycm58I2WcDkD3KVQqsLcHsrwLth/JHhg9o8nijGQYFcmVKx6DHo18P+i2E6C7RZs6cfLd5xCtA+ofxPPvZiPd4+5f26yweYhCAwKAyY17dsRquU0QmRE/7wqOpFFpOuTZI581PJvXOXLQmIaiy9m3u42pCb0jfD+anU8slB6u/DTjS2vmq+ygRnYm9/TPrQQgiof/OPMhxDcV9ohGNAHYh4seTvmV5kkBjVefbItXCNwNvf28+z+wyYKGmHnz+H0uiGQjNvM0fD2d9Zvl3NEhFTFc2N+MTpgEnuR4ndhmSG2p7IzpnVvUM50PxX0VXoLFLZaKKRuhVN7zLfQyUVvgW2ap+41kbYA4bb1Rvx1poDsd34nv3JDQm01gtuWYLNGdUz37YVE61f7d38r29OmatA/1m6R46zDZ1x1ibQtufy2rYacJLGr/PAFiJZw5KegpXRCwZxjiRiGSucukpJ3LvgF7x2r69dj7wyIohIpBJALGIGCAPC2xTz9dz1wm/VVYkmX5lFKh0dGv/brsD+GS7ZYuOsOoGNzOTF6ShRbi7B4Kb9TRKfRKKk3vKpUZJ8ZjU1NUrujfFJYNZP57a2OHsXyAFgfT4HTLWWZRRd5HlipKGjOX4bfuu7jUCMl9OQbjfxDvhlnHZtx3WX3nHCDN9Qrbszb3TIAVMv0Gc7fb4SbPqisjYKSrG3jRQjom31buma0GjyBTET1Lj8QKjMsvNBzqCpojPzygfU8mQCttitIDeBzEv74oatX9WZhUEj9/Tu+ixk4j3V61bz8yGXsUh2xZ2jc6CajG2RODRDLse7h8P 8hBsOHTt 2qsPEGhr+PxRmBSsle6Va4f5p2tg4fRZS7OtT+OsLIZb6ZCmoMv33/dQiQGOf3YQTN+hZySo1ATHeVwKggBkFyXKj/Nfm57nTYYjfCtDQXAjjMbp+lIbfdx+cewZ+KqO39FeKqR0WmhDQIPtrC0YvZrM13Rg+7/4XjWkG32lfyAcmQ2j8BxAFung1fw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In hugetlb_fault(), there used to have a special path to handle swap entry at the entrance using huge_pte_offset(). That's unsafe because huge_pte_offset() for a pmd sharable range can access freed pgtables if without any lock to protect the pgtable from being freed after pmd unshare. Here the simplest solution to make it safe is to move the swap handling to be after the vma lock being held. We may need to take the fault mutex on either migration or hwpoison entries now (also the vma lock, but that's really needed), however neither of them is hot path. Note that the vma lock cannot be released in hugetlb_fault() when the migration entry is detected, because in migration_entry_wait_huge() the pgtable page will be used again (by taking the pgtable lock), so that also need to be protected by the vma lock. Modify migration_entry_wait_huge() so that it must be called with vma read lock held, and properly release the lock in __migration_entry_wait_huge(). Reviewed-by: Mike Kravetz Reviewed-by: John Hubbard Signed-off-by: Peter Xu --- include/linux/swapops.h | 6 ++++-- mm/hugetlb.c | 37 ++++++++++++++++--------------------- mm/migrate.c | 25 +++++++++++++++++++++---- 3 files changed, 41 insertions(+), 27 deletions(-) diff --git a/include/linux/swapops.h b/include/linux/swapops.h index b982dd614572..3a451b7afcb3 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -337,7 +337,8 @@ extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, unsigned long address); #ifdef CONFIG_HUGETLB_PAGE -extern void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl); +extern void __migration_entry_wait_huge(struct vm_area_struct *vma, + pte_t *ptep, spinlock_t *ptl); extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte); #endif /* CONFIG_HUGETLB_PAGE */ #else /* CONFIG_MIGRATION */ @@ -366,7 +367,8 @@ static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, unsigned long address) { } #ifdef CONFIG_HUGETLB_PAGE -static inline void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) { } +static inline void __migration_entry_wait_huge(struct vm_area_struct *vma, + pte_t *ptep, spinlock_t *ptl) { } static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { } #endif /* CONFIG_HUGETLB_PAGE */ static inline int is_writable_migration_entry(swp_entry_t entry) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 8ccd55f9fbd3..64512a151567 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5972,22 +5972,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int need_wait_lock = 0; unsigned long haddr = address & huge_page_mask(h); - ptep = huge_pte_offset(mm, haddr, huge_page_size(h)); - if (ptep) { - /* - * Since we hold no locks, ptep could be stale. That is - * OK as we are only making decisions based on content and - * not actually modifying content here. - */ - entry = huge_ptep_get(ptep); - if (unlikely(is_hugetlb_entry_migration(entry))) { - migration_entry_wait_huge(vma, ptep); - return 0; - } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) - return VM_FAULT_HWPOISON_LARGE | - VM_FAULT_SET_HINDEX(hstate_index(h)); - } - /* * Serialize hugepage allocation and instantiation, so that we don't * get spurious allocation failures if two CPUs race to instantiate @@ -6002,10 +5986,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, * Acquire vma lock before calling huge_pte_alloc and hold * until finished with ptep. This prevents huge_pmd_unshare from * being called elsewhere and making the ptep no longer valid. - * - * ptep could have already be assigned via huge_pte_offset. That - * is OK, as huge_pte_alloc will return the same value unless - * something has changed. */ hugetlb_vma_lock_read(vma); ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h)); @@ -6034,8 +6014,23 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, * fault, and is_hugetlb_entry_(migration|hwpoisoned) check will * properly handle it. */ - if (!pte_present(entry)) + if (!pte_present(entry)) { + if (unlikely(is_hugetlb_entry_migration(entry))) { + /* + * Release the hugetlb fault lock now, but retain + * the vma lock, because it is needed to guard the + * huge_pte_lockptr() later in + * migration_entry_wait_huge(). The vma lock will + * be released there. + */ + mutex_unlock(&hugetlb_fault_mutex_table[hash]); + migration_entry_wait_huge(vma, ptep); + return 0; + } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) + ret = VM_FAULT_HWPOISON_LARGE | + VM_FAULT_SET_HINDEX(hstate_index(h)); goto out_mutex; + } /* * If we are going to COW/unshare the mapping later, we examine the diff --git a/mm/migrate.c b/mm/migrate.c index a4d3fc65085f..98de7ce2b576 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -329,24 +329,41 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, } #ifdef CONFIG_HUGETLB_PAGE -void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) +/* + * The vma read lock must be held upon entry. Holding that lock prevents either + * the pte or the ptl from being freed. + * + * This function will release the vma lock before returning. + */ +void __migration_entry_wait_huge(struct vm_area_struct *vma, + pte_t *ptep, spinlock_t *ptl) { pte_t pte; + hugetlb_vma_assert_locked(vma); spin_lock(ptl); pte = huge_ptep_get(ptep); - if (unlikely(!is_hugetlb_entry_migration(pte))) + if (unlikely(!is_hugetlb_entry_migration(pte))) { spin_unlock(ptl); - else + hugetlb_vma_unlock_read(vma); + } else { + /* + * If migration entry existed, safe to release vma lock + * here because the pgtable page won't be freed without the + * pgtable lock released. See comment right above pgtable + * lock release in migration_entry_wait_on_locked(). + */ + hugetlb_vma_unlock_read(vma); migration_entry_wait_on_locked(pte_to_swp_entry(pte), NULL, ptl); + } } void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, pte); - __migration_entry_wait_huge(pte, ptl); + __migration_entry_wait_huge(vma, pte, ptl); } #endif -- 2.37.3