All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: Rik van Riel <riel@surriel.com>
Cc: Chris Mason <clm@meta.com>, David Hildenbrand <david@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kernel-team@meta.com, Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [BUG] hugetlbfs_no_page vs MADV_DONTNEED race leading to SIGBUS
Date: Tue, 25 Oct 2022 20:07:42 -0700	[thread overview]
Message-ID: <Y1ikfoLS0SdA6dna@monkey> (raw)
In-Reply-To: <215d225585ff3c5ea90c64e6c9bdff04ab548156.camel@surriel.com>

On 10/25/22 16:37, Rik van Riel wrote:
> Hi Mike,
> 
> After getting promising results initially, we discovered there
> is yet another bug left with hugetlbfs MADV_DONTNEED.
> 
> This one involves a page fault on a hugetlbfs address, while
> another thread in the same process is in the middle of MADV_DONTNEED
> on that same memory address.
> 
> The code in __unmap_hugepage_range() will clear the page table
> entry, and then at some point later the lazy TLB code will 
> actually free the huge page back into the hugetlbfs free page
> pool.
> 
> Meanwhile, hugetlb_no_page will call alloc_huge_page, and that
> will fail because the code calling __unmap_hugepage_range() has
> not actually returned the page to the free list yet.
> 
> The result is that the process gets killed with SIGBUS.

Thanks for the details Rik.  That makes perfect sense.

> I have thought of a few different solutions to this problem, but
> none of them look good:
> - Make MADV_DONTNEED take a write lock on mmap_sem, to exclude
>   page faults. This could make MADV_DONTNEED on VMAs with 4kB
>   pages unacceptably slow.
> - Some sort of atomic counter kept by __unmap_hugepage_range()
>   that huge pages may be getting placed in the tlb gather, and
>   freed later by tlb_finish_mmu().  This would involve changes
>   to the MMU gather code, outside of hugetlbfs.
> - Some sort of generation counter that tracks tlb_gather_mmu
>   cycles in progress, with the alloc_huge_page failure path
>   waiting until all mmu gather operations that started before
>   it to finish, before retrying the allocation. This requires
>   changes to the generic code, outside of hugetlbfs.
> 
> What are the reasonable alternatives here?
> 
> Should we see if anybody can come up with a simple solution
> to the problem, or would it be better to just disable
> MADV_DONTNEED on hugetlbfs for now?

I am thinking that we could use the vma_lock to prevent faults on the
vma until the MADV_DONTNEED processing is totally complete.  IIUC, it is
the tlb_finish_mmu call that actually drops the ref count on the pages
and returns them to the free list.  Currently, we hold the vma_lock in
write mode during unmap, and acquire it in read mode during faults.
However, we drop it in the MADV_DONTNEED path (just a bit) before calling
tlb_finish_mmu.  So, holding a bit longer may address this issue rather
simply.

Below is another version of the "hugetlb: don't delete vma_lock in hugetlb
MADV_DONTNEED processing" patch.  It is very much like the version in
Andrew's tree but the code has been shuffled a bit to hold the vma_lock
longer in the MADV_DONTNEED path.  Can you take a look and see if that
addresses the issue for you?

From 9a1f047ce42f76b3befd673da5ee2bd49bea6c75 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Tue, 25 Oct 2022 20:02:58 -0700
Subject: [PATCH v4] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED
 processing

madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear the
page tables associated with the address range.  For hugetlb vmas,
zap_page_range will call __unmap_hugepage_range_final.  However,
__unmap_hugepage_range_final assumes the passed vma is about to be
removed and deletes the vma_lock to prevent pmd sharing as the vma is
on the way out.  In the case of madvise(MADV_DONTNEED) the vma remains,
but the missing vma_lock prevents pmd sharing and could potentially
lead to issues with truncation/fault races.

This issue was originally reported here [1] as a BUG triggered in
page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
prevent pmd sharing.  Subsequent faults on this vma were confused as
VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping
was not set in new pages added to the page table.  This resulted in
pages that appeared anonymous in a VM_SHARED vma and triggered the BUG.

Create a new routine clear_hugetlb_page_range() that can be called from
madvise(MADV_DONTNEED) for hugetlb vmas.  It has the same setup as
zap_page_range, but does not delete the vma_lock.

[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/

Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Reported-by: Wei Chen <harperchen1110@gmail.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/hugetlb.h |  7 +++++++
 mm/hugetlb.c            | 30 ++++++++++++++++++++++++++++++
 mm/madvise.c            |  5 ++++-
 3 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a899bc76d677..0246e77be3a3 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -158,6 +158,8 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 void unmap_hugepage_range(struct vm_area_struct *,
 			  unsigned long, unsigned long, struct page *,
 			  zap_flags_t);
+void clear_hugetlb_page_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end);
 void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 			  struct vm_area_struct *vma,
 			  unsigned long start, unsigned long end,
@@ -426,6 +428,11 @@ static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 	BUG();
 }
 
+static void __maybe_unused clear_hugetlb_page_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+}
+
 static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,
 			struct vm_area_struct *vma, unsigned long address,
 			unsigned int flags)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 931789a8f734..0a52c3938ce9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5221,9 +5221,39 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 	 * will be issues when accessed by someone else.
 	 */
 	__hugetlb_vma_unlock_write_free(vma);
+	i_mmap_unlock_write(vma->vm_file->f_mapping);
+}
+
+#ifdef CONFIG_ADVISE_SYSCALLS
+/*
+ * Similar setup as in zap_page_range().  madvise(MADV_DONTNEED) can not call
+ * zap_page_range for hugetlb vmas as __unmap_hugepage_range_final will delete
+ * the associated vma_lock.
+ */
+void clear_hugetlb_page_range(struct vm_area_struct *vma, unsigned long start,
+				unsigned long end)
+{
+	struct mmu_notifier_range range;
+	struct mmu_gather tlb;
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
+				start, end);
+	tlb_gather_mmu(&tlb, vma->vm_mm);
+	update_hiwater_rss(vma->vm_mm);
+
+	hugetlb_vma_lock_write(vma);
+	i_mmap_lock_write(vma->vm_file->f_mapping);
+
+	__unmap_hugepage_range(&tlb, vma, start, end, NULL, 0);
 
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
+
+	mmu_notifier_invalidate_range_end(&range);
+	tlb_finish_mmu(&tlb);
+
+	hugetlb_vma_unlock_write(vma);
 }
+#endif
 
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end, struct page *ref_page,
diff --git a/mm/madvise.c b/mm/madvise.c
index 2baa93ca2310..90577a669635 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -790,7 +790,10 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
-	zap_page_range(vma, start, end - start);
+	if (!is_vm_hugetlb_page(vma))
+		zap_page_range(vma, start, end - start);
+	else
+		clear_hugetlb_page_range(vma, start, end);
 	return 0;
 }
 
-- 
2.37.3



  reply	other threads:[~2022-10-26  3:07 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-25 20:37 [BUG] hugetlbfs_no_page vs MADV_DONTNEED race leading to SIGBUS Rik van Riel
2022-10-26  3:07 ` Mike Kravetz [this message]
2022-10-26  4:13   ` Mike Kravetz
2022-10-26 19:38 ` Mike Kravetz
2022-10-27 15:07   ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y1ikfoLS0SdA6dna@monkey \
    --to=mike.kravetz@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=clm@meta.com \
    --cc=david@redhat.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=riel@surriel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.