* [PATCH v10 1/3] madvise: use zap_page_range_single for madvise dontneed
2022-11-14 23:55 [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling Mike Kravetz
@ 2022-11-14 23:55 ` Mike Kravetz
2022-11-14 23:55 ` [PATCH v10 2/3] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing Mike Kravetz
` (2 subsequent siblings)
3 siblings, 0 replies; 6+ messages in thread
From: Mike Kravetz @ 2022-11-14 23:55 UTC (permalink / raw)
To: linux-mm, linux-kernel
Cc: Naoya Horiguchi, David Hildenbrand, Axel Rasmussen, Mina Almasry,
Peter Xu, Nadav Amit, Rik van Riel, Vlastimil Babka,
Matthew Wilcox, Andrew Morton, Mike Kravetz, Wei Chen, stable
Expose the routine zap_page_range_single to zap a range within a single
vma. The madvise routine madvise_dontneed_single_vma can use this
routine as it explicitly operates on a single vma. Also, update the mmu
notification range in zap_page_range_single to take hugetlb pmd sharing
into account. This is required as MADV_DONTNEED supports hugetlb vmas.
Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Wei Chen <harperchen1110@gmail.com>
Cc: <stable@vger.kernel.org>
---
include/linux/mm.h | 27 +++++++++++++++++++--------
mm/madvise.c | 6 +++---
mm/memory.c | 23 +++++++++++------------
3 files changed, 33 insertions(+), 23 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9838b535fa21..dd5a38682537 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1870,6 +1870,23 @@ static void __maybe_unused show_free_areas(unsigned int flags, nodemask_t *nodem
__show_free_areas(flags, nodemask, MAX_NR_ZONES - 1);
}
+/*
+ * Parameter block passed down to zap_pte_range in exceptional cases.
+ */
+struct zap_details {
+ struct folio *single_folio; /* Locked folio to be unmapped */
+ bool even_cows; /* Zap COWed private pages too? */
+ zap_flags_t zap_flags; /* Extra flags for zapping */
+};
+
+/*
+ * Whether to drop the pte markers, for example, the uffd-wp information for
+ * file-backed memory. This should only be specified when we will completely
+ * drop the page in the mm, either by truncation or unmapping of the vma. By
+ * default, the flag is not set.
+ */
+#define ZAP_FLAG_DROP_MARKER ((__force zap_flags_t) BIT(0))
+
#ifdef CONFIG_MMU
extern bool can_do_mlock(void);
#else
@@ -1887,6 +1904,8 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
unsigned long size);
void zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size);
+void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+ unsigned long size, struct zap_details *details);
void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
struct vm_area_struct *start_vma, unsigned long start,
unsigned long end);
@@ -3518,12 +3537,4 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
}
#endif
-/*
- * Whether to drop the pte markers, for example, the uffd-wp information for
- * file-backed memory. This should only be specified when we will completely
- * drop the page in the mm, either by truncation or unmapping of the vma. By
- * default, the flag is not set.
- */
-#define ZAP_FLAG_DROP_MARKER ((__force zap_flags_t) BIT(0))
-
#endif /* _LINUX_MM_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index df62d9e1035a..a21b186eb7a0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -785,8 +785,8 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
* Application no longer needs these pages. If the pages are dirty,
* it's OK to just throw them away. The app will be more careful about
* data it wants to keep. Be sure to free swap resources too. The
- * zap_page_range call sets things up for shrink_active_list to actually free
- * these pages later if no one else has touched them in the meantime,
+ * zap_page_range_single call sets things up for shrink_active_list to actually
+ * free these pages later if no one else has touched them in the meantime,
* although we could add these pages to a global reuse list for
* shrink_active_list to pick up before reclaiming other pages.
*
@@ -803,7 +803,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
- zap_page_range(vma, start, end - start);
+ zap_page_range_single(vma, start, end - start, NULL);
return 0;
}
diff --git a/mm/memory.c b/mm/memory.c
index 98ddb91df9a7..a177f6bbfafc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1294,15 +1294,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
return ret;
}
-/*
- * Parameter block passed down to zap_pte_range in exceptional cases.
- */
-struct zap_details {
- struct folio *single_folio; /* Locked folio to be unmapped */
- bool even_cows; /* Zap COWed private pages too? */
- zap_flags_t zap_flags; /* Extra flags for zapping */
-};
-
/* Whether we should zap all COWed (private) pages too */
static inline bool should_zap_cows(struct zap_details *details)
{
@@ -1736,19 +1727,27 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
*
* The range must fit into one VMA.
*/
-static void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *details)
{
+ const unsigned long end = address + size;
struct mmu_notifier_range range;
struct mmu_gather tlb;
lru_add_drain();
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
- address, address + size);
+ address, end);
+ if (is_vm_hugetlb_page(vma))
+ adjust_range_if_pmd_sharing_possible(vma, &range.start,
+ &range.end);
tlb_gather_mmu(&tlb, vma->vm_mm);
update_hiwater_rss(vma->vm_mm);
mmu_notifier_invalidate_range_start(&range);
- unmap_single_vma(&tlb, vma, address, range.end, details);
+ /*
+ * unmap 'address-end' not 'range.start-range.end' as range
+ * could have been expanded for hugetlb pmd sharing.
+ */
+ unmap_single_vma(&tlb, vma, address, end, details);
mmu_notifier_invalidate_range_end(&range);
tlb_finish_mmu(&tlb);
}
--
2.38.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v10 2/3] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
2022-11-14 23:55 [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling Mike Kravetz
2022-11-14 23:55 ` [PATCH v10 1/3] madvise: use zap_page_range_single for madvise dontneed Mike Kravetz
@ 2022-11-14 23:55 ` Mike Kravetz
2022-11-14 23:55 ` [PATCH v10 3/3] hugetlb: remove duplicate mmu notifications Mike Kravetz
2022-11-23 2:07 ` [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling Andrew Morton
3 siblings, 0 replies; 6+ messages in thread
From: Mike Kravetz @ 2022-11-14 23:55 UTC (permalink / raw)
To: linux-mm, linux-kernel
Cc: Naoya Horiguchi, David Hildenbrand, Axel Rasmussen, Mina Almasry,
Peter Xu, Nadav Amit, Rik van Riel, Vlastimil Babka,
Matthew Wilcox, Andrew Morton, Mike Kravetz, Wei Chen, stable
madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
tables associated with the address range. For hugetlb vmas,
zap_page_range will call __unmap_hugepage_range_final. However,
__unmap_hugepage_range_final assumes the passed vma is about to be removed
and deletes the vma_lock to prevent pmd sharing as the vma is on the way
out. In the case of madvise(MADV_DONTNEED) the vma remains, but the
missing vma_lock prevents pmd sharing and could potentially lead to issues
with truncation/fault races.
This issue was originally reported here [1] as a BUG triggered in
page_try_dup_anon_rmap. Prior to the introduction of the hugetlb
vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
prevent pmd sharing. Subsequent faults on this vma were confused as
VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
not set in new pages added to the page table. This resulted in pages that
appeared anonymous in a VM_SHARED vma and triggered the BUG.
Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap
call from unmap_vmas(). This is used to indicate the 'final' unmapping of
a hugetlb vma. When called via MADV_DONTNEED, this flag is not set and the
vm_lock is not deleted.
[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Wei Chen <harperchen1110@gmail.com>
Cc: <stable@vger.kernel.org>
---
include/linux/mm.h | 2 ++
mm/hugetlb.c | 27 ++++++++++++++++-----------
mm/memory.c | 2 +-
3 files changed, 19 insertions(+), 12 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dd5a38682537..a4e24dd2d96e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1886,6 +1886,8 @@ struct zap_details {
* default, the flag is not set.
*/
#define ZAP_FLAG_DROP_MARKER ((__force zap_flags_t) BIT(0))
+/* Set in unmap_vmas() to indicate a final unmap call. Only used by hugetlb */
+#define ZAP_FLAG_UNMAP ((__force zap_flags_t) BIT(1))
#ifdef CONFIG_MMU
extern bool can_do_mlock(void);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9d765364231e..7559b9dfe782 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5210,17 +5210,22 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
- /*
- * Unlock and free the vma lock before releasing i_mmap_rwsem. When
- * the vma_lock is freed, this makes the vma ineligible for pmd
- * sharing. And, i_mmap_rwsem is required to set up pmd sharing.
- * This is important as page tables for this unmapped range will
- * be asynchrously deleted. If the page tables are shared, there
- * will be issues when accessed by someone else.
- */
- __hugetlb_vma_unlock_write_free(vma);
-
- i_mmap_unlock_write(vma->vm_file->f_mapping);
+ if (zap_flags & ZAP_FLAG_UNMAP) { /* final unmap */
+ /*
+ * Unlock and free the vma lock before releasing i_mmap_rwsem.
+ * When the vma_lock is freed, this makes the vma ineligible
+ * for pmd sharing. And, i_mmap_rwsem is required to set up
+ * pmd sharing. This is important as page tables for this
+ * unmapped range will be asynchrously deleted. If the page
+ * tables are shared, there will be issues when accessed by
+ * someone else.
+ */
+ __hugetlb_vma_unlock_write_free(vma);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
+ } else {
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
+ hugetlb_vma_unlock_write(vma);
+ }
}
void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
diff --git a/mm/memory.c b/mm/memory.c
index a177f6bbfafc..6d77bc00bca1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1673,7 +1673,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
{
struct mmu_notifier_range range;
struct zap_details details = {
- .zap_flags = ZAP_FLAG_DROP_MARKER,
+ .zap_flags = ZAP_FLAG_DROP_MARKER | ZAP_FLAG_UNMAP,
/* Careful - we need to zap private pages too! */
.even_cows = true,
};
--
2.38.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v10 3/3] hugetlb: remove duplicate mmu notifications
2022-11-14 23:55 [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling Mike Kravetz
2022-11-14 23:55 ` [PATCH v10 1/3] madvise: use zap_page_range_single for madvise dontneed Mike Kravetz
2022-11-14 23:55 ` [PATCH v10 2/3] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing Mike Kravetz
@ 2022-11-14 23:55 ` Mike Kravetz
2022-11-23 2:07 ` [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling Andrew Morton
3 siblings, 0 replies; 6+ messages in thread
From: Mike Kravetz @ 2022-11-14 23:55 UTC (permalink / raw)
To: linux-mm, linux-kernel
Cc: Naoya Horiguchi, David Hildenbrand, Axel Rasmussen, Mina Almasry,
Peter Xu, Nadav Amit, Rik van Riel, Vlastimil Babka,
Matthew Wilcox, Andrew Morton, Mike Kravetz
The common hugetlb unmap routine __unmap_hugepage_range performs mmu
notification calls. However, in the case where __unmap_hugepage_range
is called via __unmap_hugepage_range_final, mmu notification calls are
performed earlier in other calling routines.
Remove mmu notification calls from __unmap_hugepage_range. Add
notification calls to the only other caller: unmap_hugepage_range.
unmap_hugepage_range is called for truncation and hole punch, so
change notification type from UNMAP to CLEAR as this is more appropriate.
Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
mm/hugetlb.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7559b9dfe782..0cdefa63f474 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5074,7 +5074,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
struct page *page;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
- struct mmu_notifier_range range;
unsigned long last_addr_mask;
bool force_flush = false;
@@ -5089,13 +5088,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
tlb_change_page_size(tlb, sz);
tlb_start_vma(tlb, vma);
- /*
- * If sharing possible, alert mmu notifiers of worst case.
- */
- mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, mm, start,
- end);
- adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
- mmu_notifier_invalidate_range_start(&range);
last_addr_mask = hugetlb_mask_last_page(h);
address = start;
for (; address < end; address += sz) {
@@ -5180,7 +5172,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
if (ref_page)
break;
}
- mmu_notifier_invalidate_range_end(&range);
tlb_end_vma(tlb, vma);
/*
@@ -5208,6 +5199,7 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
hugetlb_vma_lock_write(vma);
i_mmap_lock_write(vma->vm_file->f_mapping);
+ /* mmu notification performed in caller */
__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
if (zap_flags & ZAP_FLAG_UNMAP) { /* final unmap */
@@ -5232,10 +5224,18 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
unsigned long end, struct page *ref_page,
zap_flags_t zap_flags)
{
+ struct mmu_notifier_range range;
struct mmu_gather tlb;
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
+ start, end);
+ adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
+ mmu_notifier_invalidate_range_start(&range);
tlb_gather_mmu(&tlb, vma->vm_mm);
+
__unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
+
+ mmu_notifier_invalidate_range_end(&range);
tlb_finish_mmu(&tlb);
}
--
2.38.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling
2022-11-14 23:55 [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling Mike Kravetz
` (2 preceding siblings ...)
2022-11-14 23:55 ` [PATCH v10 3/3] hugetlb: remove duplicate mmu notifications Mike Kravetz
@ 2022-11-23 2:07 ` Andrew Morton
2022-11-23 2:21 ` Mike Kravetz
3 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2022-11-23 2:07 UTC (permalink / raw)
To: Mike Kravetz
Cc: linux-mm, linux-kernel, Naoya Horiguchi, David Hildenbrand,
Axel Rasmussen, Mina Almasry, Peter Xu, Nadav Amit, Rik van Riel,
Vlastimil Babka, Matthew Wilcox
Could this series be implicated in
https://lkml.kernel.org/r/00000000000041a69905edf8c1e3@google.com?
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling
2022-11-23 2:07 ` [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling Andrew Morton
@ 2022-11-23 2:21 ` Mike Kravetz
0 siblings, 0 replies; 6+ messages in thread
From: Mike Kravetz @ 2022-11-23 2:21 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Naoya Horiguchi, David Hildenbrand,
Axel Rasmussen, Mina Almasry, Peter Xu, Nadav Amit, Rik van Riel,
Vlastimil Babka, Matthew Wilcox
On 11/22/22 18:07, Andrew Morton wrote:
> Could this series be implicated in
> https://lkml.kernel.org/r/00000000000041a69905edf8c1e3@google.com?
If I am reading the report correctly, I would say that this series (at least
the first two patches) would address that issue. The bot is running against
6.1-rcX and those patches have not yet been sent to the 6.1 stream.
--
Mike Kravetz
^ permalink raw reply [flat|nested] 6+ messages in thread