From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
stable@vger.kernel.org, Hugh Dickins <hughd@google.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Yang Shi <shy828301@gmail.com>,
Alistair Popple <apopple@nvidia.com>, Jan Kara <jack@suse.cz>,
Jue Wang <juew@google.com>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
Miaohe Lin <linmiaohe@huawei.com>,
Minchan Kim <minchan@kernel.org>,
Naoya Horiguchi <naoya.horiguchi@nec.com>,
Oscar Salvador <osalvador@suse.de>, Peter Xu <peterx@redhat.com>,
Ralph Campbell <rcampbell@nvidia.com>,
Shakeel Butt <shakeelb@google.com>,
Wang Yugui <wangyugui@e16-tech.com>, Zi Yan <ziy@nvidia.com>,
Andrew Morton <akpm@linux-foundation.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Sasha Levin <sashal@kernel.org>
Subject: [PATCH 4.19 09/34] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
Date: Fri, 9 Jul 2021 15:20:25 +0200 [thread overview]
Message-ID: <20210709131649.858659055@linuxfoundation.org> (raw)
In-Reply-To: <20210709131644.969303901@linuxfoundation.org>
From: Hugh Dickins <hughd@google.com>
[ Upstream commit 22061a1ffabdb9c3385de159c5db7aac3a4df1cc ]
There is a race between THP unmapping and truncation, when truncate sees
pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
it, but before its page_remove_rmap() gets to decrement
compound_mapcount: generating false "BUG: Bad page cache" reports that
the page is still mapped when deleted. This commit fixes that, but not
in the way I hoped.
The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
instead of unmap_mapping_range() in truncate_cleanup_page(): it has
often been an annoyance that we usually call unmap_mapping_range() with
no pages locked, but there apply it to a single locked page.
try_to_unmap() looks more suitable for a single locked page.
However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
it is used to insert THP migration entries, but not used to unmap THPs.
Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
needs are different, I'm too ignorant of the DAX cases, and couldn't
decide how far to go for anon+swap. Set that aside.
The second attempt took a different tack: make no change in truncate.c,
but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
clearing it initially, then pmd_clear() between page_remove_rmap() and
unlocking at the end. Nice. But powerpc blows that approach out of the
water, with its serialize_against_pte_lookup(), and interesting pgtable
usage. It would need serious help to get working on powerpc (with a
minor optimization issue on s390 too). Set that aside.
Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
that's likely to reduce or eliminate the number of incidents, it would
give less assurance of whether we had identified the problem correctly.
This successful iteration introduces "unmap_mapping_page(page)" instead
of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
with an addition to details. Then zap_pmd_range() watches for this
case, and does spin_unlock(pmd_lock) if so - just like
page_vma_mapped_walk() now does in the PVMW_SYNC case. Not pretty, but
safe.
Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
assert its interface; but currently that's only used to make sure that
page->mapping is stable, and zap_pmd_range() doesn't care if the page is
locked or not. Along these lines, in invalidate_inode_pages2_range()
move the initial unmap_mapping_range() out from under page lock, before
then calling unmap_mapping_page() under page lock if still mapped.
Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
Fixes: fc127da085c2 ("truncate: handle file thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Note on stable backport: fixed up call to truncate_cleanup_page()
in truncate_inode_pages_range(). Use hpage_nr_pages() in
unmap_mapping_page().
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
include/linux/mm.h | 3 +++
mm/memory.c | 41 +++++++++++++++++++++++++++++++++++++++++
mm/truncate.c | 43 +++++++++++++++++++------------------------
3 files changed, 63 insertions(+), 24 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f6ecf41aea83..c736c677b876 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1338,6 +1338,7 @@ struct zap_details {
struct address_space *check_mapping; /* Check page->mapping if set */
pgoff_t first_index; /* Lowest page->index to unmap */
pgoff_t last_index; /* Highest page->index to unmap */
+ struct page *single_page; /* Locked page to be unmapped */
};
struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
@@ -1428,6 +1429,7 @@ extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
+void unmap_mapping_page(struct page *page);
void unmap_mapping_pages(struct address_space *mapping,
pgoff_t start, pgoff_t nr, bool even_cows);
void unmap_mapping_range(struct address_space *mapping,
@@ -1448,6 +1450,7 @@ static inline int fixup_user_fault(struct task_struct *tsk,
BUG();
return -EFAULT;
}
+static inline void unmap_mapping_page(struct page *page) { }
static inline void unmap_mapping_pages(struct address_space *mapping,
pgoff_t start, pgoff_t nr, bool even_cows) { }
static inline void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
index c2011c51f15d..49b546cdce0d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1439,7 +1439,18 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
else if (zap_huge_pmd(tlb, vma, pmd, addr))
goto next;
/* fall through */
+ } else if (details && details->single_page &&
+ PageTransCompound(details->single_page) &&
+ next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) {
+ spinlock_t *ptl = pmd_lock(tlb->mm, pmd);
+ /*
+ * Take and drop THP pmd lock so that we cannot return
+ * prematurely, while zap_huge_pmd() has cleared *pmd,
+ * but not yet decremented compound_mapcount().
+ */
+ spin_unlock(ptl);
}
+
/*
* Here there can be other concurrent MADV_DONTNEED or
* trans huge page faults running, and if the pmd is
@@ -2924,6 +2935,36 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
}
}
+/**
+ * unmap_mapping_page() - Unmap single page from processes.
+ * @page: The locked page to be unmapped.
+ *
+ * Unmap this page from any userspace process which still has it mmaped.
+ * Typically, for efficiency, the range of nearby pages has already been
+ * unmapped by unmap_mapping_pages() or unmap_mapping_range(). But once
+ * truncation or invalidation holds the lock on a page, it may find that
+ * the page has been remapped again: and then uses unmap_mapping_page()
+ * to unmap it finally.
+ */
+void unmap_mapping_page(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct zap_details details = { };
+
+ VM_BUG_ON(!PageLocked(page));
+ VM_BUG_ON(PageTail(page));
+
+ details.check_mapping = mapping;
+ details.first_index = page->index;
+ details.last_index = page->index + hpage_nr_pages(page) - 1;
+ details.single_page = page;
+
+ i_mmap_lock_write(mapping);
+ if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
+ unmap_mapping_range_tree(&mapping->i_mmap, &details);
+ i_mmap_unlock_write(mapping);
+}
+
/**
* unmap_mapping_pages() - Unmap pages from processes.
* @mapping: The address space containing pages to be unmapped.
diff --git a/mm/truncate.c b/mm/truncate.c
index 71b65aab8077..43c73db17a0a 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -175,13 +175,10 @@ void do_invalidatepage(struct page *page, unsigned int offset,
* its lock, b) when a concurrent invalidate_mapping_pages got there first and
* c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
*/
-static void
-truncate_cleanup_page(struct address_space *mapping, struct page *page)
+static void truncate_cleanup_page(struct page *page)
{
- if (page_mapped(page)) {
- pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
- unmap_mapping_pages(mapping, page->index, nr, false);
- }
+ if (page_mapped(page))
+ unmap_mapping_page(page);
if (page_has_private(page))
do_invalidatepage(page, 0, PAGE_SIZE);
@@ -226,7 +223,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
if (page->mapping != mapping)
return -EIO;
- truncate_cleanup_page(mapping, page);
+ truncate_cleanup_page(page);
delete_from_page_cache(page);
return 0;
}
@@ -364,7 +361,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
pagevec_add(&locked_pvec, page);
}
for (i = 0; i < pagevec_count(&locked_pvec); i++)
- truncate_cleanup_page(mapping, locked_pvec.pages[i]);
+ truncate_cleanup_page(locked_pvec.pages[i]);
delete_from_page_cache_batch(mapping, &locked_pvec);
for (i = 0; i < pagevec_count(&locked_pvec); i++)
unlock_page(locked_pvec.pages[i]);
@@ -703,6 +700,16 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
continue;
}
+ if (!did_range_unmap && page_mapped(page)) {
+ /*
+ * If page is mapped, before taking its lock,
+ * zap the rest of the file in one hit.
+ */
+ unmap_mapping_pages(mapping, index,
+ (1 + end - index), false);
+ did_range_unmap = 1;
+ }
+
lock_page(page);
WARN_ON(page_to_index(page) != index);
if (page->mapping != mapping) {
@@ -710,23 +717,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
continue;
}
wait_on_page_writeback(page);
- if (page_mapped(page)) {
- if (!did_range_unmap) {
- /*
- * Zap the rest of the file in one hit.
- */
- unmap_mapping_pages(mapping, index,
- (1 + end - index), false);
- did_range_unmap = 1;
- } else {
- /*
- * Just zap this page
- */
- unmap_mapping_pages(mapping, index,
- 1, false);
- }
- }
+
+ if (page_mapped(page))
+ unmap_mapping_page(page);
BUG_ON(page_mapped(page));
+
ret2 = do_launder_page(mapping, page);
if (ret2 == 0) {
if (!invalidate_complete_page2(mapping, page))
--
2.30.2
next prev parent reply other threads:[~2021-07-09 13:21 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-09 13:20 [PATCH 4.19 00/34] 4.19.197-rc1 review Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 01/34] mm: add VM_WARN_ON_ONCE_PAGE() macro Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 02/34] mm/rmap: remove unneeded semicolon in page_not_mapped() Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 03/34] mm/rmap: use page_not_mapped in try_to_unmap() Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 04/34] mm/thp: fix __split_huge_pmd_locked() on shmem migration entry Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 05/34] mm/thp: make is_huge_zero_pmd() safe and quicker Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 06/34] mm/thp: try_to_unmap() use TTU_SYNC for safe splitting Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 07/34] mm/thp: fix vma_address() if virtual address below file offset Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 08/34] mm/thp: fix page_address_in_vma() on file THP tails Greg Kroah-Hartman
2021-07-09 13:20 ` Greg Kroah-Hartman [this message]
2021-07-09 13:20 ` [PATCH 4.19 10/34] mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 11/34] mm: page_vma_mapped_walk(): use page for pvmw->page Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 12/34] mm: page_vma_mapped_walk(): settle PageHuge on entry Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 13/34] mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 14/34] mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 15/34] mm: page_vma_mapped_walk(): crossing page table boundary Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 16/34] mm: page_vma_mapped_walk(): add a level of indentation Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 17/34] mm: page_vma_mapped_walk(): use goto instead of while (1) Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 18/34] mm: page_vma_mapped_walk(): get vma_address_end() earlier Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 19/34] mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 20/34] mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk() Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 21/34] mm, futex: fix shared futex pgoff on shmem huge page Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 22/34] scsi: sr: Return appropriate error code when disk is ejected Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 23/34] drm/nouveau: fix dma_address check for CPU/GPU sync Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 24/34] ext4: eliminate bogus error in ext4_data_block_valid_rcu() Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 25/34] KVM: SVM: Periodically schedule when unregistering regions on destroy Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 26/34] ARM: dts: imx6qdl-sabresd: Remove incorrect power supply assignment Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 27/34] kthread_worker: split code for canceling the delayed work timer Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 28/34] kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync() Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 29/34] xen/events: reset active flag for lateeoi events later Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 30/34] KVM: SVM: Call SEV Guest Decommission if ASID binding fails Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 31/34] ARM: OMAP: replace setup_irq() by request_irq() Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 32/34] clocksource/drivers/timer-ti-dm: Add clockevent and clocksource support Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 33/34] clocksource/drivers/timer-ti-dm: Prepare to handle dra7 timer wrap issue Greg Kroah-Hartman
2021-07-09 13:20 ` [PATCH 4.19 34/34] clocksource/drivers/timer-ti-dm: Handle dra7 timer wrap errata i940 Greg Kroah-Hartman
2021-07-09 17:11 ` [PATCH 4.19 00/34] 4.19.197-rc1 review Jon Hunter
2021-07-09 21:43 ` Shuah Khan
2021-07-10 10:36 ` Sudip Mukherjee
2021-07-10 13:44 ` Naresh Kamboju
2021-07-10 19:51 ` Guenter Roeck
2021-07-11 7:59 ` Pavel Machek
2021-07-12 0:58 ` Samuel Zou
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210709131649.858659055@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=juew@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linmiaohe@huawei.com \
--cc=linux-kernel@vger.kernel.org \
--cc=minchan@kernel.org \
--cc=naoya.horiguchi@nec.com \
--cc=osalvador@suse.de \
--cc=peterx@redhat.com \
--cc=rcampbell@nvidia.com \
--cc=sashal@kernel.org \
--cc=shakeelb@google.com \
--cc=shy828301@gmail.com \
--cc=stable@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=wangyugui@e16-tech.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.