* + mm-mremap-prevent-racing-change-of-old-pmd-type.patch added to mm-hotfixes-unstable branch
@ 2024-10-02 22:44 ` Andrew Morton
2024-10-04 16:51 ` Marek Szyprowski
0 siblings, 1 reply; 3+ messages in thread
From: Andrew Morton @ 2024-10-02 22:44 UTC (permalink / raw)
To: mm-commits, willy, stable, hughd, david, jannh, akpm
The patch titled
Subject: mm/mremap: prevent racing change of old pmd type
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-mremap-prevent-racing-change-of-old-pmd-type.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-mremap-prevent-racing-change-of-old-pmd-type.patch
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Jann Horn <jannh@google.com>
Subject: mm/mremap: prevent racing change of old pmd type
Date: Wed, 02 Oct 2024 23:07:06 +0200
Prevent move_normal_pmd() in mremap() from racing with
retract_page_tables() in MADVISE_COLLAPSE such that
pmd_populate(mm, new_pmd, pmd_pgtable(pmd))
operates on an empty source pmd, causing creation of a new pmd which maps
physical address 0 as a page table.
This bug is only reachable if either CONFIG_READ_ONLY_THP_FOR_FS is set or
THP shmem is usable. (Unprivileged namespaces can be used to set up a
tmpfs that can contain THP shmem pages with "huge=advise".)
If userspace triggers this bug *in multiple processes*, this could likely
be used to create stale TLB entries pointing to freed pages or cause
kernel UAF by breaking an invariant the rmap code relies on.
Fix it by moving the rmap locking up so that it covers the span from
reading the PMD entry to moving the page table.
Link: https://lkml.kernel.org/r/20241002-move_normal_pmd-vs-collapse-fix-v1-1-78290e5dece6@google.com
Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
Signed-off-by: Jann Horn <jannh@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/mremap.c | 68 +++++++++++++++++++++++++++-----------------------
1 file changed, 38 insertions(+), 30 deletions(-)
--- a/mm/mremap.c~mm-mremap-prevent-racing-change-of-old-pmd-type
+++ a/mm/mremap.c
@@ -136,17 +136,17 @@ static pte_t move_soft_dirty_pte(pte_t p
static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
unsigned long old_addr, unsigned long old_end,
struct vm_area_struct *new_vma, pmd_t *new_pmd,
- unsigned long new_addr, bool need_rmap_locks)
+ unsigned long new_addr)
{
struct mm_struct *mm = vma->vm_mm;
pte_t *old_pte, *new_pte, pte;
spinlock_t *old_ptl, *new_ptl;
bool force_flush = false;
unsigned long len = old_end - old_addr;
- int err = 0;
/*
- * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
+ * When need_rmap_locks is true in the caller, we are holding the
+ * i_mmap_rwsem and anon_vma
* locks to ensure that rmap will always observe either the old or the
* new ptes. This is the easiest way to avoid races with
* truncate_pagecache(), page migration, etc...
@@ -163,23 +163,18 @@ static int move_ptes(struct vm_area_stru
* serialize access to individual ptes, but only rmap traversal
* order guarantees that we won't miss both the old and new ptes).
*/
- if (need_rmap_locks)
- take_rmap_locks(vma);
/*
* We don't have to worry about the ordering of src and dst
* pte locks because exclusive mmap_lock prevents deadlock.
*/
old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
- if (!old_pte) {
- err = -EAGAIN;
- goto out;
- }
+ if (!old_pte)
+ return -EAGAIN;
new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
if (!new_pte) {
pte_unmap_unlock(old_pte, old_ptl);
- err = -EAGAIN;
- goto out;
+ return -EAGAIN;
}
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
@@ -217,10 +212,7 @@ static int move_ptes(struct vm_area_stru
spin_unlock(new_ptl);
pte_unmap(new_pte - 1);
pte_unmap_unlock(old_pte - 1, old_ptl);
-out:
- if (need_rmap_locks)
- drop_rmap_locks(vma);
- return err;
+ return 0;
}
#ifndef arch_supports_page_table_move
@@ -447,17 +439,14 @@ static __always_inline unsigned long get
/*
* Attempts to speedup the move by moving entry at the level corresponding to
* pgt_entry. Returns true if the move was successful, else false.
+ * rmap locks are held by the caller.
*/
static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
unsigned long old_addr, unsigned long new_addr,
- void *old_entry, void *new_entry, bool need_rmap_locks)
+ void *old_entry, void *new_entry)
{
bool moved = false;
- /* See comment in move_ptes() */
- if (need_rmap_locks)
- take_rmap_locks(vma);
-
switch (entry) {
case NORMAL_PMD:
moved = move_normal_pmd(vma, old_addr, new_addr, old_entry,
@@ -483,9 +472,6 @@ static bool move_pgt_entry(enum pgt_entr
break;
}
- if (need_rmap_locks)
- drop_rmap_locks(vma);
-
return moved;
}
@@ -550,6 +536,7 @@ unsigned long move_page_tables(struct vm
struct mmu_notifier_range range;
pmd_t *old_pmd, *new_pmd;
pud_t *old_pud, *new_pud;
+ int move_res;
if (!len)
return 0;
@@ -573,6 +560,12 @@ unsigned long move_page_tables(struct vm
old_addr, old_end);
mmu_notifier_invalidate_range_start(&range);
+ /*
+ * Hold rmap locks to ensure the type of the old PUD/PMD entry doesn't
+ * change under us due to khugepaged or folio splitting.
+ */
+ take_rmap_locks(vma);
+
for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
cond_resched();
/*
@@ -590,14 +583,14 @@ unsigned long move_page_tables(struct vm
if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) {
if (extent == HPAGE_PUD_SIZE) {
move_pgt_entry(HPAGE_PUD, vma, old_addr, new_addr,
- old_pud, new_pud, need_rmap_locks);
+ old_pud, new_pud);
/* We ignore and continue on error? */
continue;
}
} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) {
if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
- old_pud, new_pud, true))
+ old_pud, new_pud))
continue;
}
@@ -613,7 +606,7 @@ again:
pmd_devmap(*old_pmd)) {
if (extent == HPAGE_PMD_SIZE &&
move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr,
- old_pmd, new_pmd, need_rmap_locks))
+ old_pmd, new_pmd))
continue;
split_huge_pmd(vma, old_pmd, old_addr);
} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PMD) &&
@@ -623,17 +616,32 @@ again:
* moving at the PMD level if possible.
*/
if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr,
- old_pmd, new_pmd, true))
+ old_pmd, new_pmd))
continue;
}
if (pmd_none(*old_pmd))
continue;
- if (pte_alloc(new_vma->vm_mm, new_pmd))
+
+ /*
+ * Temporarily drop the rmap locks while we do a potentially
+ * slow move_ptes() operation, unless move_ptes() wants them
+ * held (see comment inside there).
+ */
+ if (!need_rmap_locks)
+ drop_rmap_locks(vma);
+ if (pte_alloc(new_vma->vm_mm, new_pmd)) {
+ if (!need_rmap_locks)
+ take_rmap_locks(vma);
break;
- if (move_ptes(vma, old_pmd, old_addr, old_addr + extent,
- new_vma, new_pmd, new_addr, need_rmap_locks) < 0)
+ }
+ move_res = move_ptes(vma, old_pmd, old_addr, old_addr + extent,
+ new_vma, new_pmd, new_addr);
+ if (!need_rmap_locks)
+ take_rmap_locks(vma);
+ if (move_res < 0)
goto again;
}
+ drop_rmap_locks(vma);
mmu_notifier_invalidate_range_end(&range);
_
Patches currently in -mm which might be from jannh@google.com are
mm-mremap-prevent-racing-change-of-old-pmd-type.patch
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: + mm-mremap-prevent-racing-change-of-old-pmd-type.patch added to mm-hotfixes-unstable branch
2024-10-02 22:44 ` + mm-mremap-prevent-racing-change-of-old-pmd-type.patch added to mm-hotfixes-unstable branch Andrew Morton
@ 2024-10-04 16:51 ` Marek Szyprowski
2024-10-08 15:18 ` Jann Horn
0 siblings, 1 reply; 3+ messages in thread
From: Marek Szyprowski @ 2024-10-04 16:51 UTC (permalink / raw)
To: Andrew Morton, mm-commits, willy, stable, hughd, david, jannh
Dear All,
The original mail with this patch is not available in lore, so I decided
to reply this one.
On 03.10.2024 00:44, Andrew Morton wrote:
> The patch titled
> Subject: mm/mremap: prevent racing change of old pmd type
> has been added to the -mm mm-hotfixes-unstable branch. Its filename is
> mm-mremap-prevent-racing-change-of-old-pmd-type.patch
>
> This patch will shortly appear at
> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-mremap-prevent-racing-change-of-old-pmd-type.patch
>
> This patch will later appear in the mm-hotfixes-unstable branch at
> git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
>
> Before you just go and hit "reply", please:
> a) Consider who else should be cc'ed
> b) Prefer to cc a suitable mailing list as well
> c) Ideally: find the original patch on the mailing list and do a
> reply-to-all to that, adding suitable additional cc's
>
> *** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
>
> The -mm tree is included into linux-next via the mm-everything
> branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> and is updated there every 2-3 working days
>
> ------------------------------------------------------
> From: Jann Horn <jannh@google.com>
> Subject: mm/mremap: prevent racing change of old pmd type
> Date: Wed, 02 Oct 2024 23:07:06 +0200
>
> Prevent move_normal_pmd() in mremap() from racing with
> retract_page_tables() in MADVISE_COLLAPSE such that
>
> pmd_populate(mm, new_pmd, pmd_pgtable(pmd))
>
> operates on an empty source pmd, causing creation of a new pmd which maps
> physical address 0 as a page table.
>
> This bug is only reachable if either CONFIG_READ_ONLY_THP_FOR_FS is set or
> THP shmem is usable. (Unprivileged namespaces can be used to set up a
> tmpfs that can contain THP shmem pages with "huge=advise".)
>
> If userspace triggers this bug *in multiple processes*, this could likely
> be used to create stale TLB entries pointing to freed pages or cause
> kernel UAF by breaking an invariant the rmap code relies on.
>
> Fix it by moving the rmap locking up so that it covers the span from
> reading the PMD entry to moving the page table.
>
> Link: https://lkml.kernel.org/r/20241002-move_normal_pmd-vs-collapse-fix-v1-1-78290e5dece6@google.com
> Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
> Signed-off-by: Jann Horn <jannh@google.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch landed in today's linux-next as commit 46c1b3279220
("mm/mremap: prevent racing change of old pmd type"). In my tests I
found that it introduces a lockdep warning about possible circular
locking dependency on ARM64 machines. Reverting $subject together with
commits a2fbe16f45a8 ("mm: mremap: move_ptes() use
pte_offset_map_rw_nolock()") and 46c1b3279220 ("mm/mremap: prevent
racing change of old pmd type") on top of next-20241004 fixes this problem.
Here is the observed lockdep warning:
Freeing unused kernel memory: 13824K
Run /sbin/init as init process
======================================================
WARNING: possible circular locking dependency detected
6.12.0-rc1+ #15391 Not tainted
------------------------------------------------------
init/1 is trying to acquire lock:
ffff000006943588 (&anon_vma->rwsem){+.+.}-{3:3}, at: vma_prepare+0x70/0x158
but task is already holding lock:
ffff0000048c9970 (&mapping->i_mmap_rwsem){+.+.}-{3:3}, at:
vma_prepare+0x28/0x158
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&mapping->i_mmap_rwsem){+.+.}-{3:3}:
down_write+0x50/0xe8
dma_resv_lockdep+0x140/0x300
do_one_initcall+0x68/0x300
kernel_init_freeable+0x28c/0x50c
kernel_init+0x20/0x1d8
ret_from_fork+0x10/0x20
-> #1 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0xd0/0xe4
__alloc_pages_noprof+0xe4/0x10d0
alloc_pages_mpol_noprof+0x88/0x23c
alloc_pages_noprof+0x48/0xc0
__pud_alloc+0x44/0x254
alloc_new_pud.constprop.0+0x154/0x160
move_page_tables+0x1b0/0xc38
relocate_vma_down+0xe4/0x1f8
setup_arg_pages+0x190/0x370
load_elf_binary+0x370/0x15c4
bprm_execve+0x290/0x7a0
kernel_execve+0xf8/0x16c
run_init_process+0xa8/0xbc
kernel_init+0xec/0x1d8
ret_from_fork+0x10/0x20
-> #0 (&anon_vma->rwsem){+.+.}-{3:3}:
__lock_acquire+0x1374/0x2224
lock_acquire+0x200/0x340
down_write+0x50/0xe8
vma_prepare+0x70/0x158
__split_vma+0x26c/0x388
vma_modify+0x45c/0x7f4
vma_modify_flags+0x90/0xc4
mprotect_fixup+0x8c/0x2c0
do_mprotect_pkey+0x2a8/0x464
__arm64_sys_mprotect+0x20/0x30
invoke_syscall+0x48/0x110
el0_svc_common.constprop.0+0x40/0xe8
do_el0_svc_compat+0x20/0x3c
el0_svc_compat+0x44/0xe0
el0t_32_sync_handler+0x98/0x148
el0t_32_sync+0x194/0x198
other info that might help us debug this:
Chain exists of:
&anon_vma->rwsem --> fs_reclaim --> &mapping->i_mmap_rwsem
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&mapping->i_mmap_rwsem);
lock(fs_reclaim);
lock(&mapping->i_mmap_rwsem);
lock(&anon_vma->rwsem);
*** DEADLOCK ***
2 locks held by init/1:
#0: ffff000006998188 (&mm->mmap_lock){++++}-{3:3}, at:
do_mprotect_pkey+0xb4/0x464
#1: ffff0000048c9970 (&mapping->i_mmap_rwsem){+.+.}-{3:3}, at:
vma_prepare+0x28/0x158
stack backtrace:
CPU: 1 UID: 0 PID: 1 Comm: init Not tainted 6.12.0-rc1+ #15391
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x94/0xec
show_stack+0x18/0x24
dump_stack_lvl+0x90/0xd0
dump_stack+0x18/0x24
print_circular_bug+0x298/0x37c
check_noncircular+0x15c/0x170
__lock_acquire+0x1374/0x2224
lock_acquire+0x200/0x340
down_write+0x50/0xe8
vma_prepare+0x70/0x158
__split_vma+0x26c/0x388
vma_modify+0x45c/0x7f4
vma_modify_flags+0x90/0xc4
mprotect_fixup+0x8c/0x2c0
do_mprotect_pkey+0x2a8/0x464
__arm64_sys_mprotect+0x20/0x30
invoke_syscall+0x48/0x110
el0_svc_common.constprop.0+0x40/0xe8
do_el0_svc_compat+0x20/0x3c
el0_svc_compat+0x44/0xe0
el0t_32_sync_handler+0x98/0x148
el0t_32_sync+0x194/0x198
INIT: version 2.88 booting
> ...
Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: + mm-mremap-prevent-racing-change-of-old-pmd-type.patch added to mm-hotfixes-unstable branch
2024-10-04 16:51 ` Marek Szyprowski
@ 2024-10-08 15:18 ` Jann Horn
0 siblings, 0 replies; 3+ messages in thread
From: Jann Horn @ 2024-10-08 15:18 UTC (permalink / raw)
To: Marek Szyprowski; +Cc: Andrew Morton, mm-commits, willy, stable, hughd, david
Hi!
On Fri, Oct 4, 2024 at 6:51 PM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
> This patch landed in today's linux-next as commit 46c1b3279220
> ("mm/mremap: prevent racing change of old pmd type"). In my tests I
> found that it introduces a lockdep warning about possible circular
> locking dependency on ARM64 machines. Reverting $subject together with
> commits a2fbe16f45a8 ("mm: mremap: move_ptes() use
> pte_offset_map_rw_nolock()") and 46c1b3279220 ("mm/mremap: prevent
> racing change of old pmd type") on top of next-20241004 fixes this problem.
Thanks for the report; that patch has now been removed, and a
different approach
(https://lore.kernel.org/all/20241007-move_normal_pmd-vs-collapse-fix-2-v1-1-5ead9631f2ea@google.com/)
will probably be used instead.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2024-10-08 15:18 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CGME20241004165105eucas1p1f21dfe36e7b1f0384126534dcbbd36c4@eucas1p1.samsung.com>
2024-10-02 22:44 ` + mm-mremap-prevent-racing-change-of-old-pmd-type.patch added to mm-hotfixes-unstable branch Andrew Morton
2024-10-04 16:51 ` Marek Szyprowski
2024-10-08 15:18 ` Jann Horn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox