linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
@ 2025-07-23 16:26 Jann Horn
  2025-07-23 17:32 ` Vlastimil Babka
  2025-07-23 18:14 ` Lorenzo Stoakes
  0 siblings, 2 replies; 31+ messages in thread
From: Jann Horn @ 2025-07-23 16:26 UTC (permalink / raw)
  To: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Suren Baghdasaryan, Vlastimil Babka
  Cc: Pedro Falcato, Linux-MM, kernel list

[-- Attachment #1: Type: text/plain, Size: 12118 bytes --]

There's a racy UAF in `vma_refcount_put()` when called on the
`lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
without sufficient protection against concurrent object reuse:

lock_vma_under_rcu() looks up a VMA locklessly with mas_walk() under
rcu_read_lock(). At that point, the VMA may be concurrently freed, and
it can be recycled by another process. vma_start_read() then
increments the vma->vm_refcnt (if it is in an acceptable range), and
if this succeeds, vma_start_read() can return a reycled VMA. (As a
sidenote, this goes against what the surrounding comments above
vma_start_read() and in lock_vma_under_rcu() say - it would probably
be cleaner to perform the vma->vm_mm check inside vma_start_read().)

In this scenario where the VMA has been recycled, lock_vma_under_rcu()
will then detect the mismatching ->vm_mm pointer and drop the VMA
through vma_end_read(), which calls vma_refcount_put().
vma_refcount_put() does this:

```
static inline void vma_refcount_put(struct vm_area_struct *vma)
{
        /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
        struct mm_struct *mm = vma->vm_mm;
        int oldcnt;

        rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
        if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {

                if (is_vma_writer_only(oldcnt - 1))
                        rcuwait_wake_up(&mm->vma_writer_wait);
        }
}
```

This is wrong: It implicitly assumes that the caller is keeping the
VMA's mm alive, but in this scenario the caller has no relation to the
VMA's mm, so the rcuwait_wake_up() can cause UAF.

In theory, this could happen to any multithreaded process where thread
A is in the middle of pagefault handling while thread B is
manipulating adjacent VMAs such that VMA merging frees the VMA looked
up by thread A - but in practice, for UAF to actually happen, I think
you need to at least hit three race windows in a row that are each on
the order of a single memory access wide.

The interleaving leading to UAF is the following, where threads A1 and
A2 are part of one process and thread B1 is part of another process:
```
A1               A2               A3
==               ==               ==
lock_vma_under_rcu
  mas_walk
                 <VMA modification removes the VMA>
                                  mmap
                                    <reallocate the VMA>
  vma_start_read
    READ_ONCE(vma->vm_lock_seq)
    __refcount_inc_not_zero_limited_acquire
                                  munmap
                                    __vma_enter_locked
                                      refcount_add_not_zero
  vma_end_read
    vma_refcount_put
      __refcount_dec_and_test
                                      rcuwait_wait_event
                                    <finish operation>
      rcuwait_wake_up [UAF]
```

I'm not sure what the right fix is; I guess one approach would be to
have a special version of vma_refcount_put() for cases where the VMA
has been recycled by another MM that grabs an extra reference to the
MM? But then dropping a reference to the MM afterwards might be a bit
annoying and might require something like mmdrop_async()...


# Reproducer
If you want to actually reproduce this, uh, I have a way to reproduce
it but it's ugly: First apply the KASAN patch
https://lore.kernel.org/all/20250723-kasan-tsbrcu-noquarantine-v1-1-846c8645976c@google.com/
, then apply the attached diff vma-lock-delay-inject.diff to inject
delays in four different places and add some logging, then build with:

CONFIG_KASAN=y
CONFIG_PREEMPT=y
CONFIG_SLUB_RCU_DEBUG must be explicitly disabled!

Then run the resulting kernel, move everything off CPU 0 by running
"for pid in $(ls /proc | grep '^[1-9]'); do taskset -p -a 0xe $pid;
done" as root, and then run the attached testcase vmalock-uaf.c.

That should result in output like:
```
[  105.018129][ T1334] vma_start_read: PRE-INCREMENT DELAY START on
VMA ffff888134b31180
[  106.026145][ T1335] vm_area_alloc: writer allocated vma ffff888134b31180
[  107.024146][ T1334] vma_start_read: PRE-INCREMENT DELAY END
[  107.025836][ T1334] vma_start_read: returning vma
[  107.026800][ T1334] vma_refcount_put: PRE-DECREMENT DELAY START
[  107.528751][ T1335] __vma_enter_locked: BEGIN DELAY
[  110.535863][ T1334] vma_refcount_put: PRE-DECREMENT DELAY END
[  110.537553][ T1334] vma_refcount_put: PRE-WAKEUP DELAY START
(is_vma_writer_only()=1)
[  111.529833][ T1335] __vma_enter_locked: END DELAY
[  121.037571][ T1334] vma_refcount_put: PRE-WAKEUP DELAY END
[  121.039259][ T1334]
==================================================================
[  121.040792][ T1334] BUG: KASAN: slab-use-after-free in
rcuwait_wake_up+0x33/0x60
[  121.042345][ T1334] Read of size 8 at addr ffff8881223545f0 by task TEST/1334
[  121.043698][ T1334]
[  121.044175][ T1334] CPU: 0 UID: 1000 PID: 1334 Comm: TEST Not
tainted 6.16.0-rc7-00002-g0df7d6c9705b-dirty #179 PREEMPT
[  121.044180][ T1334] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[  121.044182][ T1334] Call Trace:
[  121.044193][ T1334]  <TASK>
[  121.044198][ T1334]  __dump_stack+0x15/0x20
[  121.044216][ T1334]  dump_stack_lvl+0x6c/0xa0
[  121.044219][ T1334]  print_report+0xbc/0x250
[  121.044224][ T1334]  ? rcuwait_wake_up+0x33/0x60
[  121.044228][ T1334]  kasan_report+0x148/0x180
[  121.044249][ T1334]  ? rcuwait_wake_up+0x33/0x60
[  121.044251][ T1334]  __asan_report_load8_noabort+0x10/0x20
[  121.044257][ T1334]  rcuwait_wake_up+0x33/0x60
[  121.044259][ T1334]  vma_refcount_put+0xbd/0x180
[  121.044267][ T1334]  lock_vma_under_rcu+0x438/0x490
[  121.044271][ T1334]  do_user_addr_fault+0x24c/0xbf0
[  121.044278][ T1334]  exc_page_fault+0x5d/0x90
[  121.044297][ T1334]  asm_exc_page_fault+0x22/0x30
[  121.044304][ T1334] RIP: 0033:0x556803378682
[  121.044317][ T1334] Code: ff 89 45 d4 83 7d d4 ff 75 19 48 8d 05 d7
0b 00 00 48 89 c6 bf 01 00 00 00 b8 00 00 00 00 e8 35 fa ff ff 48 8b
05 26 2a 00 00 <0f> b6 00 48 8d 05 d7 0b 00 00 48 89 c6 bf 0f 00 00 00
b8 00 00 00
[  121.044322][ T1334] RSP: 002b:00007ffcd73e98a0 EFLAGS: 00010213
[  121.044325][ T1334] RAX: 00007fcc5c28c000 RBX: 00007ffcd73e9aa8
RCX: 00007fcc5c18b23a
[  121.044327][ T1334] RDX: 0000000000000007 RSI: 000055680337923a
RDI: 000000000000000f
[  121.044329][ T1334] RBP: 00007ffcd73e9990 R08: 00000000ffffffff
R09: 0000000000000000
[  121.044330][ T1334] R10: 00007fcc5c1869c2 R11: 0000000000000202
R12: 0000000000000000
[  121.044332][ T1334] R13: 00007ffcd73e9ac0 R14: 00007fcc5c2cc000
R15: 000055680337add8
[  121.044335][ T1334]  </TASK>
[  121.044336][ T1334]
[  121.077313][ T1334] Allocated by task 1334:
[  121.078117][ T1334]  kasan_save_track+0x3a/0x80
[  121.078982][ T1334]  kasan_save_alloc_info+0x38/0x50
[  121.080012][ T1334]  __kasan_slab_alloc+0x47/0x60
[  121.080920][ T1334]  kmem_cache_alloc_noprof+0x19e/0x370
[  121.081936][ T1334]  copy_mm+0xb7/0x400
[  121.082724][ T1334]  copy_process+0xe1f/0x2ac0
[  121.083585][ T1334]  kernel_clone+0x14b/0x540
[  121.084511][ T1334]  __x64_sys_clone+0x11d/0x150
[  121.085397][ T1334]  x64_sys_call+0x2c55/0x2fa0
[  121.086271][ T1334]  do_syscall_64+0x48/0x120
[  121.087114][ T1334]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  121.088297][ T1334]
[  121.088730][ T1334] Freed by task 1334:
[  121.090219][ T1334]  kasan_save_track+0x3a/0x80
[  121.091083][ T1334]  kasan_save_free_info+0x42/0x50
[  121.092122][ T1334]  __kasan_slab_free+0x3d/0x60
[  121.093008][ T1334]  kmem_cache_free+0xf5/0x300
[  121.093878][ T1334]  __mmdrop+0x260/0x360
[  121.094645][ T1334]  finish_task_switch+0x29c/0x6d0
[  121.095645][ T1334]  __schedule+0x1396/0x2140
[  121.096521][ T1334]  preempt_schedule_irq+0x67/0xc0
[  121.097456][ T1334]  raw_irqentry_exit_cond_resched+0x30/0x40
[  121.098560][ T1334]  irqentry_exit+0x3f/0x50
[  121.099386][ T1334]  sysvec_apic_timer_interrupt+0x3e/0x80
[  121.100489][ T1334]  asm_sysvec_apic_timer_interrupt+0x16/0x20
[  121.101642][ T1334]
[  121.102132][ T1334] The buggy address belongs to the object at
ffff888122354500
[  121.102132][ T1334]  which belongs to the cache mm_struct of size 1304
[  121.106296][ T1334] The buggy address is located 240 bytes inside of
[  121.106296][ T1334]  freed 1304-byte region [ffff888122354500,
ffff888122354a18)
[  121.109452][ T1334]
[  121.109964][ T1334] The buggy address belongs to the physical page:
[  121.111363][ T1334] page: refcount:0 mapcount:0
mapping:0000000000000000 index:0xffff888122354ac0 pfn:0x122350
[  121.113626][ T1334] head: order:3 mapcount:0 entire_mapcount:0
nr_pages_mapped:-1 pincount:0
[  121.115475][ T1334] memcg:ffff88811b750641
[  121.116414][ T1334] flags: 0x200000000000240(workingset|head|node=0|zone=2)
[  121.117949][ T1334] page_type: f5(slab)
[  121.118815][ T1334] raw: 0200000000000240 ffff888100050640
ffff88810004e9d0 ffff88810004e9d0
[  121.121931][ T1334] raw: ffff888122354ac0 000000000016000d
00000000f5000000 ffff88811b750641
[  121.123820][ T1334] head: 0200000000000240 ffff888100050640
ffff88810004e9d0 ffff88810004e9d0
[  121.125722][ T1334] head: ffff888122354ac0 000000000016000d
00000000f5000000 ffff88811b750641
[  121.127596][ T1334] head: 0200000000000003 ffffea000488d401
ffffea00ffffffff 00000000ffffffff
[  121.129469][ T1334] head: ffffffffffffffff 0000000000000000
00000000ffffffff 0000000000000008
[  121.131335][ T1334] page dumped because: kasan: bad access detected
[  121.132823][ T1334] page_owner tracks the page as allocated
[  121.134065][ T1334] page last allocated via order 3, migratetype
Unmovable, gfp_mask
0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC),
pid 1165, tgid 1165 (bash), ts 98173533301, free_ts 98103770759
[  121.139486][ T1334]  post_alloc_hook+0x17a/0x190
[  121.140534][ T1334]  get_page_from_freelist+0x2edc/0x2f90
[  121.141731][ T1334]  __alloc_frozen_pages_noprof+0x1c1/0x4d0
[  121.143022][ T1334]  alloc_pages_mpol+0x14e/0x2b0
[  121.144098][ T1334]  alloc_frozen_pages_noprof+0xc4/0xf0
[  121.145318][ T1334]  allocate_slab+0x8f/0x280
[  121.146296][ T1334]  ___slab_alloc+0x3d5/0x8e0
[  121.147292][ T1334]  kmem_cache_alloc_noprof+0x229/0x370
[  121.148489][ T1334]  copy_mm+0xb7/0x400
[  121.149355][ T1334]  copy_process+0xe1f/0x2ac0
[  121.150352][ T1334]  kernel_clone+0x14b/0x540
[  121.151330][ T1334]  __x64_sys_clone+0x11d/0x150
[  121.152472][ T1334]  x64_sys_call+0x2c55/0x2fa0
[  121.153491][ T1334]  do_syscall_64+0x48/0x120
[  121.154479][ T1334]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  121.156725][ T1334] page last free pid 1321 tgid 1321 stack trace:
[  121.158098][ T1334]  __free_frozen_pages+0xa1c/0xc80
[  121.159205][ T1334]  free_frozen_pages+0xc/0x20
[  121.160229][ T1334]  __free_slab+0xad/0xc0
[  121.161149][ T1334]  free_slab+0x17/0x100
[  121.162102][ T1334]  free_to_partial_list+0x48f/0x5b0
[  121.163239][ T1334]  __slab_free+0x1e5/0x240
[  121.164210][ T1334]  ___cache_free+0xb3/0xf0
[  121.165174][ T1334]  qlist_free_all+0xb7/0x160
[  121.166176][ T1334]  kasan_quarantine_reduce+0x14b/0x170
[  121.167358][ T1334]  __kasan_slab_alloc+0x1e/0x60
[  121.168456][ T1334]  kmem_cache_alloc_noprof+0x19e/0x370
[  121.169638][ T1334]  getname_flags+0x9c/0x490
[  121.170631][ T1334]  do_sys_openat2+0x55/0x100
[  121.171629][ T1334]  __x64_sys_openat+0xf4/0x120
[  121.173713][ T1334]  x64_sys_call+0x1ab/0x2fa0
[  121.174716][ T1334]  do_syscall_64+0x48/0x120
[  121.175697][ T1334]
[  121.176216][ T1334] Memory state around the buggy address:
[  121.177433][ T1334]  ffff888122354480: fc fc fc fc fc fc fc fc fc
fc fc fc fc fc fc fc
[  121.179173][ T1334]  ffff888122354500: fa fb fb fb fb fb fb fb fb
fb fb fb fb fb fb fb
[  121.180922][ T1334] >ffff888122354580: fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb fb
[  121.182718][ T1334]
             ^
[  121.184634][ T1334]  ffff888122354600: fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb fb
[  121.186398][ T1334]  ffff888122354680: fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb fb
[  121.189204][ T1334]
==================================================================
```

[-- Attachment #2: vma-lock-delay-inject.diff --]
[-- Type: text/x-patch, Size: 3621 bytes --]

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 5da384bd0a26..1acac1ab9d30 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -12,6 +12,7 @@ extern int rcuwait_wake_up(struct rcuwait *w);
 #include <linux/tracepoint-defs.h>
 #include <linux/types.h>
 #include <linux/cleanup.h>
+#include <linux/delay.h>
 
 #define MMAP_LOCK_INITIALIZER(name) \
 	.mmap_lock = __RWSEM_INITIALIZER((name).mmap_lock),
@@ -139,8 +140,18 @@ static inline void vma_refcount_put(struct vm_area_struct *vma)
 	int oldcnt;
 
 	rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
+	if (strcmp(current->comm, "TEST") == 0) {
+		pr_warn("%s: PRE-DECREMENT DELAY START\n", __func__);
+		mdelay(2000);
+		pr_warn("%s: PRE-DECREMENT DELAY END\n", __func__);
+	}
 	if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
 
+		if (strcmp(current->comm, "TEST") == 0) {
+			pr_warn("%s: PRE-WAKEUP DELAY START (is_vma_writer_only()=%d)\n", __func__, is_vma_writer_only(oldcnt-1));
+			mdelay(10000);
+			pr_warn("%s: PRE-WAKEUP DELAY END\n", __func__);
+		}
 		if (is_vma_writer_only(oldcnt - 1))
 			rcuwait_wake_up(&mm->vma_writer_wait);
 	}
@@ -170,6 +181,11 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
 	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
 		return NULL;
 
+	if (strcmp(current->comm, "TEST") == 0) {
+		pr_warn("%s: PRE-INCREMENT DELAY START on VMA %px\n", __func__, vma);
+		mdelay(2000);
+		pr_warn("%s: PRE-INCREMENT DELAY END\n", __func__);
+	}
 	/*
 	 * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited_acquire()
 	 * will fail because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
@@ -179,6 +195,8 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
 	if (unlikely(!__refcount_inc_not_zero_limited_acquire(&vma->vm_refcnt, &oldcnt,
 							      VMA_REF_LIMIT))) {
 		/* return EAGAIN if vma got detached from under us */
+		if (strcmp(current->comm, "TEST") == 0)
+			pr_warn("%s: refcount acquire failed\n", __func__);
 		return oldcnt ? NULL : ERR_PTR(-EAGAIN);
 	}
 
@@ -195,10 +213,14 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
 	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
+		if (strcmp(current->comm, "TEST") == 0)
+			pr_warn("%s: seqcount mismatch\n", __func__);
 		vma_refcount_put(vma);
 		return NULL;
 	}
 
+	if (strcmp(current->comm, "TEST") == 0)
+		pr_warn("%s: returning vma\n", __func__);
 	return vma;
 }
 
diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
index 5f725cc67334..7c5ca1c0d331 100644
--- a/mm/mmap_lock.c
+++ b/mm/mmap_lock.c
@@ -60,6 +60,12 @@ static inline bool __vma_enter_locked(struct vm_area_struct *vma, bool detaching
 	if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt))
 		return false;
 
+	if (strcmp(current->comm, "WRITER") == 0) {
+		pr_warn("%s: BEGIN DELAY\n", __func__);
+		mdelay(2000);
+		pr_warn("%s: END DELAY\n", __func__);
+	}
+
 	rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
 	rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
 		   refcount_read(&vma->vm_refcnt) == tgt_refcnt,
diff --git a/mm/vma_init.c b/mm/vma_init.c
index 8e53c7943561..fb57d57205de 100644
--- a/mm/vma_init.c
+++ b/mm/vma_init.c
@@ -31,6 +31,8 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
 	if (!vma)
 		return NULL;
+	if (strcmp(current->comm, "WRITER-PREP") == 0)
+		pr_warn("%s: writer allocated vma %px\n", __func__, vma);
 
 	vma_init(vma, mm);
 

[-- Attachment #3: vmalock-uaf.c --]
[-- Type: text/x-csrc, Size: 1876 bytes --]

#define _GNU_SOURCE
#include <pthread.h>
#include <unistd.h>
#include <sched.h>
#include <stdio.h>
#include <err.h>
#include <signal.h>
#include <stdlib.h>
#include <sys/prctl.h>
#include <sys/mman.h>
#include <sys/eventfd.h>

#define SYSCHK(x) ({          \
  typeof(x) __res = (x);      \
  if (__res == (typeof(x))-1) \
    err(1, "SYSCHK(" #x ")"); \
  __res;                      \
})

static void *bugvma;
int wake_child_fd = -1;
static void *thread_fn(void *dummy) {
  sleep(1); // PRE-LAUNCH

  sleep(1);
  // T1
  SYSCHK(munmap(bugvma, 0x1000));
  SYSCHK(eventfd_write(wake_child_fd, 1));
  usleep(500*1000);
  return NULL;
}

int main(void) {
  setbuf(stdout, NULL);

  wake_child_fd = SYSCHK(eventfd(0, 0));
  // pin all to same core to ensure we use the same SLUB freelist
  cpu_set_t cpuset;
  CPU_ZERO(&cpuset);
  CPU_SET(0, &cpuset);
  SYSCHK(sched_setaffinity(0, sizeof(cpuset), &cpuset));

  pid_t child = SYSCHK(fork());
  if (child == 0) {
    eventfd_t dummyval;
    SYSCHK(eventfd_read(wake_child_fd, &dummyval));

    // T1
    SYSCHK(prctl(PR_SET_NAME, "WRITER-PREP"));
    void *realloc = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0));
    usleep(1500*1000);
    // T3
    printf("B1: begin munmap\n");
    SYSCHK(prctl(PR_SET_NAME, "WRITER"));
    SYSCHK(munmap(realloc, 0x1000)); // continues at T5
    SYSCHK(prctl(PR_SET_NAME, "WRITER-ENDED"));
    printf("B1: munmap done\n");
    exit(0);
  }

  pthread_t thread;
  if (pthread_create(&thread, NULL, thread_fn, NULL))
    errx(1, "pthread_create");

  sleep(1); // PRE-LAUNCH

  bugvma = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0));
  SYSCHK(prctl(PR_SET_NAME, "TEST"));
  // T0
  *(volatile char *)bugvma; // increment at T2; decrement at T4; UAF at T6
  SYSCHK(prctl(PR_SET_NAME, "TEST-ENDED"));
}

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 16:26 [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU Jann Horn
@ 2025-07-23 17:32 ` Vlastimil Babka
  2025-07-23 17:49   ` Jann Horn
  2025-07-23 18:14 ` Lorenzo Stoakes
  1 sibling, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2025-07-23 17:32 UTC (permalink / raw)
  To: Jann Horn, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Suren Baghdasaryan
  Cc: Pedro Falcato, Linux-MM, kernel list

On 7/23/25 18:26, Jann Horn wrote:
> There's a racy UAF in `vma_refcount_put()` when called on the
> `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> without sufficient protection against concurrent object reuse:

Oof.

> I'm not sure what the right fix is; I guess one approach would be to
> have a special version of vma_refcount_put() for cases where the VMA
> has been recycled by another MM that grabs an extra reference to the
> MM? But then dropping a reference to the MM afterwards might be a bit
> annoying and might require something like mmdrop_async()...

Would we need mmdrop_async()? Isn't this the case for mmget_not_zero() and
mmput_async()?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 17:32 ` Vlastimil Babka
@ 2025-07-23 17:49   ` Jann Horn
  2025-07-23 17:55     ` Suren Baghdasaryan
  2025-07-23 18:10     ` Vlastimil Babka
  0 siblings, 2 replies; 31+ messages in thread
From: Jann Horn @ 2025-07-23 17:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Suren Baghdasaryan, Pedro Falcato, Linux-MM, kernel list

On Wed, Jul 23, 2025 at 7:32 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> On 7/23/25 18:26, Jann Horn wrote:
> > There's a racy UAF in `vma_refcount_put()` when called on the
> > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> > without sufficient protection against concurrent object reuse:
>
> Oof.
>
> > I'm not sure what the right fix is; I guess one approach would be to
> > have a special version of vma_refcount_put() for cases where the VMA
> > has been recycled by another MM that grabs an extra reference to the
> > MM? But then dropping a reference to the MM afterwards might be a bit
> > annoying and might require something like mmdrop_async()...
>
> Would we need mmdrop_async()? Isn't this the case for mmget_not_zero() and
> mmput_async()?

Now I'm not sure anymore if either of those approaches would work,
because they rely on the task that's removing the VMA to wait until we
do __refcount_dec_and_test() before deleting the MM... but I don't
think we have any such guarantee...

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 17:49   ` Jann Horn
@ 2025-07-23 17:55     ` Suren Baghdasaryan
  2025-07-23 19:01       ` Lorenzo Stoakes
  2025-07-23 18:10     ` Vlastimil Babka
  1 sibling, 1 reply; 31+ messages in thread
From: Suren Baghdasaryan @ 2025-07-23 17:55 UTC (permalink / raw)
  To: Jann Horn
  Cc: Vlastimil Babka, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Pedro Falcato, Linux-MM, kernel list

On Wed, Jul 23, 2025 at 10:50 AM Jann Horn <jannh@google.com> wrote:
>
> On Wed, Jul 23, 2025 at 7:32 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > On 7/23/25 18:26, Jann Horn wrote:
> > > There's a racy UAF in `vma_refcount_put()` when called on the
> > > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> > > without sufficient protection against concurrent object reuse:
> >
> > Oof.

Thanks for analyzing this Jann. Yeah, I missed the fact that
vma_refcount_put() uses vma->vm_mm.

> >
> > > I'm not sure what the right fix is; I guess one approach would be to
> > > have a special version of vma_refcount_put() for cases where the VMA
> > > has been recycled by another MM that grabs an extra reference to the
> > > MM? But then dropping a reference to the MM afterwards might be a bit
> > > annoying and might require something like mmdrop_async()...
> >
> > Would we need mmdrop_async()? Isn't this the case for mmget_not_zero() and
> > mmput_async()?
>
> Now I'm not sure anymore if either of those approaches would work,
> because they rely on the task that's removing the VMA to wait until we
> do __refcount_dec_and_test() before deleting the MM... but I don't
> think we have any such guarantee...

This is tricky. Let me look into it some more before suggesting any fixes.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 17:49   ` Jann Horn
  2025-07-23 17:55     ` Suren Baghdasaryan
@ 2025-07-23 18:10     ` Vlastimil Babka
  2025-07-23 18:19       ` Jann Horn
  1 sibling, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2025-07-23 18:10 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Suren Baghdasaryan, Pedro Falcato, Linux-MM, kernel list

On 7/23/25 19:49, Jann Horn wrote:
> On Wed, Jul 23, 2025 at 7:32 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>> On 7/23/25 18:26, Jann Horn wrote:
>> > There's a racy UAF in `vma_refcount_put()` when called on the
>> > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
>> > without sufficient protection against concurrent object reuse:
>>
>> Oof.
>>
>> > I'm not sure what the right fix is; I guess one approach would be to
>> > have a special version of vma_refcount_put() for cases where the VMA
>> > has been recycled by another MM that grabs an extra reference to the
>> > MM? But then dropping a reference to the MM afterwards might be a bit
>> > annoying and might require something like mmdrop_async()...
>>
>> Would we need mmdrop_async()? Isn't this the case for mmget_not_zero() and
>> mmput_async()?
> 
> Now I'm not sure anymore if either of those approaches would work,
> because they rely on the task that's removing the VMA to wait until we
> do __refcount_dec_and_test() before deleting the MM... but I don't
> think we have any such guarantee...

I think it would be waiting in exit_mmap->vma_mark_detached(), but then
AFAIU you're right and we'd really need to work with mmgrab/mmdrop because
at that point the  mmget_not_zero() would already be failing...

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 16:26 [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU Jann Horn
  2025-07-23 17:32 ` Vlastimil Babka
@ 2025-07-23 18:14 ` Lorenzo Stoakes
  2025-07-23 18:30   ` Jann Horn
  1 sibling, 1 reply; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-07-23 18:14 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Liam R. Howlett, Suren Baghdasaryan,
	Vlastimil Babka, Pedro Falcato, Linux-MM, kernel list

On Wed, Jul 23, 2025 at 06:26:53PM +0200, Jann Horn wrote:
> There's a racy UAF in `vma_refcount_put()` when called on the
> `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> without sufficient protection against concurrent object reuse:
>
> lock_vma_under_rcu() looks up a VMA locklessly with mas_walk() under
> rcu_read_lock(). At that point, the VMA may be concurrently freed, and
> it can be recycled by another process. vma_start_read() then
> increments the vma->vm_refcnt (if it is in an acceptable range), and
> if this succeeds, vma_start_read() can return a reycled VMA. (As a
> sidenote, this goes against what the surrounding comments above
> vma_start_read() and in lock_vma_under_rcu() say - it would probably
> be cleaner to perform the vma->vm_mm check inside vma_start_read().)
>
> In this scenario where the VMA has been recycled, lock_vma_under_rcu()
> will then detect the mismatching ->vm_mm pointer and drop the VMA
> through vma_end_read(), which calls vma_refcount_put().

So in _correctly_ identifying the recycling, we then hit a problem. Fun!

> vma_refcount_put() does this:
>
> ```
> static inline void vma_refcount_put(struct vm_area_struct *vma)
> {
>         /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
>         struct mm_struct *mm = vma->vm_mm;

Are we at a point where we _should_ be looking at a VMA with vma->vm_mm ==
current->mm here?

Or can we not safely assume this?

Because if we can, can we not check for that here?

Do we need to keep the old mm around to wake up waiters if we're happily
freeing it anyway?

If not, then surely we can just do this check, and not do the wake up if
false?

I may be mising something or being stupid here so :P

>         int oldcnt;
>
>         rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
>         if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
>
>                 if (is_vma_writer_only(oldcnt - 1))
>                         rcuwait_wake_up(&mm->vma_writer_wait);
>         }
> }
> ```
>
> This is wrong: It implicitly assumes that the caller is keeping the
> VMA's mm alive, but in this scenario the caller has no relation to the
> VMA's mm, so the rcuwait_wake_up() can cause UAF.

Oh yikes. Yikes yikes yikes.

>
> In theory, this could happen to any multithreaded process where thread
> A is in the middle of pagefault handling while thread B is
> manipulating adjacent VMAs such that VMA merging frees the VMA looked
> up by thread A - but in practice, for UAF to actually happen, I think
> you need to at least hit three race windows in a row that are each on
> the order of a single memory access wide.

Hmm ok this is confusing, below you reference threads by number with
letter=process, and here you say thread 'A' and "B'.


>
> The interleaving leading to UAF is the following, where threads A1 and
> A2 are part of one process and thread B1 is part of another process:

Err but you refer to A1, A2, A3 below, should A3=B1?

> ```
> A1               A2               A3
> ==               ==               ==
> lock_vma_under_rcu
>   mas_walk
>                  <VMA modification removes the VMA>
>                                   mmap
>                                     <reallocate the VMA>
>   vma_start_read
>     READ_ONCE(vma->vm_lock_seq)
>     __refcount_inc_not_zero_limited_acquire
>                                   munmap
                                   ^
				   |
                                   OK so here, we have unmapped and
				   returned VMA to the slab so
				   SLAB_TYPESAFE_BY_RCU may now result in
				   you having a VMA with vma->vm_mm !=
				   current->mm right?

>                                     __vma_enter_locked
>                                       refcount_add_not_zero
>   vma_end_read
>     vma_refcount_put
      ^
      |
      Then here, we're grabbing process
      B's mm... whoops...

>       __refcount_dec_and_test
>                                       rcuwait_wait_event
>                                     <finish operation>
                                      ^
				      |
				      Then here, the VMA is finally
				      freed right?

>       rcuwait_wake_up [UAF]
        ^
	|
	And then here we try to deref freed mm->vma_writer_wait
	and boom.
> ```
>

This accurate?

> I'm not sure what the right fix is; I guess one approach would be to
> have a special version of vma_refcount_put() for cases where the VMA
> has been recycled by another MM that grabs an extra reference to the
> MM? But then dropping a reference to the MM afterwards might be a bit
> annoying and might require something like mmdrop_async()...

This seems a bit convoluted... And I worry about us getting the refcount
stuff wrong somehow, we'd have to be very careful about this to avoid
leaking VMAs somehow...

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 18:10     ` Vlastimil Babka
@ 2025-07-23 18:19       ` Jann Horn
  2025-07-23 18:39         ` Lorenzo Stoakes
  2025-07-23 20:27         ` Suren Baghdasaryan
  0 siblings, 2 replies; 31+ messages in thread
From: Jann Horn @ 2025-07-23 18:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Suren Baghdasaryan, Pedro Falcato, Linux-MM, kernel list

On Wed, Jul 23, 2025 at 8:10 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> On 7/23/25 19:49, Jann Horn wrote:
> > On Wed, Jul 23, 2025 at 7:32 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> On 7/23/25 18:26, Jann Horn wrote:
> >> > There's a racy UAF in `vma_refcount_put()` when called on the
> >> > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> >> > without sufficient protection against concurrent object reuse:
> >>
> >> Oof.
> >>
> >> > I'm not sure what the right fix is; I guess one approach would be to
> >> > have a special version of vma_refcount_put() for cases where the VMA
> >> > has been recycled by another MM that grabs an extra reference to the
> >> > MM? But then dropping a reference to the MM afterwards might be a bit
> >> > annoying and might require something like mmdrop_async()...
> >>
> >> Would we need mmdrop_async()? Isn't this the case for mmget_not_zero() and
> >> mmput_async()?
> >
> > Now I'm not sure anymore if either of those approaches would work,
> > because they rely on the task that's removing the VMA to wait until we
> > do __refcount_dec_and_test() before deleting the MM... but I don't
> > think we have any such guarantee...
>
> I think it would be waiting in exit_mmap->vma_mark_detached(), but then
> AFAIU you're right and we'd really need to work with mmgrab/mmdrop because
> at that point the  mmget_not_zero() would already be failing...

Ah, I see! vma_mark_detached() drops its reference, then does
__vma_enter_locked() to bump the refcount by VMA_LOCK_OFFSET again
(after which the reader path can't acquire it anymore), then waits
until the refcount drops to VMA_LOCK_OFFSET, and then decrements it
down to 0 from there. Makes sense.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 18:14 ` Lorenzo Stoakes
@ 2025-07-23 18:30   ` Jann Horn
  2025-07-23 18:45     ` Lorenzo Stoakes
  0 siblings, 1 reply; 31+ messages in thread
From: Jann Horn @ 2025-07-23 18:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R. Howlett, Suren Baghdasaryan,
	Vlastimil Babka, Pedro Falcato, Linux-MM, kernel list

On Wed, Jul 23, 2025 at 8:14 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Wed, Jul 23, 2025 at 06:26:53PM +0200, Jann Horn wrote:
> > There's a racy UAF in `vma_refcount_put()` when called on the
> > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> > without sufficient protection against concurrent object reuse:
> >
> > lock_vma_under_rcu() looks up a VMA locklessly with mas_walk() under
> > rcu_read_lock(). At that point, the VMA may be concurrently freed, and
> > it can be recycled by another process. vma_start_read() then
> > increments the vma->vm_refcnt (if it is in an acceptable range), and
> > if this succeeds, vma_start_read() can return a reycled VMA. (As a
> > sidenote, this goes against what the surrounding comments above
> > vma_start_read() and in lock_vma_under_rcu() say - it would probably
> > be cleaner to perform the vma->vm_mm check inside vma_start_read().)
> >
> > In this scenario where the VMA has been recycled, lock_vma_under_rcu()
> > will then detect the mismatching ->vm_mm pointer and drop the VMA
> > through vma_end_read(), which calls vma_refcount_put().
>
> So in _correctly_ identifying the recycling, we then hit a problem. Fun!
>
> > vma_refcount_put() does this:
> >
> > ```
> > static inline void vma_refcount_put(struct vm_area_struct *vma)
> > {
> >         /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> >         struct mm_struct *mm = vma->vm_mm;
>
> Are we at a point where we _should_ be looking at a VMA with vma->vm_mm ==
> current->mm here?

Well, you _hope_ to be looking at a VMA with vma->vm_mm==current->mm,
but if you lose a race it is intentional that you can end up with
another MM's VMA here.

> Or can we not safely assume this?

No.

> Because if we can, can we not check for that here?
>
> Do we need to keep the old mm around to wake up waiters if we're happily
> freeing it anyway?

Well, we don't know if the MM has already been freed, or if it is
still alive and well and has writers who are stuck waiting for our
wakeup.

> If not, then surely we can just do this check, and not do the wake up if
> false?
>
> I may be mising something or being stupid here so :P
>
> >         int oldcnt;
> >
> >         rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> >         if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
> >
> >                 if (is_vma_writer_only(oldcnt - 1))
> >                         rcuwait_wake_up(&mm->vma_writer_wait);
> >         }
> > }
> > ```
> >
> > This is wrong: It implicitly assumes that the caller is keeping the
> > VMA's mm alive, but in this scenario the caller has no relation to the
> > VMA's mm, so the rcuwait_wake_up() can cause UAF.
>
> Oh yikes. Yikes yikes yikes.
>
> >
> > In theory, this could happen to any multithreaded process where thread
> > A is in the middle of pagefault handling while thread B is
> > manipulating adjacent VMAs such that VMA merging frees the VMA looked
> > up by thread A - but in practice, for UAF to actually happen, I think
> > you need to at least hit three race windows in a row that are each on
> > the order of a single memory access wide.
>
> Hmm ok this is confusing, below you reference threads by number with
> letter=process, and here you say thread 'A' and "B'.

Ah, yeah, sorry.

> > The interleaving leading to UAF is the following, where threads A1 and
> > A2 are part of one process and thread B1 is part of another process:
>
> Err but you refer to A1, A2, A3 below, should A3=B1?

... clearly I should have proofread this more carefully. Yes, A3 should be B1.

>
> > ```
> > A1               A2               A3
> > ==               ==               ==
> > lock_vma_under_rcu
> >   mas_walk
> >                  <VMA modification removes the VMA>
> >                                   mmap
> >                                     <reallocate the VMA>
> >   vma_start_read
> >     READ_ONCE(vma->vm_lock_seq)
> >     __refcount_inc_not_zero_limited_acquire
> >                                   munmap
>                                    ^
>                                    |
>                                    OK so here, we have unmapped and
>                                    returned VMA to the slab so
>                                    SLAB_TYPESAFE_BY_RCU may now result in
>                                    you having a VMA with vma->vm_mm !=
>                                    current->mm right?

Yes. The VMA is unmapped and returned to the slab at the "<VMA
modification removes the VMA>" in A2 above, and then reallocated when
B1 does "mmap".

> >                                     __vma_enter_locked
> >                                       refcount_add_not_zero
> >   vma_end_read
> >     vma_refcount_put
>       ^
>       |
>       Then here, we're grabbing process
>       B's mm... whoops...

Yeah.

> >       __refcount_dec_and_test
> >                                       rcuwait_wait_event
> >                                     <finish operation>
>                                       ^
>                                       |
>                                       Then here, the VMA is finally
>                                       freed right?

Yeah.

> >       rcuwait_wake_up [UAF]
>         ^
>         |
>         And then here we try to deref freed mm->vma_writer_wait
>         and boom.

Yeah.

> > I'm not sure what the right fix is; I guess one approach would be to
> > have a special version of vma_refcount_put() for cases where the VMA
> > has been recycled by another MM that grabs an extra reference to the
> > MM? But then dropping a reference to the MM afterwards might be a bit
> > annoying and might require something like mmdrop_async()...
>
> This seems a bit convoluted... And I worry about us getting the refcount
> stuff wrong somehow, we'd have to be very careful about this to avoid
> leaking VMAs somehow...

FWIW, the slow and boring solution would be to use wake_up_var() and
wait_var_event(), but those might be too slow for this usecase.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 18:19       ` Jann Horn
@ 2025-07-23 18:39         ` Lorenzo Stoakes
  2025-07-23 19:52           ` Jann Horn
  2025-07-23 20:27         ` Suren Baghdasaryan
  1 sibling, 1 reply; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-07-23 18:39 UTC (permalink / raw)
  To: Jann Horn
  Cc: Vlastimil Babka, Andrew Morton, Liam R. Howlett,
	Suren Baghdasaryan, Pedro Falcato, Linux-MM, kernel list

On Wed, Jul 23, 2025 at 08:19:09PM +0200, Jann Horn wrote:
> On Wed, Jul 23, 2025 at 8:10 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > On 7/23/25 19:49, Jann Horn wrote:
> > > On Wed, Jul 23, 2025 at 7:32 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >> On 7/23/25 18:26, Jann Horn wrote:
> > >> > There's a racy UAF in `vma_refcount_put()` when called on the
> > >> > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> > >> > without sufficient protection against concurrent object reuse:
> > >>
> > >> Oof.
> > >>
> > >> > I'm not sure what the right fix is; I guess one approach would be to
> > >> > have a special version of vma_refcount_put() for cases where the VMA
> > >> > has been recycled by another MM that grabs an extra reference to the
> > >> > MM? But then dropping a reference to the MM afterwards might be a bit
> > >> > annoying and might require something like mmdrop_async()...
> > >>
> > >> Would we need mmdrop_async()? Isn't this the case for mmget_not_zero() and
> > >> mmput_async()?
> > >
> > > Now I'm not sure anymore if either of those approaches would work,
> > > because they rely on the task that's removing the VMA to wait until we
> > > do __refcount_dec_and_test() before deleting the MM... but I don't
> > > think we have any such guarantee...
> >
> > I think it would be waiting in exit_mmap->vma_mark_detached(), but then
> > AFAIU you're right and we'd really need to work with mmgrab/mmdrop because
> > at that point the  mmget_not_zero() would already be failing...
>
> Ah, I see! vma_mark_detached() drops its reference, then does
> __vma_enter_locked() to bump the refcount by VMA_LOCK_OFFSET again
> (after which the reader path can't acquire it anymore), then waits
> until the refcount drops to VMA_LOCK_OFFSET, and then decrements it
> down to 0 from there. Makes sense.

Sorry, this is really my fault because I didn't closely follow the
reimplementation of the VMA locks closely enough and so am a little behind
here (I'll fix this, probably by documenting them fully in the relevant doc
page).

So forgive me if I"m asking stupid questions.

What exactly is the issue with the waiter not being triggered?

I see in vma_mark_detached():

	/*
	 * We are the only writer, so no need to use vma_refcount_put().
	 * The condition below is unlikely because the vma has been already
	 * write-locked and readers can increment vm_refcnt only temporarily
	 * before they check vm_lock_seq, realize the vma is locked and drop
	 * back the vm_refcnt. That is a narrow window for observing a raised
	 * vm_refcnt.
	 */

So, if this is happening at the point of the unmap, and we're unlucky enough to
have some readers have spuriously incremented the refcnt before they check
vm_lock_seq, we trigger __vma_enter_locked() and wait on other VMAs to
vma_refcount_put() to wake it up vai rcuwait_wake_up() if the refcount is still
raised (which it should be right?)

So actually are we going to be left with readers sat around waiting forever? If
the scenario mentioned happens?

If we make the rando mm we are now referencing stick around, aren't we just
spuriously triggering this thing while potentially leaving actual waiters
waiting?

I may be completely misunderstanding here...

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 18:30   ` Jann Horn
@ 2025-07-23 18:45     ` Lorenzo Stoakes
  2025-07-23 19:43       ` Jann Horn
  0 siblings, 1 reply; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-07-23 18:45 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Liam R. Howlett, Suren Baghdasaryan,
	Vlastimil Babka, Pedro Falcato, Linux-MM, kernel list

On Wed, Jul 23, 2025 at 08:30:30PM +0200, Jann Horn wrote:
> On Wed, Jul 23, 2025 at 8:14 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Wed, Jul 23, 2025 at 06:26:53PM +0200, Jann Horn wrote:
> > > There's a racy UAF in `vma_refcount_put()` when called on the
> > > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> > > without sufficient protection against concurrent object reuse:
> > >
> > > lock_vma_under_rcu() looks up a VMA locklessly with mas_walk() under
> > > rcu_read_lock(). At that point, the VMA may be concurrently freed, and
> > > it can be recycled by another process. vma_start_read() then
> > > increments the vma->vm_refcnt (if it is in an acceptable range), and
> > > if this succeeds, vma_start_read() can return a reycled VMA. (As a
> > > sidenote, this goes against what the surrounding comments above
> > > vma_start_read() and in lock_vma_under_rcu() say - it would probably
> > > be cleaner to perform the vma->vm_mm check inside vma_start_read().)
> > >
> > > In this scenario where the VMA has been recycled, lock_vma_under_rcu()
> > > will then detect the mismatching ->vm_mm pointer and drop the VMA
> > > through vma_end_read(), which calls vma_refcount_put().
> >
> > So in _correctly_ identifying the recycling, we then hit a problem. Fun!
> >
> > > vma_refcount_put() does this:
> > >
> > > ```
> > > static inline void vma_refcount_put(struct vm_area_struct *vma)
> > > {
> > >         /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> > >         struct mm_struct *mm = vma->vm_mm;
> >
> > Are we at a point where we _should_ be looking at a VMA with vma->vm_mm ==
> > current->mm here?
>
> Well, you _hope_ to be looking at a VMA with vma->vm_mm==current->mm,
> but if you lose a race it is intentional that you can end up with
> another MM's VMA here.
>
> > Or can we not safely assume this?
>
> No.

What code paths lead to vma_refcount_put() with a foreign mm?

I mean maybe it's an unsafe assumption.

I realise we are doing stuff for _reasons_, but I sort of HATE that we have
put ourselves in a position where we might always see a recycled VMA and
rely on a very very complicated seqnum-based locking mechanism to make sure
this doesn't happen.

It feels like we've made ourselves a challenging bed and uncomfy bed...

>
> > Because if we can, can we not check for that here?
> >
> > Do we need to keep the old mm around to wake up waiters if we're happily
> > freeing it anyway?
>
> Well, we don't know if the MM has already been freed, or if it is
> still alive and well and has writers who are stuck waiting for our
> wakeup.

But the mm we're talking about here is some recycled one from another
thread?

Right so, we have:

'mm we meant to get' (which apparently can't be assumed to be current->mm)
'mm we actually got' (which may or may not be freed at any time)

The _meant to get_ one might have eternal waiters. Or might not even need
to be woken up.

I don't see why keeping the 'actually got' one around really helps us? Am I
missing something?

Or is this not the proposal?

>
> > If not, then surely we can just do this check, and not do the wake up if
> > false?
> >
> > I may be mising something or being stupid here so :P
> >
> > >         int oldcnt;
> > >
> > >         rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> > >         if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
> > >
> > >                 if (is_vma_writer_only(oldcnt - 1))
> > >                         rcuwait_wake_up(&mm->vma_writer_wait);
> > >         }
> > > }
> > > ```
> > >
> > > This is wrong: It implicitly assumes that the caller is keeping the
> > > VMA's mm alive, but in this scenario the caller has no relation to the
> > > VMA's mm, so the rcuwait_wake_up() can cause UAF.
> >
> > Oh yikes. Yikes yikes yikes.
> >
> > >
> > > In theory, this could happen to any multithreaded process where thread
> > > A is in the middle of pagefault handling while thread B is
> > > manipulating adjacent VMAs such that VMA merging frees the VMA looked
> > > up by thread A - but in practice, for UAF to actually happen, I think
> > > you need to at least hit three race windows in a row that are each on
> > > the order of a single memory access wide.
> >
> > Hmm ok this is confusing, below you reference threads by number with
> > letter=process, and here you say thread 'A' and "B'.
>
> Ah, yeah, sorry.
>
> > > The interleaving leading to UAF is the following, where threads A1 and
> > > A2 are part of one process and thread B1 is part of another process:
> >
> > Err but you refer to A1, A2, A3 below, should A3=B1?
>
> ... clearly I should have proofread this more carefully. Yes, A3 should be B1.
>
> >
> > > ```
> > > A1               A2               A3
> > > ==               ==               ==
> > > lock_vma_under_rcu
> > >   mas_walk
> > >                  <VMA modification removes the VMA>
> > >                                   mmap
> > >                                     <reallocate the VMA>
> > >   vma_start_read
> > >     READ_ONCE(vma->vm_lock_seq)
> > >     __refcount_inc_not_zero_limited_acquire
> > >                                   munmap
> >                                    ^
> >                                    |
> >                                    OK so here, we have unmapped and
> >                                    returned VMA to the slab so
> >                                    SLAB_TYPESAFE_BY_RCU may now result in
> >                                    you having a VMA with vma->vm_mm !=
> >                                    current->mm right?
>
> Yes. The VMA is unmapped and returned to the slab at the "<VMA
> modification removes the VMA>" in A2 above, and then reallocated when
> B1 does "mmap".
>
> > >                                     __vma_enter_locked
> > >                                       refcount_add_not_zero
> > >   vma_end_read
> > >     vma_refcount_put
> >       ^
> >       |
> >       Then here, we're grabbing process
> >       B's mm... whoops...
>
> Yeah.
>
> > >       __refcount_dec_and_test
> > >                                       rcuwait_wait_event
> > >                                     <finish operation>
> >                                       ^
> >                                       |
> >                                       Then here, the VMA is finally
> >                                       freed right?
>
> Yeah.
>
> > >       rcuwait_wake_up [UAF]
> >         ^
> >         |
> >         And then here we try to deref freed mm->vma_writer_wait
> >         and boom.
>
> Yeah.
>
> > > I'm not sure what the right fix is; I guess one approach would be to
> > > have a special version of vma_refcount_put() for cases where the VMA
> > > has been recycled by another MM that grabs an extra reference to the
> > > MM? But then dropping a reference to the MM afterwards might be a bit
> > > annoying and might require something like mmdrop_async()...
> >
> > This seems a bit convoluted... And I worry about us getting the refcount
> > stuff wrong somehow, we'd have to be very careful about this to avoid
> > leaking VMAs somehow...
>
> FWIW, the slow and boring solution would be to use wake_up_var() and
> wait_var_event(), but those might be too slow for this usecase.

I don't know what those are, but they seem like some general waiter/wake up
mechanism and yeah... the whole point of this VMA lock stuff is perf so
it's probably a no-go.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 17:55     ` Suren Baghdasaryan
@ 2025-07-23 19:01       ` Lorenzo Stoakes
  2025-07-24  0:13         ` Suren Baghdasaryan
  0 siblings, 1 reply; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-07-23 19:01 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Jann Horn, Vlastimil Babka, Andrew Morton, Liam R. Howlett,
	Pedro Falcato, Linux-MM, kernel list

On Wed, Jul 23, 2025 at 10:55:06AM -0700, Suren Baghdasaryan wrote:
> On Wed, Jul 23, 2025 at 10:50 AM Jann Horn <jannh@google.com> wrote:
> >
> > On Wed, Jul 23, 2025 at 7:32 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > On 7/23/25 18:26, Jann Horn wrote:
> > > > There's a racy UAF in `vma_refcount_put()` when called on the
> > > > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> > > > without sufficient protection against concurrent object reuse:
> > >
> > > Oof.
>
> Thanks for analyzing this Jann. Yeah, I missed the fact that
> vma_refcount_put() uses vma->vm_mm.
>
> > >
> > > > I'm not sure what the right fix is; I guess one approach would be to
> > > > have a special version of vma_refcount_put() for cases where the VMA
> > > > has been recycled by another MM that grabs an extra reference to the
> > > > MM? But then dropping a reference to the MM afterwards might be a bit
> > > > annoying and might require something like mmdrop_async()...
> > >
> > > Would we need mmdrop_async()? Isn't this the case for mmget_not_zero() and
> > > mmput_async()?
> >
> > Now I'm not sure anymore if either of those approaches would work,
> > because they rely on the task that's removing the VMA to wait until we
> > do __refcount_dec_and_test() before deleting the MM... but I don't
> > think we have any such guarantee...
>
> This is tricky. Let me look into it some more before suggesting any fixes.

Thanks Suren! :)

I feel the strong desire to document this seqnum approach as it is
intricate, so will find some time to do that for my own benefit at least.

The fact VMAs can be recycled like this at any time makes me super nervous,
so I wonder if we could find ways to, at least in a debug mode (perhaps
even in a CONFIG_DEBUG_VM_MAPLE_TREE-style 'we are fine with this being
very very slow sort of way), pick up on potentially super weird
small-race-window style issues like this.

Because it feels like debugging them 'in the wild' might be really horrid.

Or maybe it's even possible to shuffle things around and do some testing in
userland via the VMA userland tests... possibly pipe dream though given the
mechanisms that would need to be put there.

It's sort of hard now to separate VMA locks from VMA operations in general,
so that's something I need to think about anyway.

But I'm almost certainly going to document this in an 'internal' portion of
the process addrs doc page we have, at least to teach myself the deeper
internals...

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 18:45     ` Lorenzo Stoakes
@ 2025-07-23 19:43       ` Jann Horn
  2025-07-24  5:13         ` Lorenzo Stoakes
  0 siblings, 1 reply; 31+ messages in thread
From: Jann Horn @ 2025-07-23 19:43 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R. Howlett, Suren Baghdasaryan,
	Vlastimil Babka, Pedro Falcato, Linux-MM, kernel list

Sorry, while typing up this mail I realized I didn't have this stuff
particularly straight in my head myself when writing my previous mails
about this...

On Wed, Jul 23, 2025 at 8:45 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Wed, Jul 23, 2025 at 08:30:30PM +0200, Jann Horn wrote:
> > On Wed, Jul 23, 2025 at 8:14 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > > On Wed, Jul 23, 2025 at 06:26:53PM +0200, Jann Horn wrote:
> > > > There's a racy UAF in `vma_refcount_put()` when called on the
> > > > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> > > > without sufficient protection against concurrent object reuse:
> > > >
> > > > lock_vma_under_rcu() looks up a VMA locklessly with mas_walk() under
> > > > rcu_read_lock(). At that point, the VMA may be concurrently freed, and
> > > > it can be recycled by another process. vma_start_read() then
> > > > increments the vma->vm_refcnt (if it is in an acceptable range), and
> > > > if this succeeds, vma_start_read() can return a reycled VMA. (As a
> > > > sidenote, this goes against what the surrounding comments above
> > > > vma_start_read() and in lock_vma_under_rcu() say - it would probably
> > > > be cleaner to perform the vma->vm_mm check inside vma_start_read().)
> > > >
> > > > In this scenario where the VMA has been recycled, lock_vma_under_rcu()
> > > > will then detect the mismatching ->vm_mm pointer and drop the VMA
> > > > through vma_end_read(), which calls vma_refcount_put().
> > >
> > > So in _correctly_ identifying the recycling, we then hit a problem. Fun!
> > >
> > > > vma_refcount_put() does this:
> > > >
> > > > ```
> > > > static inline void vma_refcount_put(struct vm_area_struct *vma)
> > > > {
> > > >         /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> > > >         struct mm_struct *mm = vma->vm_mm;
> > >
> > > Are we at a point where we _should_ be looking at a VMA with vma->vm_mm ==
> > > current->mm here?
> >
> > Well, you _hope_ to be looking at a VMA with vma->vm_mm==current->mm,
> > but if you lose a race it is intentional that you can end up with
> > another MM's VMA here.

(I forgot: The mm passed to lock_vma_under_rcu() is potentially
different from current->mm if we're coming from uffd_mfill_lock(),
which would be intentional and desired, but that's not relevant here.
Sorry for making things more confusing.)

> > > Or can we not safely assume this?
> >
> > No.
>
> What code paths lead to vma_refcount_put() with a foreign mm?

Calls to vma_refcount_put() from vma_start_read() or from
lock_vma_under_rcu() can have an MM different from the mm that was
passed to lock_vma_under_rcu().

Basically, lock_vma_under_rcu() locklessly looks up a VMA in the maple
tree of the provided MM; and so immediately after the maple tree
lookup, before we grab any sort of reference on the VMA, the VMA can
be freed, and reallocated by another process. If we then essentially
read-lock this VMA which is used by another MM (by incrementing its
refcount), waiters in that other MM might start waiting for us; and in
that case, when we notice we got the wrong VMA and bail out, we have
to wake those waiters up again after unlocking the VMA by dropping its
refcount.

> I mean maybe it's an unsafe assumption.
>
> I realise we are doing stuff for _reasons_, but I sort of HATE that we have
> put ourselves in a position where we might always see a recycled VMA and
> rely on a very very complicated seqnum-based locking mechanism to make sure
> this doesn't happen.

Yes, that is pretty much the definition of SLAB_TYPESAFE_BY_RCU. ^^
You get higher data cache hit rates in exchange for complicated "grab
some kinda stable object reference and then recheck if we got the
right one" stuff, though it is less annoying when dealing with a
normal refcount or spinlock or such rather than this kind of
open-coded sleepable read-write semaphore.

> It feels like we've made ourselves a challenging bed and uncomfy bed...
>
> >
> > > Because if we can, can we not check for that here?
> > >
> > > Do we need to keep the old mm around to wake up waiters if we're happily
> > > freeing it anyway?
> >
> > Well, we don't know if the MM has already been freed, or if it is
> > still alive and well and has writers who are stuck waiting for our
> > wakeup.
>
> But the mm we're talking about here is some recycled one from another
> thread?

The MM is not recycled, the VMA is recycled.

> Right so, we have:
>
> 'mm we meant to get' (which apparently can't be assumed to be current->mm)
> 'mm we actually got' (which may or may not be freed at any time)
>
> The _meant to get_ one might have eternal waiters. Or might not even need
> to be woken up.
>
> I don't see why keeping the 'actually got' one around really helps us? Am I
> missing something?

We basically have taken a read lock on a VMA that is part of the
"actually got" MM, and so we may have caused writers from that MM to
block and sleep, and since we did that we have to wake them back up
and say "sorry, locked the wrong object, please continue".

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 18:39         ` Lorenzo Stoakes
@ 2025-07-23 19:52           ` Jann Horn
  2025-07-23 20:00             ` Jann Horn
  2025-07-24  5:23             ` Lorenzo Stoakes
  0 siblings, 2 replies; 31+ messages in thread
From: Jann Horn @ 2025-07-23 19:52 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vlastimil Babka, Andrew Morton, Liam R. Howlett,
	Suren Baghdasaryan, Pedro Falcato, Linux-MM, kernel list

On Wed, Jul 23, 2025 at 8:39 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Wed, Jul 23, 2025 at 08:19:09PM +0200, Jann Horn wrote:
> > On Wed, Jul 23, 2025 at 8:10 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > On 7/23/25 19:49, Jann Horn wrote:
> > > > On Wed, Jul 23, 2025 at 7:32 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > >> On 7/23/25 18:26, Jann Horn wrote:
> > > >> > There's a racy UAF in `vma_refcount_put()` when called on the
> > > >> > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> > > >> > without sufficient protection against concurrent object reuse:
> > > >>
> > > >> Oof.
> > > >>
> > > >> > I'm not sure what the right fix is; I guess one approach would be to
> > > >> > have a special version of vma_refcount_put() for cases where the VMA
> > > >> > has been recycled by another MM that grabs an extra reference to the
> > > >> > MM? But then dropping a reference to the MM afterwards might be a bit
> > > >> > annoying and might require something like mmdrop_async()...
> > > >>
> > > >> Would we need mmdrop_async()? Isn't this the case for mmget_not_zero() and
> > > >> mmput_async()?
> > > >
> > > > Now I'm not sure anymore if either of those approaches would work,
> > > > because they rely on the task that's removing the VMA to wait until we
> > > > do __refcount_dec_and_test() before deleting the MM... but I don't
> > > > think we have any such guarantee...
> > >
> > > I think it would be waiting in exit_mmap->vma_mark_detached(), but then
> > > AFAIU you're right and we'd really need to work with mmgrab/mmdrop because
> > > at that point the  mmget_not_zero() would already be failing...
> >
> > Ah, I see! vma_mark_detached() drops its reference, then does
> > __vma_enter_locked() to bump the refcount by VMA_LOCK_OFFSET again
> > (after which the reader path can't acquire it anymore), then waits
> > until the refcount drops to VMA_LOCK_OFFSET, and then decrements it
> > down to 0 from there. Makes sense.
>
> Sorry, this is really my fault because I didn't closely follow the
> reimplementation of the VMA locks closely enough and so am a little behind
> here (I'll fix this, probably by documenting them fully in the relevant doc
> page).
>
> So forgive me if I"m asking stupid questions.
>
> What exactly is the issue with the waiter not being triggered?
>
> I see in vma_mark_detached():
>
>         /*
>          * We are the only writer, so no need to use vma_refcount_put().
>          * The condition below is unlikely because the vma has been already
>          * write-locked and readers can increment vm_refcnt only temporarily
>          * before they check vm_lock_seq, realize the vma is locked and drop
>          * back the vm_refcnt. That is a narrow window for observing a raised
>          * vm_refcnt.
>          */
>
> So, if this is happening at the point of the unmap, and we're unlucky enough to
> have some readers have spuriously incremented the refcnt before they check
> vm_lock_seq, we trigger __vma_enter_locked() and wait on other VMAs to
> vma_refcount_put() to wake it up vai rcuwait_wake_up() if the refcount is still
> raised (which it should be right?)

I'm not sure if I'm understanding you correctly; but yes,
__vma_enter_locked() waits for all the waiters to drop their
"refcounts". (It's not really a refcount, you can also think of it as
a sleepable read-write lock where the low bits are the number of
readers.)

> So actually are we going to be left with readers sat around waiting forever? If
> the scenario mentioned happens?

I'm not sure I understand the question. Readers don't wait, they bail
out if they hit contention and retry with the mmap lock. As far as VMA
locks are concerned, readers basically always trylock, only writers
can wait.

> If we make the rando mm we are now referencing stick around, aren't we just
> spuriously triggering this thing while potentially leaving actual waiters
> waiting?

In that case, the thing we've read-locked is not part of the MM we
were trying to operate on, it is part of the rando other VM, so the
writers we've blocked are also part of the rando other VM, and so the
rando other VM is where we have to do a wakeup.

> I may be completely misunderstanding here...

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 19:52           ` Jann Horn
@ 2025-07-23 20:00             ` Jann Horn
  2025-07-24  5:24               ` Lorenzo Stoakes
  2025-07-24  5:23             ` Lorenzo Stoakes
  1 sibling, 1 reply; 31+ messages in thread
From: Jann Horn @ 2025-07-23 20:00 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vlastimil Babka, Andrew Morton, Liam R. Howlett,
	Suren Baghdasaryan, Pedro Falcato, Linux-MM, kernel list

On Wed, Jul 23, 2025 at 9:52 PM Jann Horn <jannh@google.com> wrote:
> I'm not sure if I'm understanding you correctly; but yes,
> __vma_enter_locked() waits for all the waiters to drop their
> "refcounts". (It's not really a refcount, you can also think of it as
> a sleepable read-write lock where the low bits are the number of
> readers.)

Sorry, that's not entirely true, since an attached VMA has a refcount
elevated by one. It's kind of a refcount, and kind of forms part of a
sleepable read-write lock, it's complicated.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 18:19       ` Jann Horn
  2025-07-23 18:39         ` Lorenzo Stoakes
@ 2025-07-23 20:27         ` Suren Baghdasaryan
  2025-07-24  2:30           ` Suren Baghdasaryan
  1 sibling, 1 reply; 31+ messages in thread
From: Suren Baghdasaryan @ 2025-07-23 20:27 UTC (permalink / raw)
  To: Jann Horn
  Cc: Vlastimil Babka, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Pedro Falcato, Linux-MM, kernel list

On Wed, Jul 23, 2025 at 11:19 AM Jann Horn <jannh@google.com> wrote:
>
> On Wed, Jul 23, 2025 at 8:10 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > On 7/23/25 19:49, Jann Horn wrote:
> > > On Wed, Jul 23, 2025 at 7:32 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >> On 7/23/25 18:26, Jann Horn wrote:
> > >> > There's a racy UAF in `vma_refcount_put()` when called on the
> > >> > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> > >> > without sufficient protection against concurrent object reuse:
> > >>
> > >> Oof.
> > >>
> > >> > I'm not sure what the right fix is; I guess one approach would be to
> > >> > have a special version of vma_refcount_put() for cases where the VMA
> > >> > has been recycled by another MM that grabs an extra reference to the
> > >> > MM? But then dropping a reference to the MM afterwards might be a bit
> > >> > annoying and might require something like mmdrop_async()...
> > >>
> > >> Would we need mmdrop_async()? Isn't this the case for mmget_not_zero() and
> > >> mmput_async()?
> > >
> > > Now I'm not sure anymore if either of those approaches would work,
> > > because they rely on the task that's removing the VMA to wait until we
> > > do __refcount_dec_and_test() before deleting the MM... but I don't
> > > think we have any such guarantee...
> >
> > I think it would be waiting in exit_mmap->vma_mark_detached(), but then
> > AFAIU you're right and we'd really need to work with mmgrab/mmdrop because
> > at that point the  mmget_not_zero() would already be failing...
>
> Ah, I see! vma_mark_detached() drops its reference, then does
> __vma_enter_locked() to bump the refcount by VMA_LOCK_OFFSET again
> (after which the reader path can't acquire it anymore), then waits
> until the refcount drops to VMA_LOCK_OFFSET, and then decrements it
> down to 0 from there. Makes sense.

Yes, that's what I was checking to understand the race. In your explanation:

A1 found the vma
A2 detached it
A3 attached it to another mm
A1 refcounts the vma
A1 realizes it's from another mm and calls vma_end_read() which tries
to wake up another mm's waiter.

Vlastimil is right that if A1 was able to successfully elevate vma's
refcount then:
1. vma must be attached to some valid mm. This is true because if the
vma is detached, vma_start_read() would not be able to elevate its
refcount. Once vma_start_read() elevates the refcount, vma will not
detach from under us because vma_mark_detached() will block until no
readers are using the vma.
2. vma->mm can't be destroyed from under us because of that
exit_mmap()->vma_mark_detached() which again will ensure no readers
are holding a reference to the vmas of that mm.

So, a special version of vma_refcount_put() that takes mm as a
parameter and does mmgrab/mmdrop before using that mm might work. I'll
do some more digging and maybe test this solution with your reproducer
to see if that works as I would expect.
Thanks,
Suren.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 19:01       ` Lorenzo Stoakes
@ 2025-07-24  0:13         ` Suren Baghdasaryan
  2025-07-24  4:40           ` Lorenzo Stoakes
  0 siblings, 1 reply; 31+ messages in thread
From: Suren Baghdasaryan @ 2025-07-24  0:13 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Jann Horn, Vlastimil Babka, Andrew Morton, Liam R. Howlett,
	Pedro Falcato, Linux-MM, kernel list

On Wed, Jul 23, 2025 at 12:01 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Jul 23, 2025 at 10:55:06AM -0700, Suren Baghdasaryan wrote:
> > On Wed, Jul 23, 2025 at 10:50 AM Jann Horn <jannh@google.com> wrote:
> > >
> > > On Wed, Jul 23, 2025 at 7:32 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > > On 7/23/25 18:26, Jann Horn wrote:
> > > > > There's a racy UAF in `vma_refcount_put()` when called on the
> > > > > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> > > > > without sufficient protection against concurrent object reuse:
> > > >
> > > > Oof.
> >
> > Thanks for analyzing this Jann. Yeah, I missed the fact that
> > vma_refcount_put() uses vma->vm_mm.
> >
> > > >
> > > > > I'm not sure what the right fix is; I guess one approach would be to
> > > > > have a special version of vma_refcount_put() for cases where the VMA
> > > > > has been recycled by another MM that grabs an extra reference to the
> > > > > MM? But then dropping a reference to the MM afterwards might be a bit
> > > > > annoying and might require something like mmdrop_async()...
> > > >
> > > > Would we need mmdrop_async()? Isn't this the case for mmget_not_zero() and
> > > > mmput_async()?
> > >
> > > Now I'm not sure anymore if either of those approaches would work,
> > > because they rely on the task that's removing the VMA to wait until we
> > > do __refcount_dec_and_test() before deleting the MM... but I don't
> > > think we have any such guarantee...
> >
> > This is tricky. Let me look into it some more before suggesting any fixes.
>
> Thanks Suren! :)
>
> I feel the strong desire to document this seqnum approach as it is
> intricate, so will find some time to do that for my own benefit at least.

I think we documented most of it in your document:
https://elixir.bootlin.com/linux/v6.16-rc7/source/Documentation/mm/process_addrs.rst#L705
but maybe we can improve?
This issue is a result of my failure to notice that vma_refcount_put()
uses vma->mm. This is my oversight, not a conceptual design flaw (at
least in my mind). We knew that lock_vma_under_rcu() should never
dereference vma->mm, that's why we added an `mm` parameter to that
function.  But unfortunately I missed this indirect usage.

>
> The fact VMAs can be recycled like this at any time makes me super nervous,
> so I wonder if we could find ways to, at least in a debug mode (perhaps
> even in a CONFIG_DEBUG_VM_MAPLE_TREE-style 'we are fine with this being
> very very slow sort of way), pick up on potentially super weird
> small-race-window style issues like this.
>
> Because it feels like debugging them 'in the wild' might be really horrid.

What do you have in mind? A specialized test that simulates some races
for vma lookup/allocation/reuse?

>
> Or maybe it's even possible to shuffle things around and do some testing in
> userland via the VMA userland tests... possibly pipe dream though given the
> mechanisms that would need to be put there.

I think that would be difficult if not impossible. We would have to
inject delays like Jann did to reveal such rare races in a repeatable
manner. Maybe if we had an in-kernel delay injection mechanism that
would be possible?

>
> It's sort of hard now to separate VMA locks from VMA operations in general,
> so that's something I need to think about anyway.
>
> But I'm almost certainly going to document this in an 'internal' portion of
> the process addrs doc page we have, at least to teach myself the deeper
> internals...
>
> Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 20:27         ` Suren Baghdasaryan
@ 2025-07-24  2:30           ` Suren Baghdasaryan
  2025-07-24  8:38             ` Vlastimil Babka
  0 siblings, 1 reply; 31+ messages in thread
From: Suren Baghdasaryan @ 2025-07-24  2:30 UTC (permalink / raw)
  To: Jann Horn
  Cc: Vlastimil Babka, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Pedro Falcato, Linux-MM, kernel list

On Wed, Jul 23, 2025 at 1:27 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Jul 23, 2025 at 11:19 AM Jann Horn <jannh@google.com> wrote:
> >
> > On Wed, Jul 23, 2025 at 8:10 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > On 7/23/25 19:49, Jann Horn wrote:
> > > > On Wed, Jul 23, 2025 at 7:32 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > >> On 7/23/25 18:26, Jann Horn wrote:
> > > >> > There's a racy UAF in `vma_refcount_put()` when called on the
> > > >> > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> > > >> > without sufficient protection against concurrent object reuse:
> > > >>
> > > >> Oof.
> > > >>
> > > >> > I'm not sure what the right fix is; I guess one approach would be to
> > > >> > have a special version of vma_refcount_put() for cases where the VMA
> > > >> > has been recycled by another MM that grabs an extra reference to the
> > > >> > MM? But then dropping a reference to the MM afterwards might be a bit
> > > >> > annoying and might require something like mmdrop_async()...
> > > >>
> > > >> Would we need mmdrop_async()? Isn't this the case for mmget_not_zero() and
> > > >> mmput_async()?
> > > >
> > > > Now I'm not sure anymore if either of those approaches would work,
> > > > because they rely on the task that's removing the VMA to wait until we
> > > > do __refcount_dec_and_test() before deleting the MM... but I don't
> > > > think we have any such guarantee...
> > >
> > > I think it would be waiting in exit_mmap->vma_mark_detached(), but then
> > > AFAIU you're right and we'd really need to work with mmgrab/mmdrop because
> > > at that point the  mmget_not_zero() would already be failing...
> >
> > Ah, I see! vma_mark_detached() drops its reference, then does
> > __vma_enter_locked() to bump the refcount by VMA_LOCK_OFFSET again
> > (after which the reader path can't acquire it anymore), then waits
> > until the refcount drops to VMA_LOCK_OFFSET, and then decrements it
> > down to 0 from there. Makes sense.
>
> Yes, that's what I was checking to understand the race. In your explanation:
>
> A1 found the vma
> A2 detached it
> A3 attached it to another mm
> A1 refcounts the vma
> A1 realizes it's from another mm and calls vma_end_read() which tries
> to wake up another mm's waiter.

Ok, I finally got the entire picture. Now I understand why it would be
so hard to reproduce and that it depends on a very specific order of
execution. These steps should happen in precisely this order:

A3 calls __vma_enter_locked() and refcount_add_not_zero() fails due to
A1 holding a refcount (usual situation);
By the time A3 calls rcuwait_wait_event(), A1 should drop its refcount
so that rcuwait_wait_event() does not enter wait;
By the time A1 calls rcuwait_wake_up(), A3 should free the mm leading
to A1's UAF;

Very clever.
I was wrong thinking that we can call rcuwait_wake_up() for the
original mm that vma was attached before. We do have to
rcuwait_wake_up() the mm that vma is attached to at the time of
vma_refcount_put(), so using vma->vm_mm in vma_refcount_put() is the
right thing to do because our refcount might be blocking operations on
the current vma->mm, not the one vma was originally attached to. We
just have to stabilize vma->mm.

So, I think vma_refcount_put() can mmgrab(vma->mm) before calling
__refcount_dec_and_test(), to stabilize that mm and then mmdrop()
after it calls rcuwait_wake_up(). What do you think about this
approach, folks?

>
> Vlastimil is right that if A1 was able to successfully elevate vma's
> refcount then:
> 1. vma must be attached to some valid mm. This is true because if the
> vma is detached, vma_start_read() would not be able to elevate its
> refcount. Once vma_start_read() elevates the refcount, vma will not
> detach from under us because vma_mark_detached() will block until no
> readers are using the vma.
> 2. vma->mm can't be destroyed from under us because of that
> exit_mmap()->vma_mark_detached() which again will ensure no readers
> are holding a reference to the vmas of that mm.
>
> So, a special version of vma_refcount_put() that takes mm as a
> parameter and does mmgrab/mmdrop before using that mm might work. I'll
> do some more digging and maybe test this solution with your reproducer
> to see if that works as I would expect.
> Thanks,
> Suren.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-24  0:13         ` Suren Baghdasaryan
@ 2025-07-24  4:40           ` Lorenzo Stoakes
  0 siblings, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24  4:40 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Jann Horn, Vlastimil Babka, Andrew Morton, Liam R. Howlett,
	Pedro Falcato, Linux-MM, kernel list

On Wed, Jul 23, 2025 at 05:13:56PM -0700, Suren Baghdasaryan wrote:
> On Wed, Jul 23, 2025 at 12:01 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Wed, Jul 23, 2025 at 10:55:06AM -0700, Suren Baghdasaryan wrote:
> > > On Wed, Jul 23, 2025 at 10:50 AM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > On Wed, Jul 23, 2025 at 7:32 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > > > On 7/23/25 18:26, Jann Horn wrote:
> > > > > > There's a racy UAF in `vma_refcount_put()` when called on the
> > > > > > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> > > > > > without sufficient protection against concurrent object reuse:
> > > > >
> > > > > Oof.
> > >
> > > Thanks for analyzing this Jann. Yeah, I missed the fact that
> > > vma_refcount_put() uses vma->vm_mm.
> > >
> > > > >
> > > > > > I'm not sure what the right fix is; I guess one approach would be to
> > > > > > have a special version of vma_refcount_put() for cases where the VMA
> > > > > > has been recycled by another MM that grabs an extra reference to the
> > > > > > MM? But then dropping a reference to the MM afterwards might be a bit
> > > > > > annoying and might require something like mmdrop_async()...
> > > > >
> > > > > Would we need mmdrop_async()? Isn't this the case for mmget_not_zero() and
> > > > > mmput_async()?
> > > >
> > > > Now I'm not sure anymore if either of those approaches would work,
> > > > because they rely on the task that's removing the VMA to wait until we
> > > > do __refcount_dec_and_test() before deleting the MM... but I don't
> > > > think we have any such guarantee...
> > >
> > > This is tricky. Let me look into it some more before suggesting any fixes.
> >
> > Thanks Suren! :)
> >
> > I feel the strong desire to document this seqnum approach as it is
> > intricate, so will find some time to do that for my own benefit at least.
>
> I think we documented most of it in your document:
> https://elixir.bootlin.com/linux/v6.16-rc7/source/Documentation/mm/process_addrs.rst#L705
> but maybe we can improve?

The whole problem is - I wrote that document and understood the locking,
and then you radically changed the implementation :)

I hope we settle on this one finally...

Anyway, it's entirely my fault for not having more closely followed things
at the time, esp. given my M for VMA locking... in my defence, my workload
has been crazy... but it's clear to me now that this implementation is a
lot more complicated than I thought.

Your doc changes in commit 795f29616e85 ("docs/mm: document latest changes
to vm_lock") provide an overview, they don't provide really anywhere enough
detail to understand the implementation.

I should have reviewed and checked that, again this is my fault.

Anyway, in the interests of 'catching up' on this I'll document these
details myself as, at least in part, a pedagogical exercise.

> This issue is a result of my failure to notice that vma_refcount_put()
> uses vma->mm. This is my oversight, not a conceptual design flaw (at
> least in my mind). We knew that lock_vma_under_rcu() should never
> dereference vma->mm, that's why we added an `mm` parameter to that
> function.  But unfortunately I missed this indirect usage.

I've looked at the implementation now and find it generally difficult to
follow, so regardless of the root cause here, this is complex stuff.

Anyway again this is my fault, I should have scrutinised the new
implementation as closely as Liam and Vlastimil did, and should have
reviewed the doc change and told you to write a lot more :)

Let me fix both of those issues.

>
> >
> > The fact VMAs can be recycled like this at any time makes me super nervous,
> > so I wonder if we could find ways to, at least in a debug mode (perhaps
> > even in a CONFIG_DEBUG_VM_MAPLE_TREE-style 'we are fine with this being
> > very very slow sort of way), pick up on potentially super weird
> > small-race-window style issues like this.
> >
> > Because it feels like debugging them 'in the wild' might be really horrid.
>
> What do you have in mind? A specialized test that simulates some races
> for vma lookup/allocation/reuse?

I mean where we introduce validation tests and check expectations at key
moments in a debug mode similar to the way CONFIG_DEBUG_VM_MAPLE_TREE does
this for the maple tree stuff.

>
> >
> > Or maybe it's even possible to shuffle things around and do some testing in
> > userland via the VMA userland tests... possibly pipe dream though given the
> > mechanisms that would need to be put there.
>
> I think that would be difficult if not impossible. We would have to
> inject delays like Jann did to reveal such rare races in a repeatable
> manner. Maybe if we had an in-kernel delay injection mechanism that
> would be possible?

I think that'd be a lot easier in userland.

My point is that a core part of how VMAs work is this locking mechanism
now, and for the userland 'VMA' tests to be meaningful we probably need to
recreate that accurately.

>
> >
> > It's sort of hard now to separate VMA locks from VMA operations in general,
> > so that's something I need to think about anyway.
> >
> > But I'm almost certainly going to document this in an 'internal' portion of
> > the process addrs doc page we have, at least to teach myself the deeper
> > internals...
> >
> > Cheers, Lorenzo

Thanks

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 19:43       ` Jann Horn
@ 2025-07-24  5:13         ` Lorenzo Stoakes
  2025-07-24 14:50           ` Jann Horn
  0 siblings, 1 reply; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24  5:13 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Liam R. Howlett, Suren Baghdasaryan,
	Vlastimil Babka, Pedro Falcato, Linux-MM, kernel list

On Wed, Jul 23, 2025 at 09:43:35PM +0200, Jann Horn wrote:
> Sorry, while typing up this mail I realized I didn't have this stuff
> particularly straight in my head myself when writing my previous mails
> about this...
>
> On Wed, Jul 23, 2025 at 8:45 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Wed, Jul 23, 2025 at 08:30:30PM +0200, Jann Horn wrote:
> > > On Wed, Jul 23, 2025 at 8:14 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > > On Wed, Jul 23, 2025 at 06:26:53PM +0200, Jann Horn wrote:
> > > > > There's a racy UAF in `vma_refcount_put()` when called on the
> > > > > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> > > > > without sufficient protection against concurrent object reuse:
> > > > >
> > > > > lock_vma_under_rcu() looks up a VMA locklessly with mas_walk() under
> > > > > rcu_read_lock(). At that point, the VMA may be concurrently freed, and
> > > > > it can be recycled by another process. vma_start_read() then
> > > > > increments the vma->vm_refcnt (if it is in an acceptable range), and
> > > > > if this succeeds, vma_start_read() can return a reycled VMA. (As a
> > > > > sidenote, this goes against what the surrounding comments above
> > > > > vma_start_read() and in lock_vma_under_rcu() say - it would probably
> > > > > be cleaner to perform the vma->vm_mm check inside vma_start_read().)
> > > > >
> > > > > In this scenario where the VMA has been recycled, lock_vma_under_rcu()
> > > > > will then detect the mismatching ->vm_mm pointer and drop the VMA
> > > > > through vma_end_read(), which calls vma_refcount_put().
> > > >
> > > > So in _correctly_ identifying the recycling, we then hit a problem. Fun!
> > > >
> > > > > vma_refcount_put() does this:
> > > > >
> > > > > ```
> > > > > static inline void vma_refcount_put(struct vm_area_struct *vma)
> > > > > {
> > > > >         /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> > > > >         struct mm_struct *mm = vma->vm_mm;
> > > >
> > > > Are we at a point where we _should_ be looking at a VMA with vma->vm_mm ==
> > > > current->mm here?
> > >
> > > Well, you _hope_ to be looking at a VMA with vma->vm_mm==current->mm,
> > > but if you lose a race it is intentional that you can end up with
> > > another MM's VMA here.

Right I get the SLAB_TYPESAFE_BY_RCU thing, what I'm saying overall is 'can we
detect that we lost the race by knowing what mm this should be'...

>
> (I forgot: The mm passed to lock_vma_under_rcu() is potentially
> different from current->mm if we're coming from uffd_mfill_lock(),
> which would be intentional and desired, but that's not relevant here.
> Sorry for making things more confusing.)

...and because of this, no we can't. I hate how uffd is implemented.

But even so, even if we could, it would SUCK as it would leave us with a
_requirement_ that this had to be current->mm or a race-recycled thing.

Stuff like mshare also wants to pull out assumptions about current->mm, so
we don't really want to be doing this anywhere even if we could.

So moot really.

>
> > > > Or can we not safely assume this?
> > >
> > > No.
> >
> > What code paths lead to vma_refcount_put() with a foreign mm?
>
> Calls to vma_refcount_put() from vma_start_read() or from
> lock_vma_under_rcu() can have an MM different from the mm that was
> passed to lock_vma_under_rcu().
>
> Basically, lock_vma_under_rcu() locklessly looks up a VMA in the maple
> tree of the provided MM; and so immediately after the maple tree
> lookup, before we grab any sort of reference on the VMA, the VMA can
> be freed, and reallocated by another process. If we then essentially
> read-lock this VMA which is used by another MM (by incrementing its
> refcount), waiters in that other MM might start waiting for us; and in
> that case, when we notice we got the wrong VMA and bail out, we have
> to wake those waiters up again after unlocking the VMA by dropping its
> refcount.

Thanks for the explanation but I _think_ I understood this bit, what I
meant was 'except for cases where slab recycled the VMA' :), as in
_intentionally_ vma->vm_mm != current->mm.

You've answered above re: uffd.

>
> > I mean maybe it's an unsafe assumption.
> >
> > I realise we are doing stuff for _reasons_, but I sort of HATE that we have
> > put ourselves in a position where we might always see a recycled VMA and
> > rely on a very very complicated seqnum-based locking mechanism to make sure
> > this doesn't happen.
>
> Yes, that is pretty much the definition of SLAB_TYPESAFE_BY_RCU. ^^
> You get higher data cache hit rates in exchange for complicated "grab
> some kinda stable object reference and then recheck if we got the
> right one" stuff, though it is less annoying when dealing with a
> normal refcount or spinlock or such rather than this kind of
> open-coded sleepable read-write semaphore.

Yes, and again I should have more closely scrutinised the original
series... or at least more closely _understood_ it. Entirely my fault,
though due to workload more than anything (hey guess I need to work some
more weekends :)

I understand we're doing this trade-off for a reason. But we're asking for
problems here.

>
> > It feels like we've made ourselves a challenging bed and uncomfy bed...
> >
> > >
> > > > Because if we can, can we not check for that here?
> > > >
> > > > Do we need to keep the old mm around to wake up waiters if we're happily
> > > > freeing it anyway?
> > >
> > > Well, we don't know if the MM has already been freed, or if it is
> > > still alive and well and has writers who are stuck waiting for our
> > > wakeup.
> >
> > But the mm we're talking about here is some recycled one from another
> > thread?
>
> The MM is not recycled, the VMA is recycled.

Yeah sorry it was late, this is what I meant.

VMA is recycled, so different mm.

>
> > Right so, we have:
> >
> > 'mm we meant to get' (which apparently can't be assumed to be current->mm)
> > 'mm we actually got' (which may or may not be freed at any time)
> >
> > The _meant to get_ one might have eternal waiters. Or might not even need
> > to be woken up.
> >
> > I don't see why keeping the 'actually got' one around really helps us? Am I
> > missing something?
>
> We basically have taken a read lock on a VMA that is part of the
> "actually got" MM, and so we may have caused writers from that MM to
> block and sleep, and since we did that we have to wake them back up
> and say "sorry, locked the wrong object, please continue".

OK I think this is the crux of it then, and what I've been missing here -
we have taken a read lock _by mistake_ in effect on the recycled mm, which
may end up to be a spurious one that we need to immediately drop, but
because of this we might have waiters that could wait forever.

OK I get it. But to safely reference the mm here we need to be assured it
stays around because in case of this not being true, we have nothing to
prevent that mm going away right?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 19:52           ` Jann Horn
  2025-07-23 20:00             ` Jann Horn
@ 2025-07-24  5:23             ` Lorenzo Stoakes
  1 sibling, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24  5:23 UTC (permalink / raw)
  To: Jann Horn
  Cc: Vlastimil Babka, Andrew Morton, Liam R. Howlett,
	Suren Baghdasaryan, Pedro Falcato, Linux-MM, kernel list

On Wed, Jul 23, 2025 at 09:52:41PM +0200, Jann Horn wrote:
> On Wed, Jul 23, 2025 at 8:39 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Wed, Jul 23, 2025 at 08:19:09PM +0200, Jann Horn wrote:
> > > On Wed, Jul 23, 2025 at 8:10 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > > On 7/23/25 19:49, Jann Horn wrote:
> > > > > On Wed, Jul 23, 2025 at 7:32 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > > >> On 7/23/25 18:26, Jann Horn wrote:
> > > > >> > There's a racy UAF in `vma_refcount_put()` when called on the
> > > > >> > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> > > > >> > without sufficient protection against concurrent object reuse:
> > > > >>
> > > > >> Oof.
> > > > >>
> > > > >> > I'm not sure what the right fix is; I guess one approach would be to
> > > > >> > have a special version of vma_refcount_put() for cases where the VMA
> > > > >> > has been recycled by another MM that grabs an extra reference to the
> > > > >> > MM? But then dropping a reference to the MM afterwards might be a bit
> > > > >> > annoying and might require something like mmdrop_async()...
> > > > >>
> > > > >> Would we need mmdrop_async()? Isn't this the case for mmget_not_zero() and
> > > > >> mmput_async()?
> > > > >
> > > > > Now I'm not sure anymore if either of those approaches would work,
> > > > > because they rely on the task that's removing the VMA to wait until we
> > > > > do __refcount_dec_and_test() before deleting the MM... but I don't
> > > > > think we have any such guarantee...
> > > >
> > > > I think it would be waiting in exit_mmap->vma_mark_detached(), but then
> > > > AFAIU you're right and we'd really need to work with mmgrab/mmdrop because
> > > > at that point the  mmget_not_zero() would already be failing...
> > >
> > > Ah, I see! vma_mark_detached() drops its reference, then does
> > > __vma_enter_locked() to bump the refcount by VMA_LOCK_OFFSET again
> > > (after which the reader path can't acquire it anymore), then waits
> > > until the refcount drops to VMA_LOCK_OFFSET, and then decrements it
> > > down to 0 from there. Makes sense.
> >
> > Sorry, this is really my fault because I didn't closely follow the
> > reimplementation of the VMA locks closely enough and so am a little behind
> > here (I'll fix this, probably by documenting them fully in the relevant doc
> > page).
> >
> > So forgive me if I"m asking stupid questions.
> >
> > What exactly is the issue with the waiter not being triggered?
> >
> > I see in vma_mark_detached():
> >
> >         /*
> >          * We are the only writer, so no need to use vma_refcount_put().
> >          * The condition below is unlikely because the vma has been already
> >          * write-locked and readers can increment vm_refcnt only temporarily
> >          * before they check vm_lock_seq, realize the vma is locked and drop
> >          * back the vm_refcnt. That is a narrow window for observing a raised
> >          * vm_refcnt.
> >          */
> >
> > So, if this is happening at the point of the unmap, and we're unlucky enough to
> > have some readers have spuriously incremented the refcnt before they check
> > vm_lock_seq, we trigger __vma_enter_locked() and wait on other VMAs to
> > vma_refcount_put() to wake it up vai rcuwait_wake_up() if the refcount is still
> > raised (which it should be right?)
>
> I'm not sure if I'm understanding you correctly; but yes,
> __vma_enter_locked() waits for all the waiters to drop their
> "refcounts". (It's not really a refcount, you can also think of it as
> a sleepable read-write lock where the low bits are the number of
> readers.)

Yes I understand this bit.

>
> > So actually are we going to be left with readers sat around waiting forever? If
> > the scenario mentioned happens?
>
> I'm not sure I understand the question. Readers don't wait, they bail
> out if they hit contention and retry with the mmap lock. As far as VMA
> locks are concerned, readers basically always trylock, only writers
> can wait.

No I understand that bit, I'm not putting this clearly. I meant to say
__vma_enter_locked() rcuwait_wait_event()'s until:

refcount_read(&vma->vm_refcnt) == tgt_refcnt

But it won't be the reader will it... it'll be the writer, waiting for any
readers that obtained a spurious refcount to drop it.

>
> > If we make the rando mm we are now referencing stick around, aren't we just
> > spuriously triggering this thing while potentially leaving actual waiters
> > waiting?
>
> In that case, the thing we've read-locked is not part of the MM we
> were trying to operate on, it is part of the rando other VM, so the
> writers we've blocked are also part of the rando other VM, and so the
> rando other VM is where we have to do a wakeup.

I am glad you perpetuated my use of 'rando' :P

And yes this is the crux of what I found confusing as mentioned in other thread,
so it's this rando other mm we need to keep around...

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-23 20:00             ` Jann Horn
@ 2025-07-24  5:24               ` Lorenzo Stoakes
  0 siblings, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24  5:24 UTC (permalink / raw)
  To: Jann Horn
  Cc: Vlastimil Babka, Andrew Morton, Liam R. Howlett,
	Suren Baghdasaryan, Pedro Falcato, Linux-MM, kernel list

On Wed, Jul 23, 2025 at 10:00:40PM +0200, Jann Horn wrote:
> On Wed, Jul 23, 2025 at 9:52 PM Jann Horn <jannh@google.com> wrote:
> > I'm not sure if I'm understanding you correctly; but yes,
> > __vma_enter_locked() waits for all the waiters to drop their
> > "refcounts". (It's not really a refcount, you can also think of it as
> > a sleepable read-write lock where the low bits are the number of
> > readers.)
>
> Sorry, that's not entirely true, since an attached VMA has a refcount
> elevated by one. It's kind of a refcount, and kind of forms part of a
> sleepable read-write lock, it's complicated.

And needs internal impl detail documentation IMO. Which I'll provide...

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-24  2:30           ` Suren Baghdasaryan
@ 2025-07-24  8:38             ` Vlastimil Babka
  2025-07-24 10:53               ` Lorenzo Stoakes
  2025-07-24 14:45               ` Jann Horn
  0 siblings, 2 replies; 31+ messages in thread
From: Vlastimil Babka @ 2025-07-24  8:38 UTC (permalink / raw)
  To: Suren Baghdasaryan, Jann Horn
  Cc: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Pedro Falcato,
	Linux-MM, kernel list

On 7/24/25 04:30, Suren Baghdasaryan wrote:
> So, I think vma_refcount_put() can mmgrab(vma->mm) before calling
> __refcount_dec_and_test(), to stabilize that mm and then mmdrop()
> after it calls rcuwait_wake_up(). What do you think about this
> approach, folks?

Yeah except it would be wasteful to do for all vma_refcount_put(). Should be
enough to have this version (as Jann suggested) for inval_end_read: part of
lock_vma_under_rcu. I think we need it also for the vma_refcount_put() done
in vma_start_read() when we fail the seqcount check? I think in that case
the same thing can be happening too, just with different race windows?

Also as Jann suggested, maybe it's not great (or even safe) to perform
__mmdrop() under rcu? And maybe some vma_start_read() users are even more
restricted? Maybe then we'd need to make __mmdrop_delayed() not RT-only, and
use that.

>>
>> Vlastimil is right that if A1 was able to successfully elevate vma's
>> refcount then:
>> 1. vma must be attached to some valid mm. This is true because if the
>> vma is detached, vma_start_read() would not be able to elevate its
>> refcount. Once vma_start_read() elevates the refcount, vma will not
>> detach from under us because vma_mark_detached() will block until no
>> readers are using the vma.
>> 2. vma->mm can't be destroyed from under us because of that
>> exit_mmap()->vma_mark_detached() which again will ensure no readers
>> are holding a reference to the vmas of that mm.
>>
>> So, a special version of vma_refcount_put() that takes mm as a
>> parameter and does mmgrab/mmdrop before using that mm might work. I'll
>> do some more digging and maybe test this solution with your reproducer
>> to see if that works as I would expect.
>> Thanks,
>> Suren.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-24  8:38             ` Vlastimil Babka
@ 2025-07-24 10:53               ` Lorenzo Stoakes
  2025-07-24 14:29                 ` Suren Baghdasaryan
  2025-07-24 14:45               ` Jann Horn
  1 sibling, 1 reply; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24 10:53 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Jann Horn, Andrew Morton, Liam R. Howlett,
	Pedro Falcato, Linux-MM, kernel list

On Thu, Jul 24, 2025 at 10:38:06AM +0200, Vlastimil Babka wrote:
> On 7/24/25 04:30, Suren Baghdasaryan wrote:
> > So, I think vma_refcount_put() can mmgrab(vma->mm) before calling
> > __refcount_dec_and_test(), to stabilize that mm and then mmdrop()
> > after it calls rcuwait_wake_up(). What do you think about this
> > approach, folks?
>
> Yeah except it would be wasteful to do for all vma_refcount_put(). Should be
> enough to have this version (as Jann suggested) for inval_end_read: part of
> lock_vma_under_rcu. I think we need it also for the vma_refcount_put() done
> in vma_start_read() when we fail the seqcount check? I think in that case
> the same thing can be happening too, just with different race windows?
>
> Also as Jann suggested, maybe it's not great (or even safe) to perform
> __mmdrop() under rcu? And maybe some vma_start_read() users are even more
> restricted? Maybe then we'd need to make __mmdrop_delayed() not RT-only, and
> use that.

Agreed that doing this under RCU seems unwise.

I know PTL relies on this as well as zap PTE page table reclaim, likely these
wouldn't interact with an mm going away (you'd hope!) but it seems unwise to
play around with assumptions here.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-24 10:53               ` Lorenzo Stoakes
@ 2025-07-24 14:29                 ` Suren Baghdasaryan
  2025-07-24 14:45                   ` Vlastimil Babka
  0 siblings, 1 reply; 31+ messages in thread
From: Suren Baghdasaryan @ 2025-07-24 14:29 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vlastimil Babka, Jann Horn, Andrew Morton, Liam R. Howlett,
	Pedro Falcato, Linux-MM, kernel list

On Thu, Jul 24, 2025 at 3:53 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Jul 24, 2025 at 10:38:06AM +0200, Vlastimil Babka wrote:
> > On 7/24/25 04:30, Suren Baghdasaryan wrote:
> > > So, I think vma_refcount_put() can mmgrab(vma->mm) before calling
> > > __refcount_dec_and_test(), to stabilize that mm and then mmdrop()
> > > after it calls rcuwait_wake_up(). What do you think about this
> > > approach, folks?
> >
> > Yeah except it would be wasteful to do for all vma_refcount_put(). Should be
> > enough to have this version (as Jann suggested) for inval_end_read: part of
> > lock_vma_under_rcu.

Yes, definitely.

> > I think we need it also for the vma_refcount_put() done
> > in vma_start_read() when we fail the seqcount check? I think in that case
> > the same thing can be happening too, just with different race windows?

Yes.

> >
> > Also as Jann suggested, maybe it's not great (or even safe) to perform
> > __mmdrop() under rcu? And maybe some vma_start_read() users are even more
> > restricted? Maybe then we'd need to make __mmdrop_delayed() not RT-only, and
> > use that.
>
> Agreed that doing this under RCU seems unwise.
>
> I know PTL relies on this as well as zap PTE page table reclaim, likely these
> wouldn't interact with an mm going away (you'd hope!) but it seems unwise to
> play around with assumptions here.

__mmdrop_delayed() is a viable option but I would hate adding
mm->delayed_drop for !CONFIG_PREEMPT_RT just for this one case.
Alternatively, we don't need to be in the rcu read section when we
call vma_end_read() inside lock_vma_under_rcu(). We refcounted the
vma, so it's locked and stable, no need for RCU at that point. We can
move rcu_read_unlock() before vma_end_read(). However that's not the
case with the vma_refcount_put() inside vma_start_read(). We could
change vma_start_read() semantics so that it drops rcu_read_lock
before it returns. That way we could do vma_refcount_put() after
rcu_read_unlock() in both places.  The retry case in
lock_vma_under_rcu() would have to reacquire rcu_read_lock() but that
retry is not the usual path, so should not affect overall locking
performance. What do you think about this alternative?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-24 14:29                 ` Suren Baghdasaryan
@ 2025-07-24 14:45                   ` Vlastimil Babka
  2025-07-24 14:52                     ` Suren Baghdasaryan
  0 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2025-07-24 14:45 UTC (permalink / raw)
  To: Suren Baghdasaryan, Lorenzo Stoakes
  Cc: Jann Horn, Andrew Morton, Liam R. Howlett, Pedro Falcato,
	Linux-MM, kernel list

On 7/24/25 16:29, Suren Baghdasaryan wrote:
> On Thu, Jul 24, 2025 at 3:53 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
>>
>> On Thu, Jul 24, 2025 at 10:38:06AM +0200, Vlastimil Babka wrote:
>> > On 7/24/25 04:30, Suren Baghdasaryan wrote:
>> > > So, I think vma_refcount_put() can mmgrab(vma->mm) before calling
>> > > __refcount_dec_and_test(), to stabilize that mm and then mmdrop()
>> > > after it calls rcuwait_wake_up(). What do you think about this
>> > > approach, folks?
>> >
>> > Yeah except it would be wasteful to do for all vma_refcount_put(). Should be
>> > enough to have this version (as Jann suggested) for inval_end_read: part of
>> > lock_vma_under_rcu.
> 
> Yes, definitely.
> 
>> > I think we need it also for the vma_refcount_put() done
>> > in vma_start_read() when we fail the seqcount check? I think in that case
>> > the same thing can be happening too, just with different race windows?
> 
> Yes.
> 
>> >
>> > Also as Jann suggested, maybe it's not great (or even safe) to perform
>> > __mmdrop() under rcu? And maybe some vma_start_read() users are even more
>> > restricted? Maybe then we'd need to make __mmdrop_delayed() not RT-only, and
>> > use that.
>>
>> Agreed that doing this under RCU seems unwise.
>>
>> I know PTL relies on this as well as zap PTE page table reclaim, likely these
>> wouldn't interact with an mm going away (you'd hope!) but it seems unwise to
>> play around with assumptions here.
> 
> __mmdrop_delayed() is a viable option but I would hate adding
> mm->delayed_drop for !CONFIG_PREEMPT_RT just for this one case.
> Alternatively, we don't need to be in the rcu read section when we
> call vma_end_read() inside lock_vma_under_rcu(). We refcounted the
> vma, so it's locked and stable, no need for RCU at that point. We can
> move rcu_read_unlock() before vma_end_read(). However that's not the

Seems correct.

> case with the vma_refcount_put() inside vma_start_read(). We could
> change vma_start_read() semantics so that it drops rcu_read_lock
> before it returns.

Looks like there's no other users of vma_start_read() than
lock_vma_under_rcu() itself that we need to care about, so seems fine.

> That way we could do vma_refcount_put() after
> rcu_read_unlock() in both places.  The retry case in
> lock_vma_under_rcu() would have to reacquire rcu_read_lock() but that
> retry is not the usual path, so should not affect overall locking
> performance. What do you think about this alternative?

Sounds feasible.

I guess at that point it would be cleaner to combine vma_start_read() with
mas_walk() into a new function that does both and starts with
rcu_read_lock() itself so we don't have the ugly scheme of entering under
rcu lock and returning without it?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-24  8:38             ` Vlastimil Babka
  2025-07-24 10:53               ` Lorenzo Stoakes
@ 2025-07-24 14:45               ` Jann Horn
  2025-07-24 16:36                 ` Suren Baghdasaryan
  1 sibling, 1 reply; 31+ messages in thread
From: Jann Horn @ 2025-07-24 14:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Pedro Falcato, Linux-MM, kernel list

On Thu, Jul 24, 2025 at 10:38 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> On 7/24/25 04:30, Suren Baghdasaryan wrote:
> > So, I think vma_refcount_put() can mmgrab(vma->mm) before calling
> > __refcount_dec_and_test(), to stabilize that mm and then mmdrop()
> > after it calls rcuwait_wake_up(). What do you think about this
> > approach, folks?
>
> Yeah except it would be wasteful to do for all vma_refcount_put(). Should be
> enough to have this version (as Jann suggested) for inval_end_read: part of
> lock_vma_under_rcu. I think we need it also for the vma_refcount_put() done
> in vma_start_read() when we fail the seqcount check? I think in that case
> the same thing can be happening too, just with different race windows?
>
> Also as Jann suggested, maybe it's not great (or even safe) to perform
> __mmdrop() under rcu? And maybe some vma_start_read() users are even more
> restricted? Maybe then we'd need to make __mmdrop_delayed() not RT-only, and
> use that.

FWIW, I think I have been mixing things up in my head - mmdrop_async()
exists, but this comment in free_signal_struct() explains that it's
because __mmdrop() is not softirq-safe because x86's pgd_lock spinlock
does not disable IRQs.

/*
* __mmdrop is not safe to call from softirq context on x86 due to
* pgd_dtor so postpone it to the async context
*/

So I guess using mmdrop() here might actually be fine, since we're
just in atomic context, not in softirq.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-24  5:13         ` Lorenzo Stoakes
@ 2025-07-24 14:50           ` Jann Horn
  2025-07-24 14:56             ` Lorenzo Stoakes
  0 siblings, 1 reply; 31+ messages in thread
From: Jann Horn @ 2025-07-24 14:50 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R. Howlett, Suren Baghdasaryan,
	Vlastimil Babka, Pedro Falcato, Linux-MM, kernel list

On Thu, Jul 24, 2025 at 7:13 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Wed, Jul 23, 2025 at 09:43:35PM +0200, Jann Horn wrote:
> > Sorry, while typing up this mail I realized I didn't have this stuff
> > particularly straight in my head myself when writing my previous mails
> > about this...
> >
> > On Wed, Jul 23, 2025 at 8:45 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > > On Wed, Jul 23, 2025 at 08:30:30PM +0200, Jann Horn wrote:
> > > > On Wed, Jul 23, 2025 at 8:14 PM Lorenzo Stoakes
> > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > > On Wed, Jul 23, 2025 at 06:26:53PM +0200, Jann Horn wrote:
> > > > > > There's a racy UAF in `vma_refcount_put()` when called on the
> > > > > > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> > > > > > without sufficient protection against concurrent object reuse:
> > > > > >
> > > > > > lock_vma_under_rcu() looks up a VMA locklessly with mas_walk() under
> > > > > > rcu_read_lock(). At that point, the VMA may be concurrently freed, and
> > > > > > it can be recycled by another process. vma_start_read() then
> > > > > > increments the vma->vm_refcnt (if it is in an acceptable range), and
> > > > > > if this succeeds, vma_start_read() can return a reycled VMA. (As a
> > > > > > sidenote, this goes against what the surrounding comments above
> > > > > > vma_start_read() and in lock_vma_under_rcu() say - it would probably
> > > > > > be cleaner to perform the vma->vm_mm check inside vma_start_read().)
> > > > > >
> > > > > > In this scenario where the VMA has been recycled, lock_vma_under_rcu()
> > > > > > will then detect the mismatching ->vm_mm pointer and drop the VMA
> > > > > > through vma_end_read(), which calls vma_refcount_put().
> > > > >
> > > > > So in _correctly_ identifying the recycling, we then hit a problem. Fun!
> > > > >
> > > > > > vma_refcount_put() does this:
> > > > > >
> > > > > > ```
> > > > > > static inline void vma_refcount_put(struct vm_area_struct *vma)
> > > > > > {
> > > > > >         /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> > > > > >         struct mm_struct *mm = vma->vm_mm;
> > > > >
> > > > > Are we at a point where we _should_ be looking at a VMA with vma->vm_mm ==
> > > > > current->mm here?
> > > >
> > > > Well, you _hope_ to be looking at a VMA with vma->vm_mm==current->mm,
> > > > but if you lose a race it is intentional that you can end up with
> > > > another MM's VMA here.
>
> Right I get the SLAB_TYPESAFE_BY_RCU thing, what I'm saying overall is 'can we
> detect that we lost the race by knowing what mm this should be'...
>
> >
> > (I forgot: The mm passed to lock_vma_under_rcu() is potentially
> > different from current->mm if we're coming from uffd_mfill_lock(),
> > which would be intentional and desired, but that's not relevant here.
> > Sorry for making things more confusing.)
>
> ...and because of this, no we can't. I hate how uffd is implemented.

I mean, we are in a context where we're looking up a VMA under a
specific MM, we know which MM the VMA should be from. And we have a
bailout that checks for this. It's just that by the time we can check
if the MM matches the expected one, we've already grabbed the VMA.

> > > Right so, we have:
> > >
> > > 'mm we meant to get' (which apparently can't be assumed to be current->mm)
> > > 'mm we actually got' (which may or may not be freed at any time)
> > >
> > > The _meant to get_ one might have eternal waiters. Or might not even need
> > > to be woken up.
> > >
> > > I don't see why keeping the 'actually got' one around really helps us? Am I
> > > missing something?
> >
> > We basically have taken a read lock on a VMA that is part of the
> > "actually got" MM, and so we may have caused writers from that MM to
> > block and sleep, and since we did that we have to wake them back up
> > and say "sorry, locked the wrong object, please continue".
>
> OK I think this is the crux of it then, and what I've been missing here -
> we have taken a read lock _by mistake_ in effect on the recycled mm, which
> may end up to be a spurious one that we need to immediately drop, but
> because of this we might have waiters that could wait forever.
>
> OK I get it. But to safely reference the mm here we need to be assured it
> stays around because in case of this not being true, we have nothing to
> prevent that mm going away right?

Yes - as Suren explained, as long as we hold a reference to the VMA,
the MM also stays around, but to access the MM after dropping the VMA
we need to somehow grab a reference on the MM first.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-24 14:45                   ` Vlastimil Babka
@ 2025-07-24 14:52                     ` Suren Baghdasaryan
  0 siblings, 0 replies; 31+ messages in thread
From: Suren Baghdasaryan @ 2025-07-24 14:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Lorenzo Stoakes, Jann Horn, Andrew Morton, Liam R. Howlett,
	Pedro Falcato, Linux-MM, kernel list

On Thu, Jul 24, 2025 at 2:45 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 7/24/25 16:29, Suren Baghdasaryan wrote:
> > On Thu, Jul 24, 2025 at 3:53 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> >>
> >> On Thu, Jul 24, 2025 at 10:38:06AM +0200, Vlastimil Babka wrote:
> >> > On 7/24/25 04:30, Suren Baghdasaryan wrote:
> >> > > So, I think vma_refcount_put() can mmgrab(vma->mm) before calling
> >> > > __refcount_dec_and_test(), to stabilize that mm and then mmdrop()
> >> > > after it calls rcuwait_wake_up(). What do you think about this
> >> > > approach, folks?
> >> >
> >> > Yeah except it would be wasteful to do for all vma_refcount_put(). Should be
> >> > enough to have this version (as Jann suggested) for inval_end_read: part of
> >> > lock_vma_under_rcu.
> >
> > Yes, definitely.
> >
> >> > I think we need it also for the vma_refcount_put() done
> >> > in vma_start_read() when we fail the seqcount check? I think in that case
> >> > the same thing can be happening too, just with different race windows?
> >
> > Yes.
> >
> >> >
> >> > Also as Jann suggested, maybe it's not great (or even safe) to perform
> >> > __mmdrop() under rcu? And maybe some vma_start_read() users are even more
> >> > restricted? Maybe then we'd need to make __mmdrop_delayed() not RT-only, and
> >> > use that.
> >>
> >> Agreed that doing this under RCU seems unwise.
> >>
> >> I know PTL relies on this as well as zap PTE page table reclaim, likely these
> >> wouldn't interact with an mm going away (you'd hope!) but it seems unwise to
> >> play around with assumptions here.
> >
> > __mmdrop_delayed() is a viable option but I would hate adding
> > mm->delayed_drop for !CONFIG_PREEMPT_RT just for this one case.
> > Alternatively, we don't need to be in the rcu read section when we
> > call vma_end_read() inside lock_vma_under_rcu(). We refcounted the
> > vma, so it's locked and stable, no need for RCU at that point. We can
> > move rcu_read_unlock() before vma_end_read(). However that's not the
>
> Seems correct.
>
> > case with the vma_refcount_put() inside vma_start_read(). We could
> > change vma_start_read() semantics so that it drops rcu_read_lock
> > before it returns.
>
> Looks like there's no other users of vma_start_read() than
> lock_vma_under_rcu() itself that we need to care about, so seems fine.
>
> > That way we could do vma_refcount_put() after
> > rcu_read_unlock() in both places.  The retry case in
> > lock_vma_under_rcu() would have to reacquire rcu_read_lock() but that
> > retry is not the usual path, so should not affect overall locking
> > performance. What do you think about this alternative?
>
> Sounds feasible.
>
> I guess at that point it would be cleaner to combine vma_start_read() with
> mas_walk() into a new function that does both and starts with
> rcu_read_lock() itself so we don't have the ugly scheme of entering under
> rcu lock and returning without it?

Yes, I was thinking along the same lines and maybe we can also move
the vma->mm != mm checks there too, as Jann suggested. I'll see which
way looks better and if I get to a satisfactory state, will post a
patch.
Thanks!

>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-24 14:50           ` Jann Horn
@ 2025-07-24 14:56             ` Lorenzo Stoakes
  0 siblings, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24 14:56 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Liam R. Howlett, Suren Baghdasaryan,
	Vlastimil Babka, Pedro Falcato, Linux-MM, kernel list

On Thu, Jul 24, 2025 at 04:50:49PM +0200, Jann Horn wrote:
> On Thu, Jul 24, 2025 at 7:13 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Wed, Jul 23, 2025 at 09:43:35PM +0200, Jann Horn wrote:
> > > Sorry, while typing up this mail I realized I didn't have this stuff
> > > particularly straight in my head myself when writing my previous mails
> > > about this...
> > >
> > > On Wed, Jul 23, 2025 at 8:45 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > > On Wed, Jul 23, 2025 at 08:30:30PM +0200, Jann Horn wrote:
> > > > > On Wed, Jul 23, 2025 at 8:14 PM Lorenzo Stoakes
> > > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > > > On Wed, Jul 23, 2025 at 06:26:53PM +0200, Jann Horn wrote:
> > > > > > > There's a racy UAF in `vma_refcount_put()` when called on the
> > > > > > > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> > > > > > > without sufficient protection against concurrent object reuse:
> > > > > > >
> > > > > > > lock_vma_under_rcu() looks up a VMA locklessly with mas_walk() under
> > > > > > > rcu_read_lock(). At that point, the VMA may be concurrently freed, and
> > > > > > > it can be recycled by another process. vma_start_read() then
> > > > > > > increments the vma->vm_refcnt (if it is in an acceptable range), and
> > > > > > > if this succeeds, vma_start_read() can return a reycled VMA. (As a
> > > > > > > sidenote, this goes against what the surrounding comments above
> > > > > > > vma_start_read() and in lock_vma_under_rcu() say - it would probably
> > > > > > > be cleaner to perform the vma->vm_mm check inside vma_start_read().)
> > > > > > >
> > > > > > > In this scenario where the VMA has been recycled, lock_vma_under_rcu()
> > > > > > > will then detect the mismatching ->vm_mm pointer and drop the VMA
> > > > > > > through vma_end_read(), which calls vma_refcount_put().
> > > > > >
> > > > > > So in _correctly_ identifying the recycling, we then hit a problem. Fun!
> > > > > >
> > > > > > > vma_refcount_put() does this:
> > > > > > >
> > > > > > > ```
> > > > > > > static inline void vma_refcount_put(struct vm_area_struct *vma)
> > > > > > > {
> > > > > > >         /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> > > > > > >         struct mm_struct *mm = vma->vm_mm;
> > > > > >
> > > > > > Are we at a point where we _should_ be looking at a VMA with vma->vm_mm ==
> > > > > > current->mm here?
> > > > >
> > > > > Well, you _hope_ to be looking at a VMA with vma->vm_mm==current->mm,
> > > > > but if you lose a race it is intentional that you can end up with
> > > > > another MM's VMA here.
> >
> > Right I get the SLAB_TYPESAFE_BY_RCU thing, what I'm saying overall is 'can we
> > detect that we lost the race by knowing what mm this should be'...
> >
> > >
> > > (I forgot: The mm passed to lock_vma_under_rcu() is potentially
> > > different from current->mm if we're coming from uffd_mfill_lock(),
> > > which would be intentional and desired, but that's not relevant here.
> > > Sorry for making things more confusing.)
> >
> > ...and because of this, no we can't. I hate how uffd is implemented.
>
> I mean, we are in a context where we're looking up a VMA under a
> specific MM, we know which MM the VMA should be from. And we have a
> bailout that checks for this. It's just that by the time we can check
> if the MM matches the expected one, we've already grabbed the VMA.

OK.

>
> > > > Right so, we have:
> > > >
> > > > 'mm we meant to get' (which apparently can't be assumed to be current->mm)
> > > > 'mm we actually got' (which may or may not be freed at any time)
> > > >
> > > > The _meant to get_ one might have eternal waiters. Or might not even need
> > > > to be woken up.
> > > >
> > > > I don't see why keeping the 'actually got' one around really helps us? Am I
> > > > missing something?
> > >
> > > We basically have taken a read lock on a VMA that is part of the
> > > "actually got" MM, and so we may have caused writers from that MM to
> > > block and sleep, and since we did that we have to wake them back up
> > > and say "sorry, locked the wrong object, please continue".
> >
> > OK I think this is the crux of it then, and what I've been missing here -
> > we have taken a read lock _by mistake_ in effect on the recycled mm, which
> > may end up to be a spurious one that we need to immediately drop, but
> > because of this we might have waiters that could wait forever.
> >
> > OK I get it. But to safely reference the mm here we need to be assured it
> > stays around because in case of this not being true, we have nothing to
> > prevent that mm going away right?
>
> Yes - as Suren explained, as long as we hold a reference to the VMA,
> the MM also stays around, but to access the MM after dropping the VMA
> we need to somehow grab a reference on the MM first.

OK.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-24 14:45               ` Jann Horn
@ 2025-07-24 16:36                 ` Suren Baghdasaryan
  2025-07-28 17:14                   ` Suren Baghdasaryan
  0 siblings, 1 reply; 31+ messages in thread
From: Suren Baghdasaryan @ 2025-07-24 16:36 UTC (permalink / raw)
  To: Jann Horn
  Cc: Vlastimil Babka, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Pedro Falcato, Linux-MM, kernel list

On Thu, Jul 24, 2025 at 2:45 PM Jann Horn <jannh@google.com> wrote:
>
> On Thu, Jul 24, 2025 at 10:38 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > On 7/24/25 04:30, Suren Baghdasaryan wrote:
> > > So, I think vma_refcount_put() can mmgrab(vma->mm) before calling
> > > __refcount_dec_and_test(), to stabilize that mm and then mmdrop()
> > > after it calls rcuwait_wake_up(). What do you think about this
> > > approach, folks?
> >
> > Yeah except it would be wasteful to do for all vma_refcount_put(). Should be
> > enough to have this version (as Jann suggested) for inval_end_read: part of
> > lock_vma_under_rcu. I think we need it also for the vma_refcount_put() done
> > in vma_start_read() when we fail the seqcount check? I think in that case
> > the same thing can be happening too, just with different race windows?
> >
> > Also as Jann suggested, maybe it's not great (or even safe) to perform
> > __mmdrop() under rcu? And maybe some vma_start_read() users are even more
> > restricted? Maybe then we'd need to make __mmdrop_delayed() not RT-only, and
> > use that.
>
> FWIW, I think I have been mixing things up in my head - mmdrop_async()
> exists, but this comment in free_signal_struct() explains that it's
> because __mmdrop() is not softirq-safe because x86's pgd_lock spinlock
> does not disable IRQs.
>
> /*
> * __mmdrop is not safe to call from softirq context on x86 due to
> * pgd_dtor so postpone it to the async context
> */
>
> So I guess using mmdrop() here might actually be fine, since we're
> just in atomic context, not in softirq.

Thanks for looking more into this. Even if it's safe, I would still
prefer to make mmdrop() outside of RCU read section. The code might
actually end-up cleaner that way too.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
  2025-07-24 16:36                 ` Suren Baghdasaryan
@ 2025-07-28 17:14                   ` Suren Baghdasaryan
  0 siblings, 0 replies; 31+ messages in thread
From: Suren Baghdasaryan @ 2025-07-28 17:14 UTC (permalink / raw)
  To: Jann Horn
  Cc: Vlastimil Babka, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Pedro Falcato, Linux-MM, kernel list

On Thu, Jul 24, 2025 at 9:36 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Jul 24, 2025 at 2:45 PM Jann Horn <jannh@google.com> wrote:
> >
> > On Thu, Jul 24, 2025 at 10:38 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > On 7/24/25 04:30, Suren Baghdasaryan wrote:
> > > > So, I think vma_refcount_put() can mmgrab(vma->mm) before calling
> > > > __refcount_dec_and_test(), to stabilize that mm and then mmdrop()
> > > > after it calls rcuwait_wake_up(). What do you think about this
> > > > approach, folks?
> > >
> > > Yeah except it would be wasteful to do for all vma_refcount_put(). Should be
> > > enough to have this version (as Jann suggested) for inval_end_read: part of
> > > lock_vma_under_rcu. I think we need it also for the vma_refcount_put() done
> > > in vma_start_read() when we fail the seqcount check? I think in that case
> > > the same thing can be happening too, just with different race windows?
> > >
> > > Also as Jann suggested, maybe it's not great (or even safe) to perform
> > > __mmdrop() under rcu? And maybe some vma_start_read() users are even more
> > > restricted? Maybe then we'd need to make __mmdrop_delayed() not RT-only, and
> > > use that.
> >
> > FWIW, I think I have been mixing things up in my head - mmdrop_async()
> > exists, but this comment in free_signal_struct() explains that it's
> > because __mmdrop() is not softirq-safe because x86's pgd_lock spinlock
> > does not disable IRQs.
> >
> > /*
> > * __mmdrop is not safe to call from softirq context on x86 due to
> > * pgd_dtor so postpone it to the async context
> > */
> >
> > So I guess using mmdrop() here might actually be fine, since we're
> > just in atomic context, not in softirq.
>
> Thanks for looking more into this. Even if it's safe, I would still
> prefer to make mmdrop() outside of RCU read section. The code might
> actually end-up cleaner that way too.

Sorry for the delay, I got some time over the weekend to work on this.
Unfortunately vma_start_read() is used in one more place in
mm-unstable and it uses vma iterators for the lookup, so combining
lookup with vma_start_read() is not as clean as we thought. After
trying couple of ways to fix this I decided to follow KISS principle.
The fix is posted at
https://lore.kernel.org/all/20250728170950.2216966-1-surenb@google.com/.
Reviews and feedback is appreciated.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2025-07-28 17:14 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-23 16:26 [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU Jann Horn
2025-07-23 17:32 ` Vlastimil Babka
2025-07-23 17:49   ` Jann Horn
2025-07-23 17:55     ` Suren Baghdasaryan
2025-07-23 19:01       ` Lorenzo Stoakes
2025-07-24  0:13         ` Suren Baghdasaryan
2025-07-24  4:40           ` Lorenzo Stoakes
2025-07-23 18:10     ` Vlastimil Babka
2025-07-23 18:19       ` Jann Horn
2025-07-23 18:39         ` Lorenzo Stoakes
2025-07-23 19:52           ` Jann Horn
2025-07-23 20:00             ` Jann Horn
2025-07-24  5:24               ` Lorenzo Stoakes
2025-07-24  5:23             ` Lorenzo Stoakes
2025-07-23 20:27         ` Suren Baghdasaryan
2025-07-24  2:30           ` Suren Baghdasaryan
2025-07-24  8:38             ` Vlastimil Babka
2025-07-24 10:53               ` Lorenzo Stoakes
2025-07-24 14:29                 ` Suren Baghdasaryan
2025-07-24 14:45                   ` Vlastimil Babka
2025-07-24 14:52                     ` Suren Baghdasaryan
2025-07-24 14:45               ` Jann Horn
2025-07-24 16:36                 ` Suren Baghdasaryan
2025-07-28 17:14                   ` Suren Baghdasaryan
2025-07-23 18:14 ` Lorenzo Stoakes
2025-07-23 18:30   ` Jann Horn
2025-07-23 18:45     ` Lorenzo Stoakes
2025-07-23 19:43       ` Jann Horn
2025-07-24  5:13         ` Lorenzo Stoakes
2025-07-24 14:50           ` Jann Horn
2025-07-24 14:56             ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).