Oops in secretmem

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Oops in secretmem_fault()
@ 2025-10-30 15:34 Big Sleep
  2025-10-31  1:36 ` Andrew Morton
  2025-10-31  9:18 ` [PATCH 1/1] mm/secretmem: fix use-after-free race in fault handler Lance Yang
  0 siblings, 2 replies; 10+ messages in thread
From: Big Sleep @ 2025-10-30 15:34 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton
  Cc: linux-mm, linux-kernel, Matthew Wilcox (Oracle), Lorenzo Stoakes

Hello Mike and Andrew,

we found a bug in secretmem_fault() - please see below for details!

--Google Big Sleep

## Bug Details

Unprivileged code can provoke a kernel Oops by exploiting a race
condition in the page fault handler for the `memfd_secret(2)`
feature. Through the race condition, there is a short time window in
which a newly allocated page can be missing from the direct map after
it is acquired, and when the allocating code attempts to access the
page in the direct map, it results in an unhandleable page fault Oops.

**We believe that the bug has no security impact.**

## Analysis

When a page fault occurs in a secret memory file created with
`memfd_secret(2)`, the kernel will allocate a new folio for it, mark
the underlying page as not-present in the direct map, and add it to
the file mapping.

If two tasks cause a fault in the same page concurrently, both could
end up allocating a folio and removing the page from the direct map,
but only one would succeed in adding the folio to the file
mapping. The task that failed undoes the effects of its attempt by (a)
freeing the folio again and (b) putting the page back into the direct
map. However, by doing these two operations in this order, the page
becomes available to the allocator again before it is placed back in
the direct mapping.

If another task attempts to allocate the page between (a) and (b), and
the kernel tries to access it via the direct map, it would result in a
supervisor not-present page fault.

The relevant code is in `secretmem_fault()` in `mm/secretmem.c`:

```
static vm_fault_t secretmem_fault(struct vm_fault *vmf)
{
    struct address_space *mapping = vmf->vma->vm_file->f_mapping;
    struct inode *inode = file_inode(vmf->vma->vm_file);
    pgoff_t offset = vmf->pgoff;
    gfp_t gfp = vmf->gfp_mask;
    unsigned long addr;
    struct folio *folio;
    vm_fault_t ret;
    int err;

    if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
        return vmf_error(-EINVAL);

    filemap_invalidate_lock_shared(mapping);

retry:
    folio = filemap_lock_folio(mapping, offset); // (0)
    if (IS_ERR(folio)) {
        folio = folio_alloc(gfp | __GFP_ZERO, 0);
        if (!folio) {
            ret = VM_FAULT_OOM;
            goto out;
        }

        err = set_direct_map_invalid_noflush(folio_page(folio, 0)); // (1)
        if (err) {
            folio_put(folio);
            ret = vmf_error(err);
            goto out;
        }

        __folio_mark_uptodate(folio);
        err = filemap_add_folio(mapping, folio, offset, gfp); // (2)
        if (unlikely(err)) {
            folio_put(folio); // (3)
            /*
             * If a split of large page was required, it
             * already happened when we marked the page invalid
             * which guarantees that this call won't fail
             */
            set_direct_map_default_noflush(folio_page(folio, 0)); // (4)
            if (err == -EEXIST)
                goto retry;

            ret = vmf_error(err);
            goto out;
        }

        addr = (unsigned long)folio_address(folio);
        flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
    }

    vmf->page = folio_file_page(folio, vmf->pgoff);
    ret = VM_FAULT_LOCKED;

out:
    filemap_invalidate_unlock_shared(mapping);
    return ret;
}
```

When two tasks cause a page fault concurrently in the same page for
the first time, the call to `filemap_lock_folio()` (0) will return an
error, so both tasks will enter the conditional. Both tasks will call
`folio_alloc()` and `set_direct_map_invalid_noflush(folio_page(folio,
0));` to remove the page from the direct map (1).

Only one task will succeed in the call to `filemap_add_folio(mapping,
folio, offset, gfp);` (2), while the other will get `-EEXIST`.

The failing task then calls `folio_put(folio)` to free the folio (3),
making the underlying page available for allocation. Immediately
after, the code calls `set_direct_map_default_noflush()` to mark the
page as present in the direct map (4), but at this point the page
could have been allocated elsewhere. The code should not use the folio
after the call to `folio_put(folio)` in (3).

## Affected Versions

The issue has been successfully reproduced in the following Linux
versions:

* v5.15.195 (longterm)
* v6.17.6 (stable)

When spot-checking the versions in between, we were also able to
reproduce the problem.

## Reproduction

Our reproducer is written for x86_64.

### Reproducer Code

We can reproduce the bug with the following self-contained C++
program:

```
#include <err.h>
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>

#include <atomic>
#include <barrier>
#include <thread>
#include <vector>

#ifndef SYS_memfd_secret
#define SYS_memfd_secret 447
#endif

static int memfd_secret_syscall(unsigned int flags) {
  return syscall(SYS_memfd_secret, flags);
}

constexpr size_t kPageSz = 4096;
constexpr size_t kMemFdSz = kPageSz * 1;
constexpr size_t kNumIters = 10000000;
constexpr size_t kNumThreads = 2;
constexpr size_t kMmapSz = kPageSz * 512 * 512 * 5;
constexpr size_t kAccessStep = kPageSz * 512;
constexpr size_t kNumMappers = 1;

int main(void) {
  std::atomic<bool> should_stop = false;

  std::vector<std::thread> mappers;
  std::atomic<size_t> bw = kNumThreads;

  for (size_t i = 0; i < kNumMappers; i++) {
    mappers.emplace_back([&should_stop]() {
      while (!should_stop.load()) {
        void* addr = mmap(nullptr, kMmapSz, PROT_READ | PROT_WRITE,
                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (addr == MAP_FAILED) {
          err(EXIT_FAILURE, "mmap");
        }

        uintptr_t mapping = reinterpret_cast<uintptr_t>(addr);
        for (size_t target = mapping; target < mapping + kMmapSz;
             target += kAccessStep) {
          __asm__ volatile("movl $1, (%0)" ::"r"(target));
        }

        munmap(addr, kMmapSz);
      }
    });
  };

  size_t last = 1;
  for (size_t i = 0; i < kNumIters; ++i) {
    if (i >= last * 2) {
      last = i;
      fprintf(stderr, "Iteration %zu\n", i);
    }

    int fd = memfd_secret_syscall(0);
    if (fd == -1) {
      err(EXIT_FAILURE, "memfd_secret");
    }

    if (ftruncate(fd, kMemFdSz) == -1) {
      err(EXIT_FAILURE, "ftruncate");
    }
    void* addr =
        mmap(nullptr, kMemFdSz, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (addr == MAP_FAILED) {
      err(EXIT_FAILURE, "mmap");
    }

    std::vector<std::thread> threads;
    uintptr_t target = reinterpret_cast<uintptr_t>(addr);

    for (size_t i = 0; i < kNumThreads; i++) {
      threads.emplace_back([&bw, target]() {
        bw--;
        while (bw.load() != 0) {
          __asm__ volatile("nop");
        }
        __asm__ volatile("movl $1, (%0)" ::"r"(target));
      });
    }

    for (auto& thread : threads) {
      thread.join();
    }
    bw = kNumThreads;

    munmap(addr, kMemFdSz);
    close(fd);
  }

  should_stop = true;
  for (auto& mapper : mappers) {
    mapper.join();
  }

  return 0;
}
```

### Build Instructions

#### Reproducer

The reproducer program can be built using C++20:

```
$ CXX=clang++ CXXFLAGS=--std=c++20 make repro
```

#### *Kernel*

To make the race condition easier to reproduce, we add two
`mdelay(100)` calls in `secretmem_fault()`:

* One `mdelay()` before the call to `filemap_add_folio()` to simplify
  the race condition between the two faulting tasks.
* The other `mdelay()` before `set_direct_map_default_noflush()` to
  simplify the race condition between the task in the error case and
  another task that allocates memory.

We build Linux with a default KVM guest configuration, with
`CONFIG_PREEMPT`:

```
make defconfig kvm_guest.config
echo CONFIG_PREEMPT=y >> .config
make olddefconfig
make bzImage
```

### Command

#### Kernel

We ran the test kernels in QEMU with the following invocation:

```
qemu-system-x86_64 -nographic -m 8G -smp 4 -net nic,model=e1000
-enable-kvm -append "root=/dev/sda1 kernel.sysrq=0 console=ttyS0
selinux=0 secretmem.enable=y" -action
"reboot=shutdown,shutdown=poweroff" -kernel "${BZIMAGE}" -snapshot
-hda ${HOME}/rootdisk.qcow2 -net "user,hostfwd=tcp::10022-:22"
```

The root disk is a simple Syzkaller-like image.

#### Userspace

From within the booted kernel, run the reproducer as an unprivileged
user:

```
$ ./repro
```

### Crash Report

Note: The crash happens when another task attempts to allocate and
access the page. It is hard to control which exact task that will be,
so the crashes can look different.

```
[   18.126690] BUG: unable to handle page fault for address: ffff9a3d060de000
[   18.128071] #PF: supervisor write access in kernel mode
[   18.129101] #PF: error_code(0x0002) - not-present page
[   18.130117] PGD 1b7c01067 P4D 1b7c01067 PUD 23ffff067 PMD 1061a8063
PTE 800ffffef9f21020
[   18.131714] Oops: Oops: 0002 [#1] SMP NOPTI
[   18.132550] CPU: 3 UID: 0 PID: 0 Comm: swapper/3 Not tainted
6.17.6-00001-g40e6d463b4d0 #2 PREEMPT(full)
[   18.134417] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   18.136358] RIP: 0010:ioread32_rep+0x44/0x60
[   18.137234] Code: 96 04 8b 10 89 16 48 83 c6 04 48 39 ce 75 f3 c3
cc cc cc cc c3 cc cc cc cc 48 81 ff 00 00 01 00 76 0f 48 89 d1 48 89
f7 89 c2 <f3> 6d c3 cc cc f
[   18.140937] RSP: 0018:ffffa61780140e10 EFLAGS: 00010002
[   18.141965] RAX: 0000000000010170 RBX: 0000000000000008 RCX: 0000000000000002
[   18.143363] RDX: 0000000000010170 RSI: ffff9a3d060de000 RDI: ffff9a3d060de000
[   18.144765] RBP: ffff9a3d060de000 R08: 0000000000000008 R09: ffff9a3d01298130
[   18.146179] R10: ffff9a3d002fbb00 R11: ffffa61780140ff8 R12: 0000000000010170
[   18.147607] R13: 0000000000000000 R14: 0000000000000008 R15: ffff9a3d01298130
[   18.148794] FS:  0000000000000000(0000) GS:ffff9a3e81469000(0000)
knlGS:0000000000000000
[   18.150053] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   18.150957] CR2: ffff9a3d060de000 CR3: 0000000101aa7000 CR4: 00000000000006f0
[   18.152095] Call Trace:
[   18.152515]  <IRQ>
[   18.152842]  ata_sff_data_xfer32+0x88/0x170
[   18.153518]  ata_sff_hsm_move+0x466/0x9b0
[   18.154146]  __ata_sff_port_intr+0x94/0x150
[   18.154817]  ata_bmdma_port_intr+0x33/0x190
[   18.155490]  ata_bmdma_interrupt+0xcc/0x1e0
[   18.156144]  __handle_irq_event_percpu+0x45/0x1a0
[   18.156909]  handle_irq_event+0x33/0x80
[   18.157536]  handle_edge_irq+0xc2/0x1b0
[   18.158139]  __common_interrupt+0x40/0xd0
[   18.158785]  common_interrupt+0x7a/0x90
[   18.159436]  </IRQ>
[   18.159788]  <TASK>
[   18.160140]  asm_common_interrupt+0x26/0x40
[   18.160836] RIP: 0010:pv_native_safe_halt+0xf/0x20
[   18.161612] Code: 11 86 00 c3 cc cc cc cc 0f 1f 00 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d e5 62 1e
00 fb f4 <c3> cc cc cc cc 0
[   18.164562] RSP: 0018:ffffa617800b3ee0 EFLAGS: 00000206
[   18.165396] RAX: ffff9a3e81469000 RBX: ffff9a3d002fbb00 RCX: 0000000433dd2480
[   18.166541] RDX: 4000000000000000 RSI: 0000000000000002 RDI: 000000000000a7c4
[   18.167684] RBP: 0000000000000003 R08: 000000000000a7c4 R09: 0000000433dd2480
[   18.168804] R10: 0000000433dd2480 R11: ffff9a3d001f5900 R12: 0000000000000000
[   18.169930] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   18.171051]  default_idle+0x9/0x10
[   18.171617]  default_idle_call+0x2b/0x100
[   18.172268]  do_idle+0x1ca/0x230
[   18.172792]  cpu_startup_entry+0x24/0x30
[   18.173432]  start_secondary+0xf3/0x100
[   18.174049]  common_startup_64+0x13e/0x148
[   18.174717]  </TASK>
[   18.175081] Modules linked in:
[   18.175597] CR2: ffff9a3d060de000
[   18.176132] ---[ end trace 0000000000000000 ]---
[   18.176888] RIP: 0010:ioread32_rep+0x44/0x60
[   18.177589] Code: 96 04 8b 10 89 16 48 83 c6 04 48 39 ce 75 f3 c3
cc cc cc cc c3 cc cc cc cc 48 81 ff 00 00 01 00 76 0f 48 89 d1 48 89
f7 89 c2 <f3> 6d c3 cc cc f
[   18.180884] RSP: 0018:ffffa61780140e10 EFLAGS: 00010002
[   18.181911] RAX: 0000000000010170 RBX: 0000000000000008 RCX: 0000000000000002
[   18.183446] RDX: 0000000000010170 RSI: ffff9a3d060de000 RDI: ffff9a3d060de000
[   18.184862] RBP: ffff9a3d060de000 R08: 0000000000000008 R09: ffff9a3d01298130
[   18.186309] R10: ffff9a3d002fbb00 R11: ffffa61780140ff8 R12: 0000000000010170
[   18.187741] R13: 0000000000000000 R14: 0000000000000008 R15: ffff9a3d01298130
[   18.189172] FS:  0000000000000000(0000) GS:ffff9a3e81469000(0000)
knlGS:0000000000000000
[   18.190804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   18.191791] CR2: ffff9a3d060de000 CR3: 0000000101aa7000 CR4: 00000000000006f0
[   18.193164] Kernel panic - not syncing: Fatal exception in interrupt
[   18.194272] Kernel Offset: 0x33200000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   18.195998] ---[ end Kernel panic - not syncing: Fatal exception in
interrupt ]---
```

## Reporter Credit

Google Big Sleep

## Disclosure Policy

Our assessment concluded that this finding has NO security
impact. Therefore, we are NOT applying a disclosure deadline to this
report.

For more information, visit
[https://goo.gle/bigsleep](https://goo.gle/bigsleep)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Oops in secretmem_fault()
  2025-10-30 15:34 Oops in secretmem_fault() Big Sleep
@ 2025-10-31  1:36 ` Andrew Morton
  2025-10-31 11:48   ` Big Sleep
  2025-10-31  9:18 ` [PATCH 1/1] mm/secretmem: fix use-after-free race in fault handler Lance Yang
  1 sibling, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2025-10-31  1:36 UTC (permalink / raw)
  To: Big Sleep
  Cc: Mike Rapoport, linux-mm, linux-kernel, Matthew Wilcox (Oracle),
	Lorenzo Stoakes

On Thu, 30 Oct 2025 16:34:29 +0100 Big Sleep <big-sleep-vuln-reports@google.com> wrote:

> Hello Mike and Andrew,
> 
> we found a bug in secretmem_fault() - please see below for details!
> 
> --Google Big Sleep

Didn't know about this - it looks neat.

https://issuetracker.google.com/issues/430375499

: Big Sleep is a collaboration between Google Project Zero and Google
: DeepMind to build an agentic AI system to help automate software
: vulnerability research

>
> ## Reporter Credit
> 
> Google Big Sleep
> 

You might want to include a token here so we (you!) can track the
report through to its resolution.  See what the sysbot people are
doing.  For example,

https://lkml.rescloud.iu.edu/2408.2/07972.html included:

: IMPORTANT: if you fix the issue, please add the following tag to the commit:
: Reported-by: syzbot+5054473a31f78f735416@syzkaller.appspotmail.com

and when Dmitry fixed this he included that info in the patch metadata:

https://lore.kernel.org/all/20251028101447.693289-1-dmantipov@yandex.ru/T/#u

and that Reported-by: will be carried all the way into the mainline tree.


btw, it would be nice to Cc some human on these reports.  One cannot
be very confident that emails sent to big-sleep-vuln-reports@google.com
will actually be read by someone.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/1] mm/secretmem: fix use-after-free race in fault handler
  2025-10-30 15:34 Oops in secretmem_fault() Big Sleep
  2025-10-31  1:36 ` Andrew Morton
@ 2025-10-31  9:18 ` Lance Yang
  2025-10-31  9:59   ` Mike Rapoport
  1 sibling, 1 reply; 10+ messages in thread
From: Lance Yang @ 2025-10-31  9:18 UTC (permalink / raw)
  To: akpm
  Cc: big-sleep-vuln-reports, linux-kernel, linux-mm, lorenzo.stoakes,
	rppt, willy, david, stable, Lance Yang

From: Lance Yang <lance.yang@linux.dev>

The error path in secretmem_fault() frees a folio before restoring its
direct map status, which is a race leading to a panic.

Fix the ordering to restore the map before the folio is freed.

Cc: <stable@vger.kernel.org>
Reported-by: Google Big Sleep <big-sleep-vuln-reports@google.com>
Closes: https://lore.kernel.org/linux-mm/CAEXGt5QeDpiHTu3K9tvjUTPqo+d-=wuCNYPa+6sWKrdQJ-ATdg@mail.gmail.com/
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 mm/secretmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/secretmem.c b/mm/secretmem.c
index c1bd9a4b663d..37f6d1097853 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -82,13 +82,13 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 		__folio_mark_uptodate(folio);
 		err = filemap_add_folio(mapping, folio, offset, gfp);
 		if (unlikely(err)) {
-			folio_put(folio);
 			/*
 			 * If a split of large page was required, it
 			 * already happened when we marked the page invalid
 			 * which guarantees that this call won't fail
 			 */
 			set_direct_map_default_noflush(folio_page(folio, 0));
+			folio_put(folio);
 			if (err == -EEXIST)
 				goto retry;
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] mm/secretmem: fix use-after-free race in fault handler
  2025-10-31  9:18 ` [PATCH 1/1] mm/secretmem: fix use-after-free race in fault handler Lance Yang
@ 2025-10-31  9:59   ` Mike Rapoport
  2025-10-31 10:19     ` David Hildenbrand
                       ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Mike Rapoport @ 2025-10-31  9:59 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, big-sleep-vuln-reports, linux-kernel, linux-mm,
	lorenzo.stoakes, willy, david, stable

On Fri, Oct 31, 2025 at 05:18:18PM +0800, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
> 
> The error path in secretmem_fault() frees a folio before restoring its
> direct map status, which is a race leading to a panic.

Let's use the issue description from the report:

When a page fault occurs in a secret memory file created with
`memfd_secret(2)`, the kernel will allocate a new folio for it, mark
the underlying page as not-present in the direct map, and add it to
the file mapping.

If two tasks cause a fault in the same page concurrently, both could
end up allocating a folio and removing the page from the direct map,
but only one would succeed in adding the folio to the file
mapping. The task that failed undoes the effects of its attempt by (a)
freeing the folio again and (b) putting the page back into the direct
map. However, by doing these two operations in this order, the page
becomes available to the allocator again before it is placed back in
the direct mapping.

If another task attempts to allocate the page between (a) and (b), and
the kernel tries to access it via the direct map, it would result in a
supervisor not-present page fault.

> Fix the ordering to restore the map before the folio is freed.

... restore the direct map

With these changes

Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> 
> Cc: <stable@vger.kernel.org>
> Reported-by: Google Big Sleep <big-sleep-vuln-reports@google.com>
> Closes: https://lore.kernel.org/linux-mm/CAEXGt5QeDpiHTu3K9tvjUTPqo+d-=wuCNYPa+6sWKrdQJ-ATdg@mail.gmail.com/
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>  mm/secretmem.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/secretmem.c b/mm/secretmem.c
> index c1bd9a4b663d..37f6d1097853 100644
> --- a/mm/secretmem.c
> +++ b/mm/secretmem.c
> @@ -82,13 +82,13 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
>  		__folio_mark_uptodate(folio);
>  		err = filemap_add_folio(mapping, folio, offset, gfp);
>  		if (unlikely(err)) {
> -			folio_put(folio);
>  			/*
>  			 * If a split of large page was required, it
>  			 * already happened when we marked the page invalid
>  			 * which guarantees that this call won't fail
>  			 */
>  			set_direct_map_default_noflush(folio_page(folio, 0));
> +			folio_put(folio);
>  			if (err == -EEXIST)
>  				goto retry;
>  
> -- 
> 2.49.0
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] mm/secretmem: fix use-after-free race in fault handler
  2025-10-31  9:59   ` Mike Rapoport
@ 2025-10-31 10:19     ` David Hildenbrand
  2025-10-31 10:35       ` Lance Yang
  2025-10-31 10:24     ` Lorenzo Stoakes
  2025-10-31 10:34     ` Lance Yang
  2 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2025-10-31 10:19 UTC (permalink / raw)
  To: Mike Rapoport, Lance Yang
  Cc: akpm, big-sleep-vuln-reports, linux-kernel, linux-mm,
	lorenzo.stoakes, willy, stable

On 31.10.25 10:59, Mike Rapoport wrote:
> On Fri, Oct 31, 2025 at 05:18:18PM +0800, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> The error path in secretmem_fault() frees a folio before restoring its
>> direct map status, which is a race leading to a panic.
> 
> Let's use the issue description from the report:
> 
> When a page fault occurs in a secret memory file created with
> `memfd_secret(2)`, the kernel will allocate a new folio for it, mark
> the underlying page as not-present in the direct map, and add it to
> the file mapping.
> 
> If two tasks cause a fault in the same page concurrently, both could
> end up allocating a folio and removing the page from the direct map,
> but only one would succeed in adding the folio to the file
> mapping. The task that failed undoes the effects of its attempt by (a)
> freeing the folio again and (b) putting the page back into the direct
> map. However, by doing these two operations in this order, the page
> becomes available to the allocator again before it is placed back in
> the direct mapping.
> 
> If another task attempts to allocate the page between (a) and (b), and
> the kernel tries to access it via the direct map, it would result in a
> supervisor not-present page fault.
>   
>> Fix the ordering to restore the map before the folio is freed.
> 
> ... restore the direct map
> 
> With these changes
> 
> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Fully agreed

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] mm/secretmem: fix use-after-free race in fault handler
  2025-10-31  9:59   ` Mike Rapoport
  2025-10-31 10:19     ` David Hildenbrand
@ 2025-10-31 10:24     ` Lorenzo Stoakes
  2025-10-31 10:34       ` Lance Yang
  2025-10-31 10:34     ` Lance Yang
  2 siblings, 1 reply; 10+ messages in thread
From: Lorenzo Stoakes @ 2025-10-31 10:24 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Lance Yang, akpm, big-sleep-vuln-reports, linux-kernel, linux-mm,
	willy, david, stable

Small thing, sorry to be a pain buuuut could we please not send patches
in-reply to another mail, it makes it harder for people to see :)

On Fri, Oct 31, 2025 at 11:59:16AM +0200, Mike Rapoport wrote:
> On Fri, Oct 31, 2025 at 05:18:18PM +0800, Lance Yang wrote:
> > From: Lance Yang <lance.yang@linux.dev>
> >
> > The error path in secretmem_fault() frees a folio before restoring its
> > direct map status, which is a race leading to a panic.
>
> Let's use the issue description from the report:
>
> When a page fault occurs in a secret memory file created with
> `memfd_secret(2)`, the kernel will allocate a new folio for it, mark
> the underlying page as not-present in the direct map, and add it to
> the file mapping.
>
> If two tasks cause a fault in the same page concurrently, both could
> end up allocating a folio and removing the page from the direct map,
> but only one would succeed in adding the folio to the file
> mapping. The task that failed undoes the effects of its attempt by (a)
> freeing the folio again and (b) putting the page back into the direct
> map. However, by doing these two operations in this order, the page
> becomes available to the allocator again before it is placed back in
> the direct mapping.
>
> If another task attempts to allocate the page between (a) and (b), and
> the kernel tries to access it via the direct map, it would result in a
> supervisor not-present page fault.
>
> > Fix the ordering to restore the map before the folio is freed.
>
> ... restore the direct map
>
> With these changes
>
> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Agree with David, Mike this looks 'obviously correct' thanks for addressing
it.

But also as per Mike, please update message accordingly and send v2
not-in-reply-to-anything :P

With that said:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

>
> >
> > Cc: <stable@vger.kernel.org>
> > Reported-by: Google Big Sleep <big-sleep-vuln-reports@google.com>
> > Closes: https://lore.kernel.org/linux-mm/CAEXGt5QeDpiHTu3K9tvjUTPqo+d-=wuCNYPa+6sWKrdQJ-ATdg@mail.gmail.com/
> > Signed-off-by: Lance Yang <lance.yang@linux.dev>
> > ---
> >  mm/secretmem.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/secretmem.c b/mm/secretmem.c
> > index c1bd9a4b663d..37f6d1097853 100644
> > --- a/mm/secretmem.c
> > +++ b/mm/secretmem.c
> > @@ -82,13 +82,13 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> >  		__folio_mark_uptodate(folio);
> >  		err = filemap_add_folio(mapping, folio, offset, gfp);
> >  		if (unlikely(err)) {
> > -			folio_put(folio);
> >  			/*
> >  			 * If a split of large page was required, it
> >  			 * already happened when we marked the page invalid
> >  			 * which guarantees that this call won't fail
> >  			 */
> >  			set_direct_map_default_noflush(folio_page(folio, 0));
> > +			folio_put(folio);
> >  			if (err == -EEXIST)
> >  				goto retry;
> >
> > --
> > 2.49.0
> >
>
> --
> Sincerely yours,
> Mike.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] mm/secretmem: fix use-after-free race in fault handler
  2025-10-31  9:59   ` Mike Rapoport
  2025-10-31 10:19     ` David Hildenbrand
  2025-10-31 10:24     ` Lorenzo Stoakes
@ 2025-10-31 10:34     ` Lance Yang
  2 siblings, 0 replies; 10+ messages in thread
From: Lance Yang @ 2025-10-31 10:34 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: akpm, big-sleep-vuln-reports, linux-kernel, linux-mm,
	lorenzo.stoakes, willy, david, stable



On 2025/10/31 17:59, Mike Rapoport wrote:
> On Fri, Oct 31, 2025 at 05:18:18PM +0800, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> The error path in secretmem_fault() frees a folio before restoring its
>> direct map status, which is a race leading to a panic.
> 
> Let's use the issue description from the report:

Will do. I'll also add the missing Fixes: tag.

> 
> When a page fault occurs in a secret memory file created with
> `memfd_secret(2)`, the kernel will allocate a new folio for it, mark
> the underlying page as not-present in the direct map, and add it to
> the file mapping.
> 
> If two tasks cause a fault in the same page concurrently, both could
> end up allocating a folio and removing the page from the direct map,
> but only one would succeed in adding the folio to the file
> mapping. The task that failed undoes the effects of its attempt by (a)
> freeing the folio again and (b) putting the page back into the direct
> map. However, by doing these two operations in this order, the page
> becomes available to the allocator again before it is placed back in
> the direct mapping.
> 
> If another task attempts to allocate the page between (a) and (b), and
> the kernel tries to access it via the direct map, it would result in a
> supervisor not-present page fault.
>   
>> Fix the ordering to restore the map before the folio is freed.
> 
> ... restore the direct map
> 
> With these changes
> 
> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Thanks!
Lance

> 
>>
>> Cc: <stable@vger.kernel.org>
>> Reported-by: Google Big Sleep <big-sleep-vuln-reports@google.com>
>> Closes: https://lore.kernel.org/linux-mm/CAEXGt5QeDpiHTu3K9tvjUTPqo+d-=wuCNYPa+6sWKrdQJ-ATdg@mail.gmail.com/
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>>   mm/secretmem.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/secretmem.c b/mm/secretmem.c
>> index c1bd9a4b663d..37f6d1097853 100644
>> --- a/mm/secretmem.c
>> +++ b/mm/secretmem.c
>> @@ -82,13 +82,13 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
>>   		__folio_mark_uptodate(folio);
>>   		err = filemap_add_folio(mapping, folio, offset, gfp);
>>   		if (unlikely(err)) {
>> -			folio_put(folio);
>>   			/*
>>   			 * If a split of large page was required, it
>>   			 * already happened when we marked the page invalid
>>   			 * which guarantees that this call won't fail
>>   			 */
>>   			set_direct_map_default_noflush(folio_page(folio, 0));
>> +			folio_put(folio);
>>   			if (err == -EEXIST)
>>   				goto retry;
>>   
>> -- 
>> 2.49.0
>>
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] mm/secretmem: fix use-after-free race in fault handler
  2025-10-31 10:24     ` Lorenzo Stoakes
@ 2025-10-31 10:34       ` Lance Yang
  0 siblings, 0 replies; 10+ messages in thread
From: Lance Yang @ 2025-10-31 10:34 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, big-sleep-vuln-reports, linux-kernel, linux-mm, willy,
	david, stable, Mike Rapoport



On 2025/10/31 18:24, Lorenzo Stoakes wrote:
> Small thing, sorry to be a pain buuuut could we please not send patches
> in-reply to another mail, it makes it harder for people to see :)
> 
> On Fri, Oct 31, 2025 at 11:59:16AM +0200, Mike Rapoport wrote:
>> On Fri, Oct 31, 2025 at 05:18:18PM +0800, Lance Yang wrote:
>>> From: Lance Yang <lance.yang@linux.dev>
>>>
>>> The error path in secretmem_fault() frees a folio before restoring its
>>> direct map status, which is a race leading to a panic.
>>
>> Let's use the issue description from the report:
>>
>> When a page fault occurs in a secret memory file created with
>> `memfd_secret(2)`, the kernel will allocate a new folio for it, mark
>> the underlying page as not-present in the direct map, and add it to
>> the file mapping.
>>
>> If two tasks cause a fault in the same page concurrently, both could
>> end up allocating a folio and removing the page from the direct map,
>> but only one would succeed in adding the folio to the file
>> mapping. The task that failed undoes the effects of its attempt by (a)
>> freeing the folio again and (b) putting the page back into the direct
>> map. However, by doing these two operations in this order, the page
>> becomes available to the allocator again before it is placed back in
>> the direct mapping.
>>
>> If another task attempts to allocate the page between (a) and (b), and
>> the kernel tries to access it via the direct map, it would result in a
>> supervisor not-present page fault.
>>
>>> Fix the ordering to restore the map before the folio is freed.
>>
>> ... restore the direct map
>>
>> With these changes
>>
>> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> 
> Agree with David, Mike this looks 'obviously correct' thanks for addressing
> it.
> 
> But also as per Mike, please update message accordingly and send v2
> not-in-reply-to-anything :P

Sure. V2 is on the way ;)

> 
> With that said:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks!
Lance

> 
>>
>>>
>>> Cc: <stable@vger.kernel.org>
>>> Reported-by: Google Big Sleep <big-sleep-vuln-reports@google.com>
>>> Closes: https://lore.kernel.org/linux-mm/CAEXGt5QeDpiHTu3K9tvjUTPqo+d-=wuCNYPa+6sWKrdQJ-ATdg@mail.gmail.com/
>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>> ---
>>>   mm/secretmem.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/secretmem.c b/mm/secretmem.c
>>> index c1bd9a4b663d..37f6d1097853 100644
>>> --- a/mm/secretmem.c
>>> +++ b/mm/secretmem.c
>>> @@ -82,13 +82,13 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
>>>   		__folio_mark_uptodate(folio);
>>>   		err = filemap_add_folio(mapping, folio, offset, gfp);
>>>   		if (unlikely(err)) {
>>> -			folio_put(folio);
>>>   			/*
>>>   			 * If a split of large page was required, it
>>>   			 * already happened when we marked the page invalid
>>>   			 * which guarantees that this call won't fail
>>>   			 */
>>>   			set_direct_map_default_noflush(folio_page(folio, 0));
>>> +			folio_put(folio);
>>>   			if (err == -EEXIST)
>>>   				goto retry;
>>>
>>> --
>>> 2.49.0
>>>
>>
>> --
>> Sincerely yours,
>> Mike.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] mm/secretmem: fix use-after-free race in fault handler
  2025-10-31 10:19     ` David Hildenbrand
@ 2025-10-31 10:35       ` Lance Yang
  0 siblings, 0 replies; 10+ messages in thread
From: Lance Yang @ 2025-10-31 10:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, big-sleep-vuln-reports, linux-kernel, linux-mm,
	lorenzo.stoakes, willy, stable, Mike Rapoport



On 2025/10/31 18:19, David Hildenbrand wrote:
> On 31.10.25 10:59, Mike Rapoport wrote:
>> On Fri, Oct 31, 2025 at 05:18:18PM +0800, Lance Yang wrote:
>>> From: Lance Yang <lance.yang@linux.dev>
>>>
>>> The error path in secretmem_fault() frees a folio before restoring its
>>> direct map status, which is a race leading to a panic.
>>
>> Let's use the issue description from the report:
>>
>> When a page fault occurs in a secret memory file created with
>> `memfd_secret(2)`, the kernel will allocate a new folio for it, mark
>> the underlying page as not-present in the direct map, and add it to
>> the file mapping.
>>
>> If two tasks cause a fault in the same page concurrently, both could
>> end up allocating a folio and removing the page from the direct map,
>> but only one would succeed in adding the folio to the file
>> mapping. The task that failed undoes the effects of its attempt by (a)
>> freeing the folio again and (b) putting the page back into the direct
>> map. However, by doing these two operations in this order, the page
>> becomes available to the allocator again before it is placed back in
>> the direct mapping.
>>
>> If another task attempts to allocate the page between (a) and (b), and
>> the kernel tries to access it via the direct map, it would result in a
>> supervisor not-present page fault.
>>> Fix the ordering to restore the map before the folio is freed.
>>
>> ... restore the direct map
>>
>> With these changes
>>
>> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> 
> Fully agreed
> 
> Acked-by: David Hildenbrand <david@redhat.com>

Cheers!


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Oops in secretmem_fault()
  2025-10-31  1:36 ` Andrew Morton
@ 2025-10-31 11:48   ` Big Sleep
  0 siblings, 0 replies; 10+ messages in thread
From: Big Sleep @ 2025-10-31 11:48 UTC (permalink / raw)
  To: big-sleep-vuln-reports
  Cc: Mike Rapoport, linux-mm, linux-kernel, Matthew Wilcox (Oracle),
	Lorenzo Stoakes, Lance Yang

Hello Andrew!

On Fri, Oct 31, 2025 at 2:36 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> On Thu, 30 Oct 2025 16:34:29 +0100 Big Sleep <big-sleep-vuln-reports@google.com> wrote:
> You might want to include a token here so we (you!) can track the
> report through to its resolution.  See what the sysbot people are
> doing.  For example,
>
> https://lkml.rescloud.iu.edu/2408.2/07972.html included:
>
> : IMPORTANT: if you fix the issue, please add the following tag to the commit:
> : Reported-by: syzbot+5054473a31f78f735416@syzkaller.appspotmail.com
>
> and when Dmitry fixed this he included that info in the patch metadata:
>
> https://lore.kernel.org/all/20251028101447.693289-1-dmantipov@yandex.ru/T/#u
>
> and that Reported-by: will be carried all the way into the mainline tree.

Thank you for the suggestion!  We are not currently tracking reports
which we categorize as having no security impact, but we will include
a tagging scheme like this for future reports to the kernel which do
have a security impact.

> btw, it would be nice to Cc some human on these reports.  One cannot
> be very confident that emails sent to big-sleep-vuln-reports@google.com
> will actually be read by someone.

We are monitoring and replying to our reports and the surrounding
conversations from our project alias.  We'll try to clarify this on
future reports.

--Google Big Sleep Team


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-10-31 11:48 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-30 15:34 Oops in secretmem_fault() Big Sleep
2025-10-31  1:36 ` Andrew Morton
2025-10-31 11:48   ` Big Sleep
2025-10-31  9:18 ` [PATCH 1/1] mm/secretmem: fix use-after-free race in fault handler Lance Yang
2025-10-31  9:59   ` Mike Rapoport
2025-10-31 10:19     ` David Hildenbrand
2025-10-31 10:35       ` Lance Yang
2025-10-31 10:24     ` Lorenzo Stoakes
2025-10-31 10:34       ` Lance Yang
2025-10-31 10:34     ` Lance Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).