[PATCH 0/2] mm/memory: fix memory tearing on threaded fork

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] mm/memory: fix memory tearing on threaded fork
@ 2025-06-03 18:21 Jann Horn
  2025-06-03 18:21 ` [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot Jann Horn
  2025-06-03 18:21 ` [PATCH 2/2] mm/memory: Document how we make a " Jann Horn
  0 siblings, 2 replies; 24+ messages in thread
From: Jann Horn @ 2025-06-03 18:21 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm
  Cc: Peter Xu, linux-kernel, Jann Horn, stable

The first patch is a fix with an explanation of the issue, you should
read that first.
The second patch adds a comment to document the rules because figuring
this out from scratch causes brain pain.

Accidentally hitting this issue and getting negative consequences from
it would require several stars to line up just right; but if someone out
there is using a malloc() implementation that uses lockless data
structures across threads or such, this could actually be a problem.

In case someone wants a testcase, here's a very artificial one:

```
 #include <pthread.h>
 #include <err.h>
 #include <stdio.h>
 #include <unistd.h>
 #include <sys/syscall.h>
 #include <sys/uio.h>
 #include <sys/mman.h>
 #include <sys/wait.h>
 #include <linux/io_uring.h>

 #define SYSCHK(x) ({          \
  typeof(x) __res = (x);      \
  if (__res == (typeof(x))-1) \
    err(1, "SYSCHK(" #x ")"); \
  __res;                      \
})

 #define NUM_SQ_PAGES 4
static int uring_init(struct io_uring_sqe **sqesp, void **cqesp) {
  struct io_uring_sqe *sqes = SYSCHK(mmap(NULL, NUM_SQ_PAGES*0x1000, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0));
  void *cqes = SYSCHK(mmap(NULL, NUM_SQ_PAGES*0x1000, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0));
  *(volatile unsigned int *)(cqes+4) = 64 * NUM_SQ_PAGES;
  struct io_uring_params params = {
    .flags = IORING_SETUP_NO_MMAP|IORING_SETUP_NO_SQARRAY,
    .sq_off = { .user_addr = (unsigned long)sqes },
    .cq_off = { .user_addr = (unsigned long)cqes }
  };
  int uring_fd = SYSCHK(syscall(__NR_io_uring_setup, /*entries=*/10, &params));
  if (sqesp)
    *sqesp = sqes;
  if (cqesp)
    *cqesp = cqes;
  return uring_fd;
}

static char *bufmem[0x3000] __attribute__((aligned(0x1000)));

static void *thread_fn(void *dummy) {
  unsigned long i = 0;
  while (1) {
    *(volatile unsigned long *)(bufmem + 0x0000) = i;
    *(volatile unsigned long *)(bufmem + 0x0f00) = i;
    *(volatile unsigned long *)(bufmem + 0x1000) = i;
    *(volatile unsigned long *)(bufmem + 0x1f00) = i;
    *(volatile unsigned long *)(bufmem + 0x2000) = i;
    *(volatile unsigned long *)(bufmem + 0x2f00) = i;
    i++;
  }
}

int main(void) {
 #if 1
  int uring_fd = uring_init(NULL, NULL);
  struct iovec reg_iov = { .iov_base = bufmem, .iov_len = 0x2000 };
  SYSCHK(syscall(__NR_io_uring_register, uring_fd, IORING_REGISTER_BUFFERS, &reg_iov, 1));
 #endif

  pthread_t thread;
  if (pthread_create(&thread, NULL, thread_fn, NULL))
    errx(1, "pthread_create");

  sleep(1);
  int child = SYSCHK(fork());
  if (child == 0) {
    printf("bufmem values:\n");
    printf("  0x0000: 0x%lx\n", *(volatile unsigned long *)(bufmem + 0x0000));
    printf("  0x0f00: 0x%lx\n", *(volatile unsigned long *)(bufmem + 0x0f00));
    printf("  0x1000: 0x%lx\n", *(volatile unsigned long *)(bufmem + 0x1000));
    printf("  0x1f00: 0x%lx\n", *(volatile unsigned long *)(bufmem + 0x1f00));
    printf("  0x2000: 0x%lx\n", *(volatile unsigned long *)(bufmem + 0x2000));
    printf("  0x2f00: 0x%lx\n", *(volatile unsigned long *)(bufmem + 0x2f00));
    return 0;
  }
  int wstatus;
  SYSCHK(wait(&wstatus));
  return 0;
}
```

Without this series, the child will usually print results that are
apart by more than 1, which is not a state that ever occurred in
the parent; in my opinion, that counts as a bug.
If you change the "#if 1" to "#if 0", the bug won't manifest.

Signed-off-by: Jann Horn <jannh@google.com>
---
Jann Horn (2):
      mm/memory: ensure fork child sees coherent memory snapshot
      mm/memory: Document how we make a coherent memory snapshot

 kernel/fork.c | 34 ++++++++++++++++++++++++++++++++++
 mm/memory.c   | 18 ++++++++++++++++++
 2 files changed, 52 insertions(+)
---
base-commit: 8477ab143069c6b05d6da4a8184ded8b969240f5
change-id: 20250530-fork-tearing-71da211a50cf

-- 
Jann Horn <jannh@google.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot
  2025-06-03 18:21 [PATCH 0/2] mm/memory: fix memory tearing on threaded fork Jann Horn
@ 2025-06-03 18:21 ` Jann Horn
  2025-06-03 18:29   ` Matthew Wilcox
                     ` (4 more replies)
  2025-06-03 18:21 ` [PATCH 2/2] mm/memory: Document how we make a " Jann Horn
  1 sibling, 5 replies; 24+ messages in thread
From: Jann Horn @ 2025-06-03 18:21 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm
  Cc: Peter Xu, linux-kernel, Jann Horn, stable

When fork() encounters possibly-pinned pages, those pages are immediately
copied instead of just marking PTEs to make CoW happen later. If the parent
is multithreaded, this can cause the child to see memory contents that are
inconsistent in multiple ways:

1. We are copying the contents of a page with a memcpy() while userspace
   may be writing to it. This can cause the resulting data in the child to
   be inconsistent.
2. After we've copied this page, future writes to other pages may
   continue to be visible to the child while future writes to this page are
   no longer visible to the child.

This means the child could theoretically see incoherent states where
allocator freelists point to objects that are actually in use or stuff like
that. A mitigating factor is that, unless userspace already has a deadlock
bug, userspace can pretty much only observe such issues when fancy lockless
data structures are used (because if another thread was in the middle of
mutating data during fork() and the post-fork child tried to take the mutex
protecting that data, it might wait forever).

On top of that, this issue is only observable when pages are either
DMA-pinned or appear false-positive-DMA-pinned due to a page having >=1024
references and the parent process having used DMA-pinning at least once
before.

Fixes: 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
---
 mm/memory.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 49199410805c..b406dfda976b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -917,7 +917,25 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 	/*
 	 * We have a prealloc page, all good!  Take it
 	 * over and copy the page & arm it.
+	 *
+	 * One nasty aspect is that we could be in a multithreaded process or
+	 * such, where another thread is in the middle of writing to memory
+	 * while this thread is forking. As long as we're just marking PTEs as
+	 * read-only to make copy-on-write happen *later*, that's easy; we just
+	 * need to do a single TLB flush before dropping the mmap/VMA locks, and
+	 * that's enough to guarantee that the child gets a coherent snapshot of
+	 * memory.
+	 * But here, where we're doing an immediate copy, we must ensure that
+	 * threads in the parent process can no longer write into the page being
+	 * copied until we're done forking.
+	 * This means that we still need to mark the source PTE as read-only,
+	 * with an immediate TLB flush.
+	 * (To make the source PTE writable again after fork() is done, we can
+	 * rely on the page fault handler to do that lazily, thanks to
+	 * PageAnonExclusive().)
 	 */
+	ptep_set_wrprotect(src_vma->vm_mm, addr, src_pte);
+	flush_tlb_page(src_vma, addr);

 	if (copy_mc_user_highpage(&new_folio->page, page, addr, src_vma))
 		return -EHWPOISON;

-- 
2.49.0.1204.g71687c7c1d-goog

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/2] mm/memory: Document how we make a coherent memory snapshot
  2025-06-03 18:21 [PATCH 0/2] mm/memory: fix memory tearing on threaded fork Jann Horn
  2025-06-03 18:21 ` [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot Jann Horn
@ 2025-06-03 18:21 ` Jann Horn
  2025-06-04 17:03   ` Peter Xu
  1 sibling, 1 reply; 24+ messages in thread
From: Jann Horn @ 2025-06-03 18:21 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm
  Cc: Peter Xu, linux-kernel, Jann Horn

It is not currently documented that the child of fork() should receive a
coherent snapshot of the parent's memory, or how we get such a snapshot.
Add a comment block to explain this.

Signed-off-by: Jann Horn <jannh@google.com>
---
 kernel/fork.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index 85afccfdf3b1..f78f5df596a9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -604,6 +604,40 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
 }
 
 #ifdef CONFIG_MMU
+/*
+ * Anonymous memory inherited by the child MM must, on success, contain a
+ * coherent snapshot of corresponding anonymous memory in the parent MM.
+ * (An exception are anonymous memory regions which are concurrently written
+ * by kernel code or hardware devices through page references obtained via GUP.)
+ * We effectively snapshot the parent's memory just before
+ * mmap_write_unlock(oldmm); any writes after that point are invisible to the
+ * child, while attempted writes before that point are either visible to the
+ * child or delayed until after mmap_write_unlock(oldmm).
+ *
+ * To make that work while only needing a single pass through the parent's VMA
+ * tree and page tables, we follow these rules:
+ *
+ *  - Before mmap_write_unlock(), a TLB flush ensures that parent threads can't
+ *    write to copy-on-write pages anymore.
+ *  - Before dup_mmap() copies page contents (which happens rarely), the
+ *    parent's PTE for the page is made read-only and a TLB flush is issued, so
+ *    subsequent writes are delayed until mmap_write_unlock().
+ *  - Before dup_mmap() starts walking the page tables of a VMA in the parent,
+ *    the VMA is write-locked to ensure that the parent can't perform writes
+ *    that won't be visible in the child before mmap_write_unlock():
+ *      a) through concurrent copy-on-write handling
+ *      b) by upgrading read-only PTEs to writable
+ *
+ * Not following these rules, and giving the child a torn copy of the parent's
+ * memory contents where different segments come from different points in time,
+ * would likely _mostly_ work:
+ * Any memory to which a concurrent parent thread could be writing under a lock
+ * can't be accessed from the child without risking deadlocks (since the child
+ * might inherit the lock in a locked state, in which case the lock will stay
+ * locked forever in the child).
+ * But if userspace is using trylock or lock-free algorithms, providing a torn
+ * view of memory could break the child.
+ */
 static __latent_entropy int dup_mmap(struct mm_struct *mm,
 					struct mm_struct *oldmm)
 {

-- 
2.49.0.1204.g71687c7c1d-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot
  2025-06-03 18:21 ` [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot Jann Horn
@ 2025-06-03 18:29   ` Matthew Wilcox
  2025-06-03 18:37     ` David Hildenbrand
  2025-06-03 19:03     ` Jann Horn
  2025-06-03 18:33   ` David Hildenbrand
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 24+ messages in thread
From: Matthew Wilcox @ 2025-06-03 18:29 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, Peter Xu,
	linux-kernel, stable

On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
> When fork() encounters possibly-pinned pages, those pages are immediately
> copied instead of just marking PTEs to make CoW happen later. If the parent
> is multithreaded, this can cause the child to see memory contents that are
> inconsistent in multiple ways:
> 
> 1. We are copying the contents of a page with a memcpy() while userspace
>    may be writing to it. This can cause the resulting data in the child to
>    be inconsistent.
> 2. After we've copied this page, future writes to other pages may
>    continue to be visible to the child while future writes to this page are
>    no longer visible to the child.
> 
> This means the child could theoretically see incoherent states where
> allocator freelists point to objects that are actually in use or stuff like
> that. A mitigating factor is that, unless userspace already has a deadlock
> bug, userspace can pretty much only observe such issues when fancy lockless
> data structures are used (because if another thread was in the middle of
> mutating data during fork() and the post-fork child tried to take the mutex
> protecting that data, it might wait forever).

Um, OK, but isn't that expected behaviour?  POSIX says:

: A process shall be created with a single thread. If a multi-threaded
: process calls fork(), the new process shall contain a replica of the
: calling thread and its entire address space, possibly including the
: states of mutexes and other resources. Consequently, the application
: shall ensure that the child process only executes async-signal-safe
: operations until such time as one of the exec functions is successful.

It's always been my understanding that you really, really shouldn't call
fork() from a multithreaded process.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot
  2025-06-03 18:21 ` [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot Jann Horn
  2025-06-03 18:29   ` Matthew Wilcox
@ 2025-06-03 18:33   ` David Hildenbrand
  2025-06-03 20:32   ` Pedro Falcato
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2025-06-03 18:33 UTC (permalink / raw)
  To: Jann Horn, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm
  Cc: Peter Xu, linux-kernel, stable

On 03.06.25 20:21, Jann Horn wrote:
> When fork() encounters possibly-pinned pages, those pages are immediately
> copied instead of just marking PTEs to make CoW happen later. If the parent
> is multithreaded, this can cause the child to see memory contents that are
> inconsistent in multiple ways:
> 
> 1. We are copying the contents of a page with a memcpy() while userspace
>     may be writing to it. This can cause the resulting data in the child to
>     be inconsistent.
 > 2. After we've copied this page, future writes to other pages may> 
  continue to be visible to the child while future writes to this page are
>     no longer visible to the child.
 > > This means the child could theoretically see incoherent states where
> allocator freelists point to objects that are actually in use or stuff like
> that. A mitigating factor is that, unless userspace already has a deadlock
> bug, userspace can pretty much only observe such issues when fancy lockless
> data structures are used (because if another thread was in the middle of
> mutating data during fork() and the post-fork child tried to take the mutex
> protecting that data, it might wait forever).
> 

Hmm, interesting.

> On top of that, this issue is only observable when pages are either
> DMA-pinned or appear false-positive-DMA-pinned due to a page having >=1024
> references and the parent process having used DMA-pinning at least once
> before.

Right.

> 
> Fixes: 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes")
> Cc: stable@vger.kernel.org
> Signed-off-by: Jann Horn <jannh@google.com>
> ---
>   mm/memory.c | 18 ++++++++++++++++++
>   1 file changed, 18 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 49199410805c..b406dfda976b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -917,7 +917,25 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>   	/*
>   	 * We have a prealloc page, all good!  Take it
>   	 * over and copy the page & arm it.
> +	 *
> +	 * One nasty aspect is that we could be in a multithreaded process or
> +	 * such, where another thread is in the middle of writing to memory
> +	 * while this thread is forking. As long as we're just marking PTEs as
> +	 * read-only to make copy-on-write happen *later*, that's easy; we just
> +	 * need to do a single TLB flush before dropping the mmap/VMA locks, and
> +	 * that's enough to guarantee that the child gets a coherent snapshot of
> +	 * memory.
> +	 * But here, where we're doing an immediate copy, we must ensure that
> +	 * threads in the parent process can no longer write into the page being
> +	 * copied until we're done forking.
> +	 * This means that we still need to mark the source PTE as read-only,
> +	 * with an immediate TLB flush.
> +	 * (To make the source PTE writable again after fork() is done, we can
> +	 * rely on the page fault handler to do that lazily, thanks to
> +	 * PageAnonExclusive().)
>   	 */
> +	ptep_set_wrprotect(src_vma->vm_mm, addr, src_pte);
> +	flush_tlb_page(src_vma, addr);

Would we need something similar for hugetlb, or is that already handled?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot
  2025-06-03 18:29   ` Matthew Wilcox
@ 2025-06-03 18:37     ` David Hildenbrand
  2025-06-03 19:09       ` Jann Horn
  2025-06-03 19:03     ` Jann Horn
  1 sibling, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2025-06-03 18:37 UTC (permalink / raw)
  To: Matthew Wilcox, Jann Horn
  Cc: Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm,
	Peter Xu, linux-kernel, stable

On 03.06.25 20:29, Matthew Wilcox wrote:
> On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
>> When fork() encounters possibly-pinned pages, those pages are immediately
>> copied instead of just marking PTEs to make CoW happen later. If the parent
>> is multithreaded, this can cause the child to see memory contents that are
>> inconsistent in multiple ways:
>>
>> 1. We are copying the contents of a page with a memcpy() while userspace
>>     may be writing to it. This can cause the resulting data in the child to
>>     be inconsistent.
>> 2. After we've copied this page, future writes to other pages may
>>     continue to be visible to the child while future writes to this page are
>>     no longer visible to the child.
>>
>> This means the child could theoretically see incoherent states where
>> allocator freelists point to objects that are actually in use or stuff like
>> that. A mitigating factor is that, unless userspace already has a deadlock
>> bug, userspace can pretty much only observe such issues when fancy lockless
>> data structures are used (because if another thread was in the middle of
>> mutating data during fork() and the post-fork child tried to take the mutex
>> protecting that data, it might wait forever).
> 
> Um, OK, but isn't that expected behaviour?  POSIX says:
> 
> : A process shall be created with a single thread. If a multi-threaded
> : process calls fork(), the new process shall contain a replica of the
> : calling thread and its entire address space, possibly including the
> : states of mutexes and other resources. Consequently, the application
> : shall ensure that the child process only executes async-signal-safe
> : operations until such time as one of the exec functions is successful.
> 
> It's always been my understanding that you really, really shouldn't call
> fork() from a multithreaded process.

I have the same recollection, but rather because of concurrent O_DIRECT 
and locking (pthread_atfork ...).

Using the allocator above example: what makes sure that no other thread 
is halfway through modifying allocator state? You really have to sync 
somehow before calling fork() -- e.g., grabbing allocator locks in 
pthread_atfork().


For Linux we document in the man page

"After  a  fork() in a multithreaded program, the child can safely call 
only async-signal-safe functions (see signal-safety(7)) until such time 
as it calls execve(2)."

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot
  2025-06-03 18:29   ` Matthew Wilcox
  2025-06-03 18:37     ` David Hildenbrand
@ 2025-06-03 19:03     ` Jann Horn
  2025-06-04 12:22       ` David Hildenbrand
  1 sibling, 1 reply; 24+ messages in thread
From: Jann Horn @ 2025-06-03 19:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, Peter Xu,
	linux-kernel, stable

On Tue, Jun 3, 2025 at 8:29 PM Matthew Wilcox <willy@infradead.org> wrote:
> On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
> > When fork() encounters possibly-pinned pages, those pages are immediately
> > copied instead of just marking PTEs to make CoW happen later. If the parent
> > is multithreaded, this can cause the child to see memory contents that are
> > inconsistent in multiple ways:
> >
> > 1. We are copying the contents of a page with a memcpy() while userspace
> >    may be writing to it. This can cause the resulting data in the child to
> >    be inconsistent.
> > 2. After we've copied this page, future writes to other pages may
> >    continue to be visible to the child while future writes to this page are
> >    no longer visible to the child.
> >
> > This means the child could theoretically see incoherent states where
> > allocator freelists point to objects that are actually in use or stuff like
> > that. A mitigating factor is that, unless userspace already has a deadlock
> > bug, userspace can pretty much only observe such issues when fancy lockless
> > data structures are used (because if another thread was in the middle of
> > mutating data during fork() and the post-fork child tried to take the mutex
> > protecting that data, it might wait forever).
>
> Um, OK, but isn't that expected behaviour?  POSIX says:

I don't think it is expected behavior that locklessly-updated data
structures in application code could break.

> : A process shall be created with a single thread. If a multi-threaded
> : process calls fork(), the new process shall contain a replica of the
> : calling thread and its entire address space, possibly including the
> : states of mutexes and other resources. Consequently, the application
> : shall ensure that the child process only executes async-signal-safe
> : operations until such time as one of the exec functions is successful.

I think that is only talking about ways in which you can interact with
libc, and does not mean that application code couldn't access its own
lockless data structures or such.

Though admittedly that is a fairly theoretical point, since in
practice the most likely place where you'd encounter this kind of
issue would be in an allocator implementation or such.

> It's always been my understanding that you really, really shouldn't call
> fork() from a multithreaded process.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot
  2025-06-03 18:37     ` David Hildenbrand
@ 2025-06-03 19:09       ` Jann Horn
  2025-06-03 20:17         ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Jann Horn @ 2025-06-03 19:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, Peter Xu, linux-kernel, stable

On Tue, Jun 3, 2025 at 8:37 PM David Hildenbrand <david@redhat.com> wrote:
> On 03.06.25 20:29, Matthew Wilcox wrote:
> > On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
> >> When fork() encounters possibly-pinned pages, those pages are immediately
> >> copied instead of just marking PTEs to make CoW happen later. If the parent
> >> is multithreaded, this can cause the child to see memory contents that are
> >> inconsistent in multiple ways:
> >>
> >> 1. We are copying the contents of a page with a memcpy() while userspace
> >>     may be writing to it. This can cause the resulting data in the child to
> >>     be inconsistent.
> >> 2. After we've copied this page, future writes to other pages may
> >>     continue to be visible to the child while future writes to this page are
> >>     no longer visible to the child.
> >>
> >> This means the child could theoretically see incoherent states where
> >> allocator freelists point to objects that are actually in use or stuff like
> >> that. A mitigating factor is that, unless userspace already has a deadlock
> >> bug, userspace can pretty much only observe such issues when fancy lockless
> >> data structures are used (because if another thread was in the middle of
> >> mutating data during fork() and the post-fork child tried to take the mutex
> >> protecting that data, it might wait forever).
> >
> > Um, OK, but isn't that expected behaviour?  POSIX says:
> >
> > : A process shall be created with a single thread. If a multi-threaded
> > : process calls fork(), the new process shall contain a replica of the
> > : calling thread and its entire address space, possibly including the
> > : states of mutexes and other resources. Consequently, the application
> > : shall ensure that the child process only executes async-signal-safe
> > : operations until such time as one of the exec functions is successful.
> >
> > It's always been my understanding that you really, really shouldn't call
> > fork() from a multithreaded process.
>
> I have the same recollection, but rather because of concurrent O_DIRECT
> and locking (pthread_atfork ...).
>
> Using the allocator above example: what makes sure that no other thread
> is halfway through modifying allocator state? You really have to sync
> somehow before calling fork() -- e.g., grabbing allocator locks in
> pthread_atfork().

Yeah, like what glibc does for its malloc implementation to prevent
allocator calls from racing with fork(), so that malloc() keeps
working after fork(), even though POSIX says that the libc doesn't
have to guarantee that.

> For Linux we document in the man page
>
> "After  a  fork() in a multithreaded program, the child can safely call
> only async-signal-safe functions (see signal-safety(7)) until such time
> as it calls execve(2)."

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot
  2025-06-03 19:09       ` Jann Horn
@ 2025-06-03 20:17         ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2025-06-03 20:17 UTC (permalink / raw)
  To: Jann Horn
  Cc: Matthew Wilcox, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, Peter Xu, linux-kernel, stable

On 03.06.25 21:09, Jann Horn wrote:
> On Tue, Jun 3, 2025 at 8:37 PM David Hildenbrand <david@redhat.com> wrote:
>> On 03.06.25 20:29, Matthew Wilcox wrote:
>>> On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
>>>> When fork() encounters possibly-pinned pages, those pages are immediately
>>>> copied instead of just marking PTEs to make CoW happen later. If the parent
>>>> is multithreaded, this can cause the child to see memory contents that are
>>>> inconsistent in multiple ways:
>>>>
>>>> 1. We are copying the contents of a page with a memcpy() while userspace
>>>>      may be writing to it. This can cause the resulting data in the child to
>>>>      be inconsistent.
>>>> 2. After we've copied this page, future writes to other pages may
>>>>      continue to be visible to the child while future writes to this page are
>>>>      no longer visible to the child.
>>>>
>>>> This means the child could theoretically see incoherent states where
>>>> allocator freelists point to objects that are actually in use or stuff like
>>>> that. A mitigating factor is that, unless userspace already has a deadlock
>>>> bug, userspace can pretty much only observe such issues when fancy lockless
>>>> data structures are used (because if another thread was in the middle of
>>>> mutating data during fork() and the post-fork child tried to take the mutex
>>>> protecting that data, it might wait forever).
>>>
>>> Um, OK, but isn't that expected behaviour?  POSIX says:
>>>
>>> : A process shall be created with a single thread. If a multi-threaded
>>> : process calls fork(), the new process shall contain a replica of the
>>> : calling thread and its entire address space, possibly including the
>>> : states of mutexes and other resources. Consequently, the application
>>> : shall ensure that the child process only executes async-signal-safe
>>> : operations until such time as one of the exec functions is successful.
>>>
>>> It's always been my understanding that you really, really shouldn't call
>>> fork() from a multithreaded process.
>>
>> I have the same recollection, but rather because of concurrent O_DIRECT
>> and locking (pthread_atfork ...).
>>
>> Using the allocator above example: what makes sure that no other thread
>> is halfway through modifying allocator state? You really have to sync
>> somehow before calling fork() -- e.g., grabbing allocator locks in
>> pthread_atfork().
> 
> Yeah, like what glibc does for its malloc implementation to prevent
> allocator calls from racing with fork(), so that malloc() keeps
> working after fork(), even though POSIX says that the libc doesn't
> have to guarantee that.

I mean, the patch here is simple, and there is already a performance 
penalty when allocating+copying the page, so it's not really the common 
hot path.

Merely a question if this was ever officially supported and warrents a 
"Fixes:".

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot
  2025-06-03 18:21 ` [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot Jann Horn
  2025-06-03 18:29   ` Matthew Wilcox
  2025-06-03 18:33   ` David Hildenbrand
@ 2025-06-03 20:32   ` Pedro Falcato
  2025-06-04 15:41     ` Jann Horn
  2025-06-05  7:33   ` Vlastimil Babka
  2025-06-06 12:49   ` Jann Horn
  4 siblings, 1 reply; 24+ messages in thread
From: Pedro Falcato @ 2025-06-03 20:32 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, Peter Xu,
	linux-kernel, stable

On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
> When fork() encounters possibly-pinned pages, those pages are immediately
> copied instead of just marking PTEs to make CoW happen later. If the parent
> is multithreaded, this can cause the child to see memory contents that are
> inconsistent in multiple ways:
> 
> 1. We are copying the contents of a page with a memcpy() while userspace
>    may be writing to it. This can cause the resulting data in the child to
>    be inconsistent.

This is an interesting problem, but we'll get to it later.

> 2. After we've copied this page, future writes to other pages may
>    continue to be visible to the child while future writes to this page are
>    no longer visible to the child.
>

Yes, and this is not fixable. It's also a problem for the regular write-protect
pte path where inevitably only a part of the address space will be write-protected.
This would only be fixable if e.g we suspended every thread on a multi-threaded fork.

> This means the child could theoretically see incoherent states where
> allocator freelists point to objects that are actually in use or stuff like
> that. A mitigating factor is that, unless userspace already has a deadlock
> bug, userspace can pretty much only observe such issues when fancy lockless
> data structures are used (because if another thread was in the middle of
> mutating data during fork() and the post-fork child tried to take the mutex
> protecting that data, it might wait forever).
> 

Ok, so the issue here is that atomics + memcpy (or our kernel variants) will
possibly observe tearing. This is indeed a problem, and POSIX doesn't _really_
tell us anything about this. _However_:

POSIX says:
> Any locks held by any thread in the calling process that have been set to be process-shared
> shall not be held by the child process. For locks held by any thread in the calling process
> that have not been set to be process-shared, any attempt by the child process to perform
> any operation on the lock results in undefined behavior (regardless of whether the calling
> process is single-threaded or multi-threaded).

The interesting bit here is "For locks held by any thread [...] any attempt by
the child [...] results in UB". I don't think it's entirely far-fetched to say
the spirit of the law is that atomics may also be UB (just like a lock[1] that was
held by a separate thread, then unlocked mid-concurrent-fork is in a UB state).

In any way, I think the bottom-line is that fork memory snapshot coherency is
a fallacy. It's really impossible to reach without adding insane constraints
(like the aforementioned thread suspending + resume). It's not even possible
when going through normal write-protect paths that have been conceptually stable since
the BSDs in the 1980s (due to the write-protect-a-page-at-a-time-problem).

Thus, personally I don't think this is worth fixing.

[1] This (at least in theory) covers every lock, so it also encompasses pthread spinlocks

-- 
Pedro

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot
  2025-06-03 19:03     ` Jann Horn
@ 2025-06-04 12:22       ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2025-06-04 12:22 UTC (permalink / raw)
  To: Jann Horn, Matthew Wilcox
  Cc: Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm,
	Peter Xu, linux-kernel, stable

On 03.06.25 21:03, Jann Horn wrote:
> On Tue, Jun 3, 2025 at 8:29 PM Matthew Wilcox <willy@infradead.org> wrote:
>> On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
>>> When fork() encounters possibly-pinned pages, those pages are immediately
>>> copied instead of just marking PTEs to make CoW happen later. If the parent
>>> is multithreaded, this can cause the child to see memory contents that are
>>> inconsistent in multiple ways:
>>>
>>> 1. We are copying the contents of a page with a memcpy() while userspace
>>>     may be writing to it. This can cause the resulting data in the child to
>>>     be inconsistent.
>>> 2. After we've copied this page, future writes to other pages may
>>>     continue to be visible to the child while future writes to this page are
>>>     no longer visible to the child.
>>>
>>> This means the child could theoretically see incoherent states where
>>> allocator freelists point to objects that are actually in use or stuff like
>>> that. A mitigating factor is that, unless userspace already has a deadlock
>>> bug, userspace can pretty much only observe such issues when fancy lockless
>>> data structures are used (because if another thread was in the middle of
>>> mutating data during fork() and the post-fork child tried to take the mutex
>>> protecting that data, it might wait forever).
>>
>> Um, OK, but isn't that expected behaviour?  POSIX says:
> 
> I don't think it is expected behavior that locklessly-updated data
> structures in application code could break.
> 
>> : A process shall be created with a single thread. If a multi-threaded
>> : process calls fork(), the new process shall contain a replica of the
>> : calling thread and its entire address space, possibly including the
>> : states of mutexes and other resources. Consequently, the application
>> : shall ensure that the child process only executes async-signal-safe
>> : operations until such time as one of the exec functions is successful.
> 
> I think that is only talking about ways in which you can interact with
> libc, and does not mean that application code couldn't access its own
> lockless data structures or such.
> 
> Though admittedly that is a fairly theoretical point, since in
> practice the most likely place where you'd encounter this kind of
> issue would be in an allocator implementation or such.

I thought a bit further about that.

The thing is, that another thread in the parent might be doing something 
lockless using atomics etc. And in the parent, that thread will make 
progress as soon as fork() is done. In the child, that thread will never 
make progress again, as it is gone.

So the assumption must be that another thread possibly stalling forever 
and not making any progress will not affect the algorithm in the child.

I am not sure if that is actually the case for many lockless algorithms 
in allocators. I'm curious, do we have any examples of such algorithms, 
in particular, regarding allocators?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot
  2025-06-03 20:32   ` Pedro Falcato
@ 2025-06-04 15:41     ` Jann Horn
  2025-06-04 16:16       ` Pedro Falcato
  0 siblings, 1 reply; 24+ messages in thread
From: Jann Horn @ 2025-06-04 15:41 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, Peter Xu,
	linux-kernel, stable

On Tue, Jun 3, 2025 at 10:32 PM Pedro Falcato <pfalcato@suse.de> wrote:
> On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
> > When fork() encounters possibly-pinned pages, those pages are immediately
> > copied instead of just marking PTEs to make CoW happen later. If the parent
> > is multithreaded, this can cause the child to see memory contents that are
> > inconsistent in multiple ways:
> >
> > 1. We are copying the contents of a page with a memcpy() while userspace
> >    may be writing to it. This can cause the resulting data in the child to
> >    be inconsistent.
>
> This is an interesting problem, but we'll get to it later.
>
> > 2. After we've copied this page, future writes to other pages may
> >    continue to be visible to the child while future writes to this page are
> >    no longer visible to the child.
> >
>
> Yes, and this is not fixable. It's also a problem for the regular write-protect
> pte path where inevitably only a part of the address space will be write-protected.

I don't understand what you mean by "inevitably only a part of the
address space will be write-protected". Are you talking about how
shared pages are kept shared between parent in child? Or are you
talking about how there is a point in time at which part of the
address space is write-protected while another part is not yet
write-protected? In that case: Yes, that can happen, but that's not a
problem.

> This would only be fixable if e.g we suspended every thread on a multi-threaded fork.

No, I think it is fine to keep threads running in parallel on a
multi-threaded fork as long as all the writes they do are guaranteed
to also be observable in the child. Such writes are no different from
writes performed before fork().

It would only get problematic if something in the parent first wrote
to page A, which has already been copied to the child (so the child no
longer sees the write) and then wrote to page B, which is CoWed (so
the child would see the write). I prevent this scenario by effectively
suspending the thread that tries to write to page A until the fork is
over (by making it block on the mmap lock in the fault handling path).

> > This means the child could theoretically see incoherent states where
> > allocator freelists point to objects that are actually in use or stuff like
> > that. A mitigating factor is that, unless userspace already has a deadlock
> > bug, userspace can pretty much only observe such issues when fancy lockless
> > data structures are used (because if another thread was in the middle of
> > mutating data during fork() and the post-fork child tried to take the mutex
> > protecting that data, it might wait forever).
> >
>
> Ok, so the issue here is that atomics + memcpy (or our kernel variants) will
> possibly observe tearing. This is indeed a problem, and POSIX doesn't _really_
> tell us anything about this. _However_:
>
> POSIX says:
> > Any locks held by any thread in the calling process that have been set to be process-shared
> > shall not be held by the child process. For locks held by any thread in the calling process
> > that have not been set to be process-shared, any attempt by the child process to perform
> > any operation on the lock results in undefined behavior (regardless of whether the calling
> > process is single-threaded or multi-threaded).
>
> The interesting bit here is "For locks held by any thread [...] any attempt by
> the child [...] results in UB". I don't think it's entirely far-fetched to say
> the spirit of the law is that atomics may also be UB (just like a lock[1] that was
> held by a separate thread, then unlocked mid-concurrent-fork is in a UB state).

I think interpreting atomic operations as locks is far-fetched. Also,
POSIX is a sort of minimal bar, and if we only implemented things
explicitly required by POSIX, we might not have a particularly useful
operating system.

Besides, I think things specified by the C standard override whatever
POSIX says, and C23 specifies that there are atomic operations, and I
haven't seen anything in C23 that restricts availability of those to
before fork().

> In any way, I think the bottom-line is that fork memory snapshot coherency is
> a fallacy. It's really impossible to reach without adding insane constraints
> (like the aforementioned thread suspending + resume). It's not even possible
> when going through normal write-protect paths that have been conceptually stable since
> the BSDs in the 1980s (due to the write-protect-a-page-at-a-time-problem).

No, Linux already had memory snapshot coherency before commit
70e806e4e645 ("mm: Do early cow for pinned pages during fork() for
ptes"). Write-protecting a page at a time does not cause coherency
issues, because letting a concurrent thread write into such memory
during fork() is no different from letting it do so before fork() from
a memory coherency perspective, as long as fork() write-locks memory
management for the process.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot
  2025-06-04 15:41     ` Jann Horn
@ 2025-06-04 16:16       ` Pedro Falcato
  0 siblings, 0 replies; 24+ messages in thread
From: Pedro Falcato @ 2025-06-04 16:16 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, Peter Xu,
	linux-kernel, stable

On Wed, Jun 04, 2025 at 05:41:47PM +0200, Jann Horn wrote:
> On Tue, Jun 3, 2025 at 10:32 PM Pedro Falcato <pfalcato@suse.de> wrote:
> > On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
> > > When fork() encounters possibly-pinned pages, those pages are immediately
> > > copied instead of just marking PTEs to make CoW happen later. If the parent
> > > is multithreaded, this can cause the child to see memory contents that are
> > > inconsistent in multiple ways:
> > >
> > > 1. We are copying the contents of a page with a memcpy() while userspace
> > >    may be writing to it. This can cause the resulting data in the child to
> > >    be inconsistent.
> >
> > This is an interesting problem, but we'll get to it later.
> >
> > > 2. After we've copied this page, future writes to other pages may
> > >    continue to be visible to the child while future writes to this page are
> > >    no longer visible to the child.
> > >
> >
> > Yes, and this is not fixable. It's also a problem for the regular write-protect
> > pte path where inevitably only a part of the address space will be write-protected.
> 
> I don't understand what you mean by "inevitably only a part of the
> address space will be write-protected". Are you talking about how
> shared pages are kept shared between parent in child? Or are you
> talking about how there is a point in time at which part of the
> address space is write-protected while another part is not yet
> write-protected? In that case: Yes, that can happen, but that's not a
> problem.
> 
> > This would only be fixable if e.g we suspended every thread on a multi-threaded fork.
> 
> No, I think it is fine to keep threads running in parallel on a
> multi-threaded fork as long as all the writes they do are guaranteed
> to also be observable in the child. Such writes are no different from
> writes performed before fork().
> 
> It would only get problematic if something in the parent first wrote
> to page A, which has already been copied to the child (so the child no
> longer sees the write) and then wrote to page B, which is CoWed (so
> the child would see the write). I prevent this scenario by effectively
> suspending the thread that tries to write to page A until the fork is
> over (by making it block on the mmap lock in the fault handling path).
> 

Ah yes, I see my mistake - we write lock all VMAs as we dup them, so
the problem I described can't happen. Thanks for the explanation :)

-- 
Pedro

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] mm/memory: Document how we make a coherent memory snapshot
  2025-06-03 18:21 ` [PATCH 2/2] mm/memory: Document how we make a " Jann Horn
@ 2025-06-04 17:03   ` Peter Xu
  2025-06-04 18:11     ` Jann Horn
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2025-06-04 17:03 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

On Tue, Jun 03, 2025 at 08:21:03PM +0200, Jann Horn wrote:
> It is not currently documented that the child of fork() should receive a
> coherent snapshot of the parent's memory, or how we get such a snapshot.
> Add a comment block to explain this.
> 
> Signed-off-by: Jann Horn <jannh@google.com>
> ---
>  kernel/fork.c | 34 ++++++++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 85afccfdf3b1..f78f5df596a9 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -604,6 +604,40 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
>  }
>  
>  #ifdef CONFIG_MMU
> +/*
> + * Anonymous memory inherited by the child MM must, on success, contain a
> + * coherent snapshot of corresponding anonymous memory in the parent MM.

Should we better define what is a coherent snapshot?  Or maybe avoid using
this term which seems to apply to the whole mm?

I think it's at least not a snapshot of whole mm at a specific time,
because as long as there can be more than one concurrent writers (hence, it
needs to be at least 3 threads in the parent process, 1 in charge of fork),
this can happen:

  parent writer 1      parent writer 2    parent fork thr
  ---------------      ---------------    ---------------
                                          wr-protect P1
  write P1                                                  <---- T1
  (trapped, didn't happen)
                       write PN                             <---- T2
                       (went through)
                                          ...
                                          wr-protect PN

The result of above would be that child process will see a mixture of old
P1 (at timestamp T1) but updated P2 (timestamp T2).  I don't think it's
impossible that the userapp could try to serialize "write P1" and "write
PN" operations in a way that it would also get a surprise seeing in the
child PN updated but P1 didn't.

I do agree it at least recovered the per-page coherence, though, no matter
what is the POSIX definition of that.  IIUC an userapp can always fix such
problem, but maybe it's too complicated in some cases, and if Linux used to
at least maintain per-page coherency, then it may make sense to recover the
behavior especially when it only affects pinned.

Said that, maybe we still want to be specific on the goal of the change.

Thanks,

> + * (An exception are anonymous memory regions which are concurrently written
> + * by kernel code or hardware devices through page references obtained via GUP.)
> + * We effectively snapshot the parent's memory just before
> + * mmap_write_unlock(oldmm); any writes after that point are invisible to the
> + * child, while attempted writes before that point are either visible to the
> + * child or delayed until after mmap_write_unlock(oldmm).
> + *
> + * To make that work while only needing a single pass through the parent's VMA
> + * tree and page tables, we follow these rules:
> + *
> + *  - Before mmap_write_unlock(), a TLB flush ensures that parent threads can't
> + *    write to copy-on-write pages anymore.
> + *  - Before dup_mmap() copies page contents (which happens rarely), the
> + *    parent's PTE for the page is made read-only and a TLB flush is issued, so
> + *    subsequent writes are delayed until mmap_write_unlock().
> + *  - Before dup_mmap() starts walking the page tables of a VMA in the parent,
> + *    the VMA is write-locked to ensure that the parent can't perform writes
> + *    that won't be visible in the child before mmap_write_unlock():
> + *      a) through concurrent copy-on-write handling
> + *      b) by upgrading read-only PTEs to writable
> + *
> + * Not following these rules, and giving the child a torn copy of the parent's
> + * memory contents where different segments come from different points in time,
> + * would likely _mostly_ work:
> + * Any memory to which a concurrent parent thread could be writing under a lock
> + * can't be accessed from the child without risking deadlocks (since the child
> + * might inherit the lock in a locked state, in which case the lock will stay
> + * locked forever in the child).
> + * But if userspace is using trylock or lock-free algorithms, providing a torn
> + * view of memory could break the child.
> + */
>  static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  					struct mm_struct *oldmm)
>  {
> 
> -- 
> 2.49.0.1204.g71687c7c1d-goog
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] mm/memory: Document how we make a coherent memory snapshot
  2025-06-04 17:03   ` Peter Xu
@ 2025-06-04 18:11     ` Jann Horn
  2025-06-04 20:10       ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Jann Horn @ 2025-06-04 18:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

On Wed, Jun 4, 2025 at 7:04 PM Peter Xu <peterx@redhat.com> wrote:
> On Tue, Jun 03, 2025 at 08:21:03PM +0200, Jann Horn wrote:
> > It is not currently documented that the child of fork() should receive a
> > coherent snapshot of the parent's memory, or how we get such a snapshot.
> > Add a comment block to explain this.
> >
> > Signed-off-by: Jann Horn <jannh@google.com>
> > ---
> >  kernel/fork.c | 34 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 34 insertions(+)
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 85afccfdf3b1..f78f5df596a9 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -604,6 +604,40 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
> >  }
> >
> >  #ifdef CONFIG_MMU
> > +/*
> > + * Anonymous memory inherited by the child MM must, on success, contain a
> > + * coherent snapshot of corresponding anonymous memory in the parent MM.
>
> Should we better define what is a coherent snapshot?  Or maybe avoid using
> this term which seems to apply to the whole mm?
>
> I think it's at least not a snapshot of whole mm at a specific time,
> because as long as there can be more than one concurrent writers (hence, it
> needs to be at least 3 threads in the parent process, 1 in charge of fork),
> this can happen:
>
>   parent writer 1      parent writer 2    parent fork thr
>   ---------------      ---------------    ---------------
>                                           wr-protect P1
>   write P1                                                  <---- T1
>   (trapped, didn't happen)
>                        write PN                             <---- T2
>                        (went through)
>                                           ...
>                                           wr-protect PN
>
> The result of above would be that child process will see a mixture of old
> P1 (at timestamp T1) but updated P2 (timestamp T2).  I don't think it's
> impossible that the userapp could try to serialize "write P1" and "write
> PN" operations in a way that it would also get a surprise seeing in the
> child PN updated but P1 didn't.

If the write at T1 hits a page fault, then it doesn't actually happen
at T1. The write instruction starts doing something at T1, but it does
not fully retire, and the architectural register state does not
change, and in particular the instruction pointer does not advance
past this instruction; just like when speculative execution is aborted
after a branch misprediction, except that the CPU raises an exception
and we enter the page fault handler. The write actually happens when
the instruction is executed a second time after page fault handling
has completed after the mmap lock is dropped. (Unless something during
page fault handling raises a signal, in which case the instruction
might never architecturally execute.)

(There is a caveat to what I just said, which makes this more complex
but does not fundamentally change the outcome: An instruction that
performs multiple memory writes without specific atomicity guarantees
can successfully do some writes and then fail on a later write. In
this case, after the page fault handler resolves the fault, the entire
instruction will run from the start again, including re-doing the
writes that were already done on the first execution, and this works
because such instructions are designed to be idempotent in this regard
and they don't make atomicity guarantees.)

> I do agree it at least recovered the per-page coherence, though, no matter
> what is the POSIX definition of that.  IIUC an userapp can always fix such
> problem, but maybe it's too complicated in some cases, and if Linux used to
> at least maintain per-page coherency, then it may make sense to recover the
> behavior especially when it only affects pinned.
>
> Said that, maybe we still want to be specific on the goal of the change.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] mm/memory: Document how we make a coherent memory snapshot
  2025-06-04 18:11     ` Jann Horn
@ 2025-06-04 20:10       ` Peter Xu
  2025-06-04 20:28         ` David Hildenbrand
  2025-06-06 14:11         ` Jann Horn
  0 siblings, 2 replies; 24+ messages in thread
From: Peter Xu @ 2025-06-04 20:10 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

On Wed, Jun 04, 2025 at 08:11:08PM +0200, Jann Horn wrote:
> On Wed, Jun 4, 2025 at 7:04 PM Peter Xu <peterx@redhat.com> wrote:
> > On Tue, Jun 03, 2025 at 08:21:03PM +0200, Jann Horn wrote:
> > > It is not currently documented that the child of fork() should receive a
> > > coherent snapshot of the parent's memory, or how we get such a snapshot.
> > > Add a comment block to explain this.
> > >
> > > Signed-off-by: Jann Horn <jannh@google.com>
> > > ---
> > >  kernel/fork.c | 34 ++++++++++++++++++++++++++++++++++
> > >  1 file changed, 34 insertions(+)
> > >
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 85afccfdf3b1..f78f5df596a9 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -604,6 +604,40 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
> > >  }
> > >
> > >  #ifdef CONFIG_MMU
> > > +/*
> > > + * Anonymous memory inherited by the child MM must, on success, contain a
> > > + * coherent snapshot of corresponding anonymous memory in the parent MM.
> >
> > Should we better define what is a coherent snapshot?  Or maybe avoid using
> > this term which seems to apply to the whole mm?
> >
> > I think it's at least not a snapshot of whole mm at a specific time,
> > because as long as there can be more than one concurrent writers (hence, it
> > needs to be at least 3 threads in the parent process, 1 in charge of fork),
> > this can happen:
> >
> >   parent writer 1      parent writer 2    parent fork thr
> >   ---------------      ---------------    ---------------
> >                                           wr-protect P1
> >   write P1                                                  <---- T1
> >   (trapped, didn't happen)
> >                        write PN                             <---- T2
> >                        (went through)
> >                                           ...
> >                                           wr-protect PN
> >
> > The result of above would be that child process will see a mixture of old
> > P1 (at timestamp T1) but updated P2 (timestamp T2).  I don't think it's
> > impossible that the userapp could try to serialize "write P1" and "write
> > PN" operations in a way that it would also get a surprise seeing in the
> > child PN updated but P1 didn't.
> 
> If the write at T1 hits a page fault, then it doesn't actually happen
> at T1. The write instruction starts doing something at T1, but it does
> not fully retire, and the architectural register state does not
> change, and in particular the instruction pointer does not advance
> past this instruction; just like when speculative execution is aborted
> after a branch misprediction, except that the CPU raises an exception
> and we enter the page fault handler. The write actually happens when
> the instruction is executed a second time after page fault handling
> has completed after the mmap lock is dropped. (Unless something during
> page fault handling raises a signal, in which case the instruction
> might never architecturally execute.)

Fair enough.  So maybe that's something like a best-effort whole mm
snapshot anytime happened during the fork() but before releasing mmap write
lock.

Your comment did mention one exception on the kernel, is it still pretty
easy to happen?  I'm thinking this use case of trying to load some data
from a O_DIRECT fd and then set the var to show it's loaded:

  bool data_read=0
  read(...);
  data_read=1;

Then IIUC this can happen:

    parent thread 1                        parent fork thr
    ---------------                        ---------------
    read(...)
      using O_DIRECT on priv-anon buffers P1
      pin_user_pages
                                           fork() happens
                                             Sees P1 pinned
                                             P1 early COW (child sees no data loaded)
      memcpy()
    set data_read=1
    (data_read can be a global private var on P2)
                                             P2 wr-protected (child sees data_read=1)

Hence in child even if it sees data_read=1 it is possible the buffer may be
uninitialized, or the buffer is partly loaded, still racing with the kernel
early COW.

I'm not sure if I understand it correct this time as you discussed in the
comment. If so, should we still not emphasize too much on the kernel
providing coherent mm snapshot, at least emphasize the best-effort part
(both in comment of patch 2, but also in patch subjects)?  After all, it
seems it isn't straightforward for any userapp to see when that coherency
will be violated.

From that POV, maybe it's better we should still suggest the undefined
behavior, even if it'll recover the old behavior some existing use case?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] mm/memory: Document how we make a coherent memory snapshot
  2025-06-04 20:10       ` Peter Xu
@ 2025-06-04 20:28         ` David Hildenbrand
  2025-06-06 14:11         ` Jann Horn
  1 sibling, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2025-06-04 20:28 UTC (permalink / raw)
  To: Peter Xu, Jann Horn
  Cc: Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm,
	linux-kernel

On 04.06.25 22:10, Peter Xu wrote:
> On Wed, Jun 04, 2025 at 08:11:08PM +0200, Jann Horn wrote:
>> On Wed, Jun 4, 2025 at 7:04 PM Peter Xu <peterx@redhat.com> wrote:
>>> On Tue, Jun 03, 2025 at 08:21:03PM +0200, Jann Horn wrote:
>>>> It is not currently documented that the child of fork() should receive a
>>>> coherent snapshot of the parent's memory, or how we get such a snapshot.
>>>> Add a comment block to explain this.
>>>>
>>>> Signed-off-by: Jann Horn <jannh@google.com>
>>>> ---
>>>>   kernel/fork.c | 34 ++++++++++++++++++++++++++++++++++
>>>>   1 file changed, 34 insertions(+)
>>>>
>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>> index 85afccfdf3b1..f78f5df596a9 100644
>>>> --- a/kernel/fork.c
>>>> +++ b/kernel/fork.c
>>>> @@ -604,6 +604,40 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
>>>>   }
>>>>
>>>>   #ifdef CONFIG_MMU
>>>> +/*
>>>> + * Anonymous memory inherited by the child MM must, on success, contain a
>>>> + * coherent snapshot of corresponding anonymous memory in the parent MM.
>>>
>>> Should we better define what is a coherent snapshot?  Or maybe avoid using
>>> this term which seems to apply to the whole mm?
>>>
>>> I think it's at least not a snapshot of whole mm at a specific time,
>>> because as long as there can be more than one concurrent writers (hence, it
>>> needs to be at least 3 threads in the parent process, 1 in charge of fork),
>>> this can happen:
>>>
>>>    parent writer 1      parent writer 2    parent fork thr
>>>    ---------------      ---------------    ---------------
>>>                                            wr-protect P1
>>>    write P1                                                  <---- T1
>>>    (trapped, didn't happen)
>>>                         write PN                             <---- T2
>>>                         (went through)
>>>                                            ...
>>>                                            wr-protect PN
>>>
>>> The result of above would be that child process will see a mixture of old
>>> P1 (at timestamp T1) but updated P2 (timestamp T2).  I don't think it's
>>> impossible that the userapp could try to serialize "write P1" and "write
>>> PN" operations in a way that it would also get a surprise seeing in the
>>> child PN updated but P1 didn't.
>>
>> If the write at T1 hits a page fault, then it doesn't actually happen
>> at T1. The write instruction starts doing something at T1, but it does
>> not fully retire, and the architectural register state does not
>> change, and in particular the instruction pointer does not advance
>> past this instruction; just like when speculative execution is aborted
>> after a branch misprediction, except that the CPU raises an exception
>> and we enter the page fault handler. The write actually happens when
>> the instruction is executed a second time after page fault handling
>> has completed after the mmap lock is dropped. (Unless something during
>> page fault handling raises a signal, in which case the instruction
>> might never architecturally execute.)
> 
> Fair enough.  So maybe that's something like a best-effort whole mm
> snapshot anytime happened during the fork() but before releasing mmap write
> lock.
> 
> Your comment did mention one exception on the kernel, is it still pretty
> easy to happen?  I'm thinking this use case of trying to load some data
> from a O_DIRECT fd and then set the var to show it's loaded:
> 
>    bool data_read=0
>    read(...);
>    data_read=1;
> 
> Then IIUC this can happen:
> 
>      parent thread 1                        parent fork thr
>      ---------------                        ---------------
>      read(...)
>        using O_DIRECT on priv-anon buffers P1
>        pin_user_pages
>                                             fork() happens
>                                               Sees P1 pinned
>                                               P1 early COW (child sees no data loaded)
>        memcpy()
>      set data_read=1
>      (data_read can be a global private var on P2)
>                                               P2 wr-protected (child sees data_read=1)
> 
> Hence in child even if it sees data_read=1 it is possible the buffer may be
> uninitialized, or the buffer is partly loaded, still racing with the kernel
> early COW.

Just mentioning that O_DIRECT and fork() has had a problematic 
relationship for a long time, although we are getting better at handling 
it (IOW, not break common setups in nasty ways).

"man open" is still quite verbose on that "O_DIRECT  I/Os  should never 
be run concurrently with the fork(2) system call ..."

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot
  2025-06-03 18:21 ` [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot Jann Horn
                     ` (2 preceding siblings ...)
  2025-06-03 20:32   ` Pedro Falcato
@ 2025-06-05  7:33   ` Vlastimil Babka
  2025-06-05 12:30     ` Pedro Falcato
  2025-06-06 12:55     ` Jann Horn
  2025-06-06 12:49   ` Jann Horn
  4 siblings, 2 replies; 24+ messages in thread
From: Vlastimil Babka @ 2025-06-05  7:33 UTC (permalink / raw)
  To: Jann Horn, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, Pedro Falcato
  Cc: Peter Xu, linux-kernel, stable

On 6/3/25 20:21, Jann Horn wrote:
> When fork() encounters possibly-pinned pages, those pages are immediately
> copied instead of just marking PTEs to make CoW happen later. If the parent
> is multithreaded, this can cause the child to see memory contents that are
> inconsistent in multiple ways:
> 
> 1. We are copying the contents of a page with a memcpy() while userspace
>    may be writing to it. This can cause the resulting data in the child to
>    be inconsistent.
> 2. After we've copied this page, future writes to other pages may
>    continue to be visible to the child while future writes to this page are
>    no longer visible to the child.
> 
> This means the child could theoretically see incoherent states where
> allocator freelists point to objects that are actually in use or stuff like
> that. A mitigating factor is that, unless userspace already has a deadlock
> bug, userspace can pretty much only observe such issues when fancy lockless
> data structures are used (because if another thread was in the middle of
> mutating data during fork() and the post-fork child tried to take the mutex
> protecting that data, it might wait forever).
> 
> On top of that, this issue is only observable when pages are either
> DMA-pinned or appear false-positive-DMA-pinned due to a page having >=1024
> references and the parent process having used DMA-pinning at least once
> before.

Seems the changelog seems to be missing the part describing what it's doing
to fix the issue? Some details are not immediately obvious (the writing
threads become blocked in page fault) as the conversation has shown.

> Fixes: 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes")
> Cc: stable@vger.kernel.org
> Signed-off-by: Jann Horn <jannh@google.com>

Given how the fix seems to be localized to the already rare slowpath and
doesn't require us to pessimize every trivial fork(), it seems reasonable to
me even if don't have a concrete example of a sane code in the wild that's
broken by the current behavior, so:

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/memory.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 49199410805c..b406dfda976b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -917,7 +917,25 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>  	/*
>  	 * We have a prealloc page, all good!  Take it
>  	 * over and copy the page & arm it.
> +	 *
> +	 * One nasty aspect is that we could be in a multithreaded process or
> +	 * such, where another thread is in the middle of writing to memory
> +	 * while this thread is forking. As long as we're just marking PTEs as
> +	 * read-only to make copy-on-write happen *later*, that's easy; we just
> +	 * need to do a single TLB flush before dropping the mmap/VMA locks, and
> +	 * that's enough to guarantee that the child gets a coherent snapshot of
> +	 * memory.
> +	 * But here, where we're doing an immediate copy, we must ensure that
> +	 * threads in the parent process can no longer write into the page being
> +	 * copied until we're done forking.
> +	 * This means that we still need to mark the source PTE as read-only,
> +	 * with an immediate TLB flush.
> +	 * (To make the source PTE writable again after fork() is done, we can
> +	 * rely on the page fault handler to do that lazily, thanks to
> +	 * PageAnonExclusive().)
>  	 */
> +	ptep_set_wrprotect(src_vma->vm_mm, addr, src_pte);
> +	flush_tlb_page(src_vma, addr);
>  
>  	if (copy_mc_user_highpage(&new_folio->page, page, addr, src_vma))
>  		return -EHWPOISON;
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot
  2025-06-05  7:33   ` Vlastimil Babka
@ 2025-06-05 12:30     ` Pedro Falcato
  2025-06-06 12:55     ` Jann Horn
  1 sibling, 0 replies; 24+ messages in thread
From: Pedro Falcato @ 2025-06-05 12:30 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Jann Horn, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, Peter Xu, linux-kernel, stable

On Thu, Jun 05, 2025 at 09:33:24AM +0200, Vlastimil Babka wrote:
> On 6/3/25 20:21, Jann Horn wrote:
> > When fork() encounters possibly-pinned pages, those pages are immediately
> > copied instead of just marking PTEs to make CoW happen later. If the parent
> > is multithreaded, this can cause the child to see memory contents that are
> > inconsistent in multiple ways:
> > 
> > 1. We are copying the contents of a page with a memcpy() while userspace
> >    may be writing to it. This can cause the resulting data in the child to
> >    be inconsistent.
> > 2. After we've copied this page, future writes to other pages may
> >    continue to be visible to the child while future writes to this page are
> >    no longer visible to the child.
> > 
> > This means the child could theoretically see incoherent states where
> > allocator freelists point to objects that are actually in use or stuff like
> > that. A mitigating factor is that, unless userspace already has a deadlock
> > bug, userspace can pretty much only observe such issues when fancy lockless
> > data structures are used (because if another thread was in the middle of
> > mutating data during fork() and the post-fork child tried to take the mutex
> > protecting that data, it might wait forever).
> > 
> > On top of that, this issue is only observable when pages are either
> > DMA-pinned or appear false-positive-DMA-pinned due to a page having >=1024
> > references and the parent process having used DMA-pinning at least once
> > before.
> 
> Seems the changelog seems to be missing the part describing what it's doing
> to fix the issue? Some details are not immediately obvious (the writing
> threads become blocked in page fault) as the conversation has shown.
> 
> > Fixes: 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Jann Horn <jannh@google.com>
> 
> Given how the fix seems to be localized to the already rare slowpath and
> doesn't require us to pessimize every trivial fork(), it seems reasonable to
> me even if don't have a concrete example of a sane code in the wild that's
> broken by the current behavior, so:
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Pedro Falcato <pfalcato@suse.de>

-- 
Pedro

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot
  2025-06-03 18:21 ` [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot Jann Horn
                     ` (3 preceding siblings ...)
  2025-06-05  7:33   ` Vlastimil Babka
@ 2025-06-06 12:49   ` Jann Horn
  2025-06-06 15:49     ` Vlastimil Babka
  4 siblings, 1 reply; 24+ messages in thread
From: Jann Horn @ 2025-06-06 12:49 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm
  Cc: Peter Xu, linux-kernel, stable

On Tue, Jun 3, 2025 at 8:21 PM Jann Horn <jannh@google.com> wrote:
> @@ -917,7 +917,25 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>         /*
>          * We have a prealloc page, all good!  Take it
>          * over and copy the page & arm it.
> +        *
> +        * One nasty aspect is that we could be in a multithreaded process or
> +        * such, where another thread is in the middle of writing to memory
> +        * while this thread is forking. As long as we're just marking PTEs as
> +        * read-only to make copy-on-write happen *later*, that's easy; we just
> +        * need to do a single TLB flush before dropping the mmap/VMA locks, and
> +        * that's enough to guarantee that the child gets a coherent snapshot of
> +        * memory.
> +        * But here, where we're doing an immediate copy, we must ensure that
> +        * threads in the parent process can no longer write into the page being
> +        * copied until we're done forking.
> +        * This means that we still need to mark the source PTE as read-only,
> +        * with an immediate TLB flush.
> +        * (To make the source PTE writable again after fork() is done, we can
> +        * rely on the page fault handler to do that lazily, thanks to
> +        * PageAnonExclusive().)
>          */
> +       ptep_set_wrprotect(src_vma->vm_mm, addr, src_pte);
> +       flush_tlb_page(src_vma, addr);

Hmm... this is actually wrong, because we did
arch_enter_lazy_mmu_mode() up in copy_pte_range(). So I guess I
actually have to do:

arch_leave_lazy_mmu_mode();
ptep_set_wrprotect(src_vma->vm_mm, addr, src_pte);
flush_tlb_page(src_vma, addr);
arch_enter_lazy_mmu_mode();

(arch_flush_lazy_mmu_mode() would look a bit nicer, but powerpc
doesn't implement that.)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot
  2025-06-05  7:33   ` Vlastimil Babka
  2025-06-05 12:30     ` Pedro Falcato
@ 2025-06-06 12:55     ` Jann Horn
  2025-06-06 15:34       ` Vlastimil Babka
  1 sibling, 1 reply; 24+ messages in thread
From: Jann Horn @ 2025-06-06 12:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, Pedro Falcato, Peter Xu, linux-kernel, stable

On Thu, Jun 5, 2025 at 9:33 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> On 6/3/25 20:21, Jann Horn wrote:
> > When fork() encounters possibly-pinned pages, those pages are immediately
> > copied instead of just marking PTEs to make CoW happen later. If the parent
> > is multithreaded, this can cause the child to see memory contents that are
> > inconsistent in multiple ways:
> >
> > 1. We are copying the contents of a page with a memcpy() while userspace
> >    may be writing to it. This can cause the resulting data in the child to
> >    be inconsistent.
> > 2. After we've copied this page, future writes to other pages may
> >    continue to be visible to the child while future writes to this page are
> >    no longer visible to the child.
> >
> > This means the child could theoretically see incoherent states where
> > allocator freelists point to objects that are actually in use or stuff like
> > that. A mitigating factor is that, unless userspace already has a deadlock
> > bug, userspace can pretty much only observe such issues when fancy lockless
> > data structures are used (because if another thread was in the middle of
> > mutating data during fork() and the post-fork child tried to take the mutex
> > protecting that data, it might wait forever).
> >
> > On top of that, this issue is only observable when pages are either
> > DMA-pinned or appear false-positive-DMA-pinned due to a page having >=1024
> > references and the parent process having used DMA-pinning at least once
> > before.
>
> Seems the changelog seems to be missing the part describing what it's doing
> to fix the issue? Some details are not immediately obvious (the writing
> threads become blocked in page fault) as the conversation has shown.

I tried to document this in patch 2/2, where I wrote this (though I
guess I should maybe make it more verbose and not just say "subsequent
writes are delayed until mmap_write_unlock()"):

+ *  - Before mmap_write_unlock(), a TLB flush ensures that parent threads can't
+ *    write to copy-on-write pages anymore.
+ *  - Before dup_mmap() copies page contents (which happens rarely), the
+ *    parent's PTE for the page is made read-only and a TLB flush is issued, so
+ *    subsequent writes are delayed until mmap_write_unlock().

But I guess this way makes it hard to review patch 1/2 individually.
Should I just squash the two patches together, and then write in the
commit message "see the comment blocks I'm adding for the fix
approach"? Or is there value in repeating the explanation in the
commit message?

> > Fixes: 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Jann Horn <jannh@google.com>
>
> Given how the fix seems to be localized to the already rare slowpath and
> doesn't require us to pessimize every trivial fork(), it seems reasonable to
> me even if don't have a concrete example of a sane code in the wild that's
> broken by the current behavior, so:
>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] mm/memory: Document how we make a coherent memory snapshot
  2025-06-04 20:10       ` Peter Xu
  2025-06-04 20:28         ` David Hildenbrand
@ 2025-06-06 14:11         ` Jann Horn
  1 sibling, 0 replies; 24+ messages in thread
From: Jann Horn @ 2025-06-06 14:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

On Wed, Jun 4, 2025 at 10:10 PM Peter Xu <peterx@redhat.com> wrote:
> On Wed, Jun 04, 2025 at 08:11:08PM +0200, Jann Horn wrote:
> > On Wed, Jun 4, 2025 at 7:04 PM Peter Xu <peterx@redhat.com> wrote:
> > > On Tue, Jun 03, 2025 at 08:21:03PM +0200, Jann Horn wrote:
> > > > It is not currently documented that the child of fork() should receive a
> > > > coherent snapshot of the parent's memory, or how we get such a snapshot.
> > > > Add a comment block to explain this.
> > > >
> > > > Signed-off-by: Jann Horn <jannh@google.com>
> > > > ---
> > > >  kernel/fork.c | 34 ++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 34 insertions(+)
> > > >
> > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > index 85afccfdf3b1..f78f5df596a9 100644
> > > > --- a/kernel/fork.c
> > > > +++ b/kernel/fork.c
> > > > @@ -604,6 +604,40 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
> > > >  }
> > > >
> > > >  #ifdef CONFIG_MMU
> > > > +/*
> > > > + * Anonymous memory inherited by the child MM must, on success, contain a
> > > > + * coherent snapshot of corresponding anonymous memory in the parent MM.
> > >
> > > Should we better define what is a coherent snapshot?  Or maybe avoid using
> > > this term which seems to apply to the whole mm?
> > >
> > > I think it's at least not a snapshot of whole mm at a specific time,
> > > because as long as there can be more than one concurrent writers (hence, it
> > > needs to be at least 3 threads in the parent process, 1 in charge of fork),
> > > this can happen:
> > >
> > >   parent writer 1      parent writer 2    parent fork thr
> > >   ---------------      ---------------    ---------------
> > >                                           wr-protect P1
> > >   write P1                                                  <---- T1
> > >   (trapped, didn't happen)
> > >                        write PN                             <---- T2
> > >                        (went through)
> > >                                           ...
> > >                                           wr-protect PN
> > >
> > > The result of above would be that child process will see a mixture of old
> > > P1 (at timestamp T1) but updated P2 (timestamp T2).  I don't think it's
> > > impossible that the userapp could try to serialize "write P1" and "write
> > > PN" operations in a way that it would also get a surprise seeing in the
> > > child PN updated but P1 didn't.
> >
> > If the write at T1 hits a page fault, then it doesn't actually happen
> > at T1. The write instruction starts doing something at T1, but it does
> > not fully retire, and the architectural register state does not
> > change, and in particular the instruction pointer does not advance
> > past this instruction; just like when speculative execution is aborted
> > after a branch misprediction, except that the CPU raises an exception
> > and we enter the page fault handler. The write actually happens when
> > the instruction is executed a second time after page fault handling
> > has completed after the mmap lock is dropped. (Unless something during
> > page fault handling raises a signal, in which case the instruction
> > might never architecturally execute.)
>
> Fair enough.  So maybe that's something like a best-effort whole mm
> snapshot anytime happened during the fork() but before releasing mmap write
> lock.
>
> Your comment did mention one exception on the kernel, is it still pretty
> easy to happen?  I'm thinking this use case of trying to load some data
> from a O_DIRECT fd and then set the var to show it's loaded:
>
>   bool data_read=0
>   read(...);
>   data_read=1;
>
> Then IIUC this can happen:
>
>     parent thread 1                        parent fork thr
>     ---------------                        ---------------
>     read(...)
>       using O_DIRECT on priv-anon buffers P1
>       pin_user_pages
>                                            fork() happens
>                                              Sees P1 pinned
>                                              P1 early COW (child sees no data loaded)
>       memcpy()
>     set data_read=1
>     (data_read can be a global private var on P2)
>                                              P2 wr-protected (child sees data_read=1)
>
> Hence in child even if it sees data_read=1 it is possible the buffer may be
> uninitialized, or the buffer is partly loaded, still racing with the kernel
> early COW.

Urgh. True, I had not considered that case.

> I'm not sure if I understand it correct this time as you discussed in the
> comment. If so, should we still not emphasize too much on the kernel
> providing coherent mm snapshot, at least emphasize the best-effort part
> (both in comment of patch 2, but also in patch subjects)?  After all, it
> seems it isn't straightforward for any userapp to see when that coherency
> will be violated.

Yeah, at least I should add a big caveat in this comment about how
O_DIRECT read buffers might be stale in such a case, and that we're
doing the best we can with a single-pass approach. Urgh.

> From that POV, maybe it's better we should still suggest the undefined
> behavior, even if it'll recover the old behavior some existing use case?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot
  2025-06-06 12:55     ` Jann Horn
@ 2025-06-06 15:34       ` Vlastimil Babka
  0 siblings, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2025-06-06 15:34 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, Pedro Falcato, Peter Xu, linux-kernel, stable

On 6/6/25 14:55, Jann Horn wrote:
> On Thu, Jun 5, 2025 at 9:33 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> + *  - Before mmap_write_unlock(), a TLB flush ensures that parent threads can't
> + *    write to copy-on-write pages anymore.
> + *  - Before dup_mmap() copies page contents (which happens rarely), the
> + *    parent's PTE for the page is made read-only and a TLB flush is issued, so
> + *    subsequent writes are delayed until mmap_write_unlock().
> 
> But I guess this way makes it hard to review patch 1/2 individually.
> Should I just squash the two patches together, and then write in the
> commit message "see the comment blocks I'm adding for the fix

That would be good unless it makes it increases the conflicts in stable
backports.

> approach"? Or is there value in repeating the explanation in the
> commit message?

Depending on the above, if the stable fix needs to stay minimal, it would be
valuable to make it more self-contained by repeating that in the commit
message. So that the LLM has an easier job marking it as a CVE. </sarcasm>

Vlastimil

>> > Fixes: 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes")
>> > Cc: stable@vger.kernel.org
>> > Signed-off-by: Jann Horn <jannh@google.com>
>>
>> Given how the fix seems to be localized to the already rare slowpath and
>> doesn't require us to pessimize every trivial fork(), it seems reasonable to
>> me even if don't have a concrete example of a sane code in the wild that's
>> broken by the current behavior, so:
>>
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Thanks!


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot
  2025-06-06 12:49   ` Jann Horn
@ 2025-06-06 15:49     ` Vlastimil Babka
  0 siblings, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2025-06-06 15:49 UTC (permalink / raw)
  To: Jann Horn, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm
  Cc: Peter Xu, linux-kernel, stable

On 6/6/25 14:49, Jann Horn wrote:
> On Tue, Jun 3, 2025 at 8:21 PM Jann Horn <jannh@google.com> wrote:
>> @@ -917,7 +917,25 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>>         /*
>>          * We have a prealloc page, all good!  Take it
>>          * over and copy the page & arm it.
>> +        *
>> +        * One nasty aspect is that we could be in a multithreaded process or
>> +        * such, where another thread is in the middle of writing to memory
>> +        * while this thread is forking. As long as we're just marking PTEs as
>> +        * read-only to make copy-on-write happen *later*, that's easy; we just
>> +        * need to do a single TLB flush before dropping the mmap/VMA locks, and
>> +        * that's enough to guarantee that the child gets a coherent snapshot of
>> +        * memory.
>> +        * But here, where we're doing an immediate copy, we must ensure that
>> +        * threads in the parent process can no longer write into the page being
>> +        * copied until we're done forking.
>> +        * This means that we still need to mark the source PTE as read-only,
>> +        * with an immediate TLB flush.
>> +        * (To make the source PTE writable again after fork() is done, we can
>> +        * rely on the page fault handler to do that lazily, thanks to
>> +        * PageAnonExclusive().)
>>          */
>> +       ptep_set_wrprotect(src_vma->vm_mm, addr, src_pte);
>> +       flush_tlb_page(src_vma, addr);
> 
> Hmm... this is actually wrong, because we did
> arch_enter_lazy_mmu_mode() up in copy_pte_range(). So I guess I
> actually have to do:
> 
> arch_leave_lazy_mmu_mode();
> ptep_set_wrprotect(src_vma->vm_mm, addr, src_pte);
> flush_tlb_page(src_vma, addr);
> arch_enter_lazy_mmu_mode();
> 
> (arch_flush_lazy_mmu_mode() would look a bit nicer, but powerpc
> doesn't implement that.)

Hm isn't that kinda weird that an arch can
#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE and then just not implement an
important part of it?
IIUC think it's the same with arch/sparc/include/asm/tlbflush_64.h ?
Should be possible to implement reusing part of the code from the respective
arch_leave_lazy_mmu_mode() right?


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-06-06 15:49 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-03 18:21 [PATCH 0/2] mm/memory: fix memory tearing on threaded fork Jann Horn
2025-06-03 18:21 ` [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot Jann Horn
2025-06-03 18:29   ` Matthew Wilcox
2025-06-03 18:37     ` David Hildenbrand
2025-06-03 19:09       ` Jann Horn
2025-06-03 20:17         ` David Hildenbrand
2025-06-03 19:03     ` Jann Horn
2025-06-04 12:22       ` David Hildenbrand
2025-06-03 18:33   ` David Hildenbrand
2025-06-03 20:32   ` Pedro Falcato
2025-06-04 15:41     ` Jann Horn
2025-06-04 16:16       ` Pedro Falcato
2025-06-05  7:33   ` Vlastimil Babka
2025-06-05 12:30     ` Pedro Falcato
2025-06-06 12:55     ` Jann Horn
2025-06-06 15:34       ` Vlastimil Babka
2025-06-06 12:49   ` Jann Horn
2025-06-06 15:49     ` Vlastimil Babka
2025-06-03 18:21 ` [PATCH 2/2] mm/memory: Document how we make a " Jann Horn
2025-06-04 17:03   ` Peter Xu
2025-06-04 18:11     ` Jann Horn
2025-06-04 20:10       ` Peter Xu
2025-06-04 20:28         ` David Hildenbrand
2025-06-06 14:11         ` Jann Horn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).