Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache
@ 2026-06-30 17:01 Gregg Leventhal
  2026-06-30 17:18 ` Gregg Leventhal
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Gregg Leventhal @ 2026-06-30 17:01 UTC (permalink / raw)
  To: To: Alexander Viro, Christian Brauner, Cc: Jan Kara,
	Matthew Wilcox, Andrew Morton, Song Liu, linux-fsdevel, linux-mm,
	linux-kernel, Eric Hagberg

[-- Attachment #1: Type: text/plain, Size: 7126 bytes --]

Hello,

We (Gregg Leventhal <gleventhal@janestreet.com> and Eric Hagberg

<ehagberg@janestreet.com>) have a reproducible data-loss issue involving file

THPs and write-open, impacting filesystems that do not support
writable large folios.


Attached are:


  - thp_write_open_cancel_dirty_repro.c

  - thp-open-writeback-before-purge.patch



Summary

=======


On an affected 6.12 kernel with CONFIG_READ_ONLY_THP_FOR_FS=y, a file can

contain read-only file THPs installed by khugepaged / MADV_COLLAPSE. When that

same file is later opened for write, do_dentry_open() notices

filemap_nr_thps() and drops the page cache:


        /*

         * XXX: Huge page cache doesn't support writing yet. Drop all page

         * cache for this file before processing writes.

         */

        if (f->f_mode & FMODE_WRITE) {

                if (filemap_nr_thps(inode->i_mapping)) {

                        struct address_space *mapping = inode->i_mapping;


                        filemap_invalidate_lock(inode->i_mapping);

                        unmap_mapping_range(mapping, 0, 0, 0);

                        truncate_inode_pages(mapping, 0);

                        filemap_invalidate_unlock(inode->i_mapping);

                }

        }


This is unsafe if the mapping also contains dirty folios.

truncate_inode_pages() is not just a clean cache-dropping primitive. It can

call truncate_cleanup_folio(), which calls folio_cancel_dirty().


In the attached reproducer, dirty appended data is discarded and later read(2)s

return zeros.


We observed this on btrfs and ext4, though most of the testing involved btrfs.


The same issue should apply to any filesystem where file THPs can be created

by READ_ONLY_THP_FOR_FS but writable large folios are not supported. The

do_dentry_open() block above is also unchanged in current mainline, so this

does not appear to be strictly 6.12-specific.



Instrumentation

===============


Tracing the failure shows the dirty folios being invalidated from the

write-open path. INVALIDATE_DIRTY and CANCEL_DIRTY below are labels from our

own probes:


        do_dentry_open / vfs_open

          truncate_inode_pages_range

            truncate_cleanup_folio

              btrfs_invalidate_folio

              folio_cancel_dirty


A representative stack from the failing path:


        INVALIDATE_DIRTY ...

          btrfs_invalidate_folio

          truncate_cleanup_folio

          truncate_inode_pages_range

          vfs_open


        CANCEL_DIRTY ...

          truncate_cleanup_folio

          truncate_inode_pages_range

          vfs_open


This confirms that the appended dirty page-cache contents are being discarded

by the open-time THP cache purge rather than written back.



Why this happens

================


The do_dentry_open() code is trying to handle the fact that some filesystems

do not support writing to file THPs. The problematic assumption is that

dropping the page cache is a safe cache-management operation.


It is not safe when dirty folios are present, because truncate_inode_pages()

cancels their dirty state without writeback.


Note that the read-only file THPs themselves are clean. The data that is lost

is unrelated dirty folios elsewhere in the same mapping, here the appended

tail, which get caught in the blanket truncate_inode_pages(mapping, 0) of the

entire mapping.



Suggested fix direction

=======================


Before dropping THP-bearing page cache on write-open, write back and wait for

any dirty folios. After writeback completes, the folios are clean, so the

subsequent truncate_inode_pages() has no dirty state to cancel and the data is

safe on disk. A later read() simply repopulates the cache from disk. If

writeback fails, fail the open rather than silently discarding the data.


The attached patch does this by adding filemap_write_and_wait(mapping) before

the unmap_mapping_range() / truncate_inode_pages() sequence.


Two caveats we are aware of with this approach:


  - filemap_write_and_wait() flushes the entire mapping, so any write-open of

    a file with filemap_nr_thps() > 0 now forces synchronous writeback. This

    path already did a full unmap + truncate, so the extra cost is probably

    acceptable, but it is a behavior change.


  - The writeback happens before unmap_mapping_range(). That is sufficient for

    the reproducer, where the dirty data comes from buffered write(2), so the

    folios are already marked dirty. We would appreciate guidance on whether

    unmap should precede the writeback in order to also cover data dirtied

    only via a writable shared mapping.


An alternative would be to replace truncate_inode_pages() with a

clean-page-only invalidation primitive, but then dirty file THPs / dirty pages

may remain in the mapping and need careful handling.



Mitigation

==========


As a temporary mitigation, setting khugepaged's scan interval very high

appears to prevent the issue by effectively stopping background file THP

collapse:


        echo 4294967295 >
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs


This is not a complete fix. It reduces or disables khugepaged background

collapse, including file THP collapse, and may reduce THP-related performance

benefits for workloads that rely on khugepaged promotion. Fault-time anonymous

THP allocation is not disabled by this knob.


Disabling CONFIG_READ_ONLY_THP_FOR_FS also seems to mitigate, but both are

suboptimal, performance-impacting trade-offs.



Reproducer

==========


The attached reproducer does the following:


  1. Creates a regular file with non-zero data.

  2. Maps part of the file read-only and uses MADV_COLLAPSE to force a file

     THP.

  3. Opens the file for writing and appends non-zero data, leaving it dirty in

     page cache.

  4. Closes the write fd.

  5. Re-collapses a read-only file range so filemap_nr_thps(mapping) is

     non-zero.

  6. Opens the file for write again, triggering the do_dentry_open() THP purge.

  7. Reads back the appended data.


Whether any single iteration reproduces is a race against background

writeback, so let the full iteration count run. A single clean pass does not

by itself prove the kernel is unaffected.


On an affected 6.12 host:


        # ./thp_write_open_cancel_dirty_repro Maybe_corrupted_file

        path=Maybe_corrupted_file base_size=67108864 append_size=16384 iters=200

        REPRODUCED iter=0 bad_bytes=16384 first_bad=0 zero_count=16384
append_off=67108864

        first 64 got:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ...

        first 64 want: 5c df 63 e6 6a ed 71 f4 78 fb 7f 03 86 0a 8d 11 ...


The corrupted range is visible as a run of null bytes at the append offset:


        # rg --text '\x00{64}\x00*' $PWD --only-matching \

              --byte-offset --no-line-number \

            | awk -F: '{print $1, $2, length($3)}' | head -n1

        /root/Maybe_corrupted_file 67108864 16384



We are happy to test any preferred fix direction and can provide
additional traces, as-needed.


Thanks,

Gregg Leventhal

Eric Hagberg

[-- Attachment #2: thp_write_open_cancel_dirty_repro.c --]
[-- Type: text/x-c-code, Size: 7962 bytes --]

// SPDX-License-Identifier: GPL-2.0
/*
 * Reproducer for write-open file-THP page-cache purge discarding dirty data.
 *
 * Intended for Linux kernels with CONFIG_READ_ONLY_THP_FOR_FS=y and a filesystem
 * that does not opt into writable large folios (e.g. btrfs/ext4 on affected
 * kernels).  Run on such a filesystem, with THP enabled.
 *
 * Mechanism:
 *   1. Create a file and fault/collapse a read-only file THP.
 *   2. Open for write and append non-zero bytes, leaving them dirty in cache.
 *      The first write-open may purge the earlier THP; that's fine.
 *   3. Close the write fd, then collapse a clean read-only range again so the
 *      mapping has file THPs while the appended folios are still dirty.
 *   4. Open for write again.  Affected kernels call truncate_inode_pages()
 *      from do_dentry_open() to drop the THP-bearing cache.  That can call
 *      folio_cancel_dirty() on the dirty appended folios.
 *   5. Read back appended bytes.  On failure they are zero/incorrect.
 *
 * This intentionally uses MADV_COLLAPSE for determinism instead of waiting for
 * background khugepaged.  If MADV_COLLAPSE returns EINVAL, check that no fd is
 * open for write and that READ_ONLY_THP_FOR_FS / THP are enabled.
 */
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <inttypes.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

#ifndef MADV_COLLAPSE
#define MADV_COLLAPSE 25
#endif
#ifndef MAP_FIXED_NOREPLACE
#define MAP_FIXED_NOREPLACE 0x100000
#endif

#define KiB 1024ULL
#define MiB (1024ULL * KiB)
#define HPAGE (2ULL * MiB)

static void die(const char *what)
{
  fprintf(stderr, "%s: %s\n", what, strerror(errno));
  exit(2);
}

static void xwrite_full(int fd, const void *buf, size_t len, off_t off)
{
  const char *p = buf;
  while (len > 0) {
    ssize_t n = pwrite(fd, p, len, off);
    if (n < 0) die("pwrite");
    if (n == 0) {
      fprintf(stderr, "short pwrite: 0\n");
      exit(2);
    }
    p += n;
    off += n;
    len -= (size_t)n;
  }
}

static void xread_full(int fd, void *buf, size_t len, off_t off)
{
  char *p = buf;
  while (len > 0) {
    ssize_t n = pread(fd, p, len, off);
    if (n < 0) die("pread");
    if (n == 0) {
      fprintf(stderr, "short pread at off=%jd\n", (intmax_t)off);
      exit(2);
    }
    p += n;
    off += n;
    len -= (size_t)n;
  }
}

static void fill_pattern(uint8_t *buf, size_t len, uint8_t seed)
{
  for (size_t i = 0; i < len; i++) {
    /* Deliberately avoid zero bytes, so any zero is suspicious. */
    buf[i] = (uint8_t)(1 + ((i * 131u + seed) % 255u));
  }
}

static void *map_file_aligned_ro(int fd, off_t file_off, size_t len)
{
  /* Reserve enough VA space to choose a PMD-aligned address. */
  size_t reserve_len = len + HPAGE;
  void *reserve = mmap(NULL, reserve_len, PROT_NONE,
                       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  if (reserve == MAP_FAILED) die("mmap reserve");

  uintptr_t base = (uintptr_t)reserve;
  uintptr_t aligned = (base + HPAGE - 1) & ~(uintptr_t)(HPAGE - 1);
  /* Release the reservation, then map the file at the aligned address. */
  if (munmap(reserve, reserve_len) != 0) die("munmap reserve");

  void *p = mmap((void *)aligned, len, PROT_READ,
                 MAP_PRIVATE | MAP_FIXED_NOREPLACE, fd, file_off);
  if (p == MAP_FAILED) die("mmap file fixed");
  if ((uintptr_t)p != aligned) {
    fprintf(stderr, "mmap did not return requested address\n");
    exit(2);
  }
  return p;
}

static void collapse_ro_range(const char *path, off_t file_off)
{
  int fd = open(path, O_RDONLY | O_CLOEXEC);
  if (fd < 0) die("open ro for collapse");

  void *p = map_file_aligned_ro(fd, file_off, HPAGE);

  /* Fault every page first; MADV_COLLAPSE generally expects present pages. */
  volatile uint8_t acc = 0;
  for (size_t off = 0; off < HPAGE; off += 4096) acc ^= ((volatile uint8_t *)p)[off];
  (void)acc;

  if (madvise(p, HPAGE, MADV_COLLAPSE) != 0) {
    int e = errno;
    fprintf(stderr,
            "MADV_COLLAPSE failed at file_off=%jd: %s\n"
            "Hints: run on btrfs/ext4 with THP enabled, ensure no write fd is open, "
            "and ensure CONFIG_READ_ONLY_THP_FOR_FS=y.\n",
            (intmax_t)file_off, strerror(e));
    errno = e;
    die("madvise MADV_COLLAPSE");
  }

  if (munmap(p, HPAGE) != 0) die("munmap collapsed");
  if (close(fd) != 0) die("close collapse fd");
}

static int count_bad(const uint8_t *got, const uint8_t *want, size_t len,
                     size_t *first_bad, size_t *zero_count)
{
  int bad = 0;
  *first_bad = (size_t)-1;
  *zero_count = 0;
  for (size_t i = 0; i < len; i++) {
    if (got[i] == 0) (*zero_count)++;
    if (got[i] != want[i]) {
      if (*first_bad == (size_t)-1) *first_bad = i;
      bad++;
    }
  }
  return bad;
}

int main(int argc, char **argv)
{
  const char *path = argc > 1 ? argv[1] : "./thp-open-repro.dat";
  const size_t base_size = argc > 2 ? strtoull(argv[2], NULL, 0) : 64 * MiB;
  const size_t append_size = argc > 3 ? strtoull(argv[3], NULL, 0) : 16 * KiB;
  const int iters = argc > 4 ? atoi(argv[4]) : 200;

  if (base_size < 4 * HPAGE) {
    fprintf(stderr, "base_size must be at least 8MiB\n");
    return 2;
  }

  printf("path=%s base_size=%zu append_size=%zu iters=%d\n",
         path, base_size, append_size, iters);

  uint8_t *base = malloc(HPAGE);
  uint8_t *want = malloc(append_size);
  uint8_t *got = malloc(append_size);
  if (!base || !want || !got) die("malloc");
  fill_pattern(base, HPAGE, 7);
  fill_pattern(want, append_size, 91);

  int fd = open(path, O_CREAT | O_TRUNC | O_RDWR | O_CLOEXEC, 0666);
  if (fd < 0) die("create");
  for (off_t off = 0; off < (off_t)base_size; off += (off_t)HPAGE) {
    xwrite_full(fd, base, HPAGE, off);
  }
  if (fsync(fd) != 0) die("fsync initial");
  if (close(fd) != 0) die("close initial");

  const off_t append_off = (off_t)base_size;
  bool reproduced = false;

  for (int iter = 0; iter < iters; iter++) {
    /* Return to the original size and durable base contents. */
    fd = open(path, O_RDWR | O_CLOEXEC);
    if (fd < 0) die("open reset");
    if (ftruncate(fd, (off_t)base_size) != 0) die("ftruncate reset");
    if (fsync(fd) != 0) die("fsync reset");
    if (close(fd) != 0) die("close reset");

    /* Create read-only file THPs, then open-for-write and dirty appended data. */
    collapse_ro_range(path, 0);

    fd = open(path, O_RDWR | O_CLOEXEC);
    if (fd < 0) die("open append");
    xwrite_full(fd, want, append_size, append_off);
    /* Do not fsync.  The dirty page-cache state is what we are testing. */
    if (close(fd) != 0) die("close append");

    /* Reinstall a file THP while no write fd is open.  This is the state that
     * makes the next write-open purge the mapping. */
    collapse_ro_range(path, 0);

    fd = open(path, O_RDWR | O_CLOEXEC);
    if (fd < 0) die("open trigger");
    if (close(fd) != 0) die("close trigger");

    int rfd = open(path, O_RDONLY | O_CLOEXEC);
    if (rfd < 0) die("open readback");
    xread_full(rfd, got, append_size, append_off);
    if (close(rfd) != 0) die("close readback");

    size_t first_bad, zero_count;
    int bad = count_bad(got, want, append_size, &first_bad, &zero_count);
    if (bad) {
      printf("REPRODUCED iter=%d bad_bytes=%d first_bad=%zu zero_count=%zu append_off=%jd\n",
             iter, bad, first_bad, zero_count, (intmax_t)append_off);
      printf("first 64 got:");
      for (int i = 0; i < 64 && i < (int)append_size; i++) printf(" %02x", got[i]);
      printf("\nfirst 64 want:");
      for (int i = 0; i < 64 && i < (int)append_size; i++) printf(" %02x", want[i]);
      printf("\n");
      reproduced = true;
      break;
    }

    if ((iter % 10) == 0) printf("iter=%d ok\n", iter);
  }

  if (!reproduced)
    printf("no corruption observed after %d iterations\n", iters);

  return reproduced ? 1 : 0;
}

[-- Attachment #3: thp-open-writeback-before-purge.patch --]
[-- Type: application/x-patch, Size: 1208 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache
  2026-06-30 17:01 Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache Gregg Leventhal
@ 2026-06-30 17:18 ` Gregg Leventhal
  2026-06-30 18:31 ` Pedro Falcato
  2026-06-30 18:36 ` Matthew Wilcox
  2 siblings, 0 replies; 16+ messages in thread
From: Gregg Leventhal @ 2026-06-30 17:18 UTC (permalink / raw)
  To: To: Alexander Viro, Christian Brauner, Cc: Jan Kara,
	Matthew Wilcox, Andrew Morton, Song Liu, linux-fsdevel, linux-mm,
	linux-kernel, Eric Hagberg

Also, just to be explicit and prevent some potentially-wasted time:
This does not repro on XFS!   You need to point the reproducer at
Btrfs or Ext4 (or some other susceptible file system, but those are
the only ones Eric and I have confirmed, personally).


On Tue, Jun 30, 2026 at 1:01 PM Gregg Leventhal
<gleventhal@janestreet.com> wrote:
>
> Hello,
>
> We (Gregg Leventhal <gleventhal@janestreet.com> and Eric Hagberg
>
> <ehagberg@janestreet.com>) have a reproducible data-loss issue involving file
>
> THPs and write-open, impacting filesystems that do not support
> writable large folios.
>
>
> Attached are:
>
>
>   - thp_write_open_cancel_dirty_repro.c
>
>   - thp-open-writeback-before-purge.patch
>
>
>
> Summary
>
> =======
>
>
> On an affected 6.12 kernel with CONFIG_READ_ONLY_THP_FOR_FS=y, a file can
>
> contain read-only file THPs installed by khugepaged / MADV_COLLAPSE. When that
>
> same file is later opened for write, do_dentry_open() notices
>
> filemap_nr_thps() and drops the page cache:
>
>
>         /*
>
>          * XXX: Huge page cache doesn't support writing yet. Drop all page
>
>          * cache for this file before processing writes.
>
>          */
>
>         if (f->f_mode & FMODE_WRITE) {
>
>                 if (filemap_nr_thps(inode->i_mapping)) {
>
>                         struct address_space *mapping = inode->i_mapping;
>
>
>                         filemap_invalidate_lock(inode->i_mapping);
>
>                         unmap_mapping_range(mapping, 0, 0, 0);
>
>                         truncate_inode_pages(mapping, 0);
>
>                         filemap_invalidate_unlock(inode->i_mapping);
>
>                 }
>
>         }
>
>
> This is unsafe if the mapping also contains dirty folios.
>
> truncate_inode_pages() is not just a clean cache-dropping primitive. It can
>
> call truncate_cleanup_folio(), which calls folio_cancel_dirty().
>
>
> In the attached reproducer, dirty appended data is discarded and later read(2)s
>
> return zeros.
>
>
> We observed this on btrfs and ext4, though most of the testing involved btrfs.
>
>
> The same issue should apply to any filesystem where file THPs can be created
>
> by READ_ONLY_THP_FOR_FS but writable large folios are not supported. The
>
> do_dentry_open() block above is also unchanged in current mainline, so this
>
> does not appear to be strictly 6.12-specific.
>
>
>
> Instrumentation
>
> ===============
>
>
> Tracing the failure shows the dirty folios being invalidated from the
>
> write-open path. INVALIDATE_DIRTY and CANCEL_DIRTY below are labels from our
>
> own probes:
>
>
>         do_dentry_open / vfs_open
>
>           truncate_inode_pages_range
>
>             truncate_cleanup_folio
>
>               btrfs_invalidate_folio
>
>               folio_cancel_dirty
>
>
> A representative stack from the failing path:
>
>
>         INVALIDATE_DIRTY ...
>
>           btrfs_invalidate_folio
>
>           truncate_cleanup_folio
>
>           truncate_inode_pages_range
>
>           vfs_open
>
>
>         CANCEL_DIRTY ...
>
>           truncate_cleanup_folio
>
>           truncate_inode_pages_range
>
>           vfs_open
>
>
> This confirms that the appended dirty page-cache contents are being discarded
>
> by the open-time THP cache purge rather than written back.
>
>
>
> Why this happens
>
> ================
>
>
> The do_dentry_open() code is trying to handle the fact that some filesystems
>
> do not support writing to file THPs. The problematic assumption is that
>
> dropping the page cache is a safe cache-management operation.
>
>
> It is not safe when dirty folios are present, because truncate_inode_pages()
>
> cancels their dirty state without writeback.
>
>
> Note that the read-only file THPs themselves are clean. The data that is lost
>
> is unrelated dirty folios elsewhere in the same mapping, here the appended
>
> tail, which get caught in the blanket truncate_inode_pages(mapping, 0) of the
>
> entire mapping.
>
>
>
> Suggested fix direction
>
> =======================
>
>
> Before dropping THP-bearing page cache on write-open, write back and wait for
>
> any dirty folios. After writeback completes, the folios are clean, so the
>
> subsequent truncate_inode_pages() has no dirty state to cancel and the data is
>
> safe on disk. A later read() simply repopulates the cache from disk. If
>
> writeback fails, fail the open rather than silently discarding the data.
>
>
> The attached patch does this by adding filemap_write_and_wait(mapping) before
>
> the unmap_mapping_range() / truncate_inode_pages() sequence.
>
>
> Two caveats we are aware of with this approach:
>
>
>   - filemap_write_and_wait() flushes the entire mapping, so any write-open of
>
>     a file with filemap_nr_thps() > 0 now forces synchronous writeback. This
>
>     path already did a full unmap + truncate, so the extra cost is probably
>
>     acceptable, but it is a behavior change.
>
>
>   - The writeback happens before unmap_mapping_range(). That is sufficient for
>
>     the reproducer, where the dirty data comes from buffered write(2), so the
>
>     folios are already marked dirty. We would appreciate guidance on whether
>
>     unmap should precede the writeback in order to also cover data dirtied
>
>     only via a writable shared mapping.
>
>
> An alternative would be to replace truncate_inode_pages() with a
>
> clean-page-only invalidation primitive, but then dirty file THPs / dirty pages
>
> may remain in the mapping and need careful handling.
>
>
>
> Mitigation
>
> ==========
>
>
> As a temporary mitigation, setting khugepaged's scan interval very high
>
> appears to prevent the issue by effectively stopping background file THP
>
> collapse:
>
>
>         echo 4294967295 >
> /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
>
>
> This is not a complete fix. It reduces or disables khugepaged background
>
> collapse, including file THP collapse, and may reduce THP-related performance
>
> benefits for workloads that rely on khugepaged promotion. Fault-time anonymous
>
> THP allocation is not disabled by this knob.
>
>
> Disabling CONFIG_READ_ONLY_THP_FOR_FS also seems to mitigate, but both are
>
> suboptimal, performance-impacting trade-offs.
>
>
>
> Reproducer
>
> ==========
>
>
> The attached reproducer does the following:
>
>
>   1. Creates a regular file with non-zero data.
>
>   2. Maps part of the file read-only and uses MADV_COLLAPSE to force a file
>
>      THP.
>
>   3. Opens the file for writing and appends non-zero data, leaving it dirty in
>
>      page cache.
>
>   4. Closes the write fd.
>
>   5. Re-collapses a read-only file range so filemap_nr_thps(mapping) is
>
>      non-zero.
>
>   6. Opens the file for write again, triggering the do_dentry_open() THP purge.
>
>   7. Reads back the appended data.
>
>
> Whether any single iteration reproduces is a race against background
>
> writeback, so let the full iteration count run. A single clean pass does not
>
> by itself prove the kernel is unaffected.
>
>
> On an affected 6.12 host:
>
>
>         # ./thp_write_open_cancel_dirty_repro Maybe_corrupted_file
>
>         path=Maybe_corrupted_file base_size=67108864 append_size=16384 iters=200
>
>         REPRODUCED iter=0 bad_bytes=16384 first_bad=0 zero_count=16384
> append_off=67108864
>
>         first 64 got:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ...
>
>         first 64 want: 5c df 63 e6 6a ed 71 f4 78 fb 7f 03 86 0a 8d 11 ...
>
>
> The corrupted range is visible as a run of null bytes at the append offset:
>
>
>         # rg --text '\x00{64}\x00*' $PWD --only-matching \
>
>               --byte-offset --no-line-number \
>
>             | awk -F: '{print $1, $2, length($3)}' | head -n1
>
>         /root/Maybe_corrupted_file 67108864 16384
>
>
>
> We are happy to test any preferred fix direction and can provide
> additional traces, as-needed.
>
>
> Thanks,
>
> Gregg Leventhal
>
> Eric Hagberg


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache
  2026-06-30 17:01 Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache Gregg Leventhal
  2026-06-30 17:18 ` Gregg Leventhal
@ 2026-06-30 18:31 ` Pedro Falcato
  2026-06-30 18:49   ` Pedro Falcato
  2026-06-30 18:36 ` Matthew Wilcox
  2 siblings, 1 reply; 16+ messages in thread
From: Pedro Falcato @ 2026-06-30 18:31 UTC (permalink / raw)
  To: Gregg Leventhal
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox,
	Andrew Morton, Song Liu, linux-fsdevel, linux-mm, linux-kernel,
	Eric Hagberg, David Hildenbrand, Lorenzo Stoakes, Zi Yan

[-- Attachment #1: Type: text/plain, Size: 2840 bytes --]

+CC some relevant THP folks
Quick note, your email client's spacing seems to be all over the place, making
this extremely hard to read.

On Tue, Jun 30, 2026 at 01:01:53PM -0400, Gregg Leventhal wrote:
> Hello,
> 
> We (Gregg Leventhal <gleventhal@janestreet.com> and Eric Hagberg
> 
> <ehagberg@janestreet.com>) have a reproducible data-loss issue involving file
> 
> THPs and write-open, impacting filesystems that do not support
> writable large folios.
> 
> 
> Attached are:
> 
> 
>   - thp_write_open_cancel_dirty_repro.c
> 
>   - thp-open-writeback-before-purge.patch
> 
> 
> 
> Summary
> 
> =======
> 
> 
> On an affected 6.12 kernel with CONFIG_READ_ONLY_THP_FOR_FS=y, a file can
> 
> contain read-only file THPs installed by khugepaged / MADV_COLLAPSE. When that
> 
> same file is later opened for write, do_dentry_open() notices
> 
> filemap_nr_thps() and drops the page cache:
> 
> 
>         /*
> 
>          * XXX: Huge page cache doesn't support writing yet. Drop all page
> 
>          * cache for this file before processing writes.
> 
>          */
> 
>         if (f->f_mode & FMODE_WRITE) {
> 
>                 if (filemap_nr_thps(inode->i_mapping)) {
> 
>                         struct address_space *mapping = inode->i_mapping;
> 
> 
>                         filemap_invalidate_lock(inode->i_mapping);
> 
>                         unmap_mapping_range(mapping, 0, 0, 0);
> 
>                         truncate_inode_pages(mapping, 0);
> 
>                         filemap_invalidate_unlock(inode->i_mapping);
> 
>                 }
> 
>         }

Ugh, this is embarassing. So, good news: this code doesn't exist anymore
in mainline! Bad news: it exists on every other upstream-stable-maintained
release :|

FWIW I don't think your fix works, there's still a race there (what if
you write and wait, then someone dirties a folio, then you truncate the
pagecache? you lost data again.). I'm attaching a very quick WIP patch
that I wrote against 6.12 LTS (again, this does not exist in mainline).
I _think_ we want to go roughly in that direction, either here or in
collapse file paths. There are still problems which are invasive and
I haven't dealt with (GUP and other "temporary" folio releases being
the main one). Some of these problems may simply make it so opening
these files writable may fail (there is certainly, AFAIK, no way of
waiting for GUP and other temporary folio holders).

We would probably be served with a custom loop that forcibly yanks
only THPs out the pagecache, though. But that requires a bit more
code for a stable-only issue...

Anyway, the patch is obviously ungood and uncromulent and is only
here for a rough conversation starter. I don't think it works and
it will probably never work. mapping invalidation is simply too
best-effort for something that Just Needs(tm) to work.

-- 
Pedro

[-- Attachment #2: 0001-foo.patch --]
[-- Type: text/x-patch, Size: 6161 bytes --]

From 22c7255577e1efbca5186fa3a3afadf714743647 Mon Sep 17 00:00:00 2001
From: Pedro Falcato <pfalcato@suse.de>
Date: Tue, 30 Jun 2026 19:13:20 +0100
Subject: [PATCH] foo

Not-Signed-off-by: Pedro Falcato <pfalcato@suse.de>
---
 fs/open.c               | 15 ++--------
 include/linux/pagemap.h |  1 +
 mm/fadvise.c            |  2 +-
 mm/internal.h           |  3 +-
 mm/truncate.c           | 63 +++++++++++++++++++++++++++++++++++++++--
 5 files changed, 68 insertions(+), 16 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index be7b55260a75..8feaf87c06b8 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -985,18 +985,9 @@ static int do_dentry_open(struct file *f,
 		 * cache will fail.
 		 */
 		if (filemap_nr_thps(inode->i_mapping)) {
-			struct address_space *mapping = inode->i_mapping;
-
-			filemap_invalidate_lock(inode->i_mapping);
-			/*
-			 * unmap_mapping_range just need to be called once
-			 * here, because the private pages is not need to be
-			 * unmapped mapping (e.g. data segment of dynamic
-			 * shared libraries here).
-			 */
-			unmap_mapping_range(mapping, 0, 0, 0);
-			truncate_inode_pages(mapping, 0);
-			filemap_invalidate_unlock(inode->i_mapping);
+			error = filemap_truncate_thps(inode);
+			if (error)
+				goto cleanup_all;
 		}
 	}
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 68a5f1ff3301..401b03970f68 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -67,6 +67,7 @@ static inline int filemap_write_and_wait(struct address_space *mapping)
 {
 	return filemap_write_and_wait_range(mapping, 0, LLONG_MAX);
 }
+int filemap_truncate_thps(struct inode *inode);
 
 /**
  * filemap_set_wb_err - set a writeback error on an address_space
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 588fe76c5a14..c44a7a11eee2 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -156,7 +156,7 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 			lru_add_drain();
 
 			mapping_try_invalidate(mapping, start_index, end_index,
-					&nr_failed);
+					&nr_failed, NULL);
 
 			/*
 			 * The failures may be due to the folio being
diff --git a/mm/internal.h b/mm/internal.h
index 3bfc1dc2d7ea..83e3bbbe18a6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -407,7 +407,8 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start,
 		loff_t end);
 long mapping_evict_folio(struct address_space *mapping, struct folio *folio);
 unsigned long mapping_try_invalidate(struct address_space *mapping,
-		pgoff_t start, pgoff_t end, unsigned long *nr_failed);
+		pgoff_t start, pgoff_t end, unsigned long *nr_failed,
+		pgoff_t *first_fail);
 
 /**
  * folio_evictable - Test whether a folio is evictable.
diff --git a/mm/truncate.c b/mm/truncate.c
index fb5c20b57bd4..3efcadd2be4f 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -490,12 +490,14 @@ EXPORT_SYMBOL(truncate_inode_pages_final);
  * @start: the offset 'from' which to invalidate
  * @end: the offset 'to' which to invalidate (inclusive)
  * @nr_failed: How many folio invalidations failed
+ * @first_fail: What was the first offset to fail invalidation?
  *
  * This function is similar to invalidate_mapping_pages(), except that it
  * returns the number of folios which could not be evicted in @nr_failed.
  */
 unsigned long mapping_try_invalidate(struct address_space *mapping,
-		pgoff_t start, pgoff_t end, unsigned long *nr_failed)
+		pgoff_t start, pgoff_t end, unsigned long *nr_failed,
+		pgoff_t *first_fail)
 {
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct folio_batch fbatch;
@@ -504,6 +506,7 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
 	unsigned long count = 0;
 	int i;
 	bool xa_has_values = false;
+	bool has_failed = false;
 
 	folio_batch_init(&fbatch);
 	while (find_lock_entries(mapping, &index, end, &fbatch, indices)) {
@@ -529,6 +532,9 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
 				/* Likely in the lru cache of a remote CPU */
 				if (nr_failed)
 					(*nr_failed)++;
+				if (!has_failed && first_fail)
+					*first_fail = folio_pgoff(folio);
+				has_failed = true;
 			}
 			count += ret;
 		}
@@ -560,7 +566,7 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t end)
 {
-	return mapping_try_invalidate(mapping, start, end, NULL);
+	return mapping_try_invalidate(mapping, start, end, NULL, NULL);
 }
 EXPORT_SYMBOL(invalidate_mapping_pages);
 
@@ -864,3 +870,56 @@ void truncate_pagecache_range(struct inode *inode, loff_t lstart, loff_t lend)
 	truncate_inode_pages_range(mapping, lstart, lend);
 }
 EXPORT_SYMBOL(truncate_pagecache_range);
+
+int filemap_truncate_thps(struct inode *inode)
+{
+	struct address_space *mapping = inode->i_mapping;
+	pgoff_t start_index = 0, first_fail;
+	unsigned long nr_failed = 0;
+	int err;
+
+	while (filemap_nr_thps(mapping)) {
+		nr_failed = 0;
+		first_fail = 0;
+		filemap_invalidate_lock(mapping);
+		/*
+		 * unmap_mapping_range just need to be called once
+		 * here, because the private pages is not need to be
+		 * unmapped mapping (e.g. data segment of dynamic
+		 * shared libraries here).
+		 */
+		unmap_mapping_range(mapping, 0, 0, 0);
+		lru_add_drain();
+		mapping_try_invalidate(mapping, start_index, LLONG_MAX,
+				       &nr_failed, &first_fail);
+		filemap_invalidate_unlock(mapping);
+		if (!nr_failed)
+			break;
+		/*
+		 * The failures may be due to the folio being
+		 * in the LRU cache of a remote CPU. Drain all
+		 * caches, do writeback and try again.
+		 */
+		lru_add_drain_all();
+		/*
+		 * We now know that up to first_fail, there are no THPs. Start
+		 * from there to ensure forward progress.
+		 */
+		start_index = first_fail;
+
+		/*
+		 * Attempt to writeback. If it fails, it's ok to fail the open. There's not
+		 * much we can do in that case.
+		 */
+		err = filemap_write_and_wait_range(mapping, start_index, LLONG_MAX);
+		if (err)
+			return err;
+	}
+
+	/*
+	 * It should not be possible to hit this case after the above loop
+	 * completes.
+	 */
+	WARN_ON_ONCE(filemap_nr_thps(mapping));
+	return 0;
+}
-- 
2.55.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache
  2026-06-30 17:01 Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache Gregg Leventhal
  2026-06-30 17:18 ` Gregg Leventhal
  2026-06-30 18:31 ` Pedro Falcato
@ 2026-06-30 18:36 ` Matthew Wilcox
  2026-06-30 19:05   ` Zi Yan
  2 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2026-06-30 18:36 UTC (permalink / raw)
  To: Gregg Leventhal
  Cc: To: Alexander Viro, Christian Brauner, Cc: Jan Kara,
	Andrew Morton, Song Liu, linux-fsdevel, linux-mm, linux-kernel,
	Eric Hagberg

On Tue, Jun 30, 2026 at 01:01:53PM -0400, Gregg Leventhal wrote:
> On an affected 6.12 kernel with CONFIG_READ_ONLY_THP_FOR_FS=y, a file can
> contain read-only file THPs installed by khugepaged / MADV_COLLAPSE. When that
> same file is later opened for write, do_dentry_open() notices
> filemap_nr_thps() and drops the page cache:
[...]
> 
> This is unsafe if the mapping also contains dirty folios.

But there shouldn't be any.  It should not be possible to have
dirty folios and THPs in the same file unless the filesystem
supports large folios natively.

If the file is open for writing, the attempt to create THPs should fail.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache
  2026-06-30 18:31 ` Pedro Falcato
@ 2026-06-30 18:49   ` Pedro Falcato
  2026-06-30 19:55     ` Pedro Falcato
  0 siblings, 1 reply; 16+ messages in thread
From: Pedro Falcato @ 2026-06-30 18:49 UTC (permalink / raw)
  To: Gregg Leventhal
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox,
	Andrew Morton, Song Liu, linux-fsdevel, linux-mm, linux-kernel,
	Eric Hagberg, David Hildenbrand, Lorenzo Stoakes, Zi Yan

On Tue, Jun 30, 2026 at 07:31:07PM +0100, Pedro Falcato wrote:
> +CC some relevant THP folks
> Quick note, your email client's spacing seems to be all over the place, making
> this extremely hard to read.
> 
> On Tue, Jun 30, 2026 at 01:01:53PM -0400, Gregg Leventhal wrote:
> > Hello,
> > 
> > We (Gregg Leventhal <gleventhal@janestreet.com> and Eric Hagberg
> > 
> > <ehagberg@janestreet.com>) have a reproducible data-loss issue involving file
> > 
> > THPs and write-open, impacting filesystems that do not support
> > writable large folios.
> > 
> > 
> > Attached are:
> > 
> > 
> >   - thp_write_open_cancel_dirty_repro.c
> > 
> >   - thp-open-writeback-before-purge.patch
> > 
> > 
> > 
> > Summary
> > 
> > =======
> > 
> > 
> > On an affected 6.12 kernel with CONFIG_READ_ONLY_THP_FOR_FS=y, a file can
> > 
> > contain read-only file THPs installed by khugepaged / MADV_COLLAPSE. When that
> > 
> > same file is later opened for write, do_dentry_open() notices
> > 
> > filemap_nr_thps() and drops the page cache:
> > 
> > 
> >         /*
> > 
> >          * XXX: Huge page cache doesn't support writing yet. Drop all page
> > 
> >          * cache for this file before processing writes.
> > 
> >          */
> > 
> >         if (f->f_mode & FMODE_WRITE) {
> > 
> >                 if (filemap_nr_thps(inode->i_mapping)) {
> > 
> >                         struct address_space *mapping = inode->i_mapping;
> > 
> > 
> >                         filemap_invalidate_lock(inode->i_mapping);
> > 
> >                         unmap_mapping_range(mapping, 0, 0, 0);
> > 
> >                         truncate_inode_pages(mapping, 0);
> > 
> >                         filemap_invalidate_unlock(inode->i_mapping);
> > 
> >                 }
> > 
> >         }
> 
> Ugh, this is embarassing. So, good news: this code doesn't exist anymore
> in mainline! Bad news: it exists on every other upstream-stable-maintained
> release :|
> 
> FWIW I don't think your fix works, there's still a race there (what if
> you write and wait, then someone dirties a folio, then you truncate the
> pagecache? you lost data again.). I'm attaching a very quick WIP patch
> that I wrote against 6.12 LTS (again, this does not exist in mainline).
> I _think_ we want to go roughly in that direction, either here or in
> collapse file paths. There are still problems which are invasive and
> I haven't dealt with (GUP and other "temporary" folio releases being
> the main one). Some of these problems may simply make it so opening
> these files writable may fail (there is certainly, AFAIK, no way of
> waiting for GUP and other temporary folio holders).
> 
> We would probably be served with a custom loop that forcibly yanks
> only THPs out the pagecache, though. But that requires a bit more
> code for a stable-only issue...
> 
> Anyway, the patch is obviously ungood and uncromulent and is only
> here for a rough conversation starter. I don't think it works and
> it will probably never work. mapping invalidation is simply too
> best-effort for something that Just Needs(tm) to work.

Other idea: perhaps doing filemap_write_and_wait() after the nr_thps
increment in collapse_file() will Just Work and result in a _much_
simpler fix. And it avoids any weird forward-progress issues as no one can
write to folios at that point.

-- 
Pedro


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache
  2026-06-30 18:36 ` Matthew Wilcox
@ 2026-06-30 19:05   ` Zi Yan
  2026-06-30 19:07     ` Matthew Wilcox
  0 siblings, 1 reply; 16+ messages in thread
From: Zi Yan @ 2026-06-30 19:05 UTC (permalink / raw)
  To: Matthew Wilcox, Gregg Leventhal
  Cc: To: Alexander Viro, Christian Brauner, Cc: Jan Kara,
	Andrew Morton, Song Liu, linux-fsdevel, linux-mm, linux-kernel,
	Eric Hagberg

On Tue Jun 30, 2026 at 2:36 PM EDT, Matthew Wilcox wrote:
> On Tue, Jun 30, 2026 at 01:01:53PM -0400, Gregg Leventhal wrote:
>> On an affected 6.12 kernel with CONFIG_READ_ONLY_THP_FOR_FS=y, a file can
>> contain read-only file THPs installed by khugepaged / MADV_COLLAPSE. When that
>> same file is later opened for write, do_dentry_open() notices
>> filemap_nr_thps() and drops the page cache:
> [...]
>> 
>> This is unsafe if the mapping also contains dirty folios.
>
> But there shouldn't be any.  It should not be possible to have
> dirty folios and THPs in the same file unless the filesystem
> supports large folios natively.
>
> If the file is open for writing, the attempt to create THPs should fail.

Maybe he means order-0 dirty folios?

-- 
Best Regards,
Yan, Zi



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache
  2026-06-30 19:05   ` Zi Yan
@ 2026-06-30 19:07     ` Matthew Wilcox
  0 siblings, 0 replies; 16+ messages in thread
From: Matthew Wilcox @ 2026-06-30 19:07 UTC (permalink / raw)
  To: Zi Yan
  Cc: Gregg Leventhal, To: Alexander Viro, Christian Brauner,
	Cc: Jan Kara, Andrew Morton, Song Liu, linux-fsdevel, linux-mm,
	linux-kernel, Eric Hagberg

On Tue, Jun 30, 2026 at 03:05:16PM -0400, Zi Yan wrote:
> On Tue Jun 30, 2026 at 2:36 PM EDT, Matthew Wilcox wrote:
> > On Tue, Jun 30, 2026 at 01:01:53PM -0400, Gregg Leventhal wrote:
> >> On an affected 6.12 kernel with CONFIG_READ_ONLY_THP_FOR_FS=y, a file can
> >> contain read-only file THPs installed by khugepaged / MADV_COLLAPSE. When that
> >> same file is later opened for write, do_dentry_open() notices
> >> filemap_nr_thps() and drops the page cache:
> > [...]
> >> 
> >> This is unsafe if the mapping also contains dirty folios.
> >
> > But there shouldn't be any.  It should not be possible to have
> > dirty folios and THPs in the same file unless the filesystem
> > supports large folios natively.
> >
> > If the file is open for writing, the attempt to create THPs should fail.
> 
> Maybe he means order-0 dirty folios?

Even then.  They're all supposed to be gone.  I don't want to guarantee
that the filesystem will properly skip over clean folios during
writeback.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache
  2026-06-30 18:49   ` Pedro Falcato
@ 2026-06-30 19:55     ` Pedro Falcato
  2026-06-30 22:34       ` Matthew Wilcox
                         ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Pedro Falcato @ 2026-06-30 19:55 UTC (permalink / raw)
  To: Gregg Leventhal
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox,
	Andrew Morton, Song Liu, linux-fsdevel, linux-mm, linux-kernel,
	Eric Hagberg, David Hildenbrand, Lorenzo Stoakes, Zi Yan

[-- Attachment #1: Type: text/plain, Size: 586 bytes --]

On Tue, Jun 30, 2026 at 07:49:12PM +0100, Pedro Falcato wrote:
snip
> 
> Other idea: perhaps doing filemap_write_and_wait() after the nr_thps
> increment in collapse_file() will Just Work and result in a _much_
> simpler fix. And it avoids any weird forward-progress issues as no one can
> write to folios at that point.
> 

Gregg, if you could test this patch, it would be much appreciated. This patch
(hopefully) makes it so no dirty folio will ever coexist with a ro-THP, thus
hopefully sidestepping the entire issue in a simple way. Only compile-tested
and not reviewed.

-- 
Pedro

[-- Attachment #2: 0001-mm-khugepaged-write-all-dirty-folios-when-collapsing.patch --]
[-- Type: text/x-patch, Size: 2171 bytes --]

From 43d90a937f0b24656a8d0405035c3efcfdf0961e Mon Sep 17 00:00:00 2001
From: Pedro Falcato <pfalcato@suse.de>
Date: Tue, 30 Jun 2026 20:48:41 +0100
Subject: [PATCH] mm/khugepaged: write all dirty folios when collapsing

Signed-off-by: Pedro Falcato <pfalcato@suse.de>
---
 mm/khugepaged.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a97b20617869..3f0f90ab16ba 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -60,6 +60,7 @@ enum scan_result {
 	SCAN_STORE_FAILED,
 	SCAN_COPY_MC,
 	SCAN_PAGE_FILLED,
+	SCAN_WRITEBACK_FAIL,
 };
 
 #define CREATE_TRACE_POINTS
@@ -1812,7 +1813,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	pgoff_t index = 0, end = start + HPAGE_PMD_NR;
 	LIST_HEAD(pagelist);
 	XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
-	int nr_none = 0, result = SCAN_SUCCEED;
+	int nr_none = 0, result = SCAN_SUCCEED, err;
 	bool is_shmem = shmem_file(file);
 
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
@@ -2043,6 +2044,17 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	 */
 	try_to_unmap_flush();
 
+	/*
+	 * If collapse looks to be successful, flush any dirty pages
+	 * out the page cache. With the nr_thps incremented, there won't be
+	 * any new writers (nor new dirties).
+	 */
+	if (result == SCAN_SUCCEED && !is_shmem) {
+		err = filemap_write_and_wait(mapping);
+		if (err)
+			result = SCAN_WRITEBACK_FAIL;
+	}
+
 	if (result == SCAN_SUCCEED && nr_none &&
 	    !shmem_charge(mapping->host, nr_none))
 		result = SCAN_FAIL;
@@ -2210,9 +2222,10 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	/*
 	 * Undo the updates of filemap_nr_thps_inc for non-SHMEM
 	 * file only. This undo is not needed unless failure is
-	 * due to SCAN_COPY_MC.
+	 * due to SCAN_COPY_MC or SCAN_WRITEBACK_FAIL.
 	 */
-	if (!is_shmem && result == SCAN_COPY_MC) {
+	if (!is_shmem && (result == SCAN_COPY_MC ||
+	    result == SCAN_WRITEBACK_FAIL)) {
 		filemap_nr_thps_dec(mapping);
 		/*
 		 * Paired with the fence in do_dentry_open() -> get_write_access()
-- 
2.55.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache
  2026-06-30 19:55     ` Pedro Falcato
@ 2026-06-30 22:34       ` Matthew Wilcox
  2026-06-30 22:48       ` Zi Yan
  2026-07-01 11:54       ` Matthew Wilcox
  2 siblings, 0 replies; 16+ messages in thread
From: Matthew Wilcox @ 2026-06-30 22:34 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Gregg Leventhal, Alexander Viro, Christian Brauner, Jan Kara,
	Andrew Morton, Song Liu, linux-fsdevel, linux-mm, linux-kernel,
	Eric Hagberg, David Hildenbrand, Lorenzo Stoakes, Zi Yan

On Tue, Jun 30, 2026 at 08:55:12PM +0100, Pedro Falcato wrote:
> Gregg, if you could test this patch, it would be much appreciated. This patch
> (hopefully) makes it so no dirty folio will ever coexist with a ro-THP, thus
> hopefully sidestepping the entire issue in a simple way. Only compile-tested
> and not reviewed.

I'd suggest this is slightly misplaced; it's actually called in some
failure places (eg:

                if (!folio_isolate_lru(folio)) {
                        result = SCAN_DEL_PAGE_LRU;
                        goto out_unlock;
...
out_unlock:
                folio_unlock(folio);
                folio_put(folio);
                goto xa_unlocked;
...
xa_unlocked:
        try_to_unmap_flush();


I'd expect to see your new code inside the existing 'if (!is_shmem) {' block
immediately before the xa_locked label.  I'd also follow the
inode_is_open_for_write() case and decrement nr_thps there rather than
introducing a new SCAN code.

> -- 
> Pedro

> >From 43d90a937f0b24656a8d0405035c3efcfdf0961e Mon Sep 17 00:00:00 2001
> From: Pedro Falcato <pfalcato@suse.de>
> Date: Tue, 30 Jun 2026 20:48:41 +0100
> Subject: [PATCH] mm/khugepaged: write all dirty folios when collapsing
> 
> Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> ---
>  mm/khugepaged.c | 19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a97b20617869..3f0f90ab16ba 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -60,6 +60,7 @@ enum scan_result {
>  	SCAN_STORE_FAILED,
>  	SCAN_COPY_MC,
>  	SCAN_PAGE_FILLED,
> +	SCAN_WRITEBACK_FAIL,
>  };
>  
>  #define CREATE_TRACE_POINTS
> @@ -1812,7 +1813,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  	pgoff_t index = 0, end = start + HPAGE_PMD_NR;
>  	LIST_HEAD(pagelist);
>  	XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
> -	int nr_none = 0, result = SCAN_SUCCEED;
> +	int nr_none = 0, result = SCAN_SUCCEED, err;
>  	bool is_shmem = shmem_file(file);
>  
>  	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> @@ -2043,6 +2044,17 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  	 */
>  	try_to_unmap_flush();
>  
> +	/*
> +	 * If collapse looks to be successful, flush any dirty pages
> +	 * out the page cache. With the nr_thps incremented, there won't be
> +	 * any new writers (nor new dirties).
> +	 */
> +	if (result == SCAN_SUCCEED && !is_shmem) {
> +		err = filemap_write_and_wait(mapping);
> +		if (err)
> +			result = SCAN_WRITEBACK_FAIL;
> +	}
> +
>  	if (result == SCAN_SUCCEED && nr_none &&
>  	    !shmem_charge(mapping->host, nr_none))
>  		result = SCAN_FAIL;
> @@ -2210,9 +2222,10 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  	/*
>  	 * Undo the updates of filemap_nr_thps_inc for non-SHMEM
>  	 * file only. This undo is not needed unless failure is
> -	 * due to SCAN_COPY_MC.
> +	 * due to SCAN_COPY_MC or SCAN_WRITEBACK_FAIL.
>  	 */
> -	if (!is_shmem && result == SCAN_COPY_MC) {
> +	if (!is_shmem && (result == SCAN_COPY_MC ||
> +	    result == SCAN_WRITEBACK_FAIL)) {
>  		filemap_nr_thps_dec(mapping);
>  		/*
>  		 * Paired with the fence in do_dentry_open() -> get_write_access()
> -- 
> 2.55.0
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache
  2026-06-30 19:55     ` Pedro Falcato
  2026-06-30 22:34       ` Matthew Wilcox
@ 2026-06-30 22:48       ` Zi Yan
  2026-07-01 12:05         ` Pedro Falcato
  2026-07-01 11:54       ` Matthew Wilcox
  2 siblings, 1 reply; 16+ messages in thread
From: Zi Yan @ 2026-06-30 22:48 UTC (permalink / raw)
  To: Pedro Falcato, Gregg Leventhal
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox,
	Andrew Morton, Song Liu, linux-fsdevel, linux-mm, linux-kernel,
	Eric Hagberg, David Hildenbrand, Lorenzo Stoakes

On Tue Jun 30, 2026 at 3:55 PM EDT, Pedro Falcato wrote:
> On Tue, Jun 30, 2026 at 07:49:12PM +0100, Pedro Falcato wrote:
> snip
>> 
>> Other idea: perhaps doing filemap_write_and_wait() after the nr_thps
>> increment in collapse_file() will Just Work and result in a _much_
>> simpler fix. And it avoids any weird forward-progress issues as no one can
>> write to folios at that point.
>> 
>
> Gregg, if you could test this patch, it would be much appreciated. This patch
> (hopefully) makes it so no dirty folio will ever coexist with a ro-THP, thus
> hopefully sidestepping the entire issue in a simple way. Only compile-tested
> and not reviewed.

I tested the patch on v6.12 kernel with the reproducer from Gregg and
confirmed that the issue is fixed.

Feel free to add
Tested-by: Zi Yan <ziy@nvidia.com>

-- 
Best Regards,
Yan, Zi



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache
  2026-06-30 19:55     ` Pedro Falcato
  2026-06-30 22:34       ` Matthew Wilcox
  2026-06-30 22:48       ` Zi Yan
@ 2026-07-01 11:54       ` Matthew Wilcox
  2026-07-01 12:04         ` Pedro Falcato
  2026-07-01 12:48         ` Pedro Falcato
  2 siblings, 2 replies; 16+ messages in thread
From: Matthew Wilcox @ 2026-07-01 11:54 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Gregg Leventhal, Alexander Viro, Christian Brauner, Jan Kara,
	Andrew Morton, Song Liu, linux-fsdevel, linux-mm, linux-kernel,
	Eric Hagberg, David Hildenbrand, Lorenzo Stoakes, Zi Yan

On Tue, Jun 30, 2026 at 08:55:12PM +0100, Pedro Falcato wrote:
> @@ -2043,6 +2044,17 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  	 */
>  	try_to_unmap_flush();
>  
> +	/*
> +	 * If collapse looks to be successful, flush any dirty pages
> +	 * out the page cache. With the nr_thps incremented, there won't be
> +	 * any new writers (nor new dirties).
> +	 */
> +	if (result == SCAN_SUCCEED && !is_shmem) {
> +		err = filemap_write_and_wait(mapping);
> +		if (err)
> +			result = SCAN_WRITEBACK_FAIL;
> +	}

Oh, the other thing is that this should be conditioned on
!mapping_large_folio_support().



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache
  2026-07-01 11:54       ` Matthew Wilcox
@ 2026-07-01 12:04         ` Pedro Falcato
  2026-07-01 12:48         ` Pedro Falcato
  1 sibling, 0 replies; 16+ messages in thread
From: Pedro Falcato @ 2026-07-01 12:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Gregg Leventhal, Alexander Viro, Christian Brauner, Jan Kara,
	Andrew Morton, Song Liu, linux-fsdevel, linux-mm, linux-kernel,
	Eric Hagberg, David Hildenbrand, Lorenzo Stoakes, Zi Yan

On Wed, Jul 01, 2026 at 12:54:52PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 30, 2026 at 08:55:12PM +0100, Pedro Falcato wrote:
> > @@ -2043,6 +2044,17 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >  	 */
> >  	try_to_unmap_flush();
> >  
> > +	/*
> > +	 * If collapse looks to be successful, flush any dirty pages
> > +	 * out the page cache. With the nr_thps incremented, there won't be
> > +	 * any new writers (nor new dirties).
> > +	 */
> > +	if (result == SCAN_SUCCEED && !is_shmem) {
> > +		err = filemap_write_and_wait(mapping);
> > +		if (err)
> > +			result = SCAN_WRITEBACK_FAIL;
> > +	}
> 
> Oh, the other thing is that this should be conditioned on
> !mapping_large_folio_support().

ACK to both of your suggestions, I'll send a proper version (once I figure
out the proper way to send a stable fix that does not hit mainline). 

-- 
Pedro


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache
  2026-06-30 22:48       ` Zi Yan
@ 2026-07-01 12:05         ` Pedro Falcato
  0 siblings, 0 replies; 16+ messages in thread
From: Pedro Falcato @ 2026-07-01 12:05 UTC (permalink / raw)
  To: Zi Yan
  Cc: Gregg Leventhal, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, Andrew Morton, Song Liu, linux-fsdevel, linux-mm,
	linux-kernel, Eric Hagberg, David Hildenbrand, Lorenzo Stoakes

On Tue, Jun 30, 2026 at 06:48:39PM -0400, Zi Yan wrote:
> On Tue Jun 30, 2026 at 3:55 PM EDT, Pedro Falcato wrote:
> > On Tue, Jun 30, 2026 at 07:49:12PM +0100, Pedro Falcato wrote:
> > snip
> >> 
> >> Other idea: perhaps doing filemap_write_and_wait() after the nr_thps
> >> increment in collapse_file() will Just Work and result in a _much_
> >> simpler fix. And it avoids any weird forward-progress issues as no one can
> >> write to folios at that point.
> >> 
> >
> > Gregg, if you could test this patch, it would be much appreciated. This patch
> > (hopefully) makes it so no dirty folio will ever coexist with a ro-THP, thus
> > hopefully sidestepping the entire issue in a simple way. Only compile-tested
> > and not reviewed.
> 
> I tested the patch on v6.12 kernel with the reproducer from Gregg and
> confirmed that the issue is fixed.
> 
> Feel free to add
> Tested-by: Zi Yan <ziy@nvidia.com>

Awesome, thanks!

-- 
Pedro


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache
  2026-07-01 11:54       ` Matthew Wilcox
  2026-07-01 12:04         ` Pedro Falcato
@ 2026-07-01 12:48         ` Pedro Falcato
  2026-07-01 13:07           ` Gregg Leventhal
  1 sibling, 1 reply; 16+ messages in thread
From: Pedro Falcato @ 2026-07-01 12:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Gregg Leventhal, Alexander Viro, Christian Brauner, Jan Kara,
	Andrew Morton, Song Liu, linux-fsdevel, linux-mm, linux-kernel,
	Eric Hagberg, David Hildenbrand, Lorenzo Stoakes, Zi Yan

On Wed, Jul 01, 2026 at 12:54:52PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 30, 2026 at 08:55:12PM +0100, Pedro Falcato wrote:
> > @@ -2043,6 +2044,17 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >  	 */
> >  	try_to_unmap_flush();
> >  
> > +	/*
> > +	 * If collapse looks to be successful, flush any dirty pages
> > +	 * out the page cache. With the nr_thps incremented, there won't be
> > +	 * any new writers (nor new dirties).
> > +	 */
> > +	if (result == SCAN_SUCCEED && !is_shmem) {
> > +		err = filemap_write_and_wait(mapping);
> > +		if (err)
> > +			result = SCAN_WRITEBACK_FAIL;
> > +	}
> 
> Oh, the other thing is that this should be conditioned on
> !mapping_large_folio_support().

Thinking more about this, I don't think it works. On the file open side, we
always truncate the full page cache, even if large folios are supported (due
to possibly max folio < PMD_ORDER). So if we skip write-and-wait on such cases,
it will be possible to fully reproduce this issue.

-- 
Pedro


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache
  2026-07-01 12:48         ` Pedro Falcato
@ 2026-07-01 13:07           ` Gregg Leventhal
  2026-07-01 14:23             ` Pedro Falcato
  0 siblings, 1 reply; 16+ messages in thread
From: Gregg Leventhal @ 2026-07-01 13:07 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Matthew Wilcox, Alexander Viro, Christian Brauner, Jan Kara,
	Andrew Morton, Song Liu, linux-fsdevel, linux-mm, linux-kernel,
	Eric Hagberg, David Hildenbrand, Lorenzo Stoakes, Zi Yan

Thanks all, it sounds like there is nothing for me to do quite yet, correct?
Also Zi Yan is correct, I meant order-0 folios.

Please let me know if there is anything I can do.
FWIW: The patch I suggested did seem to resolve the issue (based on my
testing) and did prevent my reproducer from reproducing (whereas it
repros ~instantly for me on a not patched kernel), but I (obviously)
defer to you all here.

Thanks!

On Wed, Jul 1, 2026 at 8:48 AM Pedro Falcato <pfalcato@suse.de> wrote:
>
> On Wed, Jul 01, 2026 at 12:54:52PM +0100, Matthew Wilcox wrote:
> > On Tue, Jun 30, 2026 at 08:55:12PM +0100, Pedro Falcato wrote:
> > > @@ -2043,6 +2044,17 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > >      */
> > >     try_to_unmap_flush();
> > >
> > > +   /*
> > > +    * If collapse looks to be successful, flush any dirty pages
> > > +    * out the page cache. With the nr_thps incremented, there won't be
> > > +    * any new writers (nor new dirties).
> > > +    */
> > > +   if (result == SCAN_SUCCEED && !is_shmem) {
> > > +           err = filemap_write_and_wait(mapping);
> > > +           if (err)
> > > +                   result = SCAN_WRITEBACK_FAIL;
> > > +   }
> >
> > Oh, the other thing is that this should be conditioned on
> > !mapping_large_folio_support().
>
> Thinking more about this, I don't think it works. On the file open side, we
> always truncate the full page cache, even if large folios are supported (due
> to possibly max folio < PMD_ORDER). So if we skip write-and-wait on such cases,
> it will be possible to fully reproduce this issue.
>
> --
> Pedro


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache
  2026-07-01 13:07           ` Gregg Leventhal
@ 2026-07-01 14:23             ` Pedro Falcato
  0 siblings, 0 replies; 16+ messages in thread
From: Pedro Falcato @ 2026-07-01 14:23 UTC (permalink / raw)
  To: Gregg Leventhal
  Cc: Matthew Wilcox, Alexander Viro, Christian Brauner, Jan Kara,
	Andrew Morton, Song Liu, linux-fsdevel, linux-mm, linux-kernel,
	Eric Hagberg, David Hildenbrand, Lorenzo Stoakes, Zi Yan

On Wed, Jul 01, 2026 at 09:07:44AM -0400, Gregg Leventhal wrote:
> Thanks all, it sounds like there is nothing for me to do quite yet, correct?

Correct.

> Also Zi Yan is correct, I meant order-0 folios.
> 
> Please let me know if there is anything I can do.
> FWIW: The patch I suggested did seem to resolve the issue (based on my
> testing) and did prevent my reproducer from reproducing (whereas it
> repros ~instantly for me on a not patched kernel), but I (obviously)
> defer to you all here.

FWIW, yes, I think your patch approaches correctness, but there are subtle issues, like:

Thread 0                    | Thread 1
do_dentry_open()            | do_dentry_open()
 FMODE_WRITE                | FMODE_WRITE
  nr_thps > 0               | /* the rest of the THP truncate logic */
  write_and_wait            |
                            |  write(opened_file)
                            |  /* we now have dirtied folios */
  truncate()                |

where the effective result depends on whether the invalidate lock
is held across filemap_write_and_wait() as well. But it will always,
as far as I can see, result in weird cases where writes can simply go
missing (truncate isn't really supposed to be called without the inode
lock, I think).

Matthew raised that, fundamentally, large-folio unaware filesystems should
not see these file THPs mixed in with dirty folios. Which is sensible and
seems to result in a simpler and easier to reason about fix overall.


-- 
Pedro


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-07-01 14:23 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-30 17:01 Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache Gregg Leventhal
2026-06-30 17:18 ` Gregg Leventhal
2026-06-30 18:31 ` Pedro Falcato
2026-06-30 18:49   ` Pedro Falcato
2026-06-30 19:55     ` Pedro Falcato
2026-06-30 22:34       ` Matthew Wilcox
2026-06-30 22:48       ` Zi Yan
2026-07-01 12:05         ` Pedro Falcato
2026-07-01 11:54       ` Matthew Wilcox
2026-07-01 12:04         ` Pedro Falcato
2026-07-01 12:48         ` Pedro Falcato
2026-07-01 13:07           ` Gregg Leventhal
2026-07-01 14:23             ` Pedro Falcato
2026-06-30 18:36 ` Matthew Wilcox
2026-06-30 19:05   ` Zi Yan
2026-06-30 19:07     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox