Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache
@ 2026-06-30 17:01 Gregg Leventhal
  2026-06-30 17:18 ` Gregg Leventhal
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Gregg Leventhal @ 2026-06-30 17:01 UTC (permalink / raw)
  To: To: Alexander Viro, Christian Brauner, Cc: Jan Kara,
	Matthew Wilcox, Andrew Morton, Song Liu, linux-fsdevel, linux-mm,
	linux-kernel, Eric Hagberg

[-- Attachment #1: Type: text/plain, Size: 7126 bytes --]

Hello,

We (Gregg Leventhal <gleventhal@janestreet.com> and Eric Hagberg

<ehagberg@janestreet.com>) have a reproducible data-loss issue involving file

THPs and write-open, impacting filesystems that do not support
writable large folios.


Attached are:


  - thp_write_open_cancel_dirty_repro.c

  - thp-open-writeback-before-purge.patch



Summary

=======


On an affected 6.12 kernel with CONFIG_READ_ONLY_THP_FOR_FS=y, a file can

contain read-only file THPs installed by khugepaged / MADV_COLLAPSE. When that

same file is later opened for write, do_dentry_open() notices

filemap_nr_thps() and drops the page cache:


        /*

         * XXX: Huge page cache doesn't support writing yet. Drop all page

         * cache for this file before processing writes.

         */

        if (f->f_mode & FMODE_WRITE) {

                if (filemap_nr_thps(inode->i_mapping)) {

                        struct address_space *mapping = inode->i_mapping;


                        filemap_invalidate_lock(inode->i_mapping);

                        unmap_mapping_range(mapping, 0, 0, 0);

                        truncate_inode_pages(mapping, 0);

                        filemap_invalidate_unlock(inode->i_mapping);

                }

        }


This is unsafe if the mapping also contains dirty folios.

truncate_inode_pages() is not just a clean cache-dropping primitive. It can

call truncate_cleanup_folio(), which calls folio_cancel_dirty().


In the attached reproducer, dirty appended data is discarded and later read(2)s

return zeros.


We observed this on btrfs and ext4, though most of the testing involved btrfs.


The same issue should apply to any filesystem where file THPs can be created

by READ_ONLY_THP_FOR_FS but writable large folios are not supported. The

do_dentry_open() block above is also unchanged in current mainline, so this

does not appear to be strictly 6.12-specific.



Instrumentation

===============


Tracing the failure shows the dirty folios being invalidated from the

write-open path. INVALIDATE_DIRTY and CANCEL_DIRTY below are labels from our

own probes:


        do_dentry_open / vfs_open

          truncate_inode_pages_range

            truncate_cleanup_folio

              btrfs_invalidate_folio

              folio_cancel_dirty


A representative stack from the failing path:


        INVALIDATE_DIRTY ...

          btrfs_invalidate_folio

          truncate_cleanup_folio

          truncate_inode_pages_range

          vfs_open


        CANCEL_DIRTY ...

          truncate_cleanup_folio

          truncate_inode_pages_range

          vfs_open


This confirms that the appended dirty page-cache contents are being discarded

by the open-time THP cache purge rather than written back.



Why this happens

================


The do_dentry_open() code is trying to handle the fact that some filesystems

do not support writing to file THPs. The problematic assumption is that

dropping the page cache is a safe cache-management operation.


It is not safe when dirty folios are present, because truncate_inode_pages()

cancels their dirty state without writeback.


Note that the read-only file THPs themselves are clean. The data that is lost

is unrelated dirty folios elsewhere in the same mapping, here the appended

tail, which get caught in the blanket truncate_inode_pages(mapping, 0) of the

entire mapping.



Suggested fix direction

=======================


Before dropping THP-bearing page cache on write-open, write back and wait for

any dirty folios. After writeback completes, the folios are clean, so the

subsequent truncate_inode_pages() has no dirty state to cancel and the data is

safe on disk. A later read() simply repopulates the cache from disk. If

writeback fails, fail the open rather than silently discarding the data.


The attached patch does this by adding filemap_write_and_wait(mapping) before

the unmap_mapping_range() / truncate_inode_pages() sequence.


Two caveats we are aware of with this approach:


  - filemap_write_and_wait() flushes the entire mapping, so any write-open of

    a file with filemap_nr_thps() > 0 now forces synchronous writeback. This

    path already did a full unmap + truncate, so the extra cost is probably

    acceptable, but it is a behavior change.


  - The writeback happens before unmap_mapping_range(). That is sufficient for

    the reproducer, where the dirty data comes from buffered write(2), so the

    folios are already marked dirty. We would appreciate guidance on whether

    unmap should precede the writeback in order to also cover data dirtied

    only via a writable shared mapping.


An alternative would be to replace truncate_inode_pages() with a

clean-page-only invalidation primitive, but then dirty file THPs / dirty pages

may remain in the mapping and need careful handling.



Mitigation

==========


As a temporary mitigation, setting khugepaged's scan interval very high

appears to prevent the issue by effectively stopping background file THP

collapse:


        echo 4294967295 >
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs


This is not a complete fix. It reduces or disables khugepaged background

collapse, including file THP collapse, and may reduce THP-related performance

benefits for workloads that rely on khugepaged promotion. Fault-time anonymous

THP allocation is not disabled by this knob.


Disabling CONFIG_READ_ONLY_THP_FOR_FS also seems to mitigate, but both are

suboptimal, performance-impacting trade-offs.



Reproducer

==========


The attached reproducer does the following:


  1. Creates a regular file with non-zero data.

  2. Maps part of the file read-only and uses MADV_COLLAPSE to force a file

     THP.

  3. Opens the file for writing and appends non-zero data, leaving it dirty in

     page cache.

  4. Closes the write fd.

  5. Re-collapses a read-only file range so filemap_nr_thps(mapping) is

     non-zero.

  6. Opens the file for write again, triggering the do_dentry_open() THP purge.

  7. Reads back the appended data.


Whether any single iteration reproduces is a race against background

writeback, so let the full iteration count run. A single clean pass does not

by itself prove the kernel is unaffected.


On an affected 6.12 host:


        # ./thp_write_open_cancel_dirty_repro Maybe_corrupted_file

        path=Maybe_corrupted_file base_size=67108864 append_size=16384 iters=200

        REPRODUCED iter=0 bad_bytes=16384 first_bad=0 zero_count=16384
append_off=67108864

        first 64 got:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ...

        first 64 want: 5c df 63 e6 6a ed 71 f4 78 fb 7f 03 86 0a 8d 11 ...


The corrupted range is visible as a run of null bytes at the append offset:


        # rg --text '\x00{64}\x00*' $PWD --only-matching \

              --byte-offset --no-line-number \

            | awk -F: '{print $1, $2, length($3)}' | head -n1

        /root/Maybe_corrupted_file 67108864 16384



We are happy to test any preferred fix direction and can provide
additional traces, as-needed.


Thanks,

Gregg Leventhal

Eric Hagberg

[-- Attachment #2: thp_write_open_cancel_dirty_repro.c --]
[-- Type: text/x-c-code, Size: 7962 bytes --]

// SPDX-License-Identifier: GPL-2.0
/*
 * Reproducer for write-open file-THP page-cache purge discarding dirty data.
 *
 * Intended for Linux kernels with CONFIG_READ_ONLY_THP_FOR_FS=y and a filesystem
 * that does not opt into writable large folios (e.g. btrfs/ext4 on affected
 * kernels).  Run on such a filesystem, with THP enabled.
 *
 * Mechanism:
 *   1. Create a file and fault/collapse a read-only file THP.
 *   2. Open for write and append non-zero bytes, leaving them dirty in cache.
 *      The first write-open may purge the earlier THP; that's fine.
 *   3. Close the write fd, then collapse a clean read-only range again so the
 *      mapping has file THPs while the appended folios are still dirty.
 *   4. Open for write again.  Affected kernels call truncate_inode_pages()
 *      from do_dentry_open() to drop the THP-bearing cache.  That can call
 *      folio_cancel_dirty() on the dirty appended folios.
 *   5. Read back appended bytes.  On failure they are zero/incorrect.
 *
 * This intentionally uses MADV_COLLAPSE for determinism instead of waiting for
 * background khugepaged.  If MADV_COLLAPSE returns EINVAL, check that no fd is
 * open for write and that READ_ONLY_THP_FOR_FS / THP are enabled.
 */
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <inttypes.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

#ifndef MADV_COLLAPSE
#define MADV_COLLAPSE 25
#endif
#ifndef MAP_FIXED_NOREPLACE
#define MAP_FIXED_NOREPLACE 0x100000
#endif

#define KiB 1024ULL
#define MiB (1024ULL * KiB)
#define HPAGE (2ULL * MiB)

static void die(const char *what)
{
  fprintf(stderr, "%s: %s\n", what, strerror(errno));
  exit(2);
}

static void xwrite_full(int fd, const void *buf, size_t len, off_t off)
{
  const char *p = buf;
  while (len > 0) {
    ssize_t n = pwrite(fd, p, len, off);
    if (n < 0) die("pwrite");
    if (n == 0) {
      fprintf(stderr, "short pwrite: 0\n");
      exit(2);
    }
    p += n;
    off += n;
    len -= (size_t)n;
  }
}

static void xread_full(int fd, void *buf, size_t len, off_t off)
{
  char *p = buf;
  while (len > 0) {
    ssize_t n = pread(fd, p, len, off);
    if (n < 0) die("pread");
    if (n == 0) {
      fprintf(stderr, "short pread at off=%jd\n", (intmax_t)off);
      exit(2);
    }
    p += n;
    off += n;
    len -= (size_t)n;
  }
}

static void fill_pattern(uint8_t *buf, size_t len, uint8_t seed)
{
  for (size_t i = 0; i < len; i++) {
    /* Deliberately avoid zero bytes, so any zero is suspicious. */
    buf[i] = (uint8_t)(1 + ((i * 131u + seed) % 255u));
  }
}

static void *map_file_aligned_ro(int fd, off_t file_off, size_t len)
{
  /* Reserve enough VA space to choose a PMD-aligned address. */
  size_t reserve_len = len + HPAGE;
  void *reserve = mmap(NULL, reserve_len, PROT_NONE,
                       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  if (reserve == MAP_FAILED) die("mmap reserve");

  uintptr_t base = (uintptr_t)reserve;
  uintptr_t aligned = (base + HPAGE - 1) & ~(uintptr_t)(HPAGE - 1);
  /* Release the reservation, then map the file at the aligned address. */
  if (munmap(reserve, reserve_len) != 0) die("munmap reserve");

  void *p = mmap((void *)aligned, len, PROT_READ,
                 MAP_PRIVATE | MAP_FIXED_NOREPLACE, fd, file_off);
  if (p == MAP_FAILED) die("mmap file fixed");
  if ((uintptr_t)p != aligned) {
    fprintf(stderr, "mmap did not return requested address\n");
    exit(2);
  }
  return p;
}

static void collapse_ro_range(const char *path, off_t file_off)
{
  int fd = open(path, O_RDONLY | O_CLOEXEC);
  if (fd < 0) die("open ro for collapse");

  void *p = map_file_aligned_ro(fd, file_off, HPAGE);

  /* Fault every page first; MADV_COLLAPSE generally expects present pages. */
  volatile uint8_t acc = 0;
  for (size_t off = 0; off < HPAGE; off += 4096) acc ^= ((volatile uint8_t *)p)[off];
  (void)acc;

  if (madvise(p, HPAGE, MADV_COLLAPSE) != 0) {
    int e = errno;
    fprintf(stderr,
            "MADV_COLLAPSE failed at file_off=%jd: %s\n"
            "Hints: run on btrfs/ext4 with THP enabled, ensure no write fd is open, "
            "and ensure CONFIG_READ_ONLY_THP_FOR_FS=y.\n",
            (intmax_t)file_off, strerror(e));
    errno = e;
    die("madvise MADV_COLLAPSE");
  }

  if (munmap(p, HPAGE) != 0) die("munmap collapsed");
  if (close(fd) != 0) die("close collapse fd");
}

static int count_bad(const uint8_t *got, const uint8_t *want, size_t len,
                     size_t *first_bad, size_t *zero_count)
{
  int bad = 0;
  *first_bad = (size_t)-1;
  *zero_count = 0;
  for (size_t i = 0; i < len; i++) {
    if (got[i] == 0) (*zero_count)++;
    if (got[i] != want[i]) {
      if (*first_bad == (size_t)-1) *first_bad = i;
      bad++;
    }
  }
  return bad;
}

int main(int argc, char **argv)
{
  const char *path = argc > 1 ? argv[1] : "./thp-open-repro.dat";
  const size_t base_size = argc > 2 ? strtoull(argv[2], NULL, 0) : 64 * MiB;
  const size_t append_size = argc > 3 ? strtoull(argv[3], NULL, 0) : 16 * KiB;
  const int iters = argc > 4 ? atoi(argv[4]) : 200;

  if (base_size < 4 * HPAGE) {
    fprintf(stderr, "base_size must be at least 8MiB\n");
    return 2;
  }

  printf("path=%s base_size=%zu append_size=%zu iters=%d\n",
         path, base_size, append_size, iters);

  uint8_t *base = malloc(HPAGE);
  uint8_t *want = malloc(append_size);
  uint8_t *got = malloc(append_size);
  if (!base || !want || !got) die("malloc");
  fill_pattern(base, HPAGE, 7);
  fill_pattern(want, append_size, 91);

  int fd = open(path, O_CREAT | O_TRUNC | O_RDWR | O_CLOEXEC, 0666);
  if (fd < 0) die("create");
  for (off_t off = 0; off < (off_t)base_size; off += (off_t)HPAGE) {
    xwrite_full(fd, base, HPAGE, off);
  }
  if (fsync(fd) != 0) die("fsync initial");
  if (close(fd) != 0) die("close initial");

  const off_t append_off = (off_t)base_size;
  bool reproduced = false;

  for (int iter = 0; iter < iters; iter++) {
    /* Return to the original size and durable base contents. */
    fd = open(path, O_RDWR | O_CLOEXEC);
    if (fd < 0) die("open reset");
    if (ftruncate(fd, (off_t)base_size) != 0) die("ftruncate reset");
    if (fsync(fd) != 0) die("fsync reset");
    if (close(fd) != 0) die("close reset");

    /* Create read-only file THPs, then open-for-write and dirty appended data. */
    collapse_ro_range(path, 0);

    fd = open(path, O_RDWR | O_CLOEXEC);
    if (fd < 0) die("open append");
    xwrite_full(fd, want, append_size, append_off);
    /* Do not fsync.  The dirty page-cache state is what we are testing. */
    if (close(fd) != 0) die("close append");

    /* Reinstall a file THP while no write fd is open.  This is the state that
     * makes the next write-open purge the mapping. */
    collapse_ro_range(path, 0);

    fd = open(path, O_RDWR | O_CLOEXEC);
    if (fd < 0) die("open trigger");
    if (close(fd) != 0) die("close trigger");

    int rfd = open(path, O_RDONLY | O_CLOEXEC);
    if (rfd < 0) die("open readback");
    xread_full(rfd, got, append_size, append_off);
    if (close(rfd) != 0) die("close readback");

    size_t first_bad, zero_count;
    int bad = count_bad(got, want, append_size, &first_bad, &zero_count);
    if (bad) {
      printf("REPRODUCED iter=%d bad_bytes=%d first_bad=%zu zero_count=%zu append_off=%jd\n",
             iter, bad, first_bad, zero_count, (intmax_t)append_off);
      printf("first 64 got:");
      for (int i = 0; i < 64 && i < (int)append_size; i++) printf(" %02x", got[i]);
      printf("\nfirst 64 want:");
      for (int i = 0; i < 64 && i < (int)append_size; i++) printf(" %02x", want[i]);
      printf("\n");
      reproduced = true;
      break;
    }

    if ((iter % 10) == 0) printf("iter=%d ok\n", iter);
  }

  if (!reproduced)
    printf("no corruption observed after %d iterations\n", iters);

  return reproduced ? 1 : 0;
}

[-- Attachment #3: thp-open-writeback-before-purge.patch --]
[-- Type: application/x-patch, Size: 1208 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-07-01 14:23 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-30 17:01 Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache Gregg Leventhal
2026-06-30 17:18 ` Gregg Leventhal
2026-06-30 18:31 ` Pedro Falcato
2026-06-30 18:49   ` Pedro Falcato
2026-06-30 19:55     ` Pedro Falcato
2026-06-30 22:34       ` Matthew Wilcox
2026-06-30 22:48       ` Zi Yan
2026-07-01 12:05         ` Pedro Falcato
2026-07-01 11:54       ` Matthew Wilcox
2026-07-01 12:04         ` Pedro Falcato
2026-07-01 12:48         ` Pedro Falcato
2026-07-01 13:07           ` Gregg Leventhal
2026-07-01 14:23             ` Pedro Falcato
2026-06-30 18:36 ` Matthew Wilcox
2026-06-30 19:05   ` Zi Yan
2026-06-30 19:07     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox