[BUG] iomap/io_uring: O_APPEND async buffered write silently re-appends a data chunk (corruption) on XFS, 6.1.y/6.12.y

All of lore.kernel.org
 help / color / mirror / Atom feed

* [BUG] iomap/io_uring: O_APPEND async buffered write silently re-appends a data chunk (corruption) on XFS, 6.1.y/6.12.y
@ 2026-06-04 18:46 Gregg Leventhal
  2026-06-05 15:55 ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: Gregg Leventhal @ 2026-06-04 18:46 UTC (permalink / raw)
  To: hch, djwong, bfoster, Eric Hagberg
  Cc: linux-xfs, linux-fsdevel, io-uring, Jens Axboe, stable


[-- Attachment #1.1: Type: text/plain, Size: 4613 bytes --]

Hi all,

We're seeing silent data corruption -- a chunk of a buffered write being
silently repeated at a later offset -- when using io_uring async buffered
writes with O_APPEND on XFS. It reproduces on the longterm stable trees
6.1.y and 6.12.y under memory pressure, and is fixed in 6.18.y.

Summary
-------
On XFS, an io_uring async buffered write to a file opened O_APPEND can
silently write a chunk of data twice. This is not a harmless in-place
rewrite: a page-aligned, page-multiple sub-range is re-appended at a later
offset, so the file grows and ends up containing the same run of bytes
twice in sequence (and everything after the duplicate is shifted relative
to what was intended). The CQE reports the full requested byte count
(userspace sees success), the resulting file is larger than the total
bytes the kernel reported writing, and there is no error and no dmesg
warning.

Affected kernels (vanilla stable trees; we run ELRepo builds of userspace)
  - 6.1.173   (observed as 6.1.173-1.el8.x86_64)
  - 6.12.85   (observed as 6.12.85-1.el8.x86_64)
Filesystem: XFS

Unaffected:
  - 6.18.y

Trigger conditions
------------------
  - O_APPEND specific. With an explicit file offset (no O_APPEND) we do
    not observe the corruption.
  - Only manifests under memory pressure. The reproducer triggers
    reliably when the system is under enough memory/paging pressure and
    does not reproduce on an otherwise idle box.

Root cause (our understanding)
------------------------------
Under memory pressure the inline IOCB_NOWAIT attempt commits a partial,
non-page-aligned amount and returns short at a page boundary. The pre-fix
iomap_write_iter() reverts the iov_iter by the bytes already written and
returns -EAGAIN:

    } while (iov_iter_count(i) && length);

    if (status == -EAGAIN) {
        iov_iter_revert(i, total_written);
        return -EAGAIN;
    }
    return total_written ? total_written : status;

io_uring then reissues the page-aligned remainder on io-wq. Because the
write is O_APPEND, the offset is re-resolved to the current EOF, which now
already includes the bytes committed by the inline attempt. The result is
that a page-aligned sub-range is written a second time, re-appended past
the new EOF rather than landing where it was originally intended.

What fixes it
-------------
We did not bisect. We identified Brian Foster's "iomap: incremental
per-operation iter advance" series as the likely relevant change,
backported it to the affected kernel, and confirmed it makes the
reproducer pass. The series was merged for v6.15:


https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-6.18.y&id=30f530096166202cf70e1b7d1de5a8cdfba42af1

It reworks iomap_write_iter() to advance iter->pos/iter->len incrementally
(iomap_iter_advance) and removes the iov_iter_revert/-EAGAIN handling, so
retries resume from the correct offset. The buffered-write change is in
"iomap: advance the iter directly on buffered writes" (d9dc477ff6a2), but
it depends on the earlier infrastructure patches in the same series.

Detection in the reproducer (both silent)
-----------------------------------------
  1) final file size > sum of CQE byte counts the kernel reported.
  2) the file is filled with a u64 "byte offset / 8" pattern, so on
     readback element j must equal j; the first mismatch marks the start
     of the duplicated copy (observed to be page-aligned).

Reproducer
----------
Build: gcc -O2 -o repro_uring_dup repro_uring_dup.c -luring
Run:   ./repro_uring_dup /path/on/xfs/repro [seconds] [file_target_mb]
Needs the system under memory pressure to trigger; under those conditions
it reproduces reliably. Source attached (repro_uring_dup.c).

Notes on stable
---------------
The fix is a refactor with no Fixes: tag, and the buffered-write commit
builds on the preceding patches in the series, so a single-commit
cherry-pick into 6.1.y / 6.12.y doesn't look feasible. We're wondering
whether a smaller, targeted fix would be more backportable for the active
LTS trees -- e.g. ensuring the -EAGAIN retry path keeps the append
position consistent with the reverted iov_iter so the already-committed
range isn't re-appended -- but we'd defer to your judgment on whether that
is sound or whether backporting the series as a unit is the better path.
Given this is silent data corruption present since io_uring async buffered
write support (~v6.0), we'd appreciate guidance on the right approach.

Happy to test patches and provide any additional detail.

Regards,
Gregg Leventhal <gleventhal@janestreet.com> and Eric Hagberg <
ehagberg@janestreet.com>

[-- Attachment #1.2: Type: text/html, Size: 5314 bytes --]

[-- Attachment #2: repro_uring_dup.c --]
[-- Type: text/x-csrc, Size: 6588 bytes --]

/*
 * repro_uring_dup.c
 *
 * Reproducer for io_uring async buffered-write duplication on XFS.
 * Issues large, variable-size, non-page-aligned buffered writev's appended
 * to a file via io_uring with offset -1 ("use current position").
 *
 * Bug: when the inline IOCB_NOWAIT attempt does a partial-page short write
 * (landing on a page boundary) and the page-aligned remainder is reissued on
 * io-wq, a page-aligned, page-multiple sub-range of the remainder is written
 * TWICE, while the CQE still reports the full requested byte count. Result:
 * the file is larger than the bytes we were told succeeded, with a page-aligned
 * duplicated chunk.
 *
 * Detection (both silent - no error is ever returned):
 *   1) final file size > total bytes the kernel told us it wrote.
 *   2) file is filled with a u64 "byte offset / 8" pattern, so on readback
 *      element j must equal j; the first j where it doesn't is the start of the
 *      duplicated copy (expected to be page-aligned).
 *
 * Build:  gcc -O2 -o repro_uring_dup repro_uring_dup.c -luring
 * Run:    ./repro_uring_dup /path/on/xfs/repro [seconds] [file_target_mb]
 */
#define _GNU_SOURCE
#include <liburing.h>
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/uio.h>

#define QD 8
#define MB (1024UL * 1024UL)
#define MAXCHUNK (24UL * MB)
#define MINCHUNK (1UL * MB)

/* 1: O_APPEND / offset -1 variant (corrupts).
 * 0: no O_APPEND, explicit offset variant (does not corrupt). */
static int use_append = 1;

static uint64_t now_ns(void) {
  struct timespec ts;
  clock_gettime(CLOCK_MONOTONIC, &ts);
  return (uint64_t)ts.tv_sec * 1000000000ULL + (uint64_t)ts.tv_nsec;
}

/* Fill buf so the u64 at global byte offset (base+8*i) holds (base+8*i)/8. */
static void fill_pattern(uint64_t *buf, uint64_t base_bytes, size_t len) {
  uint64_t start_idx = base_bytes / 8;
  size_t n = len / 8;
  for (size_t i = 0; i < n; i++)
    buf[i] = start_idx + i;
}

/* One writev; loops over (legitimately) short *returned* results. */
static void write_all(struct io_uring *ring, int fd, uint8_t *buf, size_t len,
                      uint64_t expected) {
  size_t done = 0;
  while (done < len) {
    struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
    struct iovec iov = {.iov_base = buf + done, .iov_len = len - done};
    long long off = use_append ? -1LL : (long long)(expected + done);
    io_uring_prep_writev(sqe, fd, &iov, 1, (unsigned long long)off);

    int ret = io_uring_submit(ring);
    if (ret < 0) {
      fprintf(stderr, "submit: %s\n", strerror(-ret));
      exit(1);
    }

    struct io_uring_cqe *cqe;
    ret = io_uring_wait_cqe(ring, &cqe);
    if (ret < 0) {
      fprintf(stderr, "wait_cqe: %s\n", strerror(-ret));
      exit(1);
    }
    int res = cqe->res;
    io_uring_cqe_seen(ring, cqe);

    if (res < 0) {
      fprintf(stderr, "write: %s\n", strerror(-res));
      exit(1);
    }
    if (res == 0) {
      fprintf(stderr, "write returned 0\n");
      exit(1);
    }
    done += (size_t)res;
  }
}

int main(int argc, char **argv) {
  if (argc < 2) {
    fprintf(stderr, "usage: %s <path-prefix-on-xfs> [seconds] [file_target_mb]\n",
            argv[0]);
    return 2;
  }
  const char *prefix = argv[1];
  int seconds = (argc > 2) ? atoi(argv[2]) : 60;
  uint64_t file_target = ((argc > 3) ? (uint64_t)atoll(argv[3]) : 48) * MB;

  srand((unsigned)(time(NULL) ^ getpid()));

  struct io_uring ring;
  if (io_uring_queue_init(QD, &ring, 0)) {
    perror("io_uring_queue_init");
    return 1;
  }

  uint8_t *buf = aligned_alloc(4096, MAXCHUNK);
  if (!buf) {
    perror("aligned_alloc");
    return 1;
  }

  static uint64_t rbuf[1 << 16];
  uint64_t deadline = now_ns() + (uint64_t)seconds * 1000000000ULL;
  long files = 0;

  while (now_ns() < deadline) {
    char fn[8192];
    snprintf(fn, sizeof fn, "%s.%ld", prefix, files);
    int flags = O_WRONLY | O_CREAT | O_TRUNC | (use_append ? O_APPEND : 0);
    int fd = open(fn, flags, 0644);
    if (fd < 0) {
      perror("open");
      return 1;
    }

    uint64_t expected = 0;
    while (expected < file_target) {
      size_t want = MINCHUNK + ((size_t)rand() % (MAXCHUNK - MINCHUNK));
      want &= ~((size_t)7); /* 8-align; deliberately NOT page-aligned */
      fill_pattern((uint64_t *)buf, expected, want);
      write_all(&ring, fd, buf, want, expected);
      expected += want; /* CQE reported full success */
    }
    close(fd);

    /* ---- verify ---- */
    struct stat st;
    if (stat(fn, &st)) {
      perror("stat");
      return 1;
    }

    long long first_bad = -1;
    uint64_t bad_val = 0;
    int rfd = open(fn, O_RDONLY);
    if (rfd < 0) {
      perror("open ro");
      return 1;
    }
    uint64_t idx = 0;
    ssize_t r;
    while ((r = read(rfd, rbuf, sizeof rbuf)) > 0) {
      size_t cnt = (size_t)r / 8;
      for (size_t i = 0; i < cnt; i++) {
        if (rbuf[i] != idx) {
          first_bad = (long long)(idx * 8);
          bad_val = rbuf[i];
          break;
        }
        idx++;
      }
      if (first_bad >= 0)
        break;
    }
    close(rfd);

    int bug = ((uint64_t)st.st_size != expected) || (first_bad >= 0);
    files++;

    if (bug) {
      printf("\n*** CORRUPTION DETECTED in %s ***\n", fn);
      printf("  bytes kernel said it wrote (sum of CQE results): %llu\n",
             (unsigned long long)expected);
      printf("  actual file size:                                %llu\n",
             (unsigned long long)st.st_size);
      printf("  extra (duplicated) bytes:                        %lld\n",
             (long long)st.st_size - (long long)expected);
      if (first_bad >= 0) {
        printf("  first mismatching offset: %lld (0x%llx)  page_aligned=%s\n", first_bad,
               (unsigned long long)first_bad, (first_bad % 4096 == 0) ? "YES" : "no");
        printf("    expected u64 %llu but found %llu "
               "(content from byte offset %llu reappeared here)\n",
               (unsigned long long)(first_bad / 8), (unsigned long long)bad_val,
               (unsigned long long)(bad_val * 8));
      }
      printf("  (file kept for inspection)\n");
      io_uring_queue_exit(&ring);
      return 0;
    }
    unlink(fn);
    if (files % 20 == 0)
      fprintf(stderr, "...%ld files clean\n", files);
  }

  printf("No corruption in %d s (%ld files). Try more time, parallel instances, "
         "or memory pressure.\n",
         seconds, files);
  io_uring_queue_exit(&ring);
  free(buf);
  return 0;
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] iomap/io_uring: O_APPEND async buffered write silently re-appends a data chunk (corruption) on XFS, 6.1.y/6.12.y
  2026-06-04 18:46 [BUG] iomap/io_uring: O_APPEND async buffered write silently re-appends a data chunk (corruption) on XFS, 6.1.y/6.12.y Gregg Leventhal
@ 2026-06-05 15:55 ` Brian Foster
  2026-06-08 16:02   ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Foster @ 2026-06-05 15:55 UTC (permalink / raw)
  To: Gregg Leventhal
  Cc: hch, djwong, Eric Hagberg, linux-xfs, linux-fsdevel, io-uring,
	Jens Axboe, stable

On Thu, Jun 04, 2026 at 02:46:33PM -0400, Gregg Leventhal wrote:
> Hi all,
> 
> We're seeing silent data corruption -- a chunk of a buffered write being
> silently repeated at a later offset -- when using io_uring async buffered
> writes with O_APPEND on XFS. It reproduces on the longterm stable trees
> 6.1.y and 6.12.y under memory pressure, and is fixed in 6.18.y.
> 
...
> What fixes it
> -------------
> We did not bisect. We identified Brian Foster's "iomap: incremental
> per-operation iter advance" series as the likely relevant change,
> backported it to the affected kernel, and confirmed it makes the
> reproducer pass. The series was merged for v6.15:
> 
>   [1]https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h
> =linux-6.18.y&id=30f530096166202cf70e1b7d1de5a8cdfba42af1
> 
> It reworks iomap_write_iter() to advance iter->pos/iter->len incrementally
> (iomap_iter_advance) and removes the iov_iter_revert/-EAGAIN handling, so
> retries resume from the correct offset. The buffered-write change is in
> "iomap: advance the iter directly on buffered writes" (d9dc477ff6a2), but
> it depends on the earlier infrastructure patches in the same series.
> 

Note: the correct hash for the buffered write change is 1a1a3b574b97
("iomap: advance the iter directly on buffered writes"). The hash
referenced above is the commit for the read path.

Thanks for the legwork here. I'm at least glad to see I accidentally
fixed something vs. breaking it for a change. ;) I am slightly wondering
if the fundamental issue here is splitting the append write into two
partial requests and whether that's racy wrt EOF, not necessarily the
pos tracking added by this patch.

For example, if we assume the same sort of append+nowait -> partial
write -> append retry loop via io_uring on the current code, but then
inject some other append write (or i_size change) between the two split
writes, wouldn't that result in a similar problem? I don't think we'd
rewrite the original data again, but maybe data ends up interleaved or
something.

But anyways looking back at that commit, I think the relevant behavior
change is that ki_pos update is made consistent with whatever partial
completion iomap_write_iter() may have performed. More specifically, the
older code doesn't update iter->pos until after a successful iomap
iteration (via iomap_iter()). This means that if we loop within
iomap_write_iter() one or more times before hitting an -EAGAIN, the
local pos update is lost and doesn't reflect back into the iomap_iter or
thus the assignment to iocb->ki_pos in the caller
(iomap_file_buffered_write()). Therefore, we revert the iov iter so it
is consistent with ki_pos at the start of the current iter.  However we
have an append write in this case and i_size is updated within
iomap_write_iter(), so the write retry will overwrite ki_pos to the
new/updated EOF IIUC (via generic_write_checks_count(), called from the
fs before calling into iomap) and result in the observed behavior of
rewriting some amount of data to a new file offset.

The updated code bumps iter->pos incrementally within
iomap_write_iter(). We don't need to revert the iov_iter anymore because
incremental progress will be reflected back to iocb->ki_pos via
iter->pos. However this also happens to be consistent with incremental
i_size updates within iomap_write_iter(), so (as long as we don't race,
I think) the retry should be consistent and pick up where the partial
write left off.

One thing I might try here is to see if just deferring append writes to
!NOWAIT context avoids this problem because I wonder how sane that sort
of retry situation really is. I'm not quite sure what the expectations
are in that sort of case. Is that something that's easy to test? Of
course that wouldn't prevent this issue if other applications have this
same write pattern.

Another potential option for a stable only fix might be tweak the iomap
code to not update i_size for append (&& nowait?) writes until an
iter->pos update is imminent, but I think we'd need to be careful there
due to the pagecache_isize_extended() call. I think that's more for
non-append cases, but I'd have to take a closer look. Maybe we could
also replace that iov_iter revert with a hardcoded advance of the iomap
iter and emulate the same behavior as newer kernels. That seems
cleanest actually, but again needs some thought and testing...

Brian

> Detection in the reproducer (both silent)
> -----------------------------------------
>   1) final file size > sum of CQE byte counts the kernel reported.
>   2) the file is filled with a u64 "byte offset / 8" pattern, so on
>      readback element j must equal j; the first mismatch marks the start
>      of the duplicated copy (observed to be page-aligned).
> 
> Reproducer
> ----------
> Build: gcc -O2 -o repro_uring_dup repro_uring_dup.c -luring
> Run:   ./repro_uring_dup /path/on/xfs/repro [seconds] [file_target_mb]
> Needs the system under memory pressure to trigger; under those conditions
> it reproduces reliably. Source attached (repro_uring_dup.c).
> 
> Notes on stable
> ---------------
> The fix is a refactor with no Fixes: tag, and the buffered-write commit
> builds on the preceding patches in the series, so a single-commit
> cherry-pick into 6.1.y / 6.12.y doesn't look feasible. We're wondering
> whether a smaller, targeted fix would be more backportable for the active
> LTS trees -- e.g. ensuring the -EAGAIN retry path keeps the append
> position consistent with the reverted iov_iter so the already-committed
> range isn't re-appended -- but we'd defer to your judgment on whether that
> is sound or whether backporting the series as a unit is the better path.
> Given this is silent data corruption present since io_uring async buffered
> write support (~v6.0), we'd appreciate guidance on the right approach.
> 
> Happy to test patches and provide any additional detail.
> 
> Regards,
> Gregg Leventhal <[2]gleventhal@janestreet.com> and Eric Hagberg <[3]
> ehagberg@janestreet.com>
> 
> References:
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-6.18.y&id=30f530096166202cf70e1b7d1de5a8cdfba42af1
> [2] mailto:gleventhal@janestreet.com
> [3] mailto:ehagberg@janestreet.com

> /*
>  * repro_uring_dup.c
>  *
>  * Reproducer for io_uring async buffered-write duplication on XFS.
>  * Issues large, variable-size, non-page-aligned buffered writev's appended
>  * to a file via io_uring with offset -1 ("use current position").
>  *
>  * Bug: when the inline IOCB_NOWAIT attempt does a partial-page short write
>  * (landing on a page boundary) and the page-aligned remainder is reissued on
>  * io-wq, a page-aligned, page-multiple sub-range of the remainder is written
>  * TWICE, while the CQE still reports the full requested byte count. Result:
>  * the file is larger than the bytes we were told succeeded, with a page-aligned
>  * duplicated chunk.
>  *
>  * Detection (both silent - no error is ever returned):
>  *   1) final file size > total bytes the kernel told us it wrote.
>  *   2) file is filled with a u64 "byte offset / 8" pattern, so on readback
>  *      element j must equal j; the first j where it doesn't is the start of the
>  *      duplicated copy (expected to be page-aligned).
>  *
>  * Build:  gcc -O2 -o repro_uring_dup repro_uring_dup.c -luring
>  * Run:    ./repro_uring_dup /path/on/xfs/repro [seconds] [file_target_mb]
>  */
> #define _GNU_SOURCE
> #include <liburing.h>
> #include <fcntl.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <time.h>
> #include <unistd.h>
> #include <errno.h>
> #include <sys/stat.h>
> #include <sys/uio.h>
> 
> #define QD 8
> #define MB (1024UL * 1024UL)
> #define MAXCHUNK (24UL * MB)
> #define MINCHUNK (1UL * MB)
> 
> /* 1: O_APPEND / offset -1 variant (corrupts).
>  * 0: no O_APPEND, explicit offset variant (does not corrupt). */
> static int use_append = 1;
> 
> static uint64_t now_ns(void) {
>   struct timespec ts;
>   clock_gettime(CLOCK_MONOTONIC, &ts);
>   return (uint64_t)ts.tv_sec * 1000000000ULL + (uint64_t)ts.tv_nsec;
> }
> 
> /* Fill buf so the u64 at global byte offset (base+8*i) holds (base+8*i)/8. */
> static void fill_pattern(uint64_t *buf, uint64_t base_bytes, size_t len) {
>   uint64_t start_idx = base_bytes / 8;
>   size_t n = len / 8;
>   for (size_t i = 0; i < n; i++)
>     buf[i] = start_idx + i;
> }
> 
> /* One writev; loops over (legitimately) short *returned* results. */
> static void write_all(struct io_uring *ring, int fd, uint8_t *buf, size_t len,
>                       uint64_t expected) {
>   size_t done = 0;
>   while (done < len) {
>     struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
>     struct iovec iov = {.iov_base = buf + done, .iov_len = len - done};
>     long long off = use_append ? -1LL : (long long)(expected + done);
>     io_uring_prep_writev(sqe, fd, &iov, 1, (unsigned long long)off);
> 
>     int ret = io_uring_submit(ring);
>     if (ret < 0) {
>       fprintf(stderr, "submit: %s\n", strerror(-ret));
>       exit(1);
>     }
> 
>     struct io_uring_cqe *cqe;
>     ret = io_uring_wait_cqe(ring, &cqe);
>     if (ret < 0) {
>       fprintf(stderr, "wait_cqe: %s\n", strerror(-ret));
>       exit(1);
>     }
>     int res = cqe->res;
>     io_uring_cqe_seen(ring, cqe);
> 
>     if (res < 0) {
>       fprintf(stderr, "write: %s\n", strerror(-res));
>       exit(1);
>     }
>     if (res == 0) {
>       fprintf(stderr, "write returned 0\n");
>       exit(1);
>     }
>     done += (size_t)res;
>   }
> }
> 
> int main(int argc, char **argv) {
>   if (argc < 2) {
>     fprintf(stderr, "usage: %s <path-prefix-on-xfs> [seconds] [file_target_mb]\n",
>             argv[0]);
>     return 2;
>   }
>   const char *prefix = argv[1];
>   int seconds = (argc > 2) ? atoi(argv[2]) : 60;
>   uint64_t file_target = ((argc > 3) ? (uint64_t)atoll(argv[3]) : 48) * MB;
> 
>   srand((unsigned)(time(NULL) ^ getpid()));
> 
>   struct io_uring ring;
>   if (io_uring_queue_init(QD, &ring, 0)) {
>     perror("io_uring_queue_init");
>     return 1;
>   }
> 
>   uint8_t *buf = aligned_alloc(4096, MAXCHUNK);
>   if (!buf) {
>     perror("aligned_alloc");
>     return 1;
>   }
> 
>   static uint64_t rbuf[1 << 16];
>   uint64_t deadline = now_ns() + (uint64_t)seconds * 1000000000ULL;
>   long files = 0;
> 
>   while (now_ns() < deadline) {
>     char fn[8192];
>     snprintf(fn, sizeof fn, "%s.%ld", prefix, files);
>     int flags = O_WRONLY | O_CREAT | O_TRUNC | (use_append ? O_APPEND : 0);
>     int fd = open(fn, flags, 0644);
>     if (fd < 0) {
>       perror("open");
>       return 1;
>     }
> 
>     uint64_t expected = 0;
>     while (expected < file_target) {
>       size_t want = MINCHUNK + ((size_t)rand() % (MAXCHUNK - MINCHUNK));
>       want &= ~((size_t)7); /* 8-align; deliberately NOT page-aligned */
>       fill_pattern((uint64_t *)buf, expected, want);
>       write_all(&ring, fd, buf, want, expected);
>       expected += want; /* CQE reported full success */
>     }
>     close(fd);
> 
>     /* ---- verify ---- */
>     struct stat st;
>     if (stat(fn, &st)) {
>       perror("stat");
>       return 1;
>     }
> 
>     long long first_bad = -1;
>     uint64_t bad_val = 0;
>     int rfd = open(fn, O_RDONLY);
>     if (rfd < 0) {
>       perror("open ro");
>       return 1;
>     }
>     uint64_t idx = 0;
>     ssize_t r;
>     while ((r = read(rfd, rbuf, sizeof rbuf)) > 0) {
>       size_t cnt = (size_t)r / 8;
>       for (size_t i = 0; i < cnt; i++) {
>         if (rbuf[i] != idx) {
>           first_bad = (long long)(idx * 8);
>           bad_val = rbuf[i];
>           break;
>         }
>         idx++;
>       }
>       if (first_bad >= 0)
>         break;
>     }
>     close(rfd);
> 
>     int bug = ((uint64_t)st.st_size != expected) || (first_bad >= 0);
>     files++;
> 
>     if (bug) {
>       printf("\n*** CORRUPTION DETECTED in %s ***\n", fn);
>       printf("  bytes kernel said it wrote (sum of CQE results): %llu\n",
>              (unsigned long long)expected);
>       printf("  actual file size:                                %llu\n",
>              (unsigned long long)st.st_size);
>       printf("  extra (duplicated) bytes:                        %lld\n",
>              (long long)st.st_size - (long long)expected);
>       if (first_bad >= 0) {
>         printf("  first mismatching offset: %lld (0x%llx)  page_aligned=%s\n", first_bad,
>                (unsigned long long)first_bad, (first_bad % 4096 == 0) ? "YES" : "no");
>         printf("    expected u64 %llu but found %llu "
>                "(content from byte offset %llu reappeared here)\n",
>                (unsigned long long)(first_bad / 8), (unsigned long long)bad_val,
>                (unsigned long long)(bad_val * 8));
>       }
>       printf("  (file kept for inspection)\n");
>       io_uring_queue_exit(&ring);
>       return 0;
>     }
>     unlink(fn);
>     if (files % 20 == 0)
>       fprintf(stderr, "...%ld files clean\n", files);
>   }
> 
>   printf("No corruption in %d s (%ld files). Try more time, parallel instances, "
>          "or memory pressure.\n",
>          seconds, files);
>   io_uring_queue_exit(&ring);
>   free(buf);
>   return 0;
> }


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] iomap/io_uring: O_APPEND async buffered write silently re-appends a data chunk (corruption) on XFS, 6.1.y/6.12.y
  2026-06-05 15:55 ` Brian Foster
@ 2026-06-08 16:02   ` Brian Foster
  2026-06-08 17:17     ` Eric Hagberg
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Foster @ 2026-06-08 16:02 UTC (permalink / raw)
  To: Gregg Leventhal
  Cc: hch, djwong, Eric Hagberg, linux-xfs, linux-fsdevel, io-uring,
	Jens Axboe, stable

On Fri, Jun 05, 2026 at 11:55:39AM -0400, Brian Foster wrote:
...
> 
> One thing I might try here is to see if just deferring append writes to
> !NOWAIT context avoids this problem because I wonder how sane that sort
> of retry situation really is. I'm not quite sure what the expectations
> are in that sort of case. Is that something that's easy to test? Of
> course that wouldn't prevent this issue if other applications have this
> same write pattern.
> 
> Another potential option for a stable only fix might be tweak the iomap
> code to not update i_size for append (&& nowait?) writes until an
> iter->pos update is imminent, but I think we'd need to be careful there
> due to the pagecache_isize_extended() call. I think that's more for
> non-append cases, but I'd have to take a closer look. Maybe we could
> also replace that iov_iter revert with a hardcoded advance of the iomap
> iter and emulate the same behavior as newer kernels. That seems
> cleanest actually, but again needs some thought and testing...
> 

The above is harder than I initially thought because iomap_iter()
expects the iter->pos to reflect the original position. This gets into
some of the dependency patches in the associated rework series.

Another idea that came to mind is to try and just replace the -EAGAIN
return sequence from the low level iterator with a flag that triggers
-EAGAIN from the next iter advance. The idea here is to allow the write
to return partial completion (i.e. so no iov_iter revert) without having
to return an error from the lowest level in the stack. I had claude come
up with a quick patch [1] for reference/experimentation.

This is based on v6.12 stable and compile tested only. It needs more
review and testing in general but might be worth throwing your
reproducer at if you can..?

Brian

[1]

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 0178292c1864..956700441f6a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1037,10 +1037,9 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		}
 	} while (iov_iter_count(i) && length);
 
-	if (status == -EAGAIN) {
-		iov_iter_revert(i, total_written);
-		return -EAGAIN;
-	}
+	if (status == -EAGAIN)
+		iter->iomap.flags |= IOMAP_F_ASYNC_RETRY;
+
 	return total_written ? total_written : status;
 }
 
diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
index 79a0614eaab7..6021b09ddc2f 100644
--- a/fs/iomap/iter.c
+++ b/fs/iomap/iter.c
@@ -22,6 +22,7 @@
 static inline int iomap_iter_advance(struct iomap_iter *iter)
 {
 	bool stale = iter->iomap.flags & IOMAP_F_STALE;
+	bool async_retry = iter->iomap.flags & IOMAP_F_ASYNC_RETRY;
 
 	/* handle the previous iteration (if any) */
 	if (iter->iomap.length) {
@@ -35,6 +36,8 @@ static inline int iomap_iter_advance(struct iomap_iter *iter)
 		iter->len -= iter->processed;
 		if (!iter->len)
 			return 0;
+		if (async_retry)
+			return -EAGAIN;
 	}
 
 	/* clear the state for the next iteration */
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index d204dcd35063..8d60073a255d 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -74,9 +74,14 @@ struct vm_fault;
  * IOMAP_F_STALE indicates that the iomap is not valid any longer and the file
  * range it covers needs to be remapped by the high level before the operation
  * can proceed.
+ *
+ * IOMAP_F_ASYNC_RETRY indicates that a buffered write made partial progress
+ * but needs to return -EAGAIN to trigger an async retry. The iter has already
+ * been advanced to reflect the partial progress.
  */
 #define IOMAP_F_SIZE_CHANGED	(1U << 8)
 #define IOMAP_F_STALE		(1U << 9)
+#define IOMAP_F_ASYNC_RETRY	(1U << 10)
 
 /*
  * Flags from 0x1000 up are for file system specific usage:


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [BUG] iomap/io_uring: O_APPEND async buffered write silently re-appends a data chunk (corruption) on XFS, 6.1.y/6.12.y
  2026-06-08 16:02   ` Brian Foster
@ 2026-06-08 17:17     ` Eric Hagberg
  2026-06-09 16:20       ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Hagberg @ 2026-06-08 17:17 UTC (permalink / raw)
  To: Brian Foster
  Cc: Gregg Leventhal, hch, djwong, linux-xfs, linux-fsdevel, io-uring,
	Jens Axboe, stable

On Mon, Jun 8, 2026 at 12:03 PM Brian Foster <bfoster@redhat.com> wrote:
> Another idea that came to mind is to try and just replace the -EAGAIN
> return sequence from the low level iterator with a flag that triggers
> -EAGAIN from the next iter advance. The idea here is to allow the write
> to return partial completion (i.e. so no iov_iter revert) without having
> to return an error from the lowest level in the stack. I had claude come
> up with a quick patch [1] for reference/experimentation.
>
> This is based on v6.12 stable and compile tested only. It needs more
> review and testing in general but might be worth throwing your
> reproducer at if you can..?

With that patch applied, the reproducer runs clean - no errors - and
gets roughly the same performance (maybe slightly better) as when run
against a 6.18 kernel on the same VM.

Thanks,
-Eric

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] iomap/io_uring: O_APPEND async buffered write silently re-appends a data chunk (corruption) on XFS, 6.1.y/6.12.y
  2026-06-08 17:17     ` Eric Hagberg
@ 2026-06-09 16:20       ` Brian Foster
  2026-06-09 17:14         ` Gregg Leventhal
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Foster @ 2026-06-09 16:20 UTC (permalink / raw)
  To: Eric Hagberg
  Cc: Gregg Leventhal, hch, djwong, linux-xfs, linux-fsdevel, io-uring,
	Jens Axboe, stable

On Mon, Jun 08, 2026 at 01:17:10PM -0400, Eric Hagberg wrote:
> On Mon, Jun 8, 2026 at 12:03 PM Brian Foster <bfoster@redhat.com> wrote:
> > Another idea that came to mind is to try and just replace the -EAGAIN
> > return sequence from the low level iterator with a flag that triggers
> > -EAGAIN from the next iter advance. The idea here is to allow the write
> > to return partial completion (i.e. so no iov_iter revert) without having
> > to return an error from the lowest level in the stack. I had claude come
> > up with a quick patch [1] for reference/experimentation.
> >
> > This is based on v6.12 stable and compile tested only. It needs more
> > review and testing in general but might be worth throwing your
> > reproducer at if you can..?
> 
> With that patch applied, the reproducer runs clean - no errors - and
> gets roughly the same performance (maybe slightly better) as when run
> against a 6.18 kernel on the same VM.
> 

Thanks for testing. I'll look into some more regression testing of this
patch and try to clean it up and post it for proper review for stable.

Are you using the reproducer program in your original mail to test? If
so, does it require some concurrent memory pressure to reproduce, and
are you using anything in particular for that?

That test seems small enough that we could potentially include it in
fstests, though I'm still not so sure about the mem pressure part..
Since you guys wrote the test, any interest in porting into fstests? If
not I can look into it.

Brian

> Thanks,
> -Eric
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] iomap/io_uring: O_APPEND async buffered write silently re-appends a data chunk (corruption) on XFS, 6.1.y/6.12.y
  2026-06-09 16:20       ` Brian Foster
@ 2026-06-09 17:14         ` Gregg Leventhal
  2026-06-10 17:34           ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: Gregg Leventhal @ 2026-06-09 17:14 UTC (permalink / raw)
  To: Brian Foster
  Cc: Eric Hagberg, hch, djwong, linux-xfs, linux-fsdevel, io-uring,
	Jens Axboe, stable

I reproduce it by running 25 ~ concurrent instances of the attached reproducer,
each writing its own file, on an otherwise-idle 15 GB VM:

  DIR=$(mktemp -d /tmp/uring.XXXXXX)
  for i in {1..25}; do
      ./repro_uring_dup "$DIR/file_$i" 120 48 &
  done
...
*** CORRUPTION DETECTED in /tmp/UmgK/file_17.1 ***
  bytes kernel said it wrote (sum of CQE results): 53621960
  actual file size:                                56218824
  extra (duplicated) bytes:                        2596864
  first mismatching offset: 6791168 (0x67a000)  page_aligned=YES
    expected u64 848896 but found 524288 (content from byte offset
4194304 reappeared here)
  (file kept for inspection)



  wait

*** CORRUPTION DETECTED in /tmp/Gznx/file_18.2 ***
  bytes kernel said it wrote (sum of CQE results): 58112616
  actual file size:                                60303976
  extra (duplicated) bytes:                        2191360
  first mismatching offset: 2191360 (0x217000)  page_aligned=YES
    expected u64 273920 but found 0 (content from byte offset 0 reappeared here)
  (file kept for inspection)


On Tue, Jun 9, 2026 at 12:20 PM Brian Foster <bfoster@redhat.com> wrote:
>
> On Mon, Jun 08, 2026 at 01:17:10PM -0400, Eric Hagberg wrote:
> > On Mon, Jun 8, 2026 at 12:03 PM Brian Foster <bfoster@redhat.com> wrote:
> > > Another idea that came to mind is to try and just replace the -EAGAIN
> > > return sequence from the low level iterator with a flag that triggers
> > > -EAGAIN from the next iter advance. The idea here is to allow the write
> > > to return partial completion (i.e. so no iov_iter revert) without having
> > > to return an error from the lowest level in the stack. I had claude come
> > > up with a quick patch [1] for reference/experimentation.
> > >
> > > This is based on v6.12 stable and compile tested only. It needs more
> > > review and testing in general but might be worth throwing your
> > > reproducer at if you can..?
> >
> > With that patch applied, the reproducer runs clean - no errors - and
> > gets roughly the same performance (maybe slightly better) as when run
> > against a 6.18 kernel on the same VM.
> >
>
> Thanks for testing. I'll look into some more regression testing of this
> patch and try to clean it up and post it for proper review for stable.
>
> Are you using the reproducer program in your original mail to test? If
> so, does it require some concurrent memory pressure to reproduce, and
> are you using anything in particular for that?
>
> That test seems small enough that we could potentially include it in
> fstests, though I'm still not so sure about the mem pressure part..
> Since you guys wrote the test, any interest in porting into fstests? If
> not I can look into it.
>
> Brian
>
> > Thanks,
> > -Eric
> >
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] iomap/io_uring: O_APPEND async buffered write silently re-appends a data chunk (corruption) on XFS, 6.1.y/6.12.y
  2026-06-09 17:14         ` Gregg Leventhal
@ 2026-06-10 17:34           ` Brian Foster
  0 siblings, 0 replies; 7+ messages in thread
From: Brian Foster @ 2026-06-10 17:34 UTC (permalink / raw)
  To: Gregg Leventhal
  Cc: Eric Hagberg, hch, djwong, linux-xfs, linux-fsdevel, io-uring,
	Jens Axboe, stable

On Tue, Jun 09, 2026 at 01:14:40PM -0400, Gregg Leventhal wrote:
> I reproduce it by running 25 ~ concurrent instances of the attached reproducer,
> each writing its own file, on an otherwise-idle 15 GB VM:
> 
>   DIR=$(mktemp -d /tmp/uring.XXXXXX)
>   for i in {1..25}; do
>       ./repro_uring_dup "$DIR/file_$i" 120 48 &
>   done
> ...
> *** CORRUPTION DETECTED in /tmp/UmgK/file_17.1 ***
>   bytes kernel said it wrote (sum of CQE results): 53621960
>   actual file size:                                56218824
>   extra (duplicated) bytes:                        2596864
>   first mismatching offset: 6791168 (0x67a000)  page_aligned=YES
>     expected u64 848896 but found 524288 (content from byte offset
> 4194304 reappeared here)
>   (file kept for inspection)
> 
> 
> 
>   wait
> 
> *** CORRUPTION DETECTED in /tmp/Gznx/file_18.2 ***
>   bytes kernel said it wrote (sum of CQE results): 58112616
>   actual file size:                                60303976
>   extra (duplicated) bytes:                        2191360
>   first mismatching offset: 2191360 (0x217000)  page_aligned=YES
>     expected u64 273920 but found 0 (content from byte offset 0 reappeared here)
>   (file kept for inspection)
> 

Thanks. I had to bump up the concurrency a bit and then was able to
reproduce.

The patch I sent survived my regression testing but when taking another
look at the upstream patch, I realized something else I had previously
missed. The code in master doesn't actually return -EAGAIN directly
along with partial completion. It just returns the partial completion,
loops again in iomap, and then presumably returns -EAGAIN at that point
which makes its way back to io_uring. I think that is mostly harmless
but technically a bug in the upstream patch as the intent was to be able
to advance the iter, return -EAGAIN, and let the operation unwind from
there.

I think this actually leaves at least a couple options here. One is that
we could presumably just do the same thing on stable as current master:
forget the flag and just remove the iov revert and direct -EAGAIN return
at the cost of one more iter before returning to the caller. Another is
to fix up the code in master and use the patch I posted as a customized
stable backport of that.

WRT the latter I suppose we could also just stick with this patch for
stable and I can follow up with a separate patch for the loop thing on
master. Hmm.. I want to think about it a little more so if any iomap
folks have Opinions in the meantime, let me know.

Brian

> 
> On Tue, Jun 9, 2026 at 12:20 PM Brian Foster <bfoster@redhat.com> wrote:
> >
> > On Mon, Jun 08, 2026 at 01:17:10PM -0400, Eric Hagberg wrote:
> > > On Mon, Jun 8, 2026 at 12:03 PM Brian Foster <bfoster@redhat.com> wrote:
> > > > Another idea that came to mind is to try and just replace the -EAGAIN
> > > > return sequence from the low level iterator with a flag that triggers
> > > > -EAGAIN from the next iter advance. The idea here is to allow the write
> > > > to return partial completion (i.e. so no iov_iter revert) without having
> > > > to return an error from the lowest level in the stack. I had claude come
> > > > up with a quick patch [1] for reference/experimentation.
> > > >
> > > > This is based on v6.12 stable and compile tested only. It needs more
> > > > review and testing in general but might be worth throwing your
> > > > reproducer at if you can..?
> > >
> > > With that patch applied, the reproducer runs clean - no errors - and
> > > gets roughly the same performance (maybe slightly better) as when run
> > > against a 6.18 kernel on the same VM.
> > >
> >
> > Thanks for testing. I'll look into some more regression testing of this
> > patch and try to clean it up and post it for proper review for stable.
> >
> > Are you using the reproducer program in your original mail to test? If
> > so, does it require some concurrent memory pressure to reproduce, and
> > are you using anything in particular for that?
> >
> > That test seems small enough that we could potentially include it in
> > fstests, though I'm still not so sure about the mem pressure part..
> > Since you guys wrote the test, any interest in porting into fstests? If
> > not I can look into it.
> >
> > Brian
> >
> > > Thanks,
> > > -Eric
> > >
> >
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-06-10 17:34 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-04 18:46 [BUG] iomap/io_uring: O_APPEND async buffered write silently re-appends a data chunk (corruption) on XFS, 6.1.y/6.12.y Gregg Leventhal
2026-06-05 15:55 ` Brian Foster
2026-06-08 16:02   ` Brian Foster
2026-06-08 17:17     ` Eric Hagberg
2026-06-09 16:20       ` Brian Foster
2026-06-09 17:14         ` Gregg Leventhal
2026-06-10 17:34           ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.