All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] iomap/io_uring: O_APPEND async buffered write silently re-appends a data chunk (corruption) on XFS, 6.1.y/6.12.y
@ 2026-06-04 18:46 Gregg Leventhal
  2026-06-05 15:55 ` Brian Foster
  0 siblings, 1 reply; 2+ messages in thread
From: Gregg Leventhal @ 2026-06-04 18:46 UTC (permalink / raw)
  To: hch, djwong, bfoster, Eric Hagberg
  Cc: linux-xfs, linux-fsdevel, io-uring, Jens Axboe, stable


[-- Attachment #1.1: Type: text/plain, Size: 4613 bytes --]

Hi all,

We're seeing silent data corruption -- a chunk of a buffered write being
silently repeated at a later offset -- when using io_uring async buffered
writes with O_APPEND on XFS. It reproduces on the longterm stable trees
6.1.y and 6.12.y under memory pressure, and is fixed in 6.18.y.

Summary
-------
On XFS, an io_uring async buffered write to a file opened O_APPEND can
silently write a chunk of data twice. This is not a harmless in-place
rewrite: a page-aligned, page-multiple sub-range is re-appended at a later
offset, so the file grows and ends up containing the same run of bytes
twice in sequence (and everything after the duplicate is shifted relative
to what was intended). The CQE reports the full requested byte count
(userspace sees success), the resulting file is larger than the total
bytes the kernel reported writing, and there is no error and no dmesg
warning.

Affected kernels (vanilla stable trees; we run ELRepo builds of userspace)
  - 6.1.173   (observed as 6.1.173-1.el8.x86_64)
  - 6.12.85   (observed as 6.12.85-1.el8.x86_64)
Filesystem: XFS

Unaffected:
  - 6.18.y

Trigger conditions
------------------
  - O_APPEND specific. With an explicit file offset (no O_APPEND) we do
    not observe the corruption.
  - Only manifests under memory pressure. The reproducer triggers
    reliably when the system is under enough memory/paging pressure and
    does not reproduce on an otherwise idle box.

Root cause (our understanding)
------------------------------
Under memory pressure the inline IOCB_NOWAIT attempt commits a partial,
non-page-aligned amount and returns short at a page boundary. The pre-fix
iomap_write_iter() reverts the iov_iter by the bytes already written and
returns -EAGAIN:

    } while (iov_iter_count(i) && length);

    if (status == -EAGAIN) {
        iov_iter_revert(i, total_written);
        return -EAGAIN;
    }
    return total_written ? total_written : status;

io_uring then reissues the page-aligned remainder on io-wq. Because the
write is O_APPEND, the offset is re-resolved to the current EOF, which now
already includes the bytes committed by the inline attempt. The result is
that a page-aligned sub-range is written a second time, re-appended past
the new EOF rather than landing where it was originally intended.

What fixes it
-------------
We did not bisect. We identified Brian Foster's "iomap: incremental
per-operation iter advance" series as the likely relevant change,
backported it to the affected kernel, and confirmed it makes the
reproducer pass. The series was merged for v6.15:


https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-6.18.y&id=30f530096166202cf70e1b7d1de5a8cdfba42af1

It reworks iomap_write_iter() to advance iter->pos/iter->len incrementally
(iomap_iter_advance) and removes the iov_iter_revert/-EAGAIN handling, so
retries resume from the correct offset. The buffered-write change is in
"iomap: advance the iter directly on buffered writes" (d9dc477ff6a2), but
it depends on the earlier infrastructure patches in the same series.

Detection in the reproducer (both silent)
-----------------------------------------
  1) final file size > sum of CQE byte counts the kernel reported.
  2) the file is filled with a u64 "byte offset / 8" pattern, so on
     readback element j must equal j; the first mismatch marks the start
     of the duplicated copy (observed to be page-aligned).

Reproducer
----------
Build: gcc -O2 -o repro_uring_dup repro_uring_dup.c -luring
Run:   ./repro_uring_dup /path/on/xfs/repro [seconds] [file_target_mb]
Needs the system under memory pressure to trigger; under those conditions
it reproduces reliably. Source attached (repro_uring_dup.c).

Notes on stable
---------------
The fix is a refactor with no Fixes: tag, and the buffered-write commit
builds on the preceding patches in the series, so a single-commit
cherry-pick into 6.1.y / 6.12.y doesn't look feasible. We're wondering
whether a smaller, targeted fix would be more backportable for the active
LTS trees -- e.g. ensuring the -EAGAIN retry path keeps the append
position consistent with the reverted iov_iter so the already-committed
range isn't re-appended -- but we'd defer to your judgment on whether that
is sound or whether backporting the series as a unit is the better path.
Given this is silent data corruption present since io_uring async buffered
write support (~v6.0), we'd appreciate guidance on the right approach.

Happy to test patches and provide any additional detail.

Regards,
Gregg Leventhal <gleventhal@janestreet.com> and Eric Hagberg <
ehagberg@janestreet.com>

[-- Attachment #1.2: Type: text/html, Size: 5314 bytes --]

[-- Attachment #2: repro_uring_dup.c --]
[-- Type: text/x-csrc, Size: 6588 bytes --]

/*
 * repro_uring_dup.c
 *
 * Reproducer for io_uring async buffered-write duplication on XFS.
 * Issues large, variable-size, non-page-aligned buffered writev's appended
 * to a file via io_uring with offset -1 ("use current position").
 *
 * Bug: when the inline IOCB_NOWAIT attempt does a partial-page short write
 * (landing on a page boundary) and the page-aligned remainder is reissued on
 * io-wq, a page-aligned, page-multiple sub-range of the remainder is written
 * TWICE, while the CQE still reports the full requested byte count. Result:
 * the file is larger than the bytes we were told succeeded, with a page-aligned
 * duplicated chunk.
 *
 * Detection (both silent - no error is ever returned):
 *   1) final file size > total bytes the kernel told us it wrote.
 *   2) file is filled with a u64 "byte offset / 8" pattern, so on readback
 *      element j must equal j; the first j where it doesn't is the start of the
 *      duplicated copy (expected to be page-aligned).
 *
 * Build:  gcc -O2 -o repro_uring_dup repro_uring_dup.c -luring
 * Run:    ./repro_uring_dup /path/on/xfs/repro [seconds] [file_target_mb]
 */
#define _GNU_SOURCE
#include <liburing.h>
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/uio.h>

#define QD 8
#define MB (1024UL * 1024UL)
#define MAXCHUNK (24UL * MB)
#define MINCHUNK (1UL * MB)

/* 1: O_APPEND / offset -1 variant (corrupts).
 * 0: no O_APPEND, explicit offset variant (does not corrupt). */
static int use_append = 1;

static uint64_t now_ns(void) {
  struct timespec ts;
  clock_gettime(CLOCK_MONOTONIC, &ts);
  return (uint64_t)ts.tv_sec * 1000000000ULL + (uint64_t)ts.tv_nsec;
}

/* Fill buf so the u64 at global byte offset (base+8*i) holds (base+8*i)/8. */
static void fill_pattern(uint64_t *buf, uint64_t base_bytes, size_t len) {
  uint64_t start_idx = base_bytes / 8;
  size_t n = len / 8;
  for (size_t i = 0; i < n; i++)
    buf[i] = start_idx + i;
}

/* One writev; loops over (legitimately) short *returned* results. */
static void write_all(struct io_uring *ring, int fd, uint8_t *buf, size_t len,
                      uint64_t expected) {
  size_t done = 0;
  while (done < len) {
    struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
    struct iovec iov = {.iov_base = buf + done, .iov_len = len - done};
    long long off = use_append ? -1LL : (long long)(expected + done);
    io_uring_prep_writev(sqe, fd, &iov, 1, (unsigned long long)off);

    int ret = io_uring_submit(ring);
    if (ret < 0) {
      fprintf(stderr, "submit: %s\n", strerror(-ret));
      exit(1);
    }

    struct io_uring_cqe *cqe;
    ret = io_uring_wait_cqe(ring, &cqe);
    if (ret < 0) {
      fprintf(stderr, "wait_cqe: %s\n", strerror(-ret));
      exit(1);
    }
    int res = cqe->res;
    io_uring_cqe_seen(ring, cqe);

    if (res < 0) {
      fprintf(stderr, "write: %s\n", strerror(-res));
      exit(1);
    }
    if (res == 0) {
      fprintf(stderr, "write returned 0\n");
      exit(1);
    }
    done += (size_t)res;
  }
}

int main(int argc, char **argv) {
  if (argc < 2) {
    fprintf(stderr, "usage: %s <path-prefix-on-xfs> [seconds] [file_target_mb]\n",
            argv[0]);
    return 2;
  }
  const char *prefix = argv[1];
  int seconds = (argc > 2) ? atoi(argv[2]) : 60;
  uint64_t file_target = ((argc > 3) ? (uint64_t)atoll(argv[3]) : 48) * MB;

  srand((unsigned)(time(NULL) ^ getpid()));

  struct io_uring ring;
  if (io_uring_queue_init(QD, &ring, 0)) {
    perror("io_uring_queue_init");
    return 1;
  }

  uint8_t *buf = aligned_alloc(4096, MAXCHUNK);
  if (!buf) {
    perror("aligned_alloc");
    return 1;
  }

  static uint64_t rbuf[1 << 16];
  uint64_t deadline = now_ns() + (uint64_t)seconds * 1000000000ULL;
  long files = 0;

  while (now_ns() < deadline) {
    char fn[8192];
    snprintf(fn, sizeof fn, "%s.%ld", prefix, files);
    int flags = O_WRONLY | O_CREAT | O_TRUNC | (use_append ? O_APPEND : 0);
    int fd = open(fn, flags, 0644);
    if (fd < 0) {
      perror("open");
      return 1;
    }

    uint64_t expected = 0;
    while (expected < file_target) {
      size_t want = MINCHUNK + ((size_t)rand() % (MAXCHUNK - MINCHUNK));
      want &= ~((size_t)7); /* 8-align; deliberately NOT page-aligned */
      fill_pattern((uint64_t *)buf, expected, want);
      write_all(&ring, fd, buf, want, expected);
      expected += want; /* CQE reported full success */
    }
    close(fd);

    /* ---- verify ---- */
    struct stat st;
    if (stat(fn, &st)) {
      perror("stat");
      return 1;
    }

    long long first_bad = -1;
    uint64_t bad_val = 0;
    int rfd = open(fn, O_RDONLY);
    if (rfd < 0) {
      perror("open ro");
      return 1;
    }
    uint64_t idx = 0;
    ssize_t r;
    while ((r = read(rfd, rbuf, sizeof rbuf)) > 0) {
      size_t cnt = (size_t)r / 8;
      for (size_t i = 0; i < cnt; i++) {
        if (rbuf[i] != idx) {
          first_bad = (long long)(idx * 8);
          bad_val = rbuf[i];
          break;
        }
        idx++;
      }
      if (first_bad >= 0)
        break;
    }
    close(rfd);

    int bug = ((uint64_t)st.st_size != expected) || (first_bad >= 0);
    files++;

    if (bug) {
      printf("\n*** CORRUPTION DETECTED in %s ***\n", fn);
      printf("  bytes kernel said it wrote (sum of CQE results): %llu\n",
             (unsigned long long)expected);
      printf("  actual file size:                                %llu\n",
             (unsigned long long)st.st_size);
      printf("  extra (duplicated) bytes:                        %lld\n",
             (long long)st.st_size - (long long)expected);
      if (first_bad >= 0) {
        printf("  first mismatching offset: %lld (0x%llx)  page_aligned=%s\n", first_bad,
               (unsigned long long)first_bad, (first_bad % 4096 == 0) ? "YES" : "no");
        printf("    expected u64 %llu but found %llu "
               "(content from byte offset %llu reappeared here)\n",
               (unsigned long long)(first_bad / 8), (unsigned long long)bad_val,
               (unsigned long long)(bad_val * 8));
      }
      printf("  (file kept for inspection)\n");
      io_uring_queue_exit(&ring);
      return 0;
    }
    unlink(fn);
    if (files % 20 == 0)
      fprintf(stderr, "...%ld files clean\n", files);
  }

  printf("No corruption in %d s (%ld files). Try more time, parallel instances, "
         "or memory pressure.\n",
         seconds, files);
  io_uring_queue_exit(&ring);
  free(buf);
  return 0;
}

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-06-05 15:55 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-04 18:46 [BUG] iomap/io_uring: O_APPEND async buffered write silently re-appends a data chunk (corruption) on XFS, 6.1.y/6.12.y Gregg Leventhal
2026-06-05 15:55 ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.