* [NYE DELUGE 2/4] xfs: online repair in its entirety
@ 2022-12-30 21:14 Darrick J. Wong
2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
` (2 more replies)
0 siblings, 3 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 21:14 UTC (permalink / raw)
To: Dave Chinner, Allison Henderson, Chandan Babu R, Catherine Hoang,
djwong
Cc: xfs, greg.marsden, shirley.ma, konrad.wilk, linux-fsdevel,
Matthew Wilcox, tpkelly, smahar, Christoph Hellwig, fstests,
Zorro Lang, Carlos Maiolino
Hi everyone,
As I've mentioned several times throughout 2022, I would like to merge
the online fsck feature in time for the 2023 LTS kernel. This is the
second part of that effort.
This deluge contains all of the online repair kernel code, a significant
amount of restructuring of how repairs work in the userspace driver
program, and a ton of fstests updates to provide automated fuzz testing
and stress testing of forced repairs.
Within the kernel section, the major pieces are the use of tmpfs files
to provide pageable kernel memory for staging repair information;
lightweight hooks into the main xfs filesystem for scrub via jump
labels; coordinated inode scans for live index construction; and the
atomic file mapping swap feature.
Changes to the userspace driver program fall into two main categories:
restructuring how repairs are scheduled so that they're tracked by inode
or AG; establishing data dependency chains so that we scan and repair
things in the correct order; and reworking the systemd background
services to be more secure, enable periodic media scans, and provide
some semblance of fs corruption reporting.
The fstests changes are a substantial reworking of the fuzzing code to
fit the testing described in the design documentation; adding stress
testing of online repairs vs. fsstress; and functional tests for all the
new features that ride in with online repair.
For this review, I would like people to focus the following:
- Are the major subsystems sufficiently documented that you could figure
out what the code does?
- Do you see any problems that are severe enough to cause long term
support hassles? (e.g. bad API design, writing weird metadata to disk)
- Can you spot mis-interactions between the subsystems?
- What were my blind spots in devising this feature?
- Are there missing pieces that you'd like to help build?
- Can I just merge all of this?
The one thing that is /not/ in scope for this review are requests for
more refactoring of existing subsystems. While there are usually valid
arguments for performing such cleanups, those are separate tasks to be
prioritized separately. I will get to them after merging online fsck,
because revising existing subsystems generally involves rebasing work
in this patchset, which means the affected patches need re-reviewing.
Unless it's absolutely necessary, this just creates more work for
everybody.
I've been running daily online **repairs** of every computer I own for
the last eight months. All modifications so far have been to optimize
data structures (holes in the xattr structures, excessively large rmap
btrees, and bugs in quota resource counter updates). So far, no damage
has resulted from these operations. All issues observed in that time
have been corrected in this submission.
Fuzz and stress testing of online repairs have been running well for a
year now. As of this writing, online repair can fix slightly more
things than offline repair, and the fsstress+repair long soak test has
passed 100 million repairs with zero problems observed.
(For comparison, the long soak fsx test recently passed 92 billion file
operations, so online fsck has a ways to go...)
As a warning, the patches will likely take several days to trickle in.
While everyone else looks at this, I plan to prototype directory tree
reconstruction with Allison's parent pointers v27 patchset. Having a
user of that functionality is (I think) the last major hurdle to
ensuring that parent pointers are a good fit for the problems that need
solving, which in turn is the last requirement for merging that feature.
--D
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory
2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
2022-12-30 22:12 ` [PATCH 2/7] xfs: enable sorting of xfile-backed arrays Darrick J. Wong
` (6 more replies)
2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
2 siblings, 7 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, willy, linux-fsdevel
Hi all,
In general, online repair of an indexed record set walks the filesystem
looking for records. These records are sorted and bulk-loaded into a
new btree. To make this happen without pinning gigabytes of metadata in
memory, first create an abstraction ('xfile') of memfd files so that
kernel code can access paged memory, and then an array abstraction
('xfarray') based on xfiles so that online repair can create an array of
new records without pinning memory.
These two data storage abstractions are critical for repair of space
metadata -- the memory used is pageable, which helps us avoid pinning
kernel memory and driving OOM problems; and they are byte-accessible
enough that we can use them like (very slow and programmatic) memory
buffers.
Later patchsets will build on this functionality to provide blob storage
and btrees.
If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.
This is an extraordinary way to destroy everything. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array
---
fs/xfs/Kconfig | 1
fs/xfs/Makefile | 2
fs/xfs/scrub/trace.c | 4
fs/xfs/scrub/trace.h | 262 ++++++++++++
fs/xfs/scrub/xfarray.c | 1084 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/scrub/xfarray.h | 142 ++++++
fs/xfs/scrub/xfile.c | 426 +++++++++++++++++++
fs/xfs/scrub/xfile.h | 78 +++
8 files changed, 1998 insertions(+), 1 deletion(-)
create mode 100644 fs/xfs/scrub/xfarray.c
create mode 100644 fs/xfs/scrub/xfarray.h
create mode 100644 fs/xfs/scrub/xfile.c
create mode 100644 fs/xfs/scrub/xfile.h
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 2/7] xfs: enable sorting of xfile-backed arrays
2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
2022-12-30 22:12 ` [PATCH 6/7] xfs: cache pages used for xfarray quicksort convergence Darrick J. Wong
` (5 subsequent siblings)
6 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, willy, linux-fsdevel
From: Darrick J. Wong <djwong@kernel.org>
The btree bulk loading code requires that records be provided in the
correct record sort order for the given btree type. In general, repair
code cannot be required to collect records in order, and it is not
feasible to insert new records in the middle of an array to maintain
sort order.
Implement a sorting algorithm so that we can sort the records just prior
to bulk loading. In principle, an xfarray could consume many gigabytes
of memory and its backing pages can be sent out to disk at any time.
This means that we cannot map the entire array into memory at once, so
we must find a way to divide the work into smaller portions (e.g. a
page) that /can/ be mapped into memory.
Quicksort seems like a reasonable fit for this purpose, since it uses a
divide and conquer strategy to keep its average runtime logarithmic.
The solution presented here is a port of the glibc implementation, which
itself is derived from the median-of-three and tail call recursion
strategies outlined by Sedgwick.
Subsequent patches will optimize the implementation further by utilizing
the kernel's heapsort on directly-mapped memory whenever possible, and
improving the quicksort pivot selection algorithm to try to avoid O(n^2)
collapses.
Note: The sorting functionality gets its own patch because the basic big
array mechanisms were plenty for a single code patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/scrub/trace.h | 114 ++++++++++
fs/xfs/scrub/xfarray.c | 569 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/scrub/xfarray.h | 67 ++++++
3 files changed, 750 insertions(+)
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 84edfa7556ac..02f5f547c563 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -18,6 +18,7 @@
struct xfile;
struct xfarray;
+struct xfarray_sortinfo;
/*
* ftrace's __print_symbolic requires that all enum values be wrapped in the
@@ -849,6 +850,119 @@ TRACE_EVENT(xfarray_create,
__entry->obj_size_log)
);
+TRACE_EVENT(xfarray_isort,
+ TP_PROTO(struct xfarray_sortinfo *si, uint64_t lo, uint64_t hi),
+ TP_ARGS(si, lo, hi),
+ TP_STRUCT__entry(
+ __field(unsigned long, ino)
+ __field(unsigned long long, lo)
+ __field(unsigned long long, hi)
+ ),
+ TP_fast_assign(
+ __entry->ino = file_inode(si->array->xfile->file)->i_ino;
+ __entry->lo = lo;
+ __entry->hi = hi;
+ ),
+ TP_printk("xfino 0x%lx lo %llu hi %llu elts %llu",
+ __entry->ino,
+ __entry->lo,
+ __entry->hi,
+ __entry->hi - __entry->lo)
+);
+
+TRACE_EVENT(xfarray_qsort,
+ TP_PROTO(struct xfarray_sortinfo *si, uint64_t lo, uint64_t hi),
+ TP_ARGS(si, lo, hi),
+ TP_STRUCT__entry(
+ __field(unsigned long, ino)
+ __field(unsigned long long, lo)
+ __field(unsigned long long, hi)
+ __field(int, stack_depth)
+ __field(int, max_stack_depth)
+ ),
+ TP_fast_assign(
+ __entry->ino = file_inode(si->array->xfile->file)->i_ino;
+ __entry->lo = lo;
+ __entry->hi = hi;
+ __entry->stack_depth = si->stack_depth;
+ __entry->max_stack_depth = si->max_stack_depth;
+ ),
+ TP_printk("xfino 0x%lx lo %llu hi %llu elts %llu stack %d/%d",
+ __entry->ino,
+ __entry->lo,
+ __entry->hi,
+ __entry->hi - __entry->lo,
+ __entry->stack_depth,
+ __entry->max_stack_depth)
+);
+
+TRACE_EVENT(xfarray_sort,
+ TP_PROTO(struct xfarray_sortinfo *si, size_t bytes),
+ TP_ARGS(si, bytes),
+ TP_STRUCT__entry(
+ __field(unsigned long, ino)
+ __field(unsigned long long, nr)
+ __field(size_t, obj_size)
+ __field(size_t, bytes)
+ __field(unsigned int, max_stack_depth)
+ ),
+ TP_fast_assign(
+ __entry->nr = si->array->nr;
+ __entry->obj_size = si->array->obj_size;
+ __entry->ino = file_inode(si->array->xfile->file)->i_ino;
+ __entry->bytes = bytes;
+ __entry->max_stack_depth = si->max_stack_depth;
+ ),
+ TP_printk("xfino 0x%lx nr %llu objsz %zu stack %u bytes %zu",
+ __entry->ino,
+ __entry->nr,
+ __entry->obj_size,
+ __entry->max_stack_depth,
+ __entry->bytes)
+);
+
+TRACE_EVENT(xfarray_sort_stats,
+ TP_PROTO(struct xfarray_sortinfo *si, int error),
+ TP_ARGS(si, error),
+ TP_STRUCT__entry(
+ __field(unsigned long, ino)
+#ifdef DEBUG
+ __field(unsigned long long, loads)
+ __field(unsigned long long, stores)
+ __field(unsigned long long, compares)
+#endif
+ __field(unsigned int, max_stack_depth)
+ __field(unsigned int, max_stack_used)
+ __field(int, error)
+ ),
+ TP_fast_assign(
+ __entry->ino = file_inode(si->array->xfile->file)->i_ino;
+#ifdef DEBUG
+ __entry->loads = si->loads;
+ __entry->stores = si->stores;
+ __entry->compares = si->compares;
+#endif
+ __entry->max_stack_depth = si->max_stack_depth;
+ __entry->max_stack_used = si->max_stack_used;
+ __entry->error = error;
+ ),
+ TP_printk(
+#ifdef DEBUG
+ "xfino 0x%lx loads %llu stores %llu compares %llu stack_depth %u/%u error %d",
+#else
+ "xfino 0x%lx stack_depth %u/%u error %d",
+#endif
+ __entry->ino,
+#ifdef DEBUG
+ __entry->loads,
+ __entry->stores,
+ __entry->compares,
+#endif
+ __entry->max_stack_used,
+ __entry->max_stack_depth,
+ __entry->error)
+);
+
/* repair tracepoints */
#if IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR)
diff --git a/fs/xfs/scrub/xfarray.c b/fs/xfs/scrub/xfarray.c
index 8fdd7dd40193..2cd3a2f42e19 100644
--- a/fs/xfs/scrub/xfarray.c
+++ b/fs/xfs/scrub/xfarray.c
@@ -368,3 +368,572 @@ xfarray_load_next(
*idx = cur;
return 0;
}
+
+/* Sorting functions */
+
+#ifdef DEBUG
+# define xfarray_sort_bump_loads(si) do { (si)->loads++; } while (0)
+# define xfarray_sort_bump_stores(si) do { (si)->stores++; } while (0)
+# define xfarray_sort_bump_compares(si) do { (si)->compares++; } while (0)
+#else
+# define xfarray_sort_bump_loads(si)
+# define xfarray_sort_bump_stores(si)
+# define xfarray_sort_bump_compares(si)
+#endif /* DEBUG */
+
+/* Load an array element for sorting. */
+static inline int
+xfarray_sort_load(
+ struct xfarray_sortinfo *si,
+ xfarray_idx_t idx,
+ void *ptr)
+{
+ xfarray_sort_bump_loads(si);
+ return xfarray_load(si->array, idx, ptr);
+}
+
+/* Store an array element for sorting. */
+static inline int
+xfarray_sort_store(
+ struct xfarray_sortinfo *si,
+ xfarray_idx_t idx,
+ void *ptr)
+{
+ xfarray_sort_bump_stores(si);
+ return xfarray_store(si->array, idx, ptr);
+}
+
+/* Compare an array element for sorting. */
+static inline int
+xfarray_sort_cmp(
+ struct xfarray_sortinfo *si,
+ const void *a,
+ const void *b)
+{
+ xfarray_sort_bump_compares(si);
+ return si->cmp_fn(a, b);
+}
+
+/* Return a pointer to the low index stack for quicksort partitioning. */
+static inline xfarray_idx_t *xfarray_sortinfo_lo(struct xfarray_sortinfo *si)
+{
+ return (xfarray_idx_t *)(si + 1);
+}
+
+/* Return a pointer to the high index stack for quicksort partitioning. */
+static inline xfarray_idx_t *xfarray_sortinfo_hi(struct xfarray_sortinfo *si)
+{
+ return xfarray_sortinfo_lo(si) + si->max_stack_depth;
+}
+
+/* Allocate memory to handle the sort. */
+static inline int
+xfarray_sortinfo_alloc(
+ struct xfarray *array,
+ xfarray_cmp_fn cmp_fn,
+ unsigned int flags,
+ struct xfarray_sortinfo **infop)
+{
+ struct xfarray_sortinfo *si;
+ size_t nr_bytes = sizeof(struct xfarray_sortinfo);
+ int max_stack_depth;
+
+ /*
+ * Tail-call recursion during the partitioning phase means that
+ * quicksort will never recurse more than log2(nr) times. We need one
+ * extra level of stack to hold the initial parameters.
+ */
+ max_stack_depth = ilog2(array->nr) + 1;
+
+ /* Each level of quicksort uses a lo and a hi index */
+ nr_bytes += max_stack_depth * sizeof(xfarray_idx_t) * 2;
+
+ /* One record for the pivot */
+ nr_bytes += array->obj_size;
+
+ si = kvzalloc(nr_bytes, XCHK_GFP_FLAGS);
+ if (!si)
+ return -ENOMEM;
+
+ si->array = array;
+ si->cmp_fn = cmp_fn;
+ si->flags = flags;
+ si->max_stack_depth = max_stack_depth;
+ si->max_stack_used = 1;
+
+ xfarray_sortinfo_lo(si)[0] = 0;
+ xfarray_sortinfo_hi(si)[0] = array->nr - 1;
+
+ trace_xfarray_sort(si, nr_bytes);
+ *infop = si;
+ return 0;
+}
+
+/* Should this sort be terminated by a fatal signal? */
+static inline bool
+xfarray_sort_terminated(
+ struct xfarray_sortinfo *si,
+ int *error)
+{
+ /*
+ * If preemption is disabled, we need to yield to the scheduler every
+ * few seconds so that we don't run afoul of the soft lockup watchdog
+ * or RCU stall detector.
+ */
+ cond_resched();
+
+ if ((si->flags & XFARRAY_SORT_KILLABLE) &&
+ fatal_signal_pending(current)) {
+ if (*error == 0)
+ *error = -EINTR;
+ return true;
+ }
+ return false;
+}
+
+/* Do we want an insertion sort? */
+static inline bool
+xfarray_want_isort(
+ struct xfarray_sortinfo *si,
+ xfarray_idx_t start,
+ xfarray_idx_t end)
+{
+ /*
+ * For array subsets smaller than 8 elements, it's slightly faster to
+ * use insertion sort than quicksort's stack machine.
+ */
+ return (end - start) < 8;
+}
+
+/* Return the scratch space within the sortinfo structure. */
+static inline void *xfarray_sortinfo_isort_scratch(struct xfarray_sortinfo *si)
+{
+ return xfarray_sortinfo_hi(si) + si->max_stack_depth;
+}
+
+/*
+ * Perform an insertion sort on a subset of the array.
+ * Though insertion sort is an O(n^2) algorithm, for small set sizes it's
+ * faster than quicksort's stack machine, so we let it take over for that.
+ * This ought to be replaced with something more efficient.
+ */
+STATIC int
+xfarray_isort(
+ struct xfarray_sortinfo *si,
+ xfarray_idx_t lo,
+ xfarray_idx_t hi)
+{
+ void *a = xfarray_sortinfo_isort_scratch(si);
+ void *b = xfarray_scratch(si->array);
+ xfarray_idx_t tmp;
+ xfarray_idx_t i;
+ xfarray_idx_t run;
+ int error;
+
+ trace_xfarray_isort(si, lo, hi);
+
+ /*
+ * Move the smallest element in a[lo..hi] to a[lo]. This
+ * simplifies the loop control logic below.
+ */
+ tmp = lo;
+ error = xfarray_sort_load(si, tmp, b);
+ if (error)
+ return error;
+ for (run = lo + 1; run <= hi; run++) {
+ /* if a[run] < a[tmp], tmp = run */
+ error = xfarray_sort_load(si, run, a);
+ if (error)
+ return error;
+ if (xfarray_sort_cmp(si, a, b) < 0) {
+ tmp = run;
+ memcpy(b, a, si->array->obj_size);
+ }
+
+ if (xfarray_sort_terminated(si, &error))
+ return error;
+ }
+
+ /*
+ * The smallest element is a[tmp]; swap with a[lo] if tmp != lo.
+ * Recall that a[tmp] is already in *b.
+ */
+ if (tmp != lo) {
+ error = xfarray_sort_load(si, lo, a);
+ if (error)
+ return error;
+ error = xfarray_sort_store(si, tmp, a);
+ if (error)
+ return error;
+ error = xfarray_sort_store(si, lo, b);
+ if (error)
+ return error;
+ }
+
+ /*
+ * Perform an insertion sort on a[lo+1..hi]. We already made sure
+ * that the smallest value in the original range is now in a[lo],
+ * so the inner loop should never underflow.
+ *
+ * For each a[lo+2..hi], make sure it's in the correct position
+ * with respect to the elements that came before it.
+ */
+ for (run = lo + 2; run <= hi; run++) {
+ error = xfarray_sort_load(si, run, a);
+ if (error)
+ return error;
+
+ /*
+ * Find the correct place for a[run] by walking leftwards
+ * towards the start of the range until a[tmp] is no longer
+ * greater than a[run].
+ */
+ tmp = run - 1;
+ error = xfarray_sort_load(si, tmp, b);
+ if (error)
+ return error;
+ while (xfarray_sort_cmp(si, a, b) < 0) {
+ tmp--;
+ error = xfarray_sort_load(si, tmp, b);
+ if (error)
+ return error;
+
+ if (xfarray_sort_terminated(si, &error))
+ return error;
+ }
+ tmp++;
+
+ /*
+ * If tmp != run, then a[tmp..run-1] are all less than a[run],
+ * so right barrel roll a[tmp..run] to get this range in
+ * sorted order.
+ */
+ if (tmp == run)
+ continue;
+
+ for (i = run; i >= tmp; i--) {
+ error = xfarray_sort_load(si, i - 1, b);
+ if (error)
+ return error;
+ error = xfarray_sort_store(si, i, b);
+ if (error)
+ return error;
+
+ if (xfarray_sort_terminated(si, &error))
+ return error;
+ }
+ error = xfarray_sort_store(si, tmp, a);
+ if (error)
+ return error;
+
+ if (xfarray_sort_terminated(si, &error))
+ return error;
+ }
+
+ return 0;
+}
+
+/* Return a pointer to the xfarray pivot record within the sortinfo struct. */
+static inline void *xfarray_sortinfo_pivot(struct xfarray_sortinfo *si)
+{
+ return xfarray_sortinfo_hi(si) + si->max_stack_depth;
+}
+
+/*
+ * Find a pivot value for quicksort partitioning, swap it with a[lo], and save
+ * the cached pivot record for the next step.
+ *
+ * Select the median value from a[lo], a[mid], and a[hi]. Put the median in
+ * a[lo], the lowest in a[mid], and the highest in a[hi]. Using the median of
+ * the three reduces the chances that we pick the worst case pivot value, since
+ * it's likely that our array values are nearly sorted.
+ */
+STATIC int
+xfarray_qsort_pivot(
+ struct xfarray_sortinfo *si,
+ xfarray_idx_t lo,
+ xfarray_idx_t hi)
+{
+ void *a = xfarray_sortinfo_pivot(si);
+ void *b = xfarray_scratch(si->array);
+ xfarray_idx_t mid = lo + ((hi - lo) / 2);
+ int error;
+
+ /* if a[mid] < a[lo], swap a[mid] and a[lo]. */
+ error = xfarray_sort_load(si, mid, a);
+ if (error)
+ return error;
+ error = xfarray_sort_load(si, lo, b);
+ if (error)
+ return error;
+ if (xfarray_sort_cmp(si, a, b) < 0) {
+ error = xfarray_sort_store(si, lo, a);
+ if (error)
+ return error;
+ error = xfarray_sort_store(si, mid, b);
+ if (error)
+ return error;
+ }
+
+ /* if a[hi] < a[mid], swap a[mid] and a[hi]. */
+ error = xfarray_sort_load(si, hi, a);
+ if (error)
+ return error;
+ error = xfarray_sort_load(si, mid, b);
+ if (error)
+ return error;
+ if (xfarray_sort_cmp(si, a, b) < 0) {
+ error = xfarray_sort_store(si, mid, a);
+ if (error)
+ return error;
+ error = xfarray_sort_store(si, hi, b);
+ if (error)
+ return error;
+ } else {
+ goto move_front;
+ }
+
+ /* if a[mid] < a[lo], swap a[mid] and a[lo]. */
+ error = xfarray_sort_load(si, mid, a);
+ if (error)
+ return error;
+ error = xfarray_sort_load(si, lo, b);
+ if (error)
+ return error;
+ if (xfarray_sort_cmp(si, a, b) < 0) {
+ error = xfarray_sort_store(si, lo, a);
+ if (error)
+ return error;
+ error = xfarray_sort_store(si, mid, b);
+ if (error)
+ return error;
+ }
+
+move_front:
+ /*
+ * Move our selected pivot to a[lo]. Recall that a == si->pivot, so
+ * this leaves us with the pivot cached in the sortinfo structure.
+ */
+ error = xfarray_sort_load(si, lo, b);
+ if (error)
+ return error;
+ error = xfarray_sort_load(si, mid, a);
+ if (error)
+ return error;
+ error = xfarray_sort_store(si, mid, b);
+ if (error)
+ return error;
+ return xfarray_sort_store(si, lo, a);
+}
+
+/*
+ * Set up the pointers for the next iteration. We push onto the stack all of
+ * the unsorted values between a[lo + 1] and a[end[i]], and we tweak the
+ * current stack frame to point to the unsorted values between a[beg[i]] and
+ * a[lo] so that those values will be sorted when we pop the stack.
+ */
+static inline int
+xfarray_qsort_push(
+ struct xfarray_sortinfo *si,
+ xfarray_idx_t *si_lo,
+ xfarray_idx_t *si_hi,
+ xfarray_idx_t lo,
+ xfarray_idx_t hi)
+{
+ /* Check for stack overflows */
+ if (si->stack_depth >= si->max_stack_depth - 1) {
+ ASSERT(si->stack_depth < si->max_stack_depth - 1);
+ return -EFSCORRUPTED;
+ }
+
+ si->max_stack_used = max_t(uint8_t, si->max_stack_used,
+ si->stack_depth + 2);
+
+ si_lo[si->stack_depth + 1] = lo + 1;
+ si_hi[si->stack_depth + 1] = si_hi[si->stack_depth];
+ si_hi[si->stack_depth++] = lo - 1;
+
+ /*
+ * Always start with the smaller of the two partitions to keep the
+ * amount of recursion in check.
+ */
+ if (si_hi[si->stack_depth] - si_lo[si->stack_depth] >
+ si_hi[si->stack_depth - 1] - si_lo[si->stack_depth - 1]) {
+ swap(si_lo[si->stack_depth], si_lo[si->stack_depth - 1]);
+ swap(si_hi[si->stack_depth], si_hi[si->stack_depth - 1]);
+ }
+
+ return 0;
+}
+
+/*
+ * Sort the array elements via quicksort. This implementation incorporates
+ * four optimizations discussed in Sedgewick:
+ *
+ * 1. Use an explicit stack of array indices to store the next array partition
+ * to sort. This helps us to avoid recursion in the call stack, which is
+ * particularly expensive in the kernel.
+ *
+ * 2. For arrays with records in arbitrary or user-controlled order, choose the
+ * pivot element using a median-of-three decision tree. This reduces the
+ * probability of selecting a bad pivot value which causes worst case
+ * behavior (i.e. partition sizes of 1).
+ *
+ * 3. The smaller of the two sub-partitions is pushed onto the stack to start
+ * the next level of recursion, and the larger sub-partition replaces the
+ * current stack frame. This guarantees that we won't need more than
+ * log2(nr) stack space.
+ *
+ * 4. Use insertion sort for small sets since since insertion sort is faster
+ * for small, mostly sorted array segments. In the author's experience,
+ * substituting insertion sort for arrays smaller than 8 elements yields
+ * a ~10% reduction in runtime.
+ */
+
+/*
+ * Due to the use of signed indices, we can only support up to 2^63 records.
+ * Files can only grow to 2^63 bytes, so this is not much of a limitation.
+ */
+#define QSORT_MAX_RECS (1ULL << 63)
+
+int
+xfarray_sort(
+ struct xfarray *array,
+ xfarray_cmp_fn cmp_fn,
+ unsigned int flags)
+{
+ struct xfarray_sortinfo *si;
+ xfarray_idx_t *si_lo, *si_hi;
+ void *pivot;
+ void *scratch = xfarray_scratch(array);
+ xfarray_idx_t lo, hi;
+ int error = 0;
+
+ if (array->nr < 2)
+ return 0;
+ if (array->nr >= QSORT_MAX_RECS)
+ return -E2BIG;
+
+ error = xfarray_sortinfo_alloc(array, cmp_fn, flags, &si);
+ if (error)
+ return error;
+ si_lo = xfarray_sortinfo_lo(si);
+ si_hi = xfarray_sortinfo_hi(si);
+ pivot = xfarray_sortinfo_pivot(si);
+
+ while (si->stack_depth >= 0) {
+ lo = si_lo[si->stack_depth];
+ hi = si_hi[si->stack_depth];
+
+ trace_xfarray_qsort(si, lo, hi);
+
+ /* Nothing left in this partition to sort; pop stack. */
+ if (lo >= hi) {
+ si->stack_depth--;
+ continue;
+ }
+
+ /* If insertion sort can solve our problems, we're done. */
+ if (xfarray_want_isort(si, lo, hi)) {
+ error = xfarray_isort(si, lo, hi);
+ if (error)
+ goto out_free;
+ si->stack_depth--;
+ continue;
+ }
+
+ /* Pick a pivot, move it to a[lo] and stash it. */
+ error = xfarray_qsort_pivot(si, lo, hi);
+ if (error)
+ goto out_free;
+
+ /*
+ * Rearrange a[lo..hi] such that everything smaller than the
+ * pivot is on the left side of the range and everything larger
+ * than the pivot is on the right side of the range.
+ */
+ while (lo < hi) {
+ /*
+ * Decrement hi until it finds an a[hi] less than the
+ * pivot value.
+ */
+ error = xfarray_sort_load(si, hi, scratch);
+ if (error)
+ goto out_free;
+ while (xfarray_sort_cmp(si, scratch, pivot) >= 0 &&
+ lo < hi) {
+ if (xfarray_sort_terminated(si, &error))
+ goto out_free;
+
+ hi--;
+ error = xfarray_sort_load(si, hi, scratch);
+ if (error)
+ goto out_free;
+ }
+
+ if (xfarray_sort_terminated(si, &error))
+ goto out_free;
+
+ /* Copy that item (a[hi]) to a[lo]. */
+ if (lo < hi) {
+ error = xfarray_sort_store(si, lo++, scratch);
+ if (error)
+ goto out_free;
+ }
+
+ /*
+ * Increment lo until it finds an a[lo] greater than
+ * the pivot value.
+ */
+ error = xfarray_sort_load(si, lo, scratch);
+ if (error)
+ goto out_free;
+ while (xfarray_sort_cmp(si, scratch, pivot) <= 0 &&
+ lo < hi) {
+ if (xfarray_sort_terminated(si, &error))
+ goto out_free;
+
+ lo++;
+ error = xfarray_sort_load(si, lo, scratch);
+ if (error)
+ goto out_free;
+ }
+
+ if (xfarray_sort_terminated(si, &error))
+ goto out_free;
+
+ /* Copy that item (a[lo]) to a[hi]. */
+ if (lo < hi) {
+ error = xfarray_sort_store(si, hi--, scratch);
+ if (error)
+ goto out_free;
+ }
+
+ if (xfarray_sort_terminated(si, &error))
+ goto out_free;
+ }
+
+ /*
+ * Put our pivot value in the correct place at a[lo]. All
+ * values between a[beg[i]] and a[lo - 1] should be less than
+ * the pivot; and all values between a[lo + 1] and a[end[i]-1]
+ * should be greater than the pivot.
+ */
+ error = xfarray_sort_store(si, lo, pivot);
+ if (error)
+ goto out_free;
+
+ /* Set up the stack frame to process the two partitions. */
+ error = xfarray_qsort_push(si, si_lo, si_hi, lo, hi);
+ if (error)
+ goto out_free;
+
+ if (xfarray_sort_terminated(si, &error))
+ goto out_free;
+ }
+
+out_free:
+ trace_xfarray_sort_stats(si, error);
+ kvfree(si);
+ return error;
+}
diff --git a/fs/xfs/scrub/xfarray.h b/fs/xfs/scrub/xfarray.h
index 26e2b594f121..b0cf818c6a7f 100644
--- a/fs/xfs/scrub/xfarray.h
+++ b/fs/xfs/scrub/xfarray.h
@@ -55,4 +55,71 @@ static inline int xfarray_append(struct xfarray *array, const void *ptr)
uint64_t xfarray_length(struct xfarray *array);
int xfarray_load_next(struct xfarray *array, xfarray_idx_t *idx, void *rec);
+/* Declarations for xfile array sort functionality. */
+
+typedef cmp_func_t xfarray_cmp_fn;
+
+struct xfarray_sortinfo {
+ struct xfarray *array;
+
+ /* Comparison function for the sort. */
+ xfarray_cmp_fn cmp_fn;
+
+ /* Maximum height of the partition stack. */
+ uint8_t max_stack_depth;
+
+ /* Current height of the partition stack. */
+ int8_t stack_depth;
+
+ /* Maximum stack depth ever used. */
+ uint8_t max_stack_used;
+
+ /* XFARRAY_SORT_* flags; see below. */
+ unsigned int flags;
+
+#ifdef DEBUG
+ /* Performance statistics. */
+ uint64_t loads;
+ uint64_t stores;
+ uint64_t compares;
+#endif
+
+ /*
+ * Extra bytes are allocated beyond the end of the structure to store
+ * quicksort information. C does not permit multiple VLAs per struct,
+ * so we document all of this in a comment.
+ *
+ * Pretend that we have a typedef for array records:
+ *
+ * typedef char[array->obj_size] xfarray_rec_t;
+ *
+ * First comes the quicksort partition stack:
+ *
+ * xfarray_idx_t lo[max_stack_depth];
+ * xfarray_idx_t hi[max_stack_depth];
+ *
+ * union {
+ *
+ * If for a given subset we decide to use an insertion sort, we use the
+ * scratchpad record after the xfarray and a second scratchpad record
+ * here to compare items:
+ *
+ * xfarray_rec_t scratch;
+ *
+ * Otherwise, we want to partition the records to partition the array.
+ * We store the chosen pivot record here and use the xfarray scratchpad
+ * to rearrange the array around the pivot:
+ *
+ * xfarray_rec_t pivot;
+ *
+ * }
+ */
+};
+
+/* Sort can be interrupted by a fatal signal. */
+#define XFARRAY_SORT_KILLABLE (1U << 0)
+
+int xfarray_sort(struct xfarray *array, xfarray_cmp_fn cmp_fn,
+ unsigned int flags);
+
#endif /* __XFS_SCRUB_XFARRAY_H__ */
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 1/7] xfs: create a big array data structure
2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
` (5 preceding siblings ...)
2022-12-30 22:12 ` [PATCH 5/7] xfs: speed up xfarray sort by sorting xfile page contents directly Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
6 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, willy, linux-fsdevel
From: Darrick J. Wong <djwong@kernel.org>
Create a simple 'big array' data structure for storage of fixed-size
metadata records that will be used to reconstruct a btree index. For
repair operations, the most important operations are append, iterate,
and sort.
Earlier implementations of the big array used linked lists and suffered
from severe problems -- pinning all records in kernel memory was not a
good idea and frequently lead to OOM situations; random access was very
inefficient; and record overhead for the lists was unacceptably high at
40-60%.
Therefore, the big memory array relies on the 'xfile' abstraction, which
creates a memfd file and stores the records in page cache pages. Since
the memfd is created in tmpfs, the memory pages can be pushed out to
disk if necessary and we have a built-in usage limit of 50% of physical
memory.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/Kconfig | 1
fs/xfs/Makefile | 2
fs/xfs/scrub/trace.c | 4 -
fs/xfs/scrub/trace.h | 123 ++++++++++++++++
fs/xfs/scrub/xfarray.c | 370 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/scrub/xfarray.h | 58 ++++++++
fs/xfs/scrub/xfile.c | 318 +++++++++++++++++++++++++++++++++++++++++
fs/xfs/scrub/xfile.h | 58 ++++++++
8 files changed, 933 insertions(+), 1 deletion(-)
create mode 100644 fs/xfs/scrub/xfarray.c
create mode 100644 fs/xfs/scrub/xfarray.h
create mode 100644 fs/xfs/scrub/xfile.c
create mode 100644 fs/xfs/scrub/xfile.h
diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 05bc865142b8..6077ac04c0c3 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -101,6 +101,7 @@ config XFS_ONLINE_SCRUB
bool "XFS online metadata check support"
default n
depends on XFS_FS
+ depends on TMPFS && SHMEM
select XFS_DRAIN_INTENTS
help
If you say Y here you will be able to check metadata on a
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 90f1f01277be..90cbba7dc550 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -162,6 +162,8 @@ xfs-y += $(addprefix scrub/, \
rmap.o \
scrub.o \
symlink.o \
+ xfarray.o \
+ xfile.o \
)
xfs-$(CONFIG_XFS_RT) += scrub/rtbitmap.o
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index b5f94676c37c..4a0385c97ea6 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -12,8 +12,10 @@
#include "xfs_mount.h"
#include "xfs_inode.h"
#include "xfs_btree.h"
-#include "scrub/scrub.h"
#include "xfs_ag.h"
+#include "scrub/scrub.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
/* Figure out which block the btree cursor was pointing to. */
static inline xfs_fsblock_t
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index cb33f42190df..84edfa7556ac 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -16,6 +16,9 @@
#include <linux/tracepoint.h>
#include "xfs_bit.h"
+struct xfile;
+struct xfarray;
+
/*
* ftrace's __print_symbolic requires that all enum values be wrapped in the
* TRACE_DEFINE_ENUM macro so that the enum value can be encoded in the ftrace
@@ -726,6 +729,126 @@ TRACE_EVENT(xchk_refcount_incorrect,
__entry->seen)
)
+TRACE_EVENT(xfile_create,
+ TP_PROTO(struct xfs_mount *mp, struct xfile *xf),
+ TP_ARGS(mp, xf),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(unsigned long, ino)
+ __array(char, pathname, 256)
+ ),
+ TP_fast_assign(
+ char pathname[257];
+ char *path;
+
+ __entry->dev = mp->m_super->s_dev;
+ __entry->ino = file_inode(xf->file)->i_ino;
+ memset(pathname, 0, sizeof(pathname));
+ path = file_path(xf->file, pathname, sizeof(pathname) - 1);
+ if (IS_ERR(path))
+ path = "(unknown)";
+ strncpy(__entry->pathname, path, sizeof(__entry->pathname));
+ ),
+ TP_printk("dev %d:%d xfino 0x%lx path '%s'",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->ino,
+ __entry->pathname)
+);
+
+TRACE_EVENT(xfile_destroy,
+ TP_PROTO(struct xfile *xf),
+ TP_ARGS(xf),
+ TP_STRUCT__entry(
+ __field(unsigned long, ino)
+ __field(unsigned long long, bytes)
+ __field(loff_t, size)
+ ),
+ TP_fast_assign(
+ struct xfile_stat statbuf;
+ int ret;
+
+ ret = xfile_stat(xf, &statbuf);
+ if (!ret) {
+ __entry->bytes = statbuf.bytes;
+ __entry->size = statbuf.size;
+ } else {
+ __entry->bytes = -1;
+ __entry->size = -1;
+ }
+ __entry->ino = file_inode(xf->file)->i_ino;
+ ),
+ TP_printk("xfino 0x%lx mem_bytes 0x%llx isize 0x%llx",
+ __entry->ino,
+ __entry->bytes,
+ __entry->size)
+);
+
+DECLARE_EVENT_CLASS(xfile_class,
+ TP_PROTO(struct xfile *xf, loff_t pos, unsigned long long bytecount),
+ TP_ARGS(xf, pos, bytecount),
+ TP_STRUCT__entry(
+ __field(unsigned long, ino)
+ __field(unsigned long long, bytes_used)
+ __field(loff_t, pos)
+ __field(loff_t, size)
+ __field(unsigned long long, bytecount)
+ ),
+ TP_fast_assign(
+ struct xfile_stat statbuf;
+ int ret;
+
+ ret = xfile_stat(xf, &statbuf);
+ if (!ret) {
+ __entry->bytes_used = statbuf.bytes;
+ __entry->size = statbuf.size;
+ } else {
+ __entry->bytes_used = -1;
+ __entry->size = -1;
+ }
+ __entry->ino = file_inode(xf->file)->i_ino;
+ __entry->pos = pos;
+ __entry->bytecount = bytecount;
+ ),
+ TP_printk("xfino 0x%lx mem_bytes 0x%llx pos 0x%llx bytecount 0x%llx isize 0x%llx",
+ __entry->ino,
+ __entry->bytes_used,
+ __entry->pos,
+ __entry->bytecount,
+ __entry->size)
+);
+#define DEFINE_XFILE_EVENT(name) \
+DEFINE_EVENT(xfile_class, name, \
+ TP_PROTO(struct xfile *xf, loff_t pos, unsigned long long bytecount), \
+ TP_ARGS(xf, pos, bytecount))
+DEFINE_XFILE_EVENT(xfile_pread);
+DEFINE_XFILE_EVENT(xfile_pwrite);
+DEFINE_XFILE_EVENT(xfile_seek_data);
+
+TRACE_EVENT(xfarray_create,
+ TP_PROTO(struct xfarray *xfa, unsigned long long required_capacity),
+ TP_ARGS(xfa, required_capacity),
+ TP_STRUCT__entry(
+ __field(unsigned long, ino)
+ __field(uint64_t, max_nr)
+ __field(size_t, obj_size)
+ __field(int, obj_size_log)
+ __field(unsigned long long, required_capacity)
+ ),
+ TP_fast_assign(
+ __entry->max_nr = xfa->max_nr;
+ __entry->obj_size = xfa->obj_size;
+ __entry->obj_size_log = xfa->obj_size_log;
+ __entry->ino = file_inode(xfa->xfile->file)->i_ino;
+ __entry->required_capacity = required_capacity;
+ ),
+ TP_printk("xfino 0x%lx max_nr %llu reqd_nr %llu objsz %zu objszlog %d",
+ __entry->ino,
+ __entry->max_nr,
+ __entry->required_capacity,
+ __entry->obj_size,
+ __entry->obj_size_log)
+);
+
/* repair tracepoints */
#if IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR)
diff --git a/fs/xfs/scrub/xfarray.c b/fs/xfs/scrub/xfarray.c
new file mode 100644
index 000000000000..8fdd7dd40193
--- /dev/null
+++ b/fs/xfs/scrub/xfarray.c
@@ -0,0 +1,370 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
+#include "scrub/scrub.h"
+#include "scrub/trace.h"
+
+/*
+ * Large Arrays of Fixed-Size Records
+ * ==================================
+ *
+ * This memory array uses an xfile (which itself is a memfd "file") to store
+ * large numbers of fixed-size records in memory that can be paged out. This
+ * puts less stress on the memory reclaim algorithms during an online repair
+ * because we don't have to pin so much memory. However, array access is less
+ * direct than would be in a regular memory array. Access to the array is
+ * performed via indexed load and store methods, and an append method is
+ * provided for convenience. Array elements can be unset, which sets them to
+ * all zeroes. Unset entries are skipped during iteration, though direct loads
+ * will return a zeroed buffer. Callers are responsible for concurrency
+ * control.
+ */
+
+/*
+ * Pointer to scratch space. Because we can't access the xfile data directly,
+ * we allocate a small amount of memory on the end of the xfarray structure to
+ * buffer array items when we need space to store values temporarily.
+ */
+static inline void *xfarray_scratch(struct xfarray *array)
+{
+ return (array + 1);
+}
+
+/* Compute array index given an xfile offset. */
+static xfarray_idx_t
+xfarray_idx(
+ struct xfarray *array,
+ loff_t pos)
+{
+ if (array->obj_size_log >= 0)
+ return (xfarray_idx_t)pos >> array->obj_size_log;
+
+ return div_u64((xfarray_idx_t)pos, array->obj_size);
+}
+
+/* Compute xfile offset of array element. */
+static inline loff_t xfarray_pos(struct xfarray *array, xfarray_idx_t idx)
+{
+ if (array->obj_size_log >= 0)
+ return idx << array->obj_size_log;
+
+ return idx * array->obj_size;
+}
+
+/*
+ * Initialize a big memory array. Array records cannot be larger than a
+ * page, and the array cannot span more bytes than the page cache supports.
+ * If @required_capacity is nonzero, the maximum array size will be set to this
+ * quantity and the array creation will fail if the underlying storage cannot
+ * support that many records.
+ */
+int
+xfarray_create(
+ struct xfs_mount *mp,
+ const char *description,
+ unsigned long long required_capacity,
+ size_t obj_size,
+ struct xfarray **arrayp)
+{
+ struct xfarray *array;
+ struct xfile *xfile;
+ int error;
+
+ ASSERT(obj_size < PAGE_SIZE);
+
+ error = xfile_create(mp, description, 0, &xfile);
+ if (error)
+ return error;
+
+ error = -ENOMEM;
+ array = kzalloc(sizeof(struct xfarray) + obj_size, XCHK_GFP_FLAGS);
+ if (!array)
+ goto out_xfile;
+
+ array->xfile = xfile;
+ array->obj_size = obj_size;
+
+ if (is_power_of_2(obj_size))
+ array->obj_size_log = ilog2(obj_size);
+ else
+ array->obj_size_log = -1;
+
+ array->max_nr = xfarray_idx(array, MAX_LFS_FILESIZE);
+ trace_xfarray_create(array, required_capacity);
+
+ if (required_capacity > 0) {
+ if (array->max_nr < required_capacity) {
+ error = -ENOMEM;
+ goto out_xfarray;
+ }
+ array->max_nr = required_capacity;
+ }
+
+ *arrayp = array;
+ return 0;
+
+out_xfarray:
+ kfree(array);
+out_xfile:
+ xfile_destroy(xfile);
+ return error;
+}
+
+/* Destroy the array. */
+void
+xfarray_destroy(
+ struct xfarray *array)
+{
+ xfile_destroy(array->xfile);
+ kfree(array);
+}
+
+/* Load an element from the array. */
+int
+xfarray_load(
+ struct xfarray *array,
+ xfarray_idx_t idx,
+ void *ptr)
+{
+ if (idx >= array->nr)
+ return -ENODATA;
+
+ return xfile_obj_load(array->xfile, ptr, array->obj_size,
+ xfarray_pos(array, idx));
+}
+
+/* Is this array element potentially unset? */
+static inline bool
+xfarray_is_unset(
+ struct xfarray *array,
+ loff_t pos)
+{
+ void *temp = xfarray_scratch(array);
+ int error;
+
+ if (array->unset_slots == 0)
+ return false;
+
+ error = xfile_obj_load(array->xfile, temp, array->obj_size, pos);
+ if (!error && xfarray_element_is_null(array, temp))
+ return true;
+
+ return false;
+}
+
+/*
+ * Unset an array element. If @idx is the last element in the array, the
+ * array will be truncated. Otherwise, the entry will be zeroed.
+ */
+int
+xfarray_unset(
+ struct xfarray *array,
+ xfarray_idx_t idx)
+{
+ void *temp = xfarray_scratch(array);
+ loff_t pos = xfarray_pos(array, idx);
+ int error;
+
+ if (idx >= array->nr)
+ return -ENODATA;
+
+ if (idx == array->nr - 1) {
+ array->nr--;
+ return 0;
+ }
+
+ if (xfarray_is_unset(array, pos))
+ return 0;
+
+ memset(temp, 0, array->obj_size);
+ error = xfile_obj_store(array->xfile, temp, array->obj_size, pos);
+ if (error)
+ return error;
+
+ array->unset_slots++;
+ return 0;
+}
+
+/*
+ * Store an element in the array. The element must not be completely zeroed,
+ * because those are considered unset sparse elements.
+ */
+int
+xfarray_store(
+ struct xfarray *array,
+ xfarray_idx_t idx,
+ const void *ptr)
+{
+ int ret;
+
+ if (idx >= array->max_nr)
+ return -EFBIG;
+
+ ASSERT(!xfarray_element_is_null(array, ptr));
+
+ ret = xfile_obj_store(array->xfile, ptr, array->obj_size,
+ xfarray_pos(array, idx));
+ if (ret)
+ return ret;
+
+ array->nr = max(array->nr, idx + 1);
+ return 0;
+}
+
+/* Is this array element NULL? */
+bool
+xfarray_element_is_null(
+ struct xfarray *array,
+ const void *ptr)
+{
+ return !memchr_inv(ptr, 0, array->obj_size);
+}
+
+/*
+ * Store an element anywhere in the array that is unset. If there are no
+ * unset slots, append the element to the array.
+ */
+int
+xfarray_store_anywhere(
+ struct xfarray *array,
+ const void *ptr)
+{
+ void *temp = xfarray_scratch(array);
+ loff_t endpos = xfarray_pos(array, array->nr);
+ loff_t pos;
+ int error;
+
+ /* Find an unset slot to put it in. */
+ for (pos = 0;
+ pos < endpos && array->unset_slots > 0;
+ pos += array->obj_size) {
+ error = xfile_obj_load(array->xfile, temp, array->obj_size,
+ pos);
+ if (error || !xfarray_element_is_null(array, temp))
+ continue;
+
+ error = xfile_obj_store(array->xfile, ptr, array->obj_size,
+ pos);
+ if (error)
+ return error;
+
+ array->unset_slots--;
+ return 0;
+ }
+
+ /* No unset slots found; attach it on the end. */
+ array->unset_slots = 0;
+ return xfarray_append(array, ptr);
+}
+
+/* Return length of array. */
+uint64_t
+xfarray_length(
+ struct xfarray *array)
+{
+ return array->nr;
+}
+
+/*
+ * Decide which array item we're going to read as part of an _iter_get.
+ * @cur is the array index, and @pos is the file offset of that array index in
+ * the backing xfile. Returns ENODATA if we reach the end of the records.
+ *
+ * Reading from a hole in a sparse xfile causes page instantiation, so for
+ * iterating a (possibly sparse) array we need to figure out if the cursor is
+ * pointing at a totally uninitialized hole and move the cursor up if
+ * necessary.
+ */
+static inline int
+xfarray_find_data(
+ struct xfarray *array,
+ xfarray_idx_t *cur,
+ loff_t *pos)
+{
+ unsigned int pgoff = offset_in_page(*pos);
+ loff_t end_pos = *pos + array->obj_size - 1;
+ loff_t new_pos;
+
+ /*
+ * If the current array record is not adjacent to a page boundary, we
+ * are in the middle of the page. We do not need to move the cursor.
+ */
+ if (pgoff != 0 && pgoff + array->obj_size - 1 < PAGE_SIZE)
+ return 0;
+
+ /*
+ * Call SEEK_DATA on the last byte in the record we're about to read.
+ * If the record ends at (or crosses) the end of a page then we know
+ * that the first byte of the record is backed by pages and don't need
+ * to query it. If instead the record begins at the start of the page
+ * then we know that querying the last byte is just as good as querying
+ * the first byte, since records cannot be larger than a page.
+ *
+ * If the call returns the same file offset, we know this record is
+ * backed by real pages. We do not need to move the cursor.
+ */
+ new_pos = xfile_seek_data(array->xfile, end_pos);
+ if (new_pos == -ENXIO)
+ return -ENODATA;
+ if (new_pos < 0)
+ return new_pos;
+ if (new_pos == end_pos)
+ return 0;
+
+ /*
+ * Otherwise, SEEK_DATA told us how far up to move the file pointer to
+ * find more data. Move the array index to the first record past the
+ * byte offset we were given.
+ */
+ new_pos = roundup_64(new_pos, array->obj_size);
+ *cur = xfarray_idx(array, new_pos);
+ *pos = xfarray_pos(array, *cur);
+ return 0;
+}
+
+/*
+ * Starting at *idx, fetch the next non-null array entry and advance the index
+ * to set up the next _load_next call. Returns ENODATA if we reach the end of
+ * the array. Callers must set @*idx to XFARRAY_CURSOR_INIT before the first
+ * call to this function.
+ */
+int
+xfarray_load_next(
+ struct xfarray *array,
+ xfarray_idx_t *idx,
+ void *rec)
+{
+ xfarray_idx_t cur = *idx;
+ loff_t pos = xfarray_pos(array, cur);
+ int error;
+
+ do {
+ if (cur >= array->nr)
+ return -ENODATA;
+
+ /*
+ * Ask the backing store for the location of next possible
+ * written record, then retrieve that record.
+ */
+ error = xfarray_find_data(array, &cur, &pos);
+ if (error)
+ return error;
+ error = xfarray_load(array, cur, rec);
+ if (error)
+ return error;
+
+ cur++;
+ pos += array->obj_size;
+ } while (xfarray_element_is_null(array, rec));
+
+ *idx = cur;
+ return 0;
+}
diff --git a/fs/xfs/scrub/xfarray.h b/fs/xfs/scrub/xfarray.h
new file mode 100644
index 000000000000..26e2b594f121
--- /dev/null
+++ b/fs/xfs/scrub/xfarray.h
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_XFARRAY_H__
+#define __XFS_SCRUB_XFARRAY_H__
+
+/* xfile array index type, along with cursor initialization */
+typedef uint64_t xfarray_idx_t;
+#define XFARRAY_CURSOR_INIT ((__force xfarray_idx_t)0)
+
+/* Iterate each index of an xfile array. */
+#define foreach_xfarray_idx(array, idx) \
+ for ((idx) = XFARRAY_CURSOR_INIT; \
+ (idx) < xfarray_length(array); \
+ (idx)++)
+
+struct xfarray {
+ /* Underlying file that backs the array. */
+ struct xfile *xfile;
+
+ /* Number of array elements. */
+ xfarray_idx_t nr;
+
+ /* Maximum possible array size. */
+ xfarray_idx_t max_nr;
+
+ /* Number of unset slots in the array below @nr. */
+ uint64_t unset_slots;
+
+ /* Size of an array element. */
+ size_t obj_size;
+
+ /* log2 of array element size, if possible. */
+ int obj_size_log;
+};
+
+int xfarray_create(struct xfs_mount *mp, const char *descr,
+ unsigned long long required_capacity, size_t obj_size,
+ struct xfarray **arrayp);
+void xfarray_destroy(struct xfarray *array);
+int xfarray_load(struct xfarray *array, xfarray_idx_t idx, void *ptr);
+int xfarray_unset(struct xfarray *array, xfarray_idx_t idx);
+int xfarray_store(struct xfarray *array, xfarray_idx_t idx, const void *ptr);
+int xfarray_store_anywhere(struct xfarray *array, const void *ptr);
+bool xfarray_element_is_null(struct xfarray *array, const void *ptr);
+
+/* Append an element to the array. */
+static inline int xfarray_append(struct xfarray *array, const void *ptr)
+{
+ return xfarray_store(array, array->nr, ptr);
+}
+
+uint64_t xfarray_length(struct xfarray *array);
+int xfarray_load_next(struct xfarray *array, xfarray_idx_t *idx, void *rec);
+
+#endif /* __XFS_SCRUB_XFARRAY_H__ */
diff --git a/fs/xfs/scrub/xfile.c b/fs/xfs/scrub/xfile.c
new file mode 100644
index 000000000000..43455aa78243
--- /dev/null
+++ b/fs/xfs/scrub/xfile.c
@@ -0,0 +1,318 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_format.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
+#include "scrub/scrub.h"
+#include "scrub/trace.h"
+#include <linux/shmem_fs.h>
+
+/*
+ * Swappable Temporary Memory
+ * ==========================
+ *
+ * Online checking sometimes needs to be able to stage a large amount of data
+ * in memory. This information might not fit in the available memory and it
+ * doesn't all need to be accessible at all times. In other words, we want an
+ * indexed data buffer to store data that can be paged out.
+ *
+ * When CONFIG_TMPFS=y, shmemfs is enough of a filesystem to meet those
+ * requirements. Therefore, the xfile mechanism uses an unlinked shmem file to
+ * store our staging data. This file is not installed in the file descriptor
+ * table so that user programs cannot access the data, which means that the
+ * xfile must be freed with xfile_destroy.
+ *
+ * xfiles assume that the caller will handle all required concurrency
+ * management; standard vfs locks (freezer and inode) are not taken. Reads
+ * and writes are satisfied directly from the page cache.
+ *
+ * NOTE: The current shmemfs implementation has a quirk that in-kernel reads
+ * of a hole cause a page to be mapped into the file. If you are going to
+ * create a sparse xfile, please be careful about reading from uninitialized
+ * parts of the file. These pages are !Uptodate and will eventually be
+ * reclaimed if not written, but in the short term this boosts memory
+ * consumption.
+ */
+
+/*
+ * xfiles must not be exposed to userspace and require upper layers to
+ * coordinate access to the one handle returned by the constructor, so
+ * establish a separate lock class for xfiles to avoid confusing lockdep.
+ */
+static struct lock_class_key xfile_i_mutex_key;
+
+/*
+ * Create an xfile of the given size. The description will be used in the
+ * trace output.
+ */
+int
+xfile_create(
+ struct xfs_mount *mp,
+ const char *description,
+ loff_t isize,
+ struct xfile **xfilep)
+{
+ char *fname;
+ struct xfile *xf;
+ int error = -ENOMEM;
+
+ xf = kmalloc(sizeof(struct xfile), XCHK_GFP_FLAGS);
+ if (!xf)
+ return -ENOMEM;
+
+ fname = kmalloc(MAXNAMELEN, XCHK_GFP_FLAGS);
+ if (!fname)
+ goto out_xfile;
+
+ snprintf(fname, MAXNAMELEN - 1, "XFS (%s): %s", mp->m_super->s_id,
+ description);
+ fname[MAXNAMELEN - 1] = 0;
+
+ xf->file = shmem_file_setup(fname, isize, 0);
+ if (!xf->file)
+ goto out_fname;
+ if (IS_ERR(xf->file)) {
+ error = PTR_ERR(xf->file);
+ goto out_fname;
+ }
+
+ /*
+ * We want a large sparse file that we can pread, pwrite, and seek.
+ * xfile users are responsible for keeping the xfile hidden away from
+ * all other callers, so we skip timestamp updates and security checks.
+ */
+ xf->file->f_mode |= FMODE_PREAD | FMODE_PWRITE | FMODE_NOCMTIME |
+ FMODE_LSEEK;
+ xf->file->f_flags |= O_RDWR | O_LARGEFILE | O_NOATIME;
+ xf->file->f_inode->i_flags |= S_PRIVATE | S_NOCMTIME | S_NOATIME;
+
+ lockdep_set_class(&file_inode(xf->file)->i_rwsem, &xfile_i_mutex_key);
+
+ trace_xfile_create(mp, xf);
+
+ kfree(fname);
+ *xfilep = xf;
+ return 0;
+out_fname:
+ kfree(fname);
+out_xfile:
+ kfree(xf);
+ return error;
+}
+
+/* Close the file and release all resources. */
+void
+xfile_destroy(
+ struct xfile *xf)
+{
+ struct inode *inode = file_inode(xf->file);
+
+ trace_xfile_destroy(xf);
+
+ lockdep_set_class(&inode->i_rwsem, &inode->i_sb->s_type->i_mutex_key);
+ fput(xf->file);
+ kfree(xf);
+}
+
+/*
+ * Read a memory object directly from the xfile's page cache. Unlike regular
+ * pread, we return -E2BIG and -EFBIG for reads that are too large or at too
+ * high an offset, instead of truncating the read. Otherwise, we return
+ * bytes read or an error code, like regular pread.
+ */
+ssize_t
+xfile_pread(
+ struct xfile *xf,
+ void *buf,
+ size_t count,
+ loff_t pos)
+{
+ struct inode *inode = file_inode(xf->file);
+ struct address_space *mapping = inode->i_mapping;
+ struct page *page = NULL;
+ ssize_t read = 0;
+ unsigned int pflags;
+ int error = 0;
+
+ if (count > MAX_RW_COUNT)
+ return -E2BIG;
+ if (inode->i_sb->s_maxbytes - pos < count)
+ return -EFBIG;
+
+ trace_xfile_pread(xf, pos, count);
+
+ pflags = memalloc_nofs_save();
+ while (count > 0) {
+ void *p, *kaddr;
+ unsigned int len;
+
+ len = min_t(ssize_t, count, PAGE_SIZE - offset_in_page(pos));
+
+ /*
+ * In-kernel reads of a shmem file cause it to allocate a page
+ * if the mapping shows a hole. Therefore, if we hit ENOMEM
+ * we can continue by zeroing the caller's buffer.
+ */
+ page = shmem_read_mapping_page_gfp(mapping, pos >> PAGE_SHIFT,
+ __GFP_NOWARN);
+ if (IS_ERR(page)) {
+ error = PTR_ERR(page);
+ if (error != -ENOMEM)
+ break;
+
+ memset(buf, 0, len);
+ goto advance;
+ }
+
+ if (PageUptodate(page)) {
+ /*
+ * xfile pages must never be mapped into userspace, so
+ * we skip the dcache flush.
+ */
+ kaddr = kmap_local_page(page);
+ p = kaddr + offset_in_page(pos);
+ memcpy(buf, p, len);
+ kunmap_local(kaddr);
+ } else {
+ memset(buf, 0, len);
+ }
+ put_page(page);
+
+advance:
+ count -= len;
+ pos += len;
+ buf += len;
+ read += len;
+ }
+ memalloc_nofs_restore(pflags);
+
+ if (read > 0)
+ return read;
+ return error;
+}
+
+/*
+ * Write a memory object directly to the xfile's page cache. Unlike regular
+ * pwrite, we return -E2BIG and -EFBIG for writes that are too large or at too
+ * high an offset, instead of truncating the write. Otherwise, we return
+ * bytes written or an error code, like regular pwrite.
+ */
+ssize_t
+xfile_pwrite(
+ struct xfile *xf,
+ const void *buf,
+ size_t count,
+ loff_t pos)
+{
+ struct inode *inode = file_inode(xf->file);
+ struct address_space *mapping = inode->i_mapping;
+ const struct address_space_operations *aops = mapping->a_ops;
+ struct page *page = NULL;
+ ssize_t written = 0;
+ unsigned int pflags;
+ int error = 0;
+
+ if (count > MAX_RW_COUNT)
+ return -E2BIG;
+ if (inode->i_sb->s_maxbytes - pos < count)
+ return -EFBIG;
+
+ trace_xfile_pwrite(xf, pos, count);
+
+ pflags = memalloc_nofs_save();
+ while (count > 0) {
+ void *fsdata = NULL;
+ void *p, *kaddr;
+ unsigned int len;
+ int ret;
+
+ len = min_t(ssize_t, count, PAGE_SIZE - offset_in_page(pos));
+
+ /*
+ * We call write_begin directly here to avoid all the freezer
+ * protection lock-taking that happens in the normal path.
+ * shmem doesn't support fs freeze, but lockdep doesn't know
+ * that and will trip over that.
+ */
+ error = aops->write_begin(NULL, mapping, pos, len, &page,
+ &fsdata);
+ if (error)
+ break;
+
+ /*
+ * xfile pages must never be mapped into userspace, so we skip
+ * the dcache flush. If the page is not uptodate, zero it
+ * before writing data.
+ */
+ kaddr = kmap_local_page(page);
+ if (!PageUptodate(page)) {
+ memset(kaddr, 0, PAGE_SIZE);
+ SetPageUptodate(page);
+ }
+ p = kaddr + offset_in_page(pos);
+ memcpy(p, buf, len);
+ kunmap_local(kaddr);
+
+ ret = aops->write_end(NULL, mapping, pos, len, len, page,
+ fsdata);
+ if (ret < 0) {
+ error = ret;
+ break;
+ }
+
+ written += ret;
+ if (ret != len)
+ break;
+
+ count -= ret;
+ pos += ret;
+ buf += ret;
+ }
+ memalloc_nofs_restore(pflags);
+
+ if (written > 0)
+ return written;
+ return error;
+}
+
+/* Find the next written area in the xfile data for a given offset. */
+loff_t
+xfile_seek_data(
+ struct xfile *xf,
+ loff_t pos)
+{
+ loff_t ret;
+
+ ret = vfs_llseek(xf->file, pos, SEEK_DATA);
+ trace_xfile_seek_data(xf, pos, ret);
+ return ret;
+}
+
+/* Query stat information for an xfile. */
+int
+xfile_stat(
+ struct xfile *xf,
+ struct xfile_stat *statbuf)
+{
+ struct kstat ks;
+ int error;
+
+ error = vfs_getattr_nosec(&xf->file->f_path, &ks,
+ STATX_SIZE | STATX_BLOCKS, AT_STATX_DONT_SYNC);
+ if (error)
+ return error;
+
+ statbuf->size = ks.size;
+ statbuf->bytes = ks.blocks << SECTOR_SHIFT;
+ return 0;
+}
diff --git a/fs/xfs/scrub/xfile.h b/fs/xfs/scrub/xfile.h
new file mode 100644
index 000000000000..b37dba1961d8
--- /dev/null
+++ b/fs/xfs/scrub/xfile.h
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_XFILE_H__
+#define __XFS_SCRUB_XFILE_H__
+
+struct xfile {
+ struct file *file;
+};
+
+int xfile_create(struct xfs_mount *mp, const char *description, loff_t isize,
+ struct xfile **xfilep);
+void xfile_destroy(struct xfile *xf);
+
+ssize_t xfile_pread(struct xfile *xf, void *buf, size_t count, loff_t pos);
+ssize_t xfile_pwrite(struct xfile *xf, const void *buf, size_t count,
+ loff_t pos);
+
+/*
+ * Load an object. Since we're treating this file as "memory", any error or
+ * short IO is treated as a failure to allocate memory.
+ */
+static inline int
+xfile_obj_load(struct xfile *xf, void *buf, size_t count, loff_t pos)
+{
+ ssize_t ret = xfile_pread(xf, buf, count, pos);
+
+ if (ret < 0 || ret != count)
+ return -ENOMEM;
+ return 0;
+}
+
+/*
+ * Store an object. Since we're treating this file as "memory", any error or
+ * short IO is treated as a failure to allocate memory.
+ */
+static inline int
+xfile_obj_store(struct xfile *xf, const void *buf, size_t count, loff_t pos)
+{
+ ssize_t ret = xfile_pwrite(xf, buf, count, pos);
+
+ if (ret < 0 || ret != count)
+ return -ENOMEM;
+ return 0;
+}
+
+loff_t xfile_seek_data(struct xfile *xf, loff_t pos);
+
+struct xfile_stat {
+ loff_t size;
+ unsigned long long bytes;
+};
+
+int xfile_stat(struct xfile *xf, struct xfile_stat *statbuf);
+
+#endif /* __XFS_SCRUB_XFILE_H__ */
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 3/7] xfs: convert xfarray insertion sort to heapsort using scratchpad memory
2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
` (3 preceding siblings ...)
2022-12-30 22:12 ` [PATCH 4/7] xfs: teach xfile to pass back direct-map pages to caller Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
2022-12-30 22:12 ` [PATCH 5/7] xfs: speed up xfarray sort by sorting xfile page contents directly Darrick J. Wong
2022-12-30 22:12 ` [PATCH 1/7] xfs: create a big array data structure Darrick J. Wong
6 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, willy, linux-fsdevel
From: Darrick J. Wong <djwong@kernel.org>
In the previous patch, we created a very basic quicksort implementation
for xfile arrays. While the use of an alternate sorting algorithm to
avoid quicksort recursion on very small subsets reduces the runtime
modestly, we could do better than a load and store-heavy insertion sort,
particularly since each load and store requires a page mapping lookup in
the xfile.
For a small increase in kernel memory requirements, we could instead
bulk load the xfarray records into memory, use the kernel's existing
heapsort implementation to sort the records, and bulk store the memory
buffer back into the xfile. On the author's computer, this reduces the
runtime by about 5% on a 500,000 element array.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/scrub/trace.h | 5 +-
fs/xfs/scrub/xfarray.c | 142 +++++++++---------------------------------------
fs/xfs/scrub/xfarray.h | 12 +++-
3 files changed, 39 insertions(+), 120 deletions(-)
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 02f5f547c563..9de9d4f795e8 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -930,6 +930,7 @@ TRACE_EVENT(xfarray_sort_stats,
__field(unsigned long long, loads)
__field(unsigned long long, stores)
__field(unsigned long long, compares)
+ __field(unsigned long long, heapsorts)
#endif
__field(unsigned int, max_stack_depth)
__field(unsigned int, max_stack_used)
@@ -941,6 +942,7 @@ TRACE_EVENT(xfarray_sort_stats,
__entry->loads = si->loads;
__entry->stores = si->stores;
__entry->compares = si->compares;
+ __entry->heapsorts = si->heapsorts;
#endif
__entry->max_stack_depth = si->max_stack_depth;
__entry->max_stack_used = si->max_stack_used;
@@ -948,7 +950,7 @@ TRACE_EVENT(xfarray_sort_stats,
),
TP_printk(
#ifdef DEBUG
- "xfino 0x%lx loads %llu stores %llu compares %llu stack_depth %u/%u error %d",
+ "xfino 0x%lx loads %llu stores %llu compares %llu heapsorts %llu stack_depth %u/%u error %d",
#else
"xfino 0x%lx stack_depth %u/%u error %d",
#endif
@@ -957,6 +959,7 @@ TRACE_EVENT(xfarray_sort_stats,
__entry->loads,
__entry->stores,
__entry->compares,
+ __entry->heapsorts,
#endif
__entry->max_stack_used,
__entry->max_stack_depth,
diff --git a/fs/xfs/scrub/xfarray.c b/fs/xfs/scrub/xfarray.c
index 2cd3a2f42e19..171c40d04e6c 100644
--- a/fs/xfs/scrub/xfarray.c
+++ b/fs/xfs/scrub/xfarray.c
@@ -375,10 +375,12 @@ xfarray_load_next(
# define xfarray_sort_bump_loads(si) do { (si)->loads++; } while (0)
# define xfarray_sort_bump_stores(si) do { (si)->stores++; } while (0)
# define xfarray_sort_bump_compares(si) do { (si)->compares++; } while (0)
+# define xfarray_sort_bump_heapsorts(si) do { (si)->heapsorts++; } while (0)
#else
# define xfarray_sort_bump_loads(si)
# define xfarray_sort_bump_stores(si)
# define xfarray_sort_bump_compares(si)
+# define xfarray_sort_bump_heapsorts(si)
#endif /* DEBUG */
/* Load an array element for sorting. */
@@ -441,15 +443,19 @@ xfarray_sortinfo_alloc(
/*
* Tail-call recursion during the partitioning phase means that
* quicksort will never recurse more than log2(nr) times. We need one
- * extra level of stack to hold the initial parameters.
+ * extra level of stack to hold the initial parameters. In-memory
+ * sort will always take care of the last few levels of recursion for
+ * us, so we can reduce the stack depth by that much.
*/
- max_stack_depth = ilog2(array->nr) + 1;
+ max_stack_depth = ilog2(array->nr) + 1 - (XFARRAY_ISORT_SHIFT - 1);
+ if (max_stack_depth < 1)
+ max_stack_depth = 1;
/* Each level of quicksort uses a lo and a hi index */
nr_bytes += max_stack_depth * sizeof(xfarray_idx_t) * 2;
- /* One record for the pivot */
- nr_bytes += array->obj_size;
+ /* Scratchpad for in-memory sort, or one record for the pivot */
+ nr_bytes += (XFARRAY_ISORT_NR * array->obj_size);
si = kvzalloc(nr_bytes, XCHK_GFP_FLAGS);
if (!si)
@@ -491,7 +497,7 @@ xfarray_sort_terminated(
return false;
}
-/* Do we want an insertion sort? */
+/* Do we want an in-memory sort? */
static inline bool
xfarray_want_isort(
struct xfarray_sortinfo *si,
@@ -499,10 +505,10 @@ xfarray_want_isort(
xfarray_idx_t end)
{
/*
- * For array subsets smaller than 8 elements, it's slightly faster to
- * use insertion sort than quicksort's stack machine.
+ * For array subsets that fit in the scratchpad, it's much faster to
+ * use the kernel's heapsort than quicksort's stack machine.
*/
- return (end - start) < 8;
+ return (end - start) < XFARRAY_ISORT_NR;
}
/* Return the scratch space within the sortinfo structure. */
@@ -512,10 +518,8 @@ static inline void *xfarray_sortinfo_isort_scratch(struct xfarray_sortinfo *si)
}
/*
- * Perform an insertion sort on a subset of the array.
- * Though insertion sort is an O(n^2) algorithm, for small set sizes it's
- * faster than quicksort's stack machine, so we let it take over for that.
- * This ought to be replaced with something more efficient.
+ * Sort a small number of array records using scratchpad memory. The records
+ * need not be contiguous in the xfile's memory pages.
*/
STATIC int
xfarray_isort(
@@ -523,114 +527,23 @@ xfarray_isort(
xfarray_idx_t lo,
xfarray_idx_t hi)
{
- void *a = xfarray_sortinfo_isort_scratch(si);
- void *b = xfarray_scratch(si->array);
- xfarray_idx_t tmp;
- xfarray_idx_t i;
- xfarray_idx_t run;
+ void *scratch = xfarray_sortinfo_isort_scratch(si);
+ loff_t lo_pos = xfarray_pos(si->array, lo);
+ loff_t len = xfarray_pos(si->array, hi - lo + 1);
int error;
trace_xfarray_isort(si, lo, hi);
- /*
- * Move the smallest element in a[lo..hi] to a[lo]. This
- * simplifies the loop control logic below.
- */
- tmp = lo;
- error = xfarray_sort_load(si, tmp, b);
+ xfarray_sort_bump_loads(si);
+ error = xfile_obj_load(si->array->xfile, scratch, len, lo_pos);
if (error)
return error;
- for (run = lo + 1; run <= hi; run++) {
- /* if a[run] < a[tmp], tmp = run */
- error = xfarray_sort_load(si, run, a);
- if (error)
- return error;
- if (xfarray_sort_cmp(si, a, b) < 0) {
- tmp = run;
- memcpy(b, a, si->array->obj_size);
- }
- if (xfarray_sort_terminated(si, &error))
- return error;
- }
+ xfarray_sort_bump_heapsorts(si);
+ sort(scratch, hi - lo + 1, si->array->obj_size, si->cmp_fn, NULL);
- /*
- * The smallest element is a[tmp]; swap with a[lo] if tmp != lo.
- * Recall that a[tmp] is already in *b.
- */
- if (tmp != lo) {
- error = xfarray_sort_load(si, lo, a);
- if (error)
- return error;
- error = xfarray_sort_store(si, tmp, a);
- if (error)
- return error;
- error = xfarray_sort_store(si, lo, b);
- if (error)
- return error;
- }
-
- /*
- * Perform an insertion sort on a[lo+1..hi]. We already made sure
- * that the smallest value in the original range is now in a[lo],
- * so the inner loop should never underflow.
- *
- * For each a[lo+2..hi], make sure it's in the correct position
- * with respect to the elements that came before it.
- */
- for (run = lo + 2; run <= hi; run++) {
- error = xfarray_sort_load(si, run, a);
- if (error)
- return error;
-
- /*
- * Find the correct place for a[run] by walking leftwards
- * towards the start of the range until a[tmp] is no longer
- * greater than a[run].
- */
- tmp = run - 1;
- error = xfarray_sort_load(si, tmp, b);
- if (error)
- return error;
- while (xfarray_sort_cmp(si, a, b) < 0) {
- tmp--;
- error = xfarray_sort_load(si, tmp, b);
- if (error)
- return error;
-
- if (xfarray_sort_terminated(si, &error))
- return error;
- }
- tmp++;
-
- /*
- * If tmp != run, then a[tmp..run-1] are all less than a[run],
- * so right barrel roll a[tmp..run] to get this range in
- * sorted order.
- */
- if (tmp == run)
- continue;
-
- for (i = run; i >= tmp; i--) {
- error = xfarray_sort_load(si, i - 1, b);
- if (error)
- return error;
- error = xfarray_sort_store(si, i, b);
- if (error)
- return error;
-
- if (xfarray_sort_terminated(si, &error))
- return error;
- }
- error = xfarray_sort_store(si, tmp, a);
- if (error)
- return error;
-
- if (xfarray_sort_terminated(si, &error))
- return error;
- }
-
- return 0;
+ xfarray_sort_bump_stores(si);
+ return xfile_obj_store(si->array->xfile, scratch, len, lo_pos);
}
/* Return a pointer to the xfarray pivot record within the sortinfo struct. */
@@ -784,9 +697,8 @@ xfarray_qsort_push(
* current stack frame. This guarantees that we won't need more than
* log2(nr) stack space.
*
- * 4. Use insertion sort for small sets since since insertion sort is faster
- * for small, mostly sorted array segments. In the author's experience,
- * substituting insertion sort for arrays smaller than 8 elements yields
+ * 4. For small sets, load the records into the scratchpad and run heapsort on
+ * them because that is very fast. In the author's experience, this yields
* a ~10% reduction in runtime.
*/
diff --git a/fs/xfs/scrub/xfarray.h b/fs/xfs/scrub/xfarray.h
index b0cf818c6a7f..f49c1afe24a1 100644
--- a/fs/xfs/scrub/xfarray.h
+++ b/fs/xfs/scrub/xfarray.h
@@ -59,6 +59,10 @@ int xfarray_load_next(struct xfarray *array, xfarray_idx_t *idx, void *rec);
typedef cmp_func_t xfarray_cmp_fn;
+/* Perform an in-memory heapsort for small subsets. */
+#define XFARRAY_ISORT_SHIFT (4)
+#define XFARRAY_ISORT_NR (1U << XFARRAY_ISORT_SHIFT)
+
struct xfarray_sortinfo {
struct xfarray *array;
@@ -82,6 +86,7 @@ struct xfarray_sortinfo {
uint64_t loads;
uint64_t stores;
uint64_t compares;
+ uint64_t heapsorts;
#endif
/*
@@ -100,11 +105,10 @@ struct xfarray_sortinfo {
*
* union {
*
- * If for a given subset we decide to use an insertion sort, we use the
- * scratchpad record after the xfarray and a second scratchpad record
- * here to compare items:
+ * If for a given subset we decide to use an in-memory sort, we use a
+ * block of scratchpad records here to compare items:
*
- * xfarray_rec_t scratch;
+ * xfarray_rec_t scratch[ISORT_NR];
*
* Otherwise, we want to partition the records to partition the array.
* We store the chosen pivot record here and use the xfarray scratchpad
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 4/7] xfs: teach xfile to pass back direct-map pages to caller
2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
` (2 preceding siblings ...)
2022-12-30 22:12 ` [PATCH 7/7] xfs: improve xfarray quicksort pivot Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
2022-12-30 22:12 ` [PATCH 3/7] xfs: convert xfarray insertion sort to heapsort using scratchpad memory Darrick J. Wong
` (2 subsequent siblings)
6 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, willy, linux-fsdevel
From: Darrick J. Wong <djwong@kernel.org>
Certain xfile array operations (such as sorting) can be sped up quite a
bit by allowing xfile users to grab a page to bulk-read the records
contained within it. Create helper methods to facilitate this.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/scrub/trace.h | 2 +
fs/xfs/scrub/xfile.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/scrub/xfile.h | 10 +++++
3 files changed, 120 insertions(+)
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 9de9d4f795e8..79b844c969df 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -824,6 +824,8 @@ DEFINE_EVENT(xfile_class, name, \
DEFINE_XFILE_EVENT(xfile_pread);
DEFINE_XFILE_EVENT(xfile_pwrite);
DEFINE_XFILE_EVENT(xfile_seek_data);
+DEFINE_XFILE_EVENT(xfile_get_page);
+DEFINE_XFILE_EVENT(xfile_put_page);
TRACE_EVENT(xfarray_create,
TP_PROTO(struct xfarray *xfa, unsigned long long required_capacity),
diff --git a/fs/xfs/scrub/xfile.c b/fs/xfs/scrub/xfile.c
index 43455aa78243..7090a8e12b60 100644
--- a/fs/xfs/scrub/xfile.c
+++ b/fs/xfs/scrub/xfile.c
@@ -316,3 +316,111 @@ xfile_stat(
statbuf->bytes = ks.blocks << SECTOR_SHIFT;
return 0;
}
+
+/*
+ * Grab the (locked) page for a memory object. The object cannot span a page
+ * boundary. Returns 0 (and a locked page) if successful, -ENOTBLK if we
+ * cannot grab the page, or the usual negative errno.
+ */
+int
+xfile_get_page(
+ struct xfile *xf,
+ loff_t pos,
+ unsigned int len,
+ struct xfile_page *xfpage)
+{
+ struct inode *inode = file_inode(xf->file);
+ struct address_space *mapping = inode->i_mapping;
+ const struct address_space_operations *aops = mapping->a_ops;
+ struct page *page = NULL;
+ void *fsdata = NULL;
+ loff_t key = round_down(pos, PAGE_SIZE);
+ unsigned int pflags;
+ int error;
+
+ if (inode->i_sb->s_maxbytes - pos < len)
+ return -ENOMEM;
+ if (len > PAGE_SIZE - offset_in_page(pos))
+ return -ENOTBLK;
+
+ trace_xfile_get_page(xf, pos, len);
+
+ pflags = memalloc_nofs_save();
+
+ /*
+ * We call write_begin directly here to avoid all the freezer
+ * protection lock-taking that happens in the normal path. shmem
+ * doesn't support fs freeze, but lockdep doesn't know that and will
+ * trip over that.
+ */
+ error = aops->write_begin(NULL, mapping, key, PAGE_SIZE, &page,
+ &fsdata);
+ if (error)
+ goto out_pflags;
+
+ /* We got the page, so make sure we push out EOF. */
+ if (i_size_read(inode) < pos + len)
+ i_size_write(inode, pos + len);
+
+ /*
+ * If the page isn't up to date, fill it with zeroes before we hand it
+ * to the caller and make sure the backing store will hold on to them.
+ */
+ if (!PageUptodate(page)) {
+ void *kaddr;
+
+ kaddr = kmap_local_page(page);
+ memset(kaddr, 0, PAGE_SIZE);
+ kunmap_local(kaddr);
+ SetPageUptodate(page);
+ }
+
+ /*
+ * Mark each page dirty so that the contents are written to some
+ * backing store when we drop this buffer, and take an extra reference
+ * to prevent the xfile page from being swapped or removed from the
+ * page cache by reclaim if the caller unlocks the page.
+ */
+ set_page_dirty(page);
+ get_page(page);
+
+ xfpage->page = page;
+ xfpage->fsdata = fsdata;
+ xfpage->pos = key;
+out_pflags:
+ memalloc_nofs_restore(pflags);
+ return error;
+}
+
+/*
+ * Release the (locked) page for a memory object. Returns 0 or a negative
+ * errno.
+ */
+int
+xfile_put_page(
+ struct xfile *xf,
+ struct xfile_page *xfpage)
+{
+ struct inode *inode = file_inode(xf->file);
+ struct address_space *mapping = inode->i_mapping;
+ const struct address_space_operations *aops = mapping->a_ops;
+ unsigned int pflags;
+ int ret;
+
+ trace_xfile_put_page(xf, xfpage->pos, PAGE_SIZE);
+
+ /* Give back the reference that we took in xfile_get_page. */
+ put_page(xfpage->page);
+
+ pflags = memalloc_nofs_save();
+ ret = aops->write_end(NULL, mapping, xfpage->pos, PAGE_SIZE, PAGE_SIZE,
+ xfpage->page, xfpage->fsdata);
+ memalloc_nofs_restore(pflags);
+ memset(xfpage, 0, sizeof(struct xfile_page));
+
+ if (ret < 0)
+ return ret;
+ if (ret != PAGE_SIZE)
+ return -EIO;
+ return 0;
+}
diff --git a/fs/xfs/scrub/xfile.h b/fs/xfs/scrub/xfile.h
index b37dba1961d8..e34ab9c4aad9 100644
--- a/fs/xfs/scrub/xfile.h
+++ b/fs/xfs/scrub/xfile.h
@@ -6,6 +6,12 @@
#ifndef __XFS_SCRUB_XFILE_H__
#define __XFS_SCRUB_XFILE_H__
+struct xfile_page {
+ struct page *page;
+ void *fsdata;
+ loff_t pos;
+};
+
struct xfile {
struct file *file;
};
@@ -55,4 +61,8 @@ struct xfile_stat {
int xfile_stat(struct xfile *xf, struct xfile_stat *statbuf);
+int xfile_get_page(struct xfile *xf, loff_t offset, unsigned int len,
+ struct xfile_page *xbuf);
+int xfile_put_page(struct xfile *xf, struct xfile_page *xbuf);
+
#endif /* __XFS_SCRUB_XFILE_H__ */
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 6/7] xfs: cache pages used for xfarray quicksort convergence
2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
2022-12-30 22:12 ` [PATCH 2/7] xfs: enable sorting of xfile-backed arrays Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
2022-12-30 22:12 ` [PATCH 7/7] xfs: improve xfarray quicksort pivot Darrick J. Wong
` (4 subsequent siblings)
6 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, willy, linux-fsdevel
From: Darrick J. Wong <djwong@kernel.org>
After quicksort picks a pivot item for a particular subsort, it walks
the records in that subset from the outside in, rearranging them so that
every record less than the pivot comes before it, and every record
greater than the pivot comes after it. This scan has a lot of locality,
so we can speed it up quite a bit by grabbing the xfile backing page and
holding onto it as long as we possibly can. Doing so reduces the
runtime by another 5% on the author's computer.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/scrub/xfarray.c | 86 ++++++++++++++++++++++++++++++++++++++++++------
fs/xfs/scrub/xfile.h | 10 ++++++
2 files changed, 86 insertions(+), 10 deletions(-)
diff --git a/fs/xfs/scrub/xfarray.c b/fs/xfs/scrub/xfarray.c
index 08479be07fda..3e232ee5e7e6 100644
--- a/fs/xfs/scrub/xfarray.c
+++ b/fs/xfs/scrub/xfarray.c
@@ -760,6 +760,66 @@ xfarray_qsort_push(
return 0;
}
+/*
+ * Load an element from the array into the first scratchpad and cache the page,
+ * if possible.
+ */
+static inline int
+xfarray_sort_load_cached(
+ struct xfarray_sortinfo *si,
+ xfarray_idx_t idx,
+ void *ptr)
+{
+ loff_t idx_pos = xfarray_pos(si->array, idx);
+ pgoff_t startpage;
+ pgoff_t endpage;
+ int error = 0;
+
+ /*
+ * If this load would split a page, release the cached page, if any,
+ * and perform a traditional read.
+ */
+ startpage = idx_pos >> PAGE_SHIFT;
+ endpage = (idx_pos + si->array->obj_size - 1) >> PAGE_SHIFT;
+ if (startpage != endpage) {
+ error = xfarray_sort_put_page(si);
+ if (error)
+ return error;
+
+ if (xfarray_sort_terminated(si, &error))
+ return error;
+
+ return xfile_obj_load(si->array->xfile, ptr,
+ si->array->obj_size, idx_pos);
+ }
+
+ /* If the cached page is not the one we want, release it. */
+ if (xfile_page_cached(&si->xfpage) &&
+ xfile_page_index(&si->xfpage) != startpage) {
+ error = xfarray_sort_put_page(si);
+ if (error)
+ return error;
+ }
+
+ /*
+ * If we don't have a cached page (and we know the load is contained
+ * in a single page) then grab it.
+ */
+ if (!xfile_page_cached(&si->xfpage)) {
+ if (xfarray_sort_terminated(si, &error))
+ return error;
+
+ error = xfarray_sort_get_page(si, startpage << PAGE_SHIFT,
+ PAGE_SIZE);
+ if (error)
+ return error;
+ }
+
+ memcpy(ptr, si->page_kaddr + offset_in_page(idx_pos),
+ si->array->obj_size);
+ return 0;
+}
+
/*
* Sort the array elements via quicksort. This implementation incorporates
* four optimizations discussed in Sedgewick:
@@ -785,6 +845,10 @@ xfarray_qsort_push(
* If a small set is contained entirely within a single xfile memory page,
* map the page directly and run heap sort directly on the xfile page
* instead of using the load/store interface. This halves the runtime.
+ *
+ * 5. This optimization is specific to the implementation. When converging lo
+ * and hi after selecting a pivot, we will try to retain the xfile memory
+ * page between load calls, which reduces run time by 50%.
*/
/*
@@ -866,19 +930,20 @@ xfarray_sort(
* Decrement hi until it finds an a[hi] less than the
* pivot value.
*/
- error = xfarray_sort_load(si, hi, scratch);
+ error = xfarray_sort_load_cached(si, hi, scratch);
if (error)
goto out_free;
while (xfarray_sort_cmp(si, scratch, pivot) >= 0 &&
lo < hi) {
- if (xfarray_sort_terminated(si, &error))
- goto out_free;
-
hi--;
- error = xfarray_sort_load(si, hi, scratch);
+ error = xfarray_sort_load_cached(si, hi,
+ scratch);
if (error)
goto out_free;
}
+ error = xfarray_sort_put_page(si);
+ if (error)
+ goto out_free;
if (xfarray_sort_terminated(si, &error))
goto out_free;
@@ -894,19 +959,20 @@ xfarray_sort(
* Increment lo until it finds an a[lo] greater than
* the pivot value.
*/
- error = xfarray_sort_load(si, lo, scratch);
+ error = xfarray_sort_load_cached(si, lo, scratch);
if (error)
goto out_free;
while (xfarray_sort_cmp(si, scratch, pivot) <= 0 &&
lo < hi) {
- if (xfarray_sort_terminated(si, &error))
- goto out_free;
-
lo++;
- error = xfarray_sort_load(si, lo, scratch);
+ error = xfarray_sort_load_cached(si, lo,
+ scratch);
if (error)
goto out_free;
}
+ error = xfarray_sort_put_page(si);
+ if (error)
+ goto out_free;
if (xfarray_sort_terminated(si, &error))
goto out_free;
diff --git a/fs/xfs/scrub/xfile.h b/fs/xfs/scrub/xfile.h
index e34ab9c4aad9..0172bd9eeab0 100644
--- a/fs/xfs/scrub/xfile.h
+++ b/fs/xfs/scrub/xfile.h
@@ -12,6 +12,16 @@ struct xfile_page {
loff_t pos;
};
+static inline bool xfile_page_cached(const struct xfile_page *xfpage)
+{
+ return xfpage->page != NULL;
+}
+
+static inline pgoff_t xfile_page_index(const struct xfile_page *xfpage)
+{
+ return xfpage->page->index;
+}
+
struct xfile {
struct file *file;
};
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 5/7] xfs: speed up xfarray sort by sorting xfile page contents directly
2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
` (4 preceding siblings ...)
2022-12-30 22:12 ` [PATCH 3/7] xfs: convert xfarray insertion sort to heapsort using scratchpad memory Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
2022-12-30 22:12 ` [PATCH 1/7] xfs: create a big array data structure Darrick J. Wong
6 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, willy, linux-fsdevel
From: Darrick J. Wong <djwong@kernel.org>
If all the records in an xfarray subset live within the same memory
page, we can short-circuit even more quicksort recursion by mapping that
page into the local CPU and using the kernel's heapsort function to sort
the subset. On the author's computer, this reduces the runtime by
another 15% on a 500,000 element array.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/scrub/trace.h | 20 ++++++++++
fs/xfs/scrub/xfarray.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/scrub/xfarray.h | 4 ++
3 files changed, 121 insertions(+)
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 79b844c969df..2431083b9f91 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -872,6 +872,26 @@ TRACE_EVENT(xfarray_isort,
__entry->hi - __entry->lo)
);
+TRACE_EVENT(xfarray_pagesort,
+ TP_PROTO(struct xfarray_sortinfo *si, uint64_t lo, uint64_t hi),
+ TP_ARGS(si, lo, hi),
+ TP_STRUCT__entry(
+ __field(unsigned long, ino)
+ __field(unsigned long long, lo)
+ __field(unsigned long long, hi)
+ ),
+ TP_fast_assign(
+ __entry->ino = file_inode(si->array->xfile->file)->i_ino;
+ __entry->lo = lo;
+ __entry->hi = hi;
+ ),
+ TP_printk("xfino 0x%lx lo %llu hi %llu elts %llu",
+ __entry->ino,
+ __entry->lo,
+ __entry->hi,
+ __entry->hi - __entry->lo)
+);
+
TRACE_EVENT(xfarray_qsort,
TP_PROTO(struct xfarray_sortinfo *si, uint64_t lo, uint64_t hi),
TP_ARGS(si, lo, hi),
diff --git a/fs/xfs/scrub/xfarray.c b/fs/xfs/scrub/xfarray.c
index 171c40d04e6c..08479be07fda 100644
--- a/fs/xfs/scrub/xfarray.c
+++ b/fs/xfs/scrub/xfarray.c
@@ -546,6 +546,87 @@ xfarray_isort(
return xfile_obj_store(si->array->xfile, scratch, len, lo_pos);
}
+/* Grab a page for sorting records. */
+static inline int
+xfarray_sort_get_page(
+ struct xfarray_sortinfo *si,
+ loff_t pos,
+ uint64_t len)
+{
+ int error;
+
+ error = xfile_get_page(si->array->xfile, pos, len, &si->xfpage);
+ if (error)
+ return error;
+
+ /*
+ * xfile pages must never be mapped into userspace, so we skip the
+ * dcache flush when mapping the page.
+ */
+ si->page_kaddr = kmap_local_page(si->xfpage.page);
+ return 0;
+}
+
+/* Release a page we grabbed for sorting records. */
+static inline int
+xfarray_sort_put_page(
+ struct xfarray_sortinfo *si)
+{
+ if (!si->page_kaddr)
+ return 0;
+
+ kunmap_local(si->page_kaddr);
+ si->page_kaddr = NULL;
+
+ return xfile_put_page(si->array->xfile, &si->xfpage);
+}
+
+/* Decide if these records are eligible for in-page sorting. */
+static inline bool
+xfarray_want_pagesort(
+ struct xfarray_sortinfo *si,
+ xfarray_idx_t lo,
+ xfarray_idx_t hi)
+{
+ pgoff_t lo_page;
+ pgoff_t hi_page;
+ loff_t end_pos;
+
+ /* We can only map one page at a time. */
+ lo_page = xfarray_pos(si->array, lo) >> PAGE_SHIFT;
+ end_pos = xfarray_pos(si->array, hi) + si->array->obj_size - 1;
+ hi_page = end_pos >> PAGE_SHIFT;
+
+ return lo_page == hi_page;
+}
+
+/* Sort a bunch of records that all live in the same memory page. */
+STATIC int
+xfarray_pagesort(
+ struct xfarray_sortinfo *si,
+ xfarray_idx_t lo,
+ xfarray_idx_t hi)
+{
+ void *startp;
+ loff_t lo_pos = xfarray_pos(si->array, lo);
+ uint64_t len = xfarray_pos(si->array, hi - lo);
+ int error = 0;
+
+ trace_xfarray_pagesort(si, lo, hi);
+
+ xfarray_sort_bump_loads(si);
+ error = xfarray_sort_get_page(si, lo_pos, len);
+ if (error)
+ return error;
+
+ xfarray_sort_bump_heapsorts(si);
+ startp = si->page_kaddr + offset_in_page(lo_pos);
+ sort(startp, hi - lo + 1, si->array->obj_size, si->cmp_fn, NULL);
+
+ xfarray_sort_bump_stores(si);
+ return xfarray_sort_put_page(si);
+}
+
/* Return a pointer to the xfarray pivot record within the sortinfo struct. */
static inline void *xfarray_sortinfo_pivot(struct xfarray_sortinfo *si)
{
@@ -700,6 +781,10 @@ xfarray_qsort_push(
* 4. For small sets, load the records into the scratchpad and run heapsort on
* them because that is very fast. In the author's experience, this yields
* a ~10% reduction in runtime.
+ *
+ * If a small set is contained entirely within a single xfile memory page,
+ * map the page directly and run heap sort directly on the xfile page
+ * instead of using the load/store interface. This halves the runtime.
*/
/*
@@ -745,6 +830,18 @@ xfarray_sort(
continue;
}
+ /*
+ * If directly mapping the page and sorting can solve our
+ * problems, we're done.
+ */
+ if (xfarray_want_pagesort(si, lo, hi)) {
+ error = xfarray_pagesort(si, lo, hi);
+ if (error)
+ goto out_free;
+ si->stack_depth--;
+ continue;
+ }
+
/* If insertion sort can solve our problems, we're done. */
if (xfarray_want_isort(si, lo, hi)) {
error = xfarray_isort(si, lo, hi);
diff --git a/fs/xfs/scrub/xfarray.h b/fs/xfs/scrub/xfarray.h
index f49c1afe24a1..e8a4523bf2de 100644
--- a/fs/xfs/scrub/xfarray.h
+++ b/fs/xfs/scrub/xfarray.h
@@ -81,6 +81,10 @@ struct xfarray_sortinfo {
/* XFARRAY_SORT_* flags; see below. */
unsigned int flags;
+ /* Cache a page here for faster access. */
+ struct xfile_page xfpage;
+ void *page_kaddr;
+
#ifdef DEBUG
/* Performance statistics. */
uint64_t loads;
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 7/7] xfs: improve xfarray quicksort pivot
2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
2022-12-30 22:12 ` [PATCH 2/7] xfs: enable sorting of xfile-backed arrays Darrick J. Wong
2022-12-30 22:12 ` [PATCH 6/7] xfs: cache pages used for xfarray quicksort convergence Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
2022-12-30 22:12 ` [PATCH 4/7] xfs: teach xfile to pass back direct-map pages to caller Darrick J. Wong
` (3 subsequent siblings)
6 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, willy, linux-fsdevel
From: Darrick J. Wong <djwong@kernel.org>
Now that we have the means to do insertion sorts of small in-memory
subsets of an xfarray, use it to improve the quicksort pivot algorithm
by reading 7 records into memory and finding the median of that. This
should prevent bad partitioning when a[lo] and a[hi] end up next to each
other in the final sort, which can happen when sorting for cntbt repair
when the free space is extremely fragmented (e.g. generic/176).
This doesn't speed up the average quicksort run by much, but it will
(hopefully) avoid the quadratic time collapse for which quicksort is
famous.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/scrub/xfarray.c | 198 ++++++++++++++++++++++++++++++++----------------
fs/xfs/scrub/xfarray.h | 19 +++--
2 files changed, 148 insertions(+), 69 deletions(-)
diff --git a/fs/xfs/scrub/xfarray.c b/fs/xfs/scrub/xfarray.c
index 3e232ee5e7e6..ce1365144209 100644
--- a/fs/xfs/scrub/xfarray.c
+++ b/fs/xfs/scrub/xfarray.c
@@ -428,6 +428,14 @@ static inline xfarray_idx_t *xfarray_sortinfo_hi(struct xfarray_sortinfo *si)
return xfarray_sortinfo_lo(si) + si->max_stack_depth;
}
+/* Size of each element in the quicksort pivot array. */
+static inline size_t
+xfarray_pivot_rec_sz(
+ struct xfarray *array)
+{
+ return round_up(array->obj_size, 8) + sizeof(xfarray_idx_t);
+}
+
/* Allocate memory to handle the sort. */
static inline int
xfarray_sortinfo_alloc(
@@ -438,8 +446,16 @@ xfarray_sortinfo_alloc(
{
struct xfarray_sortinfo *si;
size_t nr_bytes = sizeof(struct xfarray_sortinfo);
+ size_t pivot_rec_sz = xfarray_pivot_rec_sz(array);
int max_stack_depth;
+ /*
+ * The median-of-nine pivot algorithm doesn't work if a subset has
+ * fewer than 9 items. Make sure the in-memory sort will always take
+ * over for subsets where this wouldn't be the case.
+ */
+ BUILD_BUG_ON(XFARRAY_QSORT_PIVOT_NR >= XFARRAY_ISORT_NR);
+
/*
* Tail-call recursion during the partitioning phase means that
* quicksort will never recurse more than log2(nr) times. We need one
@@ -454,8 +470,10 @@ xfarray_sortinfo_alloc(
/* Each level of quicksort uses a lo and a hi index */
nr_bytes += max_stack_depth * sizeof(xfarray_idx_t) * 2;
- /* Scratchpad for in-memory sort, or one record for the pivot */
- nr_bytes += (XFARRAY_ISORT_NR * array->obj_size);
+ /* Scratchpad for in-memory sort, or finding the pivot */
+ nr_bytes += max_t(size_t,
+ (XFARRAY_QSORT_PIVOT_NR + 1) * pivot_rec_sz,
+ XFARRAY_ISORT_NR * array->obj_size);
si = kvzalloc(nr_bytes, XCHK_GFP_FLAGS);
if (!si)
@@ -633,14 +651,43 @@ static inline void *xfarray_sortinfo_pivot(struct xfarray_sortinfo *si)
return xfarray_sortinfo_hi(si) + si->max_stack_depth;
}
+/* Return a pointer to the start of the pivot array. */
+static inline void *
+xfarray_sortinfo_pivot_array(
+ struct xfarray_sortinfo *si)
+{
+ return xfarray_sortinfo_pivot(si) + si->array->obj_size;
+}
+
+/* The xfarray record is stored at the start of each pivot array element. */
+static inline void *
+xfarray_pivot_array_rec(
+ void *pa,
+ size_t pa_recsz,
+ unsigned int pa_idx)
+{
+ return pa + (pa_recsz * pa_idx);
+}
+
+/* The xfarray index is stored at the end of each pivot array element. */
+static inline xfarray_idx_t *
+xfarray_pivot_array_idx(
+ void *pa,
+ size_t pa_recsz,
+ unsigned int pa_idx)
+{
+ return xfarray_pivot_array_rec(pa, pa_recsz, pa_idx + 1) -
+ sizeof(xfarray_idx_t);
+}
+
/*
* Find a pivot value for quicksort partitioning, swap it with a[lo], and save
* the cached pivot record for the next step.
*
- * Select the median value from a[lo], a[mid], and a[hi]. Put the median in
- * a[lo], the lowest in a[mid], and the highest in a[hi]. Using the median of
- * the three reduces the chances that we pick the worst case pivot value, since
- * it's likely that our array values are nearly sorted.
+ * Load evenly-spaced records within the given range into memory, sort them,
+ * and choose the pivot from the median record. Using multiple points will
+ * improve the quality of the pivot selection, and hopefully avoid the worst
+ * quicksort behavior, since our array values are nearly always evenly sorted.
*/
STATIC int
xfarray_qsort_pivot(
@@ -648,76 +695,99 @@ xfarray_qsort_pivot(
xfarray_idx_t lo,
xfarray_idx_t hi)
{
- void *a = xfarray_sortinfo_pivot(si);
- void *b = xfarray_scratch(si->array);
- xfarray_idx_t mid = lo + ((hi - lo) / 2);
+ void *pivot = xfarray_sortinfo_pivot(si);
+ void *parray = xfarray_sortinfo_pivot_array(si);
+ void *recp;
+ xfarray_idx_t *idxp;
+ xfarray_idx_t step = (hi - lo) / (XFARRAY_QSORT_PIVOT_NR - 1);
+ size_t pivot_rec_sz = xfarray_pivot_rec_sz(si->array);
+ int i, j;
int error;
- /* if a[mid] < a[lo], swap a[mid] and a[lo]. */
- error = xfarray_sort_load(si, mid, a);
- if (error)
- return error;
- error = xfarray_sort_load(si, lo, b);
- if (error)
- return error;
- if (xfarray_sort_cmp(si, a, b) < 0) {
- error = xfarray_sort_store(si, lo, a);
- if (error)
- return error;
- error = xfarray_sort_store(si, mid, b);
- if (error)
- return error;
- }
+ ASSERT(step > 0);
- /* if a[hi] < a[mid], swap a[mid] and a[hi]. */
- error = xfarray_sort_load(si, hi, a);
- if (error)
- return error;
- error = xfarray_sort_load(si, mid, b);
- if (error)
- return error;
- if (xfarray_sort_cmp(si, a, b) < 0) {
- error = xfarray_sort_store(si, mid, a);
- if (error)
- return error;
- error = xfarray_sort_store(si, hi, b);
- if (error)
- return error;
- } else {
- goto move_front;
+ /*
+ * Load the xfarray indexes of the records we intend to sample into the
+ * pivot array.
+ */
+ idxp = xfarray_pivot_array_idx(parray, pivot_rec_sz, 0);
+ *idxp = lo;
+ for (i = 1; i < XFARRAY_QSORT_PIVOT_NR - 1; i++) {
+ idxp = xfarray_pivot_array_idx(parray, pivot_rec_sz, i);
+ *idxp = lo + (i * step);
}
+ idxp = xfarray_pivot_array_idx(parray, pivot_rec_sz,
+ XFARRAY_QSORT_PIVOT_NR - 1);
+ *idxp = hi;
- /* if a[mid] < a[lo], swap a[mid] and a[lo]. */
- error = xfarray_sort_load(si, mid, a);
- if (error)
- return error;
- error = xfarray_sort_load(si, lo, b);
- if (error)
- return error;
- if (xfarray_sort_cmp(si, a, b) < 0) {
- error = xfarray_sort_store(si, lo, a);
- if (error)
- return error;
- error = xfarray_sort_store(si, mid, b);
+ /* Load the selected xfarray records into the pivot array. */
+ for (i = 0; i < XFARRAY_QSORT_PIVOT_NR; i++) {
+ xfarray_idx_t idx;
+
+ recp = xfarray_pivot_array_rec(parray, pivot_rec_sz, i);
+ idxp = xfarray_pivot_array_idx(parray, pivot_rec_sz, i);
+
+ /* No unset records; load directly into the array. */
+ if (likely(si->array->unset_slots == 0)) {
+ error = xfarray_sort_load(si, *idxp, recp);
+ if (error)
+ return error;
+ continue;
+ }
+
+ /*
+ * Load non-null records into the scratchpad without changing
+ * the xfarray_idx_t in the pivot array.
+ */
+ idx = *idxp;
+ xfarray_sort_bump_loads(si);
+ error = xfarray_load_next(si->array, &idx, recp);
if (error)
return error;
}
-move_front:
+ xfarray_sort_bump_heapsorts(si);
+ sort(parray, XFARRAY_QSORT_PIVOT_NR, pivot_rec_sz, si->cmp_fn, NULL);
+
/*
- * Move our selected pivot to a[lo]. Recall that a == si->pivot, so
- * this leaves us with the pivot cached in the sortinfo structure.
+ * We sorted the pivot array records (which includes the xfarray
+ * indices) in xfarray record order. The median element of the pivot
+ * array contains the xfarray record that we will use as the pivot.
+ * Copy that xfarray record to the designated space.
*/
- error = xfarray_sort_load(si, lo, b);
- if (error)
- return error;
- error = xfarray_sort_load(si, mid, a);
- if (error)
- return error;
- error = xfarray_sort_store(si, mid, b);
+ recp = xfarray_pivot_array_rec(parray, pivot_rec_sz,
+ XFARRAY_QSORT_PIVOT_NR / 2);
+ memcpy(pivot, recp, si->array->obj_size);
+
+ /* If the pivot record we chose was already in a[lo] then we're done. */
+ idxp = xfarray_pivot_array_idx(parray, pivot_rec_sz,
+ XFARRAY_QSORT_PIVOT_NR / 2);
+ if (*idxp == lo)
+ return 0;
+
+ /*
+ * Find the cached copy of a[lo] in the pivot array so that we can swap
+ * a[lo] and a[pivot].
+ */
+ for (i = 0, j = -1; i < XFARRAY_QSORT_PIVOT_NR; i++) {
+ idxp = xfarray_pivot_array_idx(parray, pivot_rec_sz, i);
+ if (*idxp == lo)
+ j = i;
+ }
+ if (j < 0) {
+ ASSERT(j >= 0);
+ return -EFSCORRUPTED;
+ }
+
+ /* Swap a[lo] and a[pivot]. */
+ error = xfarray_sort_store(si, lo, pivot);
if (error)
return error;
- return xfarray_sort_store(si, lo, a);
+
+ recp = xfarray_pivot_array_rec(parray, pivot_rec_sz, j);
+ idxp = xfarray_pivot_array_idx(parray, pivot_rec_sz,
+ XFARRAY_QSORT_PIVOT_NR / 2);
+ return xfarray_sort_store(si, *idxp, recp);
}
/*
@@ -829,7 +899,7 @@ xfarray_sort_load_cached(
* particularly expensive in the kernel.
*
* 2. For arrays with records in arbitrary or user-controlled order, choose the
- * pivot element using a median-of-three decision tree. This reduces the
+ * pivot element using a median-of-nine decision tree. This reduces the
* probability of selecting a bad pivot value which causes worst case
* behavior (i.e. partition sizes of 1).
*
diff --git a/fs/xfs/scrub/xfarray.h b/fs/xfs/scrub/xfarray.h
index e8a4523bf2de..69f0c922c98a 100644
--- a/fs/xfs/scrub/xfarray.h
+++ b/fs/xfs/scrub/xfarray.h
@@ -63,6 +63,9 @@ typedef cmp_func_t xfarray_cmp_fn;
#define XFARRAY_ISORT_SHIFT (4)
#define XFARRAY_ISORT_NR (1U << XFARRAY_ISORT_SHIFT)
+/* Evalulate this many points to find the qsort pivot. */
+#define XFARRAY_QSORT_PIVOT_NR (9)
+
struct xfarray_sortinfo {
struct xfarray *array;
@@ -92,7 +95,6 @@ struct xfarray_sortinfo {
uint64_t compares;
uint64_t heapsorts;
#endif
-
/*
* Extra bytes are allocated beyond the end of the structure to store
* quicksort information. C does not permit multiple VLAs per struct,
@@ -115,11 +117,18 @@ struct xfarray_sortinfo {
* xfarray_rec_t scratch[ISORT_NR];
*
* Otherwise, we want to partition the records to partition the array.
- * We store the chosen pivot record here and use the xfarray scratchpad
- * to rearrange the array around the pivot:
- *
- * xfarray_rec_t pivot;
+ * We store the chosen pivot record at the start of the scratchpad area
+ * and use the rest to sample some records to estimate the median.
+ * The format of the qsort_pivot array enables us to use the kernel
+ * heapsort function to place the median value in the middle.
*
+ * struct {
+ * xfarray_rec_t pivot;
+ * struct {
+ * xfarray_rec_t rec; (rounded up to 8 bytes)
+ * xfarray_idx_t idx;
+ * } qsort_pivot[QSORT_PIVOT_NR];
+ * };
* }
*/
};
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHSET v24.0 0/7] xfs: support in-memory btrees
2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 4/7] xfs: consolidate btree block freeing tracepoints Darrick J. Wong
` (6 more replies)
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
2 siblings, 7 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, willy, linux-fsdevel
Hi all,
Online repair of the reverse-mapping btrees presens some unique
challenges. To construct a new reverse mapping btree, we must scan the
entire filesystem, but we cannot afford to quiesce the entire filesystem
for the potentially lengthy scan.
For rmap btrees, therefore, we relax our requirements of totally atomic
repairs. Instead, repairs will scan all inodes, construct a new reverse
mapping dataset, format a new btree, and commit it before anyone trips
over the corruption. This is exactly the same strategy as was used in
the quotacheck and nlink scanners.
Unfortunately, the xfarray cannot perform key-based lookups and is
therefore unsuitable for supporting live updates. Luckily, we already a
data structure that maintains an indexed rmap recordset -- the existing
rmap btree code! Hence we port the existing btree and buffer target
code to be able to create a btree using the xfile we developed earlier.
Live hooks keep the in-memory btree up to date for any resources that
have already been scanned.
This approach is not maximally memory efficient, but we can use the same
rmap code that we do everywhere else, which provides improved stability
without growing the code base even more. Note that in-memory btree
blocks are always page sized.
This patchset modifies the kernel xfs buffer cache to be capable of
using a xfile (aka a shmem file) as a backing device. It then augments
the btree code to support creating btree cursors with buffers that come
from a buftarg other than the data device (namely an xfile-backed
buftarg). For the userspace xfs buffer cache, we instead use a memfd or
an O_TMPFILE file as a backing device.
If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.
This is an extraordinary way to destroy everything. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees
xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=in-memory-btrees
---
fs/xfs/Kconfig | 8
fs/xfs/Makefile | 1
fs/xfs/libxfs/xfs_btree.c | 173 ++++++--
fs/xfs/libxfs/xfs_btree.h | 17 +
fs/xfs/libxfs/xfs_btree_mem.h | 128 ++++++
fs/xfs/libxfs/xfs_refcount_btree.c | 4
fs/xfs/libxfs/xfs_rmap_btree.c | 4
fs/xfs/scrub/bitmap.c | 28 +
fs/xfs/scrub/bitmap.h | 3
fs/xfs/scrub/scrub.c | 4
fs/xfs/scrub/scrub.h | 3
fs/xfs/scrub/trace.c | 13 +
fs/xfs/scrub/trace.h | 110 +++++
fs/xfs/scrub/xfbtree.c | 816 ++++++++++++++++++++++++++++++++++++
fs/xfs/scrub/xfbtree.h | 57 +++
fs/xfs/scrub/xfile.c | 181 ++++++++
fs/xfs/scrub/xfile.h | 65 +++
fs/xfs/xfs_aops.c | 5
fs/xfs/xfs_bmap_util.c | 8
fs/xfs/xfs_buf.c | 234 ++++++++--
fs/xfs/xfs_buf.h | 90 ++++
fs/xfs/xfs_discard.c | 8
fs/xfs/xfs_file.c | 6
fs/xfs/xfs_health.c | 3
fs/xfs/xfs_ioctl.c | 3
fs/xfs/xfs_iomap.c | 4
fs/xfs/xfs_log.c | 4
fs/xfs/xfs_log_cil.c | 3
fs/xfs/xfs_log_recover.c | 3
fs/xfs/xfs_super.c | 4
fs/xfs/xfs_trace.c | 3
fs/xfs/xfs_trace.h | 85 ++++
fs/xfs/xfs_trans.h | 1
fs/xfs/xfs_trans_buf.c | 42 ++
34 files changed, 2011 insertions(+), 110 deletions(-)
create mode 100644 fs/xfs/libxfs/xfs_btree_mem.h
create mode 100644 fs/xfs/scrub/xfbtree.c
create mode 100644 fs/xfs/scrub/xfbtree.h
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 1/7] xfs: dump xfiles for debugging purposes
2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
2022-12-30 22:13 ` [PATCH 4/7] xfs: consolidate btree block freeing tracepoints Darrick J. Wong
2022-12-30 22:13 ` [PATCH 3/7] xfs: support in-memory buffer cache targets Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 2/7] xfs: teach buftargs to maintain their own buffer hashtable Darrick J. Wong
` (3 subsequent siblings)
6 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, willy, linux-fsdevel
From: Darrick J. Wong <djwong@kernel.org>
Add a debug function to dump an xfile's contents for debug purposes.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/scrub/xfile.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/scrub/xfile.h | 2 +
2 files changed, 100 insertions(+)
diff --git a/fs/xfs/scrub/xfile.c b/fs/xfs/scrub/xfile.c
index 7090a8e12b60..1b858b6a53c8 100644
--- a/fs/xfs/scrub/xfile.c
+++ b/fs/xfs/scrub/xfile.c
@@ -424,3 +424,101 @@ xfile_put_page(
return -EIO;
return 0;
}
+
+/* Dump an xfile to dmesg. */
+int
+xfile_dump(
+ struct xfile *xf)
+{
+ struct xfile_stat sb;
+ struct inode *inode = file_inode(xf->file);
+ struct address_space *mapping = inode->i_mapping;
+ loff_t holepos = 0;
+ loff_t datapos;
+ loff_t ret;
+ unsigned int pflags;
+ bool all_zeroes = true;
+ int error = 0;
+
+ error = xfile_stat(xf, &sb);
+ if (error)
+ return error;
+
+ printk(KERN_ALERT "xfile ino 0x%lx isize 0x%llx dump:", inode->i_ino,
+ sb.size);
+ pflags = memalloc_nofs_save();
+
+ while ((ret = vfs_llseek(xf->file, holepos, SEEK_DATA)) >= 0) {
+ datapos = rounddown_64(ret, PAGE_SIZE);
+ ret = vfs_llseek(xf->file, datapos, SEEK_HOLE);
+ if (ret < 0)
+ break;
+ holepos = min_t(loff_t, sb.size, roundup_64(ret, PAGE_SIZE));
+
+ while (datapos < holepos) {
+ struct page *page = NULL;
+ void *p, *kaddr;
+ u64 datalen = holepos - datapos;
+ unsigned int pagepos;
+ unsigned int pagelen;
+
+ cond_resched();
+
+ if (fatal_signal_pending(current)) {
+ error = -EINTR;
+ goto out_pflags;
+ }
+
+ pagelen = min_t(u64, datalen, PAGE_SIZE);
+
+ page = shmem_read_mapping_page_gfp(mapping,
+ datapos >> PAGE_SHIFT, __GFP_NOWARN);
+ if (IS_ERR(page)) {
+ error = PTR_ERR(page);
+ if (error == -EIO)
+ printk(KERN_ALERT "%.8llx: poisoned",
+ datapos);
+ else if (error != -ENOMEM)
+ goto out_pflags;
+
+ goto next_pgoff;
+ }
+
+ if (!PageUptodate(page))
+ goto next_page;
+
+ kaddr = kmap_local_page(page);
+ p = kaddr;
+
+ for (pagepos = 0; pagepos < pagelen; pagepos += 16) {
+ char prefix[16];
+ unsigned int linelen;
+
+ linelen = min_t(unsigned int, pagelen, 16);
+
+ if (!memchr_inv(p + pagepos, 0, linelen))
+ continue;
+
+ snprintf(prefix, 16, "%.8llx: ",
+ datapos + pagepos);
+
+ all_zeroes = false;
+ print_hex_dump(KERN_ALERT, prefix,
+ DUMP_PREFIX_NONE, 16, 1,
+ p + pagepos, linelen, true);
+ }
+ kunmap_local(kaddr);
+next_page:
+ put_page(page);
+next_pgoff:
+ datapos += PAGE_SIZE;
+ }
+ }
+ if (all_zeroes)
+ printk(KERN_ALERT "<all zeroes>");
+ if (ret != -ENXIO)
+ error = ret;
+out_pflags:
+ memalloc_nofs_restore(pflags);
+ return error;
+}
diff --git a/fs/xfs/scrub/xfile.h b/fs/xfs/scrub/xfile.h
index 0172bd9eeab0..b7f046016b1b 100644
--- a/fs/xfs/scrub/xfile.h
+++ b/fs/xfs/scrub/xfile.h
@@ -75,4 +75,6 @@ int xfile_get_page(struct xfile *xf, loff_t offset, unsigned int len,
struct xfile_page *xbuf);
int xfile_put_page(struct xfile *xf, struct xfile_page *xbuf);
+int xfile_dump(struct xfile *xf);
+
#endif /* __XFS_SCRUB_XFILE_H__ */
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 2/7] xfs: teach buftargs to maintain their own buffer hashtable
2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
` (2 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 1/7] xfs: dump xfiles for debugging purposes Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 5/7] xfs: consolidate btree block allocation tracepoints Darrick J. Wong
` (2 subsequent siblings)
6 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, willy, linux-fsdevel
From: Darrick J. Wong <djwong@kernel.org>
Currently, cached buffers are indexed by per-AG hashtables. This works
great for the data device, but won't work for in-memory btrees. Make it
so that buftargs can index buffers too. Introduce XFS_BSTATE_CACHED as
an explicit state flag for buffers that are cached in an rhashtable,
since we can't rely on b_pag being set for buffers that are cached but
not on behalf of an AG. We'll soon be using the buffer cache for
xfiles.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_buf.c | 142 ++++++++++++++++++++++++++++++++++++++++--------------
fs/xfs/xfs_buf.h | 9 +++
2 files changed, 113 insertions(+), 38 deletions(-)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 2bea2c3f9ead..7dfc1db566fa 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -570,7 +570,7 @@ xfs_buf_find_lock(
static inline int
xfs_buf_lookup(
- struct xfs_perag *pag,
+ struct rhashtable *bufhash,
struct xfs_buf_map *map,
xfs_buf_flags_t flags,
struct xfs_buf **bpp)
@@ -579,7 +579,7 @@ xfs_buf_lookup(
int error;
rcu_read_lock();
- bp = rhashtable_lookup(&pag->pag_buf_hash, map, xfs_buf_hash_params);
+ bp = rhashtable_lookup(bufhash, map, xfs_buf_hash_params);
if (!bp || !atomic_inc_not_zero(&bp->b_hold)) {
rcu_read_unlock();
return -ENOENT;
@@ -605,6 +605,8 @@ static int
xfs_buf_find_insert(
struct xfs_buftarg *btp,
struct xfs_perag *pag,
+ spinlock_t *hashlock,
+ struct rhashtable *bufhash,
struct xfs_buf_map *cmap,
struct xfs_buf_map *map,
int nmaps,
@@ -632,18 +634,18 @@ xfs_buf_find_insert(
goto out_free_buf;
}
- spin_lock(&pag->pag_buf_lock);
- bp = rhashtable_lookup_get_insert_fast(&pag->pag_buf_hash,
- &new_bp->b_rhash_head, xfs_buf_hash_params);
+ spin_lock(hashlock);
+ bp = rhashtable_lookup_get_insert_fast(bufhash, &new_bp->b_rhash_head,
+ xfs_buf_hash_params);
if (IS_ERR(bp)) {
error = PTR_ERR(bp);
- spin_unlock(&pag->pag_buf_lock);
+ spin_unlock(hashlock);
goto out_free_buf;
}
if (bp) {
/* found an existing buffer */
atomic_inc(&bp->b_hold);
- spin_unlock(&pag->pag_buf_lock);
+ spin_unlock(hashlock);
error = xfs_buf_find_lock(bp, flags);
if (error)
xfs_buf_rele(bp);
@@ -654,14 +656,16 @@ xfs_buf_find_insert(
/* The new buffer keeps the perag reference until it is freed. */
new_bp->b_pag = pag;
- spin_unlock(&pag->pag_buf_lock);
+ new_bp->b_state |= XFS_BSTATE_CACHED;
+ spin_unlock(hashlock);
*bpp = new_bp;
return 0;
out_free_buf:
xfs_buf_free(new_bp);
out_drop_pag:
- xfs_perag_put(pag);
+ if (pag)
+ xfs_perag_put(pag);
return error;
}
@@ -678,6 +682,8 @@ xfs_buf_get_map(
xfs_buf_flags_t flags,
struct xfs_buf **bpp)
{
+ spinlock_t *hashlock;
+ struct rhashtable *bufhash;
struct xfs_perag *pag;
struct xfs_buf *bp = NULL;
struct xfs_buf_map cmap = { .bm_bn = map[0].bm_bn };
@@ -693,10 +699,18 @@ xfs_buf_get_map(
if (error)
return error;
- pag = xfs_perag_get(btp->bt_mount,
- xfs_daddr_to_agno(btp->bt_mount, cmap.bm_bn));
+ if (btp->bt_flags & XFS_BUFTARG_SELF_CACHED) {
+ pag = NULL;
+ hashlock = &btp->bt_hashlock;
+ bufhash = &btp->bt_bufhash;
+ } else {
+ pag = xfs_perag_get(btp->bt_mount,
+ xfs_daddr_to_agno(btp->bt_mount, cmap.bm_bn));
+ hashlock = &pag->pag_buf_lock;
+ bufhash = &pag->pag_buf_hash;
+ }
- error = xfs_buf_lookup(pag, &cmap, flags, &bp);
+ error = xfs_buf_lookup(bufhash, &cmap, flags, &bp);
if (error && error != -ENOENT)
goto out_put_perag;
@@ -708,13 +722,14 @@ xfs_buf_get_map(
goto out_put_perag;
/* xfs_buf_find_insert() consumes the perag reference. */
- error = xfs_buf_find_insert(btp, pag, &cmap, map, nmaps,
- flags, &bp);
+ error = xfs_buf_find_insert(btp, pag, hashlock, bufhash, &cmap,
+ map, nmaps, flags, &bp);
if (error)
return error;
} else {
XFS_STATS_INC(btp->bt_mount, xb_get_locked);
- xfs_perag_put(pag);
+ if (pag)
+ xfs_perag_put(pag);
}
/* We do not hold a perag reference anymore. */
@@ -742,7 +757,8 @@ xfs_buf_get_map(
return 0;
out_put_perag:
- xfs_perag_put(pag);
+ if (pag)
+ xfs_perag_put(pag);
return error;
}
@@ -996,12 +1012,14 @@ xfs_buf_rele(
struct xfs_buf *bp)
{
struct xfs_perag *pag = bp->b_pag;
+ spinlock_t *hashlock;
+ struct rhashtable *bufhash;
bool release;
bool freebuf = false;
trace_xfs_buf_rele(bp, _RET_IP_);
- if (!pag) {
+ if (!(bp->b_state & XFS_BSTATE_CACHED)) {
ASSERT(list_empty(&bp->b_lru));
if (atomic_dec_and_test(&bp->b_hold)) {
xfs_buf_ioacct_dec(bp);
@@ -1012,6 +1030,14 @@ xfs_buf_rele(
ASSERT(atomic_read(&bp->b_hold) > 0);
+ if (bp->b_target->bt_flags & XFS_BUFTARG_SELF_CACHED) {
+ hashlock = &bp->b_target->bt_hashlock;
+ bufhash = &bp->b_target->bt_bufhash;
+ } else {
+ hashlock = &pag->pag_buf_lock;
+ bufhash = &pag->pag_buf_hash;
+ }
+
/*
* We grab the b_lock here first to serialise racing xfs_buf_rele()
* calls. The pag_buf_lock being taken on the last reference only
@@ -1023,7 +1049,7 @@ xfs_buf_rele(
* leading to a use-after-free scenario.
*/
spin_lock(&bp->b_lock);
- release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
+ release = atomic_dec_and_lock(&bp->b_hold, hashlock);
if (!release) {
/*
* Drop the in-flight state if the buffer is already on the LRU
@@ -1048,7 +1074,7 @@ xfs_buf_rele(
bp->b_state &= ~XFS_BSTATE_DISPOSE;
atomic_inc(&bp->b_hold);
}
- spin_unlock(&pag->pag_buf_lock);
+ spin_unlock(hashlock);
} else {
/*
* most of the time buffers will already be removed from the
@@ -1063,10 +1089,13 @@ xfs_buf_rele(
}
ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
- rhashtable_remove_fast(&pag->pag_buf_hash, &bp->b_rhash_head,
- xfs_buf_hash_params);
- spin_unlock(&pag->pag_buf_lock);
- xfs_perag_put(pag);
+ rhashtable_remove_fast(bufhash, &bp->b_rhash_head,
+ xfs_buf_hash_params);
+ spin_unlock(hashlock);
+ if (pag)
+ xfs_perag_put(pag);
+ bp->b_state &= ~XFS_BSTATE_CACHED;
+ bp->b_pag = NULL;
freebuf = true;
}
@@ -1946,6 +1975,8 @@ xfs_free_buftarg(
ASSERT(percpu_counter_sum(&btp->bt_io_count) == 0);
percpu_counter_destroy(&btp->bt_io_count);
list_lru_destroy(&btp->bt_lru);
+ if (btp->bt_flags & XFS_BUFTARG_SELF_CACHED)
+ rhashtable_destroy(&btp->bt_bufhash);
blkdev_issue_flush(btp->bt_bdev);
invalidate_bdev(btp->bt_bdev);
@@ -1990,24 +2021,20 @@ xfs_setsize_buftarg_early(
return xfs_setsize_buftarg(btp, bdev_logical_block_size(bdev));
}
-struct xfs_buftarg *
-xfs_alloc_buftarg(
+static struct xfs_buftarg *
+__xfs_alloc_buftarg(
struct xfs_mount *mp,
- struct block_device *bdev)
+ unsigned int flags)
{
- xfs_buftarg_t *btp;
- const struct dax_holder_operations *ops = NULL;
+ struct xfs_buftarg *btp;
+ int error;
-#if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE)
- ops = &xfs_dax_holder_operations;
-#endif
btp = kmem_zalloc(sizeof(*btp), KM_NOFS);
+ if (!btp)
+ return NULL;
btp->bt_mount = mp;
- btp->bt_dev = bdev->bd_dev;
- btp->bt_bdev = bdev;
- btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off,
- mp, ops);
+ btp->bt_flags = flags;
/*
* Buffer IO error rate limiting. Limit it to no more than 10 messages
@@ -2016,9 +2043,6 @@ xfs_alloc_buftarg(
ratelimit_state_init(&btp->bt_ioerror_rl, 30 * HZ,
DEFAULT_RATELIMIT_BURST);
- if (xfs_setsize_buftarg_early(btp, bdev))
- goto error_free;
-
if (list_lru_init(&btp->bt_lru))
goto error_free;
@@ -2032,8 +2056,18 @@ xfs_alloc_buftarg(
if (register_shrinker(&btp->bt_shrinker, "xfs-buf:%s",
mp->m_super->s_id))
goto error_pcpu;
+
+ if (btp->bt_flags & XFS_BUFTARG_SELF_CACHED) {
+ spin_lock_init(&btp->bt_hashlock);
+ error = rhashtable_init(&btp->bt_bufhash, &xfs_buf_hash_params);
+ if (error)
+ goto error_shrinker;
+ }
+
return btp;
+error_shrinker:
+ unregister_shrinker(&btp->bt_shrinker);
error_pcpu:
percpu_counter_destroy(&btp->bt_io_count);
error_lru:
@@ -2043,6 +2077,38 @@ xfs_alloc_buftarg(
return NULL;
}
+/* Allocate a buffer cache target for a persistent block device. */
+struct xfs_buftarg *
+xfs_alloc_buftarg(
+ struct xfs_mount *mp,
+ struct block_device *bdev)
+{
+ struct xfs_buftarg *btp;
+ const struct dax_holder_operations *ops = NULL;
+
+#if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE)
+ ops = &xfs_dax_holder_operations;
+#endif
+
+ btp = __xfs_alloc_buftarg(mp, 0);
+ if (!btp)
+ return NULL;
+
+ btp->bt_dev = bdev->bd_dev;
+ btp->bt_bdev = bdev;
+ btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off,
+ mp, ops);
+
+ if (xfs_setsize_buftarg_early(btp, bdev))
+ goto error_free;
+
+ return btp;
+
+error_free:
+ xfs_free_buftarg(btp);
+ return NULL;
+}
+
/*
* Cancel a delayed write list.
*
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 467ddb2e2f0d..d7bf7f657e99 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -82,6 +82,7 @@ typedef unsigned int xfs_buf_flags_t;
*/
#define XFS_BSTATE_DISPOSE (1 << 0) /* buffer being discarded */
#define XFS_BSTATE_IN_FLIGHT (1 << 1) /* I/O in flight */
+#define XFS_BSTATE_CACHED (1 << 2) /* cached buffer */
/*
* The xfs_buftarg contains 2 notions of "sector size" -
@@ -102,11 +103,16 @@ typedef struct xfs_buftarg {
struct dax_device *bt_daxdev;
u64 bt_dax_part_off;
struct xfs_mount *bt_mount;
+ unsigned int bt_flags;
unsigned int bt_meta_sectorsize;
size_t bt_meta_sectormask;
size_t bt_logical_sectorsize;
size_t bt_logical_sectormask;
+ /* self-caching buftargs */
+ spinlock_t bt_hashlock;
+ struct rhashtable bt_bufhash;
+
/* LRU control structures */
struct shrinker bt_shrinker;
struct list_lru bt_lru;
@@ -115,6 +121,9 @@ typedef struct xfs_buftarg {
struct ratelimit_state bt_ioerror_rl;
} xfs_buftarg_t;
+/* the xfs_buftarg indexes buffers via bt_buf_hash */
+#define XFS_BUFTARG_SELF_CACHED (1U << 0)
+
#define XB_PAGES 2
struct xfs_buf_map {
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 3/7] xfs: support in-memory buffer cache targets
2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
2022-12-30 22:13 ` [PATCH 4/7] xfs: consolidate btree block freeing tracepoints Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 1/7] xfs: dump xfiles for debugging purposes Darrick J. Wong
` (4 subsequent siblings)
6 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, willy, linux-fsdevel
From: Darrick J. Wong <djwong@kernel.org>
Allow the buffer cache to target in-memory files by connecting it to
xfiles.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/Kconfig | 4 ++
fs/xfs/scrub/xfile.h | 15 +++++++++
fs/xfs/xfs_aops.c | 5 ++-
fs/xfs/xfs_bmap_util.c | 8 ++---
fs/xfs/xfs_buf.c | 80 +++++++++++++++++++++++++++++++++++++++++++---
fs/xfs/xfs_buf.h | 71 +++++++++++++++++++++++++++++++++++++++--
fs/xfs/xfs_discard.c | 8 ++---
fs/xfs/xfs_file.c | 6 ++-
fs/xfs/xfs_ioctl.c | 3 +-
fs/xfs/xfs_iomap.c | 4 +-
fs/xfs/xfs_log.c | 4 +-
fs/xfs/xfs_log_cil.c | 3 +-
fs/xfs/xfs_log_recover.c | 3 +-
fs/xfs/xfs_super.c | 4 +-
14 files changed, 188 insertions(+), 30 deletions(-)
diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 54806c2b80d4..2373324be997 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -101,6 +101,9 @@ config XFS_LIVE_HOOKS
bool
select JUMP_LABEL if HAVE_ARCH_JUMP_LABEL
+config XFS_IN_MEMORY_FILE
+ bool
+
config XFS_ONLINE_SCRUB
bool "XFS online metadata check support"
default n
@@ -108,6 +111,7 @@ config XFS_ONLINE_SCRUB
depends on TMPFS && SHMEM
select XFS_LIVE_HOOKS
select XFS_DRAIN_INTENTS
+ select XFS_IN_MEMORY_FILE
help
If you say Y here you will be able to check metadata on a
mounted XFS filesystem. This feature is intended to reduce
diff --git a/fs/xfs/scrub/xfile.h b/fs/xfs/scrub/xfile.h
index b7f046016b1b..99b6db838612 100644
--- a/fs/xfs/scrub/xfile.h
+++ b/fs/xfs/scrub/xfile.h
@@ -6,6 +6,8 @@
#ifndef __XFS_SCRUB_XFILE_H__
#define __XFS_SCRUB_XFILE_H__
+#ifdef CONFIG_XFS_IN_MEMORY_FILE
+
struct xfile_page {
struct page *page;
void *fsdata;
@@ -76,5 +78,18 @@ int xfile_get_page(struct xfile *xf, loff_t offset, unsigned int len,
int xfile_put_page(struct xfile *xf, struct xfile_page *xbuf);
int xfile_dump(struct xfile *xf);
+#else
+static inline int
+xfile_obj_load(struct xfile *xf, void *buf, size_t count, loff_t offset)
+{
+ return -EIO;
+}
+
+static inline int
+xfile_obj_store(struct xfile *xf, const void *buf, size_t count, loff_t offset)
+{
+ return -EIO;
+}
+#endif /* CONFIG_XFS_IN_MEMORY_FILE */
#endif /* __XFS_SCRUB_XFILE_H__ */
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 41734202796f..c3a9df0c0eab 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -562,7 +562,10 @@ xfs_iomap_swapfile_activate(
struct file *swap_file,
sector_t *span)
{
- sis->bdev = xfs_inode_buftarg(XFS_I(file_inode(swap_file)))->bt_bdev;
+ struct xfs_inode *ip = XFS_I(file_inode(swap_file));
+ struct xfs_buftarg *btp = xfs_inode_buftarg(ip);
+
+ sis->bdev = xfs_buftarg_bdev(btp);
return iomap_swapfile_activate(sis, swap_file, span,
&xfs_read_iomap_ops);
}
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 867645b74d88..e094932869f6 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -62,10 +62,10 @@ xfs_zero_extent(
xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb);
sector_t block = XFS_BB_TO_FSBT(mp, sector);
- return blkdev_issue_zeroout(target->bt_bdev,
- block << (mp->m_super->s_blocksize_bits - 9),
- count_fsb << (mp->m_super->s_blocksize_bits - 9),
- GFP_NOFS, 0);
+ return xfs_buftarg_zeroout(target,
+ block << (mp->m_super->s_blocksize_bits - 9),
+ count_fsb << (mp->m_super->s_blocksize_bits - 9),
+ GFP_NOFS, 0);
}
#ifdef CONFIG_XFS_RT
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 7dfc1db566fa..2ec8d39def9c 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -21,6 +21,7 @@
#include "xfs_errortag.h"
#include "xfs_error.h"
#include "xfs_ag.h"
+#include "scrub/xfile.h"
struct kmem_cache *xfs_buf_cache;
@@ -1554,6 +1555,36 @@ xfs_buf_ioapply_map(
}
+static inline void
+xfs_buf_ioapply_in_memory(
+ struct xfs_buf *bp)
+{
+ struct xfile *xfile = bp->b_target->bt_xfile;
+ loff_t pos = BBTOB(xfs_buf_daddr(bp));
+ size_t size = BBTOB(bp->b_length);
+ int error;
+
+ atomic_inc(&bp->b_io_remaining);
+
+ if (bp->b_map_count > 1) {
+ /* We don't need or support multi-map buffers. */
+ ASSERT(0);
+ error = -EIO;
+ } else if (bp->b_flags & XBF_WRITE) {
+ error = xfile_obj_store(xfile, bp->b_addr, size, pos);
+ } else {
+ error = xfile_obj_load(xfile, bp->b_addr, size, pos);
+ }
+ if (error)
+ cmpxchg(&bp->b_io_error, 0, error);
+
+ if (!bp->b_error && xfs_buf_is_vmapped(bp) && (bp->b_flags & XBF_READ))
+ invalidate_kernel_vmap_range(bp->b_addr, xfs_buf_vmap_len(bp));
+
+ if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
+ xfs_buf_ioend(bp);
+}
+
STATIC void
_xfs_buf_ioapply(
struct xfs_buf *bp)
@@ -1611,6 +1642,11 @@ _xfs_buf_ioapply(
/* we only use the buffer cache for meta-data */
op |= REQ_META;
+ if (bp->b_target->bt_flags & XFS_BUFTARG_IN_MEMORY) {
+ xfs_buf_ioapply_in_memory(bp);
+ return;
+ }
+
/*
* Walk all the vectors issuing IO on them. Set up the initial offset
* into the buffer and the desired IO size before we start -
@@ -1978,9 +2014,11 @@ xfs_free_buftarg(
if (btp->bt_flags & XFS_BUFTARG_SELF_CACHED)
rhashtable_destroy(&btp->bt_bufhash);
- blkdev_issue_flush(btp->bt_bdev);
- invalidate_bdev(btp->bt_bdev);
- fs_put_dax(btp->bt_daxdev, btp->bt_mount);
+ if (!(btp->bt_flags & XFS_BUFTARG_IN_MEMORY)) {
+ blkdev_issue_flush(btp->bt_bdev);
+ invalidate_bdev(btp->bt_bdev);
+ fs_put_dax(btp->bt_daxdev, btp->bt_mount);
+ }
kmem_free(btp);
}
@@ -2024,12 +2062,13 @@ xfs_setsize_buftarg_early(
static struct xfs_buftarg *
__xfs_alloc_buftarg(
struct xfs_mount *mp,
- unsigned int flags)
+ unsigned int flags,
+ xfs_km_flags_t km_flags)
{
struct xfs_buftarg *btp;
int error;
- btp = kmem_zalloc(sizeof(*btp), KM_NOFS);
+ btp = kmem_zalloc(sizeof(*btp), KM_NOFS | km_flags);
if (!btp)
return NULL;
@@ -2090,7 +2129,7 @@ xfs_alloc_buftarg(
ops = &xfs_dax_holder_operations;
#endif
- btp = __xfs_alloc_buftarg(mp, 0);
+ btp = __xfs_alloc_buftarg(mp, 0, 0);
if (!btp)
return NULL;
@@ -2109,6 +2148,35 @@ xfs_alloc_buftarg(
return NULL;
}
+#ifdef CONFIG_XFS_IN_MEMORY_FILE
+/* Allocate a buffer cache target for a memory-backed file. */
+int
+xfs_alloc_memory_buftarg(
+ struct xfs_mount *mp,
+ struct xfile *xfile,
+ struct xfs_buftarg **btpp)
+{
+ struct xfs_buftarg *btp;
+
+ btp = __xfs_alloc_buftarg(mp,
+ XFS_BUFTARG_SELF_CACHED | XFS_BUFTARG_IN_MEMORY,
+ KM_MAYFAIL);
+ if (!btp)
+ return -ENOMEM;
+
+ btp->bt_xfile = xfile;
+ btp->bt_dev = (dev_t)-1U;
+
+ btp->bt_meta_sectorsize = SECTOR_SIZE;
+ btp->bt_meta_sectormask = SECTOR_SIZE - 1;
+ btp->bt_logical_sectorsize = SECTOR_SIZE;
+ btp->bt_logical_sectormask = SECTOR_SIZE - 1;
+
+ *btpp = btp;
+ return 0;
+}
+#endif /* CONFIG_XFS_IN_MEMORY_FILE */
+
/*
* Cancel a delayed write list.
*
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index d7bf7f657e99..dcae77dabdcc 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -21,6 +21,7 @@ extern struct kmem_cache *xfs_buf_cache;
* Base types
*/
struct xfs_buf;
+struct xfile;
#define XFS_BUF_DADDR_NULL ((xfs_daddr_t) (-1LL))
@@ -99,7 +100,10 @@ typedef unsigned int xfs_buf_flags_t;
*/
typedef struct xfs_buftarg {
dev_t bt_dev;
- struct block_device *bt_bdev;
+ union {
+ struct block_device *bt_bdev;
+ struct xfile *bt_xfile;
+ };
struct dax_device *bt_daxdev;
u64 bt_dax_part_off;
struct xfs_mount *bt_mount;
@@ -124,6 +128,20 @@ typedef struct xfs_buftarg {
/* the xfs_buftarg indexes buffers via bt_buf_hash */
#define XFS_BUFTARG_SELF_CACHED (1U << 0)
+/* in-memory buftarg via bt_xfile */
+#ifdef CONFIG_XFS_IN_MEMORY_FILE
+# define XFS_BUFTARG_IN_MEMORY (1U << 1)
+#else
+# define XFS_BUFTARG_IN_MEMORY (0)
+#endif
+
+static inline bool
+xfs_buftarg_in_memory(
+ struct xfs_buftarg *btp)
+{
+ return btp->bt_flags & XFS_BUFTARG_IN_MEMORY;
+}
+
#define XB_PAGES 2
struct xfs_buf_map {
@@ -372,13 +390,60 @@ xfs_buf_update_cksum(struct xfs_buf *bp, unsigned long cksum_offset)
*/
struct xfs_buftarg *xfs_alloc_buftarg(struct xfs_mount *mp,
struct block_device *bdev);
+#ifdef CONFIG_XFS_IN_MEMORY_FILE
+int xfs_alloc_memory_buftarg(struct xfs_mount *mp, struct xfile *xfile,
+ struct xfs_buftarg **btpp);
+#endif
extern void xfs_free_buftarg(struct xfs_buftarg *);
extern void xfs_buftarg_wait(struct xfs_buftarg *);
extern void xfs_buftarg_drain(struct xfs_buftarg *);
extern int xfs_setsize_buftarg(struct xfs_buftarg *, unsigned int);
-#define xfs_getsize_buftarg(buftarg) block_size((buftarg)->bt_bdev)
-#define xfs_readonly_buftarg(buftarg) bdev_read_only((buftarg)->bt_bdev)
+static inline struct block_device *
+xfs_buftarg_bdev(struct xfs_buftarg *btp)
+{
+ if (btp->bt_flags & XFS_BUFTARG_IN_MEMORY)
+ return NULL;
+ return btp->bt_bdev;
+}
+
+static inline unsigned int
+xfs_getsize_buftarg(struct xfs_buftarg *btp)
+{
+ if (btp->bt_flags & XFS_BUFTARG_IN_MEMORY)
+ return SECTOR_SIZE;
+ return block_size(btp->bt_bdev);
+}
+
+static inline bool
+xfs_readonly_buftarg(struct xfs_buftarg *btp)
+{
+ if (btp->bt_flags & XFS_BUFTARG_IN_MEMORY)
+ return false;
+ return bdev_read_only(btp->bt_bdev);
+}
+
+static inline int
+xfs_buftarg_flush(struct xfs_buftarg *btp)
+{
+ if (btp->bt_flags & XFS_BUFTARG_IN_MEMORY)
+ return 0;
+ return blkdev_issue_flush(btp->bt_bdev);
+}
+
+static inline int
+xfs_buftarg_zeroout(
+ struct xfs_buftarg *btp,
+ sector_t sector,
+ sector_t nr_sects,
+ gfp_t gfp_mask,
+ unsigned flags)
+{
+ if (btp->bt_flags & XFS_BUFTARG_IN_MEMORY)
+ return -EOPNOTSUPP;
+ return blkdev_issue_zeroout(btp->bt_bdev, sector, nr_sects, gfp_mask,
+ flags);
+}
int xfs_buf_reverify(struct xfs_buf *bp, const struct xfs_buf_ops *ops);
bool xfs_verify_magic(struct xfs_buf *bp, __be32 dmagic);
diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index 3fa6b0ab9ed6..44658cc7d3f2 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -29,7 +29,7 @@ xfs_trim_extents(
xfs_daddr_t minlen,
uint64_t *blocks_trimmed)
{
- struct block_device *bdev = mp->m_ddev_targp->bt_bdev;
+ struct block_device *bdev = xfs_buftarg_bdev(mp->m_ddev_targp);
struct xfs_btree_cur *cur;
struct xfs_buf *agbp;
struct xfs_agf *agf;
@@ -154,8 +154,8 @@ xfs_ioc_trim(
struct xfs_mount *mp,
struct fstrim_range __user *urange)
{
- unsigned int granularity =
- bdev_discard_granularity(mp->m_ddev_targp->bt_bdev);
+ struct block_device *bdev = xfs_buftarg_bdev(mp->m_ddev_targp);
+ unsigned int granularity = bdev_discard_granularity(bdev);
struct fstrim_range range;
xfs_daddr_t start, end, minlen;
xfs_agnumber_t start_agno, end_agno, agno;
@@ -164,7 +164,7 @@ xfs_ioc_trim(
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
- if (!bdev_max_discard_sectors(mp->m_ddev_targp->bt_bdev))
+ if (!bdev_max_discard_sectors(bdev))
return -EOPNOTSUPP;
/*
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 595a5bcf46b9..c4bdadd8fa71 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -164,9 +164,9 @@ xfs_file_fsync(
* inode size in case of an extending write.
*/
if (XFS_IS_REALTIME_INODE(ip))
- error = blkdev_issue_flush(mp->m_rtdev_targp->bt_bdev);
+ error = xfs_buftarg_flush(mp->m_rtdev_targp);
else if (mp->m_logdev_targp != mp->m_ddev_targp)
- error = blkdev_issue_flush(mp->m_ddev_targp->bt_bdev);
+ error = xfs_buftarg_flush(mp->m_ddev_targp);
/*
* Any inode that has dirty modifications in the log is pinned. The
@@ -189,7 +189,7 @@ xfs_file_fsync(
*/
if (!log_flushed && !XFS_IS_REALTIME_INODE(ip) &&
mp->m_logdev_targp == mp->m_ddev_targp) {
- err2 = blkdev_issue_flush(mp->m_ddev_targp->bt_bdev);
+ err2 = xfs_buftarg_flush(mp->m_ddev_targp);
if (err2 && !error)
error = err2;
}
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 020111f0f2a2..4b2a02a08dfa 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1762,6 +1762,7 @@ xfs_ioc_setlabel(
char __user *newlabel)
{
struct xfs_sb *sbp = &mp->m_sb;
+ struct block_device *bdev = xfs_buftarg_bdev(mp->m_ddev_targp);
char label[XFSLABEL_MAX + 1];
size_t len;
int error;
@@ -1808,7 +1809,7 @@ xfs_ioc_setlabel(
error = xfs_update_secondary_sbs(mp);
mutex_unlock(&mp->m_growlock);
- invalidate_bdev(mp->m_ddev_targp->bt_bdev);
+ invalidate_bdev(bdev);
out:
mnt_drop_write_file(filp);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index c2ba03281daf..99a7c271c353 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -129,7 +129,7 @@ xfs_bmbt_to_iomap(
if (mapping_flags & IOMAP_DAX)
iomap->dax_dev = target->bt_daxdev;
else
- iomap->bdev = target->bt_bdev;
+ iomap->bdev = xfs_buftarg_bdev(target);
iomap->flags = iomap_flags;
if (xfs_ipincount(ip) &&
@@ -154,7 +154,7 @@ xfs_hole_to_iomap(
iomap->type = IOMAP_HOLE;
iomap->offset = XFS_FSB_TO_B(ip->i_mount, offset_fsb);
iomap->length = XFS_FSB_TO_B(ip->i_mount, end_fsb - offset_fsb);
- iomap->bdev = target->bt_bdev;
+ iomap->bdev = xfs_buftarg_bdev(target);
iomap->dax_dev = target->bt_daxdev;
}
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index fc61cc024023..b32a8e57f576 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1938,7 +1938,7 @@ xlog_write_iclog(
* writeback throttle from throttling log writes behind background
* metadata writeback and causing priority inversions.
*/
- bio_init(&iclog->ic_bio, log->l_targ->bt_bdev, iclog->ic_bvec,
+ bio_init(&iclog->ic_bio, xfs_buftarg_bdev(log->l_targ), iclog->ic_bvec,
howmany(count, PAGE_SIZE),
REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE);
iclog->ic_bio.bi_iter.bi_sector = log->l_logBBstart + bno;
@@ -1959,7 +1959,7 @@ xlog_write_iclog(
* avoid shutdown re-entering this path and erroring out again.
*/
if (log->l_targ != log->l_mp->m_ddev_targp &&
- blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev)) {
+ xfs_buftarg_flush(log->l_mp->m_ddev_targp)) {
xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR);
return;
}
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index eccbfb99e894..12cd2874048f 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -742,7 +742,8 @@ xlog_discard_busy_extents(
trace_xfs_discard_extent(mp, busyp->agno, busyp->bno,
busyp->length);
- error = __blkdev_issue_discard(mp->m_ddev_targp->bt_bdev,
+ error = __blkdev_issue_discard(
+ xfs_buftarg_bdev(mp->m_ddev_targp),
XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno),
XFS_FSB_TO_BB(mp, busyp->length),
GFP_NOFS, &bio);
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 322eb2ee6c55..6b1f37bc3e95 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -137,7 +137,8 @@ xlog_do_io(
nbblks = round_up(nbblks, log->l_sectBBsize);
ASSERT(nbblks > 0);
- error = xfs_rw_bdev(log->l_targ->bt_bdev, log->l_logBBstart + blk_no,
+ error = xfs_rw_bdev(xfs_buftarg_bdev(log->l_targ),
+ log->l_logBBstart + blk_no,
BBTOB(nbblks), data, op);
if (error && !xlog_is_shutdown(log)) {
xfs_alert(log->l_mp,
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 020ff2d93f23..8841947bdce7 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -397,13 +397,13 @@ xfs_close_devices(
struct xfs_mount *mp)
{
if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp) {
- struct block_device *logdev = mp->m_logdev_targp->bt_bdev;
+ struct block_device *logdev = xfs_buftarg_bdev(mp->m_logdev_targp);
xfs_free_buftarg(mp->m_logdev_targp);
xfs_blkdev_put(logdev);
}
if (mp->m_rtdev_targp) {
- struct block_device *rtdev = mp->m_rtdev_targp->bt_bdev;
+ struct block_device *rtdev = xfs_buftarg_bdev(mp->m_rtdev_targp);
xfs_free_buftarg(mp->m_rtdev_targp);
xfs_blkdev_put(rtdev);
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 4/7] xfs: consolidate btree block freeing tracepoints
2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 3/7] xfs: support in-memory buffer cache targets Darrick J. Wong
` (5 subsequent siblings)
6 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, willy, linux-fsdevel
From: Darrick J. Wong <djwong@kernel.org>
Don't waste tracepoint segment memory on per-btree block freeing
tracepoints when we can do it from the generic btree code.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_btree.c | 2 ++
fs/xfs/libxfs/xfs_refcount_btree.c | 2 --
fs/xfs/libxfs/xfs_rmap_btree.c | 2 --
fs/xfs/xfs_trace.h | 32 ++++++++++++++++++++++++++++++--
4 files changed, 32 insertions(+), 6 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 02c237984fa6..7fab2df1046f 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -414,6 +414,8 @@ xfs_btree_free_block(
{
int error;
+ trace_xfs_btree_free_block(cur, bp);
+
error = cur->bc_ops->free_block(cur, bp);
if (!error) {
xfs_trans_binval(cur->bc_tp, bp);
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 1bf991bf452f..b1d1f3bb159f 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -108,8 +108,6 @@ xfs_refcountbt_free_block(
xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, xfs_buf_daddr(bp));
int error;
- trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_ag.pag->pag_agno,
- XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
be32_add_cpu(&agf->agf_refcount_blocks, -1);
xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
error = xfs_free_extent(cur->bc_tp, cur->bc_ag.pag,
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 2c90a05ca814..1421fcfcad64 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -125,8 +125,6 @@ xfs_rmapbt_free_block(
int error;
bno = xfs_daddr_to_agbno(cur->bc_mp, xfs_buf_daddr(bp));
- trace_xfs_rmapbt_free_block(cur->bc_mp, pag->pag_agno,
- bno, 1);
be32_add_cpu(&agf->agf_rmap_blocks, -1);
xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_RMAP_BLOCKS);
error = xfs_alloc_put_freelist(pag, cur->bc_tp, agbp, NULL, bno, 1);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 145808b733ce..50f4d4410976 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2515,6 +2515,36 @@ DEFINE_EVENT(xfs_btree_cur_class, name, \
DEFINE_BTREE_CUR_EVENT(xfs_btree_updkeys);
DEFINE_BTREE_CUR_EVENT(xfs_btree_overlapped_query_range);
+TRACE_EVENT(xfs_btree_free_block,
+ TP_PROTO(struct xfs_btree_cur *cur, struct xfs_buf *bp),
+ TP_ARGS(cur, bp),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(xfs_agnumber_t, agno)
+ __field(xfs_ino_t, ino)
+ __field(xfs_btnum_t, btnum)
+ __field(xfs_agblock_t, agbno)
+ ),
+ TP_fast_assign(
+ __entry->dev = cur->bc_mp->m_super->s_dev;
+ __entry->agno = xfs_daddr_to_agno(cur->bc_mp,
+ xfs_buf_daddr(bp));
+ if (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE)
+ __entry->ino = cur->bc_ino.ip->i_ino;
+ else
+ __entry->ino = 0;
+ __entry->btnum = cur->bc_btnum;
+ __entry->agbno = xfs_daddr_to_agbno(cur->bc_mp,
+ xfs_buf_daddr(bp));
+ ),
+ TP_printk("dev %d:%d btree %s agno 0x%x ino 0x%llx agbno 0x%x",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __print_symbolic(__entry->btnum, XFS_BTNUM_STRINGS),
+ __entry->agno,
+ __entry->ino,
+ __entry->agbno)
+);
+
/* deferred ops */
struct xfs_defer_pending;
@@ -2869,7 +2899,6 @@ DEFINE_RMAP_DEFERRED_EVENT(xfs_rmap_defer);
DEFINE_RMAP_DEFERRED_EVENT(xfs_rmap_deferred);
DEFINE_BUSY_EVENT(xfs_rmapbt_alloc_block);
-DEFINE_BUSY_EVENT(xfs_rmapbt_free_block);
DEFINE_RMAPBT_EVENT(xfs_rmap_update);
DEFINE_RMAPBT_EVENT(xfs_rmap_insert);
DEFINE_RMAPBT_EVENT(xfs_rmap_delete);
@@ -3228,7 +3257,6 @@ DEFINE_EVENT(xfs_refcount_triple_extent_class, name, \
/* refcount btree tracepoints */
DEFINE_BUSY_EVENT(xfs_refcountbt_alloc_block);
-DEFINE_BUSY_EVENT(xfs_refcountbt_free_block);
DEFINE_AG_BTREE_LOOKUP_EVENT(xfs_refcount_lookup);
DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_get);
DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_update);
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 5/7] xfs: consolidate btree block allocation tracepoints
2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
` (3 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 2/7] xfs: teach buftargs to maintain their own buffer hashtable Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 6/7] xfs: support in-memory btrees Darrick J. Wong
2022-12-30 22:13 ` [PATCH 7/7] xfs: connect in-memory btrees to xfiles Darrick J. Wong
6 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, willy, linux-fsdevel
From: Darrick J. Wong <djwong@kernel.org>
Don't waste tracepoint segment memory on per-btree block allocation
tracepoints when we can do it from the generic btree code.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_btree.c | 20 ++++++++++++---
fs/xfs/libxfs/xfs_refcount_btree.c | 2 -
fs/xfs/libxfs/xfs_rmap_btree.c | 2 -
fs/xfs/xfs_trace.h | 49 +++++++++++++++++++++++++++++++++++-
4 files changed, 64 insertions(+), 9 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 7fab2df1046f..f577c0463c6e 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2693,6 +2693,20 @@ xfs_btree_rshift(
return error;
}
+static inline int
+xfs_btree_alloc_block(
+ struct xfs_btree_cur *cur,
+ const union xfs_btree_ptr *hint_block,
+ union xfs_btree_ptr *new_block,
+ int *stat)
+{
+ int error;
+
+ error = cur->bc_ops->alloc_block(cur, hint_block, new_block, stat);
+ trace_xfs_btree_alloc_block(cur, new_block, *stat, error);
+ return error;
+}
+
/*
* Split cur/level block in half.
* Return new block number and the key to its first
@@ -2736,7 +2750,7 @@ __xfs_btree_split(
xfs_btree_buf_to_ptr(cur, lbp, &lptr);
/* Allocate the new block. If we can't do it, we're toast. Give up. */
- error = cur->bc_ops->alloc_block(cur, &lptr, &rptr, stat);
+ error = xfs_btree_alloc_block(cur, &lptr, &rptr, stat);
if (error)
goto error0;
if (*stat == 0)
@@ -3002,7 +3016,7 @@ xfs_btree_new_iroot(
pp = xfs_btree_ptr_addr(cur, 1, block);
/* Allocate the new block. If we can't do it, we're toast. Give up. */
- error = cur->bc_ops->alloc_block(cur, pp, &nptr, stat);
+ error = xfs_btree_alloc_block(cur, pp, &nptr, stat);
if (error)
goto error0;
if (*stat == 0)
@@ -3102,7 +3116,7 @@ xfs_btree_new_root(
cur->bc_ops->init_ptr_from_cur(cur, &rptr);
/* Allocate the new block. If we can't do it, we're toast. Give up. */
- error = cur->bc_ops->alloc_block(cur, &rptr, &lptr, stat);
+ error = xfs_btree_alloc_block(cur, &rptr, &lptr, stat);
if (error)
goto error0;
if (*stat == 0)
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index b1d1f3bb159f..b75005684aa2 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -77,8 +77,6 @@ xfs_refcountbt_alloc_block(
error = xfs_alloc_vextent(&args);
if (error)
goto out_error;
- trace_xfs_refcountbt_alloc_block(cur->bc_mp, cur->bc_ag.pag->pag_agno,
- args.agbno, 1);
if (args.fsbno == NULLFSBLOCK) {
*stat = 0;
return 0;
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 1421fcfcad64..5583dbe43bb5 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -94,8 +94,6 @@ xfs_rmapbt_alloc_block(
&bno, 1);
if (error)
return error;
-
- trace_xfs_rmapbt_alloc_block(cur->bc_mp, pag->pag_agno, bno, 1);
if (bno == NULLAGBLOCK) {
*stat = 0;
return 0;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 50f4d4410976..d86dd34127f2 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2515,6 +2515,53 @@ DEFINE_EVENT(xfs_btree_cur_class, name, \
DEFINE_BTREE_CUR_EVENT(xfs_btree_updkeys);
DEFINE_BTREE_CUR_EVENT(xfs_btree_overlapped_query_range);
+TRACE_EVENT(xfs_btree_alloc_block,
+ TP_PROTO(struct xfs_btree_cur *cur, union xfs_btree_ptr *ptr, int stat,
+ int error),
+ TP_ARGS(cur, ptr, stat, error),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(xfs_agnumber_t, agno)
+ __field(xfs_ino_t, ino)
+ __field(xfs_btnum_t, btnum)
+ __field(int, error)
+ __field(xfs_agblock_t, agbno)
+ ),
+ TP_fast_assign(
+ __entry->dev = cur->bc_mp->m_super->s_dev;
+ if (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) {
+ __entry->agno = 0;
+ __entry->ino = cur->bc_ino.ip->i_ino;
+ } else {
+ __entry->agno = cur->bc_ag.pag->pag_agno;
+ __entry->ino = 0;
+ }
+ __entry->btnum = cur->bc_btnum;
+ __entry->error = error;
+ if (!error && stat) {
+ if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
+ xfs_fsblock_t fsb = be64_to_cpu(ptr->l);
+
+ __entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp,
+ fsb);
+ __entry->agbno = XFS_FSB_TO_AGBNO(cur->bc_mp,
+ fsb);
+ } else {
+ __entry->agbno = be32_to_cpu(ptr->s);
+ }
+ } else {
+ __entry->agbno = NULLAGBLOCK;
+ }
+ ),
+ TP_printk("dev %d:%d btree %s agno 0x%x ino 0x%llx agbno 0x%x error %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __print_symbolic(__entry->btnum, XFS_BTNUM_STRINGS),
+ __entry->agno,
+ __entry->ino,
+ __entry->agbno,
+ __entry->error)
+);
+
TRACE_EVENT(xfs_btree_free_block,
TP_PROTO(struct xfs_btree_cur *cur, struct xfs_buf *bp),
TP_ARGS(cur, bp),
@@ -2898,7 +2945,6 @@ DEFINE_EVENT(xfs_rmapbt_class, name, \
DEFINE_RMAP_DEFERRED_EVENT(xfs_rmap_defer);
DEFINE_RMAP_DEFERRED_EVENT(xfs_rmap_deferred);
-DEFINE_BUSY_EVENT(xfs_rmapbt_alloc_block);
DEFINE_RMAPBT_EVENT(xfs_rmap_update);
DEFINE_RMAPBT_EVENT(xfs_rmap_insert);
DEFINE_RMAPBT_EVENT(xfs_rmap_delete);
@@ -3256,7 +3302,6 @@ DEFINE_EVENT(xfs_refcount_triple_extent_class, name, \
TP_ARGS(mp, agno, i1, i2, i3))
/* refcount btree tracepoints */
-DEFINE_BUSY_EVENT(xfs_refcountbt_alloc_block);
DEFINE_AG_BTREE_LOOKUP_EVENT(xfs_refcount_lookup);
DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_get);
DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_update);
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 6/7] xfs: support in-memory btrees
2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
` (4 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 5/7] xfs: consolidate btree block allocation tracepoints Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 7/7] xfs: connect in-memory btrees to xfiles Darrick J. Wong
6 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, willy, linux-fsdevel
From: Darrick J. Wong <djwong@kernel.org>
Adapt the generic btree cursor code to be able to create a btree whose
buffers come from a (presumably in-memory) buftarg with a header block
that's specific to in-memory btrees. We'll connect this to other parts
of online scrub in the next patches.
Note that in-memory btrees always have a block size matching the system
memory page size for efficiency reasons.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/Kconfig | 4
fs/xfs/Makefile | 1
fs/xfs/libxfs/xfs_btree.c | 151 ++++++++++++++----
fs/xfs/libxfs/xfs_btree.h | 17 ++
fs/xfs/libxfs/xfs_btree_mem.h | 87 ++++++++++
fs/xfs/scrub/xfbtree.c | 352 +++++++++++++++++++++++++++++++++++++++++
fs/xfs/scrub/xfbtree.h | 34 ++++
fs/xfs/scrub/xfile.h | 46 +++++
fs/xfs/xfs_buf.c | 10 +
fs/xfs/xfs_buf.h | 10 +
fs/xfs/xfs_health.c | 3
fs/xfs/xfs_trace.c | 3
fs/xfs/xfs_trace.h | 5 -
13 files changed, 694 insertions(+), 29 deletions(-)
create mode 100644 fs/xfs/libxfs/xfs_btree_mem.h
create mode 100644 fs/xfs/scrub/xfbtree.c
create mode 100644 fs/xfs/scrub/xfbtree.h
diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 2373324be997..612e5c458033 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -104,6 +104,9 @@ config XFS_LIVE_HOOKS
config XFS_IN_MEMORY_FILE
bool
+config XFS_IN_MEMORY_BTREE
+ bool
+
config XFS_ONLINE_SCRUB
bool "XFS online metadata check support"
default n
@@ -161,6 +164,7 @@ config XFS_ONLINE_REPAIR
bool "XFS online metadata repair support"
default n
depends on XFS_FS && XFS_ONLINE_SCRUB
+ select XFS_IN_MEMORY_BTREE
help
If you say Y here you will be able to repair metadata on a
mounted XFS filesystem. This feature is intended to reduce
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 2d756e13d441..7e1495465cec 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -195,6 +195,7 @@ xfs-y += $(addprefix scrub/, \
reap.o \
refcount_repair.o \
repair.o \
+ xfbtree.o \
)
xfs-$(CONFIG_XFS_RT) += $(addprefix scrub/, \
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index f577c0463c6e..4c5b4d26cd1b 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -28,6 +28,9 @@
#include "xfs_rmap_btree.h"
#include "xfs_refcount_btree.h"
#include "xfs_health.h"
+#include "scrub/xfile.h"
+#include "scrub/xfbtree.h"
+#include "xfs_btree_mem.h"
/*
* Btree magic numbers.
@@ -82,6 +85,9 @@ xfs_btree_check_lblock_siblings(
if (level >= 0) {
if (!xfs_btree_check_lptr(cur, sibling, level + 1))
return __this_address;
+ } else if (cur && (cur->bc_flags & XFS_BTREE_IN_MEMORY)) {
+ if (!xfbtree_verify_xfileoff(cur, sibling))
+ return __this_address;
} else {
if (!xfs_verify_fsbno(mp, sibling))
return __this_address;
@@ -109,6 +115,9 @@ xfs_btree_check_sblock_siblings(
if (level >= 0) {
if (!xfs_btree_check_sptr(cur, sibling, level + 1))
return __this_address;
+ } else if (cur && (cur->bc_flags & XFS_BTREE_IN_MEMORY)) {
+ if (!xfbtree_verify_xfileoff(cur, sibling))
+ return __this_address;
} else {
if (!xfs_verify_agbno(pag, sibling))
return __this_address;
@@ -151,7 +160,9 @@ __xfs_btree_check_lblock(
cur->bc_ops->get_maxrecs(cur, level))
return __this_address;
- if (bp)
+ if ((cur->bc_flags & XFS_BTREE_IN_MEMORY) && bp)
+ fsb = xfbtree_buf_to_xfoff(cur, bp);
+ else if (bp)
fsb = XFS_DADDR_TO_FSB(mp, xfs_buf_daddr(bp));
fa = xfs_btree_check_lblock_siblings(mp, cur, level, fsb,
@@ -218,8 +229,12 @@ __xfs_btree_check_sblock(
cur->bc_ops->get_maxrecs(cur, level))
return __this_address;
- if (bp)
+ if ((cur->bc_flags & XFS_BTREE_IN_MEMORY) && bp) {
+ pag = NULL;
+ agbno = xfbtree_buf_to_xfoff(cur, bp);
+ } else if (bp) {
agbno = xfs_daddr_to_agbno(mp, xfs_buf_daddr(bp));
+ }
fa = xfs_btree_check_sblock_siblings(pag, cur, level, agbno,
block->bb_u.s.bb_leftsib);
@@ -276,6 +291,8 @@ xfs_btree_check_lptr(
{
if (level <= 0)
return false;
+ if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+ return xfbtree_verify_xfileoff(cur, fsbno);
return xfs_verify_fsbno(cur->bc_mp, fsbno);
}
@@ -288,6 +305,8 @@ xfs_btree_check_sptr(
{
if (level <= 0)
return false;
+ if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+ return xfbtree_verify_xfileoff(cur, agbno);
return xfs_verify_agbno(cur->bc_ag.pag, agbno);
}
@@ -302,6 +321,9 @@ xfs_btree_check_ptr(
int index,
int level)
{
+ if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+ return xfbtree_check_ptr(cur, ptr, index, level);
+
if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
if (xfs_btree_check_lptr(cur, be64_to_cpu((&ptr->l)[index]),
level))
@@ -458,11 +480,36 @@ xfs_btree_del_cursor(
xfs_is_shutdown(cur->bc_mp) || error != 0);
if (unlikely(cur->bc_flags & XFS_BTREE_STAGING))
kmem_free(cur->bc_ops);
- if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && cur->bc_ag.pag)
+ if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) &&
+ !(cur->bc_flags & XFS_BTREE_IN_MEMORY) && cur->bc_ag.pag)
xfs_perag_put(cur->bc_ag.pag);
+ if (cur->bc_flags & XFS_BTREE_IN_MEMORY) {
+ if (cur->bc_mem.pag)
+ xfs_perag_put(cur->bc_mem.pag);
+ }
kmem_cache_free(cur->bc_cache, cur);
}
+/* Return the buffer target for this btree's buffer. */
+static inline struct xfs_buftarg *
+xfs_btree_buftarg(
+ struct xfs_btree_cur *cur)
+{
+ if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+ return xfbtree_target(cur->bc_mem.xfbtree);
+ return cur->bc_mp->m_ddev_targp;
+}
+
+/* Return the block size (in units of 512b sectors) for this btree. */
+static inline unsigned int
+xfs_btree_bbsize(
+ struct xfs_btree_cur *cur)
+{
+ if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+ return xfbtree_bbsize();
+ return cur->bc_mp->m_bsize;
+}
+
/*
* Duplicate the btree cursor.
* Allocate a new one, copy the record, re-get the buffers.
@@ -500,10 +547,11 @@ xfs_btree_dup_cursor(
new->bc_levels[i].ra = cur->bc_levels[i].ra;
bp = cur->bc_levels[i].bp;
if (bp) {
- error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp,
- xfs_buf_daddr(bp), mp->m_bsize,
- 0, &bp,
- cur->bc_ops->buf_ops);
+ error = xfs_trans_read_buf(mp, tp,
+ xfs_btree_buftarg(cur),
+ xfs_buf_daddr(bp),
+ xfs_btree_bbsize(cur), 0, &bp,
+ cur->bc_ops->buf_ops);
if (xfs_metadata_is_sick(error))
xfs_btree_mark_sick(new);
if (error) {
@@ -944,6 +992,9 @@ xfs_btree_readahead_lblock(
xfs_fsblock_t left = be64_to_cpu(block->bb_u.l.bb_leftsib);
xfs_fsblock_t right = be64_to_cpu(block->bb_u.l.bb_rightsib);
+ if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+ return 0;
+
if ((lr & XFS_BTCUR_LEFTRA) && left != NULLFSBLOCK) {
xfs_btree_reada_bufl(cur->bc_mp, left, 1,
cur->bc_ops->buf_ops);
@@ -969,6 +1020,8 @@ xfs_btree_readahead_sblock(
xfs_agblock_t left = be32_to_cpu(block->bb_u.s.bb_leftsib);
xfs_agblock_t right = be32_to_cpu(block->bb_u.s.bb_rightsib);
+ if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+ return 0;
if ((lr & XFS_BTCUR_LEFTRA) && left != NULLAGBLOCK) {
xfs_btree_reada_bufs(cur->bc_mp, cur->bc_ag.pag->pag_agno,
@@ -1030,6 +1083,11 @@ xfs_btree_ptr_to_daddr(
if (error)
return error;
+ if (cur->bc_flags & XFS_BTREE_IN_MEMORY) {
+ *daddr = xfbtree_ptr_to_daddr(cur, ptr);
+ return 0;
+ }
+
if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
fsbno = be64_to_cpu(ptr->l);
*daddr = XFS_FSB_TO_DADDR(cur->bc_mp, fsbno);
@@ -1058,8 +1116,9 @@ xfs_btree_readahead_ptr(
if (xfs_btree_ptr_to_daddr(cur, ptr, &daddr))
return;
- xfs_buf_readahead(cur->bc_mp->m_ddev_targp, daddr,
- cur->bc_mp->m_bsize * count, cur->bc_ops->buf_ops);
+ xfs_buf_readahead(xfs_btree_buftarg(cur), daddr,
+ xfs_btree_bbsize(cur) * count,
+ cur->bc_ops->buf_ops);
}
/*
@@ -1233,7 +1292,9 @@ xfs_btree_init_block_cur(
* change in future, but is safe for current users of the generic btree
* code.
*/
- if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+ if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+ owner = xfbtree_owner(cur);
+ else if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
owner = cur->bc_ino.ip->i_ino;
else
owner = cur->bc_ag.pag->pag_agno;
@@ -1273,6 +1334,11 @@ xfs_btree_buf_to_ptr(
struct xfs_buf *bp,
union xfs_btree_ptr *ptr)
{
+ if (cur->bc_flags & XFS_BTREE_IN_MEMORY) {
+ xfbtree_buf_to_ptr(cur, bp, ptr);
+ return;
+ }
+
if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
ptr->l = cpu_to_be64(XFS_DADDR_TO_FSB(cur->bc_mp,
xfs_buf_daddr(bp)));
@@ -1317,15 +1383,14 @@ xfs_btree_get_buf_block(
struct xfs_btree_block **block,
struct xfs_buf **bpp)
{
- struct xfs_mount *mp = cur->bc_mp;
- xfs_daddr_t d;
- int error;
+ xfs_daddr_t d;
+ int error;
error = xfs_btree_ptr_to_daddr(cur, ptr, &d);
if (error)
return error;
- error = xfs_trans_get_buf(cur->bc_tp, mp->m_ddev_targp, d, mp->m_bsize,
- 0, bpp);
+ error = xfs_trans_get_buf(cur->bc_tp, xfs_btree_buftarg(cur), d,
+ xfs_btree_bbsize(cur), 0, bpp);
if (error)
return error;
@@ -1356,9 +1421,9 @@ xfs_btree_read_buf_block(
error = xfs_btree_ptr_to_daddr(cur, ptr, &d);
if (error)
return error;
- error = xfs_trans_read_buf(mp, cur->bc_tp, mp->m_ddev_targp, d,
- mp->m_bsize, flags, bpp,
- cur->bc_ops->buf_ops);
+ error = xfs_trans_read_buf(mp, cur->bc_tp, xfs_btree_buftarg(cur), d,
+ xfs_btree_bbsize(cur), flags, bpp,
+ cur->bc_ops->buf_ops);
if (xfs_metadata_is_sick(error))
xfs_btree_mark_sick(cur);
if (error)
@@ -1798,6 +1863,37 @@ xfs_btree_decrement(
return error;
}
+/*
+ * Check the btree block owner now that we have the context to know who the
+ * real owner is.
+ */
+static inline xfs_failaddr_t
+xfs_btree_check_block_owner(
+ struct xfs_btree_cur *cur,
+ struct xfs_btree_block *block)
+{
+ if (!xfs_has_crc(cur->bc_mp))
+ return NULL;
+
+ if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+ return xfbtree_check_block_owner(cur, block);
+
+ if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS)) {
+ if (be32_to_cpu(block->bb_u.s.bb_owner) !=
+ cur->bc_ag.pag->pag_agno)
+ return __this_address;
+ return NULL;
+ }
+
+ if (cur->bc_ino.flags & XFS_BTCUR_BMBT_INVALID_OWNER)
+ return NULL;
+
+ if (be64_to_cpu(block->bb_u.l.bb_owner) != cur->bc_ino.ip->i_ino)
+ return __this_address;
+
+ return NULL;
+}
+
int
xfs_btree_lookup_get_block(
struct xfs_btree_cur *cur, /* btree cursor */
@@ -1836,11 +1932,7 @@ xfs_btree_lookup_get_block(
return error;
/* Check the inode owner since the verifiers don't. */
- if (xfs_has_crc(cur->bc_mp) &&
- !(cur->bc_ino.flags & XFS_BTCUR_BMBT_INVALID_OWNER) &&
- (cur->bc_flags & XFS_BTREE_LONG_PTRS) &&
- be64_to_cpu((*blkp)->bb_u.l.bb_owner) !=
- cur->bc_ino.ip->i_ino)
+ if (xfs_btree_check_block_owner(cur, *blkp) != NULL)
goto out_bad;
/* Did we get the level we were looking for? */
@@ -4372,7 +4464,7 @@ xfs_btree_visit_block(
{
struct xfs_btree_block *block;
struct xfs_buf *bp;
- union xfs_btree_ptr rptr;
+ union xfs_btree_ptr rptr, bufptr;
int error;
/* do right sibling readahead */
@@ -4395,15 +4487,14 @@ xfs_btree_visit_block(
* return the same block without checking if the right sibling points
* back to us and creates a cyclic reference in the btree.
*/
+ xfs_btree_buf_to_ptr(cur, bp, &bufptr);
if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
- if (be64_to_cpu(rptr.l) == XFS_DADDR_TO_FSB(cur->bc_mp,
- xfs_buf_daddr(bp))) {
+ if (rptr.l == bufptr.l) {
xfs_btree_mark_sick(cur);
return -EFSCORRUPTED;
}
} else {
- if (be32_to_cpu(rptr.s) == xfs_daddr_to_agbno(cur->bc_mp,
- xfs_buf_daddr(bp))) {
+ if (rptr.s == bufptr.s) {
xfs_btree_mark_sick(cur);
return -EFSCORRUPTED;
}
@@ -4585,6 +4676,8 @@ xfs_btree_lblock_verify(
xfs_fsblock_t fsb;
xfs_failaddr_t fa;
+ ASSERT(!xfs_buftarg_in_memory(bp->b_target));
+
/* numrecs verification */
if (be16_to_cpu(block->bb_numrecs) > max_recs)
return __this_address;
@@ -4640,6 +4733,8 @@ xfs_btree_sblock_verify(
xfs_agblock_t agbno;
xfs_failaddr_t fa;
+ ASSERT(!xfs_buftarg_in_memory(bp->b_target));
+
/* numrecs verification */
if (be16_to_cpu(block->bb_numrecs) > max_recs)
return __this_address;
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 5525d3715d57..b2c08a436997 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -248,6 +248,15 @@ struct xfs_btree_cur_ino {
#define XFS_BTCUR_BMBT_INVALID_OWNER (1 << 1)
};
+/* In-memory btree information */
+struct xfbtree;
+
+struct xfs_btree_cur_mem {
+ struct xfbtree *xfbtree;
+ struct xfs_buf *head_bp;
+ struct xfs_perag *pag;
+};
+
struct xfs_btree_level {
/* buffer pointer */
struct xfs_buf *bp;
@@ -287,6 +296,7 @@ struct xfs_btree_cur
union {
struct xfs_btree_cur_ag bc_ag;
struct xfs_btree_cur_ino bc_ino;
+ struct xfs_btree_cur_mem bc_mem;
};
/* Must be at the end of the struct! */
@@ -317,6 +327,13 @@ xfs_btree_cur_sizeof(unsigned int nlevels)
*/
#define XFS_BTREE_STAGING (1<<5)
+/* btree stored in memory; not compatible with ROOT_IN_INODE */
+#ifdef CONFIG_XFS_IN_MEMORY_BTREE
+# define XFS_BTREE_IN_MEMORY (1<<7)
+#else
+# define XFS_BTREE_IN_MEMORY (0)
+#endif
+
#define XFS_BTREE_NOERROR 0
#define XFS_BTREE_ERROR 1
diff --git a/fs/xfs/libxfs/xfs_btree_mem.h b/fs/xfs/libxfs/xfs_btree_mem.h
new file mode 100644
index 000000000000..6ca9ea64a9a4
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_btree_mem.h
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_BTREE_MEM_H__
+#define __XFS_BTREE_MEM_H__
+
+struct xfbtree;
+
+#ifdef CONFIG_XFS_IN_MEMORY_BTREE
+unsigned int xfs_btree_mem_head_nlevels(struct xfs_buf *head_bp);
+
+struct xfs_buftarg *xfbtree_target(struct xfbtree *xfbtree);
+int xfbtree_check_ptr(struct xfs_btree_cur *cur,
+ const union xfs_btree_ptr *ptr, int index, int level);
+xfs_daddr_t xfbtree_ptr_to_daddr(struct xfs_btree_cur *cur,
+ const union xfs_btree_ptr *ptr);
+void xfbtree_buf_to_ptr(struct xfs_btree_cur *cur, struct xfs_buf *bp,
+ union xfs_btree_ptr *ptr);
+
+unsigned int xfbtree_bbsize(void);
+
+void xfbtree_set_root(struct xfs_btree_cur *cur,
+ const union xfs_btree_ptr *ptr, int inc);
+void xfbtree_init_ptr_from_cur(struct xfs_btree_cur *cur,
+ union xfs_btree_ptr *ptr);
+struct xfs_btree_cur *xfbtree_dup_cursor(struct xfs_btree_cur *cur);
+bool xfbtree_verify_xfileoff(struct xfs_btree_cur *cur,
+ unsigned long long xfoff);
+xfs_failaddr_t xfbtree_check_block_owner(struct xfs_btree_cur *cur,
+ struct xfs_btree_block *block);
+unsigned long long xfbtree_owner(struct xfs_btree_cur *cur);
+xfs_failaddr_t xfbtree_lblock_verify(struct xfs_buf *bp, unsigned int max_recs);
+xfs_failaddr_t xfbtree_sblock_verify(struct xfs_buf *bp, unsigned int max_recs);
+unsigned long long xfbtree_buf_to_xfoff(struct xfs_btree_cur *cur,
+ struct xfs_buf *bp);
+#else
+static inline unsigned int xfs_btree_mem_head_nlevels(struct xfs_buf *head_bp)
+{
+ return 0;
+}
+
+static inline struct xfs_buftarg *
+xfbtree_target(struct xfbtree *xfbtree)
+{
+ return NULL;
+}
+
+static inline int
+xfbtree_check_ptr(struct xfs_btree_cur *cur, const union xfs_btree_ptr *ptr,
+ int index, int level)
+{
+ return 0;
+}
+
+static inline xfs_daddr_t
+xfbtree_ptr_to_daddr(struct xfs_btree_cur *cur, const union xfs_btree_ptr *ptr)
+{
+ return 0;
+}
+
+static inline void
+xfbtree_buf_to_ptr(
+ struct xfs_btree_cur *cur,
+ struct xfs_buf *bp,
+ union xfs_btree_ptr *ptr)
+{
+ memset(ptr, 0xFF, sizeof(*ptr));
+}
+
+static inline unsigned int xfbtree_bbsize(void)
+{
+ return 0;
+}
+
+#define xfbtree_set_root NULL
+#define xfbtree_init_ptr_from_cur NULL
+#define xfbtree_dup_cursor NULL
+#define xfbtree_verify_xfileoff(cur, xfoff) (false)
+#define xfbtree_check_block_owner(cur, block) NULL
+#define xfbtree_owner(cur) (0ULL)
+#define xfbtree_buf_to_xfoff(cur, bp) (-1)
+
+#endif /* CONFIG_XFS_IN_MEMORY_BTREE */
+
+#endif /* __XFS_BTREE_MEM_H__ */
diff --git a/fs/xfs/scrub/xfbtree.c b/fs/xfs/scrub/xfbtree.c
new file mode 100644
index 000000000000..80f9ab4fec07
--- /dev/null
+++ b/fs/xfs/scrub/xfbtree.c
@@ -0,0 +1,352 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_trans.h"
+#include "xfs_btree.h"
+#include "xfs_error.h"
+#include "xfs_btree_mem.h"
+#include "xfs_ag.h"
+#include "scrub/xfile.h"
+#include "scrub/xfbtree.h"
+
+/* btree ops functions for in-memory btrees. */
+
+static xfs_failaddr_t
+xfs_btree_mem_head_verify(
+ struct xfs_buf *bp)
+{
+ struct xfs_btree_mem_head *mhead = bp->b_addr;
+ struct xfs_mount *mp = bp->b_mount;
+
+ if (!xfs_verify_magic(bp, mhead->mh_magic))
+ return __this_address;
+ if (be32_to_cpu(mhead->mh_nlevels) == 0)
+ return __this_address;
+ if (!uuid_equal(&mhead->mh_uuid, &mp->m_sb.sb_meta_uuid))
+ return __this_address;
+
+ return NULL;
+}
+
+static void
+xfs_btree_mem_head_read_verify(
+ struct xfs_buf *bp)
+{
+ xfs_failaddr_t fa = xfs_btree_mem_head_verify(bp);
+
+ if (fa)
+ xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+}
+
+static void
+xfs_btree_mem_head_write_verify(
+ struct xfs_buf *bp)
+{
+ xfs_failaddr_t fa = xfs_btree_mem_head_verify(bp);
+
+ if (fa)
+ xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+}
+
+static const struct xfs_buf_ops xfs_btree_mem_head_buf_ops = {
+ .name = "xfs_btree_mem_head",
+ .magic = { cpu_to_be32(XFS_BTREE_MEM_HEAD_MAGIC),
+ cpu_to_be32(XFS_BTREE_MEM_HEAD_MAGIC) },
+ .verify_read = xfs_btree_mem_head_read_verify,
+ .verify_write = xfs_btree_mem_head_write_verify,
+ .verify_struct = xfs_btree_mem_head_verify,
+};
+
+/* Initialize the header block for an in-memory btree. */
+static inline void
+xfs_btree_mem_head_init(
+ struct xfs_buf *head_bp,
+ unsigned long long owner,
+ xfileoff_t leaf_xfoff)
+{
+ struct xfs_btree_mem_head *mhead = head_bp->b_addr;
+ struct xfs_mount *mp = head_bp->b_mount;
+
+ mhead->mh_magic = cpu_to_be32(XFS_BTREE_MEM_HEAD_MAGIC);
+ mhead->mh_nlevels = cpu_to_be32(1);
+ mhead->mh_owner = cpu_to_be64(owner);
+ mhead->mh_root = cpu_to_be64(leaf_xfoff);
+ uuid_copy(&mhead->mh_uuid, &mp->m_sb.sb_meta_uuid);
+
+ head_bp->b_ops = &xfs_btree_mem_head_buf_ops;
+}
+
+/* Return tree height from the in-memory btree head. */
+unsigned int
+xfs_btree_mem_head_nlevels(
+ struct xfs_buf *head_bp)
+{
+ struct xfs_btree_mem_head *mhead = head_bp->b_addr;
+
+ return be32_to_cpu(mhead->mh_nlevels);
+}
+
+/* Extract the buftarg target for this xfile btree. */
+struct xfs_buftarg *
+xfbtree_target(struct xfbtree *xfbtree)
+{
+ return xfbtree->target;
+}
+
+/* Is this daddr (sector offset) contained within the buffer target? */
+static inline bool
+xfbtree_verify_buftarg_xfileoff(
+ struct xfs_buftarg *btp,
+ xfileoff_t xfoff)
+{
+ xfs_daddr_t xfoff_daddr = xfo_to_daddr(xfoff);
+
+ return xfs_buftarg_verify_daddr(btp, xfoff_daddr);
+}
+
+/* Is this btree xfile offset contained within the xfile? */
+bool
+xfbtree_verify_xfileoff(
+ struct xfs_btree_cur *cur,
+ unsigned long long xfoff)
+{
+ struct xfs_buftarg *btp = xfbtree_target(cur->bc_mem.xfbtree);
+
+ return xfbtree_verify_buftarg_xfileoff(btp, xfoff);
+}
+
+/* Check if a btree pointer is reasonable. */
+int
+xfbtree_check_ptr(
+ struct xfs_btree_cur *cur,
+ const union xfs_btree_ptr *ptr,
+ int index,
+ int level)
+{
+ xfileoff_t bt_xfoff;
+ xfs_failaddr_t fa = NULL;
+
+ ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+ if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+ bt_xfoff = be64_to_cpu(ptr->l);
+ else
+ bt_xfoff = be32_to_cpu(ptr->s);
+
+ if (!xfbtree_verify_xfileoff(cur, bt_xfoff))
+ fa = __this_address;
+
+ if (fa) {
+ xfs_err(cur->bc_mp,
+"In-memory: Corrupt btree %d flags 0x%x pointer at level %d index %d fa %pS.",
+ cur->bc_btnum, cur->bc_flags, level, index,
+ fa);
+ return -EFSCORRUPTED;
+ }
+ return 0;
+}
+
+/* Convert a btree pointer to a daddr */
+xfs_daddr_t
+xfbtree_ptr_to_daddr(
+ struct xfs_btree_cur *cur,
+ const union xfs_btree_ptr *ptr)
+{
+ xfileoff_t bt_xfoff;
+
+ if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+ bt_xfoff = be64_to_cpu(ptr->l);
+ else
+ bt_xfoff = be32_to_cpu(ptr->s);
+ return xfo_to_daddr(bt_xfoff);
+}
+
+/* Set the pointer to point to this buffer. */
+void
+xfbtree_buf_to_ptr(
+ struct xfs_btree_cur *cur,
+ struct xfs_buf *bp,
+ union xfs_btree_ptr *ptr)
+{
+ xfileoff_t xfoff = xfs_daddr_to_xfo(xfs_buf_daddr(bp));
+
+ if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+ ptr->l = cpu_to_be64(xfoff);
+ else
+ ptr->s = cpu_to_be32(xfoff);
+}
+
+/* Return the in-memory btree block size, in units of 512 bytes. */
+unsigned int xfbtree_bbsize(void)
+{
+ return xfo_to_daddr(1);
+}
+
+/* Set the root of an in-memory btree. */
+void
+xfbtree_set_root(
+ struct xfs_btree_cur *cur,
+ const union xfs_btree_ptr *ptr,
+ int inc)
+{
+ struct xfs_buf *head_bp = cur->bc_mem.head_bp;
+ struct xfs_btree_mem_head *mhead = head_bp->b_addr;
+
+ ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+ if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
+ mhead->mh_root = ptr->l;
+ } else {
+ uint32_t root = be32_to_cpu(ptr->s);
+
+ mhead->mh_root = cpu_to_be64(root);
+ }
+ be32_add_cpu(&mhead->mh_nlevels, inc);
+ xfs_trans_log_buf(cur->bc_tp, head_bp, 0, sizeof(*mhead) - 1);
+}
+
+/* Initialize a pointer from the in-memory btree header. */
+void
+xfbtree_init_ptr_from_cur(
+ struct xfs_btree_cur *cur,
+ union xfs_btree_ptr *ptr)
+{
+ struct xfs_buf *head_bp = cur->bc_mem.head_bp;
+ struct xfs_btree_mem_head *mhead = head_bp->b_addr;
+
+ ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+ if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
+ ptr->l = mhead->mh_root;
+ } else {
+ uint64_t root = be64_to_cpu(mhead->mh_root);
+
+ ptr->s = cpu_to_be32(root);
+ }
+}
+
+/* Duplicate an in-memory btree cursor. */
+struct xfs_btree_cur *
+xfbtree_dup_cursor(
+ struct xfs_btree_cur *cur)
+{
+ struct xfs_btree_cur *ncur;
+
+ ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+ ncur = xfs_btree_alloc_cursor(cur->bc_mp, cur->bc_tp, cur->bc_btnum,
+ cur->bc_maxlevels, cur->bc_cache);
+ ncur->bc_flags = cur->bc_flags;
+ ncur->bc_nlevels = cur->bc_nlevels;
+ ncur->bc_statoff = cur->bc_statoff;
+ ncur->bc_ops = cur->bc_ops;
+ memcpy(&ncur->bc_mem, &cur->bc_mem, sizeof(cur->bc_mem));
+
+ if (cur->bc_mem.pag)
+ ncur->bc_mem.pag = xfs_perag_bump(cur->bc_mem.pag);
+
+ return ncur;
+}
+
+/* Check the owner of an in-memory btree block. */
+xfs_failaddr_t
+xfbtree_check_block_owner(
+ struct xfs_btree_cur *cur,
+ struct xfs_btree_block *block)
+{
+ struct xfbtree *xfbt = cur->bc_mem.xfbtree;
+
+ if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
+ if (be64_to_cpu(block->bb_u.l.bb_owner) != xfbt->owner)
+ return __this_address;
+
+ return NULL;
+ }
+
+ if (be32_to_cpu(block->bb_u.s.bb_owner) != xfbt->owner)
+ return __this_address;
+
+ return NULL;
+}
+
+/* Return the owner of this in-memory btree. */
+unsigned long long
+xfbtree_owner(
+ struct xfs_btree_cur *cur)
+{
+ return cur->bc_mem.xfbtree->owner;
+}
+
+/* Return the xfile offset (in blocks) of a btree buffer. */
+unsigned long long
+xfbtree_buf_to_xfoff(
+ struct xfs_btree_cur *cur,
+ struct xfs_buf *bp)
+{
+ ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+ return xfs_daddr_to_xfo(xfs_buf_daddr(bp));
+}
+
+/* Verify a long-format btree block. */
+xfs_failaddr_t
+xfbtree_lblock_verify(
+ struct xfs_buf *bp,
+ unsigned int max_recs)
+{
+ struct xfs_btree_block *block = XFS_BUF_TO_BLOCK(bp);
+ struct xfs_buftarg *btp = bp->b_target;
+
+ /* numrecs verification */
+ if (be16_to_cpu(block->bb_numrecs) > max_recs)
+ return __this_address;
+
+ /* sibling pointer verification */
+ if (block->bb_u.l.bb_leftsib != cpu_to_be64(NULLFSBLOCK) &&
+ !xfbtree_verify_buftarg_xfileoff(btp,
+ be64_to_cpu(block->bb_u.l.bb_leftsib)))
+ return __this_address;
+
+ if (block->bb_u.l.bb_rightsib != cpu_to_be64(NULLFSBLOCK) &&
+ !xfbtree_verify_buftarg_xfileoff(btp,
+ be64_to_cpu(block->bb_u.l.bb_rightsib)))
+ return __this_address;
+
+ return NULL;
+}
+
+/* Verify a short-format btree block. */
+xfs_failaddr_t
+xfbtree_sblock_verify(
+ struct xfs_buf *bp,
+ unsigned int max_recs)
+{
+ struct xfs_btree_block *block = XFS_BUF_TO_BLOCK(bp);
+ struct xfs_buftarg *btp = bp->b_target;
+
+ /* numrecs verification */
+ if (be16_to_cpu(block->bb_numrecs) > max_recs)
+ return __this_address;
+
+ /* sibling pointer verification */
+ if (block->bb_u.s.bb_leftsib != cpu_to_be32(NULLAGBLOCK) &&
+ !xfbtree_verify_buftarg_xfileoff(btp,
+ be32_to_cpu(block->bb_u.s.bb_leftsib)))
+ return __this_address;
+
+ if (block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK) &&
+ !xfbtree_verify_buftarg_xfileoff(btp,
+ be32_to_cpu(block->bb_u.s.bb_rightsib)))
+ return __this_address;
+
+ return NULL;
+}
diff --git a/fs/xfs/scrub/xfbtree.h b/fs/xfs/scrub/xfbtree.h
new file mode 100644
index 000000000000..b3836f21085d
--- /dev/null
+++ b/fs/xfs/scrub/xfbtree.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef XFS_SCRUB_XFBTREE_H__
+#define XFS_SCRUB_XFBTREE_H__
+
+#ifdef CONFIG_XFS_IN_MEMORY_BTREE
+
+/* Root block for an in-memory btree. */
+struct xfs_btree_mem_head {
+ __be32 mh_magic;
+ __be32 mh_nlevels;
+ __be64 mh_owner;
+ __be64 mh_root;
+ uuid_t mh_uuid;
+};
+
+#define XFS_BTREE_MEM_HEAD_MAGIC 0x4341544D /* "CATM" */
+
+/* xfile-backed in-memory btrees */
+
+struct xfbtree {
+ /* buffer cache target for this in-memory btree */
+ struct xfs_buftarg *target;
+
+ /* Owner of this btree. */
+ unsigned long long owner;
+};
+
+#endif /* CONFIG_XFS_IN_MEMORY_BTREE */
+
+#endif /* XFS_SCRUB_XFBTREE_H__ */
diff --git a/fs/xfs/scrub/xfile.h b/fs/xfs/scrub/xfile.h
index 99b6db838612..c934e70f95e8 100644
--- a/fs/xfs/scrub/xfile.h
+++ b/fs/xfs/scrub/xfile.h
@@ -78,6 +78,47 @@ int xfile_get_page(struct xfile *xf, loff_t offset, unsigned int len,
int xfile_put_page(struct xfile *xf, struct xfile_page *xbuf);
int xfile_dump(struct xfile *xf);
+
+static inline loff_t xfile_size(struct xfile *xf)
+{
+ return i_size_read(file_inode(xf->file));
+}
+
+/* file block (aka system page size) to basic block conversions. */
+typedef unsigned long long xfileoff_t;
+#define XFB_BLOCKSIZE (PAGE_SIZE)
+#define XFB_BSHIFT (PAGE_SHIFT)
+#define XFB_SHIFT (XFB_BSHIFT - BBSHIFT)
+
+static inline loff_t xfo_to_b(xfileoff_t xfoff)
+{
+ return xfoff << XFB_BSHIFT;
+}
+
+static inline xfileoff_t b_to_xfo(loff_t pos)
+{
+ return (pos + (XFB_BLOCKSIZE - 1)) >> XFB_BSHIFT;
+}
+
+static inline xfileoff_t b_to_xfot(loff_t pos)
+{
+ return pos >> XFB_BSHIFT;
+}
+
+static inline xfs_daddr_t xfo_to_daddr(xfileoff_t xfoff)
+{
+ return xfoff << XFB_SHIFT;
+}
+
+static inline xfileoff_t xfs_daddr_to_xfo(xfs_daddr_t bb)
+{
+ return (bb + (xfo_to_daddr(1) - 1)) >> XFB_SHIFT;
+}
+
+static inline xfileoff_t xfs_daddr_to_xfot(xfs_daddr_t bb)
+{
+ return bb >> XFB_SHIFT;
+}
#else
static inline int
xfile_obj_load(struct xfile *xf, void *buf, size_t count, loff_t offset)
@@ -90,6 +131,11 @@ xfile_obj_store(struct xfile *xf, const void *buf, size_t count, loff_t offset)
{
return -EIO;
}
+
+static inline loff_t xfile_size(struct xfile *xf)
+{
+ return 0;
+}
#endif /* CONFIG_XFS_IN_MEMORY_FILE */
#endif /* __XFS_SCRUB_XFILE_H__ */
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 2ec8d39def9c..bf3b7c96f207 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -2533,3 +2533,13 @@ xfs_verify_magic16(
return false;
return dmagic == bp->b_ops->magic16[idx];
}
+
+/* Return the number of sectors for a buffer target. */
+xfs_daddr_t
+xfs_buftarg_nr_sectors(
+ struct xfs_buftarg *btp)
+{
+ if (btp->bt_flags & XFS_BUFTARG_IN_MEMORY)
+ return xfile_size(btp->bt_xfile) >> SECTOR_SHIFT;
+ return bdev_nr_sectors(btp->bt_bdev);
+}
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index dcae77dabdcc..d74ce9080282 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -445,6 +445,16 @@ xfs_buftarg_zeroout(
flags);
}
+xfs_daddr_t xfs_buftarg_nr_sectors(struct xfs_buftarg *btp);
+
+static inline bool
+xfs_buftarg_verify_daddr(
+ struct xfs_buftarg *btp,
+ xfs_daddr_t daddr)
+{
+ return daddr < xfs_buftarg_nr_sectors(btp);
+}
+
int xfs_buf_reverify(struct xfs_buf *bp, const struct xfs_buf_ops *ops);
bool xfs_verify_magic(struct xfs_buf *bp, __be32 dmagic);
bool xfs_verify_magic16(struct xfs_buf *bp, __be16 dmagic);
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 74a4620d763b..6de8780b208a 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -508,6 +508,9 @@ xfs_btree_mark_sick(
{
unsigned int mask;
+ if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+ return;
+
switch (cur->bc_btnum) {
case XFS_BTNUM_BMAP:
xfs_bmap_mark_sick(cur->bc_ino.ip, cur->bc_ino.whichfork);
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 8a5dc1538aa8..2d49310fb912 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -36,6 +36,9 @@
#include "xfs_error.h"
#include <linux/iomap.h>
#include "xfs_iomap.h"
+#include "scrub/xfile.h"
+#include "scrub/xfbtree.h"
+#include "xfs_btree_mem.h"
/*
* We include this last to have the helpers above available for the trace
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index d86dd34127f2..2d006bf0f9ce 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2529,7 +2529,10 @@ TRACE_EVENT(xfs_btree_alloc_block,
),
TP_fast_assign(
__entry->dev = cur->bc_mp->m_super->s_dev;
- if (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) {
+ if (cur->bc_flags & XFS_BTREE_IN_MEMORY) {
+ __entry->agno = 0;
+ __entry->ino = 0;
+ } else if (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) {
__entry->agno = 0;
__entry->ino = cur->bc_ino.ip->i_ino;
} else {
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 7/7] xfs: connect in-memory btrees to xfiles
2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
` (5 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 6/7] xfs: support in-memory btrees Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
6 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, willy, linux-fsdevel
From: Darrick J. Wong <djwong@kernel.org>
Add to our stubbed-out in-memory btrees the ability to connect them with
an actual in-memory backing file (aka xfiles) and the necessary pieces
to track free space in the xfile and flush dirty xfbtree buffers on
demand, which we'll need for online repair.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_btree_mem.h | 41 ++++
fs/xfs/scrub/bitmap.c | 28 ++
fs/xfs/scrub/bitmap.h | 3
fs/xfs/scrub/scrub.c | 4
fs/xfs/scrub/scrub.h | 3
fs/xfs/scrub/trace.c | 13 +
fs/xfs/scrub/trace.h | 110 ++++++++++
fs/xfs/scrub/xfbtree.c | 466 +++++++++++++++++++++++++++++++++++++++++
fs/xfs/scrub/xfbtree.h | 25 ++
fs/xfs/scrub/xfile.c | 83 +++++++
fs/xfs/scrub/xfile.h | 2
fs/xfs/xfs_buf.c | 6 -
fs/xfs/xfs_trace.h | 1
fs/xfs/xfs_trans.h | 1
fs/xfs/xfs_trans_buf.c | 42 ++++
15 files changed, 825 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_btree_mem.h b/fs/xfs/libxfs/xfs_btree_mem.h
index 6ca9ea64a9a4..5e7b1f20fb5b 100644
--- a/fs/xfs/libxfs/xfs_btree_mem.h
+++ b/fs/xfs/libxfs/xfs_btree_mem.h
@@ -8,6 +8,26 @@
struct xfbtree;
+struct xfbtree_config {
+ /* Buffer ops for the btree root block */
+ const struct xfs_btree_ops *btree_ops;
+
+ /* Buffer target for the xfile backing this btree. */
+ struct xfs_buftarg *target;
+
+ /* Owner of this btree. */
+ unsigned long long owner;
+
+ /* Btree type number */
+ xfs_btnum_t btnum;
+
+ /* XFBTREE_CREATE_* flags */
+ unsigned int flags;
+};
+
+/* btree has long pointers */
+#define XFBTREE_CREATE_LONG_PTRS (1U << 0)
+
#ifdef CONFIG_XFS_IN_MEMORY_BTREE
unsigned int xfs_btree_mem_head_nlevels(struct xfs_buf *head_bp);
@@ -35,6 +55,16 @@ xfs_failaddr_t xfbtree_lblock_verify(struct xfs_buf *bp, unsigned int max_recs);
xfs_failaddr_t xfbtree_sblock_verify(struct xfs_buf *bp, unsigned int max_recs);
unsigned long long xfbtree_buf_to_xfoff(struct xfs_btree_cur *cur,
struct xfs_buf *bp);
+
+int xfbtree_get_minrecs(struct xfs_btree_cur *cur, int level);
+int xfbtree_get_maxrecs(struct xfs_btree_cur *cur, int level);
+
+int xfbtree_create(struct xfs_mount *mp, const struct xfbtree_config *cfg,
+ struct xfbtree **xfbtreep);
+int xfbtree_alloc_block(struct xfs_btree_cur *cur,
+ const union xfs_btree_ptr *start, union xfs_btree_ptr *ptr,
+ int *stat);
+int xfbtree_free_block(struct xfs_btree_cur *cur, struct xfs_buf *bp);
#else
static inline unsigned int xfs_btree_mem_head_nlevels(struct xfs_buf *head_bp)
{
@@ -77,11 +107,22 @@ static inline unsigned int xfbtree_bbsize(void)
#define xfbtree_set_root NULL
#define xfbtree_init_ptr_from_cur NULL
#define xfbtree_dup_cursor NULL
+#define xfbtree_get_minrecs NULL
+#define xfbtree_get_maxrecs NULL
+#define xfbtree_alloc_block NULL
+#define xfbtree_free_block NULL
#define xfbtree_verify_xfileoff(cur, xfoff) (false)
#define xfbtree_check_block_owner(cur, block) NULL
#define xfbtree_owner(cur) (0ULL)
#define xfbtree_buf_to_xfoff(cur, bp) (-1)
+static inline int
+xfbtree_create(struct xfs_mount *mp, const struct xfbtree_config *cfg,
+ struct xfbtree **xfbtreep)
+{
+ return -EOPNOTSUPP;
+}
+
#endif /* CONFIG_XFS_IN_MEMORY_BTREE */
#endif /* __XFS_BTREE_MEM_H__ */
diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index f707434b1c86..c98f4c45414a 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -379,3 +379,31 @@ xbitmap_test(
*len = bn->bn_start - start;
return false;
}
+
+/*
+ * Find the first set bit in this bitmap, clear it, and return the index of
+ * that bit in @valp. Returns -ENODATA if no bits were set, or the usual
+ * negative errno.
+ */
+int
+xbitmap_take_first_set(
+ struct xbitmap *bitmap,
+ uint64_t start,
+ uint64_t last,
+ uint64_t *valp)
+{
+ struct xbitmap_node *bn;
+ uint64_t val;
+ int error;
+
+ bn = xbitmap_tree_iter_first(&bitmap->xb_root, start, last);
+ if (!bn)
+ return -ENODATA;
+
+ val = bn->bn_start;
+ error = xbitmap_clear(bitmap, bn->bn_start, 1);
+ if (error)
+ return error;
+ *valp = val;
+ return 0;
+}
diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index 7f1b9c9c7831..1ebe1918bdb2 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -32,6 +32,9 @@ int xbitmap_walk(struct xbitmap *bitmap, xbitmap_walk_fn fn,
bool xbitmap_empty(struct xbitmap *bitmap);
bool xbitmap_test(struct xbitmap *bitmap, uint64_t start, uint64_t *len);
+int xbitmap_take_first_set(struct xbitmap *bitmap, uint64_t start,
+ uint64_t last, uint64_t *valp);
+
/* Bitmaps, but for type-checked for xfs_agblock_t */
struct xagb_bitmap {
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index fd116531a0d9..5bbc12649277 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -191,6 +191,10 @@ xchk_teardown(
sc->flags &= ~XCHK_HAVE_FREEZE_PROT;
mnt_drop_write_file(sc->file);
}
+ if (sc->xfile_buftarg) {
+ xfs_free_buftarg(sc->xfile_buftarg);
+ sc->xfile_buftarg = NULL;
+ }
if (sc->xfile) {
xfile_destroy(sc->xfile);
sc->xfile = NULL;
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 04afb584f504..6fe59d1a2518 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -99,6 +99,9 @@ struct xfs_scrub {
/* xfile used by the scrubbers; freed at teardown. */
struct xfile *xfile;
+ /* buffer target for the xfile; also freed at teardown. */
+ struct xfs_buftarg *xfile_buftarg;
+
/* Lock flags for @ip. */
uint ilock_flags;
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index 08e05d49e7c0..177fc4c75507 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -12,15 +12,19 @@
#include "xfs_mount.h"
#include "xfs_inode.h"
#include "xfs_btree.h"
+#include "xfs_btree_mem.h"
#include "xfs_ag.h"
#include "xfs_quota_defs.h"
#include "xfs_dir2.h"
+#include "xfs_da_format.h"
+#include "xfs_btree_mem.h"
#include "scrub/scrub.h"
#include "scrub/xfile.h"
#include "scrub/xfarray.h"
#include "scrub/iscan.h"
#include "scrub/nlinks.h"
#include "scrub/fscounters.h"
+#include "scrub/xfbtree.h"
/* Figure out which block the btree cursor was pointing to. */
static inline xfs_fsblock_t
@@ -39,6 +43,15 @@ xchk_btree_cur_fsbno(
return NULLFSBLOCK;
}
+#ifdef CONFIG_XFS_IN_MEMORY_BTREE
+static inline unsigned long
+xfbtree_ino(
+ struct xfbtree *xfbt)
+{
+ return file_inode(xfbt->target->bt_xfile->file)->i_ino;
+}
+#endif /* CONFIG_XFS_IN_MEMORY_BTREE */
+
/*
* We include this last to have the helpers above available for the trace
* event implementations.
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 14569068b6ee..05b6a6e3d0ab 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -24,6 +24,8 @@ struct xfarray_sortinfo;
struct xchk_iscan;
struct xchk_nlink;
struct xchk_fscounters;
+struct xfbtree;
+struct xfbtree_config;
/*
* ftrace's __print_symbolic requires that all enum values be wrapped in the
@@ -866,6 +868,8 @@ DEFINE_XFILE_EVENT(xfile_pwrite);
DEFINE_XFILE_EVENT(xfile_seek_data);
DEFINE_XFILE_EVENT(xfile_get_page);
DEFINE_XFILE_EVENT(xfile_put_page);
+DEFINE_XFILE_EVENT(xfile_discard);
+DEFINE_XFILE_EVENT(xfile_prealloc);
TRACE_EVENT(xfarray_create,
TP_PROTO(struct xfarray *xfa, unsigned long long required_capacity),
@@ -1971,8 +1975,114 @@ DEFINE_XREP_DQUOT_EVENT(xrep_quotacheck_dquot);
DEFINE_SCRUB_NLINKS_DIFF_EVENT(xrep_nlinks_update_inode);
DEFINE_SCRUB_NLINKS_DIFF_EVENT(xrep_nlinks_unfixable_inode);
+TRACE_EVENT(xfbtree_create,
+ TP_PROTO(struct xfs_mount *mp, const struct xfbtree_config *cfg,
+ struct xfbtree *xfbt),
+ TP_ARGS(mp, cfg, xfbt),
+ TP_STRUCT__entry(
+ __field(xfs_btnum_t, btnum)
+ __field(unsigned int, xfbtree_flags)
+ __field(unsigned long, xfino)
+ __field(unsigned int, leaf_mxr)
+ __field(unsigned int, leaf_mnr)
+ __field(unsigned int, node_mxr)
+ __field(unsigned int, node_mnr)
+ __field(unsigned long long, owner)
+ ),
+ TP_fast_assign(
+ __entry->btnum = cfg->btnum;
+ __entry->xfbtree_flags = cfg->flags;
+ __entry->xfino = xfbtree_ino(xfbt);
+ __entry->leaf_mxr = xfbt->maxrecs[0];
+ __entry->node_mxr = xfbt->maxrecs[1];
+ __entry->leaf_mnr = xfbt->minrecs[0];
+ __entry->node_mnr = xfbt->minrecs[1];
+ __entry->owner = cfg->owner;
+ ),
+ TP_printk("xfino 0x%lx btnum %s owner 0x%llx leaf_mxr %u leaf_mnr %u node_mxr %u node_mnr %u",
+ __entry->xfino,
+ __print_symbolic(__entry->btnum, XFS_BTNUM_STRINGS),
+ __entry->owner,
+ __entry->leaf_mxr,
+ __entry->leaf_mnr,
+ __entry->node_mxr,
+ __entry->node_mnr)
+);
+
+DECLARE_EVENT_CLASS(xfbtree_buf_class,
+ TP_PROTO(struct xfbtree *xfbt, struct xfs_buf *bp),
+ TP_ARGS(xfbt, bp),
+ TP_STRUCT__entry(
+ __field(unsigned long, xfino)
+ __field(xfs_daddr_t, bno)
+ __field(int, nblks)
+ __field(int, hold)
+ __field(int, pincount)
+ __field(unsigned, lockval)
+ __field(unsigned, flags)
+ ),
+ TP_fast_assign(
+ __entry->xfino = xfbtree_ino(xfbt);
+ __entry->bno = xfs_buf_daddr(bp);
+ __entry->nblks = bp->b_length;
+ __entry->hold = atomic_read(&bp->b_hold);
+ __entry->pincount = atomic_read(&bp->b_pin_count);
+ __entry->lockval = bp->b_sema.count;
+ __entry->flags = bp->b_flags;
+ ),
+ TP_printk("xfino 0x%lx daddr 0x%llx bbcount 0x%x hold %d pincount %d "
+ "lock %d flags %s",
+ __entry->xfino,
+ (unsigned long long)__entry->bno,
+ __entry->nblks,
+ __entry->hold,
+ __entry->pincount,
+ __entry->lockval,
+ __print_flags(__entry->flags, "|", XFS_BUF_FLAGS))
+)
+
+#define DEFINE_XFBTREE_BUF_EVENT(name) \
+DEFINE_EVENT(xfbtree_buf_class, name, \
+ TP_PROTO(struct xfbtree *xfbt, struct xfs_buf *bp), \
+ TP_ARGS(xfbt, bp))
+DEFINE_XFBTREE_BUF_EVENT(xfbtree_create_root_buf);
+DEFINE_XFBTREE_BUF_EVENT(xfbtree_trans_commit_buf);
+DEFINE_XFBTREE_BUF_EVENT(xfbtree_trans_cancel_buf);
+
+DECLARE_EVENT_CLASS(xfbtree_freesp_class,
+ TP_PROTO(struct xfbtree *xfbt, struct xfs_btree_cur *cur,
+ xfs_fileoff_t fileoff),
+ TP_ARGS(xfbt, cur, fileoff),
+ TP_STRUCT__entry(
+ __field(unsigned long, xfino)
+ __field(xfs_btnum_t, btnum)
+ __field(int, nlevels)
+ __field(xfs_fileoff_t, fileoff)
+ ),
+ TP_fast_assign(
+ __entry->xfino = xfbtree_ino(xfbt);
+ __entry->btnum = cur->bc_btnum;
+ __entry->nlevels = cur->bc_nlevels;
+ __entry->fileoff = fileoff;
+ ),
+ TP_printk("xfino 0x%lx btree %s nlevels %d fileoff 0x%llx",
+ __entry->xfino,
+ __print_symbolic(__entry->btnum, XFS_BTNUM_STRINGS),
+ __entry->nlevels,
+ (unsigned long long)__entry->fileoff)
+)
+
+#define DEFINE_XFBTREE_FREESP_EVENT(name) \
+DEFINE_EVENT(xfbtree_freesp_class, name, \
+ TP_PROTO(struct xfbtree *xfbt, struct xfs_btree_cur *cur, \
+ xfs_fileoff_t fileoff), \
+ TP_ARGS(xfbt, cur, fileoff))
+DEFINE_XFBTREE_FREESP_EVENT(xfbtree_alloc_block);
+DEFINE_XFBTREE_FREESP_EVENT(xfbtree_free_block);
+
#endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
+
#endif /* _TRACE_XFS_SCRUB_TRACE_H */
#undef TRACE_INCLUDE_PATH
diff --git a/fs/xfs/scrub/xfbtree.c b/fs/xfs/scrub/xfbtree.c
index 80f9ab4fec07..3eeb5110a1cc 100644
--- a/fs/xfs/scrub/xfbtree.c
+++ b/fs/xfs/scrub/xfbtree.c
@@ -9,14 +9,19 @@
#include "xfs_format.h"
#include "xfs_log_format.h"
#include "xfs_trans_resv.h"
+#include "xfs_bit.h"
#include "xfs_mount.h"
#include "xfs_trans.h"
+#include "xfs_buf_item.h"
#include "xfs_btree.h"
#include "xfs_error.h"
#include "xfs_btree_mem.h"
#include "xfs_ag.h"
+#include "scrub/scrub.h"
#include "scrub/xfile.h"
#include "scrub/xfbtree.h"
+#include "scrub/bitmap.h"
+#include "scrub/trace.h"
/* btree ops functions for in-memory btrees. */
@@ -142,9 +147,18 @@ xfbtree_check_ptr(
else
bt_xfoff = be32_to_cpu(ptr->s);
- if (!xfbtree_verify_xfileoff(cur, bt_xfoff))
+ if (!xfbtree_verify_xfileoff(cur, bt_xfoff)) {
fa = __this_address;
+ goto done;
+ }
+ /* Can't point to the head or anything before it */
+ if (bt_xfoff < XFBTREE_INIT_LEAF_BLOCK) {
+ fa = __this_address;
+ goto done;
+ }
+
+done:
if (fa) {
xfs_err(cur->bc_mp,
"In-memory: Corrupt btree %d flags 0x%x pointer at level %d index %d fa %pS.",
@@ -350,3 +364,453 @@ xfbtree_sblock_verify(
return NULL;
}
+
+/* Close the btree xfile and release all resources. */
+void
+xfbtree_destroy(
+ struct xfbtree *xfbt)
+{
+ xbitmap_destroy(xfbt->freespace);
+ kfree(xfbt->freespace);
+ xfs_buftarg_drain(xfbt->target);
+ kfree(xfbt);
+}
+
+/* Compute the number of bytes available for records. */
+static inline unsigned int
+xfbtree_rec_bytes(
+ struct xfs_mount *mp,
+ const struct xfbtree_config *cfg)
+{
+ unsigned int blocklen = xfo_to_b(1);
+
+ if (cfg->flags & XFBTREE_CREATE_LONG_PTRS) {
+ if (xfs_has_crc(mp))
+ return blocklen - XFS_BTREE_LBLOCK_CRC_LEN;
+
+ return blocklen - XFS_BTREE_LBLOCK_LEN;
+ }
+
+ if (xfs_has_crc(mp))
+ return blocklen - XFS_BTREE_SBLOCK_CRC_LEN;
+
+ return blocklen - XFS_BTREE_SBLOCK_LEN;
+}
+
+/* Initialize an empty leaf block as the btree root. */
+STATIC int
+xfbtree_init_leaf_block(
+ struct xfs_mount *mp,
+ struct xfbtree *xfbt,
+ const struct xfbtree_config *cfg)
+{
+ struct xfs_buf *bp;
+ xfs_daddr_t daddr;
+ int error;
+ unsigned int bc_flags = 0;
+
+ if (cfg->flags & XFBTREE_CREATE_LONG_PTRS)
+ bc_flags |= XFS_BTREE_LONG_PTRS;
+
+ daddr = xfo_to_daddr(XFBTREE_INIT_LEAF_BLOCK);
+ error = xfs_buf_get(xfbt->target, daddr, xfbtree_bbsize(), &bp);
+ if (error)
+ return error;
+
+ trace_xfbtree_create_root_buf(xfbt, bp);
+
+ bp->b_ops = cfg->btree_ops->buf_ops;
+ xfs_btree_init_block_int(mp, bp->b_addr, daddr, cfg->btnum, 0, 0,
+ cfg->owner, bc_flags);
+ error = xfs_bwrite(bp);
+ xfs_buf_relse(bp);
+ if (error)
+ return error;
+
+ xfbt->xf_used++;
+ return 0;
+}
+
+/* Initialize the in-memory btree header block. */
+STATIC int
+xfbtree_init_head(
+ struct xfbtree *xfbt)
+{
+ struct xfs_buf *bp;
+ xfs_daddr_t daddr;
+ int error;
+
+ daddr = xfo_to_daddr(XFBTREE_HEAD_BLOCK);
+ error = xfs_buf_get(xfbt->target, daddr, xfbtree_bbsize(), &bp);
+ if (error)
+ return error;
+
+ xfs_btree_mem_head_init(bp, xfbt->owner, XFBTREE_INIT_LEAF_BLOCK);
+ error = xfs_bwrite(bp);
+ xfs_buf_relse(bp);
+ if (error)
+ return error;
+
+ xfbt->xf_used++;
+ return 0;
+}
+
+/* Create an xfile btree backing thing that can be used for in-memory btrees. */
+int
+xfbtree_create(
+ struct xfs_mount *mp,
+ const struct xfbtree_config *cfg,
+ struct xfbtree **xfbtreep)
+{
+ struct xfbtree *xfbt;
+ unsigned int blocklen = xfbtree_rec_bytes(mp, cfg);
+ unsigned int keyptr_len = cfg->btree_ops->key_len;
+ int error;
+
+ /* Requires an xfile-backed buftarg. */
+ if (!(cfg->target->bt_flags & XFS_BUFTARG_IN_MEMORY)) {
+ ASSERT(cfg->target->bt_flags & XFS_BUFTARG_IN_MEMORY);
+ return -EINVAL;
+ }
+
+ xfbt = kzalloc(sizeof(struct xfbtree), XCHK_GFP_FLAGS);
+ if (!xfbt)
+ return -ENOMEM;
+
+ /* Assign our memory file and the free space bitmap. */
+ xfbt->target = cfg->target;
+ xfbt->freespace = kmalloc(sizeof(struct xbitmap), XCHK_GFP_FLAGS);
+ if (!xfbt->freespace) {
+ error = -ENOMEM;
+ goto err_buftarg;
+ }
+ xbitmap_init(xfbt->freespace);
+
+ /* Set up min/maxrecs for this btree. */
+ if (cfg->flags & XFBTREE_CREATE_LONG_PTRS)
+ keyptr_len += sizeof(__be64);
+ else
+ keyptr_len += sizeof(__be32);
+ xfbt->maxrecs[0] = blocklen / cfg->btree_ops->rec_len;
+ xfbt->maxrecs[1] = blocklen / keyptr_len;
+ xfbt->minrecs[0] = xfbt->maxrecs[0] / 2;
+ xfbt->minrecs[1] = xfbt->maxrecs[1] / 2;
+ xfbt->owner = cfg->owner;
+
+ /* Initialize the empty btree. */
+ error = xfbtree_init_leaf_block(mp, xfbt, cfg);
+ if (error)
+ goto err_freesp;
+
+ error = xfbtree_init_head(xfbt);
+ if (error)
+ goto err_freesp;
+
+ trace_xfbtree_create(mp, cfg, xfbt);
+
+ *xfbtreep = xfbt;
+ return 0;
+
+err_freesp:
+ xbitmap_destroy(xfbt->freespace);
+ kfree(xfbt->freespace);
+err_buftarg:
+ xfs_buftarg_drain(xfbt->target);
+ kfree(xfbt);
+ return error;
+}
+
+/* Read the in-memory btree head. */
+int
+xfbtree_head_read_buf(
+ struct xfbtree *xfbt,
+ struct xfs_trans *tp,
+ struct xfs_buf **bpp)
+{
+ struct xfs_buftarg *btp = xfbt->target;
+ struct xfs_mount *mp = btp->bt_mount;
+ struct xfs_btree_mem_head *mhead;
+ struct xfs_buf *bp;
+ xfs_daddr_t daddr;
+ int error;
+
+ daddr = xfo_to_daddr(XFBTREE_HEAD_BLOCK);
+ error = xfs_trans_read_buf(mp, tp, btp, daddr, xfbtree_bbsize(), 0,
+ &bp, &xfs_btree_mem_head_buf_ops);
+ if (error)
+ return error;
+
+ mhead = bp->b_addr;
+ if (be64_to_cpu(mhead->mh_owner) != xfbt->owner) {
+ xfs_verifier_error(bp, -EFSCORRUPTED, __this_address);
+ xfs_trans_brelse(tp, bp);
+ return -EFSCORRUPTED;
+ }
+
+ *bpp = bp;
+ return 0;
+}
+
+static inline struct xfile *xfbtree_xfile(struct xfbtree *xfbt)
+{
+ return xfbt->target->bt_xfile;
+}
+
+/* Allocate a block to our in-memory btree. */
+int
+xfbtree_alloc_block(
+ struct xfs_btree_cur *cur,
+ const union xfs_btree_ptr *start,
+ union xfs_btree_ptr *new,
+ int *stat)
+{
+ struct xfbtree *xfbt = cur->bc_mem.xfbtree;
+ xfileoff_t bt_xfoff;
+ loff_t pos;
+ int error;
+
+ ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+ /*
+ * Find the first free block in the free space bitmap and take it. If
+ * none are found, seek to end of the file.
+ */
+ error = xbitmap_take_first_set(xfbt->freespace, 0, -1ULL, &bt_xfoff);
+ if (error == -ENODATA) {
+ bt_xfoff = xfbt->xf_used;
+ xfbt->xf_used++;
+ } else if (error) {
+ return error;
+ }
+
+ trace_xfbtree_alloc_block(xfbt, cur, bt_xfoff);
+
+ /* Fail if the block address exceeds the maximum for short pointers. */
+ if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && bt_xfoff >= INT_MAX) {
+ *stat = 0;
+ return 0;
+ }
+
+ /* Make sure we actually can write to the block before we return it. */
+ pos = xfo_to_b(bt_xfoff);
+ error = xfile_prealloc(xfbtree_xfile(xfbt), pos, xfo_to_b(1));
+ if (error)
+ return error;
+
+ if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+ new->l = cpu_to_be64(bt_xfoff);
+ else
+ new->s = cpu_to_be32(bt_xfoff);
+
+ *stat = 1;
+ return 0;
+}
+
+/* Free a block from our in-memory btree. */
+int
+xfbtree_free_block(
+ struct xfs_btree_cur *cur,
+ struct xfs_buf *bp)
+{
+ struct xfbtree *xfbt = cur->bc_mem.xfbtree;
+ xfileoff_t bt_xfoff, bt_xflen;
+
+ ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+ bt_xfoff = xfs_daddr_to_xfot(xfs_buf_daddr(bp));
+ bt_xflen = xfs_daddr_to_xfot(bp->b_length);
+
+ trace_xfbtree_free_block(xfbt, cur, bt_xfoff);
+
+ return xbitmap_set(xfbt->freespace, bt_xfoff, bt_xflen);
+}
+
+/* Return the minimum number of records for a btree block. */
+int
+xfbtree_get_minrecs(
+ struct xfs_btree_cur *cur,
+ int level)
+{
+ struct xfbtree *xfbt = cur->bc_mem.xfbtree;
+
+ return xfbt->minrecs[level != 0];
+}
+
+/* Return the maximum number of records for a btree block. */
+int
+xfbtree_get_maxrecs(
+ struct xfs_btree_cur *cur,
+ int level)
+{
+ struct xfbtree *xfbt = cur->bc_mem.xfbtree;
+
+ return xfbt->maxrecs[level != 0];
+}
+
+/* If this log item is a buffer item that came from the xfbtree, return it. */
+static inline struct xfs_buf *
+xfbtree_buf_match(
+ struct xfbtree *xfbt,
+ const struct xfs_log_item *lip)
+{
+ const struct xfs_buf_log_item *bli;
+ struct xfs_buf *bp;
+
+ if (lip->li_type != XFS_LI_BUF)
+ return NULL;
+
+ bli = container_of(lip, struct xfs_buf_log_item, bli_item);
+ bp = bli->bli_buf;
+ if (bp->b_target != xfbt->target)
+ return NULL;
+
+ return bp;
+}
+
+/*
+ * Detach this (probably dirty) xfbtree buffer from the transaction by any
+ * means necessary. Returns true if the buffer needs to be written.
+ */
+STATIC bool
+xfbtree_trans_bdetach(
+ struct xfs_trans *tp,
+ struct xfs_buf *bp)
+{
+ struct xfs_buf_log_item *bli = bp->b_log_item;
+ bool dirty;
+
+ ASSERT(bli != NULL);
+
+ dirty = bli->bli_flags & (XFS_BLI_DIRTY | XFS_BLI_ORDERED);
+
+ bli->bli_flags &= ~(XFS_BLI_DIRTY | XFS_BLI_ORDERED |
+ XFS_BLI_LOGGED | XFS_BLI_STALE);
+ clear_bit(XFS_LI_DIRTY, &bli->bli_item.li_flags);
+
+ while (bp->b_log_item != NULL)
+ xfs_trans_bdetach(tp, bp);
+
+ return dirty;
+}
+
+/*
+ * Commit changes to the incore btree immediately by writing all dirty xfbtree
+ * buffers to the backing xfile. This detaches all xfbtree buffers from the
+ * transaction, even on failure. The buffer locks are dropped between the
+ * delwri queue and submit, so the caller must synchronize btree access.
+ *
+ * Normally we'd let the buffers commit with the transaction and get written to
+ * the xfile via the log, but online repair stages ephemeral btrees in memory
+ * and uses the btree_staging functions to write new btrees to disk atomically.
+ * The in-memory btree (and its backing store) are discarded at the end of the
+ * repair phase, which means that xfbtree buffers cannot commit with the rest
+ * of a transaction.
+ *
+ * In other words, online repair only needs the transaction to collect buffer
+ * pointers and to avoid buffer deadlocks, not to guarantee consistency of
+ * updates.
+ */
+int
+xfbtree_trans_commit(
+ struct xfbtree *xfbt,
+ struct xfs_trans *tp)
+{
+ LIST_HEAD(buffer_list);
+ struct xfs_log_item *lip, *n;
+ bool corrupt = false;
+ bool tp_dirty = false;
+
+ /*
+ * For each xfbtree buffer attached to the transaction, write the dirty
+ * buffers to the xfile and release them.
+ */
+ list_for_each_entry_safe(lip, n, &tp->t_items, li_trans) {
+ struct xfs_buf *bp = xfbtree_buf_match(xfbt, lip);
+ bool dirty;
+
+ if (!bp) {
+ if (test_bit(XFS_LI_DIRTY, &lip->li_flags))
+ tp_dirty |= true;
+ continue;
+ }
+
+ trace_xfbtree_trans_commit_buf(xfbt, bp);
+
+ dirty = xfbtree_trans_bdetach(tp, bp);
+ if (dirty && !corrupt) {
+ xfs_failaddr_t fa = bp->b_ops->verify_struct(bp);
+
+ /*
+ * Because this btree is ephemeral, validate the buffer
+ * structure before delwri_submit so that we can return
+ * corruption errors to the caller without shutting
+ * down the filesystem.
+ *
+ * If the buffer fails verification, log the failure
+ * but continue walking the transaction items so that
+ * we remove all ephemeral btree buffers.
+ */
+ if (fa) {
+ corrupt = true;
+ xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+ } else {
+ xfs_buf_delwri_queue_here(bp, &buffer_list);
+ }
+ }
+
+ xfs_buf_relse(bp);
+ }
+
+ /*
+ * Reset the transaction's dirty flag to reflect the dirty state of the
+ * log items that are still attached.
+ */
+ tp->t_flags = (tp->t_flags & ~XFS_TRANS_DIRTY) |
+ (tp_dirty ? XFS_TRANS_DIRTY : 0);
+
+ if (corrupt) {
+ xfs_buf_delwri_cancel(&buffer_list);
+ return -EFSCORRUPTED;
+ }
+
+ if (list_empty(&buffer_list))
+ return 0;
+
+ return xfs_buf_delwri_submit(&buffer_list);
+}
+
+/*
+ * Cancel changes to the incore btree by detaching all the xfbtree buffers.
+ * Changes are not written to the backing store. This is needed for online
+ * repair btrees, which are by nature ephemeral.
+ */
+void
+xfbtree_trans_cancel(
+ struct xfbtree *xfbt,
+ struct xfs_trans *tp)
+{
+ struct xfs_log_item *lip, *n;
+ bool tp_dirty = false;
+
+ list_for_each_entry_safe(lip, n, &tp->t_items, li_trans) {
+ struct xfs_buf *bp = xfbtree_buf_match(xfbt, lip);
+
+ if (!bp) {
+ if (test_bit(XFS_LI_DIRTY, &lip->li_flags))
+ tp_dirty |= true;
+ continue;
+ }
+
+ trace_xfbtree_trans_cancel_buf(xfbt, bp);
+
+ xfbtree_trans_bdetach(tp, bp);
+ xfs_buf_relse(bp);
+ }
+
+ /*
+ * Reset the transaction's dirty flag to reflect the dirty state of the
+ * log items that are still attached.
+ */
+ tp->t_flags = (tp->t_flags & ~XFS_TRANS_DIRTY) |
+ (tp_dirty ? XFS_TRANS_DIRTY : 0);
+}
diff --git a/fs/xfs/scrub/xfbtree.h b/fs/xfs/scrub/xfbtree.h
index b3836f21085d..bdbf850bf7a1 100644
--- a/fs/xfs/scrub/xfbtree.h
+++ b/fs/xfs/scrub/xfbtree.h
@@ -22,13 +22,36 @@ struct xfs_btree_mem_head {
/* xfile-backed in-memory btrees */
struct xfbtree {
- /* buffer cache target for this in-memory btree */
+ /* buffer cache target for the xfile backing this in-memory btree */
struct xfs_buftarg *target;
+ /* Bitmap of free space from pos to used */
+ struct xbitmap *freespace;
+
+ /* Number of xfile blocks actually used by this xfbtree. */
+ xfileoff_t xf_used;
+
/* Owner of this btree. */
unsigned long long owner;
+
+ /* Minimum and maximum records per block. */
+ unsigned int maxrecs[2];
+ unsigned int minrecs[2];
};
+/* The head of the in-memory btree is always at block 0 */
+#define XFBTREE_HEAD_BLOCK 0
+
+/* in-memory btrees are always created with an empty leaf block at block 1 */
+#define XFBTREE_INIT_LEAF_BLOCK 1
+
+int xfbtree_head_read_buf(struct xfbtree *xfbt, struct xfs_trans *tp,
+ struct xfs_buf **bpp);
+
+void xfbtree_destroy(struct xfbtree *xfbt);
+int xfbtree_trans_commit(struct xfbtree *xfbt, struct xfs_trans *tp);
+void xfbtree_trans_cancel(struct xfbtree *xfbt, struct xfs_trans *tp);
+
#endif /* CONFIG_XFS_IN_MEMORY_BTREE */
#endif /* XFS_SCRUB_XFBTREE_H__ */
diff --git a/fs/xfs/scrub/xfile.c b/fs/xfs/scrub/xfile.c
index 1b858b6a53c8..b1cbf80f55d7 100644
--- a/fs/xfs/scrub/xfile.c
+++ b/fs/xfs/scrub/xfile.c
@@ -285,6 +285,89 @@ xfile_pwrite(
return error;
}
+/* Discard pages backing a range of the xfile. */
+void
+xfile_discard(
+ struct xfile *xf,
+ loff_t pos,
+ u64 count)
+{
+ trace_xfile_discard(xf, pos, count);
+ shmem_truncate_range(file_inode(xf->file), pos, pos + count - 1);
+}
+
+/* Ensure that there is storage backing the given range. */
+int
+xfile_prealloc(
+ struct xfile *xf,
+ loff_t pos,
+ u64 count)
+{
+ struct inode *inode = file_inode(xf->file);
+ struct address_space *mapping = inode->i_mapping;
+ const struct address_space_operations *aops = mapping->a_ops;
+ struct page *page = NULL;
+ unsigned int pflags;
+ int error = 0;
+
+ if (count > MAX_RW_COUNT)
+ return -E2BIG;
+ if (inode->i_sb->s_maxbytes - pos < count)
+ return -EFBIG;
+
+ trace_xfile_prealloc(xf, pos, count);
+
+ pflags = memalloc_nofs_save();
+ while (count > 0) {
+ void *fsdata = NULL;
+ unsigned int len;
+ int ret;
+
+ len = min_t(ssize_t, count, PAGE_SIZE - offset_in_page(pos));
+
+ /*
+ * We call write_begin directly here to avoid all the freezer
+ * protection lock-taking that happens in the normal path.
+ * shmem doesn't support fs freeze, but lockdep doesn't know
+ * that and will trip over that.
+ */
+ error = aops->write_begin(NULL, mapping, pos, len, &page,
+ &fsdata);
+ if (error)
+ break;
+
+ /*
+ * xfile pages must never be mapped into userspace, so we skip
+ * the dcache flush. If the page is not uptodate, zero it to
+ * ensure we never go lacking for space here.
+ */
+ if (!PageUptodate(page)) {
+ void *kaddr = kmap_local_page(page);
+
+ memset(kaddr, 0, PAGE_SIZE);
+ SetPageUptodate(page);
+ kunmap_local(kaddr);
+ }
+
+ ret = aops->write_end(NULL, mapping, pos, len, len, page,
+ fsdata);
+ if (ret < 0) {
+ error = ret;
+ break;
+ }
+ if (ret != len) {
+ error = -EIO;
+ break;
+ }
+
+ count -= len;
+ pos += len;
+ }
+ memalloc_nofs_restore(pflags);
+
+ return error;
+}
+
/* Find the next written area in the xfile data for a given offset. */
loff_t
xfile_seek_data(
diff --git a/fs/xfs/scrub/xfile.h b/fs/xfs/scrub/xfile.h
index c934e70f95e8..bf80bb796e83 100644
--- a/fs/xfs/scrub/xfile.h
+++ b/fs/xfs/scrub/xfile.h
@@ -64,6 +64,8 @@ xfile_obj_store(struct xfile *xf, const void *buf, size_t count, loff_t pos)
return 0;
}
+void xfile_discard(struct xfile *xf, loff_t pos, u64 count);
+int xfile_prealloc(struct xfile *xf, loff_t pos, u64 count);
loff_t xfile_seek_data(struct xfile *xf, loff_t pos);
struct xfile_stat {
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index bf3b7c96f207..410db46e7935 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -2066,8 +2066,12 @@ __xfs_alloc_buftarg(
xfs_km_flags_t km_flags)
{
struct xfs_buftarg *btp;
+ gfp_t gfp = GFP_KERNEL;
int error;
+ if (km_flags & KM_MAYFAIL)
+ gfp |= __GFP_RETRY_MAYFAIL;
+
btp = kmem_zalloc(sizeof(*btp), KM_NOFS | km_flags);
if (!btp)
return NULL;
@@ -2085,7 +2089,7 @@ __xfs_alloc_buftarg(
if (list_lru_init(&btp->bt_lru))
goto error_free;
- if (percpu_counter_init(&btp->bt_io_count, 0, GFP_KERNEL))
+ if (percpu_counter_init(&btp->bt_io_count, 0, gfp))
goto error_lru;
btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 2d006bf0f9ce..d1620ea1c70f 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -632,6 +632,7 @@ DEFINE_BUF_ITEM_EVENT(xfs_trans_read_buf);
DEFINE_BUF_ITEM_EVENT(xfs_trans_read_buf_recur);
DEFINE_BUF_ITEM_EVENT(xfs_trans_log_buf);
DEFINE_BUF_ITEM_EVENT(xfs_trans_brelse);
+DEFINE_BUF_ITEM_EVENT(xfs_trans_bdetach);
DEFINE_BUF_ITEM_EVENT(xfs_trans_bjoin);
DEFINE_BUF_ITEM_EVENT(xfs_trans_bhold);
DEFINE_BUF_ITEM_EVENT(xfs_trans_bhold_release);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index ae587101e167..a43d6465b9d4 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -219,6 +219,7 @@ struct xfs_buf *xfs_trans_getsb(struct xfs_trans *);
void xfs_trans_brelse(xfs_trans_t *, struct xfs_buf *);
void xfs_trans_bjoin(xfs_trans_t *, struct xfs_buf *);
+void xfs_trans_bdetach(struct xfs_trans *tp, struct xfs_buf *bp);
void xfs_trans_bhold(xfs_trans_t *, struct xfs_buf *);
void xfs_trans_bhold_release(xfs_trans_t *, struct xfs_buf *);
void xfs_trans_binval(xfs_trans_t *, struct xfs_buf *);
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 6549e50d852c..e28ab74af4f0 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -392,6 +392,48 @@ xfs_trans_brelse(
xfs_buf_relse(bp);
}
+/*
+ * Forcibly detach a buffer previously joined to the transaction. The caller
+ * will retain its locked reference to the buffer after this function returns.
+ * The buffer must be completely clean and must not be held to the transaction.
+ */
+void
+xfs_trans_bdetach(
+ struct xfs_trans *tp,
+ struct xfs_buf *bp)
+{
+ struct xfs_buf_log_item *bip = bp->b_log_item;
+
+ ASSERT(tp != NULL);
+ ASSERT(bp->b_transp == tp);
+ ASSERT(bip->bli_item.li_type == XFS_LI_BUF);
+ ASSERT(atomic_read(&bip->bli_refcount) > 0);
+
+ trace_xfs_trans_bdetach(bip);
+
+ /*
+ * Erase all recursion count, since we're removing this buffer from the
+ * transaction.
+ */
+ bip->bli_recur = 0;
+
+ /*
+ * The buffer must be completely clean. Specifically, it had better
+ * not be dirty, stale, logged, ordered, or held to the transaction.
+ */
+ ASSERT(!test_bit(XFS_LI_DIRTY, &bip->bli_item.li_flags));
+ ASSERT(!(bip->bli_flags & XFS_BLI_DIRTY));
+ ASSERT(!(bip->bli_flags & XFS_BLI_HOLD));
+ ASSERT(!(bip->bli_flags & XFS_BLI_LOGGED));
+ ASSERT(!(bip->bli_flags & XFS_BLI_ORDERED));
+ ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
+
+ /* Unlink the log item from the transaction and drop the log item. */
+ xfs_trans_del_item(&bip->bli_item);
+ xfs_buf_item_put(bip);
+ bp->b_transp = NULL;
+}
+
/*
* Mark the buffer as not needing to be unlocked when the buf item's
* iop_committing() routine is called. The buffer must already be locked
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHSET v24.0 00/21] xfs: atomic file updates
2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 01/21] vfs: introduce new file range exchange ioctl Darrick J. Wong
` (20 more replies)
2 siblings, 21 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
Hi all,
This series creates a new FIEXCHANGE_RANGE system call to exchange
ranges of bytes between two files atomically. This new functionality
enables data storage programs to stage and commit file updates such that
reader programs will see either the old contents or the new contents in
their entirety, with no chance of torn writes. A successful call
completion guarantees that the new contents will be seen even if the
system fails.
The ability to swap extent mappings between files in this manner is
critical to supporting online filesystem repair, which is built upon the
strategy of constructing a clean copy of a damaged structure and
committing the new structure into the metadata file atomically.
User programs will be able to update files atomically by opening an
O_TMPFILE, reflinking the source file to it, making whatever updates
they want to make, and exchange the relevant ranges of the temp file
with the original file. If the updates are aligned with the file block
size, a new (since v2) flag provides for exchanging only the written
areas. Callers can arrange for the update to be rejected if the
original file has been changed.
The intent behind this new userspace functionality is to enable atomic
rewrites of arbitrary parts of individual files. For years, application
programmers wanting to ensure the atomicity of a file update had to
write the changes to a new file in the same directory, fsync the new
file, rename the new file on top of the old filename, and then fsync the
directory. People get it wrong all the time, and $fs hacks abound.
Here is the proposed manual page:
IOCTL-FIEXCHANGE_RANGE(Linux Programmer's ManIOCTL-FIEXCHANGE_RANGE(2)
NAME
ioctl_fiexchange_range - exchange the contents of parts of two
files
SYNOPSIS
#include <sys/ioctl.h>
#include <linux/fiexchange.h>
int ioctl(int file2_fd, FIEXCHANGE_RANGE, struct
file_xchg_range *arg);
DESCRIPTION
Given a range of bytes in a first file file1_fd and a second
range of bytes in a second file file2_fd, this ioctl(2) ex‐
changes the contents of the two ranges.
Exchanges are atomic with regards to concurrent file opera‐
tions, so no userspace-level locks need to be taken to obtain
consistent results. Implementations must guarantee that read‐
ers see either the old contents or the new contents in their
entirety, even if the system fails.
The exchange parameters are conveyed in a structure of the fol‐
lowing form:
struct file_xchg_range {
__s64 file1_fd;
__s64 file1_offset;
__s64 file2_offset;
__s64 length;
__u64 flags;
__s64 file2_ino;
__s64 file2_mtime;
__s64 file2_ctime;
__s32 file2_mtime_nsec;
__s32 file2_ctime_nsec;
__u64 pad[6];
};
The field pad must be zero.
The fields file1_fd, file1_offset, and length define the first
range of bytes to be exchanged.
The fields file2_fd, file2_offset, and length define the second
range of bytes to be exchanged.
Both files must be from the same filesystem mount. If the two
file descriptors represent the same file, the byte ranges must
not overlap. Most disk-based filesystems require that the
starts of both ranges must be aligned to the file block size.
If this is the case, the ends of the ranges must also be so
aligned unless the FILE_XCHG_RANGE_TO_EOF flag is set.
The field flags control the behavior of the exchange operation.
FILE_XCHG_RANGE_FILE2_FRESH
Check the freshness of file2_fd after locking the
file but before exchanging the contents. The sup‐
plied file2_ino field must match file2's inode num‐
ber, and the supplied file2_mtime, file2_mtime_nsec,
file2_ctime, and file2_ctime_nsec fields must match
the modification time and change time of file2. If
they do not match, EBUSY will be returned.
FILE_XCHG_RANGE_TO_EOF
Ignore the length parameter. All bytes in file1_fd
from file1_offset to EOF are moved to file2_fd, and
file2's size is set to (file2_offset+(file1_length-
file1_offset)). Meanwhile, all bytes in file2 from
file2_offset to EOF are moved to file1 and file1's
size is set to (file1_offset+(file2_length-
file2_offset)). This option is not compatible with
FILE_XCHG_RANGE_FULL_FILES.
FILE_XCHG_RANGE_FSYNC
Ensure that all modified in-core data in both file
ranges and all metadata updates pertaining to the
exchange operation are flushed to persistent storage
before the call returns. Opening either file de‐
scriptor with O_SYNC or O_DSYNC will have the same
effect.
FILE_XCHG_RANGE_SKIP_FILE1_HOLES
Skip sub-ranges of file1_fd that are known not to
contain data. This facility can be used to imple‐
ment atomic scatter-gather writes of any complexity
for software-defined storage targets.
FILE_XCHG_RANGE_DRY_RUN
Check the parameters and the feasibility of the op‐
eration, but do not change anything.
FILE_XCHG_RANGE_COMMIT
This flag is a combination of
FILE_XCHG_RANGE_FILE2_FRESH | FILE_XCHG_RANGE_FSYNC
and can be used to commit changes to file2_fd to
persistent storage if and only if file2 has not
changed.
FILE_XCHG_RANGE_FULL_FILES
Require that file1_offset and file2_offset are zero,
and that the length field matches the lengths of
both files. If not, EDOM will be returned. This
option is not compatible with
FILE_XCHG_RANGE_TO_EOF.
FILE_XCHG_RANGE_NONATOMIC
This flag relaxes the requirement that readers see
only the old contents or the new contents in their
entirety. If the system fails before all modified
in-core data and metadata updates are persisted to
disk, the contents of both file ranges after recov‐
ery are not defined and may be a mix of both.
Do not use this flag unless the contents of both
ranges are known to be identical and there are no
other writers.
RETURN VALUE
On error, -1 is returned, and errno is set to indicate the er‐
ror.
ERRORS
Error codes can be one of, but are not limited to, the follow‐
ing:
EBADF file1_fd is not open for reading and writing or is open
for append-only writes; or file2_fd is not open for
reading and writing or is open for append-only writes.
EBUSY The inode number and timestamps supplied do not match
file2_fd and FILE_XCHG_RANGE_FILE2_FRESH was set in
flags.
EDOM The ranges do not cover the entirety of both files, and
FILE_XCHG_RANGE_FULL_FILES was set in flags.
EINVAL The parameters are not correct for these files. This
error can also appear if either file descriptor repre‐
sents a device, FIFO, or socket. Disk filesystems gen‐
erally require the offset and length arguments to be
aligned to the fundamental block sizes of both files.
EIO An I/O error occurred.
EISDIR One of the files is a directory.
ENOMEM The kernel was unable to allocate sufficient memory to
perform the operation.
ENOSPC There is not enough free space in the filesystem ex‐
change the contents safely.
EOPNOTSUPP
The filesystem does not support exchanging bytes between
the two files.
EPERM file1_fd or file2_fd are immutable.
ETXTBSY
One of the files is a swap file.
EUCLEAN
The filesystem is corrupt.
EXDEV file1_fd and file2_fd are not on the same mounted
filesystem.
CONFORMING TO
This API is Linux-specific.
USE CASES
Three use cases are imagined for this system call.
The first is a filesystem defragmenter, which copies the con‐
tents of a file into another file and wishes to exchange the
space mappings of the two files, provided that the original
file has not changed. The flags NONATOMIC and FILE2_FRESH are
recommended for this application.
The second is a data storage program that wants to commit non-
contiguous updates to a file atomically. This can be done by
creating a temporary file, calling FICLONE(2) to share the con‐
tents, and staging the updates into the temporary file. Either
of the FULL_FILES or TO_EOF flags are recommended, along with
FSYNC. Depending on the application's locking design, the
flags FILE2_FRESH or COMMIT may be applicable here. The tempo‐
rary file can be deleted or punched out afterwards.
The third is a software-defined storage host (e.g. a disk juke‐
box) which implements an atomic scatter-gather write command.
Provided the exported disk's logical block size matches the
file's allocation unit size, this can be done by creating a
temporary file and writing the data at the appropriate offsets.
Use this call with the SKIP_HOLES flag to exchange only the
blocks involved in the write command. The use of the FSYNC
flag is recommended here. The temporary file should be deleted
or punched out completely before being reused to stage another
write.
NOTES
Some filesystems may limit the amount of data or the number of
extents that can be exchanged in a single call.
SEE ALSO
ioctl(2)
Linux 2022-12-31 IOCTL-FIEXCHANGE_RANGE(2)
The reference implementation in XFS creates a new log incompat feature
and log intent items to track high level progress of swapping ranges of
two files and finish interrupted work if the system goes down. Sample
code can be found in the corresponding changes to xfs_io to exercise the
use case mentioned above.
Note that this function is /not/ the O_DIRECT atomic file writes concept
that has also been floating around for years. This RFC is constructed
entirely in software, which means that there are no limitations other
than the general filesystem limits.
As a side note, the original motivation behind the kernel functionality
is online repair of file-based metadata. The atomic file swap is
implemented as an atomic inode fork swap, which means that we can
implement online reconstruction of extended attributes and directories
by building a new one in another inode and atomically swap the contents.
Subsequent patchsets adapt the online filesystem repair code to use
atomic extent swapping. This enables repair functions to construct a
clean copy of a directory, xattr information, symbolic links, realtime
bitmaps, and realtime summary information in a temporary inode. If this
completes successfully, the new contents can be swapped atomically into
the inode being repaired. This is essential to avoid making corruption
problems worse if the system goes down in the middle of running repair.
This patchset also ports the old XFS extent swap ioctl interface to use
the new extent swap code.
For userspace, this series also includes the userspace pieces needed to
test the new functionality, and a sample implementation of atomic file
updates.
Question: Should we really bother with fsdevel bikeshedding? Most
filesystems cannot support this functionality, so we could keep it
private to XFS for now.
If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.
This is an extraordinary way to destroy everything. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates
xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=atomic-file-updates
fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=atomic-file-updates
xfsdocs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-documentation.git/log/?h=atomic-file-updates
---
Documentation/filesystems/vfs.rst | 16
fs/ioctl.c | 27 +
fs/remap_range.c | 296 ++++++++
fs/xfs/Makefile | 3
fs/xfs/libxfs/xfs_bmap.h | 4
fs/xfs/libxfs/xfs_defer.c | 7
fs/xfs/libxfs/xfs_defer.h | 3
fs/xfs/libxfs/xfs_errortag.h | 4
fs/xfs/libxfs/xfs_format.h | 15
fs/xfs/libxfs/xfs_fs.h | 2
fs/xfs/libxfs/xfs_log_format.h | 80 ++
fs/xfs/libxfs/xfs_log_recover.h | 2
fs/xfs/libxfs/xfs_sb.c | 3
fs/xfs/libxfs/xfs_swapext.c | 1258 ++++++++++++++++++++++++++++++++++++
fs/xfs/libxfs/xfs_swapext.h | 170 +++++
fs/xfs/libxfs/xfs_symlink_remote.c | 47 +
fs/xfs/libxfs/xfs_symlink_remote.h | 1
fs/xfs/libxfs/xfs_trans_space.h | 4
fs/xfs/xfs_bmap_util.c | 620 ------------------
fs/xfs/xfs_bmap_util.h | 3
fs/xfs/xfs_error.c | 3
fs/xfs/xfs_file.c | 94 ++-
fs/xfs/xfs_inode.c | 13
fs/xfs/xfs_inode.h | 6
fs/xfs/xfs_ioctl.c | 102 +--
fs/xfs/xfs_ioctl.h | 4
fs/xfs/xfs_ioctl32.c | 11
fs/xfs/xfs_linux.h | 5
fs/xfs/xfs_log.c | 47 +
fs/xfs/xfs_log.h | 10
fs/xfs/xfs_log_priv.h | 3
fs/xfs/xfs_log_recover.c | 5
fs/xfs/xfs_mount.c | 11
fs/xfs/xfs_mount.h | 7
fs/xfs/xfs_rtalloc.c | 136 ++++
fs/xfs/xfs_rtalloc.h | 3
fs/xfs/xfs_super.c | 19 +
fs/xfs/xfs_swapext_item.c | 657 +++++++++++++++++++
fs/xfs/xfs_swapext_item.h | 56 ++
fs/xfs/xfs_symlink.c | 49 -
fs/xfs/xfs_trace.c | 2
fs/xfs/xfs_trace.h | 351 ++++++++++
fs/xfs/xfs_xattr.c | 6
fs/xfs/xfs_xchgrange.c | 964 ++++++++++++++++++++++++++++
fs/xfs/xfs_xchgrange.h | 40 +
include/linux/fs.h | 14
include/uapi/linux/fiexchange.h | 101 +++
47 files changed, 4473 insertions(+), 811 deletions(-)
create mode 100644 fs/xfs/libxfs/xfs_swapext.c
create mode 100644 fs/xfs/libxfs/xfs_swapext.h
create mode 100644 fs/xfs/xfs_swapext_item.c
create mode 100644 fs/xfs/xfs_swapext_item.h
create mode 100644 fs/xfs/xfs_xchgrange.c
create mode 100644 fs/xfs/xfs_xchgrange.h
create mode 100644 include/uapi/linux/fiexchange.h
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 01/21] vfs: introduce new file range exchange ioctl
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 04/21] xfs: parameterize all the incompat log feature helpers Darrick J. Wong
` (19 subsequent siblings)
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
Introduce a new ioctl to handle swapping ranges of bytes between files.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
Documentation/filesystems/vfs.rst | 16 ++
fs/ioctl.c | 27 +++
fs/remap_range.c | 296 +++++++++++++++++++++++++++++++++++++
fs/xfs/libxfs/xfs_fs.h | 1
include/linux/fs.h | 14 ++
include/uapi/linux/fiexchange.h | 101 +++++++++++++
6 files changed, 454 insertions(+), 1 deletion(-)
create mode 100644 include/uapi/linux/fiexchange.h
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 2c15e7053113..cae6dd3a8a0b 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -1036,6 +1036,8 @@ This describes how the VFS can manipulate an open file. As of kernel
loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags);
+ int (*xchg_file_range)(struct file *file1, struct file *file2,
+ struct file_xchg_range *fxr);
int (*fadvise)(struct file *, loff_t, loff_t, int);
};
@@ -1154,6 +1156,20 @@ otherwise noted.
ok with the implementation shortening the request length to
satisfy alignment or EOF requirements (or any other reason).
+``xchg_file_range``
+ called by the ioctl(2) system call for FIEXCHANGE_RANGE to exchange the
+ contents of two file ranges. An implementation should exchange
+ fxr.length bytes starting at fxr.file1_offset in file1 with the same
+ number of bytes starting at fxr.file2_offset in file2. Refer to
+ fiexchange.h file for more information. Implementations must call
+ generic_xchg_file_range_prep to prepare the two files prior to taking
+ locks; they must call generic_xchg_file_range_check_fresh once the
+ inode is locked to abort the call if file2 has changed; and they must
+ update the inode change and mod times of both files as part of the
+ metadata update. The timestamp updates must be done atomically as part
+ of the data exchange operation to ensure correctness of the freshness
+ check.
+
``fadvise``
possibly called by the fadvise64() system call.
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 80ac36aea913..bd636daf8e90 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -259,6 +259,30 @@ static long ioctl_file_clone_range(struct file *file,
args.src_length, args.dest_offset);
}
+static long ioctl_file_xchg_range(struct file *file2,
+ struct file_xchg_range __user *argp)
+{
+ struct file_xchg_range args;
+ struct fd file1;
+ int ret;
+
+ if (copy_from_user(&args, argp, sizeof(args)))
+ return -EFAULT;
+
+ file1 = fdget(args.file1_fd);
+ if (!file1.file)
+ return -EBADF;
+
+ ret = -EXDEV;
+ if (file1.file->f_path.mnt != file2->f_path.mnt)
+ goto fdput;
+
+ ret = vfs_xchg_file_range(file1.file, file2, &args);
+fdput:
+ fdput(file1);
+ return ret;
+}
+
/*
* This provides compatibility with legacy XFS pre-allocation ioctls
* which predate the fallocate syscall.
@@ -825,6 +849,9 @@ static int do_vfs_ioctl(struct file *filp, unsigned int fd,
case FIDEDUPERANGE:
return ioctl_file_dedupe_range(filp, argp);
+ case FIEXCHANGE_RANGE:
+ return ioctl_file_xchg_range(filp, argp);
+
case FIONREAD:
if (!S_ISREG(inode->i_mode))
return vfs_ioctl(filp, cmd, arg);
diff --git a/fs/remap_range.c b/fs/remap_range.c
index 41f60477bb41..469d53fb42e9 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -567,3 +567,299 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
return ret;
}
EXPORT_SYMBOL(vfs_dedupe_file_range);
+
+/* Performs necessary checks before doing a range exchange. */
+static int generic_xchg_file_range_checks(struct file *file1,
+ struct file *file2,
+ struct file_xchg_range *fxr,
+ unsigned int blocksize)
+{
+ struct inode *inode1 = file1->f_mapping->host;
+ struct inode *inode2 = file2->f_mapping->host;
+ uint64_t blkmask = blocksize - 1;
+ int64_t test_len;
+ uint64_t blen;
+ loff_t size1, size2;
+ int ret;
+
+ /* Don't touch certain kinds of inodes */
+ if (IS_IMMUTABLE(inode1) || IS_IMMUTABLE(inode2))
+ return -EPERM;
+ if (IS_SWAPFILE(inode1) || IS_SWAPFILE(inode2))
+ return -ETXTBSY;
+
+ size1 = i_size_read(inode1);
+ size2 = i_size_read(inode2);
+
+ /* Ranges cannot start after EOF. */
+ if (fxr->file1_offset > size1 || fxr->file2_offset > size2)
+ return -EINVAL;
+
+ /*
+ * If the caller asked for full files, check that the offset/length
+ * values cover all of both files.
+ */
+ if ((fxr->flags & FILE_XCHG_RANGE_FULL_FILES) &&
+ (fxr->file1_offset != 0 || fxr->file2_offset != 0 ||
+ fxr->length != size1 || fxr->length != size2))
+ return -EDOM;
+
+ /*
+ * If the caller said to exchange to EOF, we set the length of the
+ * request large enough to cover everything to the end of both files.
+ */
+ if (fxr->flags & FILE_XCHG_RANGE_TO_EOF)
+ fxr->length = max_t(int64_t, size1 - fxr->file1_offset,
+ size2 - fxr->file2_offset);
+
+ /* The start of both ranges must be aligned to an fs block. */
+ if (!IS_ALIGNED(fxr->file1_offset, blocksize) ||
+ !IS_ALIGNED(fxr->file2_offset, blocksize))
+ return -EINVAL;
+
+ /* Ensure offsets don't wrap. */
+ if (fxr->file1_offset + fxr->length < fxr->file1_offset ||
+ fxr->file2_offset + fxr->length < fxr->file2_offset)
+ return -EINVAL;
+
+ /*
+ * We require both ranges to be within EOF, unless we're exchanging
+ * to EOF. generic_xchg_range_prep already checked that both
+ * fxr->file1_offset and fxr->file2_offset are within EOF.
+ */
+ if (!(fxr->flags & FILE_XCHG_RANGE_TO_EOF) &&
+ (fxr->file1_offset + fxr->length > size1 ||
+ fxr->file2_offset + fxr->length > size2))
+ return -EINVAL;
+
+ /*
+ * Make sure we don't hit any file size limits. If we hit any size
+ * limits such that test_length was adjusted, we abort the whole
+ * operation.
+ */
+ test_len = fxr->length;
+ ret = generic_write_check_limits(file2, fxr->file2_offset, &test_len);
+ if (ret)
+ return ret;
+ ret = generic_write_check_limits(file1, fxr->file1_offset, &test_len);
+ if (ret)
+ return ret;
+ if (test_len != fxr->length)
+ return -EINVAL;
+
+ /*
+ * If the user wanted us to exchange up to the infile's EOF, round up
+ * to the next block boundary for this check. Do the same for the
+ * outfile.
+ *
+ * Otherwise, reject the range length if it's not block aligned. We
+ * already confirmed the starting offsets' block alignment.
+ */
+ if (fxr->file1_offset + fxr->length == size1)
+ blen = ALIGN(size1, blocksize) - fxr->file1_offset;
+ else if (fxr->file2_offset + fxr->length == size2)
+ blen = ALIGN(size2, blocksize) - fxr->file2_offset;
+ else if (!IS_ALIGNED(fxr->length, blocksize))
+ return -EINVAL;
+ else
+ blen = fxr->length;
+
+ /* Don't allow overlapped exchanges within the same file. */
+ if (inode1 == inode2 &&
+ fxr->file2_offset + blen > fxr->file1_offset &&
+ fxr->file1_offset + blen > fxr->file2_offset)
+ return -EINVAL;
+
+ /* If we already failed the freshness check, we're done. */
+ ret = generic_xchg_file_range_check_fresh(inode2, fxr);
+ if (ret)
+ return ret;
+
+ /*
+ * Ensure that we don't exchange a partial EOF block into the middle of
+ * another file.
+ */
+ if ((fxr->length & blkmask) == 0)
+ return 0;
+
+ blen = fxr->length;
+ if (fxr->file2_offset + blen < size2)
+ blen &= ~blkmask;
+
+ if (fxr->file1_offset + blen < size1)
+ blen &= ~blkmask;
+
+ return blen == fxr->length ? 0 : -EINVAL;
+}
+
+/*
+ * Check that the two inodes are eligible for range exchanges, the ranges make
+ * sense, and then flush all dirty data. Caller must ensure that the inodes
+ * have been locked against any other modifications.
+ */
+int generic_xchg_file_range_prep(struct file *file1, struct file *file2,
+ struct file_xchg_range *fxr,
+ unsigned int blocksize)
+{
+ struct inode *inode1 = file_inode(file1);
+ struct inode *inode2 = file_inode(file2);
+ bool same_inode = (inode1 == inode2);
+ int ret;
+
+ /* Check that we don't violate system file offset limits. */
+ ret = generic_xchg_file_range_checks(file1, file2, fxr, blocksize);
+ if (ret || fxr->length == 0)
+ return ret;
+
+ /* Wait for the completion of any pending IOs on both files */
+ inode_dio_wait(inode1);
+ if (!same_inode)
+ inode_dio_wait(inode2);
+
+ ret = filemap_write_and_wait_range(inode1->i_mapping, fxr->file1_offset,
+ fxr->file1_offset + fxr->length - 1);
+ if (ret)
+ return ret;
+
+ ret = filemap_write_and_wait_range(inode2->i_mapping, fxr->file2_offset,
+ fxr->file2_offset + fxr->length - 1);
+ if (ret)
+ return ret;
+
+ /*
+ * If the files or inodes involved require synchronous writes, amend
+ * the request to force the filesystem to flush all data and metadata
+ * to disk after the operation completes.
+ */
+ if (((file1->f_flags | file2->f_flags) & (__O_SYNC | O_DSYNC)) ||
+ IS_SYNC(inode1) || IS_SYNC(inode2))
+ fxr->flags |= FILE_XCHG_RANGE_FSYNC;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(generic_xchg_file_range_prep);
+
+/*
+ * Finish a range exchange operation, if it was successful. Caller must ensure
+ * that the inodes are still locked against any other modifications.
+ */
+int generic_xchg_file_range_finish(struct file *file1, struct file *file2)
+{
+ int ret;
+
+ ret = file_remove_privs(file1);
+ if (ret)
+ return ret;
+ if (file_inode(file1) == file_inode(file2))
+ return 0;
+
+ return file_remove_privs(file2);
+}
+EXPORT_SYMBOL_GPL(generic_xchg_file_range_finish);
+
+/*
+ * Check that both files' metadata agree with the snapshot that we took for
+ * the range exchange request.
+
+ * This should be called after the filesystem has locked /all/ inode metadata
+ * against modification.
+ */
+int generic_xchg_file_range_check_fresh(struct inode *inode2,
+ const struct file_xchg_range *fxr)
+{
+ /* Check that file2 hasn't otherwise been modified. */
+ if ((fxr->flags & FILE_XCHG_RANGE_FILE2_FRESH) &&
+ (fxr->file2_ino != inode2->i_ino ||
+ fxr->file2_ctime != inode2->i_ctime.tv_sec ||
+ fxr->file2_ctime_nsec != inode2->i_ctime.tv_nsec ||
+ fxr->file2_mtime != inode2->i_mtime.tv_sec ||
+ fxr->file2_mtime_nsec != inode2->i_mtime.tv_nsec))
+ return -EBUSY;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(generic_xchg_file_range_check_fresh);
+
+static inline int xchg_range_verify_area(struct file *file, loff_t pos,
+ struct file_xchg_range *fxr)
+{
+ int64_t len = fxr->length;
+
+ if (pos < 0)
+ return -EINVAL;
+
+ if (fxr->flags & FILE_XCHG_RANGE_TO_EOF)
+ len = min_t(int64_t, len, i_size_read(file_inode(file)) - pos);
+ return remap_verify_area(file, pos, len, true);
+}
+
+int do_xchg_file_range(struct file *file1, struct file *file2,
+ struct file_xchg_range *fxr)
+{
+ struct inode *inode1 = file_inode(file1);
+ struct inode *inode2 = file_inode(file2);
+ int ret;
+
+ if ((fxr->flags & ~FILE_XCHG_RANGE_ALL_FLAGS) ||
+ memchr_inv(&fxr->pad, 0, sizeof(fxr->pad)))
+ return -EINVAL;
+
+ if ((fxr->flags & FILE_XCHG_RANGE_FULL_FILES) &&
+ (fxr->flags & FILE_XCHG_RANGE_TO_EOF))
+ return -EINVAL;
+
+ /*
+ * The ioctl enforces that src and dest files are on the same mount.
+ * Practically, they only need to be on the same file system.
+ */
+ if (inode1->i_sb != inode2->i_sb)
+ return -EXDEV;
+
+ /* This only works for regular files. */
+ if (S_ISDIR(inode1->i_mode) || S_ISDIR(inode2->i_mode))
+ return -EISDIR;
+ if (!S_ISREG(inode1->i_mode) || !S_ISREG(inode2->i_mode))
+ return -EINVAL;
+
+ ret = generic_file_rw_checks(file1, file2);
+ if (ret < 0)
+ return ret;
+
+ ret = generic_file_rw_checks(file2, file1);
+ if (ret < 0)
+ return ret;
+
+ if (!file1->f_op->xchg_file_range)
+ return -EOPNOTSUPP;
+
+ ret = xchg_range_verify_area(file1, fxr->file1_offset, fxr);
+ if (ret)
+ return ret;
+
+ ret = xchg_range_verify_area(file2, fxr->file2_offset, fxr);
+ if (ret)
+ return ret;
+
+ ret = file2->f_op->xchg_file_range(file1, file2, fxr);
+ if (ret)
+ return ret;
+
+ fsnotify_modify(file1);
+ if (file2 != file1)
+ fsnotify_modify(file2);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(do_xchg_file_range);
+
+int vfs_xchg_file_range(struct file *file1, struct file *file2,
+ struct file_xchg_range *fxr)
+{
+ int ret;
+
+ file_start_write(file2);
+ ret = do_xchg_file_range(file1, file2, fxr);
+ file_end_write(file2);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vfs_xchg_file_range);
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 400cf68e551e..210c17f5a16c 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -841,6 +841,7 @@ struct xfs_scrub_metadata {
#define XFS_IOC_FSGEOMETRY _IOR ('X', 126, struct xfs_fsop_geom)
#define XFS_IOC_BULKSTAT _IOR ('X', 127, struct xfs_bulkstat_req)
#define XFS_IOC_INUMBERS _IOR ('X', 128, struct xfs_inumbers_req)
+/* FIEXCHANGE_RANGE ----------- hoisted 129 */
/* XFS_IOC_GETFSUUID ---------- deprecated 140 */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 066555ad1bf8..cd86ac22c339 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -46,6 +46,7 @@
#include <asm/byteorder.h>
#include <uapi/linux/fs.h>
+#include <uapi/linux/fiexchange.h>
struct backing_dev_info;
struct bdi_writeback;
@@ -2125,6 +2126,8 @@ struct file_operations {
loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags);
+ int (*xchg_file_range)(struct file *file1, struct file *file2,
+ struct file_xchg_range *fsr);
int (*fadvise)(struct file *, loff_t, loff_t, int);
int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
@@ -2205,6 +2208,10 @@ int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t *count, unsigned int remap_flags);
+int generic_xchg_file_range_prep(struct file *file1, struct file *file2,
+ struct file_xchg_range *fsr,
+ unsigned int blocksize);
+int generic_xchg_file_range_finish(struct file *file1, struct file *file2);
extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags);
@@ -2216,7 +2223,12 @@ extern int vfs_dedupe_file_range(struct file *file,
extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
struct file *dst_file, loff_t dst_pos,
loff_t len, unsigned int remap_flags);
-
+extern int do_xchg_file_range(struct file *file1, struct file *file2,
+ struct file_xchg_range *fsr);
+extern int vfs_xchg_file_range(struct file *file1, struct file *file2,
+ struct file_xchg_range *fsr);
+extern int generic_xchg_file_range_check_fresh(struct inode *inode2,
+ const struct file_xchg_range *fsr);
struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/uapi/linux/fiexchange.h b/include/uapi/linux/fiexchange.h
new file mode 100644
index 000000000000..72bc228d4141
--- /dev/null
+++ b/include/uapi/linux/fiexchange.h
@@ -0,0 +1,101 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later WITH Linux-syscall-note */
+/*
+ * FIEXCHANGE_RANGE ioctl definitions, to facilitate exchanging parts of files.
+ *
+ * Copyright (C) 2022 Oracle. All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef _LINUX_FIEXCHANGE_H
+#define _LINUX_FIEXCHANGE_H
+
+#include <linux/types.h>
+
+/*
+ * Exchange part of file1 with part of the file that this ioctl that is being
+ * called against (which we'll call file2). Filesystems must be able to
+ * restart and complete the operation even after the system goes down.
+ */
+struct file_xchg_range {
+ __s64 file1_fd;
+ __s64 file1_offset; /* file1 offset, bytes */
+ __s64 file2_offset; /* file2 offset, bytes */
+ __u64 length; /* bytes to exchange */
+
+ __u64 flags; /* see FILE_XCHG_RANGE_* below */
+
+ /* file2 metadata for optional freshness checks */
+ __s64 file2_ino; /* inode number */
+ __s64 file2_mtime; /* modification time */
+ __s64 file2_ctime; /* change time */
+ __s32 file2_mtime_nsec; /* mod time, nsec */
+ __s32 file2_ctime_nsec; /* change time, nsec */
+
+ __u64 pad[6]; /* must be zeroes */
+};
+
+/*
+ * Atomic exchange operations are not required. This relaxes the requirement
+ * that the filesystem must be able to complete the operation after a crash.
+ */
+#define FILE_XCHG_RANGE_NONATOMIC (1 << 0)
+
+/*
+ * Check that file2's inode number, mtime, and ctime against the values
+ * provided, and return -EBUSY if there isn't an exact match.
+ */
+#define FILE_XCHG_RANGE_FILE2_FRESH (1 << 1)
+
+/*
+ * Check that the file1's length is equal to file1_offset + length, and that
+ * file2's length is equal to file2_offset + length. Returns -EDOM if there
+ * isn't an exact match.
+ */
+#define FILE_XCHG_RANGE_FULL_FILES (1 << 2)
+
+/*
+ * Exchange file data all the way to the ends of both files, and then exchange
+ * the file sizes. This flag can be used to replace a file's contents with a
+ * different amount of data. length will be ignored.
+ */
+#define FILE_XCHG_RANGE_TO_EOF (1 << 3)
+
+/* Flush all changes in file data and file metadata to disk before returning. */
+#define FILE_XCHG_RANGE_FSYNC (1 << 4)
+
+/* Dry run; do all the parameter verification but do not change anything. */
+#define FILE_XCHG_RANGE_DRY_RUN (1 << 5)
+
+/*
+ * Do not exchange any part of the range where file1's mapping is a hole. This
+ * can be used to emulate scatter-gather atomic writes with a temp file.
+ */
+#define FILE_XCHG_RANGE_SKIP_FILE1_HOLES (1 << 6)
+
+/*
+ * Commit the contents of file1 into file2 if file2 has the same inode number,
+ * mtime, and ctime as the arguments provided to the call. The old contents of
+ * file2 will be moved to file1.
+ *
+ * With this flag, all committed information can be retrieved even if the
+ * system crashes or is rebooted. This includes writing through or flushing a
+ * disk cache if present. The call blocks until the device reports that the
+ * commit is complete.
+ *
+ * This flag should not be combined with NONATOMIC. It can be combined with
+ * SKIP_FILE1_HOLES.
+ */
+#define FILE_XCHG_RANGE_COMMIT (FILE_XCHG_RANGE_FILE2_FRESH | \
+ FILE_XCHG_RANGE_FSYNC)
+
+#define FILE_XCHG_RANGE_ALL_FLAGS (FILE_XCHG_RANGE_NONATOMIC | \
+ FILE_XCHG_RANGE_FILE2_FRESH | \
+ FILE_XCHG_RANGE_FULL_FILES | \
+ FILE_XCHG_RANGE_TO_EOF | \
+ FILE_XCHG_RANGE_FSYNC | \
+ FILE_XCHG_RANGE_DRY_RUN | \
+ FILE_XCHG_RANGE_SKIP_FILE1_HOLES)
+
+#define FIEXCHANGE_RANGE _IOWR('X', 129, struct file_xchg_range)
+
+#endif /* _LINUX_FIEXCHANGE_H */
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 02/21] xfs: create a new helper to return a file's allocation unit
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
2022-12-30 22:13 ` [PATCH 01/21] vfs: introduce new file range exchange ioctl Darrick J. Wong
2022-12-30 22:13 ` [PATCH 04/21] xfs: parameterize all the incompat log feature helpers Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 05/21] xfs: create a log incompat flag for atomic extent swapping Darrick J. Wong
` (17 subsequent siblings)
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
Create a new helper function to calculate the fundamental allocation
unit (i.e. the smallest unit of space we can allocate) of a file.
Things are going to get hairy with range-exchange on the realtime
device, so prepare for this now.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_file.c | 26 +++++++++-----------------
fs/xfs/xfs_inode.c | 13 +++++++++++++
fs/xfs/xfs_inode.h | 1 +
3 files changed, 23 insertions(+), 17 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c4bdadd8fa71..b382380656d7 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -44,27 +44,19 @@ xfs_is_falloc_aligned(
loff_t pos,
long long int len)
{
- struct xfs_mount *mp = ip->i_mount;
- uint64_t mask;
+ unsigned int alloc_unit = xfs_inode_alloc_unitsize(ip);
- if (XFS_IS_REALTIME_INODE(ip)) {
- if (!is_power_of_2(mp->m_sb.sb_rextsize)) {
- u64 rextbytes;
- u32 mod;
+ if (XFS_IS_REALTIME_INODE(ip) && !is_power_of_2(alloc_unit)) {
+ u32 mod;
- rextbytes = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
- div_u64_rem(pos, rextbytes, &mod);
- if (mod)
- return false;
- div_u64_rem(len, rextbytes, &mod);
- return mod == 0;
- }
- mask = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize) - 1;
- } else {
- mask = mp->m_sb.sb_blocksize - 1;
+ div_u64_rem(pos, alloc_unit, &mod);
+ if (mod)
+ return false;
+ div_u64_rem(len, alloc_unit, &mod);
+ return mod == 0;
}
- return !((pos | len) & mask);
+ return !((pos | len) & (alloc_unit - 1));
}
/*
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b082222a9061..04ceafb936bc 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3795,3 +3795,16 @@ xfs_inode_count_blocks(
xfs_bmap_count_leaves(ifp, rblocks);
*dblocks = ip->i_nblocks - *rblocks;
}
+
+/* Returns the size of fundamental allocation unit for a file, in bytes. */
+unsigned int
+xfs_inode_alloc_unitsize(
+ struct xfs_inode *ip)
+{
+ unsigned int blocks = 1;
+
+ if (XFS_IS_REALTIME_INODE(ip))
+ blocks = ip->i_mount->m_sb.sb_rextsize;
+
+ return XFS_FSB_TO_B(ip->i_mount, blocks);
+}
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 926e4dd566d0..4b01d078ace2 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -577,6 +577,7 @@ void xfs_iunlock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
void xfs_inode_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
xfs_filblks_t *dblocks, xfs_filblks_t *rblocks);
+unsigned int xfs_inode_alloc_unitsize(struct xfs_inode *ip);
/*
* Parameters for tracking bumplink and droplink operations. The hook
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 03/21] xfs: refactor non-power-of-two alignment checks
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
` (3 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 05/21] xfs: create a log incompat flag for atomic extent swapping Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 09/21] xfs: add a ->xchg_file_range handler Darrick J. Wong
` (15 subsequent siblings)
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
Create a helper function that can compute if a 64-bit number is an
integer multiple of a 32-bit number, where the 32-bit number is not
required to be an even power of two. This is needed for some new code
for the realtime device, where we can set 37k allocation units and then
have to remap them.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_file.c | 12 +++---------
fs/xfs/xfs_linux.h | 5 +++++
2 files changed, 8 insertions(+), 9 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b382380656d7..78323574021c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -46,15 +46,9 @@ xfs_is_falloc_aligned(
{
unsigned int alloc_unit = xfs_inode_alloc_unitsize(ip);
- if (XFS_IS_REALTIME_INODE(ip) && !is_power_of_2(alloc_unit)) {
- u32 mod;
-
- div_u64_rem(pos, alloc_unit, &mod);
- if (mod)
- return false;
- div_u64_rem(len, alloc_unit, &mod);
- return mod == 0;
- }
+ if (XFS_IS_REALTIME_INODE(ip) && !is_power_of_2(alloc_unit))
+ return isaligned_64(pos, alloc_unit) &&
+ isaligned_64(len, alloc_unit);
return !((pos | len) & (alloc_unit - 1));
}
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 3847719c3026..7e9bf03c80a3 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -197,6 +197,11 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y)
return x;
}
+static inline bool isaligned_64(uint64_t x, uint32_t y)
+{
+ return do_div(x, y) == 0;
+}
+
int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count,
char *data, enum req_op op);
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 04/21] xfs: parameterize all the incompat log feature helpers
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
2022-12-30 22:13 ` [PATCH 01/21] vfs: introduce new file range exchange ioctl Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 02/21] xfs: create a new helper to return a file's allocation unit Darrick J. Wong
` (18 subsequent siblings)
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
We're about to define a new XFS_SB_FEAT_INCOMPAT_LOG_ bit, which means
that callers will soon require the ability to toggle on and off
different log incompat feature bits. Parameterize the
xlog_{use,drop}_incompat_feat and xfs_sb_remove_incompat_log_features
functions so that callers can specify which feature they're trying to
use and so that we can clear individual log incompat bits as needed.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_format.h | 5 +++--
fs/xfs/xfs_log.c | 34 +++++++++++++++++++++++++---------
fs/xfs/xfs_log.h | 9 ++++++---
fs/xfs/xfs_log_priv.h | 2 +-
fs/xfs/xfs_log_recover.c | 3 ++-
fs/xfs/xfs_mount.c | 11 +++++------
fs/xfs/xfs_mount.h | 2 +-
fs/xfs/xfs_xattr.c | 6 +++---
8 files changed, 46 insertions(+), 26 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 5ba2dae7aa2f..817adb36cb1e 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -404,9 +404,10 @@ xfs_sb_has_incompat_log_feature(
static inline void
xfs_sb_remove_incompat_log_features(
- struct xfs_sb *sbp)
+ struct xfs_sb *sbp,
+ uint32_t feature)
{
- sbp->sb_features_log_incompat &= ~XFS_SB_FEAT_INCOMPAT_LOG_ALL;
+ sbp->sb_features_log_incompat &= ~feature;
}
static inline void
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index b32a8e57f576..a0ef09addc84 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1082,7 +1082,7 @@ xfs_log_quiesce(
* failures, though it's not fatal to have a higher log feature
* protection level than the log contents actually require.
*/
- if (xfs_clear_incompat_log_features(mp)) {
+ if (xfs_clear_incompat_log_features(mp, XFS_SB_FEAT_INCOMPAT_LOG_ALL)) {
int error;
error = xfs_sync_sb(mp, false);
@@ -1489,6 +1489,7 @@ xlog_clear_incompat(
struct xlog *log)
{
struct xfs_mount *mp = log->l_mp;
+ uint32_t incompat_mask = 0;
if (!xfs_sb_has_incompat_log_feature(&mp->m_sb,
XFS_SB_FEAT_INCOMPAT_LOG_ALL))
@@ -1497,11 +1498,16 @@ xlog_clear_incompat(
if (log->l_covered_state != XLOG_STATE_COVER_DONE2)
return;
- if (!down_write_trylock(&log->l_incompat_users))
+ if (down_write_trylock(&log->l_incompat_xattrs))
+ incompat_mask |= XFS_SB_FEAT_INCOMPAT_LOG_XATTRS;
+
+ if (!incompat_mask)
return;
- xfs_clear_incompat_log_features(mp);
- up_write(&log->l_incompat_users);
+ xfs_clear_incompat_log_features(mp, incompat_mask);
+
+ if (incompat_mask & XFS_SB_FEAT_INCOMPAT_LOG_XATTRS)
+ up_write(&log->l_incompat_xattrs);
}
/*
@@ -1618,7 +1624,7 @@ xlog_alloc_log(
}
log->l_sectBBsize = 1 << log2_size;
- init_rwsem(&log->l_incompat_users);
+ init_rwsem(&log->l_incompat_xattrs);
xlog_get_iclog_buffer_size(mp, log);
@@ -3909,15 +3915,25 @@ xfs_log_check_lsn(
*/
void
xlog_use_incompat_feat(
- struct xlog *log)
+ struct xlog *log,
+ enum xlog_incompat_feat what)
{
- down_read(&log->l_incompat_users);
+ switch (what) {
+ case XLOG_INCOMPAT_FEAT_XATTRS:
+ down_read(&log->l_incompat_xattrs);
+ break;
+ }
}
/* Notify the log that we've finished using log incompat features. */
void
xlog_drop_incompat_feat(
- struct xlog *log)
+ struct xlog *log,
+ enum xlog_incompat_feat what)
{
- up_read(&log->l_incompat_users);
+ switch (what) {
+ case XLOG_INCOMPAT_FEAT_XATTRS:
+ up_read(&log->l_incompat_xattrs);
+ break;
+ }
}
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 2728886c2963..d187f6445909 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -159,8 +159,11 @@ bool xfs_log_check_lsn(struct xfs_mount *, xfs_lsn_t);
xfs_lsn_t xlog_grant_push_threshold(struct xlog *log, int need_bytes);
bool xlog_force_shutdown(struct xlog *log, uint32_t shutdown_flags);
-void xlog_use_incompat_feat(struct xlog *log);
-void xlog_drop_incompat_feat(struct xlog *log);
-int xfs_attr_use_log_assist(struct xfs_mount *mp);
+enum xlog_incompat_feat {
+ XLOG_INCOMPAT_FEAT_XATTRS = XFS_SB_FEAT_INCOMPAT_LOG_XATTRS,
+};
+
+void xlog_use_incompat_feat(struct xlog *log, enum xlog_incompat_feat what);
+void xlog_drop_incompat_feat(struct xlog *log, enum xlog_incompat_feat what);
#endif /* __XFS_LOG_H__ */
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 1bd2963e8fbd..a13b5b6b744d 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -447,7 +447,7 @@ struct xlog {
uint32_t l_iclog_roundoff;/* padding roundoff */
/* Users of log incompat features should take a read lock. */
- struct rw_semaphore l_incompat_users;
+ struct rw_semaphore l_incompat_xattrs;
};
/*
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 6b1f37bc3e95..81ce08c23306 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3473,7 +3473,8 @@ xlog_recover_finish(
* longer anything to protect. We rely on the AIL push to write out the
* updated superblock after everything else.
*/
- if (xfs_clear_incompat_log_features(log->l_mp)) {
+ if (xfs_clear_incompat_log_features(log->l_mp,
+ XFS_SB_FEAT_INCOMPAT_LOG_ALL)) {
error = xfs_sync_sb(log->l_mp, false);
if (error < 0) {
xfs_alert(log->l_mp,
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 31f49211fdd6..54cd47882991 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1357,13 +1357,13 @@ xfs_add_incompat_log_feature(
*/
bool
xfs_clear_incompat_log_features(
- struct xfs_mount *mp)
+ struct xfs_mount *mp,
+ uint32_t features)
{
bool ret = false;
if (!xfs_has_crc(mp) ||
- !xfs_sb_has_incompat_log_feature(&mp->m_sb,
- XFS_SB_FEAT_INCOMPAT_LOG_ALL) ||
+ !xfs_sb_has_incompat_log_feature(&mp->m_sb, features) ||
xfs_is_shutdown(mp))
return false;
@@ -1375,9 +1375,8 @@ xfs_clear_incompat_log_features(
xfs_buf_lock(mp->m_sb_bp);
xfs_buf_hold(mp->m_sb_bp);
- if (xfs_sb_has_incompat_log_feature(&mp->m_sb,
- XFS_SB_FEAT_INCOMPAT_LOG_ALL)) {
- xfs_sb_remove_incompat_log_features(&mp->m_sb);
+ if (xfs_sb_has_incompat_log_feature(&mp->m_sb, features)) {
+ xfs_sb_remove_incompat_log_features(&mp->m_sb, features);
ret = true;
}
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index ec8b185d45f8..7c48a2b70f6f 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -547,7 +547,7 @@ struct xfs_error_cfg * xfs_error_get_cfg(struct xfs_mount *mp,
int error_class, int error);
void xfs_force_summary_recalc(struct xfs_mount *mp);
int xfs_add_incompat_log_feature(struct xfs_mount *mp, uint32_t feature);
-bool xfs_clear_incompat_log_features(struct xfs_mount *mp);
+bool xfs_clear_incompat_log_features(struct xfs_mount *mp, uint32_t feature);
void xfs_mod_delalloc(struct xfs_mount *mp, int64_t delta);
#endif /* __XFS_MOUNT_H__ */
diff --git a/fs/xfs/xfs_xattr.c b/fs/xfs/xfs_xattr.c
index 10aa1fd39d2b..e03f199f50c7 100644
--- a/fs/xfs/xfs_xattr.c
+++ b/fs/xfs/xfs_xattr.c
@@ -37,7 +37,7 @@ xfs_attr_grab_log_assist(
* Protect ourselves from an idle log clearing the logged xattrs log
* incompat feature bit.
*/
- xlog_use_incompat_feat(mp->m_log);
+ xlog_use_incompat_feat(mp->m_log, XLOG_INCOMPAT_FEAT_XATTRS);
/*
* If log-assisted xattrs are already enabled, the caller can use the
@@ -57,7 +57,7 @@ xfs_attr_grab_log_assist(
return 0;
drop_incompat:
- xlog_drop_incompat_feat(mp->m_log);
+ xlog_drop_incompat_feat(mp->m_log, XLOG_INCOMPAT_FEAT_XATTRS);
return error;
}
@@ -65,7 +65,7 @@ static inline void
xfs_attr_rele_log_assist(
struct xfs_mount *mp)
{
- xlog_drop_incompat_feat(mp->m_log);
+ xlog_drop_incompat_feat(mp->m_log, XLOG_INCOMPAT_FEAT_XATTRS);
}
static inline bool
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 05/21] xfs: create a log incompat flag for atomic extent swapping
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
` (2 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 02/21] xfs: create a new helper to return a file's allocation unit Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 03/21] xfs: refactor non-power-of-two alignment checks Darrick J. Wong
` (16 subsequent siblings)
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
Create a log incompat flag so that we only attempt to process swap
extent log items if the filesystem supports it, and a geometry flag to
advertise support if it's present.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_format.h | 1 +
fs/xfs/libxfs/xfs_fs.h | 1 +
fs/xfs/libxfs/xfs_sb.c | 3 +++
fs/xfs/libxfs/xfs_swapext.h | 24 ++++++++++++++++++++++++
4 files changed, 29 insertions(+)
create mode 100644 fs/xfs/libxfs/xfs_swapext.h
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 817adb36cb1e..1424976ec955 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -391,6 +391,7 @@ xfs_sb_has_incompat_feature(
}
#define XFS_SB_FEAT_INCOMPAT_LOG_XATTRS (1 << 0) /* Delayed Attributes */
+#define XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT (1U << 31) /* file extent swap */
#define XFS_SB_FEAT_INCOMPAT_LOG_ALL \
(XFS_SB_FEAT_INCOMPAT_LOG_XATTRS)
#define XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN ~XFS_SB_FEAT_INCOMPAT_LOG_ALL
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 210c17f5a16c..a39fd65e6ee0 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -239,6 +239,7 @@ typedef struct xfs_fsop_resblks {
#define XFS_FSOP_GEOM_FLAGS_BIGTIME (1 << 21) /* 64-bit nsec timestamps */
#define XFS_FSOP_GEOM_FLAGS_INOBTCNT (1 << 22) /* inobt btree counter */
#define XFS_FSOP_GEOM_FLAGS_NREXT64 (1 << 23) /* large extent counters */
+#define XFS_FSOP_GEOM_FLAGS_ATOMIC_SWAP (1U << 31) /* atomic file extent swap */
/*
* Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index b3e8ab247b28..5b6f5939fda1 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -25,6 +25,7 @@
#include "xfs_da_format.h"
#include "xfs_health.h"
#include "xfs_ag.h"
+#include "xfs_swapext.h"
/*
* Physical superblock buffer manipulations. Shared with libxfs in userspace.
@@ -1197,6 +1198,8 @@ xfs_fs_geometry(
}
if (xfs_has_large_extent_counts(mp))
geo->flags |= XFS_FSOP_GEOM_FLAGS_NREXT64;
+ if (xfs_swapext_supported(mp))
+ geo->flags |= XFS_FSOP_GEOM_FLAGS_ATOMIC_SWAP;
geo->rtsectsize = sbp->sb_blocksize;
geo->dirblocksize = xfs_dir2_dirblock_bytes(sbp);
diff --git a/fs/xfs/libxfs/xfs_swapext.h b/fs/xfs/libxfs/xfs_swapext.h
new file mode 100644
index 000000000000..316323339d76
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_swapext.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SWAPEXT_H_
+#define __XFS_SWAPEXT_H_ 1
+
+/*
+ * Decide if this filesystem supports using log items to swap file extents and
+ * restart the operation if the system fails before the operation completes.
+ *
+ * This can be done to individual file extents by using the block mapping log
+ * intent items introduced with reflink and rmap; or to entire file ranges
+ * using swapext log intent items to track the overall progress across multiple
+ * extent mappings. Realtime is not supported yet.
+ */
+static inline bool xfs_swapext_supported(struct xfs_mount *mp)
+{
+ return (xfs_has_reflink(mp) || xfs_has_rmapbt(mp)) &&
+ !xfs_has_realtime(mp);
+}
+
+#endif /* __XFS_SWAPEXT_H_ */
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 06/21] xfs: introduce a swap-extent log intent item
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
` (9 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 10/21] xfs: add error injection to test swapext recovery Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 18/21] xfs: condense symbolic links after an atomic swap Darrick J. Wong
` (9 subsequent siblings)
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
Introduce a new intent log item to handle swapping extents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/Makefile | 1
fs/xfs/libxfs/xfs_log_format.h | 51 ++++++
fs/xfs/libxfs/xfs_log_recover.h | 2
fs/xfs/xfs_log_recover.c | 2
fs/xfs/xfs_super.c | 19 ++
fs/xfs/xfs_swapext_item.c | 320 +++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_swapext_item.h | 56 +++++++
7 files changed, 448 insertions(+), 3 deletions(-)
create mode 100644 fs/xfs/xfs_swapext_item.c
create mode 100644 fs/xfs/xfs_swapext_item.h
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index fc83759656c6..c5cb8cf6ffbb 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -109,6 +109,7 @@ xfs-y += xfs_log.o \
xfs_iunlink_item.o \
xfs_refcount_item.o \
xfs_rmap_item.o \
+ xfs_swapext_item.o \
xfs_log_recover.o \
xfs_trans_ail.o \
xfs_trans_buf.o
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 367f536d9881..b105a5ef6644 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -117,8 +117,9 @@ struct xfs_unmount_log_format {
#define XLOG_REG_TYPE_ATTRD_FORMAT 28
#define XLOG_REG_TYPE_ATTR_NAME 29
#define XLOG_REG_TYPE_ATTR_VALUE 30
-#define XLOG_REG_TYPE_MAX 30
-
+#define XLOG_REG_TYPE_SXI_FORMAT 31
+#define XLOG_REG_TYPE_SXD_FORMAT 32
+#define XLOG_REG_TYPE_MAX 32
/*
* Flags to log operation header
@@ -243,6 +244,8 @@ typedef struct xfs_trans_header {
#define XFS_LI_BUD 0x1245
#define XFS_LI_ATTRI 0x1246 /* attr set/remove intent*/
#define XFS_LI_ATTRD 0x1247 /* attr set/remove done */
+#define XFS_LI_SXI 0x1248 /* extent swap intent */
+#define XFS_LI_SXD 0x1249 /* extent swap done */
#define XFS_LI_TYPE_DESC \
{ XFS_LI_EFI, "XFS_LI_EFI" }, \
@@ -260,7 +263,9 @@ typedef struct xfs_trans_header {
{ XFS_LI_BUI, "XFS_LI_BUI" }, \
{ XFS_LI_BUD, "XFS_LI_BUD" }, \
{ XFS_LI_ATTRI, "XFS_LI_ATTRI" }, \
- { XFS_LI_ATTRD, "XFS_LI_ATTRD" }
+ { XFS_LI_ATTRD, "XFS_LI_ATTRD" }, \
+ { XFS_LI_SXI, "XFS_LI_SXI" }, \
+ { XFS_LI_SXD, "XFS_LI_SXD" }
/*
* Inode Log Item Format definitions.
@@ -871,6 +876,46 @@ struct xfs_bud_log_format {
uint64_t bud_bui_id; /* id of corresponding bui */
};
+/*
+ * SXI/SXD (extent swapping) log format definitions
+ */
+
+struct xfs_swap_extent {
+ uint64_t sx_inode1;
+ uint64_t sx_inode2;
+ uint64_t sx_startoff1;
+ uint64_t sx_startoff2;
+ uint64_t sx_blockcount;
+ uint64_t sx_flags;
+ int64_t sx_isize1;
+ int64_t sx_isize2;
+};
+
+#define XFS_SWAP_EXT_FLAGS (0)
+
+#define XFS_SWAP_EXT_STRINGS
+
+/* This is the structure used to lay out an sxi log item in the log. */
+struct xfs_sxi_log_format {
+ uint16_t sxi_type; /* sxi log item type */
+ uint16_t sxi_size; /* size of this item */
+ uint32_t __pad; /* must be zero */
+ uint64_t sxi_id; /* sxi identifier */
+ struct xfs_swap_extent sxi_extent; /* extent to swap */
+};
+
+/*
+ * This is the structure used to lay out an sxd log item in the
+ * log. The sxd_extents array is a variable size array whose
+ * size is given by sxd_nextents;
+ */
+struct xfs_sxd_log_format {
+ uint16_t sxd_type; /* sxd log item type */
+ uint16_t sxd_size; /* size of this item */
+ uint32_t __pad;
+ uint64_t sxd_sxi_id; /* id of corresponding bui */
+};
+
/*
* Dquot Log format definitions.
*
diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
index 2420865f3007..6162c93b5d38 100644
--- a/fs/xfs/libxfs/xfs_log_recover.h
+++ b/fs/xfs/libxfs/xfs_log_recover.h
@@ -74,6 +74,8 @@ extern const struct xlog_recover_item_ops xlog_cui_item_ops;
extern const struct xlog_recover_item_ops xlog_cud_item_ops;
extern const struct xlog_recover_item_ops xlog_attri_item_ops;
extern const struct xlog_recover_item_ops xlog_attrd_item_ops;
+extern const struct xlog_recover_item_ops xlog_sxi_item_ops;
+extern const struct xlog_recover_item_ops xlog_sxd_item_ops;
/*
* Macros, structures, prototypes for internal log manager use.
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 81ce08c23306..006ceff1959d 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1796,6 +1796,8 @@ static const struct xlog_recover_item_ops *xlog_recover_item_ops[] = {
&xlog_bud_item_ops,
&xlog_attri_item_ops,
&xlog_attrd_item_ops,
+ &xlog_sxi_item_ops,
+ &xlog_sxd_item_ops,
};
static const struct xlog_recover_item_ops *
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index a16d4d1b35d0..4cf26611f46f 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -42,6 +42,7 @@
#include "xfs_xattr.h"
#include "xfs_iunlink_item.h"
#include "scrub/rcbag_btree.h"
+#include "xfs_swapext_item.h"
#include <linux/magic.h>
#include <linux/fs_context.h>
@@ -2122,8 +2123,24 @@ xfs_init_caches(void)
if (!xfs_iunlink_cache)
goto out_destroy_attri_cache;
+ xfs_sxd_cache = kmem_cache_create("xfs_sxd_item",
+ sizeof(struct xfs_sxd_log_item),
+ 0, 0, NULL);
+ if (!xfs_sxd_cache)
+ goto out_destroy_iul_cache;
+
+ xfs_sxi_cache = kmem_cache_create("xfs_sxi_item",
+ sizeof(struct xfs_sxi_log_item),
+ 0, 0, NULL);
+ if (!xfs_sxi_cache)
+ goto out_destroy_sxd_cache;
+
return 0;
+ out_destroy_sxd_cache:
+ kmem_cache_destroy(xfs_sxd_cache);
+ out_destroy_iul_cache:
+ kmem_cache_destroy(xfs_iunlink_cache);
out_destroy_attri_cache:
kmem_cache_destroy(xfs_attri_cache);
out_destroy_attrd_cache:
@@ -2180,6 +2197,8 @@ xfs_destroy_caches(void)
* destroy caches.
*/
rcu_barrier();
+ kmem_cache_destroy(xfs_sxd_cache);
+ kmem_cache_destroy(xfs_sxi_cache);
kmem_cache_destroy(xfs_iunlink_cache);
kmem_cache_destroy(xfs_attri_cache);
kmem_cache_destroy(xfs_attrd_cache);
diff --git a/fs/xfs/xfs_swapext_item.c b/fs/xfs/xfs_swapext_item.c
new file mode 100644
index 000000000000..ea4a3a8de7e3
--- /dev/null
+++ b/fs/xfs/xfs_swapext_item.c
@@ -0,0 +1,320 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_bit.h"
+#include "xfs_shared.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_swapext_item.h"
+#include "xfs_log.h"
+#include "xfs_bmap.h"
+#include "xfs_icache.h"
+#include "xfs_trans_space.h"
+#include "xfs_error.h"
+#include "xfs_log_priv.h"
+#include "xfs_log_recover.h"
+
+struct kmem_cache *xfs_sxi_cache;
+struct kmem_cache *xfs_sxd_cache;
+
+static const struct xfs_item_ops xfs_sxi_item_ops;
+
+static inline struct xfs_sxi_log_item *SXI_ITEM(struct xfs_log_item *lip)
+{
+ return container_of(lip, struct xfs_sxi_log_item, sxi_item);
+}
+
+STATIC void
+xfs_sxi_item_free(
+ struct xfs_sxi_log_item *sxi_lip)
+{
+ kmem_free(sxi_lip->sxi_item.li_lv_shadow);
+ kmem_cache_free(xfs_sxi_cache, sxi_lip);
+}
+
+/*
+ * Freeing the SXI requires that we remove it from the AIL if it has already
+ * been placed there. However, the SXI may not yet have been placed in the AIL
+ * when called by xfs_sxi_release() from SXD processing due to the ordering of
+ * committed vs unpin operations in bulk insert operations. Hence the reference
+ * count to ensure only the last caller frees the SXI.
+ */
+STATIC void
+xfs_sxi_release(
+ struct xfs_sxi_log_item *sxi_lip)
+{
+ ASSERT(atomic_read(&sxi_lip->sxi_refcount) > 0);
+ if (atomic_dec_and_test(&sxi_lip->sxi_refcount)) {
+ xfs_trans_ail_delete(&sxi_lip->sxi_item, SHUTDOWN_LOG_IO_ERROR);
+ xfs_sxi_item_free(sxi_lip);
+ }
+}
+
+
+STATIC void
+xfs_sxi_item_size(
+ struct xfs_log_item *lip,
+ int *nvecs,
+ int *nbytes)
+{
+ *nvecs += 1;
+ *nbytes += sizeof(struct xfs_sxi_log_format);
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the given sxi log
+ * item. We use only 1 iovec, and we point that at the sxi_log_format structure
+ * embedded in the sxi item.
+ */
+STATIC void
+xfs_sxi_item_format(
+ struct xfs_log_item *lip,
+ struct xfs_log_vec *lv)
+{
+ struct xfs_sxi_log_item *sxi_lip = SXI_ITEM(lip);
+ struct xfs_log_iovec *vecp = NULL;
+
+ sxi_lip->sxi_format.sxi_type = XFS_LI_SXI;
+ sxi_lip->sxi_format.sxi_size = 1;
+
+ xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_SXI_FORMAT,
+ &sxi_lip->sxi_format,
+ sizeof(struct xfs_sxi_log_format));
+}
+
+/*
+ * The unpin operation is the last place an SXI is manipulated in the log. It
+ * is either inserted in the AIL or aborted in the event of a log I/O error. In
+ * either case, the SXI transaction has been successfully committed to make it
+ * this far. Therefore, we expect whoever committed the SXI to either construct
+ * and commit the SXD or drop the SXD's reference in the event of error. Simply
+ * drop the log's SXI reference now that the log is done with it.
+ */
+STATIC void
+xfs_sxi_item_unpin(
+ struct xfs_log_item *lip,
+ int remove)
+{
+ struct xfs_sxi_log_item *sxi_lip = SXI_ITEM(lip);
+
+ xfs_sxi_release(sxi_lip);
+}
+
+/*
+ * The SXI has been either committed or aborted if the transaction has been
+ * cancelled. If the transaction was cancelled, an SXD isn't going to be
+ * constructed and thus we free the SXI here directly.
+ */
+STATIC void
+xfs_sxi_item_release(
+ struct xfs_log_item *lip)
+{
+ xfs_sxi_release(SXI_ITEM(lip));
+}
+
+/* Allocate and initialize an sxi item with the given number of extents. */
+STATIC struct xfs_sxi_log_item *
+xfs_sxi_init(
+ struct xfs_mount *mp)
+
+{
+ struct xfs_sxi_log_item *sxi_lip;
+
+ sxi_lip = kmem_cache_zalloc(xfs_sxi_cache, GFP_KERNEL | __GFP_NOFAIL);
+
+ xfs_log_item_init(mp, &sxi_lip->sxi_item, XFS_LI_SXI, &xfs_sxi_item_ops);
+ sxi_lip->sxi_format.sxi_id = (uintptr_t)(void *)sxi_lip;
+ atomic_set(&sxi_lip->sxi_refcount, 2);
+
+ return sxi_lip;
+}
+
+static inline struct xfs_sxd_log_item *SXD_ITEM(struct xfs_log_item *lip)
+{
+ return container_of(lip, struct xfs_sxd_log_item, sxd_item);
+}
+
+STATIC void
+xfs_sxd_item_size(
+ struct xfs_log_item *lip,
+ int *nvecs,
+ int *nbytes)
+{
+ *nvecs += 1;
+ *nbytes += sizeof(struct xfs_sxd_log_format);
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the given sxd log
+ * item. We use only 1 iovec, and we point that at the sxd_log_format structure
+ * embedded in the sxd item.
+ */
+STATIC void
+xfs_sxd_item_format(
+ struct xfs_log_item *lip,
+ struct xfs_log_vec *lv)
+{
+ struct xfs_sxd_log_item *sxd_lip = SXD_ITEM(lip);
+ struct xfs_log_iovec *vecp = NULL;
+
+ sxd_lip->sxd_format.sxd_type = XFS_LI_SXD;
+ sxd_lip->sxd_format.sxd_size = 1;
+
+ xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_SXD_FORMAT, &sxd_lip->sxd_format,
+ sizeof(struct xfs_sxd_log_format));
+}
+
+/*
+ * The SXD is either committed or aborted if the transaction is cancelled. If
+ * the transaction is cancelled, drop our reference to the SXI and free the
+ * SXD.
+ */
+STATIC void
+xfs_sxd_item_release(
+ struct xfs_log_item *lip)
+{
+ struct xfs_sxd_log_item *sxd_lip = SXD_ITEM(lip);
+
+ kmem_free(sxd_lip->sxd_item.li_lv_shadow);
+ xfs_sxi_release(sxd_lip->sxd_intent_log_item);
+ kmem_cache_free(xfs_sxd_cache, sxd_lip);
+}
+
+static struct xfs_log_item *
+xfs_sxd_item_intent(
+ struct xfs_log_item *lip)
+{
+ return &SXD_ITEM(lip)->sxd_intent_log_item->sxi_item;
+}
+
+static const struct xfs_item_ops xfs_sxd_item_ops = {
+ .flags = XFS_ITEM_RELEASE_WHEN_COMMITTED |
+ XFS_ITEM_INTENT_DONE,
+ .iop_size = xfs_sxd_item_size,
+ .iop_format = xfs_sxd_item_format,
+ .iop_release = xfs_sxd_item_release,
+ .iop_intent = xfs_sxd_item_intent,
+};
+
+/* Process a swapext update intent item that was recovered from the log. */
+STATIC int
+xfs_sxi_item_recover(
+ struct xfs_log_item *lip,
+ struct list_head *capture_list)
+{
+ return -EFSCORRUPTED;
+}
+
+STATIC bool
+xfs_sxi_item_match(
+ struct xfs_log_item *lip,
+ uint64_t intent_id)
+{
+ return SXI_ITEM(lip)->sxi_format.sxi_id == intent_id;
+}
+
+/* Relog an intent item to push the log tail forward. */
+static struct xfs_log_item *
+xfs_sxi_item_relog(
+ struct xfs_log_item *intent,
+ struct xfs_trans *tp)
+{
+ ASSERT(0);
+ return NULL;
+}
+
+static const struct xfs_item_ops xfs_sxi_item_ops = {
+ .flags = XFS_ITEM_INTENT,
+ .iop_size = xfs_sxi_item_size,
+ .iop_format = xfs_sxi_item_format,
+ .iop_unpin = xfs_sxi_item_unpin,
+ .iop_release = xfs_sxi_item_release,
+ .iop_recover = xfs_sxi_item_recover,
+ .iop_match = xfs_sxi_item_match,
+ .iop_relog = xfs_sxi_item_relog,
+};
+
+/*
+ * This routine is called to create an in-core extent swapext update item from
+ * the sxi format structure which was logged on disk. It allocates an in-core
+ * sxi, copies the extents from the format structure into it, and adds the sxi
+ * to the AIL with the given LSN.
+ */
+STATIC int
+xlog_recover_sxi_commit_pass2(
+ struct xlog *log,
+ struct list_head *buffer_list,
+ struct xlog_recover_item *item,
+ xfs_lsn_t lsn)
+{
+ struct xfs_mount *mp = log->l_mp;
+ struct xfs_sxi_log_item *sxi_lip;
+ struct xfs_sxi_log_format *sxi_formatp;
+ size_t len;
+
+ sxi_formatp = item->ri_buf[0].i_addr;
+
+ if (sxi_formatp->__pad != 0) {
+ XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
+ return -EFSCORRUPTED;
+ }
+
+ len = sizeof(struct xfs_sxi_log_format);
+ if (item->ri_buf[0].i_len != len) {
+ XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
+ return -EFSCORRUPTED;
+ }
+
+ sxi_lip = xfs_sxi_init(mp);
+ memcpy(&sxi_lip->sxi_format, sxi_formatp, len);
+
+ xfs_trans_ail_insert(log->l_ailp, &sxi_lip->sxi_item, lsn);
+ xfs_sxi_release(sxi_lip);
+ return 0;
+}
+
+const struct xlog_recover_item_ops xlog_sxi_item_ops = {
+ .item_type = XFS_LI_SXI,
+ .commit_pass2 = xlog_recover_sxi_commit_pass2,
+};
+
+/*
+ * This routine is called when an SXD format structure is found in a committed
+ * transaction in the log. Its purpose is to cancel the corresponding SXI if it
+ * was still in the log. To do this it searches the AIL for the SXI with an id
+ * equal to that in the SXD format structure. If we find it we drop the SXD
+ * reference, which removes the SXI from the AIL and frees it.
+ */
+STATIC int
+xlog_recover_sxd_commit_pass2(
+ struct xlog *log,
+ struct list_head *buffer_list,
+ struct xlog_recover_item *item,
+ xfs_lsn_t lsn)
+{
+ struct xfs_sxd_log_format *sxd_formatp;
+
+ sxd_formatp = item->ri_buf[0].i_addr;
+ if (item->ri_buf[0].i_len != sizeof(struct xfs_sxd_log_format)) {
+ XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
+ return -EFSCORRUPTED;
+ }
+
+ xlog_recover_release_intent(log, XFS_LI_SXI, sxd_formatp->sxd_sxi_id);
+ return 0;
+}
+
+const struct xlog_recover_item_ops xlog_sxd_item_ops = {
+ .item_type = XFS_LI_SXD,
+ .commit_pass2 = xlog_recover_sxd_commit_pass2,
+};
diff --git a/fs/xfs/xfs_swapext_item.h b/fs/xfs/xfs_swapext_item.h
new file mode 100644
index 000000000000..e3cb59692e50
--- /dev/null
+++ b/fs/xfs/xfs_swapext_item.h
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SWAPEXT_ITEM_H__
+#define __XFS_SWAPEXT_ITEM_H__
+
+/*
+ * The extent swapping intent item help us perform atomic extent swaps between
+ * two inode forks. It does this by tracking the range of logical offsets that
+ * still need to be swapped, and relogs as progress happens.
+ *
+ * *I items should be recorded in the *first* of a series of rolled
+ * transactions, and the *D items should be recorded in the same transaction
+ * that records the associated bmbt updates.
+ *
+ * Should the system crash after the commit of the first transaction but
+ * before the commit of the final transaction in a series, log recovery will
+ * use the redo information recorded by the intent items to replay the
+ * rest of the extent swaps.
+ */
+
+/* kernel only SXI/SXD definitions */
+
+struct xfs_mount;
+struct kmem_cache;
+
+/*
+ * This is the "swapext update intent" log item. It is used to log the fact
+ * that we are swapping extents between two files. It is used in conjunction
+ * with the "swapext update done" log item described below.
+ *
+ * These log items follow the same rules as struct xfs_efi_log_item; see the
+ * comments about that structure (in xfs_extfree_item.h) for more details.
+ */
+struct xfs_sxi_log_item {
+ struct xfs_log_item sxi_item;
+ atomic_t sxi_refcount;
+ struct xfs_sxi_log_format sxi_format;
+};
+
+/*
+ * This is the "swapext update done" log item. It is used to log the fact that
+ * some extent swapping mentioned in an earlier sxi item have been performed.
+ */
+struct xfs_sxd_log_item {
+ struct xfs_log_item sxd_item;
+ struct xfs_sxi_log_item *sxd_intent_log_item;
+ struct xfs_sxd_log_format sxd_format;
+};
+
+extern struct kmem_cache *xfs_sxi_cache;
+extern struct kmem_cache *xfs_sxd_cache;
+
+#endif /* __XFS_SWAPEXT_ITEM_H__ */
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 07/21] xfs: create deferred log items for extent swapping
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
` (7 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 11/21] xfs: port xfs_swap_extents_rmap to our new code Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 10/21] xfs: add error injection to test swapext recovery Darrick J. Wong
` (11 subsequent siblings)
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
Now that we've created the skeleton of a log intent item to track and
restart extent swap operations, add the upper level logic to commit
intent items and turn them into concrete work recorded in the log. We
use the deferred item "multihop" feature that was introduced a few
patches ago to constrain the number of active swap operations to one per
thread.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/Makefile | 2
fs/xfs/libxfs/xfs_bmap.h | 4
fs/xfs/libxfs/xfs_defer.c | 7
fs/xfs/libxfs/xfs_defer.h | 3
fs/xfs/libxfs/xfs_format.h | 6
fs/xfs/libxfs/xfs_log_format.h | 28 +
fs/xfs/libxfs/xfs_swapext.c | 1021 +++++++++++++++++++++++++++++++++++++++
fs/xfs/libxfs/xfs_swapext.h | 142 +++++
fs/xfs/libxfs/xfs_trans_space.h | 4
fs/xfs/xfs_swapext_item.c | 357 +++++++++++++-
fs/xfs/xfs_trace.c | 1
fs/xfs/xfs_trace.h | 215 ++++++++
fs/xfs/xfs_xchgrange.c | 65 ++
fs/xfs/xfs_xchgrange.h | 17 +
14 files changed, 1855 insertions(+), 17 deletions(-)
create mode 100644 fs/xfs/libxfs/xfs_swapext.c
create mode 100644 fs/xfs/xfs_xchgrange.c
create mode 100644 fs/xfs/xfs_xchgrange.h
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index c5cb8cf6ffbb..23b0c40620cf 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -46,6 +46,7 @@ xfs-y += $(addprefix libxfs/, \
xfs_refcount.o \
xfs_refcount_btree.o \
xfs_sb.o \
+ xfs_swapext.o \
xfs_symlink_remote.o \
xfs_trans_inode.o \
xfs_trans_resv.o \
@@ -92,6 +93,7 @@ xfs-y += xfs_aops.o \
xfs_sysfs.o \
xfs_trans.o \
xfs_xattr.o \
+ xfs_xchgrange.o \
kmem.o
# low-level transaction/log code
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index cb09a43a2872..413ec27f2f24 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -144,7 +144,7 @@ static inline int xfs_bmapi_whichfork(uint32_t bmapi_flags)
{ BMAP_COWFORK, "COW" }
/* Return true if the extent is an allocated extent, written or not. */
-static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
+static inline bool xfs_bmap_is_real_extent(const struct xfs_bmbt_irec *irec)
{
return irec->br_startblock != HOLESTARTBLOCK &&
irec->br_startblock != DELAYSTARTBLOCK &&
@@ -155,7 +155,7 @@ static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
* Return true if the extent is a real, allocated extent, or false if it is a
* delayed allocation, and unwritten extent or a hole.
*/
-static inline bool xfs_bmap_is_written_extent(struct xfs_bmbt_irec *irec)
+static inline bool xfs_bmap_is_written_extent(const struct xfs_bmbt_irec *irec)
{
return xfs_bmap_is_real_extent(irec) &&
irec->br_state != XFS_EXT_UNWRITTEN;
diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index bcfb6a4203cd..1619b9b928db 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -26,6 +26,7 @@
#include "xfs_da_format.h"
#include "xfs_da_btree.h"
#include "xfs_attr.h"
+#include "xfs_swapext.h"
static struct kmem_cache *xfs_defer_pending_cache;
@@ -189,6 +190,7 @@ static const struct xfs_defer_op_type *defer_op_types[] = {
[XFS_DEFER_OPS_TYPE_FREE] = &xfs_extent_free_defer_type,
[XFS_DEFER_OPS_TYPE_AGFL_FREE] = &xfs_agfl_free_defer_type,
[XFS_DEFER_OPS_TYPE_ATTR] = &xfs_attr_defer_type,
+ [XFS_DEFER_OPS_TYPE_SWAPEXT] = &xfs_swapext_defer_type,
};
/*
@@ -913,6 +915,10 @@ xfs_defer_init_item_caches(void)
error = xfs_attr_intent_init_cache();
if (error)
goto err;
+ error = xfs_swapext_intent_init_cache();
+ if (error)
+ goto err;
+
return 0;
err:
xfs_defer_destroy_item_caches();
@@ -923,6 +929,7 @@ xfs_defer_init_item_caches(void)
void
xfs_defer_destroy_item_caches(void)
{
+ xfs_swapext_intent_destroy_cache();
xfs_attr_intent_destroy_cache();
xfs_extfree_intent_destroy_cache();
xfs_bmap_intent_destroy_cache();
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 114a3a4930a3..bcc48b0c75c9 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -20,6 +20,7 @@ enum xfs_defer_ops_type {
XFS_DEFER_OPS_TYPE_FREE,
XFS_DEFER_OPS_TYPE_AGFL_FREE,
XFS_DEFER_OPS_TYPE_ATTR,
+ XFS_DEFER_OPS_TYPE_SWAPEXT,
XFS_DEFER_OPS_TYPE_MAX,
};
@@ -65,7 +66,7 @@ extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
extern const struct xfs_defer_op_type xfs_attr_defer_type;
-
+extern const struct xfs_defer_op_type xfs_swapext_defer_type;
/*
* Deferred operation item relogging limits.
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 1424976ec955..bb8bff488017 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -425,6 +425,12 @@ static inline bool xfs_sb_version_haslogxattrs(struct xfs_sb *sbp)
XFS_SB_FEAT_INCOMPAT_LOG_XATTRS);
}
+static inline bool xfs_sb_version_haslogswapext(struct xfs_sb *sbp)
+{
+ return xfs_sb_is_v5(sbp) && (sbp->sb_features_log_incompat &
+ XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT);
+}
+
static inline bool
xfs_is_quota_inode(struct xfs_sb *sbp, xfs_ino_t ino)
{
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index b105a5ef6644..65a84fdefe56 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -891,9 +891,33 @@ struct xfs_swap_extent {
int64_t sx_isize2;
};
-#define XFS_SWAP_EXT_FLAGS (0)
+/* Swap extents between extended attribute forks. */
+#define XFS_SWAP_EXT_ATTR_FORK (1ULL << 0)
-#define XFS_SWAP_EXT_STRINGS
+/* Set the file sizes when finished. */
+#define XFS_SWAP_EXT_SET_SIZES (1ULL << 1)
+
+/* Do not swap any part of the range where inode1's mapping is a hole. */
+#define XFS_SWAP_EXT_SKIP_INO1_HOLES (1ULL << 2)
+
+/* Clear the reflink flag from inode1 after the operation. */
+#define XFS_SWAP_EXT_CLEAR_INO1_REFLINK (1ULL << 3)
+
+/* Clear the reflink flag from inode2 after the operation. */
+#define XFS_SWAP_EXT_CLEAR_INO2_REFLINK (1ULL << 4)
+
+#define XFS_SWAP_EXT_FLAGS (XFS_SWAP_EXT_ATTR_FORK | \
+ XFS_SWAP_EXT_SET_SIZES | \
+ XFS_SWAP_EXT_SKIP_INO1_HOLES | \
+ XFS_SWAP_EXT_CLEAR_INO1_REFLINK | \
+ XFS_SWAP_EXT_CLEAR_INO2_REFLINK)
+
+#define XFS_SWAP_EXT_STRINGS \
+ { XFS_SWAP_EXT_ATTR_FORK, "ATTRFORK" }, \
+ { XFS_SWAP_EXT_SET_SIZES, "SETSIZES" }, \
+ { XFS_SWAP_EXT_SKIP_INO1_HOLES, "SKIP_INO1_HOLES" }, \
+ { XFS_SWAP_EXT_CLEAR_INO1_REFLINK, "CLEAR_INO1_REFLINK" }, \
+ { XFS_SWAP_EXT_CLEAR_INO2_REFLINK, "CLEAR_INO2_REFLINK" }
/* This is the structure used to lay out an sxi log item in the log. */
struct xfs_sxi_log_format {
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
new file mode 100644
index 000000000000..0bc758c5cf5c
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -0,0 +1,1021 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_bmap.h"
+#include "xfs_icache.h"
+#include "xfs_quota.h"
+#include "xfs_swapext.h"
+#include "xfs_trace.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_trans_space.h"
+#include "xfs_error.h"
+#include "xfs_errortag.h"
+#include "xfs_health.h"
+
+struct kmem_cache *xfs_swapext_intent_cache;
+
+/* bmbt mappings adjacent to a pair of records. */
+struct xfs_swapext_adjacent {
+ struct xfs_bmbt_irec left1;
+ struct xfs_bmbt_irec right1;
+ struct xfs_bmbt_irec left2;
+ struct xfs_bmbt_irec right2;
+};
+
+#define ADJACENT_INIT { \
+ .left1 = { .br_startblock = HOLESTARTBLOCK }, \
+ .right1 = { .br_startblock = HOLESTARTBLOCK }, \
+ .left2 = { .br_startblock = HOLESTARTBLOCK }, \
+ .right2 = { .br_startblock = HOLESTARTBLOCK }, \
+}
+
+/* Information to help us reset reflink flag / CoW fork state after a swap. */
+
+/* Previous state of the two inodes' reflink flags. */
+#define XFS_REFLINK_STATE_IP1 (1U << 0)
+#define XFS_REFLINK_STATE_IP2 (1U << 1)
+
+/*
+ * If the reflink flag is set on either inode, make sure it has an incore CoW
+ * fork, since all reflink inodes must have them. If there's a CoW fork and it
+ * has extents in it, make sure the inodes are tagged appropriately so that
+ * speculative preallocations can be GC'd if we run low of space.
+ */
+static inline void
+xfs_swapext_ensure_cowfork(
+ struct xfs_inode *ip)
+{
+ struct xfs_ifork *cfork;
+
+ if (xfs_is_reflink_inode(ip))
+ xfs_ifork_init_cow(ip);
+
+ cfork = xfs_ifork_ptr(ip, XFS_COW_FORK);
+ if (!cfork)
+ return;
+ if (cfork->if_bytes > 0)
+ xfs_inode_set_cowblocks_tag(ip);
+ else
+ xfs_inode_clear_cowblocks_tag(ip);
+}
+
+/* Schedule an atomic extent swap. */
+void
+xfs_swapext_schedule(
+ struct xfs_trans *tp,
+ struct xfs_swapext_intent *sxi)
+{
+ trace_xfs_swapext_defer(tp->t_mountp, sxi);
+ xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_SWAPEXT, &sxi->sxi_list);
+}
+
+/*
+ * Adjust the on-disk inode size upwards if needed so that we never map extents
+ * into the file past EOF. This is crucial so that log recovery won't get
+ * confused by the sudden appearance of post-eof extents.
+ */
+STATIC void
+xfs_swapext_update_size(
+ struct xfs_trans *tp,
+ struct xfs_inode *ip,
+ struct xfs_bmbt_irec *imap,
+ xfs_fsize_t new_isize)
+{
+ struct xfs_mount *mp = tp->t_mountp;
+ xfs_fsize_t len;
+
+ if (new_isize < 0)
+ return;
+
+ len = min(XFS_FSB_TO_B(mp, imap->br_startoff + imap->br_blockcount),
+ new_isize);
+
+ if (len <= ip->i_disk_size)
+ return;
+
+ trace_xfs_swapext_update_inode_size(ip, len);
+
+ ip->i_disk_size = len;
+ xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+}
+
+static inline bool
+sxi_has_more_swap_work(const struct xfs_swapext_intent *sxi)
+{
+ return sxi->sxi_blockcount > 0;
+}
+
+static inline bool
+sxi_has_postop_work(const struct xfs_swapext_intent *sxi)
+{
+ return sxi->sxi_flags & (XFS_SWAP_EXT_CLEAR_INO1_REFLINK |
+ XFS_SWAP_EXT_CLEAR_INO2_REFLINK);
+}
+
+static inline void
+sxi_advance(
+ struct xfs_swapext_intent *sxi,
+ const struct xfs_bmbt_irec *irec)
+{
+ sxi->sxi_startoff1 += irec->br_blockcount;
+ sxi->sxi_startoff2 += irec->br_blockcount;
+ sxi->sxi_blockcount -= irec->br_blockcount;
+}
+
+/* Check all extents to make sure we can actually swap them. */
+int
+xfs_swapext_check_extents(
+ struct xfs_mount *mp,
+ const struct xfs_swapext_req *req)
+{
+ struct xfs_ifork *ifp1, *ifp2;
+
+ /* No fork? */
+ ifp1 = xfs_ifork_ptr(req->ip1, req->whichfork);
+ ifp2 = xfs_ifork_ptr(req->ip2, req->whichfork);
+ if (!ifp1 || !ifp2)
+ return -EINVAL;
+
+ /* We don't know how to swap local format forks. */
+ if (ifp1->if_format == XFS_DINODE_FMT_LOCAL ||
+ ifp2->if_format == XFS_DINODE_FMT_LOCAL)
+ return -EINVAL;
+
+ /* We don't support realtime data forks yet. */
+ if (!XFS_IS_REALTIME_INODE(req->ip1))
+ return 0;
+ if (req->whichfork == XFS_ATTR_FORK)
+ return 0;
+ return -EINVAL;
+}
+
+#ifdef CONFIG_XFS_QUOTA
+/* Log the actual updates to the quota accounting. */
+static inline void
+xfs_swapext_update_quota(
+ struct xfs_trans *tp,
+ struct xfs_swapext_intent *sxi,
+ struct xfs_bmbt_irec *irec1,
+ struct xfs_bmbt_irec *irec2)
+{
+ int64_t ip1_delta = 0, ip2_delta = 0;
+ unsigned int qflag;
+
+ qflag = XFS_IS_REALTIME_INODE(sxi->sxi_ip1) ? XFS_TRANS_DQ_RTBCOUNT :
+ XFS_TRANS_DQ_BCOUNT;
+
+ if (xfs_bmap_is_real_extent(irec1)) {
+ ip1_delta -= irec1->br_blockcount;
+ ip2_delta += irec1->br_blockcount;
+ }
+
+ if (xfs_bmap_is_real_extent(irec2)) {
+ ip1_delta += irec2->br_blockcount;
+ ip2_delta -= irec2->br_blockcount;
+ }
+
+ xfs_trans_mod_dquot_byino(tp, sxi->sxi_ip1, qflag, ip1_delta);
+ xfs_trans_mod_dquot_byino(tp, sxi->sxi_ip2, qflag, ip2_delta);
+}
+#else
+# define xfs_swapext_update_quota(tp, sxi, irec1, irec2) ((void)0)
+#endif
+
+/*
+ * Walk forward through the file ranges in @sxi until we find two different
+ * mappings to exchange. If there is work to do, return the mappings;
+ * otherwise we've reached the end of the range and sxi_blockcount will be
+ * zero.
+ *
+ * If the walk skips over a pair of mappings to the same storage, save them as
+ * the left records in @adj (if provided) so that the simulation phase can
+ * avoid an extra lookup.
+ */
+static int
+xfs_swapext_find_mappings(
+ struct xfs_swapext_intent *sxi,
+ struct xfs_bmbt_irec *irec1,
+ struct xfs_bmbt_irec *irec2,
+ struct xfs_swapext_adjacent *adj)
+{
+ int nimaps;
+ int bmap_flags;
+ int error;
+
+ bmap_flags = xfs_bmapi_aflag(xfs_swapext_whichfork(sxi));
+
+ for (; sxi_has_more_swap_work(sxi); sxi_advance(sxi, irec1)) {
+ /* Read extent from the first file */
+ nimaps = 1;
+ error = xfs_bmapi_read(sxi->sxi_ip1, sxi->sxi_startoff1,
+ sxi->sxi_blockcount, irec1, &nimaps,
+ bmap_flags);
+ if (error)
+ return error;
+ if (nimaps != 1 ||
+ irec1->br_startblock == DELAYSTARTBLOCK ||
+ irec1->br_startoff != sxi->sxi_startoff1) {
+ /*
+ * We should never get no mapping or a delalloc extent
+ * or something that doesn't match what we asked for,
+ * since the caller flushed both inodes and we hold the
+ * ILOCKs for both inodes.
+ */
+ ASSERT(0);
+ return -EINVAL;
+ }
+
+ /*
+ * If the caller told us to ignore sparse areas of file1, jump
+ * ahead to the next region.
+ */
+ if ((sxi->sxi_flags & XFS_SWAP_EXT_SKIP_INO1_HOLES) &&
+ irec1->br_startblock == HOLESTARTBLOCK) {
+ trace_xfs_swapext_extent1(sxi->sxi_ip1, irec1);
+ continue;
+ }
+
+ /* Read extent from the second file */
+ nimaps = 1;
+ error = xfs_bmapi_read(sxi->sxi_ip2, sxi->sxi_startoff2,
+ irec1->br_blockcount, irec2, &nimaps,
+ bmap_flags);
+ if (error)
+ return error;
+ if (nimaps != 1 ||
+ irec2->br_startblock == DELAYSTARTBLOCK ||
+ irec2->br_startoff != sxi->sxi_startoff2) {
+ /*
+ * We should never get no mapping or a delalloc extent
+ * or something that doesn't match what we asked for,
+ * since the caller flushed both inodes and we hold the
+ * ILOCKs for both inodes.
+ */
+ ASSERT(0);
+ return -EINVAL;
+ }
+
+ /*
+ * We can only swap as many blocks as the smaller of the two
+ * extent maps.
+ */
+ irec1->br_blockcount = min(irec1->br_blockcount,
+ irec2->br_blockcount);
+
+ trace_xfs_swapext_extent1(sxi->sxi_ip1, irec1);
+ trace_xfs_swapext_extent2(sxi->sxi_ip2, irec2);
+
+ /* We found something to swap, so return it. */
+ if (irec1->br_startblock != irec2->br_startblock)
+ return 0;
+
+ /*
+ * Two extents mapped to the same physical block must not have
+ * different states; that's filesystem corruption. Move on to
+ * the next extent if they're both holes or both the same
+ * physical extent.
+ */
+ if (irec1->br_state != irec2->br_state) {
+ xfs_bmap_mark_sick(sxi->sxi_ip1,
+ xfs_swapext_whichfork(sxi));
+ xfs_bmap_mark_sick(sxi->sxi_ip2,
+ xfs_swapext_whichfork(sxi));
+ return -EFSCORRUPTED;
+ }
+
+ /*
+ * Save the mappings if we're estimating work and skipping
+ * these identical mappings.
+ */
+ if (adj) {
+ memcpy(&adj->left1, irec1, sizeof(*irec1));
+ memcpy(&adj->left2, irec2, sizeof(*irec2));
+ }
+ }
+
+ return 0;
+}
+
+/* Exchange these two mappings. */
+static void
+xfs_swapext_exchange_mappings(
+ struct xfs_trans *tp,
+ struct xfs_swapext_intent *sxi,
+ struct xfs_bmbt_irec *irec1,
+ struct xfs_bmbt_irec *irec2)
+{
+ int whichfork = xfs_swapext_whichfork(sxi);
+
+ xfs_swapext_update_quota(tp, sxi, irec1, irec2);
+
+ /* Remove both mappings. */
+ xfs_bmap_unmap_extent(tp, sxi->sxi_ip1, whichfork, irec1);
+ xfs_bmap_unmap_extent(tp, sxi->sxi_ip2, whichfork, irec2);
+
+ /*
+ * Re-add both mappings. We swap the file offsets between the two maps
+ * and add the opposite map, which has the effect of filling the
+ * logical offsets we just unmapped, but with with the physical mapping
+ * information swapped.
+ */
+ swap(irec1->br_startoff, irec2->br_startoff);
+ xfs_bmap_map_extent(tp, sxi->sxi_ip1, whichfork, irec2);
+ xfs_bmap_map_extent(tp, sxi->sxi_ip2, whichfork, irec1);
+
+ /* Make sure we're not mapping extents past EOF. */
+ if (whichfork == XFS_DATA_FORK) {
+ xfs_swapext_update_size(tp, sxi->sxi_ip1, irec2,
+ sxi->sxi_isize1);
+ xfs_swapext_update_size(tp, sxi->sxi_ip2, irec1,
+ sxi->sxi_isize2);
+ }
+
+ /*
+ * Advance our cursor and exit. The caller (either defer ops or log
+ * recovery) will log the SXD item, and if *blockcount is nonzero, it
+ * will log a new SXI item for the remainder and call us back.
+ */
+ sxi_advance(sxi, irec1);
+}
+
+static inline void
+xfs_swapext_clear_reflink(
+ struct xfs_trans *tp,
+ struct xfs_inode *ip)
+{
+ trace_xfs_reflink_unset_inode_flag(ip);
+
+ ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
+ xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+}
+
+/* Finish whatever work might come after a swap operation. */
+static int
+xfs_swapext_do_postop_work(
+ struct xfs_trans *tp,
+ struct xfs_swapext_intent *sxi)
+{
+ if (sxi->sxi_flags & XFS_SWAP_EXT_CLEAR_INO1_REFLINK) {
+ xfs_swapext_clear_reflink(tp, sxi->sxi_ip1);
+ sxi->sxi_flags &= ~XFS_SWAP_EXT_CLEAR_INO1_REFLINK;
+ }
+
+ if (sxi->sxi_flags & XFS_SWAP_EXT_CLEAR_INO2_REFLINK) {
+ xfs_swapext_clear_reflink(tp, sxi->sxi_ip2);
+ sxi->sxi_flags &= ~XFS_SWAP_EXT_CLEAR_INO2_REFLINK;
+ }
+
+ return 0;
+}
+
+/* Finish one extent swap, possibly log more. */
+int
+xfs_swapext_finish_one(
+ struct xfs_trans *tp,
+ struct xfs_swapext_intent *sxi)
+{
+ struct xfs_bmbt_irec irec1, irec2;
+ int error;
+
+ if (sxi_has_more_swap_work(sxi)) {
+ /*
+ * If the operation state says that some range of the files
+ * have not yet been swapped, look for extents in that range to
+ * swap. If we find some extents, swap them.
+ */
+ error = xfs_swapext_find_mappings(sxi, &irec1, &irec2, NULL);
+ if (error)
+ return error;
+
+ if (sxi_has_more_swap_work(sxi))
+ xfs_swapext_exchange_mappings(tp, sxi, &irec1, &irec2);
+
+ /*
+ * If the caller asked us to exchange the file sizes after the
+ * swap and either we just swapped the last extents in the
+ * range or we didn't find anything to swap, update the ondisk
+ * file sizes.
+ */
+ if ((sxi->sxi_flags & XFS_SWAP_EXT_SET_SIZES) &&
+ !sxi_has_more_swap_work(sxi)) {
+ sxi->sxi_ip1->i_disk_size = sxi->sxi_isize1;
+ sxi->sxi_ip2->i_disk_size = sxi->sxi_isize2;
+
+ xfs_trans_log_inode(tp, sxi->sxi_ip1, XFS_ILOG_CORE);
+ xfs_trans_log_inode(tp, sxi->sxi_ip2, XFS_ILOG_CORE);
+ }
+ } else if (sxi_has_postop_work(sxi)) {
+ /*
+ * Now that we're finished with the swap operation, complete
+ * the post-op cleanup work.
+ */
+ error = xfs_swapext_do_postop_work(tp, sxi);
+ if (error)
+ return error;
+ }
+
+ /* If we still have work to do, ask for a new transaction. */
+ if (sxi_has_more_swap_work(sxi) || sxi_has_postop_work(sxi)) {
+ trace_xfs_swapext_defer(tp->t_mountp, sxi);
+ return -EAGAIN;
+ }
+
+ /*
+ * If we reach here, we've finished all the swapping work and the post
+ * operation work. The last thing we need to do before returning to
+ * the caller is to make sure that COW forks are set up correctly.
+ */
+ if (!(sxi->sxi_flags & XFS_SWAP_EXT_ATTR_FORK)) {
+ xfs_swapext_ensure_cowfork(sxi->sxi_ip1);
+ xfs_swapext_ensure_cowfork(sxi->sxi_ip2);
+ }
+
+ return 0;
+}
+
+/*
+ * Compute the amount of bmbt blocks we should reserve for each file. In the
+ * worst case, each exchange will fill a hole with a new mapping, which could
+ * result in a btree split every time we add a new leaf block.
+ */
+static inline uint64_t
+xfs_swapext_bmbt_blocks(
+ struct xfs_mount *mp,
+ const struct xfs_swapext_req *req)
+{
+ return howmany_64(req->nr_exchanges,
+ XFS_MAX_CONTIG_BMAPS_PER_BLOCK(mp)) *
+ XFS_EXTENTADD_SPACE_RES(mp, req->whichfork);
+}
+
+static inline uint64_t
+xfs_swapext_rmapbt_blocks(
+ struct xfs_mount *mp,
+ const struct xfs_swapext_req *req)
+{
+ if (!xfs_has_rmapbt(mp))
+ return 0;
+ if (XFS_IS_REALTIME_INODE(req->ip1))
+ return 0;
+
+ return howmany_64(req->nr_exchanges,
+ XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp)) *
+ XFS_RMAPADD_SPACE_RES(mp);
+}
+
+/* Estimate the bmbt and rmapbt overhead required to exchange extents. */
+static int
+xfs_swapext_estimate_overhead(
+ struct xfs_swapext_req *req)
+{
+ struct xfs_mount *mp = req->ip1->i_mount;
+ xfs_filblks_t bmbt_blocks;
+ xfs_filblks_t rmapbt_blocks;
+ xfs_filblks_t resblks = req->resblks;
+
+ /*
+ * Compute the number of bmbt and rmapbt blocks we might need to handle
+ * the estimated number of exchanges.
+ */
+ bmbt_blocks = xfs_swapext_bmbt_blocks(mp, req);
+ rmapbt_blocks = xfs_swapext_rmapbt_blocks(mp, req);
+
+ trace_xfs_swapext_overhead(mp, bmbt_blocks, rmapbt_blocks);
+
+ /* Make sure the change in file block count doesn't overflow. */
+ if (check_add_overflow(req->ip1_bcount, bmbt_blocks, &req->ip1_bcount))
+ return -EFBIG;
+ if (check_add_overflow(req->ip2_bcount, bmbt_blocks, &req->ip2_bcount))
+ return -EFBIG;
+
+ /*
+ * Add together the number of blocks we need to handle btree growth,
+ * then add it to the number of blocks we need to reserve to this
+ * transaction.
+ */
+ if (check_add_overflow(resblks, bmbt_blocks, &resblks))
+ return -ENOSPC;
+ if (check_add_overflow(resblks, bmbt_blocks, &resblks))
+ return -ENOSPC;
+ if (check_add_overflow(resblks, rmapbt_blocks, &resblks))
+ return -ENOSPC;
+ if (check_add_overflow(resblks, rmapbt_blocks, &resblks))
+ return -ENOSPC;
+
+ /* Can't actually reserve more than UINT_MAX blocks. */
+ if (req->resblks > UINT_MAX)
+ return -ENOSPC;
+
+ req->resblks = resblks;
+ trace_xfs_swapext_final_estimate(req);
+ return 0;
+}
+
+/* Decide if we can merge two real extents. */
+static inline bool
+can_merge(
+ const struct xfs_bmbt_irec *b1,
+ const struct xfs_bmbt_irec *b2)
+{
+ /* Don't merge holes. */
+ if (b1->br_startblock == HOLESTARTBLOCK ||
+ b2->br_startblock == HOLESTARTBLOCK)
+ return false;
+
+ /* We don't merge holes. */
+ if (!xfs_bmap_is_real_extent(b1) || !xfs_bmap_is_real_extent(b2))
+ return false;
+
+ if (b1->br_startoff + b1->br_blockcount == b2->br_startoff &&
+ b1->br_startblock + b1->br_blockcount == b2->br_startblock &&
+ b1->br_state == b2->br_state &&
+ b1->br_blockcount + b2->br_blockcount <= XFS_MAX_BMBT_EXTLEN)
+ return true;
+
+ return false;
+}
+
+#define CLEFT_CONTIG 0x01
+#define CRIGHT_CONTIG 0x02
+#define CHOLE 0x04
+#define CBOTH_CONTIG (CLEFT_CONTIG | CRIGHT_CONTIG)
+
+#define NLEFT_CONTIG 0x10
+#define NRIGHT_CONTIG 0x20
+#define NHOLE 0x40
+#define NBOTH_CONTIG (NLEFT_CONTIG | NRIGHT_CONTIG)
+
+/* Estimate the effect of a single swap on extent count. */
+static inline int
+delta_nextents_step(
+ struct xfs_mount *mp,
+ const struct xfs_bmbt_irec *left,
+ const struct xfs_bmbt_irec *curr,
+ const struct xfs_bmbt_irec *new,
+ const struct xfs_bmbt_irec *right)
+{
+ bool lhole, rhole, chole, nhole;
+ unsigned int state = 0;
+ int ret = 0;
+
+ lhole = left->br_startblock == HOLESTARTBLOCK;
+ rhole = right->br_startblock == HOLESTARTBLOCK;
+ chole = curr->br_startblock == HOLESTARTBLOCK;
+ nhole = new->br_startblock == HOLESTARTBLOCK;
+
+ if (chole)
+ state |= CHOLE;
+ if (!lhole && !chole && can_merge(left, curr))
+ state |= CLEFT_CONTIG;
+ if (!rhole && !chole && can_merge(curr, right))
+ state |= CRIGHT_CONTIG;
+ if ((state & CBOTH_CONTIG) == CBOTH_CONTIG &&
+ left->br_startblock + curr->br_startblock +
+ right->br_startblock > XFS_MAX_BMBT_EXTLEN)
+ state &= ~CRIGHT_CONTIG;
+
+ if (nhole)
+ state |= NHOLE;
+ if (!lhole && !nhole && can_merge(left, new))
+ state |= NLEFT_CONTIG;
+ if (!rhole && !nhole && can_merge(new, right))
+ state |= NRIGHT_CONTIG;
+ if ((state & NBOTH_CONTIG) == NBOTH_CONTIG &&
+ left->br_startblock + new->br_startblock +
+ right->br_startblock > XFS_MAX_BMBT_EXTLEN)
+ state &= ~NRIGHT_CONTIG;
+
+ switch (state & (CLEFT_CONTIG | CRIGHT_CONTIG | CHOLE)) {
+ case CLEFT_CONTIG | CRIGHT_CONTIG:
+ /*
+ * left/curr/right are the same extent, so deleting curr causes
+ * 2 new extents to be created.
+ */
+ ret += 2;
+ break;
+ case 0:
+ /*
+ * curr is not contiguous with any extent, so we remove curr
+ * completely
+ */
+ ret--;
+ break;
+ case CHOLE:
+ /* hole, do nothing */
+ break;
+ case CLEFT_CONTIG:
+ case CRIGHT_CONTIG:
+ /* trim either left or right, no change */
+ break;
+ }
+
+ switch (state & (NLEFT_CONTIG | NRIGHT_CONTIG | NHOLE)) {
+ case NLEFT_CONTIG | NRIGHT_CONTIG:
+ /*
+ * left/curr/right will become the same extent, so adding
+ * curr causes the deletion of right.
+ */
+ ret--;
+ break;
+ case 0:
+ /* new is not contiguous with any extent */
+ ret++;
+ break;
+ case NHOLE:
+ /* hole, do nothing. */
+ break;
+ case NLEFT_CONTIG:
+ case NRIGHT_CONTIG:
+ /* new is absorbed into left or right, no change */
+ break;
+ }
+
+ trace_xfs_swapext_delta_nextents_step(mp, left, curr, new, right, ret,
+ state);
+ return ret;
+}
+
+/* Make sure we don't overflow the extent counters. */
+static inline int
+ensure_delta_nextents(
+ struct xfs_swapext_req *req,
+ struct xfs_inode *ip,
+ int64_t delta)
+{
+ struct xfs_mount *mp = ip->i_mount;
+ struct xfs_ifork *ifp = xfs_ifork_ptr(ip, req->whichfork);
+ xfs_extnum_t max_extents;
+ bool large_extcount;
+
+ if (delta < 0)
+ return 0;
+
+ if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_REDUCE_MAX_IEXTENTS)) {
+ if (ifp->if_nextents + delta > 10)
+ return -EFBIG;
+ }
+
+ if (req->req_flags & XFS_SWAP_REQ_NREXT64)
+ large_extcount = true;
+ else
+ large_extcount = xfs_inode_has_large_extent_counts(ip);
+
+ max_extents = xfs_iext_max_nextents(large_extcount, req->whichfork);
+ if (ifp->if_nextents + delta <= max_extents)
+ return 0;
+ if (large_extcount)
+ return -EFBIG;
+ if (!xfs_has_large_extent_counts(mp))
+ return -EFBIG;
+
+ max_extents = xfs_iext_max_nextents(true, req->whichfork);
+ if (ifp->if_nextents + delta > max_extents)
+ return -EFBIG;
+
+ req->req_flags |= XFS_SWAP_REQ_NREXT64;
+ return 0;
+}
+
+/* Find the next extent after irec. */
+static inline int
+get_next_ext(
+ struct xfs_inode *ip,
+ int bmap_flags,
+ const struct xfs_bmbt_irec *irec,
+ struct xfs_bmbt_irec *nrec)
+{
+ xfs_fileoff_t off;
+ xfs_filblks_t blockcount;
+ int nimaps = 1;
+ int error;
+
+ off = irec->br_startoff + irec->br_blockcount;
+ blockcount = XFS_MAX_FILEOFF - off;
+ error = xfs_bmapi_read(ip, off, blockcount, nrec, &nimaps, bmap_flags);
+ if (error)
+ return error;
+ if (nrec->br_startblock == DELAYSTARTBLOCK ||
+ nrec->br_startoff != off) {
+ /*
+ * If we don't get the extent we want, return a zero-length
+ * mapping, which our estimator function will pretend is a hole.
+ * We shouldn't get delalloc reservations.
+ */
+ nrec->br_startblock = HOLESTARTBLOCK;
+ }
+
+ return 0;
+}
+
+int __init
+xfs_swapext_intent_init_cache(void)
+{
+ xfs_swapext_intent_cache = kmem_cache_create("xfs_swapext_intent",
+ sizeof(struct xfs_swapext_intent),
+ 0, 0, NULL);
+
+ return xfs_swapext_intent_cache != NULL ? 0 : -ENOMEM;
+}
+
+void
+xfs_swapext_intent_destroy_cache(void)
+{
+ kmem_cache_destroy(xfs_swapext_intent_cache);
+ xfs_swapext_intent_cache = NULL;
+}
+
+/*
+ * Decide if we will swap the reflink flags between the two files after the
+ * swap. The only time we want to do this is if we're exchanging all extents
+ * under EOF and the inode reflink flags have different states.
+ */
+static inline bool
+sxi_can_exchange_reflink_flags(
+ const struct xfs_swapext_req *req,
+ unsigned int reflink_state)
+{
+ struct xfs_mount *mp = req->ip1->i_mount;
+
+ if (hweight32(reflink_state) != 1)
+ return false;
+ if (req->startoff1 != 0 || req->startoff2 != 0)
+ return false;
+ if (req->blockcount != XFS_B_TO_FSB(mp, req->ip1->i_disk_size))
+ return false;
+ if (req->blockcount != XFS_B_TO_FSB(mp, req->ip2->i_disk_size))
+ return false;
+ return true;
+}
+
+
+/* Allocate and initialize a new incore intent item from a request. */
+struct xfs_swapext_intent *
+xfs_swapext_init_intent(
+ const struct xfs_swapext_req *req,
+ unsigned int *reflink_state)
+{
+ struct xfs_swapext_intent *sxi;
+ unsigned int rs = 0;
+
+ sxi = kmem_cache_zalloc(xfs_swapext_intent_cache,
+ GFP_NOFS | __GFP_NOFAIL);
+ INIT_LIST_HEAD(&sxi->sxi_list);
+ sxi->sxi_ip1 = req->ip1;
+ sxi->sxi_ip2 = req->ip2;
+ sxi->sxi_startoff1 = req->startoff1;
+ sxi->sxi_startoff2 = req->startoff2;
+ sxi->sxi_blockcount = req->blockcount;
+ sxi->sxi_isize1 = sxi->sxi_isize2 = -1;
+
+ if (req->whichfork == XFS_ATTR_FORK)
+ sxi->sxi_flags |= XFS_SWAP_EXT_ATTR_FORK;
+
+ if (req->whichfork == XFS_DATA_FORK &&
+ (req->req_flags & XFS_SWAP_REQ_SET_SIZES)) {
+ sxi->sxi_flags |= XFS_SWAP_EXT_SET_SIZES;
+ sxi->sxi_isize1 = req->ip2->i_disk_size;
+ sxi->sxi_isize2 = req->ip1->i_disk_size;
+ }
+
+ if (req->req_flags & XFS_SWAP_REQ_SKIP_INO1_HOLES)
+ sxi->sxi_flags |= XFS_SWAP_EXT_SKIP_INO1_HOLES;
+
+ if (req->req_flags & XFS_SWAP_REQ_LOGGED)
+ sxi->sxi_op_flags |= XFS_SWAP_EXT_OP_LOGGED;
+ if (req->req_flags & XFS_SWAP_REQ_NREXT64)
+ sxi->sxi_op_flags |= XFS_SWAP_EXT_OP_NREXT64;
+
+ if (req->whichfork == XFS_DATA_FORK) {
+ /*
+ * Record the state of each inode's reflink flag before the
+ * operation.
+ */
+ if (xfs_is_reflink_inode(req->ip1))
+ rs |= XFS_REFLINK_STATE_IP1;
+ if (xfs_is_reflink_inode(req->ip2))
+ rs |= XFS_REFLINK_STATE_IP2;
+
+ /*
+ * Figure out if we're clearing the reflink flags (which
+ * effectively swaps them) after the operation.
+ */
+ if (sxi_can_exchange_reflink_flags(req, rs)) {
+ if (rs & XFS_REFLINK_STATE_IP1)
+ sxi->sxi_flags |=
+ XFS_SWAP_EXT_CLEAR_INO1_REFLINK;
+ if (rs & XFS_REFLINK_STATE_IP2)
+ sxi->sxi_flags |=
+ XFS_SWAP_EXT_CLEAR_INO2_REFLINK;
+ }
+ }
+
+ if (reflink_state)
+ *reflink_state = rs;
+ return sxi;
+}
+
+/*
+ * Estimate the number of exchange operations and the number of file blocks
+ * in each file that will be affected by the exchange operation.
+ */
+int
+xfs_swapext_estimate(
+ struct xfs_swapext_req *req)
+{
+ struct xfs_swapext_intent *sxi;
+ struct xfs_bmbt_irec irec1, irec2;
+ struct xfs_swapext_adjacent adj = ADJACENT_INIT;
+ xfs_filblks_t ip1_blocks = 0, ip2_blocks = 0;
+ int64_t d_nexts1, d_nexts2;
+ int bmap_flags;
+ int error;
+
+ ASSERT(!(req->req_flags & ~XFS_SWAP_REQ_FLAGS));
+
+ bmap_flags = xfs_bmapi_aflag(req->whichfork);
+ sxi = xfs_swapext_init_intent(req, NULL);
+
+ /*
+ * To guard against the possibility of overflowing the extent counters,
+ * we have to estimate an upper bound on the potential increase in that
+ * counter. We can split the extent at each end of the range, and for
+ * each step of the swap we can split the extent that we're working on
+ * if the extents do not align.
+ */
+ d_nexts1 = d_nexts2 = 3;
+
+ while (sxi_has_more_swap_work(sxi)) {
+ /*
+ * Walk through the file ranges until we find something to
+ * swap. Because we're simulating the swap, pass in adj to
+ * capture skipped mappings for correct estimation of bmbt
+ * record merges.
+ */
+ error = xfs_swapext_find_mappings(sxi, &irec1, &irec2, &adj);
+ if (error)
+ goto out_free;
+ if (!sxi_has_more_swap_work(sxi))
+ break;
+
+ /* Update accounting. */
+ if (xfs_bmap_is_real_extent(&irec1))
+ ip1_blocks += irec1.br_blockcount;
+ if (xfs_bmap_is_real_extent(&irec2))
+ ip2_blocks += irec2.br_blockcount;
+ req->nr_exchanges++;
+
+ /* Read the next extents from both files. */
+ error = get_next_ext(req->ip1, bmap_flags, &irec1, &adj.right1);
+ if (error)
+ goto out_free;
+
+ error = get_next_ext(req->ip2, bmap_flags, &irec2, &adj.right2);
+ if (error)
+ goto out_free;
+
+ /* Update extent count deltas. */
+ d_nexts1 += delta_nextents_step(req->ip1->i_mount,
+ &adj.left1, &irec1, &irec2, &adj.right1);
+
+ d_nexts2 += delta_nextents_step(req->ip1->i_mount,
+ &adj.left2, &irec2, &irec1, &adj.right2);
+
+ /* Now pretend we swapped the extents. */
+ if (can_merge(&adj.left2, &irec1))
+ adj.left2.br_blockcount += irec1.br_blockcount;
+ else
+ memcpy(&adj.left2, &irec1, sizeof(irec1));
+
+ if (can_merge(&adj.left1, &irec2))
+ adj.left1.br_blockcount += irec2.br_blockcount;
+ else
+ memcpy(&adj.left1, &irec2, sizeof(irec2));
+
+ sxi_advance(sxi, &irec1);
+ }
+
+ /* Account for the blocks that are being exchanged. */
+ if (XFS_IS_REALTIME_INODE(req->ip1) &&
+ req->whichfork == XFS_DATA_FORK) {
+ req->ip1_rtbcount = ip1_blocks;
+ req->ip2_rtbcount = ip2_blocks;
+ } else {
+ req->ip1_bcount = ip1_blocks;
+ req->ip2_bcount = ip2_blocks;
+ }
+
+ /*
+ * Make sure that both forks have enough slack left in their extent
+ * counters that the swap operation will not overflow.
+ */
+ trace_xfs_swapext_delta_nextents(req, d_nexts1, d_nexts2);
+ if (req->ip1 == req->ip2) {
+ error = ensure_delta_nextents(req, req->ip1,
+ d_nexts1 + d_nexts2);
+ } else {
+ error = ensure_delta_nextents(req, req->ip1, d_nexts1);
+ if (error)
+ goto out_free;
+ error = ensure_delta_nextents(req, req->ip2, d_nexts2);
+ }
+ if (error)
+ goto out_free;
+
+ trace_xfs_swapext_initial_estimate(req);
+ error = xfs_swapext_estimate_overhead(req);
+out_free:
+ kmem_cache_free(xfs_swapext_intent_cache, sxi);
+ return error;
+}
+
+static inline void
+xfs_swapext_set_reflink(
+ struct xfs_trans *tp,
+ struct xfs_inode *ip)
+{
+ trace_xfs_reflink_set_inode_flag(ip);
+
+ ip->i_diflags2 |= XFS_DIFLAG2_REFLINK;
+ xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+}
+
+/*
+ * If either file has shared blocks and we're swapping data forks, we must flag
+ * the other file as having shared blocks so that we get the shared-block rmap
+ * functions if we need to fix up the rmaps.
+ */
+void
+xfs_swapext_ensure_reflink(
+ struct xfs_trans *tp,
+ const struct xfs_swapext_intent *sxi,
+ unsigned int reflink_state)
+{
+ if ((reflink_state & XFS_REFLINK_STATE_IP1) &&
+ !xfs_is_reflink_inode(sxi->sxi_ip2))
+ xfs_swapext_set_reflink(tp, sxi->sxi_ip2);
+
+ if ((reflink_state & XFS_REFLINK_STATE_IP2) &&
+ !xfs_is_reflink_inode(sxi->sxi_ip1))
+ xfs_swapext_set_reflink(tp, sxi->sxi_ip1);
+}
+
+/* Widen the extent counts of both inodes if necessary. */
+static inline void
+xfs_swapext_upgrade_extent_counts(
+ struct xfs_trans *tp,
+ const struct xfs_swapext_intent *sxi)
+{
+ if (!(sxi->sxi_op_flags & XFS_SWAP_EXT_OP_NREXT64))
+ return;
+
+ sxi->sxi_ip1->i_diflags2 |= XFS_DIFLAG2_NREXT64;
+ xfs_trans_log_inode(tp, sxi->sxi_ip1, XFS_ILOG_CORE);
+
+ sxi->sxi_ip2->i_diflags2 |= XFS_DIFLAG2_NREXT64;
+ xfs_trans_log_inode(tp, sxi->sxi_ip2, XFS_ILOG_CORE);
+}
+
+/*
+ * Schedule a swap a range of extents from one inode to another. If the atomic
+ * swap feature is enabled, then the operation progress can be resumed even if
+ * the system goes down. The caller must commit the transaction to start the
+ * work.
+ *
+ * The caller must ensure the inodes must be joined to the transaction and
+ * ILOCKd; they will still be joined to the transaction at exit.
+ */
+void
+xfs_swapext(
+ struct xfs_trans *tp,
+ const struct xfs_swapext_req *req)
+{
+ struct xfs_swapext_intent *sxi;
+ unsigned int reflink_state;
+
+ ASSERT(xfs_isilocked(req->ip1, XFS_ILOCK_EXCL));
+ ASSERT(xfs_isilocked(req->ip2, XFS_ILOCK_EXCL));
+ ASSERT(req->whichfork != XFS_COW_FORK);
+ ASSERT(!(req->req_flags & ~XFS_SWAP_REQ_FLAGS));
+ if (req->req_flags & XFS_SWAP_REQ_SET_SIZES)
+ ASSERT(req->whichfork == XFS_DATA_FORK);
+
+ if (req->blockcount == 0)
+ return;
+
+ sxi = xfs_swapext_init_intent(req, &reflink_state);
+ xfs_swapext_schedule(tp, sxi);
+ xfs_swapext_ensure_reflink(tp, sxi, reflink_state);
+ xfs_swapext_upgrade_extent_counts(tp, sxi);
+}
diff --git a/fs/xfs/libxfs/xfs_swapext.h b/fs/xfs/libxfs/xfs_swapext.h
index 316323339d76..1987897ddc25 100644
--- a/fs/xfs/libxfs/xfs_swapext.h
+++ b/fs/xfs/libxfs/xfs_swapext.h
@@ -21,4 +21,146 @@ static inline bool xfs_swapext_supported(struct xfs_mount *mp)
!xfs_has_realtime(mp);
}
+/*
+ * In-core information about an extent swap request between ranges of two
+ * inodes.
+ */
+struct xfs_swapext_intent {
+ /* List of other incore deferred work. */
+ struct list_head sxi_list;
+
+ /* Inodes participating in the operation. */
+ struct xfs_inode *sxi_ip1;
+ struct xfs_inode *sxi_ip2;
+
+ /* File offset range information. */
+ xfs_fileoff_t sxi_startoff1;
+ xfs_fileoff_t sxi_startoff2;
+ xfs_filblks_t sxi_blockcount;
+
+ /* Set these file sizes after the operation, unless negative. */
+ xfs_fsize_t sxi_isize1;
+ xfs_fsize_t sxi_isize2;
+
+ /* XFS_SWAP_EXT_* log operation flags */
+ unsigned int sxi_flags;
+
+ /* XFS_SWAP_EXT_OP_* flags */
+ unsigned int sxi_op_flags;
+};
+
+/* Use log intent items to track and restart the entire operation. */
+#define XFS_SWAP_EXT_OP_LOGGED (1U << 0)
+
+/* Upgrade files to have large extent counts before proceeding. */
+#define XFS_SWAP_EXT_OP_NREXT64 (1U << 1)
+
+#define XFS_SWAP_EXT_OP_STRINGS \
+ { XFS_SWAP_EXT_OP_LOGGED, "LOGGED" }, \
+ { XFS_SWAP_EXT_OP_NREXT64, "NREXT64" }
+
+static inline int
+xfs_swapext_whichfork(const struct xfs_swapext_intent *sxi)
+{
+ if (sxi->sxi_flags & XFS_SWAP_EXT_ATTR_FORK)
+ return XFS_ATTR_FORK;
+ return XFS_DATA_FORK;
+}
+
+/* Parameters for a swapext request. */
+struct xfs_swapext_req {
+ /* Inodes participating in the operation. */
+ struct xfs_inode *ip1;
+ struct xfs_inode *ip2;
+
+ /* File offset range information. */
+ xfs_fileoff_t startoff1;
+ xfs_fileoff_t startoff2;
+ xfs_filblks_t blockcount;
+
+ /* Data or attr fork? */
+ int whichfork;
+
+ /* XFS_SWAP_REQ_* operation flags */
+ unsigned int req_flags;
+
+ /*
+ * Fields below this line are filled out by xfs_swapext_estimate;
+ * callers should initialize this part of the struct to zero.
+ */
+
+ /*
+ * Data device blocks to be moved out of ip1, and free space needed to
+ * handle the bmbt changes.
+ */
+ xfs_filblks_t ip1_bcount;
+
+ /*
+ * Data device blocks to be moved out of ip2, and free space needed to
+ * handle the bmbt changes.
+ */
+ xfs_filblks_t ip2_bcount;
+
+ /* rt blocks to be moved out of ip1. */
+ xfs_filblks_t ip1_rtbcount;
+
+ /* rt blocks to be moved out of ip2. */
+ xfs_filblks_t ip2_rtbcount;
+
+ /* Free space needed to handle the bmbt changes */
+ unsigned long long resblks;
+
+ /* Number of extent swaps needed to complete the operation */
+ unsigned long long nr_exchanges;
+};
+
+/* Caller has permission to use log intent items for the swapext operation. */
+#define XFS_SWAP_REQ_LOGGED (1U << 0)
+
+/* Set the file sizes when finished. */
+#define XFS_SWAP_REQ_SET_SIZES (1U << 1)
+
+/* Do not swap any part of the range where ip1's mapping is a hole. */
+#define XFS_SWAP_REQ_SKIP_INO1_HOLES (1U << 2)
+
+/* Files need to be upgraded to have large extent counts. */
+#define XFS_SWAP_REQ_NREXT64 (1U << 3)
+
+#define XFS_SWAP_REQ_FLAGS (XFS_SWAP_REQ_LOGGED | \
+ XFS_SWAP_REQ_SET_SIZES | \
+ XFS_SWAP_REQ_SKIP_INO1_HOLES | \
+ XFS_SWAP_REQ_NREXT64)
+
+#define XFS_SWAP_REQ_STRINGS \
+ { XFS_SWAP_REQ_LOGGED, "LOGGED" }, \
+ { XFS_SWAP_REQ_SET_SIZES, "SETSIZES" }, \
+ { XFS_SWAP_REQ_SKIP_INO1_HOLES, "SKIP_INO1_HOLES" }, \
+ { XFS_SWAP_REQ_NREXT64, "NREXT64" }
+
+unsigned int xfs_swapext_reflink_prep(const struct xfs_swapext_req *req);
+void xfs_swapext_reflink_finish(struct xfs_trans *tp,
+ const struct xfs_swapext_req *req, unsigned int reflink_state);
+
+int xfs_swapext_estimate(struct xfs_swapext_req *req);
+
+extern struct kmem_cache *xfs_swapext_intent_cache;
+
+int __init xfs_swapext_intent_init_cache(void);
+void xfs_swapext_intent_destroy_cache(void);
+
+struct xfs_swapext_intent *xfs_swapext_init_intent(
+ const struct xfs_swapext_req *req, unsigned int *reflink_state);
+void xfs_swapext_ensure_reflink(struct xfs_trans *tp,
+ const struct xfs_swapext_intent *sxi, unsigned int reflink_state);
+
+void xfs_swapext_schedule(struct xfs_trans *tp,
+ struct xfs_swapext_intent *sxi);
+int xfs_swapext_finish_one(struct xfs_trans *tp,
+ struct xfs_swapext_intent *sxi);
+
+int xfs_swapext_check_extents(struct xfs_mount *mp,
+ const struct xfs_swapext_req *req);
+
+void xfs_swapext(struct xfs_trans *tp, const struct xfs_swapext_req *req);
+
#endif /* __XFS_SWAPEXT_H_ */
diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
index 87b31c69a773..9640fc232c14 100644
--- a/fs/xfs/libxfs/xfs_trans_space.h
+++ b/fs/xfs/libxfs/xfs_trans_space.h
@@ -10,6 +10,10 @@
* Components of space reservations.
*/
+/* Worst case number of bmaps that can be held in a block. */
+#define XFS_MAX_CONTIG_BMAPS_PER_BLOCK(mp) \
+ (((mp)->m_bmap_dmxr[0]) - ((mp)->m_bmap_dmnr[0]))
+
/* Worst case number of rmaps that can be held in a block. */
#define XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp) \
(((mp)->m_rmap_mxr[0]) - ((mp)->m_rmap_mnr[0]))
diff --git a/fs/xfs/xfs_swapext_item.c b/fs/xfs/xfs_swapext_item.c
index ea4a3a8de7e3..24331a702497 100644
--- a/fs/xfs/xfs_swapext_item.c
+++ b/fs/xfs/xfs_swapext_item.c
@@ -16,13 +16,17 @@
#include "xfs_trans.h"
#include "xfs_trans_priv.h"
#include "xfs_swapext_item.h"
+#include "xfs_swapext.h"
#include "xfs_log.h"
#include "xfs_bmap.h"
#include "xfs_icache.h"
+#include "xfs_bmap_btree.h"
#include "xfs_trans_space.h"
#include "xfs_error.h"
#include "xfs_log_priv.h"
#include "xfs_log_recover.h"
+#include "xfs_xchgrange.h"
+#include "xfs_trace.h"
struct kmem_cache *xfs_sxi_cache;
struct kmem_cache *xfs_sxd_cache;
@@ -206,13 +210,333 @@ static const struct xfs_item_ops xfs_sxd_item_ops = {
.iop_intent = xfs_sxd_item_intent,
};
+static struct xfs_sxd_log_item *
+xfs_trans_get_sxd(
+ struct xfs_trans *tp,
+ struct xfs_sxi_log_item *sxi_lip)
+{
+ struct xfs_sxd_log_item *sxd_lip;
+
+ sxd_lip = kmem_cache_zalloc(xfs_sxd_cache, GFP_KERNEL | __GFP_NOFAIL);
+ xfs_log_item_init(tp->t_mountp, &sxd_lip->sxd_item, XFS_LI_SXD,
+ &xfs_sxd_item_ops);
+ sxd_lip->sxd_intent_log_item = sxi_lip;
+ sxd_lip->sxd_format.sxd_sxi_id = sxi_lip->sxi_format.sxi_id;
+
+ xfs_trans_add_item(tp, &sxd_lip->sxd_item);
+ return sxd_lip;
+}
+
+/*
+ * Finish an swapext update and log it to the SXD. Note that the transaction is
+ * marked dirty regardless of whether the swapext update succeeds or fails to
+ * support the SXI/SXD lifecycle rules.
+ */
+static int
+xfs_swapext_finish_update(
+ struct xfs_trans *tp,
+ struct xfs_log_item *done,
+ struct xfs_swapext_intent *sxi)
+{
+ int error;
+
+ error = xfs_swapext_finish_one(tp, sxi);
+
+ /*
+ * Mark the transaction dirty, even on error. This ensures the
+ * transaction is aborted, which:
+ *
+ * 1.) releases the SXI and frees the SXD
+ * 2.) shuts down the filesystem
+ */
+ tp->t_flags |= XFS_TRANS_DIRTY;
+ if (done)
+ set_bit(XFS_LI_DIRTY, &done->li_flags);
+
+ return error;
+}
+
+/* Log swapext updates in the intent item. */
+STATIC struct xfs_log_item *
+xfs_swapext_create_intent(
+ struct xfs_trans *tp,
+ struct list_head *items,
+ unsigned int count,
+ bool sort)
+{
+ struct xfs_sxi_log_item *sxi_lip;
+ struct xfs_swapext_intent *sxi;
+ struct xfs_swap_extent *sx;
+
+ ASSERT(count == 1);
+
+ sxi = list_first_entry_or_null(items, struct xfs_swapext_intent,
+ sxi_list);
+
+ /*
+ * We use the same defer ops control machinery to perform extent swaps
+ * even if we aren't using the machinery to track the operation status
+ * through log items.
+ */
+ if (!(sxi->sxi_op_flags & XFS_SWAP_EXT_OP_LOGGED))
+ return NULL;
+
+ sxi_lip = xfs_sxi_init(tp->t_mountp);
+ xfs_trans_add_item(tp, &sxi_lip->sxi_item);
+ tp->t_flags |= XFS_TRANS_DIRTY;
+ set_bit(XFS_LI_DIRTY, &sxi_lip->sxi_item.li_flags);
+
+ sx = &sxi_lip->sxi_format.sxi_extent;
+ sx->sx_inode1 = sxi->sxi_ip1->i_ino;
+ sx->sx_inode2 = sxi->sxi_ip2->i_ino;
+ sx->sx_startoff1 = sxi->sxi_startoff1;
+ sx->sx_startoff2 = sxi->sxi_startoff2;
+ sx->sx_blockcount = sxi->sxi_blockcount;
+ sx->sx_isize1 = sxi->sxi_isize1;
+ sx->sx_isize2 = sxi->sxi_isize2;
+ sx->sx_flags = sxi->sxi_flags;
+
+ return &sxi_lip->sxi_item;
+}
+
+STATIC struct xfs_log_item *
+xfs_swapext_create_done(
+ struct xfs_trans *tp,
+ struct xfs_log_item *intent,
+ unsigned int count)
+{
+ if (intent == NULL)
+ return NULL;
+ return &xfs_trans_get_sxd(tp, SXI_ITEM(intent))->sxd_item;
+}
+
+/* Process a deferred swapext update. */
+STATIC int
+xfs_swapext_finish_item(
+ struct xfs_trans *tp,
+ struct xfs_log_item *done,
+ struct list_head *item,
+ struct xfs_btree_cur **state)
+{
+ struct xfs_swapext_intent *sxi;
+ int error;
+
+ sxi = container_of(item, struct xfs_swapext_intent, sxi_list);
+
+ /*
+ * Swap one more extent between the two files. If there's still more
+ * work to do, we want to requeue ourselves after all other pending
+ * deferred operations have finished. This includes all of the dfops
+ * that we queued directly as well as any new ones created in the
+ * process of finishing the others. Doing so prevents us from queuing
+ * a large number of SXI log items in kernel memory, which in turn
+ * prevents us from pinning the tail of the log (while logging those
+ * new SXI items) until the first SXI items can be processed.
+ */
+ error = xfs_swapext_finish_update(tp, done, sxi);
+ if (error == -EAGAIN)
+ return error;
+
+ kmem_cache_free(xfs_swapext_intent_cache, sxi);
+ return error;
+}
+
+/* Abort all pending SXIs. */
+STATIC void
+xfs_swapext_abort_intent(
+ struct xfs_log_item *intent)
+{
+ xfs_sxi_release(SXI_ITEM(intent));
+}
+
+/* Cancel a deferred swapext update. */
+STATIC void
+xfs_swapext_cancel_item(
+ struct list_head *item)
+{
+ struct xfs_swapext_intent *sxi;
+
+ sxi = container_of(item, struct xfs_swapext_intent, sxi_list);
+ kmem_cache_free(xfs_swapext_intent_cache, sxi);
+}
+
+const struct xfs_defer_op_type xfs_swapext_defer_type = {
+ .max_items = 1,
+ .create_intent = xfs_swapext_create_intent,
+ .abort_intent = xfs_swapext_abort_intent,
+ .create_done = xfs_swapext_create_done,
+ .finish_item = xfs_swapext_finish_item,
+ .cancel_item = xfs_swapext_cancel_item,
+};
+
+/* Is this recovered SXI ok? */
+static inline bool
+xfs_sxi_validate(
+ struct xfs_mount *mp,
+ struct xfs_sxi_log_item *sxi_lip)
+{
+ struct xfs_swap_extent *sx = &sxi_lip->sxi_format.sxi_extent;
+
+ if (!xfs_sb_version_haslogswapext(&mp->m_sb))
+ return false;
+
+ if (sxi_lip->sxi_format.__pad != 0)
+ return false;
+
+ if (sx->sx_flags & ~XFS_SWAP_EXT_FLAGS)
+ return false;
+
+ if (!xfs_verify_ino(mp, sx->sx_inode1) ||
+ !xfs_verify_ino(mp, sx->sx_inode2))
+ return false;
+
+ if ((sx->sx_flags & XFS_SWAP_EXT_SET_SIZES) &&
+ (sx->sx_isize1 < 0 || sx->sx_isize2 < 0))
+ return false;
+
+ if (!xfs_verify_fileext(mp, sx->sx_startoff1, sx->sx_blockcount))
+ return false;
+
+ return xfs_verify_fileext(mp, sx->sx_startoff2, sx->sx_blockcount);
+}
+
+/*
+ * Use the recovered log state to create a new request, estimate resource
+ * requirements, and create a new incore intent state.
+ */
+STATIC struct xfs_swapext_intent *
+xfs_sxi_item_recover_intent(
+ struct xfs_mount *mp,
+ const struct xfs_swap_extent *sx,
+ struct xfs_swapext_req *req,
+ unsigned int *reflink_state)
+{
+ struct xfs_inode *ip1, *ip2;
+ int error;
+
+ /*
+ * Grab both inodes and set IRECOVERY to prevent trimming of post-eof
+ * extents and freeing of unlinked inodes until we're totally done
+ * processing files.
+ */
+ error = xlog_recover_iget(mp, sx->sx_inode1, &ip1);
+ if (error)
+ return ERR_PTR(error);
+ error = xlog_recover_iget(mp, sx->sx_inode2, &ip2);
+ if (error)
+ goto err_rele1;
+
+ req->ip1 = ip1;
+ req->ip2 = ip2;
+ req->startoff1 = sx->sx_startoff1;
+ req->startoff2 = sx->sx_startoff2;
+ req->blockcount = sx->sx_blockcount;
+
+ if (sx->sx_flags & XFS_SWAP_EXT_ATTR_FORK)
+ req->whichfork = XFS_ATTR_FORK;
+ else
+ req->whichfork = XFS_DATA_FORK;
+
+ if (sx->sx_flags & XFS_SWAP_EXT_SET_SIZES)
+ req->req_flags |= XFS_SWAP_REQ_SET_SIZES;
+ if (sx->sx_flags & XFS_SWAP_EXT_SKIP_INO1_HOLES)
+ req->req_flags |= XFS_SWAP_REQ_SKIP_INO1_HOLES;
+ req->req_flags |= XFS_SWAP_REQ_LOGGED;
+
+ xfs_xchg_range_ilock(NULL, ip1, ip2);
+ error = xfs_swapext_estimate(req);
+ xfs_xchg_range_iunlock(ip1, ip2);
+ if (error)
+ goto err_rele2;
+
+ return xfs_swapext_init_intent(req, reflink_state);
+
+err_rele2:
+ xfs_irele(ip2);
+err_rele1:
+ xfs_irele(ip1);
+ return ERR_PTR(error);
+}
+
/* Process a swapext update intent item that was recovered from the log. */
STATIC int
xfs_sxi_item_recover(
- struct xfs_log_item *lip,
- struct list_head *capture_list)
+ struct xfs_log_item *lip,
+ struct list_head *capture_list)
{
- return -EFSCORRUPTED;
+ struct xfs_swapext_req req = { .req_flags = 0 };
+ struct xfs_swapext_intent *sxi;
+ struct xfs_sxi_log_item *sxi_lip = SXI_ITEM(lip);
+ struct xfs_mount *mp = lip->li_log->l_mp;
+ struct xfs_swap_extent *sx = &sxi_lip->sxi_format.sxi_extent;
+ struct xfs_sxd_log_item *sxd_lip = NULL;
+ struct xfs_trans *tp;
+ struct xfs_inode *ip1, *ip2;
+ unsigned int reflink_state;
+ int error = 0;
+
+ if (!xfs_sxi_validate(mp, sxi_lip)) {
+ XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
+ &sxi_lip->sxi_format,
+ sizeof(sxi_lip->sxi_format));
+ return -EFSCORRUPTED;
+ }
+
+ sxi = xfs_sxi_item_recover_intent(mp, sx, &req, &reflink_state);
+ if (IS_ERR(sxi))
+ return PTR_ERR(sxi);
+
+ trace_xfs_swapext_recover(mp, sxi);
+
+ ip1 = sxi->sxi_ip1;
+ ip2 = sxi->sxi_ip2;
+
+ error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, req.resblks, 0, 0,
+ &tp);
+ if (error)
+ goto err_rele;
+
+ sxd_lip = xfs_trans_get_sxd(tp, sxi_lip);
+
+ xfs_xchg_range_ilock(tp, ip1, ip2);
+
+ xfs_swapext_ensure_reflink(tp, sxi, reflink_state);
+ error = xfs_swapext_finish_update(tp, &sxd_lip->sxd_item, sxi);
+ if (error == -EAGAIN) {
+ /*
+ * If there's more extent swapping to be done, we have to
+ * schedule that as a separate deferred operation to be run
+ * after we've finished replaying all of the intents we
+ * recovered from the log. Transfer ownership of the sxi to
+ * the transaction.
+ */
+ xfs_swapext_schedule(tp, sxi);
+ error = 0;
+ sxi = NULL;
+ }
+ if (error == -EFSCORRUPTED)
+ XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, sx,
+ sizeof(*sx));
+ if (error)
+ goto err_cancel;
+
+ /*
+ * Commit transaction, which frees the transaction and saves the inodes
+ * for later replay activities.
+ */
+ error = xfs_defer_ops_capture_and_commit(tp, capture_list);
+ goto err_unlock;
+
+err_cancel:
+ xfs_trans_cancel(tp);
+err_unlock:
+ xfs_xchg_range_iunlock(ip1, ip2);
+err_rele:
+ if (sxi)
+ kmem_cache_free(xfs_swapext_intent_cache, sxi);
+ xfs_irele(ip2);
+ xfs_irele(ip1);
+ return error;
}
STATIC bool
@@ -229,8 +553,21 @@ xfs_sxi_item_relog(
struct xfs_log_item *intent,
struct xfs_trans *tp)
{
- ASSERT(0);
- return NULL;
+ struct xfs_sxd_log_item *sxd_lip;
+ struct xfs_sxi_log_item *sxi_lip;
+ struct xfs_swap_extent *sx;
+
+ sx = &SXI_ITEM(intent)->sxi_format.sxi_extent;
+
+ tp->t_flags |= XFS_TRANS_DIRTY;
+ sxd_lip = xfs_trans_get_sxd(tp, SXI_ITEM(intent));
+ set_bit(XFS_LI_DIRTY, &sxd_lip->sxd_item.li_flags);
+
+ sxi_lip = xfs_sxi_init(tp->t_mountp);
+ memcpy(&sxi_lip->sxi_format.sxi_extent, sx, sizeof(*sx));
+ xfs_trans_add_item(tp, &sxi_lip->sxi_item);
+ set_bit(XFS_LI_DIRTY, &sxi_lip->sxi_item.li_flags);
+ return &sxi_lip->sxi_item;
}
static const struct xfs_item_ops xfs_sxi_item_ops = {
@@ -264,17 +601,17 @@ xlog_recover_sxi_commit_pass2(
sxi_formatp = item->ri_buf[0].i_addr;
- if (sxi_formatp->__pad != 0) {
- XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
- return -EFSCORRUPTED;
- }
-
len = sizeof(struct xfs_sxi_log_format);
if (item->ri_buf[0].i_len != len) {
XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
return -EFSCORRUPTED;
}
+ if (sxi_formatp->__pad != 0) {
+ XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
+ return -EFSCORRUPTED;
+ }
+
sxi_lip = xfs_sxi_init(mp);
memcpy(&sxi_lip->sxi_format, sxi_formatp, len);
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index c9a5d8087b63..b43b973f0e10 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -40,6 +40,7 @@
#include "scrub/xfbtree.h"
#include "xfs_btree_mem.h"
#include "xfs_bmap.h"
+#include "xfs_swapext.h"
/*
* We include this last to have the helpers above available for the trace
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 15bd6b86b514..9ebaa5ffe504 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -78,6 +78,8 @@ union xfs_btree_ptr;
struct xfs_dqtrx;
struct xfs_icwalk;
struct xfs_bmap_intent;
+struct xfs_swapext_intent;
+struct xfs_swapext_req;
#define XFS_ATTR_FILTER_FLAGS \
{ XFS_ATTR_ROOT, "ROOT" }, \
@@ -2173,7 +2175,7 @@ TRACE_EVENT(xfs_dir2_leafn_moveents,
__entry->count)
);
-#define XFS_SWAPEXT_INODES \
+#define XFS_SWAP_EXT_INODES \
{ 0, "target" }, \
{ 1, "temp" }
@@ -2208,7 +2210,7 @@ DECLARE_EVENT_CLASS(xfs_swap_extent_class,
"broot size %d, forkoff 0x%x",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->ino,
- __print_symbolic(__entry->which, XFS_SWAPEXT_INODES),
+ __print_symbolic(__entry->which, XFS_SWAP_EXT_INODES),
__print_symbolic(__entry->format, XFS_INODE_FORMAT_STR),
__entry->nex,
__entry->broot_size,
@@ -3761,6 +3763,9 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_cancel_cow);
DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap);
DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap_piece);
DEFINE_INODE_ERROR_EVENT(xfs_swap_extent_rmap_error);
+DEFINE_INODE_IREC_EVENT(xfs_swapext_extent1);
+DEFINE_INODE_IREC_EVENT(xfs_swapext_extent2);
+DEFINE_ITRUNC_EVENT(xfs_swapext_update_inode_size);
/* fsmap traces */
DECLARE_EVENT_CLASS(xfs_fsmap_class,
@@ -4581,6 +4586,212 @@ DEFINE_PERAG_INTENTS_EVENT(xfs_perag_wait_intents);
#endif /* CONFIG_XFS_DRAIN_INTENTS */
+TRACE_EVENT(xfs_swapext_overhead,
+ TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
+ unsigned long long rmapbt_blocks),
+ TP_ARGS(mp, bmbt_blocks, rmapbt_blocks),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(unsigned long long, bmbt_blocks)
+ __field(unsigned long long, rmapbt_blocks)
+ ),
+ TP_fast_assign(
+ __entry->dev = mp->m_super->s_dev;
+ __entry->bmbt_blocks = bmbt_blocks;
+ __entry->rmapbt_blocks = rmapbt_blocks;
+ ),
+ TP_printk("dev %d:%d bmbt_blocks 0x%llx rmapbt_blocks 0x%llx",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->bmbt_blocks,
+ __entry->rmapbt_blocks)
+);
+
+DECLARE_EVENT_CLASS(xfs_swapext_estimate_class,
+ TP_PROTO(const struct xfs_swapext_req *req),
+ TP_ARGS(req),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(xfs_ino_t, ino1)
+ __field(xfs_ino_t, ino2)
+ __field(xfs_fileoff_t, startoff1)
+ __field(xfs_fileoff_t, startoff2)
+ __field(xfs_filblks_t, blockcount)
+ __field(int, whichfork)
+ __field(unsigned int, req_flags)
+ __field(xfs_filblks_t, ip1_bcount)
+ __field(xfs_filblks_t, ip2_bcount)
+ __field(xfs_filblks_t, ip1_rtbcount)
+ __field(xfs_filblks_t, ip2_rtbcount)
+ __field(unsigned long long, resblks)
+ __field(unsigned long long, nr_exchanges)
+ ),
+ TP_fast_assign(
+ __entry->dev = req->ip1->i_mount->m_super->s_dev;
+ __entry->ino1 = req->ip1->i_ino;
+ __entry->ino2 = req->ip2->i_ino;
+ __entry->startoff1 = req->startoff1;
+ __entry->startoff2 = req->startoff2;
+ __entry->blockcount = req->blockcount;
+ __entry->whichfork = req->whichfork;
+ __entry->req_flags = req->req_flags;
+ __entry->ip1_bcount = req->ip1_bcount;
+ __entry->ip2_bcount = req->ip2_bcount;
+ __entry->ip1_rtbcount = req->ip1_rtbcount;
+ __entry->ip2_rtbcount = req->ip2_rtbcount;
+ __entry->resblks = req->resblks;
+ __entry->nr_exchanges = req->nr_exchanges;
+ ),
+ TP_printk("dev %d:%d ino1 0x%llx fileoff1 0x%llx ino2 0x%llx fileoff2 0x%llx fsbcount 0x%llx flags (%s) fork %s bcount1 0x%llx rtbcount1 0x%llx bcount2 0x%llx rtbcount2 0x%llx resblks 0x%llx nr_exchanges %llu",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->ino1, __entry->startoff1,
+ __entry->ino2, __entry->startoff2,
+ __entry->blockcount,
+ __print_flags(__entry->req_flags, "|", XFS_SWAP_REQ_STRINGS),
+ __print_symbolic(__entry->whichfork, XFS_WHICHFORK_STRINGS),
+ __entry->ip1_bcount,
+ __entry->ip1_rtbcount,
+ __entry->ip2_bcount,
+ __entry->ip2_rtbcount,
+ __entry->resblks,
+ __entry->nr_exchanges)
+);
+
+#define DEFINE_SWAPEXT_ESTIMATE_EVENT(name) \
+DEFINE_EVENT(xfs_swapext_estimate_class, name, \
+ TP_PROTO(const struct xfs_swapext_req *req), \
+ TP_ARGS(req))
+DEFINE_SWAPEXT_ESTIMATE_EVENT(xfs_swapext_initial_estimate);
+DEFINE_SWAPEXT_ESTIMATE_EVENT(xfs_swapext_final_estimate);
+
+DECLARE_EVENT_CLASS(xfs_swapext_intent_class,
+ TP_PROTO(struct xfs_mount *mp, const struct xfs_swapext_intent *sxi),
+ TP_ARGS(mp, sxi),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(xfs_ino_t, ino1)
+ __field(xfs_ino_t, ino2)
+ __field(unsigned int, flags)
+ __field(unsigned int, opflags)
+ __field(xfs_fileoff_t, startoff1)
+ __field(xfs_fileoff_t, startoff2)
+ __field(xfs_filblks_t, blockcount)
+ __field(xfs_fsize_t, isize1)
+ __field(xfs_fsize_t, isize2)
+ __field(xfs_fsize_t, new_isize1)
+ __field(xfs_fsize_t, new_isize2)
+ ),
+ TP_fast_assign(
+ __entry->dev = mp->m_super->s_dev;
+ __entry->ino1 = sxi->sxi_ip1->i_ino;
+ __entry->ino2 = sxi->sxi_ip2->i_ino;
+ __entry->flags = sxi->sxi_flags;
+ __entry->opflags = sxi->sxi_op_flags;
+ __entry->startoff1 = sxi->sxi_startoff1;
+ __entry->startoff2 = sxi->sxi_startoff2;
+ __entry->blockcount = sxi->sxi_blockcount;
+ __entry->isize1 = sxi->sxi_ip1->i_disk_size;
+ __entry->isize2 = sxi->sxi_ip2->i_disk_size;
+ __entry->new_isize1 = sxi->sxi_isize1;
+ __entry->new_isize2 = sxi->sxi_isize2;
+ ),
+ TP_printk("dev %d:%d ino1 0x%llx fileoff1 0x%llx ino2 0x%llx fileoff2 0x%llx fsbcount 0x%llx flags (%s) opflags (%s) isize1 0x%llx newisize1 0x%llx isize2 0x%llx newisize2 0x%llx",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->ino1, __entry->startoff1,
+ __entry->ino2, __entry->startoff2,
+ __entry->blockcount,
+ __print_flags(__entry->flags, "|", XFS_SWAP_EXT_STRINGS),
+ __print_flags(__entry->opflags, "|", XFS_SWAP_EXT_OP_STRINGS),
+ __entry->isize1, __entry->new_isize1,
+ __entry->isize2, __entry->new_isize2)
+);
+
+#define DEFINE_SWAPEXT_INTENT_EVENT(name) \
+DEFINE_EVENT(xfs_swapext_intent_class, name, \
+ TP_PROTO(struct xfs_mount *mp, const struct xfs_swapext_intent *sxi), \
+ TP_ARGS(mp, sxi))
+DEFINE_SWAPEXT_INTENT_EVENT(xfs_swapext_defer);
+DEFINE_SWAPEXT_INTENT_EVENT(xfs_swapext_recover);
+
+TRACE_EVENT(xfs_swapext_delta_nextents_step,
+ TP_PROTO(struct xfs_mount *mp,
+ const struct xfs_bmbt_irec *left,
+ const struct xfs_bmbt_irec *curr,
+ const struct xfs_bmbt_irec *new,
+ const struct xfs_bmbt_irec *right,
+ int delta, unsigned int state),
+ TP_ARGS(mp, left, curr, new, right, delta, state),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(xfs_fileoff_t, loff)
+ __field(xfs_fsblock_t, lstart)
+ __field(xfs_filblks_t, lcount)
+ __field(xfs_fileoff_t, coff)
+ __field(xfs_fsblock_t, cstart)
+ __field(xfs_filblks_t, ccount)
+ __field(xfs_fileoff_t, noff)
+ __field(xfs_fsblock_t, nstart)
+ __field(xfs_filblks_t, ncount)
+ __field(xfs_fileoff_t, roff)
+ __field(xfs_fsblock_t, rstart)
+ __field(xfs_filblks_t, rcount)
+ __field(int, delta)
+ __field(unsigned int, state)
+ ),
+ TP_fast_assign(
+ __entry->dev = mp->m_super->s_dev;
+ __entry->loff = left->br_startoff;
+ __entry->lstart = left->br_startblock;
+ __entry->lcount = left->br_blockcount;
+ __entry->coff = curr->br_startoff;
+ __entry->cstart = curr->br_startblock;
+ __entry->ccount = curr->br_blockcount;
+ __entry->noff = new->br_startoff;
+ __entry->nstart = new->br_startblock;
+ __entry->ncount = new->br_blockcount;
+ __entry->roff = right->br_startoff;
+ __entry->rstart = right->br_startblock;
+ __entry->rcount = right->br_blockcount;
+ __entry->delta = delta;
+ __entry->state = state;
+ ),
+ TP_printk("dev %d:%d left 0x%llx:0x%llx:0x%llx; curr 0x%llx:0x%llx:0x%llx <- new 0x%llx:0x%llx:0x%llx; right 0x%llx:0x%llx:0x%llx delta %d state 0x%x",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->loff, __entry->lstart, __entry->lcount,
+ __entry->coff, __entry->cstart, __entry->ccount,
+ __entry->noff, __entry->nstart, __entry->ncount,
+ __entry->roff, __entry->rstart, __entry->rcount,
+ __entry->delta, __entry->state)
+);
+
+TRACE_EVENT(xfs_swapext_delta_nextents,
+ TP_PROTO(const struct xfs_swapext_req *req, int64_t d_nexts1,
+ int64_t d_nexts2),
+ TP_ARGS(req, d_nexts1, d_nexts2),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(xfs_ino_t, ino1)
+ __field(xfs_ino_t, ino2)
+ __field(xfs_extnum_t, nexts1)
+ __field(xfs_extnum_t, nexts2)
+ __field(int64_t, d_nexts1)
+ __field(int64_t, d_nexts2)
+ ),
+ TP_fast_assign(
+ __entry->dev = req->ip1->i_mount->m_super->s_dev;
+ __entry->ino1 = req->ip1->i_ino;
+ __entry->ino2 = req->ip2->i_ino;
+ __entry->nexts1 = xfs_ifork_ptr(req->ip1, req->whichfork)->if_nextents;
+ __entry->nexts2 = xfs_ifork_ptr(req->ip2, req->whichfork)->if_nextents;
+ __entry->d_nexts1 = d_nexts1;
+ __entry->d_nexts2 = d_nexts2;
+ ),
+ TP_printk("dev %d:%d ino1 0x%llx nexts %llu ino2 0x%llx nexts %llu delta1 %lld delta2 %lld",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->ino1, __entry->nexts1,
+ __entry->ino2, __entry->nexts2,
+ __entry->d_nexts1, __entry->d_nexts2)
+);
+
#endif /* _TRACE_XFS_H */
#undef TRACE_INCLUDE_PATH
diff --git a/fs/xfs/xfs_xchgrange.c b/fs/xfs/xfs_xchgrange.c
new file mode 100644
index 000000000000..0dba5078c9f7
--- /dev/null
+++ b/fs/xfs/xfs_xchgrange.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_swapext.h"
+#include "xfs_xchgrange.h"
+
+/* Lock (and optionally join) two inodes for a file range exchange. */
+void
+xfs_xchg_range_ilock(
+ struct xfs_trans *tp,
+ struct xfs_inode *ip1,
+ struct xfs_inode *ip2)
+{
+ if (ip1 != ip2)
+ xfs_lock_two_inodes(ip1, XFS_ILOCK_EXCL,
+ ip2, XFS_ILOCK_EXCL);
+ else
+ xfs_ilock(ip1, XFS_ILOCK_EXCL);
+ if (tp) {
+ xfs_trans_ijoin(tp, ip1, 0);
+ if (ip2 != ip1)
+ xfs_trans_ijoin(tp, ip2, 0);
+ }
+
+}
+
+/* Unlock two inodes after a file range exchange operation. */
+void
+xfs_xchg_range_iunlock(
+ struct xfs_inode *ip1,
+ struct xfs_inode *ip2)
+{
+ if (ip2 != ip1)
+ xfs_iunlock(ip2, XFS_ILOCK_EXCL);
+ xfs_iunlock(ip1, XFS_ILOCK_EXCL);
+}
+
+/*
+ * Estimate the resource requirements to exchange file contents between the two
+ * files. The caller is required to hold the IOLOCK and the MMAPLOCK and to
+ * have flushed both inodes' pagecache and active direct-ios.
+ */
+int
+xfs_xchg_range_estimate(
+ struct xfs_swapext_req *req)
+{
+ int error;
+
+ xfs_xchg_range_ilock(NULL, req->ip1, req->ip2);
+ error = xfs_swapext_estimate(req);
+ xfs_xchg_range_iunlock(req->ip1, req->ip2);
+ return error;
+}
diff --git a/fs/xfs/xfs_xchgrange.h b/fs/xfs/xfs_xchgrange.h
new file mode 100644
index 000000000000..89320a354efa
--- /dev/null
+++ b/fs/xfs/xfs_xchgrange.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_XCHGRANGE_H__
+#define __XFS_XCHGRANGE_H__
+
+struct xfs_swapext_req;
+
+void xfs_xchg_range_ilock(struct xfs_trans *tp, struct xfs_inode *ip1,
+ struct xfs_inode *ip2);
+void xfs_xchg_range_iunlock(struct xfs_inode *ip1, struct xfs_inode *ip2);
+
+int xfs_xchg_range_estimate(struct xfs_swapext_req *req);
+
+#endif /* __XFS_XCHGRANGE_H__ */
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 08/21] xfs: enable xlog users to toggle atomic extent swapping
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
` (5 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 09/21] xfs: add a ->xchg_file_range handler Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 11/21] xfs: port xfs_swap_extents_rmap to our new code Darrick J. Wong
` (13 subsequent siblings)
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
Plumb the necessary bits into the xlog code so that higher level callers
can enable the atomic extent swapping feature and have it clear
automatically when possible.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_log.c | 13 +++++++++++++
fs/xfs/xfs_log.h | 1 +
fs/xfs/xfs_log_priv.h | 1 +
3 files changed, 15 insertions(+)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index a0ef09addc84..37e85c1bb913 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1501,11 +1501,17 @@ xlog_clear_incompat(
if (down_write_trylock(&log->l_incompat_xattrs))
incompat_mask |= XFS_SB_FEAT_INCOMPAT_LOG_XATTRS;
+ if (down_write_trylock(&log->l_incompat_swapext))
+ incompat_mask |= XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT;
+
if (!incompat_mask)
return;
xfs_clear_incompat_log_features(mp, incompat_mask);
+ if (incompat_mask & XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT)
+ up_write(&log->l_incompat_swapext);
+
if (incompat_mask & XFS_SB_FEAT_INCOMPAT_LOG_XATTRS)
up_write(&log->l_incompat_xattrs);
}
@@ -1625,6 +1631,7 @@ xlog_alloc_log(
log->l_sectBBsize = 1 << log2_size;
init_rwsem(&log->l_incompat_xattrs);
+ init_rwsem(&log->l_incompat_swapext);
xlog_get_iclog_buffer_size(mp, log);
@@ -3922,6 +3929,9 @@ xlog_use_incompat_feat(
case XLOG_INCOMPAT_FEAT_XATTRS:
down_read(&log->l_incompat_xattrs);
break;
+ case XLOG_INCOMPAT_FEAT_SWAPEXT:
+ down_read(&log->l_incompat_swapext);
+ break;
}
}
@@ -3935,5 +3945,8 @@ xlog_drop_incompat_feat(
case XLOG_INCOMPAT_FEAT_XATTRS:
up_read(&log->l_incompat_xattrs);
break;
+ case XLOG_INCOMPAT_FEAT_SWAPEXT:
+ up_read(&log->l_incompat_swapext);
+ break;
}
}
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index d187f6445909..30bdbf8ee25c 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -161,6 +161,7 @@ bool xlog_force_shutdown(struct xlog *log, uint32_t shutdown_flags);
enum xlog_incompat_feat {
XLOG_INCOMPAT_FEAT_XATTRS = XFS_SB_FEAT_INCOMPAT_LOG_XATTRS,
+ XLOG_INCOMPAT_FEAT_SWAPEXT = XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT
};
void xlog_use_incompat_feat(struct xlog *log, enum xlog_incompat_feat what);
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index a13b5b6b744d..6cbee6996de5 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -448,6 +448,7 @@ struct xlog {
/* Users of log incompat features should take a read lock. */
struct rw_semaphore l_incompat_xattrs;
+ struct rw_semaphore l_incompat_swapext;
};
/*
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 09/21] xfs: add a ->xchg_file_range handler
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
` (4 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 03/21] xfs: refactor non-power-of-two alignment checks Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 08/21] xfs: enable xlog users to toggle atomic extent swapping Darrick J. Wong
` (14 subsequent siblings)
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
Add a function to handle file range exchange requests from the vfs.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_bmap_util.c | 1
fs/xfs/xfs_file.c | 70 +++++++++
fs/xfs/xfs_mount.h | 5 +
fs/xfs/xfs_trace.c | 1
fs/xfs/xfs_trace.h | 120 +++++++++++++++
fs/xfs/xfs_xchgrange.c | 375 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_xchgrange.h | 23 +++
7 files changed, 594 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 8621534b749b..d587015aec0e 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -28,6 +28,7 @@
#include "xfs_icache.h"
#include "xfs_iomap.h"
#include "xfs_reflink.h"
+#include "xfs_swapext.h"
/* Kernel only BMAP related definitions and functions */
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 78323574021c..b4629c8aa6b7 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -24,6 +24,7 @@
#include "xfs_pnfs.h"
#include "xfs_iomap.h"
#include "xfs_reflink.h"
+#include "xfs_xchgrange.h"
#include <linux/dax.h>
#include <linux/falloc.h>
@@ -1150,6 +1151,74 @@ xfs_file_remap_range(
return remapped > 0 ? remapped : ret;
}
+STATIC int
+xfs_file_xchg_range(
+ struct file *file1,
+ struct file *file2,
+ struct file_xchg_range *fxr)
+{
+ struct inode *inode1 = file_inode(file1);
+ struct inode *inode2 = file_inode(file2);
+ struct xfs_inode *ip1 = XFS_I(inode1);
+ struct xfs_inode *ip2 = XFS_I(inode2);
+ struct xfs_mount *mp = ip1->i_mount;
+ unsigned int priv_flags = 0;
+ bool use_logging = false;
+ int error;
+
+ if (xfs_is_shutdown(mp))
+ return -EIO;
+
+ /* Update cmtime if the fd/inode don't forbid it. */
+ if (likely(!(file1->f_mode & FMODE_NOCMTIME) && !IS_NOCMTIME(inode1)))
+ priv_flags |= XFS_XCHG_RANGE_UPD_CMTIME1;
+ if (likely(!(file2->f_mode & FMODE_NOCMTIME) && !IS_NOCMTIME(inode2)))
+ priv_flags |= XFS_XCHG_RANGE_UPD_CMTIME2;
+
+ /* Lock both files against IO */
+ error = xfs_ilock2_io_mmap(ip1, ip2);
+ if (error)
+ goto out_err;
+
+ /* Prepare and then exchange file contents. */
+ error = xfs_xchg_range_prep(file1, file2, fxr);
+ if (error)
+ goto out_unlock;
+
+ /* Get permission to use log-assisted file content swaps. */
+ error = xfs_xchg_range_grab_log_assist(mp,
+ !(fxr->flags & FILE_XCHG_RANGE_NONATOMIC),
+ &use_logging);
+ if (error)
+ goto out_unlock;
+ if (use_logging)
+ priv_flags |= XFS_XCHG_RANGE_LOGGED;
+
+ error = xfs_xchg_range(ip1, ip2, fxr, priv_flags);
+ if (error)
+ goto out_drop_feat;
+
+ /*
+ * Finish the exchange by removing special file privileges like any
+ * other file write would do. This may involve turning on support for
+ * logged xattrs if either file has security capabilities, which means
+ * xfs_xchg_range_grab_log_assist before xfs_attr_grab_log_assist.
+ */
+ error = generic_xchg_file_range_finish(file1, file2);
+ if (error)
+ goto out_drop_feat;
+
+out_drop_feat:
+ if (use_logging)
+ xfs_xchg_range_rele_log_assist(mp);
+out_unlock:
+ xfs_iunlock2_io_mmap(ip1, ip2);
+out_err:
+ if (error)
+ trace_xfs_file_xchg_range_error(ip2, error, _RET_IP_);
+ return error;
+}
+
STATIC int
xfs_file_open(
struct inode *inode,
@@ -1439,6 +1508,7 @@ const struct file_operations xfs_file_operations = {
.fallocate = xfs_file_fallocate,
.fadvise = xfs_file_fadvise,
.remap_file_range = xfs_file_remap_range,
+ .xchg_file_range = xfs_file_xchg_range,
};
const struct file_operations xfs_dir_file_operations = {
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 7c48a2b70f6f..3b2601ab954d 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -399,6 +399,8 @@ __XFS_HAS_FEAT(nouuid, NOUUID)
#define XFS_OPSTATE_WARNED_SHRINK 8
/* Kernel has logged a warning about logged xattr updates being used. */
#define XFS_OPSTATE_WARNED_LARP 9
+/* Kernel has logged a warning about extent swapping being used on this fs. */
+#define XFS_OPSTATE_WARNED_SWAPEXT 10
#define __XFS_IS_OPSTATE(name, NAME) \
static inline bool xfs_is_ ## name (struct xfs_mount *mp) \
@@ -438,7 +440,8 @@ xfs_should_warn(struct xfs_mount *mp, long nr)
{ (1UL << XFS_OPSTATE_BLOCKGC_ENABLED), "blockgc" }, \
{ (1UL << XFS_OPSTATE_WARNED_SCRUB), "wscrub" }, \
{ (1UL << XFS_OPSTATE_WARNED_SHRINK), "wshrink" }, \
- { (1UL << XFS_OPSTATE_WARNED_LARP), "wlarp" }
+ { (1UL << XFS_OPSTATE_WARNED_LARP), "wlarp" }, \
+ { (1UL << XFS_OPSTATE_WARNED_SWAPEXT), "wswapext" }
/*
* Max and min values for mount-option defined I/O
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index b43b973f0e10..e38814f4380c 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -41,6 +41,7 @@
#include "xfs_btree_mem.h"
#include "xfs_bmap.h"
#include "xfs_swapext.h"
+#include "xfs_xchgrange.h"
/*
* We include this last to have the helpers above available for the trace
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 9ebaa5ffe504..6841f04ee38d 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3763,10 +3763,130 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_cancel_cow);
DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap);
DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap_piece);
DEFINE_INODE_ERROR_EVENT(xfs_swap_extent_rmap_error);
+
+/* swapext tracepoints */
+DEFINE_INODE_ERROR_EVENT(xfs_file_xchg_range_error);
DEFINE_INODE_IREC_EVENT(xfs_swapext_extent1);
DEFINE_INODE_IREC_EVENT(xfs_swapext_extent2);
DEFINE_ITRUNC_EVENT(xfs_swapext_update_inode_size);
+#define FIEXCHANGE_FLAGS_STRS \
+ { FILE_XCHG_RANGE_NONATOMIC, "NONATOMIC" }, \
+ { FILE_XCHG_RANGE_FILE2_FRESH, "F2_FRESH" }, \
+ { FILE_XCHG_RANGE_FULL_FILES, "FULL" }, \
+ { FILE_XCHG_RANGE_TO_EOF, "TO_EOF" }, \
+ { FILE_XCHG_RANGE_FSYNC , "FSYNC" }, \
+ { FILE_XCHG_RANGE_DRY_RUN, "DRY_RUN" }, \
+ { FILE_XCHG_RANGE_SKIP_FILE1_HOLES, "SKIP_F1_HOLES" }
+
+/* file exchange-range tracepoint class */
+DECLARE_EVENT_CLASS(xfs_xchg_range_class,
+ TP_PROTO(struct xfs_inode *ip1, const struct file_xchg_range *fxr,
+ struct xfs_inode *ip2, unsigned int xchg_flags),
+ TP_ARGS(ip1, fxr, ip2, xchg_flags),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(xfs_ino_t, ip1_ino)
+ __field(loff_t, ip1_isize)
+ __field(loff_t, ip1_disize)
+ __field(xfs_ino_t, ip2_ino)
+ __field(loff_t, ip2_isize)
+ __field(loff_t, ip2_disize)
+
+ __field(loff_t, file1_offset)
+ __field(loff_t, file2_offset)
+ __field(unsigned long long, length)
+ __field(unsigned long long, vflags)
+ __field(unsigned int, xflags)
+ ),
+ TP_fast_assign(
+ __entry->dev = VFS_I(ip1)->i_sb->s_dev;
+ __entry->ip1_ino = ip1->i_ino;
+ __entry->ip1_isize = VFS_I(ip1)->i_size;
+ __entry->ip1_disize = ip1->i_disk_size;
+ __entry->ip2_ino = ip2->i_ino;
+ __entry->ip2_isize = VFS_I(ip2)->i_size;
+ __entry->ip2_disize = ip2->i_disk_size;
+
+ __entry->file1_offset = fxr->file1_offset;
+ __entry->file2_offset = fxr->file2_offset;
+ __entry->length = fxr->length;
+ __entry->vflags = fxr->flags;
+ __entry->xflags = xchg_flags;
+ ),
+ TP_printk("dev %d:%d vfs_flags %s xchg_flags %s bytecount 0x%llx "
+ "ino1 0x%llx isize 0x%llx disize 0x%llx pos 0x%llx -> "
+ "ino2 0x%llx isize 0x%llx disize 0x%llx pos 0x%llx",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __print_flags(__entry->vflags, "|", FIEXCHANGE_FLAGS_STRS),
+ __print_flags(__entry->xflags, "|", XCHG_RANGE_FLAGS_STRS),
+ __entry->length,
+ __entry->ip1_ino,
+ __entry->ip1_isize,
+ __entry->ip1_disize,
+ __entry->file1_offset,
+ __entry->ip2_ino,
+ __entry->ip2_isize,
+ __entry->ip2_disize,
+ __entry->file2_offset)
+)
+
+#define DEFINE_XCHG_RANGE_EVENT(name) \
+DEFINE_EVENT(xfs_xchg_range_class, name, \
+ TP_PROTO(struct xfs_inode *ip1, const struct file_xchg_range *fxr, \
+ struct xfs_inode *ip2, unsigned int xchg_flags), \
+ TP_ARGS(ip1, fxr, ip2, xchg_flags))
+DEFINE_XCHG_RANGE_EVENT(xfs_xchg_range_prep);
+DEFINE_XCHG_RANGE_EVENT(xfs_xchg_range_flush);
+DEFINE_XCHG_RANGE_EVENT(xfs_xchg_range);
+
+TRACE_EVENT(xfs_xchg_range_freshness,
+ TP_PROTO(struct xfs_inode *ip2, const struct file_xchg_range *fxr),
+ TP_ARGS(ip2, fxr),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(xfs_ino_t, ip2_ino)
+ __field(long long, ip2_mtime)
+ __field(long long, ip2_ctime)
+ __field(int, ip2_mtime_nsec)
+ __field(int, ip2_ctime_nsec)
+
+ __field(xfs_ino_t, file2_ino)
+ __field(long long, file2_mtime)
+ __field(long long, file2_ctime)
+ __field(int, file2_mtime_nsec)
+ __field(int, file2_ctime_nsec)
+ ),
+ TP_fast_assign(
+ __entry->dev = VFS_I(ip2)->i_sb->s_dev;
+ __entry->ip2_ino = ip2->i_ino;
+ __entry->ip2_mtime = VFS_I(ip2)->i_mtime.tv_sec;
+ __entry->ip2_ctime = VFS_I(ip2)->i_ctime.tv_sec;
+ __entry->ip2_mtime_nsec = VFS_I(ip2)->i_mtime.tv_nsec;
+ __entry->ip2_ctime_nsec = VFS_I(ip2)->i_ctime.tv_nsec;
+
+ __entry->file2_ino = fxr->file2_ino;
+ __entry->file2_mtime = fxr->file2_mtime;
+ __entry->file2_ctime = fxr->file2_ctime;
+ __entry->file2_mtime_nsec = fxr->file2_mtime_nsec;
+ __entry->file2_ctime_nsec = fxr->file2_ctime_nsec;
+ ),
+ TP_printk("dev %d:%d "
+ "ino 0x%llx mtime %lld:%d ctime %lld:%d -> "
+ "file 0x%llx mtime %lld:%d ctime %lld:%d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->ip2_ino,
+ __entry->ip2_mtime,
+ __entry->ip2_mtime_nsec,
+ __entry->ip2_ctime,
+ __entry->ip2_ctime_nsec,
+ __entry->file2_ino,
+ __entry->file2_mtime,
+ __entry->file2_mtime_nsec,
+ __entry->file2_ctime,
+ __entry->file2_ctime_nsec)
+);
+
/* fsmap traces */
DECLARE_EVENT_CLASS(xfs_fsmap_class,
TP_PROTO(struct xfs_mount *mp, u32 keydev, xfs_agnumber_t agno,
diff --git a/fs/xfs/xfs_xchgrange.c b/fs/xfs/xfs_xchgrange.c
index 0dba5078c9f7..9966938134c0 100644
--- a/fs/xfs/xfs_xchgrange.c
+++ b/fs/xfs/xfs_xchgrange.c
@@ -13,8 +13,15 @@
#include "xfs_defer.h"
#include "xfs_inode.h"
#include "xfs_trans.h"
+#include "xfs_quota.h"
+#include "xfs_bmap_util.h"
+#include "xfs_reflink.h"
+#include "xfs_trace.h"
#include "xfs_swapext.h"
#include "xfs_xchgrange.h"
+#include "xfs_sb.h"
+#include "xfs_icache.h"
+#include "xfs_log.h"
/* Lock (and optionally join) two inodes for a file range exchange. */
void
@@ -63,3 +70,371 @@ xfs_xchg_range_estimate(
xfs_xchg_range_iunlock(req->ip1, req->ip2);
return error;
}
+
+/* Prepare two files to have their data exchanged. */
+int
+xfs_xchg_range_prep(
+ struct file *file1,
+ struct file *file2,
+ struct file_xchg_range *fxr)
+{
+ struct xfs_inode *ip1 = XFS_I(file_inode(file1));
+ struct xfs_inode *ip2 = XFS_I(file_inode(file2));
+ int error;
+
+ trace_xfs_xchg_range_prep(ip1, fxr, ip2, 0);
+
+ /* Verify both files are either real-time or non-realtime */
+ if (XFS_IS_REALTIME_INODE(ip1) != XFS_IS_REALTIME_INODE(ip2))
+ return -EINVAL;
+
+ /*
+ * The alignment checks in the VFS helpers cannot deal with allocation
+ * units that are not powers of 2. This can happen with the realtime
+ * volume if the extent size is set. Note that alignment checks are
+ * skipped if FULL_FILES is set.
+ */
+ if (!(fxr->flags & FILE_XCHG_RANGE_FULL_FILES) &&
+ !is_power_of_2(xfs_inode_alloc_unitsize(ip2)))
+ return -EOPNOTSUPP;
+
+ error = generic_xchg_file_range_prep(file1, file2, fxr,
+ xfs_inode_alloc_unitsize(ip2));
+ if (error || fxr->length == 0)
+ return error;
+
+ /* Attach dquots to both inodes before changing block maps. */
+ error = xfs_qm_dqattach(ip2);
+ if (error)
+ return error;
+ error = xfs_qm_dqattach(ip1);
+ if (error)
+ return error;
+
+ trace_xfs_xchg_range_flush(ip1, fxr, ip2, 0);
+
+ /* Flush the relevant ranges of both files. */
+ error = xfs_flush_unmap_range(ip2, fxr->file2_offset, fxr->length);
+ if (error)
+ return error;
+ error = xfs_flush_unmap_range(ip1, fxr->file1_offset, fxr->length);
+ if (error)
+ return error;
+
+ /*
+ * Cancel CoW fork preallocations for the ranges of both files. The
+ * prep function should have flushed all the dirty data, so the only
+ * extents remaining should be speculative.
+ */
+ if (xfs_inode_has_cow_data(ip1)) {
+ error = xfs_reflink_cancel_cow_range(ip1, fxr->file1_offset,
+ fxr->length, true);
+ if (error)
+ return error;
+ }
+
+ if (xfs_inode_has_cow_data(ip2)) {
+ error = xfs_reflink_cancel_cow_range(ip2, fxr->file2_offset,
+ fxr->length, true);
+ if (error)
+ return error;
+ }
+
+ return 0;
+}
+
+#define QRETRY_IP1 (0x1)
+#define QRETRY_IP2 (0x2)
+
+/*
+ * Obtain a quota reservation to make sure we don't hit EDQUOT. We can skip
+ * this if quota enforcement is disabled or if both inodes' dquots are the
+ * same. The qretry structure must be initialized to zeroes before the first
+ * call to this function.
+ */
+STATIC int
+xfs_xchg_range_reserve_quota(
+ struct xfs_trans *tp,
+ const struct xfs_swapext_req *req,
+ unsigned int *qretry)
+{
+ int64_t ddelta, rdelta;
+ int ip1_error = 0;
+ int error;
+
+ /*
+ * Don't bother with a quota reservation if we're not enforcing them
+ * or the two inodes have the same dquots.
+ */
+ if (!XFS_IS_QUOTA_ON(tp->t_mountp) || req->ip1 == req->ip2 ||
+ (req->ip1->i_udquot == req->ip2->i_udquot &&
+ req->ip1->i_gdquot == req->ip2->i_gdquot &&
+ req->ip1->i_pdquot == req->ip2->i_pdquot))
+ return 0;
+
+ *qretry = 0;
+
+ /*
+ * For each file, compute the net gain in the number of regular blocks
+ * that will be mapped into that file and reserve that much quota. The
+ * quota counts must be able to absorb at least that much space.
+ */
+ ddelta = req->ip2_bcount - req->ip1_bcount;
+ rdelta = req->ip2_rtbcount - req->ip1_rtbcount;
+ if (ddelta > 0 || rdelta > 0) {
+ error = xfs_trans_reserve_quota_nblks(tp, req->ip1,
+ ddelta > 0 ? ddelta : 0,
+ rdelta > 0 ? rdelta : 0,
+ false);
+ if (error == -EDQUOT || error == -ENOSPC) {
+ /*
+ * Save this error and see what happens if we try to
+ * reserve quota for ip2. Then report both.
+ */
+ *qretry |= QRETRY_IP1;
+ ip1_error = error;
+ error = 0;
+ }
+ if (error)
+ return error;
+ }
+ if (ddelta < 0 || rdelta < 0) {
+ error = xfs_trans_reserve_quota_nblks(tp, req->ip2,
+ ddelta < 0 ? -ddelta : 0,
+ rdelta < 0 ? -rdelta : 0,
+ false);
+ if (error == -EDQUOT || error == -ENOSPC)
+ *qretry |= QRETRY_IP2;
+ if (error)
+ return error;
+ }
+ if (ip1_error)
+ return ip1_error;
+
+ /*
+ * For each file, forcibly reserve the gross gain in mapped blocks so
+ * that we don't trip over any quota block reservation assertions.
+ * We must reserve the gross gain because the quota code subtracts from
+ * bcount the number of blocks that we unmap; it does not add that
+ * quantity back to the quota block reservation.
+ */
+ error = xfs_trans_reserve_quota_nblks(tp, req->ip1, req->ip1_bcount,
+ req->ip1_rtbcount, true);
+ if (error)
+ return error;
+
+ return xfs_trans_reserve_quota_nblks(tp, req->ip2, req->ip2_bcount,
+ req->ip2_rtbcount, true);
+}
+
+/*
+ * Get permission to use log-assisted atomic exchange of file extents.
+ *
+ * Callers must hold the IOLOCK and MMAPLOCK of both files. They must not be
+ * running any transactions or hold any ILOCKS. If @use_logging is set after a
+ * successful return, callers must call xfs_xchg_range_rele_log_assist after
+ * the exchange is completed.
+ */
+int
+xfs_xchg_range_grab_log_assist(
+ struct xfs_mount *mp,
+ bool force,
+ bool *use_logging)
+{
+ int error = 0;
+
+ /*
+ * Protect ourselves from an idle log clearing the atomic swapext
+ * log incompat feature bit.
+ */
+ xlog_use_incompat_feat(mp->m_log, XLOG_INCOMPAT_FEAT_SWAPEXT);
+ *use_logging = true;
+
+ /*
+ * If log-assisted swapping is already enabled, the caller can use the
+ * log assisted swap functions with the log-incompat reference we got.
+ */
+ if (xfs_sb_version_haslogswapext(&mp->m_sb))
+ return 0;
+
+ /*
+ * If the caller doesn't /require/ log-assisted swapping, drop the
+ * log-incompat feature protection and exit. The caller cannot use
+ * log assisted swapping.
+ */
+ if (!force)
+ goto drop_incompat;
+
+ /*
+ * Caller requires log-assisted swapping but the fs feature set isn't
+ * rich enough to support it. Bail out.
+ */
+ if (!xfs_swapext_supported(mp)) {
+ error = -EOPNOTSUPP;
+ goto drop_incompat;
+ }
+
+ error = xfs_add_incompat_log_feature(mp,
+ XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT);
+ if (error)
+ goto drop_incompat;
+
+ xfs_warn_mount(mp, XFS_OPSTATE_WARNED_SWAPEXT,
+ "EXPERIMENTAL atomic file range swap feature in use. Use at your own risk!");
+
+ return 0;
+drop_incompat:
+ xlog_drop_incompat_feat(mp->m_log, XLOG_INCOMPAT_FEAT_SWAPEXT);
+ *use_logging = false;
+ return error;
+}
+
+/* Release permission to use log-assisted extent swapping. */
+void
+xfs_xchg_range_rele_log_assist(
+ struct xfs_mount *mp)
+{
+ xlog_drop_incompat_feat(mp->m_log, XLOG_INCOMPAT_FEAT_SWAPEXT);
+}
+
+/* Exchange the contents of two files. */
+int
+xfs_xchg_range(
+ struct xfs_inode *ip1,
+ struct xfs_inode *ip2,
+ const struct file_xchg_range *fxr,
+ unsigned int xchg_flags)
+{
+ struct xfs_mount *mp = ip1->i_mount;
+ struct xfs_swapext_req req = {
+ .ip1 = ip1,
+ .ip2 = ip2,
+ .whichfork = XFS_DATA_FORK,
+ .startoff1 = XFS_B_TO_FSBT(mp, fxr->file1_offset),
+ .startoff2 = XFS_B_TO_FSBT(mp, fxr->file2_offset),
+ .blockcount = XFS_B_TO_FSB(mp, fxr->length),
+ };
+ struct xfs_trans *tp;
+ unsigned int qretry;
+ bool retried = false;
+ int error;
+
+ trace_xfs_xchg_range(ip1, fxr, ip2, xchg_flags);
+
+ /*
+ * This function only supports using log intent items (SXI items if
+ * atomic exchange is required, or BUI items if not) to exchange file
+ * data. The legacy whole-fork swap will be ported in a later patch.
+ */
+ if (!(xchg_flags & XFS_XCHG_RANGE_LOGGED) && !xfs_swapext_supported(mp))
+ return -EOPNOTSUPP;
+
+ if (fxr->flags & FILE_XCHG_RANGE_TO_EOF)
+ req.req_flags |= XFS_SWAP_REQ_SET_SIZES;
+ if (fxr->flags & FILE_XCHG_RANGE_SKIP_FILE1_HOLES)
+ req.req_flags |= XFS_SWAP_REQ_SKIP_INO1_HOLES;
+ if (xchg_flags & XFS_XCHG_RANGE_LOGGED)
+ req.req_flags |= XFS_SWAP_REQ_LOGGED;
+
+ error = xfs_xchg_range_estimate(&req);
+ if (error)
+ return error;
+
+retry:
+ /* Allocate the transaction, lock the inodes, and join them. */
+ error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, req.resblks, 0,
+ XFS_TRANS_RES_FDBLKS, &tp);
+ if (error)
+ return error;
+
+ xfs_xchg_range_ilock(tp, ip1, ip2);
+
+ trace_xfs_swap_extent_before(ip2, 0);
+ trace_xfs_swap_extent_before(ip1, 1);
+
+ if (fxr->flags & FILE_XCHG_RANGE_FILE2_FRESH)
+ trace_xfs_xchg_range_freshness(ip2, fxr);
+
+ /*
+ * Now that we've excluded all other inode metadata changes by taking
+ * the ILOCK, repeat the freshness check.
+ */
+ error = generic_xchg_file_range_check_fresh(VFS_I(ip2), fxr);
+ if (error)
+ goto out_trans_cancel;
+
+ error = xfs_swapext_check_extents(mp, &req);
+ if (error)
+ goto out_trans_cancel;
+
+ /*
+ * Reserve ourselves some quota if any of them are in enforcing mode.
+ * In theory we only need enough to satisfy the change in the number
+ * of blocks between the two ranges being remapped.
+ */
+ error = xfs_xchg_range_reserve_quota(tp, &req, &qretry);
+ if ((error == -EDQUOT || error == -ENOSPC) && !retried) {
+ xfs_trans_cancel(tp);
+ xfs_xchg_range_iunlock(ip1, ip2);
+ if (qretry & QRETRY_IP1)
+ xfs_blockgc_free_quota(ip1, 0);
+ if (qretry & QRETRY_IP2)
+ xfs_blockgc_free_quota(ip2, 0);
+ retried = true;
+ goto retry;
+ }
+ if (error)
+ goto out_trans_cancel;
+
+ /* If we got this far on a dry run, all parameters are ok. */
+ if (fxr->flags & FILE_XCHG_RANGE_DRY_RUN)
+ goto out_trans_cancel;
+
+ /* Update the mtime and ctime of both files. */
+ if (xchg_flags & XFS_XCHG_RANGE_UPD_CMTIME1)
+ xfs_trans_ichgtime(tp, ip1,
+ XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
+ if (xchg_flags & XFS_XCHG_RANGE_UPD_CMTIME2)
+ xfs_trans_ichgtime(tp, ip2,
+ XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
+
+ xfs_swapext(tp, &req);
+
+ /*
+ * Force the log to persist metadata updates if the caller or the
+ * administrator requires this. The VFS prep function already flushed
+ * the relevant parts of the page cache.
+ */
+ if (xfs_has_wsync(mp) || (fxr->flags & FILE_XCHG_RANGE_FSYNC))
+ xfs_trans_set_sync(tp);
+
+ error = xfs_trans_commit(tp);
+
+ trace_xfs_swap_extent_after(ip2, 0);
+ trace_xfs_swap_extent_after(ip1, 1);
+
+ if (error)
+ goto out_unlock;
+
+ /*
+ * If the caller wanted us to exchange the contents of two complete
+ * files of unequal length, exchange the incore sizes now. This should
+ * be safe because we flushed both files' page caches, moved all the
+ * extents, and updated the ondisk sizes.
+ */
+ if (fxr->flags & FILE_XCHG_RANGE_TO_EOF) {
+ loff_t temp;
+
+ temp = i_size_read(VFS_I(ip2));
+ i_size_write(VFS_I(ip2), i_size_read(VFS_I(ip1)));
+ i_size_write(VFS_I(ip1), temp);
+ }
+
+out_unlock:
+ xfs_xchg_range_iunlock(ip1, ip2);
+ return error;
+
+out_trans_cancel:
+ xfs_trans_cancel(tp);
+ goto out_unlock;
+}
diff --git a/fs/xfs/xfs_xchgrange.h b/fs/xfs/xfs_xchgrange.h
index 89320a354efa..a0e64408784a 100644
--- a/fs/xfs/xfs_xchgrange.h
+++ b/fs/xfs/xfs_xchgrange.h
@@ -14,4 +14,27 @@ void xfs_xchg_range_iunlock(struct xfs_inode *ip1, struct xfs_inode *ip2);
int xfs_xchg_range_estimate(struct xfs_swapext_req *req);
+int xfs_xchg_range_grab_log_assist(struct xfs_mount *mp, bool force,
+ bool *use_logging);
+void xfs_xchg_range_rele_log_assist(struct xfs_mount *mp);
+
+/* Caller has permission to use log intent items for the exchange operation. */
+#define XFS_XCHG_RANGE_LOGGED (1U << 0)
+
+/* Update ip1's change and mod time. */
+#define XFS_XCHG_RANGE_UPD_CMTIME1 (1U << 1)
+
+/* Update ip2's change and mod time. */
+#define XFS_XCHG_RANGE_UPD_CMTIME2 (1U << 2)
+
+#define XCHG_RANGE_FLAGS_STRS \
+ { XFS_XCHG_RANGE_LOGGED, "LOGGED" }, \
+ { XFS_XCHG_RANGE_UPD_CMTIME1, "UPD_CMTIME1" }, \
+ { XFS_XCHG_RANGE_UPD_CMTIME2, "UPD_CMTIME2" }
+
+int xfs_xchg_range(struct xfs_inode *ip1, struct xfs_inode *ip2,
+ const struct file_xchg_range *fxr, unsigned int xchg_flags);
+int xfs_xchg_range_prep(struct file *file1, struct file *file2,
+ struct file_xchg_range *fxr);
+
#endif /* __XFS_XCHGRANGE_H__ */
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 10/21] xfs: add error injection to test swapext recovery
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
` (8 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 07/21] xfs: create deferred log items for extent swapping Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 06/21] xfs: introduce a swap-extent log intent item Darrick J. Wong
` (10 subsequent siblings)
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
Add an errortag so that we can test recovery of swapext log items.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_errortag.h | 4 +++-
fs/xfs/libxfs/xfs_swapext.c | 3 +++
fs/xfs/xfs_error.c | 3 +++
3 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index 01a9e86b3037..263d62a8d70f 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -63,7 +63,8 @@
#define XFS_ERRTAG_ATTR_LEAF_TO_NODE 41
#define XFS_ERRTAG_WB_DELAY_MS 42
#define XFS_ERRTAG_WRITE_DELAY_MS 43
-#define XFS_ERRTAG_MAX 44
+#define XFS_ERRTAG_SWAPEXT_FINISH_ONE 44
+#define XFS_ERRTAG_MAX 45
/*
* Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -111,5 +112,6 @@
#define XFS_RANDOM_ATTR_LEAF_TO_NODE 1
#define XFS_RANDOM_WB_DELAY_MS 3000
#define XFS_RANDOM_WRITE_DELAY_MS 3000
+#define XFS_RANDOM_SWAPEXT_FINISH_ONE 1
#endif /* __XFS_ERRORTAG_H_ */
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index 0bc758c5cf5c..227a08ac5d4b 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -426,6 +426,9 @@ xfs_swapext_finish_one(
return error;
}
+ if (XFS_TEST_ERROR(false, tp->t_mountp, XFS_ERRTAG_SWAPEXT_FINISH_ONE))
+ return -EIO;
+
/* If we still have work to do, ask for a new transaction. */
if (sxi_has_more_swap_work(sxi) || sxi_has_postop_work(sxi)) {
trace_xfs_swapext_defer(tp->t_mountp, sxi);
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index ae082808cfed..4b57a809ced5 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -62,6 +62,7 @@ static unsigned int xfs_errortag_random_default[] = {
XFS_RANDOM_ATTR_LEAF_TO_NODE,
XFS_RANDOM_WB_DELAY_MS,
XFS_RANDOM_WRITE_DELAY_MS,
+ XFS_RANDOM_SWAPEXT_FINISH_ONE,
};
struct xfs_errortag_attr {
@@ -179,6 +180,7 @@ XFS_ERRORTAG_ATTR_RW(da_leaf_split, XFS_ERRTAG_DA_LEAF_SPLIT);
XFS_ERRORTAG_ATTR_RW(attr_leaf_to_node, XFS_ERRTAG_ATTR_LEAF_TO_NODE);
XFS_ERRORTAG_ATTR_RW(wb_delay_ms, XFS_ERRTAG_WB_DELAY_MS);
XFS_ERRORTAG_ATTR_RW(write_delay_ms, XFS_ERRTAG_WRITE_DELAY_MS);
+XFS_ERRORTAG_ATTR_RW(swapext_finish_one, XFS_ERRTAG_SWAPEXT_FINISH_ONE);
static struct attribute *xfs_errortag_attrs[] = {
XFS_ERRORTAG_ATTR_LIST(noerror),
@@ -224,6 +226,7 @@ static struct attribute *xfs_errortag_attrs[] = {
XFS_ERRORTAG_ATTR_LIST(attr_leaf_to_node),
XFS_ERRORTAG_ATTR_LIST(wb_delay_ms),
XFS_ERRORTAG_ATTR_LIST(write_delay_ms),
+ XFS_ERRORTAG_ATTR_LIST(swapext_finish_one),
NULL,
};
ATTRIBUTE_GROUPS(xfs_errortag);
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 11/21] xfs: port xfs_swap_extents_rmap to our new code
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
` (6 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 08/21] xfs: enable xlog users to toggle atomic extent swapping Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 07/21] xfs: create deferred log items for extent swapping Darrick J. Wong
` (12 subsequent siblings)
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
The inner loop of xfs_swap_extents_rmap does the same work as
xfs_swapext_finish_one, so adapt it to use that. Doing so has the side
benefit that the older code path no longer wastes its time remapping
shared extents.
This forms the basis of the non-atomic swaprange implementation.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_bmap_util.c | 151 +++++-------------------------------------------
fs/xfs/xfs_trace.h | 5 --
2 files changed, 16 insertions(+), 140 deletions(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index d587015aec0e..4d4696bf9b08 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1360,138 +1360,6 @@ xfs_swap_extent_flush(
return 0;
}
-/*
- * Move extents from one file to another, when rmap is enabled.
- */
-STATIC int
-xfs_swap_extent_rmap(
- struct xfs_trans **tpp,
- struct xfs_inode *ip,
- struct xfs_inode *tip)
-{
- struct xfs_trans *tp = *tpp;
- struct xfs_bmbt_irec irec;
- struct xfs_bmbt_irec uirec;
- struct xfs_bmbt_irec tirec;
- xfs_fileoff_t offset_fsb;
- xfs_fileoff_t end_fsb;
- xfs_filblks_t count_fsb;
- int error;
- xfs_filblks_t ilen;
- xfs_filblks_t rlen;
- int nimaps;
- uint64_t tip_flags2;
-
- /*
- * If the source file has shared blocks, we must flag the donor
- * file as having shared blocks so that we get the shared-block
- * rmap functions when we go to fix up the rmaps. The flags
- * will be switch for reals later.
- */
- tip_flags2 = tip->i_diflags2;
- if (ip->i_diflags2 & XFS_DIFLAG2_REFLINK)
- tip->i_diflags2 |= XFS_DIFLAG2_REFLINK;
-
- offset_fsb = 0;
- end_fsb = XFS_B_TO_FSB(ip->i_mount, i_size_read(VFS_I(ip)));
- count_fsb = (xfs_filblks_t)(end_fsb - offset_fsb);
-
- while (count_fsb) {
- /* Read extent from the donor file */
- nimaps = 1;
- error = xfs_bmapi_read(tip, offset_fsb, count_fsb, &tirec,
- &nimaps, 0);
- if (error)
- goto out;
- ASSERT(nimaps == 1);
- ASSERT(tirec.br_startblock != DELAYSTARTBLOCK);
-
- trace_xfs_swap_extent_rmap_remap(tip, &tirec);
- ilen = tirec.br_blockcount;
-
- /* Unmap the old blocks in the source file. */
- while (tirec.br_blockcount) {
- ASSERT(tp->t_firstblock == NULLFSBLOCK);
- trace_xfs_swap_extent_rmap_remap_piece(tip, &tirec);
-
- /* Read extent from the source file */
- nimaps = 1;
- error = xfs_bmapi_read(ip, tirec.br_startoff,
- tirec.br_blockcount, &irec,
- &nimaps, 0);
- if (error)
- goto out;
- ASSERT(nimaps == 1);
- ASSERT(tirec.br_startoff == irec.br_startoff);
- trace_xfs_swap_extent_rmap_remap_piece(ip, &irec);
-
- /* Trim the extent. */
- uirec = tirec;
- uirec.br_blockcount = rlen = min_t(xfs_filblks_t,
- tirec.br_blockcount,
- irec.br_blockcount);
- trace_xfs_swap_extent_rmap_remap_piece(tip, &uirec);
-
- if (xfs_bmap_is_real_extent(&uirec)) {
- error = xfs_iext_count_may_overflow(ip,
- XFS_DATA_FORK,
- XFS_IEXT_SWAP_RMAP_CNT);
- if (error == -EFBIG)
- error = xfs_iext_count_upgrade(tp, ip,
- XFS_IEXT_SWAP_RMAP_CNT);
- if (error)
- goto out;
- }
-
- if (xfs_bmap_is_real_extent(&irec)) {
- error = xfs_iext_count_may_overflow(tip,
- XFS_DATA_FORK,
- XFS_IEXT_SWAP_RMAP_CNT);
- if (error == -EFBIG)
- error = xfs_iext_count_upgrade(tp, ip,
- XFS_IEXT_SWAP_RMAP_CNT);
- if (error)
- goto out;
- }
-
- /* Remove the mapping from the donor file. */
- xfs_bmap_unmap_extent(tp, tip, XFS_DATA_FORK, &uirec);
-
- /* Remove the mapping from the source file. */
- xfs_bmap_unmap_extent(tp, ip, XFS_DATA_FORK, &irec);
-
- /* Map the donor file's blocks into the source file. */
- xfs_bmap_map_extent(tp, ip, XFS_DATA_FORK, &uirec);
-
- /* Map the source file's blocks into the donor file. */
- xfs_bmap_map_extent(tp, tip, XFS_DATA_FORK, &irec);
-
- error = xfs_defer_finish(tpp);
- tp = *tpp;
- if (error)
- goto out;
-
- tirec.br_startoff += rlen;
- if (tirec.br_startblock != HOLESTARTBLOCK &&
- tirec.br_startblock != DELAYSTARTBLOCK)
- tirec.br_startblock += rlen;
- tirec.br_blockcount -= rlen;
- }
-
- /* Roll on... */
- count_fsb -= ilen;
- offset_fsb += ilen;
- }
-
- tip->i_diflags2 = tip_flags2;
- return 0;
-
-out:
- trace_xfs_swap_extent_rmap_error(ip, error, _RET_IP_);
- tip->i_diflags2 = tip_flags2;
- return error;
-}
-
/* Swap the extents of two files by swapping data forks. */
STATIC int
xfs_swap_extent_forks(
@@ -1775,13 +1643,24 @@ xfs_swap_extents(
src_log_flags = XFS_ILOG_CORE;
target_log_flags = XFS_ILOG_CORE;
- if (xfs_has_rmapbt(mp))
- error = xfs_swap_extent_rmap(&tp, ip, tip);
- else
+ if (xfs_has_rmapbt(mp)) {
+ struct xfs_swapext_req req = {
+ .ip1 = tip,
+ .ip2 = ip,
+ .whichfork = XFS_DATA_FORK,
+ .blockcount = XFS_B_TO_FSB(ip->i_mount,
+ i_size_read(VFS_I(ip))),
+ };
+
+ xfs_swapext(tp, &req);
+ error = xfs_defer_finish(&tp);
+ } else
error = xfs_swap_extent_forks(tp, ip, tip, &src_log_flags,
&target_log_flags);
- if (error)
+ if (error) {
+ trace_xfs_swap_extent_error(ip, error, _THIS_IP_);
goto out_trans_cancel;
+ }
/* Do we have to swap reflink flags? */
if ((ip->i_diflags2 & XFS_DIFLAG2_REFLINK) ^
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 6841f04ee38d..b0ced76af3b9 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3759,13 +3759,10 @@ DEFINE_INODE_ERROR_EVENT(xfs_reflink_end_cow_error);
DEFINE_INODE_IREC_EVENT(xfs_reflink_cancel_cow);
-/* rmap swapext tracepoints */
-DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap);
-DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap_piece);
-DEFINE_INODE_ERROR_EVENT(xfs_swap_extent_rmap_error);
/* swapext tracepoints */
DEFINE_INODE_ERROR_EVENT(xfs_file_xchg_range_error);
+DEFINE_INODE_ERROR_EVENT(xfs_swap_extent_error);
DEFINE_INODE_IREC_EVENT(xfs_swapext_extent1);
DEFINE_INODE_IREC_EVENT(xfs_swapext_extent2);
DEFINE_ITRUNC_EVENT(xfs_swapext_update_inode_size);
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 12/21] xfs: consolidate all of the xfs_swap_extent_forks code
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
` (16 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 14/21] xfs: allow xfs_swap_range to use older extent swap algorithms Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 19/21] xfs: make atomic extent swapping support realtime files Darrick J. Wong
` (2 subsequent siblings)
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
Now that we've moved the old swapext code to use the new log-assisted
extent swap code for rmap filesystems, let's start porting the old
implementation to the new ioctl interface so that later we can port the
old interface to the new interface.
Consolidate the reflink flag swap code and the the bmbt owner change
scan code in xfs_swap_extent_forks, since both interfaces are going to
need that.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_bmap_util.c | 220 ++++++++++++++++++++++++------------------------
1 file changed, 108 insertions(+), 112 deletions(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 4d4696bf9b08..dbd95d86addb 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1360,19 +1360,61 @@ xfs_swap_extent_flush(
return 0;
}
+/*
+ * Fix up the owners of the bmbt blocks to refer to the current inode. The
+ * change owner scan attempts to order all modified buffers in the current
+ * transaction. In the event of ordered buffer failure, the offending buffer is
+ * physically logged as a fallback and the scan returns -EAGAIN. We must roll
+ * the transaction in this case to replenish the fallback log reservation and
+ * restart the scan. This process repeats until the scan completes.
+ */
+static int
+xfs_swap_change_owner(
+ struct xfs_trans **tpp,
+ struct xfs_inode *ip,
+ struct xfs_inode *tmpip)
+{
+ int error;
+ struct xfs_trans *tp = *tpp;
+
+ do {
+ error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK, ip->i_ino,
+ NULL);
+ /* success or fatal error */
+ if (error != -EAGAIN)
+ break;
+
+ error = xfs_trans_roll(tpp);
+ if (error)
+ break;
+ tp = *tpp;
+
+ /*
+ * Redirty both inodes so they can relog and keep the log tail
+ * moving forward.
+ */
+ xfs_trans_ijoin(tp, ip, 0);
+ xfs_trans_ijoin(tp, tmpip, 0);
+ xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+ xfs_trans_log_inode(tp, tmpip, XFS_ILOG_CORE);
+ } while (true);
+
+ return error;
+}
+
/* Swap the extents of two files by swapping data forks. */
STATIC int
xfs_swap_extent_forks(
- struct xfs_trans *tp,
+ struct xfs_trans **tpp,
struct xfs_inode *ip,
- struct xfs_inode *tip,
- int *src_log_flags,
- int *target_log_flags)
+ struct xfs_inode *tip)
{
xfs_filblks_t aforkblks = 0;
xfs_filblks_t taforkblks = 0;
xfs_extnum_t junk;
uint64_t tmp;
+ int src_log_flags = XFS_ILOG_CORE;
+ int target_log_flags = XFS_ILOG_CORE;
int error;
/*
@@ -1380,14 +1422,14 @@ xfs_swap_extent_forks(
*/
if (xfs_inode_has_attr_fork(ip) && ip->i_af.if_nextents > 0 &&
ip->i_af.if_format != XFS_DINODE_FMT_LOCAL) {
- error = xfs_bmap_count_blocks(tp, ip, XFS_ATTR_FORK, &junk,
+ error = xfs_bmap_count_blocks(*tpp, ip, XFS_ATTR_FORK, &junk,
&aforkblks);
if (error)
return error;
}
if (xfs_inode_has_attr_fork(tip) && tip->i_af.if_nextents > 0 &&
tip->i_af.if_format != XFS_DINODE_FMT_LOCAL) {
- error = xfs_bmap_count_blocks(tp, tip, XFS_ATTR_FORK, &junk,
+ error = xfs_bmap_count_blocks(*tpp, tip, XFS_ATTR_FORK, &junk,
&taforkblks);
if (error)
return error;
@@ -1402,9 +1444,9 @@ xfs_swap_extent_forks(
*/
if (xfs_has_v3inodes(ip->i_mount)) {
if (ip->i_df.if_format == XFS_DINODE_FMT_BTREE)
- (*target_log_flags) |= XFS_ILOG_DOWNER;
+ target_log_flags |= XFS_ILOG_DOWNER;
if (tip->i_df.if_format == XFS_DINODE_FMT_BTREE)
- (*src_log_flags) |= XFS_ILOG_DOWNER;
+ src_log_flags |= XFS_ILOG_DOWNER;
}
/*
@@ -1434,71 +1476,80 @@ xfs_swap_extent_forks(
switch (ip->i_df.if_format) {
case XFS_DINODE_FMT_EXTENTS:
- (*src_log_flags) |= XFS_ILOG_DEXT;
+ src_log_flags |= XFS_ILOG_DEXT;
break;
case XFS_DINODE_FMT_BTREE:
ASSERT(!xfs_has_v3inodes(ip->i_mount) ||
- (*src_log_flags & XFS_ILOG_DOWNER));
- (*src_log_flags) |= XFS_ILOG_DBROOT;
+ (src_log_flags & XFS_ILOG_DOWNER));
+ src_log_flags |= XFS_ILOG_DBROOT;
break;
}
switch (tip->i_df.if_format) {
case XFS_DINODE_FMT_EXTENTS:
- (*target_log_flags) |= XFS_ILOG_DEXT;
+ target_log_flags |= XFS_ILOG_DEXT;
break;
case XFS_DINODE_FMT_BTREE:
- (*target_log_flags) |= XFS_ILOG_DBROOT;
+ target_log_flags |= XFS_ILOG_DBROOT;
ASSERT(!xfs_has_v3inodes(ip->i_mount) ||
- (*target_log_flags & XFS_ILOG_DOWNER));
+ (target_log_flags & XFS_ILOG_DOWNER));
break;
}
+ /* Do we have to swap reflink flags? */
+ if ((ip->i_diflags2 & XFS_DIFLAG2_REFLINK) ^
+ (tip->i_diflags2 & XFS_DIFLAG2_REFLINK)) {
+ uint64_t f;
+
+ f = ip->i_diflags2 & XFS_DIFLAG2_REFLINK;
+ ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
+ ip->i_diflags2 |= tip->i_diflags2 & XFS_DIFLAG2_REFLINK;
+ tip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
+ tip->i_diflags2 |= f & XFS_DIFLAG2_REFLINK;
+ }
+
+ /* Swap the cow forks. */
+ if (xfs_has_reflink(ip->i_mount)) {
+ ASSERT(!ip->i_cowfp ||
+ ip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
+ ASSERT(!tip->i_cowfp ||
+ tip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
+
+ swap(ip->i_cowfp, tip->i_cowfp);
+
+ if (ip->i_cowfp && ip->i_cowfp->if_bytes)
+ xfs_inode_set_cowblocks_tag(ip);
+ else
+ xfs_inode_clear_cowblocks_tag(ip);
+ if (tip->i_cowfp && tip->i_cowfp->if_bytes)
+ xfs_inode_set_cowblocks_tag(tip);
+ else
+ xfs_inode_clear_cowblocks_tag(tip);
+ }
+
+ xfs_trans_log_inode(*tpp, ip, src_log_flags);
+ xfs_trans_log_inode(*tpp, tip, target_log_flags);
+
+ /*
+ * The extent forks have been swapped, but crc=1,rmapbt=0 filesystems
+ * have inode number owner values in the bmbt blocks that still refer to
+ * the old inode. Scan each bmbt to fix up the owner values with the
+ * inode number of the current inode.
+ */
+ if (src_log_flags & XFS_ILOG_DOWNER) {
+ error = xfs_swap_change_owner(tpp, ip, tip);
+ if (error)
+ return error;
+ }
+ if (target_log_flags & XFS_ILOG_DOWNER) {
+ error = xfs_swap_change_owner(tpp, tip, ip);
+ if (error)
+ return error;
+ }
+
return 0;
}
-/*
- * Fix up the owners of the bmbt blocks to refer to the current inode. The
- * change owner scan attempts to order all modified buffers in the current
- * transaction. In the event of ordered buffer failure, the offending buffer is
- * physically logged as a fallback and the scan returns -EAGAIN. We must roll
- * the transaction in this case to replenish the fallback log reservation and
- * restart the scan. This process repeats until the scan completes.
- */
-static int
-xfs_swap_change_owner(
- struct xfs_trans **tpp,
- struct xfs_inode *ip,
- struct xfs_inode *tmpip)
-{
- int error;
- struct xfs_trans *tp = *tpp;
-
- do {
- error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK, ip->i_ino,
- NULL);
- /* success or fatal error */
- if (error != -EAGAIN)
- break;
-
- error = xfs_trans_roll(tpp);
- if (error)
- break;
- tp = *tpp;
-
- /*
- * Redirty both inodes so they can relog and keep the log tail
- * moving forward.
- */
- xfs_trans_ijoin(tp, ip, 0);
- xfs_trans_ijoin(tp, tmpip, 0);
- xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
- xfs_trans_log_inode(tp, tmpip, XFS_ILOG_CORE);
- } while (true);
-
- return error;
-}
-
int
xfs_swap_extents(
struct xfs_inode *ip, /* target inode */
@@ -1508,9 +1559,7 @@ xfs_swap_extents(
struct xfs_mount *mp = ip->i_mount;
struct xfs_trans *tp;
struct xfs_bstat *sbp = &sxp->sx_stat;
- int src_log_flags, target_log_flags;
int error = 0;
- uint64_t f;
int resblks = 0;
unsigned int flags = 0;
@@ -1640,9 +1689,6 @@ xfs_swap_extents(
* recovery is going to see the fork as owned by the swapped inode,
* not the pre-swapped inodes.
*/
- src_log_flags = XFS_ILOG_CORE;
- target_log_flags = XFS_ILOG_CORE;
-
if (xfs_has_rmapbt(mp)) {
struct xfs_swapext_req req = {
.ip1 = tip,
@@ -1655,62 +1701,12 @@ xfs_swap_extents(
xfs_swapext(tp, &req);
error = xfs_defer_finish(&tp);
} else
- error = xfs_swap_extent_forks(tp, ip, tip, &src_log_flags,
- &target_log_flags);
+ error = xfs_swap_extent_forks(&tp, ip, tip);
if (error) {
trace_xfs_swap_extent_error(ip, error, _THIS_IP_);
goto out_trans_cancel;
}
- /* Do we have to swap reflink flags? */
- if ((ip->i_diflags2 & XFS_DIFLAG2_REFLINK) ^
- (tip->i_diflags2 & XFS_DIFLAG2_REFLINK)) {
- f = ip->i_diflags2 & XFS_DIFLAG2_REFLINK;
- ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
- ip->i_diflags2 |= tip->i_diflags2 & XFS_DIFLAG2_REFLINK;
- tip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
- tip->i_diflags2 |= f & XFS_DIFLAG2_REFLINK;
- }
-
- /* Swap the cow forks. */
- if (xfs_has_reflink(mp)) {
- ASSERT(!ip->i_cowfp ||
- ip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
- ASSERT(!tip->i_cowfp ||
- tip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
-
- swap(ip->i_cowfp, tip->i_cowfp);
-
- if (ip->i_cowfp && ip->i_cowfp->if_bytes)
- xfs_inode_set_cowblocks_tag(ip);
- else
- xfs_inode_clear_cowblocks_tag(ip);
- if (tip->i_cowfp && tip->i_cowfp->if_bytes)
- xfs_inode_set_cowblocks_tag(tip);
- else
- xfs_inode_clear_cowblocks_tag(tip);
- }
-
- xfs_trans_log_inode(tp, ip, src_log_flags);
- xfs_trans_log_inode(tp, tip, target_log_flags);
-
- /*
- * The extent forks have been swapped, but crc=1,rmapbt=0 filesystems
- * have inode number owner values in the bmbt blocks that still refer to
- * the old inode. Scan each bmbt to fix up the owner values with the
- * inode number of the current inode.
- */
- if (src_log_flags & XFS_ILOG_DOWNER) {
- error = xfs_swap_change_owner(&tp, ip, tip);
- if (error)
- goto out_trans_cancel;
- }
- if (target_log_flags & XFS_ILOG_DOWNER) {
- error = xfs_swap_change_owner(&tp, tip, ip);
- if (error)
- goto out_trans_cancel;
- }
-
/*
* If this is a synchronous mount, make sure that the
* transaction goes to disk before returning to the user.
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 13/21] xfs: port xfs_swap_extent_forks to use xfs_swapext_req
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
` (11 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 18/21] xfs: condense symbolic links after an atomic swap Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 17/21] xfs: condense directories after an atomic swap Darrick J. Wong
` (7 subsequent siblings)
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
Port the old extent fork swapping function to take a xfs_swapext_req as
input, which aligns it with the new fiexchange interface.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_bmap_util.c | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index dbd95d86addb..9d6337a05544 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1406,9 +1406,10 @@ xfs_swap_change_owner(
STATIC int
xfs_swap_extent_forks(
struct xfs_trans **tpp,
- struct xfs_inode *ip,
- struct xfs_inode *tip)
+ struct xfs_swapext_req *req)
{
+ struct xfs_inode *ip = req->ip2;
+ struct xfs_inode *tip = req->ip1;
xfs_filblks_t aforkblks = 0;
xfs_filblks_t taforkblks = 0;
xfs_extnum_t junk;
@@ -1556,6 +1557,11 @@ xfs_swap_extents(
struct xfs_inode *tip, /* tmp inode */
struct xfs_swapext *sxp)
{
+ struct xfs_swapext_req req = {
+ .ip1 = tip,
+ .ip2 = ip,
+ .whichfork = XFS_DATA_FORK,
+ };
struct xfs_mount *mp = ip->i_mount;
struct xfs_trans *tp;
struct xfs_bstat *sbp = &sxp->sx_stat;
@@ -1689,19 +1695,12 @@ xfs_swap_extents(
* recovery is going to see the fork as owned by the swapped inode,
* not the pre-swapped inodes.
*/
+ req.blockcount = XFS_B_TO_FSB(ip->i_mount, i_size_read(VFS_I(ip)));
if (xfs_has_rmapbt(mp)) {
- struct xfs_swapext_req req = {
- .ip1 = tip,
- .ip2 = ip,
- .whichfork = XFS_DATA_FORK,
- .blockcount = XFS_B_TO_FSB(ip->i_mount,
- i_size_read(VFS_I(ip))),
- };
-
xfs_swapext(tp, &req);
error = xfs_defer_finish(&tp);
} else
- error = xfs_swap_extent_forks(&tp, ip, tip);
+ error = xfs_swap_extent_forks(&tp, &req);
if (error) {
trace_xfs_swap_extent_error(ip, error, _THIS_IP_);
goto out_trans_cancel;
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 14/21] xfs: allow xfs_swap_range to use older extent swap algorithms
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
` (15 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 15/21] xfs: remove old swap extents implementation Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 12/21] xfs: consolidate all of the xfs_swap_extent_forks code Darrick J. Wong
` (3 subsequent siblings)
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
If userspace permits non-atomic swap operations, use the older code
paths to implement the same functionality.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_bmap_util.c | 4 +-
fs/xfs/xfs_bmap_util.h | 4 ++
fs/xfs/xfs_xchgrange.c | 96 +++++++++++++++++++++++++++++++++++++++++++-----
3 files changed, 92 insertions(+), 12 deletions(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 9d6337a05544..e8562c4de7eb 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1261,7 +1261,7 @@ xfs_insert_file_space(
* reject and log the attempt. basically we are putting the responsibility on
* userspace to get this right.
*/
-static int
+int
xfs_swap_extents_check_format(
struct xfs_inode *ip, /* target inode */
struct xfs_inode *tip) /* tmp inode */
@@ -1403,7 +1403,7 @@ xfs_swap_change_owner(
}
/* Swap the extents of two files by swapping data forks. */
-STATIC int
+int
xfs_swap_extent_forks(
struct xfs_trans **tpp,
struct xfs_swapext_req *req)
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 6888078f5c31..39c71da08403 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -69,6 +69,10 @@ int xfs_free_eofblocks(struct xfs_inode *ip);
int xfs_swap_extents(struct xfs_inode *ip, struct xfs_inode *tip,
struct xfs_swapext *sx);
+struct xfs_swapext_req;
+int xfs_swap_extent_forks(struct xfs_trans **tpp, struct xfs_swapext_req *req);
+int xfs_swap_extents_check_format(struct xfs_inode *ip, struct xfs_inode *tip);
+
xfs_daddr_t xfs_fsb_to_db(struct xfs_inode *ip, xfs_fsblock_t fsb);
xfs_extnum_t xfs_bmap_count_leaves(struct xfs_ifork *ifp, xfs_filblks_t *count);
diff --git a/fs/xfs/xfs_xchgrange.c b/fs/xfs/xfs_xchgrange.c
index 9966938134c0..2b7aedc49923 100644
--- a/fs/xfs/xfs_xchgrange.c
+++ b/fs/xfs/xfs_xchgrange.c
@@ -297,6 +297,33 @@ xfs_xchg_range_rele_log_assist(
xlog_drop_incompat_feat(mp->m_log, XLOG_INCOMPAT_FEAT_SWAPEXT);
}
+/* Decide if we can use the old data fork exchange code. */
+static inline bool
+xfs_xchg_use_forkswap(
+ const struct file_xchg_range *fxr,
+ struct xfs_inode *ip1,
+ struct xfs_inode *ip2)
+{
+ if (!(fxr->flags & FILE_XCHG_RANGE_NONATOMIC))
+ return false;
+ if (!(fxr->flags & FILE_XCHG_RANGE_FULL_FILES))
+ return false;
+ if (fxr->flags & FILE_XCHG_RANGE_TO_EOF)
+ return false;
+ if (fxr->file1_offset != 0 || fxr->file2_offset != 0)
+ return false;
+ if (fxr->length != ip1->i_disk_size)
+ return false;
+ if (fxr->length != ip2->i_disk_size)
+ return false;
+ return true;
+}
+
+enum xchg_strategy {
+ SWAPEXT = 1, /* xfs_swapext() */
+ FORKSWAP = 2, /* exchange forks */
+};
+
/* Exchange the contents of two files. */
int
xfs_xchg_range(
@@ -316,19 +343,13 @@ xfs_xchg_range(
};
struct xfs_trans *tp;
unsigned int qretry;
+ unsigned int flags = 0;
bool retried = false;
+ enum xchg_strategy strategy;
int error;
trace_xfs_xchg_range(ip1, fxr, ip2, xchg_flags);
- /*
- * This function only supports using log intent items (SXI items if
- * atomic exchange is required, or BUI items if not) to exchange file
- * data. The legacy whole-fork swap will be ported in a later patch.
- */
- if (!(xchg_flags & XFS_XCHG_RANGE_LOGGED) && !xfs_swapext_supported(mp))
- return -EOPNOTSUPP;
-
if (fxr->flags & FILE_XCHG_RANGE_TO_EOF)
req.req_flags |= XFS_SWAP_REQ_SET_SIZES;
if (fxr->flags & FILE_XCHG_RANGE_SKIP_FILE1_HOLES)
@@ -340,10 +361,25 @@ xfs_xchg_range(
if (error)
return error;
+ /*
+ * We haven't decided which exchange strategy we want to use yet, but
+ * here we must choose if we want freed blocks during the swap to be
+ * added to the transaction block reservation (RES_FDBLKS) or freed
+ * into the global fdblocks. The legacy fork swap mechanism doesn't
+ * free any blocks, so it doesn't require it. It is also the only
+ * option that works for older filesystems.
+ *
+ * The bmap log intent items that were added with rmap and reflink can
+ * change the bmbt shape, so the intent-based swap strategies require
+ * us to set RES_FDBLKS.
+ */
+ if (xfs_has_lazysbcount(mp))
+ flags |= XFS_TRANS_RES_FDBLKS;
+
retry:
/* Allocate the transaction, lock the inodes, and join them. */
error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, req.resblks, 0,
- XFS_TRANS_RES_FDBLKS, &tp);
+ flags, &tp);
if (error)
return error;
@@ -386,6 +422,40 @@ xfs_xchg_range(
if (error)
goto out_trans_cancel;
+ if ((xchg_flags & XFS_XCHG_RANGE_LOGGED) || xfs_swapext_supported(mp)) {
+ /*
+ * xfs_swapext() uses deferred bmap log intent items to swap
+ * extents between file forks. If the atomic log swap feature
+ * is enabled, it will also use swapext log intent items to
+ * restart the operation in case of failure.
+ *
+ * This means that we can use it if we previously obtained
+ * permission from the log to use log-assisted atomic extent
+ * swapping; or if the fs supports rmap or reflink and the
+ * user said NONATOMIC.
+ */
+ strategy = SWAPEXT;
+ } else if (xfs_xchg_use_forkswap(fxr, ip1, ip2)) {
+ /*
+ * Exchange the file contents by using the old bmap fork
+ * exchange code, if we're a defrag tool doing a full file
+ * swap.
+ */
+ strategy = FORKSWAP;
+
+ error = xfs_swap_extents_check_format(ip2, ip1);
+ if (error) {
+ xfs_notice(mp,
+ "%s: inode 0x%llx format is incompatible for exchanging.",
+ __func__, ip2->i_ino);
+ goto out_trans_cancel;
+ }
+ } else {
+ /* We cannot exchange the file contents. */
+ error = -EOPNOTSUPP;
+ goto out_trans_cancel;
+ }
+
/* If we got this far on a dry run, all parameters are ok. */
if (fxr->flags & FILE_XCHG_RANGE_DRY_RUN)
goto out_trans_cancel;
@@ -398,7 +468,13 @@ xfs_xchg_range(
xfs_trans_ichgtime(tp, ip2,
XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
- xfs_swapext(tp, &req);
+ if (strategy == SWAPEXT) {
+ xfs_swapext(tp, &req);
+ } else {
+ error = xfs_swap_extent_forks(&tp, &req);
+ if (error)
+ goto out_trans_cancel;
+ }
/*
* Force the log to persist metadata updates if the caller or the
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 15/21] xfs: remove old swap extents implementation
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
` (14 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 16/21] xfs: condense extended attributes " Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 14/21] xfs: allow xfs_swap_range to use older extent swap algorithms Darrick J. Wong
` (4 subsequent siblings)
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
Migrate the old XFS_IOC_SWAPEXT implementation to use our shiny new one.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_bmap_util.c | 491 ------------------------------------------------
fs/xfs/xfs_bmap_util.h | 7 -
fs/xfs/xfs_ioctl.c | 102 +++-------
fs/xfs/xfs_ioctl.h | 4
fs/xfs/xfs_ioctl32.c | 11 -
fs/xfs/xfs_xchgrange.c | 299 +++++++++++++++++++++++++++++
6 files changed, 334 insertions(+), 580 deletions(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index e8562c4de7eb..47a583a94d58 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1240,494 +1240,3 @@ xfs_insert_file_space(
xfs_iunlock(ip, XFS_ILOCK_EXCL);
return error;
}
-
-/*
- * We need to check that the format of the data fork in the temporary inode is
- * valid for the target inode before doing the swap. This is not a problem with
- * attr1 because of the fixed fork offset, but attr2 has a dynamically sized
- * data fork depending on the space the attribute fork is taking so we can get
- * invalid formats on the target inode.
- *
- * E.g. target has space for 7 extents in extent format, temp inode only has
- * space for 6. If we defragment down to 7 extents, then the tmp format is a
- * btree, but when swapped it needs to be in extent format. Hence we can't just
- * blindly swap data forks on attr2 filesystems.
- *
- * Note that we check the swap in both directions so that we don't end up with
- * a corrupt temporary inode, either.
- *
- * Note that fixing the way xfs_fsr sets up the attribute fork in the source
- * inode will prevent this situation from occurring, so all we do here is
- * reject and log the attempt. basically we are putting the responsibility on
- * userspace to get this right.
- */
-int
-xfs_swap_extents_check_format(
- struct xfs_inode *ip, /* target inode */
- struct xfs_inode *tip) /* tmp inode */
-{
- struct xfs_ifork *ifp = &ip->i_df;
- struct xfs_ifork *tifp = &tip->i_df;
-
- /* User/group/project quota ids must match if quotas are enforced. */
- if (XFS_IS_QUOTA_ON(ip->i_mount) &&
- (!uid_eq(VFS_I(ip)->i_uid, VFS_I(tip)->i_uid) ||
- !gid_eq(VFS_I(ip)->i_gid, VFS_I(tip)->i_gid) ||
- ip->i_projid != tip->i_projid))
- return -EINVAL;
-
- /* Should never get a local format */
- if (ifp->if_format == XFS_DINODE_FMT_LOCAL ||
- tifp->if_format == XFS_DINODE_FMT_LOCAL)
- return -EINVAL;
-
- /*
- * if the target inode has less extents that then temporary inode then
- * why did userspace call us?
- */
- if (ifp->if_nextents < tifp->if_nextents)
- return -EINVAL;
-
- /*
- * If we have to use the (expensive) rmap swap method, we can
- * handle any number of extents and any format.
- */
- if (xfs_has_rmapbt(ip->i_mount))
- return 0;
-
- /*
- * if the target inode is in extent form and the temp inode is in btree
- * form then we will end up with the target inode in the wrong format
- * as we already know there are less extents in the temp inode.
- */
- if (ifp->if_format == XFS_DINODE_FMT_EXTENTS &&
- tifp->if_format == XFS_DINODE_FMT_BTREE)
- return -EINVAL;
-
- /* Check temp in extent form to max in target */
- if (tifp->if_format == XFS_DINODE_FMT_EXTENTS &&
- tifp->if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK))
- return -EINVAL;
-
- /* Check target in extent form to max in temp */
- if (ifp->if_format == XFS_DINODE_FMT_EXTENTS &&
- ifp->if_nextents > XFS_IFORK_MAXEXT(tip, XFS_DATA_FORK))
- return -EINVAL;
-
- /*
- * If we are in a btree format, check that the temp root block will fit
- * in the target and that it has enough extents to be in btree format
- * in the target.
- *
- * Note that we have to be careful to allow btree->extent conversions
- * (a common defrag case) which will occur when the temp inode is in
- * extent format...
- */
- if (tifp->if_format == XFS_DINODE_FMT_BTREE) {
- if (xfs_inode_has_attr_fork(ip) &&
- XFS_BMAP_BMDR_SPACE(tifp->if_broot) > xfs_inode_fork_boff(ip))
- return -EINVAL;
- if (tifp->if_nextents <= XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK))
- return -EINVAL;
- }
-
- /* Reciprocal target->temp btree format checks */
- if (ifp->if_format == XFS_DINODE_FMT_BTREE) {
- if (xfs_inode_has_attr_fork(tip) &&
- XFS_BMAP_BMDR_SPACE(ip->i_df.if_broot) > xfs_inode_fork_boff(tip))
- return -EINVAL;
- if (ifp->if_nextents <= XFS_IFORK_MAXEXT(tip, XFS_DATA_FORK))
- return -EINVAL;
- }
-
- return 0;
-}
-
-static int
-xfs_swap_extent_flush(
- struct xfs_inode *ip)
-{
- int error;
-
- error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
- if (error)
- return error;
- truncate_pagecache_range(VFS_I(ip), 0, -1);
-
- /* Verify O_DIRECT for ftmp */
- if (VFS_I(ip)->i_mapping->nrpages)
- return -EINVAL;
- return 0;
-}
-
-/*
- * Fix up the owners of the bmbt blocks to refer to the current inode. The
- * change owner scan attempts to order all modified buffers in the current
- * transaction. In the event of ordered buffer failure, the offending buffer is
- * physically logged as a fallback and the scan returns -EAGAIN. We must roll
- * the transaction in this case to replenish the fallback log reservation and
- * restart the scan. This process repeats until the scan completes.
- */
-static int
-xfs_swap_change_owner(
- struct xfs_trans **tpp,
- struct xfs_inode *ip,
- struct xfs_inode *tmpip)
-{
- int error;
- struct xfs_trans *tp = *tpp;
-
- do {
- error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK, ip->i_ino,
- NULL);
- /* success or fatal error */
- if (error != -EAGAIN)
- break;
-
- error = xfs_trans_roll(tpp);
- if (error)
- break;
- tp = *tpp;
-
- /*
- * Redirty both inodes so they can relog and keep the log tail
- * moving forward.
- */
- xfs_trans_ijoin(tp, ip, 0);
- xfs_trans_ijoin(tp, tmpip, 0);
- xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
- xfs_trans_log_inode(tp, tmpip, XFS_ILOG_CORE);
- } while (true);
-
- return error;
-}
-
-/* Swap the extents of two files by swapping data forks. */
-int
-xfs_swap_extent_forks(
- struct xfs_trans **tpp,
- struct xfs_swapext_req *req)
-{
- struct xfs_inode *ip = req->ip2;
- struct xfs_inode *tip = req->ip1;
- xfs_filblks_t aforkblks = 0;
- xfs_filblks_t taforkblks = 0;
- xfs_extnum_t junk;
- uint64_t tmp;
- int src_log_flags = XFS_ILOG_CORE;
- int target_log_flags = XFS_ILOG_CORE;
- int error;
-
- /*
- * Count the number of extended attribute blocks
- */
- if (xfs_inode_has_attr_fork(ip) && ip->i_af.if_nextents > 0 &&
- ip->i_af.if_format != XFS_DINODE_FMT_LOCAL) {
- error = xfs_bmap_count_blocks(*tpp, ip, XFS_ATTR_FORK, &junk,
- &aforkblks);
- if (error)
- return error;
- }
- if (xfs_inode_has_attr_fork(tip) && tip->i_af.if_nextents > 0 &&
- tip->i_af.if_format != XFS_DINODE_FMT_LOCAL) {
- error = xfs_bmap_count_blocks(*tpp, tip, XFS_ATTR_FORK, &junk,
- &taforkblks);
- if (error)
- return error;
- }
-
- /*
- * Btree format (v3) inodes have the inode number stamped in the bmbt
- * block headers. We can't start changing the bmbt blocks until the
- * inode owner change is logged so recovery does the right thing in the
- * event of a crash. Set the owner change log flags now and leave the
- * bmbt scan as the last step.
- */
- if (xfs_has_v3inodes(ip->i_mount)) {
- if (ip->i_df.if_format == XFS_DINODE_FMT_BTREE)
- target_log_flags |= XFS_ILOG_DOWNER;
- if (tip->i_df.if_format == XFS_DINODE_FMT_BTREE)
- src_log_flags |= XFS_ILOG_DOWNER;
- }
-
- /*
- * Swap the data forks of the inodes
- */
- swap(ip->i_df, tip->i_df);
-
- /*
- * Fix the on-disk inode values
- */
- tmp = (uint64_t)ip->i_nblocks;
- ip->i_nblocks = tip->i_nblocks - taforkblks + aforkblks;
- tip->i_nblocks = tmp + taforkblks - aforkblks;
-
- /*
- * The extents in the source inode could still contain speculative
- * preallocation beyond EOF (e.g. the file is open but not modified
- * while defrag is in progress). In that case, we need to copy over the
- * number of delalloc blocks the data fork in the source inode is
- * tracking beyond EOF so that when the fork is truncated away when the
- * temporary inode is unlinked we don't underrun the i_delayed_blks
- * counter on that inode.
- */
- ASSERT(tip->i_delayed_blks == 0);
- tip->i_delayed_blks = ip->i_delayed_blks;
- ip->i_delayed_blks = 0;
-
- switch (ip->i_df.if_format) {
- case XFS_DINODE_FMT_EXTENTS:
- src_log_flags |= XFS_ILOG_DEXT;
- break;
- case XFS_DINODE_FMT_BTREE:
- ASSERT(!xfs_has_v3inodes(ip->i_mount) ||
- (src_log_flags & XFS_ILOG_DOWNER));
- src_log_flags |= XFS_ILOG_DBROOT;
- break;
- }
-
- switch (tip->i_df.if_format) {
- case XFS_DINODE_FMT_EXTENTS:
- target_log_flags |= XFS_ILOG_DEXT;
- break;
- case XFS_DINODE_FMT_BTREE:
- target_log_flags |= XFS_ILOG_DBROOT;
- ASSERT(!xfs_has_v3inodes(ip->i_mount) ||
- (target_log_flags & XFS_ILOG_DOWNER));
- break;
- }
-
- /* Do we have to swap reflink flags? */
- if ((ip->i_diflags2 & XFS_DIFLAG2_REFLINK) ^
- (tip->i_diflags2 & XFS_DIFLAG2_REFLINK)) {
- uint64_t f;
-
- f = ip->i_diflags2 & XFS_DIFLAG2_REFLINK;
- ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
- ip->i_diflags2 |= tip->i_diflags2 & XFS_DIFLAG2_REFLINK;
- tip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
- tip->i_diflags2 |= f & XFS_DIFLAG2_REFLINK;
- }
-
- /* Swap the cow forks. */
- if (xfs_has_reflink(ip->i_mount)) {
- ASSERT(!ip->i_cowfp ||
- ip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
- ASSERT(!tip->i_cowfp ||
- tip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
-
- swap(ip->i_cowfp, tip->i_cowfp);
-
- if (ip->i_cowfp && ip->i_cowfp->if_bytes)
- xfs_inode_set_cowblocks_tag(ip);
- else
- xfs_inode_clear_cowblocks_tag(ip);
- if (tip->i_cowfp && tip->i_cowfp->if_bytes)
- xfs_inode_set_cowblocks_tag(tip);
- else
- xfs_inode_clear_cowblocks_tag(tip);
- }
-
- xfs_trans_log_inode(*tpp, ip, src_log_flags);
- xfs_trans_log_inode(*tpp, tip, target_log_flags);
-
- /*
- * The extent forks have been swapped, but crc=1,rmapbt=0 filesystems
- * have inode number owner values in the bmbt blocks that still refer to
- * the old inode. Scan each bmbt to fix up the owner values with the
- * inode number of the current inode.
- */
- if (src_log_flags & XFS_ILOG_DOWNER) {
- error = xfs_swap_change_owner(tpp, ip, tip);
- if (error)
- return error;
- }
- if (target_log_flags & XFS_ILOG_DOWNER) {
- error = xfs_swap_change_owner(tpp, tip, ip);
- if (error)
- return error;
- }
-
- return 0;
-}
-
-int
-xfs_swap_extents(
- struct xfs_inode *ip, /* target inode */
- struct xfs_inode *tip, /* tmp inode */
- struct xfs_swapext *sxp)
-{
- struct xfs_swapext_req req = {
- .ip1 = tip,
- .ip2 = ip,
- .whichfork = XFS_DATA_FORK,
- };
- struct xfs_mount *mp = ip->i_mount;
- struct xfs_trans *tp;
- struct xfs_bstat *sbp = &sxp->sx_stat;
- int error = 0;
- int resblks = 0;
- unsigned int flags = 0;
-
- /*
- * Lock the inodes against other IO, page faults and truncate to
- * begin with. Then we can ensure the inodes are flushed and have no
- * page cache safely. Once we have done this we can take the ilocks and
- * do the rest of the checks.
- */
- lock_two_nondirectories(VFS_I(ip), VFS_I(tip));
- filemap_invalidate_lock_two(VFS_I(ip)->i_mapping,
- VFS_I(tip)->i_mapping);
-
- /* Verify that both files have the same format */
- if ((VFS_I(ip)->i_mode & S_IFMT) != (VFS_I(tip)->i_mode & S_IFMT)) {
- error = -EINVAL;
- goto out_unlock;
- }
-
- /* Verify both files are either real-time or non-realtime */
- if (XFS_IS_REALTIME_INODE(ip) != XFS_IS_REALTIME_INODE(tip)) {
- error = -EINVAL;
- goto out_unlock;
- }
-
- error = xfs_qm_dqattach(ip);
- if (error)
- goto out_unlock;
-
- error = xfs_qm_dqattach(tip);
- if (error)
- goto out_unlock;
-
- error = xfs_swap_extent_flush(ip);
- if (error)
- goto out_unlock;
- error = xfs_swap_extent_flush(tip);
- if (error)
- goto out_unlock;
-
- if (xfs_inode_has_cow_data(tip)) {
- error = xfs_reflink_cancel_cow_range(tip, 0, NULLFILEOFF, true);
- if (error)
- goto out_unlock;
- }
-
- /*
- * Extent "swapping" with rmap requires a permanent reservation and
- * a block reservation because it's really just a remap operation
- * performed with log redo items!
- */
- if (xfs_has_rmapbt(mp)) {
- int w = XFS_DATA_FORK;
- uint32_t ipnext = ip->i_df.if_nextents;
- uint32_t tipnext = tip->i_df.if_nextents;
-
- /*
- * Conceptually this shouldn't affect the shape of either bmbt,
- * but since we atomically move extents one by one, we reserve
- * enough space to rebuild both trees.
- */
- resblks = XFS_SWAP_RMAP_SPACE_RES(mp, ipnext, w);
- resblks += XFS_SWAP_RMAP_SPACE_RES(mp, tipnext, w);
-
- /*
- * If either inode straddles a bmapbt block allocation boundary,
- * the rmapbt algorithm triggers repeated allocs and frees as
- * extents are remapped. This can exhaust the block reservation
- * prematurely and cause shutdown. Return freed blocks to the
- * transaction reservation to counter this behavior.
- */
- flags |= XFS_TRANS_RES_FDBLKS;
- }
- error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, flags,
- &tp);
- if (error)
- goto out_unlock;
-
- /*
- * Lock and join the inodes to the tansaction so that transaction commit
- * or cancel will unlock the inodes from this point onwards.
- */
- xfs_lock_two_inodes(ip, XFS_ILOCK_EXCL, tip, XFS_ILOCK_EXCL);
- xfs_trans_ijoin(tp, ip, 0);
- xfs_trans_ijoin(tp, tip, 0);
-
-
- /* Verify all data are being swapped */
- if (sxp->sx_offset != 0 ||
- sxp->sx_length != ip->i_disk_size ||
- sxp->sx_length != tip->i_disk_size) {
- error = -EFAULT;
- goto out_trans_cancel;
- }
-
- trace_xfs_swap_extent_before(ip, 0);
- trace_xfs_swap_extent_before(tip, 1);
-
- /* check inode formats now that data is flushed */
- error = xfs_swap_extents_check_format(ip, tip);
- if (error) {
- xfs_notice(mp,
- "%s: inode 0x%llx format is incompatible for exchanging.",
- __func__, ip->i_ino);
- goto out_trans_cancel;
- }
-
- /*
- * Compare the current change & modify times with that
- * passed in. If they differ, we abort this swap.
- * This is the mechanism used to ensure the calling
- * process that the file was not changed out from
- * under it.
- */
- if ((sbp->bs_ctime.tv_sec != VFS_I(ip)->i_ctime.tv_sec) ||
- (sbp->bs_ctime.tv_nsec != VFS_I(ip)->i_ctime.tv_nsec) ||
- (sbp->bs_mtime.tv_sec != VFS_I(ip)->i_mtime.tv_sec) ||
- (sbp->bs_mtime.tv_nsec != VFS_I(ip)->i_mtime.tv_nsec)) {
- error = -EBUSY;
- goto out_trans_cancel;
- }
-
- /*
- * Note the trickiness in setting the log flags - we set the owner log
- * flag on the opposite inode (i.e. the inode we are setting the new
- * owner to be) because once we swap the forks and log that, log
- * recovery is going to see the fork as owned by the swapped inode,
- * not the pre-swapped inodes.
- */
- req.blockcount = XFS_B_TO_FSB(ip->i_mount, i_size_read(VFS_I(ip)));
- if (xfs_has_rmapbt(mp)) {
- xfs_swapext(tp, &req);
- error = xfs_defer_finish(&tp);
- } else
- error = xfs_swap_extent_forks(&tp, &req);
- if (error) {
- trace_xfs_swap_extent_error(ip, error, _THIS_IP_);
- goto out_trans_cancel;
- }
-
- /*
- * If this is a synchronous mount, make sure that the
- * transaction goes to disk before returning to the user.
- */
- if (xfs_has_wsync(mp))
- xfs_trans_set_sync(tp);
-
- error = xfs_trans_commit(tp);
-
- trace_xfs_swap_extent_after(ip, 0);
- trace_xfs_swap_extent_after(tip, 1);
-
-out_unlock_ilock:
- xfs_iunlock(ip, XFS_ILOCK_EXCL);
- xfs_iunlock(tip, XFS_ILOCK_EXCL);
-out_unlock:
- filemap_invalidate_unlock_two(VFS_I(ip)->i_mapping,
- VFS_I(tip)->i_mapping);
- unlock_two_nondirectories(VFS_I(ip), VFS_I(tip));
- return error;
-
-out_trans_cancel:
- xfs_trans_cancel(tp);
- goto out_unlock_ilock;
-}
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 39c71da08403..8eb7166aa9d4 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -66,13 +66,6 @@ int xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
bool xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
int xfs_free_eofblocks(struct xfs_inode *ip);
-int xfs_swap_extents(struct xfs_inode *ip, struct xfs_inode *tip,
- struct xfs_swapext *sx);
-
-struct xfs_swapext_req;
-int xfs_swap_extent_forks(struct xfs_trans **tpp, struct xfs_swapext_req *req);
-int xfs_swap_extents_check_format(struct xfs_inode *ip, struct xfs_inode *tip);
-
xfs_daddr_t xfs_fsb_to_db(struct xfs_inode *ip, xfs_fsblock_t fsb);
xfs_extnum_t xfs_bmap_count_leaves(struct xfs_ifork *ifp, xfs_filblks_t *count);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 4b2a02a08dfa..85c33142c5ab 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1655,81 +1655,43 @@ xfs_ioc_scrub_metadata(
int
xfs_ioc_swapext(
- xfs_swapext_t *sxp)
+ struct xfs_swapext *sxp)
{
- xfs_inode_t *ip, *tip;
- struct fd f, tmp;
- int error = 0;
+ struct file_xchg_range fxr = { 0 };
+ struct fd fd2, fd1;
+ int error = 0;
- /* Pull information for the target fd */
- f = fdget((int)sxp->sx_fdtarget);
- if (!f.file) {
- error = -EINVAL;
- goto out;
- }
-
- if (!(f.file->f_mode & FMODE_WRITE) ||
- !(f.file->f_mode & FMODE_READ) ||
- (f.file->f_flags & O_APPEND)) {
- error = -EBADF;
- goto out_put_file;
- }
+ fd2 = fdget((int)sxp->sx_fdtarget);
+ if (!fd2.file)
+ return -EINVAL;
- tmp = fdget((int)sxp->sx_fdtmp);
- if (!tmp.file) {
+ fd1 = fdget((int)sxp->sx_fdtmp);
+ if (!fd1.file) {
error = -EINVAL;
- goto out_put_file;
+ goto dest_fdput;
}
- if (!(tmp.file->f_mode & FMODE_WRITE) ||
- !(tmp.file->f_mode & FMODE_READ) ||
- (tmp.file->f_flags & O_APPEND)) {
- error = -EBADF;
- goto out_put_tmp_file;
- }
+ fxr.file1_fd = sxp->sx_fdtmp;
+ fxr.length = sxp->sx_length;
+ fxr.flags = FILE_XCHG_RANGE_NONATOMIC | FILE_XCHG_RANGE_FILE2_FRESH |
+ FILE_XCHG_RANGE_FULL_FILES;
+ fxr.file2_ino = sxp->sx_stat.bs_ino;
+ fxr.file2_mtime = sxp->sx_stat.bs_mtime.tv_sec;
+ fxr.file2_ctime = sxp->sx_stat.bs_ctime.tv_sec;
+ fxr.file2_mtime_nsec = sxp->sx_stat.bs_mtime.tv_nsec;
+ fxr.file2_ctime_nsec = sxp->sx_stat.bs_ctime.tv_nsec;
- if (IS_SWAPFILE(file_inode(f.file)) ||
- IS_SWAPFILE(file_inode(tmp.file))) {
- error = -EINVAL;
- goto out_put_tmp_file;
- }
+ error = vfs_xchg_file_range(fd1.file, fd2.file, &fxr);
/*
- * We need to ensure that the fds passed in point to XFS inodes
- * before we cast and access them as XFS structures as we have no
- * control over what the user passes us here.
+ * The old implementation returned EFAULT if the swap range was not
+ * the entirety of both files.
*/
- if (f.file->f_op != &xfs_file_operations ||
- tmp.file->f_op != &xfs_file_operations) {
- error = -EINVAL;
- goto out_put_tmp_file;
- }
-
- ip = XFS_I(file_inode(f.file));
- tip = XFS_I(file_inode(tmp.file));
-
- if (ip->i_mount != tip->i_mount) {
- error = -EINVAL;
- goto out_put_tmp_file;
- }
-
- if (ip->i_ino == tip->i_ino) {
- error = -EINVAL;
- goto out_put_tmp_file;
- }
-
- if (xfs_is_shutdown(ip->i_mount)) {
- error = -EIO;
- goto out_put_tmp_file;
- }
-
- error = xfs_swap_extents(ip, tip, sxp);
-
- out_put_tmp_file:
- fdput(tmp);
- out_put_file:
- fdput(f);
- out:
+ if (error == -EDOM)
+ error = -EFAULT;
+ fdput(fd1);
+dest_fdput:
+ fdput(fd2);
return error;
}
@@ -1988,14 +1950,10 @@ xfs_file_ioctl(
case XFS_IOC_SWAPEXT: {
struct xfs_swapext sxp;
- if (copy_from_user(&sxp, arg, sizeof(xfs_swapext_t)))
+ if (copy_from_user(&sxp, arg, sizeof(struct xfs_swapext)))
return -EFAULT;
- error = mnt_want_write_file(filp);
- if (error)
- return error;
- error = xfs_ioc_swapext(&sxp);
- mnt_drop_write_file(filp);
- return error;
+
+ return xfs_ioc_swapext(&sxp);
}
case XFS_IOC_FSCOUNTS: {
diff --git a/fs/xfs/xfs_ioctl.h b/fs/xfs/xfs_ioctl.h
index d4abba2c13c1..e3f72d816e0e 100644
--- a/fs/xfs/xfs_ioctl.h
+++ b/fs/xfs/xfs_ioctl.h
@@ -10,9 +10,7 @@ struct xfs_bstat;
struct xfs_ibulk;
struct xfs_inogrp;
-int
-xfs_ioc_swapext(
- xfs_swapext_t *sxp);
+int xfs_ioc_swapext(struct xfs_swapext *sxp);
extern int
xfs_find_handle(
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index 2f54b701eead..885d6e58d7ec 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -425,7 +425,6 @@ xfs_file_compat_ioctl(
struct inode *inode = file_inode(filp);
struct xfs_inode *ip = XFS_I(inode);
void __user *arg = compat_ptr(p);
- int error;
trace_xfs_file_compat_ioctl(ip);
@@ -435,6 +434,7 @@ xfs_file_compat_ioctl(
return xfs_compat_ioc_fsgeometry_v1(ip->i_mount, arg);
case XFS_IOC_FSGROWFSDATA_32: {
struct xfs_growfs_data in;
+ int error;
if (xfs_compat_growfs_data_copyin(&in, arg))
return -EFAULT;
@@ -447,6 +447,7 @@ xfs_file_compat_ioctl(
}
case XFS_IOC_FSGROWFSRT_32: {
struct xfs_growfs_rt in;
+ int error;
if (xfs_compat_growfs_rt_copyin(&in, arg))
return -EFAULT;
@@ -471,12 +472,8 @@ xfs_file_compat_ioctl(
offsetof(struct xfs_swapext, sx_stat)) ||
xfs_ioctl32_bstat_copyin(&sxp.sx_stat, &sxu->sx_stat))
return -EFAULT;
- error = mnt_want_write_file(filp);
- if (error)
- return error;
- error = xfs_ioc_swapext(&sxp);
- mnt_drop_write_file(filp);
- return error;
+
+ return xfs_ioc_swapext(&sxp);
}
case XFS_IOC_FSBULKSTAT_32:
case XFS_IOC_FSBULKSTAT_SINGLE_32:
diff --git a/fs/xfs/xfs_xchgrange.c b/fs/xfs/xfs_xchgrange.c
index 2b7aedc49923..27bb88dcf228 100644
--- a/fs/xfs/xfs_xchgrange.c
+++ b/fs/xfs/xfs_xchgrange.c
@@ -2,6 +2,11 @@
/*
* Copyright (C) 2022 Oracle. All Rights Reserved.
* Author: Darrick J. Wong <djwong@kernel.org>
+ *
+ * The xfs_swap_extent_* functions are:
+ * Copyright (c) 2000-2006 Silicon Graphics, Inc.
+ * Copyright (c) 2012 Red Hat, Inc.
+ * All Rights Reserved.
*/
#include "xfs.h"
#include "xfs_fs.h"
@@ -15,6 +20,7 @@
#include "xfs_trans.h"
#include "xfs_quota.h"
#include "xfs_bmap_util.h"
+#include "xfs_bmap_btree.h"
#include "xfs_reflink.h"
#include "xfs_trace.h"
#include "xfs_swapext.h"
@@ -71,6 +77,299 @@ xfs_xchg_range_estimate(
return error;
}
+/*
+ * We need to check that the format of the data fork in the temporary inode is
+ * valid for the target inode before doing the swap. This is not a problem with
+ * attr1 because of the fixed fork offset, but attr2 has a dynamically sized
+ * data fork depending on the space the attribute fork is taking so we can get
+ * invalid formats on the target inode.
+ *
+ * E.g. target has space for 7 extents in extent format, temp inode only has
+ * space for 6. If we defragment down to 7 extents, then the tmp format is a
+ * btree, but when swapped it needs to be in extent format. Hence we can't just
+ * blindly swap data forks on attr2 filesystems.
+ *
+ * Note that we check the swap in both directions so that we don't end up with
+ * a corrupt temporary inode, either.
+ *
+ * Note that fixing the way xfs_fsr sets up the attribute fork in the source
+ * inode will prevent this situation from occurring, so all we do here is
+ * reject and log the attempt. basically we are putting the responsibility on
+ * userspace to get this right.
+ */
+STATIC int
+xfs_swap_extents_check_format(
+ struct xfs_inode *ip, /* target inode */
+ struct xfs_inode *tip) /* tmp inode */
+{
+ struct xfs_ifork *ifp = &ip->i_df;
+ struct xfs_ifork *tifp = &tip->i_df;
+
+ /* User/group/project quota ids must match if quotas are enforced. */
+ if (XFS_IS_QUOTA_ON(ip->i_mount) &&
+ (!uid_eq(VFS_I(ip)->i_uid, VFS_I(tip)->i_uid) ||
+ !gid_eq(VFS_I(ip)->i_gid, VFS_I(tip)->i_gid) ||
+ ip->i_projid != tip->i_projid))
+ return -EINVAL;
+
+ /* Should never get a local format */
+ if (ifp->if_format == XFS_DINODE_FMT_LOCAL ||
+ tifp->if_format == XFS_DINODE_FMT_LOCAL)
+ return -EINVAL;
+
+ /*
+ * if the target inode has less extents that then temporary inode then
+ * why did userspace call us?
+ */
+ if (ifp->if_nextents < tifp->if_nextents)
+ return -EINVAL;
+
+ /*
+ * If we have to use the (expensive) rmap swap method, we can
+ * handle any number of extents and any format.
+ */
+ if (xfs_has_rmapbt(ip->i_mount))
+ return 0;
+
+ /*
+ * if the target inode is in extent form and the temp inode is in btree
+ * form then we will end up with the target inode in the wrong format
+ * as we already know there are less extents in the temp inode.
+ */
+ if (ifp->if_format == XFS_DINODE_FMT_EXTENTS &&
+ tifp->if_format == XFS_DINODE_FMT_BTREE)
+ return -EINVAL;
+
+ /* Check temp in extent form to max in target */
+ if (tifp->if_format == XFS_DINODE_FMT_EXTENTS &&
+ tifp->if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK))
+ return -EINVAL;
+
+ /* Check target in extent form to max in temp */
+ if (ifp->if_format == XFS_DINODE_FMT_EXTENTS &&
+ ifp->if_nextents > XFS_IFORK_MAXEXT(tip, XFS_DATA_FORK))
+ return -EINVAL;
+
+ /*
+ * If we are in a btree format, check that the temp root block will fit
+ * in the target and that it has enough extents to be in btree format
+ * in the target.
+ *
+ * Note that we have to be careful to allow btree->extent conversions
+ * (a common defrag case) which will occur when the temp inode is in
+ * extent format...
+ */
+ if (tifp->if_format == XFS_DINODE_FMT_BTREE) {
+ if (xfs_inode_has_attr_fork(ip) &&
+ XFS_BMAP_BMDR_SPACE(tifp->if_broot) > xfs_inode_fork_boff(ip))
+ return -EINVAL;
+ if (tifp->if_nextents <= XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK))
+ return -EINVAL;
+ }
+
+ /* Reciprocal target->temp btree format checks */
+ if (ifp->if_format == XFS_DINODE_FMT_BTREE) {
+ if (xfs_inode_has_attr_fork(tip) &&
+ XFS_BMAP_BMDR_SPACE(ip->i_df.if_broot) > xfs_inode_fork_boff(tip))
+ return -EINVAL;
+ if (ifp->if_nextents <= XFS_IFORK_MAXEXT(tip, XFS_DATA_FORK))
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+/*
+ * Fix up the owners of the bmbt blocks to refer to the current inode. The
+ * change owner scan attempts to order all modified buffers in the current
+ * transaction. In the event of ordered buffer failure, the offending buffer is
+ * physically logged as a fallback and the scan returns -EAGAIN. We must roll
+ * the transaction in this case to replenish the fallback log reservation and
+ * restart the scan. This process repeats until the scan completes.
+ */
+static int
+xfs_swap_change_owner(
+ struct xfs_trans **tpp,
+ struct xfs_inode *ip,
+ struct xfs_inode *tmpip)
+{
+ int error;
+ struct xfs_trans *tp = *tpp;
+
+ do {
+ error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK, ip->i_ino,
+ NULL);
+ /* success or fatal error */
+ if (error != -EAGAIN)
+ break;
+
+ error = xfs_trans_roll(tpp);
+ if (error)
+ break;
+ tp = *tpp;
+
+ /*
+ * Redirty both inodes so they can relog and keep the log tail
+ * moving forward.
+ */
+ xfs_trans_ijoin(tp, ip, 0);
+ xfs_trans_ijoin(tp, tmpip, 0);
+ xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+ xfs_trans_log_inode(tp, tmpip, XFS_ILOG_CORE);
+ } while (true);
+
+ return error;
+}
+
+/* Swap the extents of two files by swapping data forks. */
+STATIC int
+xfs_swap_extent_forks(
+ struct xfs_trans **tpp,
+ struct xfs_swapext_req *req)
+{
+ struct xfs_inode *ip = req->ip2;
+ struct xfs_inode *tip = req->ip1;
+ xfs_filblks_t aforkblks = 0;
+ xfs_filblks_t taforkblks = 0;
+ xfs_extnum_t junk;
+ uint64_t tmp;
+ int src_log_flags = XFS_ILOG_CORE;
+ int target_log_flags = XFS_ILOG_CORE;
+ int error;
+
+ /*
+ * Count the number of extended attribute blocks
+ */
+ if (xfs_inode_has_attr_fork(ip) && ip->i_af.if_nextents > 0 &&
+ ip->i_af.if_format != XFS_DINODE_FMT_LOCAL) {
+ error = xfs_bmap_count_blocks(*tpp, ip, XFS_ATTR_FORK, &junk,
+ &aforkblks);
+ if (error)
+ return error;
+ }
+ if (xfs_inode_has_attr_fork(tip) && tip->i_af.if_nextents > 0 &&
+ tip->i_af.if_format != XFS_DINODE_FMT_LOCAL) {
+ error = xfs_bmap_count_blocks(*tpp, tip, XFS_ATTR_FORK, &junk,
+ &taforkblks);
+ if (error)
+ return error;
+ }
+
+ /*
+ * Btree format (v3) inodes have the inode number stamped in the bmbt
+ * block headers. We can't start changing the bmbt blocks until the
+ * inode owner change is logged so recovery does the right thing in the
+ * event of a crash. Set the owner change log flags now and leave the
+ * bmbt scan as the last step.
+ */
+ if (xfs_has_v3inodes(ip->i_mount)) {
+ if (ip->i_df.if_format == XFS_DINODE_FMT_BTREE)
+ target_log_flags |= XFS_ILOG_DOWNER;
+ if (tip->i_df.if_format == XFS_DINODE_FMT_BTREE)
+ src_log_flags |= XFS_ILOG_DOWNER;
+ }
+
+ /*
+ * Swap the data forks of the inodes
+ */
+ swap(ip->i_df, tip->i_df);
+
+ /*
+ * Fix the on-disk inode values
+ */
+ tmp = (uint64_t)ip->i_nblocks;
+ ip->i_nblocks = tip->i_nblocks - taforkblks + aforkblks;
+ tip->i_nblocks = tmp + taforkblks - aforkblks;
+
+ /*
+ * The extents in the source inode could still contain speculative
+ * preallocation beyond EOF (e.g. the file is open but not modified
+ * while defrag is in progress). In that case, we need to copy over the
+ * number of delalloc blocks the data fork in the source inode is
+ * tracking beyond EOF so that when the fork is truncated away when the
+ * temporary inode is unlinked we don't underrun the i_delayed_blks
+ * counter on that inode.
+ */
+ ASSERT(tip->i_delayed_blks == 0);
+ tip->i_delayed_blks = ip->i_delayed_blks;
+ ip->i_delayed_blks = 0;
+
+ switch (ip->i_df.if_format) {
+ case XFS_DINODE_FMT_EXTENTS:
+ src_log_flags |= XFS_ILOG_DEXT;
+ break;
+ case XFS_DINODE_FMT_BTREE:
+ ASSERT(!xfs_has_v3inodes(ip->i_mount) ||
+ (src_log_flags & XFS_ILOG_DOWNER));
+ src_log_flags |= XFS_ILOG_DBROOT;
+ break;
+ }
+
+ switch (tip->i_df.if_format) {
+ case XFS_DINODE_FMT_EXTENTS:
+ target_log_flags |= XFS_ILOG_DEXT;
+ break;
+ case XFS_DINODE_FMT_BTREE:
+ target_log_flags |= XFS_ILOG_DBROOT;
+ ASSERT(!xfs_has_v3inodes(ip->i_mount) ||
+ (target_log_flags & XFS_ILOG_DOWNER));
+ break;
+ }
+
+ /* Do we have to swap reflink flags? */
+ if ((ip->i_diflags2 & XFS_DIFLAG2_REFLINK) ^
+ (tip->i_diflags2 & XFS_DIFLAG2_REFLINK)) {
+ uint64_t f;
+
+ f = ip->i_diflags2 & XFS_DIFLAG2_REFLINK;
+ ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
+ ip->i_diflags2 |= tip->i_diflags2 & XFS_DIFLAG2_REFLINK;
+ tip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
+ tip->i_diflags2 |= f & XFS_DIFLAG2_REFLINK;
+ }
+
+ /* Swap the cow forks. */
+ if (xfs_has_reflink(ip->i_mount)) {
+ ASSERT(!ip->i_cowfp ||
+ ip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
+ ASSERT(!tip->i_cowfp ||
+ tip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
+
+ swap(ip->i_cowfp, tip->i_cowfp);
+
+ if (ip->i_cowfp && ip->i_cowfp->if_bytes)
+ xfs_inode_set_cowblocks_tag(ip);
+ else
+ xfs_inode_clear_cowblocks_tag(ip);
+ if (tip->i_cowfp && tip->i_cowfp->if_bytes)
+ xfs_inode_set_cowblocks_tag(tip);
+ else
+ xfs_inode_clear_cowblocks_tag(tip);
+ }
+
+ xfs_trans_log_inode(*tpp, ip, src_log_flags);
+ xfs_trans_log_inode(*tpp, tip, target_log_flags);
+
+ /*
+ * The extent forks have been swapped, but crc=1,rmapbt=0 filesystems
+ * have inode number owner values in the bmbt blocks that still refer to
+ * the old inode. Scan each bmbt to fix up the owner values with the
+ * inode number of the current inode.
+ */
+ if (src_log_flags & XFS_ILOG_DOWNER) {
+ error = xfs_swap_change_owner(tpp, ip, tip);
+ if (error)
+ return error;
+ }
+ if (target_log_flags & XFS_ILOG_DOWNER) {
+ error = xfs_swap_change_owner(tpp, tip, ip);
+ if (error)
+ return error;
+ }
+
+ return 0;
+}
+
/* Prepare two files to have their data exchanged. */
int
xfs_xchg_range_prep(
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 16/21] xfs: condense extended attributes after an atomic swap
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
` (13 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 17/21] xfs: condense directories after an atomic swap Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 15/21] xfs: remove old swap extents implementation Darrick J. Wong
` (5 subsequent siblings)
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
Add a new swapext flag that enables us to perform post-swap processing
on file2 once we're done swapping the extent maps. If we were swapping
the extended attributes, we want to be able to convert file2's attr fork
from block to inline format.
This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online xattr repair feature can create
salvaged attrs in a temporary file and swap the attr forks when ready.
If one file is in extents format and the other is inline, we will have to
promote both to extents format to perform the swap. After the swap, we
can try to condense the fixed file's attr fork back down to inline
format if possible.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_log_format.h | 9 +++++--
fs/xfs/libxfs/xfs_swapext.c | 51 +++++++++++++++++++++++++++++++++++++++-
fs/xfs/libxfs/xfs_swapext.h | 9 +++++--
3 files changed, 64 insertions(+), 5 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 65a84fdefe56..378201a70028 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -906,18 +906,23 @@ struct xfs_swap_extent {
/* Clear the reflink flag from inode2 after the operation. */
#define XFS_SWAP_EXT_CLEAR_INO2_REFLINK (1ULL << 4)
+/* Try to convert inode2 from block to short format at the end, if possible. */
+#define XFS_SWAP_EXT_CVT_INO2_SF (1ULL << 5)
+
#define XFS_SWAP_EXT_FLAGS (XFS_SWAP_EXT_ATTR_FORK | \
XFS_SWAP_EXT_SET_SIZES | \
XFS_SWAP_EXT_SKIP_INO1_HOLES | \
XFS_SWAP_EXT_CLEAR_INO1_REFLINK | \
- XFS_SWAP_EXT_CLEAR_INO2_REFLINK)
+ XFS_SWAP_EXT_CLEAR_INO2_REFLINK | \
+ XFS_SWAP_EXT_CVT_INO2_SF)
#define XFS_SWAP_EXT_STRINGS \
{ XFS_SWAP_EXT_ATTR_FORK, "ATTRFORK" }, \
{ XFS_SWAP_EXT_SET_SIZES, "SETSIZES" }, \
{ XFS_SWAP_EXT_SKIP_INO1_HOLES, "SKIP_INO1_HOLES" }, \
{ XFS_SWAP_EXT_CLEAR_INO1_REFLINK, "CLEAR_INO1_REFLINK" }, \
- { XFS_SWAP_EXT_CLEAR_INO2_REFLINK, "CLEAR_INO2_REFLINK" }
+ { XFS_SWAP_EXT_CLEAR_INO2_REFLINK, "CLEAR_INO2_REFLINK" }, \
+ { XFS_SWAP_EXT_CVT_INO2_SF, "CVT_INO2_SF" }
/* This is the structure used to lay out an sxi log item in the log. */
struct xfs_sxi_log_format {
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index 227a08ac5d4b..6b5223e73692 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -23,6 +23,10 @@
#include "xfs_error.h"
#include "xfs_errortag.h"
#include "xfs_health.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_attr_leaf.h"
+#include "xfs_attr.h"
struct kmem_cache *xfs_swapext_intent_cache;
@@ -121,7 +125,8 @@ static inline bool
sxi_has_postop_work(const struct xfs_swapext_intent *sxi)
{
return sxi->sxi_flags & (XFS_SWAP_EXT_CLEAR_INO1_REFLINK |
- XFS_SWAP_EXT_CLEAR_INO2_REFLINK);
+ XFS_SWAP_EXT_CLEAR_INO2_REFLINK |
+ XFS_SWAP_EXT_CVT_INO2_SF);
}
static inline void
@@ -350,6 +355,36 @@ xfs_swapext_exchange_mappings(
sxi_advance(sxi, irec1);
}
+/* Convert inode2's leaf attr fork back to shortform, if possible.. */
+STATIC int
+xfs_swapext_attr_to_sf(
+ struct xfs_trans *tp,
+ struct xfs_swapext_intent *sxi)
+{
+ struct xfs_da_args args = {
+ .dp = sxi->sxi_ip2,
+ .geo = tp->t_mountp->m_attr_geo,
+ .whichfork = XFS_ATTR_FORK,
+ .trans = tp,
+ };
+ struct xfs_buf *bp;
+ int forkoff;
+ int error;
+
+ if (!xfs_attr_is_leaf(sxi->sxi_ip2))
+ return 0;
+
+ error = xfs_attr3_leaf_read(tp, sxi->sxi_ip2, 0, &bp);
+ if (error)
+ return error;
+
+ forkoff = xfs_attr_shortform_allfit(bp, sxi->sxi_ip2);
+ if (forkoff == 0)
+ return 0;
+
+ return xfs_attr3_leaf_to_shortform(bp, &args, forkoff);
+}
+
static inline void
xfs_swapext_clear_reflink(
struct xfs_trans *tp,
@@ -367,6 +402,16 @@ xfs_swapext_do_postop_work(
struct xfs_trans *tp,
struct xfs_swapext_intent *sxi)
{
+ if (sxi->sxi_flags & XFS_SWAP_EXT_CVT_INO2_SF) {
+ int error = 0;
+
+ if (sxi->sxi_flags & XFS_SWAP_EXT_ATTR_FORK)
+ error = xfs_swapext_attr_to_sf(tp, sxi);
+ sxi->sxi_flags &= ~XFS_SWAP_EXT_CVT_INO2_SF;
+ if (error)
+ return error;
+ }
+
if (sxi->sxi_flags & XFS_SWAP_EXT_CLEAR_INO1_REFLINK) {
xfs_swapext_clear_reflink(tp, sxi->sxi_ip1);
sxi->sxi_flags &= ~XFS_SWAP_EXT_CLEAR_INO1_REFLINK;
@@ -794,6 +839,8 @@ xfs_swapext_init_intent(
if (req->req_flags & XFS_SWAP_REQ_SKIP_INO1_HOLES)
sxi->sxi_flags |= XFS_SWAP_EXT_SKIP_INO1_HOLES;
+ if (req->req_flags & XFS_SWAP_REQ_CVT_INO2_SF)
+ sxi->sxi_flags |= XFS_SWAP_EXT_CVT_INO2_SF;
if (req->req_flags & XFS_SWAP_REQ_LOGGED)
sxi->sxi_op_flags |= XFS_SWAP_EXT_OP_LOGGED;
@@ -1013,6 +1060,8 @@ xfs_swapext(
ASSERT(!(req->req_flags & ~XFS_SWAP_REQ_FLAGS));
if (req->req_flags & XFS_SWAP_REQ_SET_SIZES)
ASSERT(req->whichfork == XFS_DATA_FORK);
+ if (req->req_flags & XFS_SWAP_REQ_CVT_INO2_SF)
+ ASSERT(req->whichfork == XFS_ATTR_FORK);
if (req->blockcount == 0)
return;
diff --git a/fs/xfs/libxfs/xfs_swapext.h b/fs/xfs/libxfs/xfs_swapext.h
index 1987897ddc25..6b610fea150a 100644
--- a/fs/xfs/libxfs/xfs_swapext.h
+++ b/fs/xfs/libxfs/xfs_swapext.h
@@ -126,16 +126,21 @@ struct xfs_swapext_req {
/* Files need to be upgraded to have large extent counts. */
#define XFS_SWAP_REQ_NREXT64 (1U << 3)
+/* Try to convert inode2's fork to local format, if possible. */
+#define XFS_SWAP_REQ_CVT_INO2_SF (1U << 4)
+
#define XFS_SWAP_REQ_FLAGS (XFS_SWAP_REQ_LOGGED | \
XFS_SWAP_REQ_SET_SIZES | \
XFS_SWAP_REQ_SKIP_INO1_HOLES | \
- XFS_SWAP_REQ_NREXT64)
+ XFS_SWAP_REQ_NREXT64 | \
+ XFS_SWAP_REQ_CVT_INO2_SF)
#define XFS_SWAP_REQ_STRINGS \
{ XFS_SWAP_REQ_LOGGED, "LOGGED" }, \
{ XFS_SWAP_REQ_SET_SIZES, "SETSIZES" }, \
{ XFS_SWAP_REQ_SKIP_INO1_HOLES, "SKIP_INO1_HOLES" }, \
- { XFS_SWAP_REQ_NREXT64, "NREXT64" }
+ { XFS_SWAP_REQ_NREXT64, "NREXT64" }, \
+ { XFS_SWAP_REQ_CVT_INO2_SF, "CVT_INO2_SF" }
unsigned int xfs_swapext_reflink_prep(const struct xfs_swapext_req *req);
void xfs_swapext_reflink_finish(struct xfs_trans *tp,
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 17/21] xfs: condense directories after an atomic swap
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
` (12 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 13/21] xfs: port xfs_swap_extent_forks to use xfs_swapext_req Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 16/21] xfs: condense extended attributes " Darrick J. Wong
` (6 subsequent siblings)
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
The previous commit added a new swapext flag that enables us to perform
post-swap processing on file2 once we're done swapping the extent maps.
Now add this ability for directories.
This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online directory repair feature can
create salvaged dirents in a temporary directory and swap the data forks
when ready. If one file is in extents format and the other is inline,
we will have to promote both to extents format to perform the swap.
After the swap, we can try to condense the fixed directory down to
inline format if possible.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_swapext.c | 44 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 43 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index 6b5223e73692..a52f72a499f4 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -27,6 +27,8 @@
#include "xfs_da_btree.h"
#include "xfs_attr_leaf.h"
#include "xfs_attr.h"
+#include "xfs_dir2_priv.h"
+#include "xfs_dir2.h"
struct kmem_cache *xfs_swapext_intent_cache;
@@ -385,6 +387,42 @@ xfs_swapext_attr_to_sf(
return xfs_attr3_leaf_to_shortform(bp, &args, forkoff);
}
+/* Convert inode2's block dir fork back to shortform, if possible.. */
+STATIC int
+xfs_swapext_dir_to_sf(
+ struct xfs_trans *tp,
+ struct xfs_swapext_intent *sxi)
+{
+ struct xfs_da_args args = {
+ .dp = sxi->sxi_ip2,
+ .geo = tp->t_mountp->m_dir_geo,
+ .whichfork = XFS_DATA_FORK,
+ .trans = tp,
+ };
+ struct xfs_dir2_sf_hdr sfh;
+ struct xfs_buf *bp;
+ bool isblock;
+ int size;
+ int error;
+
+ error = xfs_dir2_isblock(&args, &isblock);
+ if (error)
+ return error;
+
+ if (!isblock)
+ return 0;
+
+ error = xfs_dir3_block_read(tp, sxi->sxi_ip2, &bp);
+ if (error)
+ return error;
+
+ size = xfs_dir2_block_sfsize(sxi->sxi_ip2, bp->b_addr, &sfh);
+ if (size > xfs_inode_data_fork_size(sxi->sxi_ip2))
+ return 0;
+
+ return xfs_dir2_block_to_sf(&args, bp, size, &sfh);
+}
+
static inline void
xfs_swapext_clear_reflink(
struct xfs_trans *tp,
@@ -407,6 +445,8 @@ xfs_swapext_do_postop_work(
if (sxi->sxi_flags & XFS_SWAP_EXT_ATTR_FORK)
error = xfs_swapext_attr_to_sf(tp, sxi);
+ else if (S_ISDIR(VFS_I(sxi->sxi_ip2)->i_mode))
+ error = xfs_swapext_dir_to_sf(tp, sxi);
sxi->sxi_flags &= ~XFS_SWAP_EXT_CVT_INO2_SF;
if (error)
return error;
@@ -1061,7 +1101,9 @@ xfs_swapext(
if (req->req_flags & XFS_SWAP_REQ_SET_SIZES)
ASSERT(req->whichfork == XFS_DATA_FORK);
if (req->req_flags & XFS_SWAP_REQ_CVT_INO2_SF)
- ASSERT(req->whichfork == XFS_ATTR_FORK);
+ ASSERT(req->whichfork == XFS_ATTR_FORK ||
+ (req->whichfork == XFS_DATA_FORK &&
+ S_ISDIR(VFS_I(req->ip2)->i_mode)));
if (req->blockcount == 0)
return;
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 18/21] xfs: condense symbolic links after an atomic swap
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
` (10 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 06/21] xfs: introduce a swap-extent log intent item Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 13/21] xfs: port xfs_swap_extent_forks to use xfs_swapext_req Darrick J. Wong
` (8 subsequent siblings)
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
The previous commit added a new swapext flag that enables us to perform
post-swap processing on file2 once we're done swapping the extent maps.
Now add this ability for symlinks.
This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online symlink repair feature can
salvage the remote target in a temporary link and swap the data forks
when ready. If one file is in extents format and the other is inline,
we will have to promote both to extents format to perform the swap.
After the swap, we can try to condense the fixed symlink down to inline
format if possible.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_swapext.c | 48 +++++++++++++++++++++++++++++++++++
fs/xfs/libxfs/xfs_symlink_remote.c | 47 +++++++++++++++++++++++++++++++++++
fs/xfs/libxfs/xfs_symlink_remote.h | 1 +
fs/xfs/xfs_symlink.c | 49 ++++--------------------------------
4 files changed, 101 insertions(+), 44 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index a52f72a499f4..b27ceeb93a16 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -29,6 +29,7 @@
#include "xfs_attr.h"
#include "xfs_dir2_priv.h"
#include "xfs_dir2.h"
+#include "xfs_symlink_remote.h"
struct kmem_cache *xfs_swapext_intent_cache;
@@ -423,6 +424,48 @@ xfs_swapext_dir_to_sf(
return xfs_dir2_block_to_sf(&args, bp, size, &sfh);
}
+/* Convert inode2's remote symlink target back to shortform, if possible. */
+STATIC int
+xfs_swapext_link_to_sf(
+ struct xfs_trans *tp,
+ struct xfs_swapext_intent *sxi)
+{
+ struct xfs_inode *ip = sxi->sxi_ip2;
+ struct xfs_ifork *ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK);
+ char *buf;
+ int error;
+
+ if (ifp->if_format == XFS_DINODE_FMT_LOCAL ||
+ ip->i_disk_size > xfs_inode_data_fork_size(ip))
+ return 0;
+
+ /* Read the current symlink target into a buffer. */
+ buf = kmem_alloc(ip->i_disk_size + 1, KM_NOFS);
+ if (!buf) {
+ ASSERT(0);
+ return -ENOMEM;
+ }
+
+ error = xfs_symlink_remote_read(ip, buf);
+ if (error)
+ goto free;
+
+ /* Remove the blocks. */
+ error = xfs_symlink_remote_truncate(tp, ip);
+ if (error)
+ goto free;
+
+ /* Convert fork to local format and log our changes. */
+ xfs_idestroy_fork(ifp);
+ ifp->if_bytes = 0;
+ ifp->if_format = XFS_DINODE_FMT_LOCAL;
+ xfs_init_local_fork(ip, XFS_DATA_FORK, buf, ip->i_disk_size);
+ xfs_trans_log_inode(tp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE);
+free:
+ kmem_free(buf);
+ return error;
+}
+
static inline void
xfs_swapext_clear_reflink(
struct xfs_trans *tp,
@@ -447,6 +490,8 @@ xfs_swapext_do_postop_work(
error = xfs_swapext_attr_to_sf(tp, sxi);
else if (S_ISDIR(VFS_I(sxi->sxi_ip2)->i_mode))
error = xfs_swapext_dir_to_sf(tp, sxi);
+ else if (S_ISLNK(VFS_I(sxi->sxi_ip2)->i_mode))
+ error = xfs_swapext_link_to_sf(tp, sxi);
sxi->sxi_flags &= ~XFS_SWAP_EXT_CVT_INO2_SF;
if (error)
return error;
@@ -1103,7 +1148,8 @@ xfs_swapext(
if (req->req_flags & XFS_SWAP_REQ_CVT_INO2_SF)
ASSERT(req->whichfork == XFS_ATTR_FORK ||
(req->whichfork == XFS_DATA_FORK &&
- S_ISDIR(VFS_I(req->ip2)->i_mode)));
+ (S_ISDIR(VFS_I(req->ip2)->i_mode) ||
+ S_ISLNK(VFS_I(req->ip2)->i_mode))));
if (req->blockcount == 0)
return;
diff --git a/fs/xfs/libxfs/xfs_symlink_remote.c b/fs/xfs/libxfs/xfs_symlink_remote.c
index 5261f15ea2ed..b48dcb893a2a 100644
--- a/fs/xfs/libxfs/xfs_symlink_remote.c
+++ b/fs/xfs/libxfs/xfs_symlink_remote.c
@@ -391,3 +391,50 @@ xfs_symlink_write_target(
ASSERT(pathlen == 0);
return 0;
}
+
+/* Remove all the blocks from a symlink and invalidate buffers. */
+int
+xfs_symlink_remote_truncate(
+ struct xfs_trans *tp,
+ struct xfs_inode *ip)
+{
+ struct xfs_bmbt_irec mval[XFS_SYMLINK_MAPS];
+ struct xfs_mount *mp = tp->t_mountp;
+ struct xfs_buf *bp;
+ int nmaps = XFS_SYMLINK_MAPS;
+ int done = 0;
+ int i;
+ int error;
+
+ /* Read mappings and invalidate buffers. */
+ error = xfs_bmapi_read(ip, 0, XFS_MAX_FILEOFF, mval, &nmaps, 0);
+ if (error)
+ return error;
+
+ for (i = 0; i < nmaps; i++) {
+ if (!xfs_bmap_is_real_extent(&mval[i]))
+ break;
+
+ error = xfs_trans_get_buf(tp, mp->m_ddev_targp,
+ XFS_FSB_TO_DADDR(mp, mval[i].br_startblock),
+ XFS_FSB_TO_BB(mp, mval[i].br_blockcount), 0,
+ &bp);
+ if (error)
+ return error;
+
+ xfs_trans_binval(tp, bp);
+ }
+
+ /* Unmap the remote blocks. */
+ error = xfs_bunmapi(tp, ip, 0, XFS_MAX_FILEOFF, 0, nmaps, &done);
+ if (error)
+ return error;
+ if (!done) {
+ ASSERT(done);
+ xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
+ return -EFSCORRUPTED;
+ }
+
+ xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+ return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_symlink_remote.h b/fs/xfs/libxfs/xfs_symlink_remote.h
index d81461c06b6b..05eb9c3937d9 100644
--- a/fs/xfs/libxfs/xfs_symlink_remote.h
+++ b/fs/xfs/libxfs/xfs_symlink_remote.h
@@ -23,5 +23,6 @@ int xfs_symlink_remote_read(struct xfs_inode *ip, char *link);
int xfs_symlink_write_target(struct xfs_trans *tp, struct xfs_inode *ip,
const char *target_path, int pathlen, xfs_fsblock_t fs_blocks,
uint resblks);
+int xfs_symlink_remote_truncate(struct xfs_trans *tp, struct xfs_inode *ip);
#endif /* __XFS_SYMLINK_REMOTE_H */
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 548d9116e0c5..8cf69ca4bd7c 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -250,19 +250,12 @@ xfs_symlink(
*/
STATIC int
xfs_inactive_symlink_rmt(
- struct xfs_inode *ip)
+ struct xfs_inode *ip)
{
- struct xfs_buf *bp;
- int done;
- int error;
- int i;
- xfs_mount_t *mp;
- xfs_bmbt_irec_t mval[XFS_SYMLINK_MAPS];
- int nmaps;
- int size;
- xfs_trans_t *tp;
+ struct xfs_mount *mp = ip->i_mount;
+ struct xfs_trans *tp;
+ int error;
- mp = ip->i_mount;
ASSERT(!xfs_need_iread_extents(&ip->i_df));
/*
* We're freeing a symlink that has some
@@ -286,44 +279,14 @@ xfs_inactive_symlink_rmt(
* locked for the second transaction. In the error paths we need it
* held so the cancel won't rele it, see below.
*/
- size = (int)ip->i_disk_size;
ip->i_disk_size = 0;
VFS_I(ip)->i_mode = (VFS_I(ip)->i_mode & ~S_IFMT) | S_IFREG;
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
- /*
- * Find the block(s) so we can inval and unmap them.
- */
- done = 0;
- nmaps = ARRAY_SIZE(mval);
- error = xfs_bmapi_read(ip, 0, xfs_symlink_blocks(mp, size),
- mval, &nmaps, 0);
- if (error)
- goto error_trans_cancel;
- /*
- * Invalidate the block(s). No validation is done.
- */
- for (i = 0; i < nmaps; i++) {
- error = xfs_trans_get_buf(tp, mp->m_ddev_targp,
- XFS_FSB_TO_DADDR(mp, mval[i].br_startblock),
- XFS_FSB_TO_BB(mp, mval[i].br_blockcount), 0,
- &bp);
- if (error)
- goto error_trans_cancel;
- xfs_trans_binval(tp, bp);
- }
- /*
- * Unmap the dead block(s) to the dfops.
- */
- error = xfs_bunmapi(tp, ip, 0, size, 0, nmaps, &done);
+
+ error = xfs_symlink_remote_truncate(tp, ip);
if (error)
goto error_trans_cancel;
- ASSERT(done);
- /*
- * Commit the transaction. This first logs the EFI and the inode, then
- * rolls and commits the transaction that frees the extents.
- */
- xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
error = xfs_trans_commit(tp);
if (error) {
ASSERT(xfs_is_shutdown(mp));
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 19/21] xfs: make atomic extent swapping support realtime files
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
` (17 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 12/21] xfs: consolidate all of the xfs_swap_extent_forks code Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 20/21] xfs: support non-power-of-two rtextsize with exchange-range Darrick J. Wong
2022-12-30 22:13 ` [PATCH 21/21] xfs: enable atomic swapext feature Darrick J. Wong
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
Now that bmap items support the realtime device, we can add the
necessary pieces to the atomic extent swapping code to support such
things.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_swapext.c | 109 +++++++++++++++++++++++++++++++++-
fs/xfs/libxfs/xfs_swapext.h | 5 +-
fs/xfs/xfs_bmap_util.c | 2 -
fs/xfs/xfs_file.c | 2 -
fs/xfs/xfs_inode.h | 5 ++
fs/xfs/xfs_rtalloc.c | 136 +++++++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_rtalloc.h | 3 +
fs/xfs/xfs_trace.h | 11 ++-
fs/xfs/xfs_xchgrange.c | 71 ++++++++++++++++++++++
fs/xfs/xfs_xchgrange.h | 2 -
10 files changed, 329 insertions(+), 17 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index b27ceeb93a16..69812594fd71 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -142,6 +142,108 @@ sxi_advance(
sxi->sxi_blockcount -= irec->br_blockcount;
}
+#ifdef DEBUG
+static inline bool
+xfs_swapext_need_rt_conversion(
+ const struct xfs_swapext_req *req)
+{
+ struct xfs_inode *ip = req->ip2;
+ struct xfs_mount *mp = ip->i_mount;
+
+ /* xattrs don't live on the rt device */
+ if (req->whichfork == XFS_ATTR_FORK)
+ return false;
+
+ /*
+ * Caller got permission to use logged swapext, so log recovery will
+ * finish the swap and not leave us with partially swapped rt extents
+ * exposed to userspace.
+ */
+ if (req->req_flags & XFS_SWAP_REQ_LOGGED)
+ return false;
+
+ /*
+ * If we can't use log intent items at all, the only supported
+ * operation is full fork swaps.
+ */
+ if (!xfs_swapext_supported(mp))
+ return false;
+
+ /* Conversion is only needed for realtime files with big rt extents */
+ return xfs_inode_has_bigrtextents(ip);
+}
+
+static inline int
+xfs_swapext_check_rt_extents(
+ struct xfs_mount *mp,
+ const struct xfs_swapext_req *req)
+{
+ struct xfs_bmbt_irec irec1, irec2;
+ xfs_fileoff_t startoff1 = req->startoff1;
+ xfs_fileoff_t startoff2 = req->startoff2;
+ xfs_filblks_t blockcount = req->blockcount;
+ uint32_t mod;
+ int nimaps;
+ int error;
+
+ if (!xfs_swapext_need_rt_conversion(req))
+ return 0;
+
+ while (blockcount > 0) {
+ /* Read extent from the first file */
+ nimaps = 1;
+ error = xfs_bmapi_read(req->ip1, startoff1, blockcount,
+ &irec1, &nimaps, 0);
+ if (error)
+ return error;
+ ASSERT(nimaps == 1);
+
+ /* Read extent from the second file */
+ nimaps = 1;
+ error = xfs_bmapi_read(req->ip2, startoff2,
+ irec1.br_blockcount, &irec2, &nimaps,
+ 0);
+ if (error)
+ return error;
+ ASSERT(nimaps == 1);
+
+ /*
+ * We can only swap as many blocks as the smaller of the two
+ * extent maps.
+ */
+ irec1.br_blockcount = min(irec1.br_blockcount,
+ irec2.br_blockcount);
+
+ /* Both mappings must be aligned to the realtime extent size. */
+ div_u64_rem(irec1.br_startoff, mp->m_sb.sb_rextsize, &mod);
+ if (mod) {
+ ASSERT(mod == 0);
+ return -EINVAL;
+ }
+
+ div_u64_rem(irec2.br_startoff, mp->m_sb.sb_rextsize, &mod);
+ if (mod) {
+ ASSERT(mod == 0);
+ return -EINVAL;
+ }
+
+ div_u64_rem(irec1.br_blockcount, mp->m_sb.sb_rextsize, &mod);
+ if (mod) {
+ ASSERT(mod == 0);
+ return -EINVAL;
+ }
+
+ startoff1 += irec1.br_blockcount;
+ startoff2 += irec1.br_blockcount;
+ blockcount -= irec1.br_blockcount;
+ }
+
+ return 0;
+}
+#else
+# define xfs_swapext_check_rt_extents(mp, req) (0)
+#endif
+
/* Check all extents to make sure we can actually swap them. */
int
xfs_swapext_check_extents(
@@ -161,12 +263,7 @@ xfs_swapext_check_extents(
ifp2->if_format == XFS_DINODE_FMT_LOCAL)
return -EINVAL;
- /* We don't support realtime data forks yet. */
- if (!XFS_IS_REALTIME_INODE(req->ip1))
- return 0;
- if (req->whichfork == XFS_ATTR_FORK)
- return 0;
- return -EINVAL;
+ return xfs_swapext_check_rt_extents(mp, req);
}
#ifdef CONFIG_XFS_QUOTA
diff --git a/fs/xfs/libxfs/xfs_swapext.h b/fs/xfs/libxfs/xfs_swapext.h
index 6b610fea150a..155add23d8e2 100644
--- a/fs/xfs/libxfs/xfs_swapext.h
+++ b/fs/xfs/libxfs/xfs_swapext.h
@@ -13,12 +13,11 @@
* This can be done to individual file extents by using the block mapping log
* intent items introduced with reflink and rmap; or to entire file ranges
* using swapext log intent items to track the overall progress across multiple
- * extent mappings. Realtime is not supported yet.
+ * extent mappings.
*/
static inline bool xfs_swapext_supported(struct xfs_mount *mp)
{
- return (xfs_has_reflink(mp) || xfs_has_rmapbt(mp)) &&
- !xfs_has_realtime(mp);
+ return xfs_has_reflink(mp) || xfs_has_rmapbt(mp);
}
/*
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 47a583a94d58..3593c0f0ce13 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -989,7 +989,7 @@ xfs_free_file_space(
endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
/* We can only free complete realtime extents. */
- if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
+ if (xfs_inode_has_bigrtextents(ip)) {
startoffset_fsb = roundup_64(startoffset_fsb,
mp->m_sb.sb_rextsize);
endoffset_fsb = rounddown_64(endoffset_fsb,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b4629c8aa6b7..87dfb05640a8 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1181,7 +1181,7 @@ xfs_file_xchg_range(
goto out_err;
/* Prepare and then exchange file contents. */
- error = xfs_xchg_range_prep(file1, file2, fxr);
+ error = xfs_xchg_range_prep(file1, file2, fxr, priv_flags);
if (error)
goto out_unlock;
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 4b01d078ace2..444c43571e31 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -287,6 +287,11 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
}
+static inline bool xfs_inode_has_bigrtextents(struct xfs_inode *ip)
+{
+ return XFS_IS_REALTIME_INODE(ip) && ip->i_mount->m_sb.sb_rextsize > 1;
+}
+
/*
* Return the buftarg used for data allocations on a given inode.
*/
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 790191316a32..883333036519 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -21,6 +21,7 @@
#include "xfs_sb.h"
#include "xfs_log_priv.h"
#include "xfs_health.h"
+#include "xfs_trace.h"
/*
* Read and return the summary information for a given extent size,
@@ -1461,3 +1462,138 @@ xfs_rtpick_extent(
*pick = b;
return 0;
}
+
+/*
+ * Decide if this is an unwritten extent that isn't aligned to a rt extent
+ * boundary. If it is, shorten the mapping so that we're ready to convert
+ * everything up to the next rt extent to a zeroed written extent. If not,
+ * return false.
+ */
+static inline bool
+xfs_rtfile_want_conversion(
+ struct xfs_mount *mp,
+ struct xfs_bmbt_irec *irec)
+{
+ xfs_fileoff_t rext_next;
+ uint32_t modoff, modcnt;
+
+ if (irec->br_state != XFS_EXT_UNWRITTEN)
+ return false;
+
+ div_u64_rem(irec->br_startoff, mp->m_sb.sb_rextsize, &modoff);
+ if (modoff == 0) {
+ uint64_t rexts = div_u64_rem(irec->br_blockcount,
+ mp->m_sb.sb_rextsize, &modcnt);
+
+ if (rexts > 0) {
+ /*
+ * Unwritten mapping starts at an rt extent boundary
+ * and is longer than one rt extent. Round the length
+ * down to the nearest extent but don't select it for
+ * conversion.
+ */
+ irec->br_blockcount -= modcnt;
+ modcnt = 0;
+ }
+
+ /* Unwritten mapping is perfectly aligned, do not convert. */
+ if (modcnt == 0)
+ return false;
+ }
+
+ /*
+ * Unaligned and unwritten; trim to the current rt extent and select it
+ * for conversion.
+ */
+ rext_next = (irec->br_startoff - modoff) + mp->m_sb.sb_rextsize;
+ xfs_trim_extent(irec, irec->br_startoff, rext_next - irec->br_startoff);
+ return true;
+}
+
+/*
+ * For all realtime extents backing the given range of a file, search for
+ * unwritten mappings that do not cover a full rt extent and convert them
+ * to zeroed written mappings. The goal is to end up with one mapping per rt
+ * extent so that we can perform a remapping operation. Callers must ensure
+ * that there are no dirty pages in the given range.
+ */
+int
+xfs_rtfile_convert_unwritten(
+ struct xfs_inode *ip,
+ loff_t pos,
+ uint64_t len)
+{
+ struct xfs_bmbt_irec irec;
+ struct xfs_trans *tp;
+ struct xfs_mount *mp = ip->i_mount;
+ xfs_fileoff_t off;
+ xfs_fileoff_t endoff;
+ unsigned int resblks;
+ int ret;
+
+ if (mp->m_sb.sb_rextsize == 1)
+ return 0;
+
+ off = rounddown_64(XFS_B_TO_FSBT(mp, pos), mp->m_sb.sb_rextsize);
+ endoff = roundup_64(XFS_B_TO_FSB(mp, pos + len), mp->m_sb.sb_rextsize);
+
+ trace_xfs_rtfile_convert_unwritten(ip, pos, len);
+
+ while (off < endoff) {
+ int nmap = 1;
+
+ if (fatal_signal_pending(current))
+ return -EINTR;
+
+ resblks = XFS_DIOSTRAT_SPACE_RES(mp, 1);
+ ret = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0,
+ &tp);
+ if (ret)
+ return ret;
+
+ xfs_ilock(ip, XFS_ILOCK_EXCL);
+ xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
+
+ /*
+ * Read the mapping. If we find an unwritten extent that isn't
+ * aligned to an rt extent boundary...
+ */
+ ret = xfs_bmapi_read(ip, off, endoff - off, &irec, &nmap, 0);
+ if (ret)
+ goto err;
+ ASSERT(nmap == 1);
+ ASSERT(irec.br_startoff == off);
+ if (!xfs_rtfile_want_conversion(mp, &irec)) {
+ xfs_trans_cancel(tp);
+ off += irec.br_blockcount;
+ continue;
+ }
+
+ /*
+ * ...make sure this partially unwritten rt extent gets
+ * converted to a zeroed written extent that we can remap.
+ */
+ nmap = 1;
+ ret = xfs_bmapi_write(tp, ip, off, irec.br_blockcount,
+ XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO, 0, &irec,
+ &nmap);
+ if (ret)
+ goto err;
+ ASSERT(nmap == 1);
+ if (irec.br_state != XFS_EXT_NORM) {
+ ASSERT(0);
+ ret = -EIO;
+ goto err;
+ }
+ ret = xfs_trans_commit(tp);
+ if (ret)
+ return ret;
+
+ off += irec.br_blockcount;
+ }
+
+ return 0;
+err:
+ xfs_trans_cancel(tp);
+ return ret;
+}
diff --git a/fs/xfs/xfs_rtalloc.h b/fs/xfs/xfs_rtalloc.h
index 3b2f1b499a11..e440f793dd98 100644
--- a/fs/xfs/xfs_rtalloc.h
+++ b/fs/xfs/xfs_rtalloc.h
@@ -140,6 +140,8 @@ int xfs_rtalloc_extent_is_free(struct xfs_mount *mp, struct xfs_trans *tp,
xfs_rtblock_t start, xfs_extlen_t len,
bool *is_free);
int xfs_rtalloc_reinit_frextents(struct xfs_mount *mp);
+int xfs_rtfile_convert_unwritten(struct xfs_inode *ip, loff_t pos,
+ uint64_t len);
#else
# define xfs_rtallocate_extent(t,b,min,max,l,f,p,rb) (ENOSYS)
# define xfs_rtfree_extent(t,b,l) (ENOSYS)
@@ -164,6 +166,7 @@ xfs_rtmount_init(
}
# define xfs_rtmount_inodes(m) (((mp)->m_sb.sb_rblocks == 0)? 0 : (ENOSYS))
# define xfs_rtunmount_inodes(m)
+# define xfs_rtfile_convert_unwritten(ip, pos, len) (0)
#endif /* CONFIG_XFS_RT */
#endif /* __XFS_RTALLOC_H__ */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index b0ced76af3b9..0802f078a945 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1519,7 +1519,7 @@ DEFINE_IMAP_EVENT(xfs_iomap_alloc);
DEFINE_IMAP_EVENT(xfs_iomap_found);
DECLARE_EVENT_CLASS(xfs_simple_io_class,
- TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count),
+ TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, u64 count),
TP_ARGS(ip, offset, count),
TP_STRUCT__entry(
__field(dev_t, dev)
@@ -1527,7 +1527,7 @@ DECLARE_EVENT_CLASS(xfs_simple_io_class,
__field(loff_t, isize)
__field(loff_t, disize)
__field(loff_t, offset)
- __field(size_t, count)
+ __field(u64, count)
),
TP_fast_assign(
__entry->dev = VFS_I(ip)->i_sb->s_dev;
@@ -1538,7 +1538,7 @@ DECLARE_EVENT_CLASS(xfs_simple_io_class,
__entry->count = count;
),
TP_printk("dev %d:%d ino 0x%llx isize 0x%llx disize 0x%llx "
- "pos 0x%llx bytecount 0x%zx",
+ "pos 0x%llx bytecount 0x%llx",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->ino,
__entry->isize,
@@ -1549,7 +1549,7 @@ DECLARE_EVENT_CLASS(xfs_simple_io_class,
#define DEFINE_SIMPLE_IO_EVENT(name) \
DEFINE_EVENT(xfs_simple_io_class, name, \
- TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count), \
+ TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, u64 count), \
TP_ARGS(ip, offset, count))
DEFINE_SIMPLE_IO_EVENT(xfs_delalloc_enospc);
DEFINE_SIMPLE_IO_EVENT(xfs_unwritten_convert);
@@ -3741,6 +3741,9 @@ TRACE_EVENT(xfs_ioctl_clone,
/* unshare tracepoints */
DEFINE_SIMPLE_IO_EVENT(xfs_reflink_unshare);
DEFINE_INODE_ERROR_EVENT(xfs_reflink_unshare_error);
+#ifdef CONFIG_XFS_RT
+DEFINE_SIMPLE_IO_EVENT(xfs_rtfile_convert_unwritten);
+#endif /* CONFIG_XFS_RT */
/* copy on write */
DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_around_shared);
diff --git a/fs/xfs/xfs_xchgrange.c b/fs/xfs/xfs_xchgrange.c
index 27bb88dcf228..6a66d09099b0 100644
--- a/fs/xfs/xfs_xchgrange.c
+++ b/fs/xfs/xfs_xchgrange.c
@@ -28,6 +28,7 @@
#include "xfs_sb.h"
#include "xfs_icache.h"
#include "xfs_log.h"
+#include "xfs_rtalloc.h"
/* Lock (and optionally join) two inodes for a file range exchange. */
void
@@ -370,12 +371,58 @@ xfs_swap_extent_forks(
return 0;
}
+/*
+ * There may be partially written rt extents lurking in the ranges to be
+ * swapped. According to the rules for realtime files with big rt extents, we
+ * must guarantee that an outside observer (an IO thread, realistically) never
+ * can see multiple physical rt extents mapped to the same logical file rt
+ * extent. The deferred bmap log intent items that we use under the hood
+ * operate on single block mappings and not rt extents, which means we must
+ * have a strategy to ensure that log recovery after a failure won't stop in
+ * the middle of an rt extent.
+ *
+ * The preferred strategy is to use deferred extent swap log intent items to
+ * track the status of the overall swap operation so that we can complete the
+ * work during crash recovery. If that isn't possible, we fall back to
+ * requiring the selected mappings in both forks to be aligned to rt extent
+ * boundaries. As an aside, the old fork swap routine didn't have this
+ * requirement, but at an extreme cost in flexibilty (full files only, and no
+ * support if rmapbt is enabled).
+ */
+static bool
+xfs_xchg_range_need_rt_conversion(
+ struct xfs_inode *ip,
+ unsigned int xchg_flags)
+{
+ struct xfs_mount *mp = ip->i_mount;
+
+ /*
+ * Caller got permission to use logged swapext, so log recovery will
+ * finish the swap and not leave us with partially swapped rt extents
+ * exposed to userspace.
+ */
+ if (xchg_flags & XFS_XCHG_RANGE_LOGGED)
+ return false;
+
+ /*
+ * If we can't use log intent items at all, the only supported
+ * operation is full fork swaps, so no conversions are needed.
+ * The range requirements are enforced by the swapext code itself.
+ */
+ if (!xfs_swapext_supported(mp))
+ return false;
+
+ /* Conversion is only needed for realtime files with big rt extents */
+ return xfs_inode_has_bigrtextents(ip);
+}
+
/* Prepare two files to have their data exchanged. */
int
xfs_xchg_range_prep(
struct file *file1,
struct file *file2,
- struct file_xchg_range *fxr)
+ struct file_xchg_range *fxr,
+ unsigned int xchg_flags)
{
struct xfs_inode *ip1 = XFS_I(file_inode(file1));
struct xfs_inode *ip2 = XFS_I(file_inode(file2));
@@ -439,6 +486,19 @@ xfs_xchg_range_prep(
return error;
}
+ /* Convert unwritten sub-extent mappings if required. */
+ if (xfs_xchg_range_need_rt_conversion(ip2, xchg_flags)) {
+ error = xfs_rtfile_convert_unwritten(ip2, fxr->file2_offset,
+ fxr->length);
+ if (error)
+ return error;
+
+ error = xfs_rtfile_convert_unwritten(ip1, fxr->file1_offset,
+ fxr->length);
+ if (error)
+ return error;
+ }
+
return 0;
}
@@ -656,6 +716,15 @@ xfs_xchg_range(
if (xchg_flags & XFS_XCHG_RANGE_LOGGED)
req.req_flags |= XFS_SWAP_REQ_LOGGED;
+ /*
+ * Round the request length up to the nearest fundamental unit of
+ * allocation. The prep function already checked that the request
+ * offsets and length in @fxr are safe to round up.
+ */
+ if (XFS_IS_REALTIME_INODE(ip2))
+ req.blockcount = roundup_64(req.blockcount,
+ mp->m_sb.sb_rextsize);
+
error = xfs_xchg_range_estimate(&req);
if (error)
return error;
diff --git a/fs/xfs/xfs_xchgrange.h b/fs/xfs/xfs_xchgrange.h
index a0e64408784a..e356fe09a40c 100644
--- a/fs/xfs/xfs_xchgrange.h
+++ b/fs/xfs/xfs_xchgrange.h
@@ -35,6 +35,6 @@ void xfs_xchg_range_rele_log_assist(struct xfs_mount *mp);
int xfs_xchg_range(struct xfs_inode *ip1, struct xfs_inode *ip2,
const struct file_xchg_range *fxr, unsigned int xchg_flags);
int xfs_xchg_range_prep(struct file *file1, struct file *file2,
- struct file_xchg_range *fxr);
+ struct file_xchg_range *fxr, unsigned int xchg_flags);
#endif /* __XFS_XCHGRANGE_H__ */
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 20/21] xfs: support non-power-of-two rtextsize with exchange-range
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
` (18 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 19/21] xfs: make atomic extent swapping support realtime files Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
2022-12-30 22:13 ` [PATCH 21/21] xfs: enable atomic swapext feature Darrick J. Wong
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
The VFS exchange-range alignment checks use (fast) bitmasks to perform
block alignment checks on the exchange parameters. Unfortunately,
bitmasks require that the alignment size be a power of two. This isn't
true for realtime devices, so we have to copy-pasta the VFS checks using
long division for this to work properly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_xchgrange.c | 102 +++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 91 insertions(+), 11 deletions(-)
diff --git a/fs/xfs/xfs_xchgrange.c b/fs/xfs/xfs_xchgrange.c
index 6a66d09099b0..ae030a6f607e 100644
--- a/fs/xfs/xfs_xchgrange.c
+++ b/fs/xfs/xfs_xchgrange.c
@@ -416,6 +416,86 @@ xfs_xchg_range_need_rt_conversion(
return xfs_inode_has_bigrtextents(ip);
}
+/*
+ * Check the alignment of an exchange request when the allocation unit size
+ * isn't a power of two. The VFS helpers use (fast) bitmask-based alignment
+ * checks, but here we have to use slow long division.
+ */
+static int
+xfs_xchg_range_check_rtalign(
+ struct xfs_inode *ip1,
+ struct xfs_inode *ip2,
+ const struct file_xchg_range *fxr)
+{
+ struct xfs_mount *mp = ip1->i_mount;
+ uint32_t rextbytes;
+ uint64_t length = fxr->length;
+ uint64_t blen;
+ loff_t size1, size2;
+
+ rextbytes = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
+ size1 = i_size_read(VFS_I(ip1));
+ size2 = i_size_read(VFS_I(ip2));
+
+ /* The start of both ranges must be aligned to a rt extent. */
+ if (!isaligned_64(fxr->file1_offset, rextbytes) ||
+ !isaligned_64(fxr->file2_offset, rextbytes))
+ return -EINVAL;
+
+ /*
+ * If the caller asked for full files, check that the offset/length
+ * values cover all of both files.
+ */
+ if ((fxr->flags & FILE_XCHG_RANGE_FULL_FILES) &&
+ (fxr->file1_offset != 0 || fxr->file2_offset != 0 ||
+ fxr->length != size1 || fxr->length != size2))
+ return -EDOM;
+
+ if (fxr->flags & FILE_XCHG_RANGE_TO_EOF)
+ length = max_t(int64_t, size1 - fxr->file1_offset,
+ size2 - fxr->file2_offset);
+
+ /*
+ * If the user wanted us to exchange up to the infile's EOF, round up
+ * to the next rt extent boundary for this check. Do the same for the
+ * outfile.
+ *
+ * Otherwise, reject the range length if it's not rt extent aligned.
+ * We already confirmed the starting offsets' rt extent block
+ * alignment.
+ */
+ if (fxr->file1_offset + length == size1)
+ blen = roundup_64(size1, rextbytes) - fxr->file1_offset;
+ else if (fxr->file2_offset + length == size2)
+ blen = roundup_64(size2, rextbytes) - fxr->file2_offset;
+ else if (!isaligned_64(length, rextbytes))
+ return -EINVAL;
+ else
+ blen = length;
+
+ /* Don't allow overlapped exchanges within the same file. */
+ if (ip1 == ip2 &&
+ fxr->file2_offset + blen > fxr->file1_offset &&
+ fxr->file1_offset + blen > fxr->file2_offset)
+ return -EINVAL;
+
+ /*
+ * Ensure that we don't exchange a partial EOF rt extent into the
+ * middle of another file.
+ */
+ if (isaligned_64(length, rextbytes))
+ return 0;
+
+ blen = length;
+ if (fxr->file2_offset + length < size2)
+ blen = rounddown_64(blen, rextbytes);
+
+ if (fxr->file1_offset + blen < size1)
+ blen = rounddown_64(blen, rextbytes);
+
+ return blen == length ? 0 : -EINVAL;
+}
+
/* Prepare two files to have their data exchanged. */
int
xfs_xchg_range_prep(
@@ -426,6 +506,7 @@ xfs_xchg_range_prep(
{
struct xfs_inode *ip1 = XFS_I(file_inode(file1));
struct xfs_inode *ip2 = XFS_I(file_inode(file2));
+ unsigned int alloc_unit = xfs_inode_alloc_unitsize(ip2);
int error;
trace_xfs_xchg_range_prep(ip1, fxr, ip2, 0);
@@ -434,18 +515,17 @@ xfs_xchg_range_prep(
if (XFS_IS_REALTIME_INODE(ip1) != XFS_IS_REALTIME_INODE(ip2))
return -EINVAL;
- /*
- * The alignment checks in the VFS helpers cannot deal with allocation
- * units that are not powers of 2. This can happen with the realtime
- * volume if the extent size is set. Note that alignment checks are
- * skipped if FULL_FILES is set.
- */
- if (!(fxr->flags & FILE_XCHG_RANGE_FULL_FILES) &&
- !is_power_of_2(xfs_inode_alloc_unitsize(ip2)))
- return -EOPNOTSUPP;
+ /* Check non-power of two alignment issues, if necessary. */
+ if (XFS_IS_REALTIME_INODE(ip2) && !is_power_of_2(alloc_unit)) {
+ error = xfs_xchg_range_check_rtalign(ip1, ip2, fxr);
+ if (error)
+ return error;
- error = generic_xchg_file_range_prep(file1, file2, fxr,
- xfs_inode_alloc_unitsize(ip2));
+ /* Do the VFS checks with the regular block alignment. */
+ alloc_unit = ip1->i_mount->m_sb.sb_blocksize;
+ }
+
+ error = generic_xchg_file_range_prep(file1, file2, fxr, alloc_unit);
if (error || fxr->length == 0)
return error;
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 21/21] xfs: enable atomic swapext feature
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
` (19 preceding siblings ...)
2022-12-30 22:13 ` [PATCH 20/21] xfs: support non-power-of-two rtextsize with exchange-range Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
20 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api
From: Darrick J. Wong <djwong@kernel.org>
Add the atomic swapext feature to the set of features that we will
permit.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_format.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index bb8bff488017..0c457905cce5 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -393,7 +393,8 @@ xfs_sb_has_incompat_feature(
#define XFS_SB_FEAT_INCOMPAT_LOG_XATTRS (1 << 0) /* Delayed Attributes */
#define XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT (1U << 31) /* file extent swap */
#define XFS_SB_FEAT_INCOMPAT_LOG_ALL \
- (XFS_SB_FEAT_INCOMPAT_LOG_XATTRS)
+ (XFS_SB_FEAT_INCOMPAT_LOG_XATTRS | \
+ XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT)
#define XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN ~XFS_SB_FEAT_INCOMPAT_LOG_ALL
static inline bool
xfs_sb_has_incompat_log_feature(
^ permalink raw reply related [flat|nested] 39+ messages in thread
end of thread, other threads:[~2022-12-30 23:55 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
2022-12-30 22:12 ` [PATCH 2/7] xfs: enable sorting of xfile-backed arrays Darrick J. Wong
2022-12-30 22:12 ` [PATCH 6/7] xfs: cache pages used for xfarray quicksort convergence Darrick J. Wong
2022-12-30 22:12 ` [PATCH 7/7] xfs: improve xfarray quicksort pivot Darrick J. Wong
2022-12-30 22:12 ` [PATCH 4/7] xfs: teach xfile to pass back direct-map pages to caller Darrick J. Wong
2022-12-30 22:12 ` [PATCH 3/7] xfs: convert xfarray insertion sort to heapsort using scratchpad memory Darrick J. Wong
2022-12-30 22:12 ` [PATCH 5/7] xfs: speed up xfarray sort by sorting xfile page contents directly Darrick J. Wong
2022-12-30 22:12 ` [PATCH 1/7] xfs: create a big array data structure Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
2022-12-30 22:13 ` [PATCH 4/7] xfs: consolidate btree block freeing tracepoints Darrick J. Wong
2022-12-30 22:13 ` [PATCH 3/7] xfs: support in-memory buffer cache targets Darrick J. Wong
2022-12-30 22:13 ` [PATCH 1/7] xfs: dump xfiles for debugging purposes Darrick J. Wong
2022-12-30 22:13 ` [PATCH 2/7] xfs: teach buftargs to maintain their own buffer hashtable Darrick J. Wong
2022-12-30 22:13 ` [PATCH 5/7] xfs: consolidate btree block allocation tracepoints Darrick J. Wong
2022-12-30 22:13 ` [PATCH 6/7] xfs: support in-memory btrees Darrick J. Wong
2022-12-30 22:13 ` [PATCH 7/7] xfs: connect in-memory btrees to xfiles Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
2022-12-30 22:13 ` [PATCH 01/21] vfs: introduce new file range exchange ioctl Darrick J. Wong
2022-12-30 22:13 ` [PATCH 04/21] xfs: parameterize all the incompat log feature helpers Darrick J. Wong
2022-12-30 22:13 ` [PATCH 02/21] xfs: create a new helper to return a file's allocation unit Darrick J. Wong
2022-12-30 22:13 ` [PATCH 05/21] xfs: create a log incompat flag for atomic extent swapping Darrick J. Wong
2022-12-30 22:13 ` [PATCH 03/21] xfs: refactor non-power-of-two alignment checks Darrick J. Wong
2022-12-30 22:13 ` [PATCH 09/21] xfs: add a ->xchg_file_range handler Darrick J. Wong
2022-12-30 22:13 ` [PATCH 08/21] xfs: enable xlog users to toggle atomic extent swapping Darrick J. Wong
2022-12-30 22:13 ` [PATCH 11/21] xfs: port xfs_swap_extents_rmap to our new code Darrick J. Wong
2022-12-30 22:13 ` [PATCH 07/21] xfs: create deferred log items for extent swapping Darrick J. Wong
2022-12-30 22:13 ` [PATCH 10/21] xfs: add error injection to test swapext recovery Darrick J. Wong
2022-12-30 22:13 ` [PATCH 06/21] xfs: introduce a swap-extent log intent item Darrick J. Wong
2022-12-30 22:13 ` [PATCH 18/21] xfs: condense symbolic links after an atomic swap Darrick J. Wong
2022-12-30 22:13 ` [PATCH 13/21] xfs: port xfs_swap_extent_forks to use xfs_swapext_req Darrick J. Wong
2022-12-30 22:13 ` [PATCH 17/21] xfs: condense directories after an atomic swap Darrick J. Wong
2022-12-30 22:13 ` [PATCH 16/21] xfs: condense extended attributes " Darrick J. Wong
2022-12-30 22:13 ` [PATCH 15/21] xfs: remove old swap extents implementation Darrick J. Wong
2022-12-30 22:13 ` [PATCH 14/21] xfs: allow xfs_swap_range to use older extent swap algorithms Darrick J. Wong
2022-12-30 22:13 ` [PATCH 12/21] xfs: consolidate all of the xfs_swap_extent_forks code Darrick J. Wong
2022-12-30 22:13 ` [PATCH 19/21] xfs: make atomic extent swapping support realtime files Darrick J. Wong
2022-12-30 22:13 ` [PATCH 20/21] xfs: support non-power-of-two rtextsize with exchange-range Darrick J. Wong
2022-12-30 22:13 ` [PATCH 21/21] xfs: enable atomic swapext feature Darrick J. Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).