[PATCH v19 00/18] xfs: online repair support

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v19 00/18] xfs: online repair support
@ 2019-08-05  0:34 Darrick J. Wong
  2019-08-05  0:34 ` [PATCH 01/18] xfs: add a repair revalidation function pointer Darrick J. Wong
                   ` (18 more replies)
  0 siblings, 19 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:34 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

Hi all,

This is the first part of the nineteenth revision of a patchset that
adds to XFS kernel support for online metadata scrubbing and repair.
There aren't any on-disk format changes.

New for this version is a rebase against 5.3-rc2, integration with the
health reporting subsystem, and the explicit revalidation of all
metadata structures that were rebuilt.

Patch 1 lays the groundwork for scrub types specifying a revalidation
function that will check everything that the repair function might have
rebuilt.  This will be necessary for the free space and inode btree
repair functions, which rebuild both btrees at once.

Patch 2 ensures that the health reporting query code doesn't get in the
way of post-repair revalidation of all rebuilt metadata structures.

Patch 3 creates a new data structure that provides an abstraction of a
big memory array by using linked lists.  This is where we store records
for btree reconstruction.  This first implementation is memory
inefficient and consumes a /lot/ of kernel memory, but lays the
groundwork for the last patch in the set to convert the implementation
to use a (memfd) swap file, which enables us to use pageable memory
without pounding the slab cache.

Patches 4-10 implement reconstruction of the free space btrees, inode
btrees, reference count btrees, inode records, inode forks, inode block
maps, and symbolic links.

Patch 11 implements a new data structure for storing arbitrary key/value
pairs, which we're going to need to reconstruct extended attribute
forks.

Patches 12-14 clean up the block unmapping code so that we will be able
to perform a mass reset of an inode's fork.  This is a key component for
salvaging extended attributes, freeing all the attr fork blocks, and
reconstructing the extended attribute data.

Patch 15 implements extended attribute salvage operations.  There is no
redundant or secondary xattr metadata, so the best we can do is trawl
through the attr leaves looking for intact entities.

Patch 16 augments scrub to rebuild extended attributes when any of the
attr blocks are fragmented.

Patch 17 implements reconstruction of quota blocks.

Patch 18 converts both in-memory array implementations from the clunky
linked list implementation to something resembling C arrays.  The array
data are backed by a (memfd) file, which means that idle data can be
paged out to disk instead of pinning kernel memory.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-part-one

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-part-one

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-part-one

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 01/18] xfs: add a repair revalidation function pointer
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
@ 2019-08-05  0:34 ` Darrick J. Wong
  2019-08-05  0:34 ` [PATCH 02/18] xfs: always rescan allegedly healthy per-ag metadata after repair Darrick J. Wong
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:34 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Allow repair functions to set a separate function pointer to validate
the metadata that they've rebuilt.  This prevents us from exiting from a
repair function that rebuilds both A and B without checking that both A
and B can pass a scrub test.  We'll need this for the free space and
inode btree repair strategies.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/scrub.c |    5 ++++-
 fs/xfs/scrub/scrub.h |    8 ++++++++
 2 files changed, 12 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 15c8c5f3f688..0f0b64d7164b 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -495,7 +495,10 @@ xfs_scrub_metadata(
 		goto out_teardown;
 
 	/* Scrub for errors. */
-	error = sc.ops->scrub(&sc);
+	if ((sc.flags & XREP_ALREADY_FIXED) && sc.ops->repair_eval != NULL)
+		error = sc.ops->repair_eval(&sc);
+	else
+		error = sc.ops->scrub(&sc);
 	if (!(sc.flags & XCHK_TRY_HARDER) && error == -EDEADLOCK) {
 		/*
 		 * Scrubbers return -EDEADLOCK to mean 'try harder'.
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index ad1ceb44a628..94a30637a127 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -27,6 +27,14 @@ struct xchk_meta_ops {
 	/* Repair or optimize the metadata. */
 	int		(*repair)(struct xfs_scrub *);
 
+	/*
+	 * Re-scrub the metadata we repaired, in case there's extra work that
+	 * we need to do to check our repair work.  If this is NULL, we'll use
+	 * the ->scrub function pointer, assuming that the regular scrub is
+	 * sufficient.
+	 */
+	int		(*repair_eval)(struct xfs_scrub *sc);
+
 	/* Decide if we even have this piece of metadata. */
 	bool		(*has)(struct xfs_sb *);
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 02/18] xfs: always rescan allegedly healthy per-ag metadata after repair
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
  2019-08-05  0:34 ` [PATCH 01/18] xfs: add a repair revalidation function pointer Darrick J. Wong
@ 2019-08-05  0:34 ` Darrick J. Wong
  2019-08-05  0:35 ` [PATCH 03/18] xfs: create a big array data structure Darrick J. Wong
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:34 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

After an online repair function runs for a per-AG metadata structure,
sc->sick_mask is supposed to reflect the per-AG metadata that the repair
function fixed.  Our next move is to re-check the metadata to assess
the completeness of our repair, so we don't want the rebuilt structure
to be excluded from the rescan just because the health system previously
logged a problem with the data structure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/health.c |   10 ++++++++++
 1 file changed, 10 insertions(+)


diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c
index b2f602811e9d..4865b2180e22 100644
--- a/fs/xfs/scrub/health.c
+++ b/fs/xfs/scrub/health.c
@@ -220,6 +220,16 @@ xchk_ag_btree_healthy_enough(
 		return true;
 	}
 
+	/*
+	 * If we just repaired some AG metadata, sc->sick_mask will reflect all
+	 * the per-AG metadata types that were repaired.  Exclude these from
+	 * the filesystem health query because we have not yet updated the
+	 * health status and we want everything to be scanned.
+	 */
+	if ((sc->flags & XREP_ALREADY_FIXED) &&
+	    type_to_health_flag[sc->sm->sm_type].group == XHG_AG)
+		mask &= ~sc->sick_mask;
+
 	if (xfs_ag_has_sickness(pag, mask)) {
 		sc->sm->sm_flags |= XFS_SCRUB_OFLAG_XFAIL;
 		return false;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 03/18] xfs: create a big array data structure
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
  2019-08-05  0:34 ` [PATCH 01/18] xfs: add a repair revalidation function pointer Darrick J. Wong
  2019-08-05  0:34 ` [PATCH 02/18] xfs: always rescan allegedly healthy per-ag metadata after repair Darrick J. Wong
@ 2019-08-05  0:35 ` Darrick J. Wong
  2019-08-05  0:35 ` [PATCH 04/18] xfs: repair free space btrees Darrick J. Wong
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:35 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a simple 'big array' data structure for storage of fixed-size
metadata records that will be used to reconstruct a btree index.  For
repair operations, the most important operations are append, iterate,
and sort; while supported, get and put are not for frequent use.

For the initial implementation we will use linked-list containers,
though a subsequent patch will restructure the backend to avoid using
pinned kernel memory.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile      |    1 
 fs/xfs/scrub/array.c |  283 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/array.h |   53 +++++++++
 3 files changed, 337 insertions(+)
 create mode 100644 fs/xfs/scrub/array.c
 create mode 100644 fs/xfs/scrub/array.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 06b68b6115bc..0ace13e94d98 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -160,6 +160,7 @@ xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota.o
 ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
 xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
+				   array.o \
 				   bitmap.o \
 				   repair.o \
 				   )
diff --git a/fs/xfs/scrub/array.c b/fs/xfs/scrub/array.c
new file mode 100644
index 000000000000..4089e595df8b
--- /dev/null
+++ b/fs/xfs/scrub/array.c
@@ -0,0 +1,283 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "scrub/array.h"
+
+/*
+ * XFS Fixed-Size Big Memory Array
+ * ===============================
+ * The big memory array uses a list to store large numbers of fixed-size
+ * records in memory.  Access to the array is performed via indexed get and put
+ * methods, and an append method is provided for convenience.  Array elements
+ * can be set to all zeroes, which means that the entry is NULL and will be
+ * skipped during iteration.
+ */
+
+struct xa_item {
+	struct list_head	list;
+	/* array item comes after here */
+};
+
+#define XA_ITEM_SIZE(sz)	(sizeof(struct xa_item) + (sz))
+
+/* Initialize a big memory array. */
+struct xfbma *
+xfbma_init(
+	size_t		obj_size)
+{
+	struct xfbma	*array;
+	int		error;
+
+	error = -ENOMEM;
+	array = kmem_alloc(sizeof(struct xfbma) + obj_size,
+			KM_NOFS | KM_MAYFAIL);
+	if (!array)
+		return ERR_PTR(error);
+
+	array->obj_size = obj_size;
+	array->nr = 0;
+	INIT_LIST_HEAD(&array->list);
+	memset(&array->cache, 0, sizeof(array->cache));
+
+	return array;
+}
+
+void
+xfbma_destroy(
+	struct xfbma	*array)
+{
+	struct xa_item	*item, *n;
+
+	list_for_each_entry_safe(item, n, &array->list, list) {
+		list_del(&item->list);
+		kmem_free(item);
+	}
+	kmem_free(array);
+}
+
+/* Find something in the cache. */
+static struct xa_item *
+xfbma_cache_lookup(
+	struct xfbma	*array,
+	uint64_t	nr)
+{
+	uint64_t	i;
+
+	for (i = 0; i < XMA_CACHE_SIZE; i++)
+		if (array->cache[i].nr == nr && array->cache[i].item)
+			return array->cache[i].item;
+	return NULL;
+}
+
+/* Invalidate the lookup cache. */
+static void
+xfbma_cache_invalidate(
+	struct xfbma	*array)
+{
+	memset(array->cache, 0, sizeof(array->cache));
+}
+
+/* Put something in the cache. */
+static void
+xfbma_cache_store(
+	struct xfbma	*array,
+	uint64_t	nr,
+	struct xa_item	*item)
+{
+	memmove(array->cache + 1, array->cache,
+			sizeof(struct xma_cache) * (XMA_CACHE_SIZE - 1));
+	array->cache[0].item = item;
+	array->cache[0].nr = nr;
+}
+
+/* Find a particular array item. */
+static struct xa_item *
+xfbma_lookup(
+	struct xfbma	*array,
+	uint64_t	nr)
+{
+	struct xa_item	*item;
+	uint64_t	i;
+
+	if (nr >= array->nr) {
+		ASSERT(0);
+		return NULL;
+	}
+
+	item = xfbma_cache_lookup(array, nr);
+	if (item)
+		return item;
+
+	i = 0;
+	list_for_each_entry(item, &array->list, list) {
+		if (i == nr) {
+			xfbma_cache_store(array, nr, item);
+			return item;
+		}
+		i++;
+	}
+	return NULL;
+}
+
+/* Get an element from the array. */
+int
+xfbma_get(
+	struct xfbma	*array,
+	uint64_t	nr,
+	void		*ptr)
+{
+	struct xa_item	*item;
+
+	item = xfbma_lookup(array, nr);
+	if (!item)
+		return -ENODATA;
+	memcpy(ptr, item + 1, array->obj_size);
+	return 0;
+}
+
+/* Put an element in the array. */
+int
+xfbma_set(
+	struct xfbma	*array,
+	uint64_t	nr,
+	void		*ptr)
+{
+	struct xa_item	*item;
+
+	item = xfbma_lookup(array, nr);
+	if (!item)
+		return -ENODATA;
+	memcpy(item + 1, ptr, array->obj_size);
+	return 0;
+}
+
+/* Is this array element NULL? */
+bool
+xfbma_is_null(
+	struct xfbma	*array,
+	void		*ptr)
+{
+	return !memchr_inv(ptr, 0, array->obj_size);
+}
+
+/* Put an element anywhere in the array that isn't NULL. */
+int
+xfbma_insert_anywhere(
+	struct xfbma	*array,
+	void		*ptr)
+{
+	struct xa_item	*item;
+
+	/* Find a null slot to put it in. */
+	list_for_each_entry(item, &array->list, list) {
+		if (!xfbma_is_null(array, item + 1))
+			continue;
+		memcpy(item + 1, ptr, array->obj_size);
+		return 0;
+	}
+
+	/* No null slots, just dump it on the end. */
+	return xfbma_append(array, ptr);
+}
+
+/* NULL an element in the array. */
+int
+xfbma_nullify(
+	struct xfbma	*array,
+	uint64_t	nr)
+{
+	struct xa_item	*item;
+
+	item = xfbma_lookup(array, nr);
+	if (!item)
+		return -ENODATA;
+	memset(item + 1, 0, array->obj_size);
+	return 0;
+}
+
+/* Append an element to the array. */
+int
+xfbma_append(
+	struct xfbma	*array,
+	void		*ptr)
+{
+	struct xa_item	*item;
+
+	item = kmem_alloc(XA_ITEM_SIZE(array->obj_size), KM_NOFS | KM_MAYFAIL);
+	if (!item)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&item->list);
+	memcpy(item + 1, ptr, array->obj_size);
+	list_add_tail(&item->list, &array->list);
+	array->nr++;
+	return 0;
+}
+
+/*
+ * Iterate every element in this array, freeing each element as we go.
+ * Array elements will be shifted down.
+ */
+int
+xfbma_iter_del(
+	struct xfbma	*array,
+	xfbma_iter_fn	iter_fn,
+	void		*priv)
+{
+	struct xa_item	*item, *n;
+	int		error = 0;
+
+	list_for_each_entry_safe(item, n, &array->list, list) {
+		if (xfbma_is_null(array, item + 1))
+			goto next;
+		memcpy(array + 1, item + 1, array->obj_size);
+		error = iter_fn(array + 1, priv);
+		if (error)
+			break;
+next:
+		list_del(&item->list);
+		kmem_free(item);
+		array->nr--;
+	}
+
+	xfbma_cache_invalidate(array);
+	return error;
+}
+
+/* Return length of array. */
+uint64_t
+xfbma_length(
+	struct xfbma	*array)
+{
+	return array->nr;
+}
+
+static int
+xfbma_item_cmp(
+	void			*priv,
+	struct list_head	*a,
+	struct list_head	*b)
+{
+	int			(*cmp_fn)(void *a, void *b) = priv;
+	struct xa_item		*ai, *bi;
+
+	ai = container_of(a, struct xa_item, list);
+	bi = container_of(b, struct xa_item, list);
+
+	return cmp_fn(ai + 1, bi + 1);
+}
+
+/* Sort everything in this array. */
+int
+xfbma_sort(
+	struct xfbma	*array,
+	xfbma_cmp_fn	cmp_fn)
+{
+	list_sort(cmp_fn, &array->list, xfbma_item_cmp);
+	return 0;
+}
diff --git a/fs/xfs/scrub/array.h b/fs/xfs/scrub/array.h
new file mode 100644
index 000000000000..607e664147b3
--- /dev/null
+++ b/fs/xfs/scrub/array.h
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#ifndef __XFS_SCRUB_ARRAY_H__
+#define __XFS_SCRUB_ARRAY_H__
+
+struct xma_item;
+
+struct xma_cache {
+	uint64_t	nr;
+	struct xa_item	*item;
+};
+
+#define XMA_CACHE_SIZE	(8)
+
+struct xfbma {
+	struct list_head	list;
+	size_t			obj_size;
+	uint64_t		nr;
+	struct xma_cache	cache[XMA_CACHE_SIZE];
+};
+
+struct xfbma *xfbma_init(size_t obj_size);
+void xfbma_destroy(struct xfbma *array);
+int xfbma_get(struct xfbma *array, uint64_t nr, void *ptr);
+int xfbma_set(struct xfbma *array, uint64_t nr, void *ptr);
+int xfbma_insert_anywhere(struct xfbma *array, void *ptr);
+bool xfbma_is_null(struct xfbma *array, void *ptr);
+int xfbma_nullify(struct xfbma *array, uint64_t nr);
+int xfbma_append(struct xfbma *array, void *ptr);
+uint64_t xfbma_length(struct xfbma *array);
+
+/*
+ * Iterator functions return zero for success, a negative error code to abort
+ * with an error, or XFBMA_ITERATE_ABORT to stop iterating.
+ */
+#define XFBMA_ITERATE_ABORT	(1)
+typedef int (*xfbma_iter_fn)(const void *item, void *priv);
+
+int xfbma_iter_del(struct xfbma *array, xfbma_iter_fn iter_fn, void *priv);
+
+typedef int (*xfbma_cmp_fn)(const void *a, const void *b);
+
+int xfbma_sort(struct xfbma *array, xfbma_cmp_fn cmp_fn);
+
+#define foreach_xfbma_item(array, i, rec) \
+	for ((i) = 0; (i) < xfbma_length((array)); (i)++) \
+		if (xfbma_get((array), (i), &(rec)) == 0 && \
+		    !xfbma_is_null((array), &(rec)))
+
+#endif /* __XFS_SCRUB_ARRAY_H__ */

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 04/18] xfs: repair free space btrees
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
                   ` (2 preceding siblings ...)
  2019-08-05  0:35 ` [PATCH 03/18] xfs: create a big array data structure Darrick J. Wong
@ 2019-08-05  0:35 ` Darrick J. Wong
  2019-08-05  0:35 ` [PATCH 05/18] xfs: repair inode btrees Darrick J. Wong
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:35 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Rebuild the free space btrees from the gaps in the rmap btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile             |    1 
 fs/xfs/libxfs/xfs_ag_resv.c |    2 
 fs/xfs/scrub/alloc_repair.c |  595 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.c       |    8 +
 fs/xfs/scrub/repair.h       |    8 +
 fs/xfs/scrub/scrub.c        |    6 
 fs/xfs/scrub/trace.h        |    2 
 fs/xfs/xfs_extent_busy.c    |   13 +
 fs/xfs/xfs_extent_busy.h    |    2 
 fs/xfs/xfs_mount.h          |    7 +
 10 files changed, 640 insertions(+), 4 deletions(-)
 create mode 100644 fs/xfs/scrub/alloc_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 0ace13e94d98..f1a1a2a47805 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -160,6 +160,7 @@ xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota.o
 ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
 xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
+				   alloc_repair.o \
 				   array.o \
 				   bitmap.o \
 				   repair.o \
diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index 87a9747f1d36..3f79958ce08e 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -381,6 +381,8 @@ xfs_ag_resv_free_extent(
 		/* fall through */
 	case XFS_AG_RESV_NONE:
 		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, (int64_t)len);
+		/* fall through */
+	case XFS_AG_RESV_IGNORE:
 		return;
 	}
 
diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c
new file mode 100644
index 000000000000..7c98a2f76ee7
--- /dev/null
+++ b/fs/xfs/scrub/alloc_repair.c
@@ -0,0 +1,595 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_alloc.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_inode.h"
+#include "xfs_refcount.h"
+#include "xfs_extent_busy.h"
+#include "xfs_health.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/bitmap.h"
+#include "scrub/array.h"
+
+/*
+ * Free Space Btree Repair
+ * =======================
+ *
+ * The reverse mappings are supposed to record all space usage for the entire
+ * AG.  Therefore, we can recalculate the free extents in an AG by looking for
+ * gaps in the physical extents recorded in the rmapbt.  On a reflink
+ * filesystem this is a little more tricky in that we have to be aware that
+ * the rmap records are allowed to overlap.
+ *
+ * We derive which blocks belonged to the old bnobt/cntbt by recording all the
+ * OWN_AG extents and subtracting out the blocks owned by all other OWN_AG
+ * metadata: the rmapbt blocks visited while iterating the reverse mappings
+ * and the AGFL blocks.
+ *
+ * Once we have both of those pieces, we can reconstruct the bnobt and cntbt
+ * by blowing out the free block state and freeing all the extents that we
+ * found.  This adds the requirement that we can't have any busy extents in
+ * the AG because the busy code cannot handle duplicate records.
+ *
+ * Note that we can only rebuild both free space btrees at the same time
+ * because the regular extent freeing infrastructure loads both btrees at the
+ * same time.
+ *
+ * We use the prefix 'xrep_abt' here because we regenerate both free space
+ * allocation btrees at the same time.
+ */
+
+struct xrep_abt_extent {
+	xfs_agblock_t		bno;
+	xfs_extlen_t		len;
+} __packed;
+
+struct xrep_abt {
+	/* Blocks owned by the rmapbt or the agfl. */
+	struct xfs_bitmap	nobtlist;
+
+	/* All OWN_AG blocks. */
+	struct xfs_bitmap	*btlist;
+
+	/* Free space extents. */
+	struct xfbma		*free_records;
+
+	struct xfs_scrub	*sc;
+
+	/*
+	 * Next block we anticipate seeing in the rmap records.  If the next
+	 * rmap record is greater than next_bno, we have found unused space.
+	 */
+	xfs_agblock_t		next_bno;
+
+	/* Number of free blocks in this AG. */
+	xfs_agblock_t		nr_blocks;
+};
+
+/* Record extents that aren't in use from gaps in the rmap records. */
+STATIC int
+xrep_abt_walk_rmap(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*rec,
+	void			*priv)
+{
+	struct xrep_abt		*ra = priv;
+	struct xrep_abt_extent	rae;
+	xfs_fsblock_t		fsb;
+	int			error;
+
+	/* Record all the OWN_AG blocks... */
+	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
+		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
+				rec->rm_startblock);
+		error = xfs_bitmap_set(ra->btlist, fsb, rec->rm_blockcount);
+		if (error)
+			return error;
+	}
+
+	/* ...and all the rmapbt blocks... */
+	error = xfs_bitmap_set_btcur_path(&ra->nobtlist, cur);
+	if (error)
+		return error;
+
+	/* ...and all the free space. */
+	if (rec->rm_startblock > ra->next_bno) {
+		trace_xrep_abt_walk_rmap(cur->bc_mp, cur->bc_private.a.agno,
+				ra->next_bno, rec->rm_startblock - ra->next_bno,
+				XFS_RMAP_OWN_NULL, 0, 0);
+
+		rae.bno = ra->next_bno;
+		rae.len = rec->rm_startblock - ra->next_bno;
+		error = xfbma_append(ra->free_records, &rae);
+		if (error)
+			return error;
+		ra->nr_blocks += rae.len;
+	}
+
+	/*
+	 * rmap records can overlap on reflink filesystems, so project next_bno
+	 * as far out into the AG space as we currently know about.
+	 */
+	ra->next_bno = max_t(xfs_agblock_t, ra->next_bno,
+			rec->rm_startblock + rec->rm_blockcount);
+	return 0;
+}
+
+/* Collect an AGFL block for the not-to-release list. */
+static int
+xrep_abt_walk_agfl(
+	struct xfs_mount	*mp,
+	xfs_agblock_t		bno,
+	void			*priv)
+{
+	struct xrep_abt		*ra = priv;
+	xfs_fsblock_t		fsb;
+
+	fsb = XFS_AGB_TO_FSB(mp, ra->sc->sa.agno, bno);
+	return xfs_bitmap_set(&ra->nobtlist, fsb, 1);
+}
+
+/* Compare two free space extents. */
+static int
+xrep_abt_extent_cmp(
+	const void			*a,
+	const void			*b)
+{
+	const struct xrep_abt_extent	*ap = a;
+	const struct xrep_abt_extent	*bp = b;
+
+	if (ap->bno > bp->bno)
+		return 1;
+	else if (ap->bno < bp->bno)
+		return -1;
+	return 0;
+}
+
+/*
+ * Add a free space record back into the bnobt/cntbt.  It is assumed that the
+ * space is already accounted for in fdblocks, so we use a special per-AG
+ * reservation code to skip the fdblocks update.
+ */
+STATIC int
+xrep_abt_free_extent(
+	const void			*item,
+	void				*priv)
+{
+	struct xfs_scrub		*sc = priv;
+	const struct xrep_abt_extent	*rae = item;
+	xfs_fsblock_t			fsbno;
+	int				error;
+
+	fsbno = XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, rae->bno);
+
+	error = xfs_free_extent(sc->tp, fsbno, rae->len,
+			&XFS_RMAP_OINFO_SKIP_UPDATE, XFS_AG_RESV_IGNORE);
+	if (error)
+		return error;
+	return xrep_roll_ag_trans(sc);
+}
+
+/* Find the longest free extent in the list. */
+static int
+xrep_abt_get_longest(
+	struct xfbma		*free_records,
+	struct xrep_abt_extent	*longest)
+{
+	struct xrep_abt_extent	rae;
+	uint64_t		victim = -1ULL;
+	uint64_t		i;
+
+	longest->len = 0;
+	foreach_xfbma_item(free_records, i, rae) {
+		if (rae.len > longest->len) {
+			memcpy(longest, &rae, sizeof(*longest));
+			victim = i;
+		}
+	}
+
+	if (longest->len == 0)
+		return 0;
+	return xfbma_nullify(free_records, victim);
+}
+
+/*
+ * Allocate a block from the (cached) first extent in the AG.  In theory
+ * this should never fail, since we already checked that there was enough
+ * space to handle the new btrees.
+ */
+STATIC xfs_agblock_t
+xrep_abt_alloc_block(
+	struct xfs_scrub	*sc,
+	struct xfbma		*free_records)
+{
+	struct xrep_abt_extent	ext = { 0 };
+	uint64_t		i;
+	xfs_agblock_t		agbno;
+	int			error;
+
+	/* Pull the first free space extent off the list, and... */
+	foreach_xfbma_item(free_records, i, ext) {
+		break;
+	}
+	if (ext.len == 0)
+		return NULLAGBLOCK;
+
+	/* ...take its first block. */
+	agbno = ext.bno;
+	ext.bno++;
+	ext.len--;
+	if (ext.len)
+		error = xfbma_set(free_records, i, &ext);
+	else
+		error = xfbma_nullify(free_records, i);
+	if (error)
+		return NULLAGBLOCK;
+	return agbno;
+}
+
+/*
+ * Iterate all reverse mappings to find (1) the free extents, (2) the OWN_AG
+ * extents, (3) the rmapbt blocks, and (4) the AGFL blocks.  The free space is
+ * (1) + (2) - (3) - (4).  Figure out if we have enough free space to
+ * reconstruct the free space btrees.  Caller must clean up the input lists
+ * if something goes wrong.
+ */
+STATIC int
+xrep_abt_find_freespace(
+	struct xfs_scrub	*sc,
+	struct xfbma		*free_records,
+	struct xfs_bitmap	*old_allocbt_blocks)
+{
+	struct xrep_abt		ra = {
+		.sc		= sc,
+		.free_records	= free_records,
+		.btlist		= old_allocbt_blocks,
+	};
+	struct xrep_abt_extent	rae;
+	struct xfs_btree_cur	*cur;
+	struct xfs_mount	*mp = sc->mp;
+	xfs_agblock_t		agend;
+	xfs_agblock_t		nr_blocks;
+	int			error;
+
+	xfs_bitmap_init(&ra.nobtlist);
+
+	/*
+	 * Iterate all the reverse mappings to find gaps in the physical
+	 * mappings, all the OWN_AG blocks, and all the rmapbt extents.
+	 */
+	cur = xfs_rmapbt_init_cursor(mp, sc->tp, sc->sa.agf_bp, sc->sa.agno);
+	error = xfs_rmap_query_all(cur, xrep_abt_walk_rmap, &ra);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, error);
+	cur = NULL;
+
+	/* Insert a record for space between the last rmap and EOAG. */
+	agend = be32_to_cpu(XFS_BUF_TO_AGF(sc->sa.agf_bp)->agf_length);
+	if (ra.next_bno < agend) {
+		rae.bno = ra.next_bno;
+		rae.len = agend - ra.next_bno;
+		error = xfbma_append(free_records, &rae);
+		if (error)
+			goto err;
+		ra.nr_blocks += rae.len;
+	}
+
+	/* Collect all the AGFL blocks. */
+	error = xfs_agfl_walk(mp, XFS_BUF_TO_AGF(sc->sa.agf_bp),
+			sc->sa.agfl_bp, xrep_abt_walk_agfl, &ra);
+	if (error)
+		goto err;
+
+	/*
+	 * Do we have enough space to rebuild both freespace btrees?  We won't
+	 * touch the AG if we've exceeded the per-AG reservation or if we don't
+	 * have enough free space to store the free space information.
+	 */
+	nr_blocks = 2 * xfs_allocbt_calc_size(mp, xfbma_length(free_records));
+	if (!xrep_ag_has_space(sc->sa.pag, 0, XFS_AG_RESV_NONE) ||
+	    ra.nr_blocks < nr_blocks) {
+		error = -ENOSPC;
+		goto err;
+	}
+
+	/* Compute the old bnobt/cntbt blocks. */
+	error = xfs_bitmap_disunion(old_allocbt_blocks, &ra.nobtlist);
+err:
+	xfs_bitmap_destroy(&ra.nobtlist);
+	if (cur)
+		xfs_btree_del_cursor(cur, error);
+	return error;
+}
+
+/*
+ * Reset the global free block counter and the per-AG counters to make it look
+ * like this AG has no free space.
+ */
+STATIC int
+xrep_abt_reset_counters(
+	struct xfs_scrub	*sc,
+	int			*log_flags)
+{
+	struct xfs_perag	*pag = sc->sa.pag;
+	struct xfs_agf		*agf;
+	xfs_agblock_t		new_btblks;
+	xfs_agblock_t		to_free;
+
+	/*
+	 * Since we're abandoning the old bnobt/cntbt, we have to decrease
+	 * fdblocks by the # of blocks in those trees.  btreeblks counts the
+	 * non-root blocks of the free space and rmap btrees.  Do this before
+	 * resetting the AGF counters.
+	 */
+	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
+
+	/* rmap_blocks accounts root block, btreeblks doesn't */
+	new_btblks = be32_to_cpu(agf->agf_rmap_blocks) - 1;
+
+	/* btreeblks doesn't account bno/cnt root blocks */
+	to_free = pag->pagf_btreeblks + 2;
+
+	/* and don't account for the blocks we aren't freeing */
+	to_free -= new_btblks;
+
+	/*
+	 * Reset the per-AG info, both incore and ondisk.  Mark the incore
+	 * state stale in case we fail out of here.
+	 */
+	ASSERT(pag->pagf_init);
+	pag->pagf_init = 0;
+	pag->pagf_btreeblks = new_btblks;
+	pag->pagf_freeblks = 0;
+	pag->pagf_longest = 0;
+
+	agf->agf_btreeblks = cpu_to_be32(new_btblks);
+	agf->agf_freeblks = 0;
+	agf->agf_longest = 0;
+	*log_flags |= XFS_AGF_BTREEBLKS | XFS_AGF_LONGEST | XFS_AGF_FREEBLKS;
+
+	return 0;
+}
+
+/* Initialize a new free space btree root and implant into AGF. */
+STATIC int
+xrep_abt_reset_btree(
+	struct xfs_scrub	*sc,
+	xfs_btnum_t		btnum,
+	struct xfbma		*free_records)
+{
+	struct xfs_buf		*bp;
+	struct xfs_perag	*pag = sc->sa.pag;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
+	const struct xfs_buf_ops *ops;
+	xfs_agblock_t		agbno;
+	int			error;
+
+	/* Allocate new root block. */
+	agbno = xrep_abt_alloc_block(sc, free_records);
+	if (agbno == NULLAGBLOCK)
+		return -ENOSPC;
+
+	switch (btnum) {
+	case XFS_BTNUM_BNOi:
+		ops = &xfs_bnobt_buf_ops;
+		break;
+	case XFS_BTNUM_CNTi:
+		ops = &xfs_cntbt_buf_ops;
+		break;
+	default:
+		ASSERT(0);
+		return -EFSCORRUPTED;
+	}
+
+	/* Initialize new tree root. */
+	error = xrep_init_btblock(sc, XFS_AGB_TO_FSB(mp, sc->sa.agno, agbno),
+			&bp, btnum, ops);
+	if (error)
+		return error;
+
+	/* Implant into AGF. */
+	agf->agf_roots[btnum] = cpu_to_be32(agbno);
+	agf->agf_levels[btnum] = cpu_to_be32(1);
+
+	/* Add rmap records for the btree roots */
+	error = xfs_rmap_alloc(sc->tp, sc->sa.agf_bp, sc->sa.agno, agbno, 1,
+			&XFS_RMAP_OINFO_AG);
+	if (error)
+		return error;
+
+	/* Reset the incore state. */
+	pag->pagf_levels[btnum] = 1;
+
+	return 0;
+}
+
+/* Initialize new bnobt/cntbt roots and implant them into the AGF. */
+STATIC int
+xrep_abt_reset_btrees(
+	struct xfs_scrub	*sc,
+	struct xfbma		*free_records,
+	int			*log_flags)
+{
+	int			error;
+
+	error = xrep_abt_reset_btree(sc, XFS_BTNUM_BNOi, free_records);
+	if (error)
+		return error;
+	error = xrep_abt_reset_btree(sc, XFS_BTNUM_CNTi, free_records);
+	if (error)
+		return error;
+
+	*log_flags |= XFS_AGF_ROOTS | XFS_AGF_LEVELS;
+	return 0;
+}
+
+/*
+ * Make our new freespace btree roots permanent so that we can start freeing
+ * unused space back into the AG.
+ */
+STATIC int
+xrep_abt_commit_new(
+	struct xfs_scrub	*sc,
+	struct xfs_bitmap	*old_allocbt_blocks,
+	int			log_flags)
+{
+	int			error;
+
+	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
+
+	/* Invalidate the old freespace btree blocks and commit. */
+	error = xrep_invalidate_blocks(sc, old_allocbt_blocks);
+	if (error)
+		return error;
+	error = xrep_roll_ag_trans(sc);
+	if (error)
+		return error;
+
+	/* Now that we've succeeded, mark the incore state valid again. */
+	sc->sa.pag->pagf_init = 1;
+	return 0;
+}
+
+/* Build new free space btrees and dispose of the old one. */
+STATIC int
+xrep_abt_rebuild_trees(
+	struct xfs_scrub	*sc,
+	struct xfbma		*free_records,
+	struct xfs_bitmap	*old_allocbt_blocks)
+{
+	struct xrep_abt_extent	rae;
+	int			error;
+
+	/*
+	 * Insert the longest free extent in case it's necessary to
+	 * refresh the AGFL with multiple blocks.  If there is no longest
+	 * extent, we had exactly the free space we needed; we're done.
+	 */
+	error = xrep_abt_get_longest(free_records, &rae);
+	if (!error && rae.len > 0) {
+		error = xrep_abt_free_extent(&rae, sc);
+		if (error)
+			return error;
+	}
+
+	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
+	error = xrep_reap_extents(sc, old_allocbt_blocks, &XFS_RMAP_OINFO_AG,
+			XFS_AG_RESV_IGNORE);
+	if (error)
+		return error;
+
+	/* Insert records into the new btrees. */
+	return xfbma_iter_del(free_records, xrep_abt_free_extent, sc);
+}
+
+/* Repair the freespace btrees for some AG. */
+int
+xrep_allocbt(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_bitmap	old_allocbt_blocks;
+	struct xfbma		*free_records;
+	struct xfs_mount	*mp = sc->mp;
+	int			log_flags = 0;
+	int			error;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	/* We rebuild both data structures. */
+	sc->sick_mask = XFS_SICK_AG_BNOBT | XFS_SICK_AG_CNTBT;
+
+	xchk_perag_get(sc->mp, &sc->sa);
+
+	/*
+	 * Make sure the busy extent list is clear because we can't put
+	 * extents on there twice.
+	 */
+	if (!xfs_extent_busy_list_empty(sc->sa.pag))
+		return -EDEADLOCK;
+
+	/* Set up some storage */
+	free_records = xfbma_init(sizeof(struct xrep_abt_extent));
+	if (IS_ERR(free_records))
+		return PTR_ERR(free_records);
+
+	/* Collect the free space data and find the old btree blocks. */
+	xfs_bitmap_init(&old_allocbt_blocks);
+	error = xrep_abt_find_freespace(sc, free_records, &old_allocbt_blocks);
+	if (error)
+		goto out;
+
+	/* Make sure we got some free space. */
+	if (xfbma_length(free_records) == 0) {
+		error = -ENOSPC;
+		goto out;
+	}
+
+	/*
+	 * Sort the free extents by block number to avoid bnobt splits when we
+	 * rebuild the free space btrees.
+	 */
+	error = xfbma_sort(free_records, xrep_abt_extent_cmp);
+	if (error)
+		goto out;
+
+	/*
+	 * Blow out the old free space btrees.  This is the point at which
+	 * we are no longer able to bail out gracefully.
+	 */
+	error = xrep_abt_reset_counters(sc, &log_flags);
+	if (error)
+		goto out;
+	error = xrep_abt_reset_btrees(sc, free_records, &log_flags);
+	if (error)
+		goto out;
+	error = xrep_abt_commit_new(sc, &old_allocbt_blocks, log_flags);
+	if (error)
+		goto out;
+
+	/* Now rebuild the freespace information. */
+	error = xrep_abt_rebuild_trees(sc, free_records, &old_allocbt_blocks);
+out:
+	xfbma_destroy(free_records);
+	xfs_bitmap_destroy(&old_allocbt_blocks);
+	return error;
+}
+
+/* Make sure both btrees are ok after we've rebuilt them. */
+int
+xrep_revalidate_allocbt(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	error = xchk_bnobt(sc);
+	if (error)
+		return error;
+
+	return xchk_cntbt(sc);
+}
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 18876056e5e0..4a49a9099477 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -634,8 +634,14 @@ xchk_setup_ag_btree(
 	 * expensive operation should be performed infrequently and only
 	 * as a last resort.  Any caller that sets force_log should
 	 * document why they need to do so.
+	 *
+	 * Force everything in memory out to disk if we're repairing.
+	 * This ensures we won't get tripped up by btree blocks sitting
+	 * in memory waiting to have LSNs stamped in.  The AGF/AGI repair
+	 * routines use any available rmap data to try to find a btree
+	 * root that also passes the read verifiers.
 	 */
-	if (force_log) {
+	if (force_log || (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR)) {
 		error = xchk_checkpoint_log(mp);
 		if (error)
 			return error;
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 60c61d7052a8..5a6a1cd437d7 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -52,6 +52,10 @@ int xrep_find_ag_btree_roots(struct xfs_scrub *sc, struct xfs_buf *agf_bp,
 void xrep_force_quotacheck(struct xfs_scrub *sc, uint dqtype);
 int xrep_ino_dqattach(struct xfs_scrub *sc);
 
+/* Metadata revalidators */
+
+int xrep_revalidate_allocbt(struct xfs_scrub *sc);
+
 /* Metadata repairers */
 
 int xrep_probe(struct xfs_scrub *sc);
@@ -59,6 +63,7 @@ int xrep_superblock(struct xfs_scrub *sc);
 int xrep_agf(struct xfs_scrub *sc);
 int xrep_agfl(struct xfs_scrub *sc);
 int xrep_agi(struct xfs_scrub *sc);
+int xrep_allocbt(struct xfs_scrub *sc);
 
 #else
 
@@ -79,11 +84,14 @@ xrep_calc_ag_resblks(
 	return 0;
 }
 
+#define xrep_revalidate_allocbt		(NULL)
+
 #define xrep_probe			xrep_notsupported
 #define xrep_superblock			xrep_notsupported
 #define xrep_agf			xrep_notsupported
 #define xrep_agfl			xrep_notsupported
 #define xrep_agi			xrep_notsupported
+#define xrep_allocbt			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 0f0b64d7164b..b42ac8ecdb49 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -217,13 +217,15 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_PERAG,
 		.setup	= xchk_setup_ag_allocbt,
 		.scrub	= xchk_bnobt,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_allocbt,
+		.repair_eval = xrep_revalidate_allocbt,
 	},
 	[XFS_SCRUB_TYPE_CNTBT] = {	/* cntbt */
 		.type	= ST_PERAG,
 		.setup	= xchk_setup_ag_allocbt,
 		.scrub	= xchk_cntbt,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_allocbt,
+		.repair_eval = xrep_revalidate_allocbt,
 	},
 	[XFS_SCRUB_TYPE_INOBT] = {	/* inobt */
 		.type	= ST_PERAG,
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 3362bae28b46..d43b6003a088 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -722,7 +722,7 @@ DEFINE_EVENT(xrep_rmap_class, name, \
 		 xfs_agblock_t agbno, xfs_extlen_t len, \
 		 uint64_t owner, uint64_t offset, unsigned int flags), \
 	TP_ARGS(mp, agno, agbno, len, owner, offset, flags))
-DEFINE_REPAIR_RMAP_EVENT(xrep_alloc_extent_fn);
+DEFINE_REPAIR_RMAP_EVENT(xrep_abt_walk_rmap);
 DEFINE_REPAIR_RMAP_EVENT(xrep_ialloc_extent_fn);
 DEFINE_REPAIR_RMAP_EVENT(xrep_rmap_extent_fn);
 DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_extent_fn);
diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
index 0ed68379e551..cc83b0687b9b 100644
--- a/fs/xfs/xfs_extent_busy.c
+++ b/fs/xfs/xfs_extent_busy.c
@@ -657,3 +657,16 @@ xfs_extent_busy_ag_cmp(
 		diff = b1->bno - b2->bno;
 	return diff;
 }
+
+/* Are there any busy extents in this AG? */
+bool
+xfs_extent_busy_list_empty(
+	struct xfs_perag	*pag)
+{
+	bool			res;
+
+	spin_lock(&pag->pagb_lock);
+	res = RB_EMPTY_ROOT(&pag->pagb_tree);
+	spin_unlock(&pag->pagb_lock);
+	return res;
+}
diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
index 990ab3891971..2f8c73c712c6 100644
--- a/fs/xfs/xfs_extent_busy.h
+++ b/fs/xfs/xfs_extent_busy.h
@@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
 	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
 }
 
+bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
+
 #endif /* __XFS_EXTENT_BUSY_H__ */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 4adb6837439a..f40283df29cc 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -332,6 +332,13 @@ enum xfs_ag_resv_type {
 	XFS_AG_RESV_AGFL,
 	XFS_AG_RESV_METADATA,
 	XFS_AG_RESV_RMAPBT,
+
+	/*
+	 * Don't increase fdblocks when freeing extent.  This is a pony for
+	 * the bnobt repair functions to re-free the free space without
+	 * altering fdblocks.  If you think you need this you're wrong.
+	 */
+	XFS_AG_RESV_IGNORE,
 };
 
 struct xfs_ag_resv {

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 05/18] xfs: repair inode btrees
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
                   ` (3 preceding siblings ...)
  2019-08-05  0:35 ` [PATCH 04/18] xfs: repair free space btrees Darrick J. Wong
@ 2019-08-05  0:35 ` Darrick J. Wong
  2019-08-05  0:35 ` [PATCH 06/18] xfs: repair refcount btrees Darrick J. Wong
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:35 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Use the rmapbt to find inode chunks, query the chunks to compute
hole and free masks, and with that information rebuild the inobt
and finobt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile              |    1 
 fs/xfs/scrub/common.c        |    1 
 fs/xfs/scrub/ialloc_repair.c |  743 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.c        |   23 +
 fs/xfs/scrub/repair.h        |   16 +
 fs/xfs/scrub/scrub.c         |    6 
 fs/xfs/scrub/scrub.h         |    1 
 fs/xfs/scrub/trace.h         |    4 
 8 files changed, 791 insertions(+), 4 deletions(-)
 create mode 100644 fs/xfs/scrub/ialloc_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index f1a1a2a47805..3b7fdccf2818 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -163,6 +163,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   alloc_repair.o \
 				   array.o \
 				   bitmap.o \
+				   ialloc_repair.o \
 				   repair.o \
 				   )
 endif
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 4a49a9099477..abe88fa756aa 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -527,6 +527,7 @@ xchk_ag_free(
 	struct xchk_ag		*sa)
 {
 	xchk_ag_btcur_free(sa);
+	xrep_reset_perag_resv(sc);
 	if (sa->agfl_bp) {
 		xfs_trans_brelse(sc->tp, sa->agfl_bp);
 		sa->agfl_bp = NULL;
diff --git a/fs/xfs/scrub/ialloc_repair.c b/fs/xfs/scrub/ialloc_repair.c
new file mode 100644
index 000000000000..95546d100e31
--- /dev/null
+++ b/fs/xfs/scrub/ialloc_repair.c
@@ -0,0 +1,743 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_alloc.h"
+#include "xfs_ialloc.h"
+#include "xfs_ialloc_btree.h"
+#include "xfs_icache.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_log.h"
+#include "xfs_trans_priv.h"
+#include "xfs_error.h"
+#include "xfs_health.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/bitmap.h"
+#include "scrub/array.h"
+
+/*
+ * Inode Btree Repair
+ * ==================
+ *
+ * A quick refresher of inode btrees on a v5 filesystem:
+ *
+ * - Inode records are read into memory in units of 'inode clusters'.  However
+ *   many inodes fit in a cluster buffer is the smallest number of inodes that
+ *   can be allocated or freed.  Clusters are never smaller than one fs block
+ *   though they can span multiple blocks.  The size (in fs blocks) is
+ *   computed with xfs_icluster_size_fsb().  The fs block alignment of a
+ *   cluster is computed with xfs_ialloc_cluster_alignment().
+ *
+ * - Each inode btree record can describe a single 'inode chunk'.  The chunk
+ *   size is defined to be 64 inodes.  If sparse inodes are enabled, every
+ *   inobt record must be aligned to the chunk size; if not, every record must
+ *   be aligned to the start of a cluster.  It is possible to construct an XFS
+ *   geometry where one inobt record maps to multiple inode clusters; it is
+ *   also possible to construct a geometry where multiple inobt records map to
+ *   different parts of one inode cluster.
+ *
+ * - If sparse inodes are not enabled, the smallest unit of allocation for
+ *   inode records is enough to contain one inode chunk's worth of inodes.
+ *
+ * - If sparse inodes are enabled, the holemask field will be active.  Each
+ *   bit of the holemask represents 4 potential inodes; if set, the
+ *   corresponding space does *not* contain inodes and must be left alone.
+ *   Clusters cannot be smaller than 4 inodes.  The smallest unit of allocation
+ *   of inode records is one inode cluster.
+ *
+ * So what's the rebuild algorithm?
+ *
+ * Iterate the reverse mapping records looking for OWN_INODES and OWN_INOBT
+ * records.  The OWN_INOBT records are the old inode btree blocks and will be
+ * cleared out after we've rebuilt the tree.  Each possible inode cluster
+ * within an OWN_INODES record will be read in; for each possible inobt record
+ * associated with that cluster, compute the freemask calculated from the
+ * i_mode data in the inode chunk.  For sparse inodes the holemask will be
+ * calculated by creating the properly aligned inobt record and punching out
+ * any chunk that's missing.  Inode allocations and frees grab the AGI first,
+ * so repair protects itself from concurrent access by locking the AGI.
+ *
+ * Once we've reconstructed all the inode records, we can create new inode
+ * btree roots and reload the btrees.  We rebuild both inode trees at the same
+ * time because they have the same rmap owner and it would be more complex to
+ * figure out if the other tree isn't in need of a rebuild and which OWN_INOBT
+ * blocks it owns.  We have all the data we need to build both, so dump
+ * everything and start over.
+ *
+ * We use the prefix 'xrep_ibt' because we rebuild both inode btrees at once.
+ */
+
+struct xrep_ibt {
+	/* Record under construction. */
+	struct xfs_inobt_rec_incore	rie;
+
+	/* Reconstructed inode records. */
+	struct xfbma		*inode_records;
+
+	/* Old inode btree blocks we found in the rmap. */
+	struct xfs_bitmap	*btlist;
+
+	struct xfs_scrub	*sc;
+
+	/* Number of inodes assigned disk space. */
+	unsigned int		icount;
+
+	/* Number of inodes in use. */
+	unsigned int		iused;
+};
+
+/*
+ * Is this inode in use?  If the inode is in memory we can tell from i_mode,
+ * otherwise we have to check di_mode in the on-disk buffer.  We only care
+ * that the high (i.e. non-permission) bits of _mode are zero.  This should be
+ * safe because repair keeps all AG headers locked until the end, and process
+ * trying to perform an inode allocation/free must lock the AGI.
+ *
+ * @cluster_ag_base is the inode offset of the cluster within the AG.
+ * @cluster_bp is the cluster buffer.
+ * @cluster_index is the inode offset within the inode cluster.
+ */
+STATIC int
+xrep_ibt_check_ifree(
+	struct xrep_ibt		*ri,
+	xfs_agino_t		cluster_ag_base,
+	struct xfs_buf		*cluster_bp,
+	unsigned int		cluster_index,
+	bool			*inuse)
+{
+	struct xfs_scrub	*sc = ri->sc;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_dinode	*dip;
+	xfs_ino_t		fsino;
+	xfs_agnumber_t		agno = ri->sc->sa.agno;
+	unsigned int		cluster_buf_base;
+	unsigned int		offset;
+	int			error;
+
+	fsino = XFS_AGINO_TO_INO(mp, agno, cluster_ag_base + cluster_index);
+
+	/* Inode uncached or half assembled, read disk buffer */
+	cluster_buf_base = XFS_INO_TO_OFFSET(mp, cluster_ag_base);
+	offset = (cluster_buf_base + cluster_index) * mp->m_sb.sb_inodesize;
+	if (offset >= BBTOB(cluster_bp->b_length))
+		return -EFSCORRUPTED;
+	dip = xfs_buf_offset(cluster_bp, offset);
+	if (be16_to_cpu(dip->di_magic) != XFS_DINODE_MAGIC)
+		return -EFSCORRUPTED;
+
+	if (dip->di_version >= 3 && be64_to_cpu(dip->di_ino) != fsino)
+		return -EFSCORRUPTED;
+
+	/* Will the in-core inode tell us if it's in use? */
+	error = xfs_icache_inode_is_allocated(mp, sc->tp, fsino, inuse);
+	if (!error)
+		return 0;
+
+	*inuse = dip->di_mode != 0;
+	return 0;
+}
+
+/*
+ * Given an extent of inodes and an inode cluster buffer, calculate the
+ * location of the corresponding inobt record (creating it if necessary),
+ * then update the parts of the holemask and freemask of that record that
+ * correspond to the inode extent we were given.
+ *
+ * @cluster_ir_startino is the AG inode number of an inobt record that we're
+ * proposing to create for this inode cluster.  If sparse inodes are enabled,
+ * we must round down to a chunk boundary to find the actual sparse record.
+ * @cluster_bp is the buffer of the inode cluster.
+ * @nr_inodes is the number of inodes to check from the cluster.
+ */
+STATIC int
+xrep_ibt_cluster_record(
+	struct xrep_ibt		*ri,
+	xfs_agino_t		cluster_ir_startino,
+	struct xfs_buf		*cluster_bp,
+	unsigned int		nr_inodes)
+{
+	struct xfs_scrub	*sc = ri->sc;
+	struct xfs_mount	*mp = sc->mp;
+	xfs_agino_t		ir_startino;
+	unsigned int		cluster_base;
+	unsigned int		cluster_index;
+	bool			inuse;
+	int			error = 0;
+
+	ir_startino = cluster_ir_startino;
+	if (xfs_sb_version_hassparseinodes(&mp->m_sb))
+		ir_startino = rounddown(ir_startino, XFS_INODES_PER_CHUNK);
+	cluster_base = cluster_ir_startino - ir_startino;
+
+	/*
+	 * If the accumulated inobt record doesn't map this cluster, add it to
+	 * the list and reset it.
+	 */
+	if (ri->rie.ir_startino != NULLAGINO &&
+	    ri->rie.ir_startino + XFS_INODES_PER_CHUNK <= ir_startino) {
+		error = xfbma_append(ri->inode_records, &ri->rie);
+		if (error)
+			return error;
+		ri->rie.ir_startino = NULLAGINO;
+	}
+
+	if (ri->rie.ir_startino == NULLAGINO) {
+		ri->rie.ir_startino = ir_startino;
+		ri->rie.ir_free = XFS_INOBT_ALL_FREE;
+		ri->rie.ir_holemask = 0xFFFF;
+		ri->rie.ir_count = 0;
+	}
+
+	/* Record the whole cluster. */
+	ri->icount += nr_inodes;
+	ri->rie.ir_count += nr_inodes;
+	ri->rie.ir_holemask &= ~xfs_inobt_maskn(
+				cluster_base / XFS_INODES_PER_HOLEMASK_BIT,
+				nr_inodes / XFS_INODES_PER_HOLEMASK_BIT);
+
+	/* Which inodes within this cluster are free? */
+	for (cluster_index = 0; cluster_index < nr_inodes; cluster_index++) {
+		error = xrep_ibt_check_ifree(ri, cluster_ir_startino,
+				cluster_bp, cluster_index, &inuse);
+		if (error)
+			return error;
+		if (!inuse)
+			continue;
+		ri->iused++;
+		ri->rie.ir_free &= ~XFS_INOBT_MASK(cluster_base +
+						   cluster_index);
+	}
+	return 0;
+}
+
+/*
+ * For each inode cluster covering the physical extent recorded by the rmapbt,
+ * we must calculate the properly aligned startino of that cluster, then
+ * iterate each cluster to fill in used and filled masks appropriately.  We
+ * then use the (startino, used, filled) information to construct the
+ * appropriate inode records.
+ */
+STATIC int
+xrep_ibt_process_cluster(
+	struct xrep_ibt		*ri,
+	xfs_agblock_t		cluster_bno)
+{
+	struct xfs_imap		imap;
+	struct xfs_dinode	*dip;
+	struct xfs_buf		*cluster_bp;
+	struct xfs_scrub	*sc = ri->sc;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
+	xfs_agino_t		cluster_ag_base;
+	xfs_agino_t		irec_index;
+	unsigned int		nr_inodes;
+	int			error;
+
+	nr_inodes = min_t(unsigned int, igeo->inodes_per_cluster,
+			XFS_INODES_PER_CHUNK);
+
+	/*
+	 * Grab the inode cluster buffer.  This is safe to do with a broken
+	 * inobt because imap_to_bp directly maps the buffer without touching
+	 * either inode btree.
+	 */
+	imap.im_blkno = XFS_AGB_TO_DADDR(mp, sc->sa.agno, cluster_bno);
+	imap.im_len = XFS_FSB_TO_BB(mp, igeo->blocks_per_cluster);
+	imap.im_boffset = 0;
+	error = xfs_imap_to_bp(mp, sc->tp, &imap, &dip, &cluster_bp, 0, 0);
+	if (error)
+		return error;
+
+	/*
+	 * Record the contents of each possible inobt record mapping this
+	 * cluster.
+	 */
+	cluster_ag_base = XFS_AGB_TO_AGINO(mp, cluster_bno);
+	for (irec_index = 0;
+	     irec_index < igeo->inodes_per_cluster;
+	     irec_index += XFS_INODES_PER_CHUNK) {
+		error = xrep_ibt_cluster_record(ri,
+				cluster_ag_base + irec_index, cluster_bp,
+				nr_inodes);
+		if (error)
+			break;
+
+	}
+
+	xfs_trans_brelse(sc->tp, cluster_bp);
+	return error;
+}
+
+/* Record extents that belong to inode btrees. */
+STATIC int
+xrep_ibt_walk_rmap(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*rec,
+	void			*priv)
+{
+	struct xrep_ibt		*ri = priv;
+	struct xfs_mount	*mp = cur->bc_mp;
+	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
+	xfs_fsblock_t		fsbno;
+	xfs_agblock_t		agbno = rec->rm_startblock;
+	xfs_agblock_t		cluster_base;
+	int			error = 0;
+
+	if (xchk_should_terminate(ri->sc, &error))
+		return error;
+
+	/* Fragment of the old btrees; dispose of them later. */
+	if (rec->rm_owner == XFS_RMAP_OWN_INOBT) {
+		fsbno = XFS_AGB_TO_FSB(mp, ri->sc->sa.agno, agbno);
+		return xfs_bitmap_set(ri->btlist, fsbno, rec->rm_blockcount);
+	}
+
+	/* Skip extents which are not owned by this inode and fork. */
+	if (rec->rm_owner != XFS_RMAP_OWN_INODES)
+		return 0;
+
+	/* The entire record must align to the inode cluster size. */
+	if (agbno & (igeo->blocks_per_cluster - 1) ||
+	    (agbno + rec->rm_blockcount) & (igeo->blocks_per_cluster - 1))
+		return -EFSCORRUPTED;
+
+	/*
+	 * The entire record must also adhere to the inode cluster alignment
+	 * size if sparse inodes are not enabled.
+	 */
+	if (!xfs_sb_version_hassparseinodes(&mp->m_sb) &&
+	    (agbno & (igeo->cluster_align - 1) ||
+	     (agbno + rec->rm_blockcount) & (igeo->cluster_align - 1)))
+		return -ENAVAIL;
+
+	/*
+	 * On a sparse inode fs, this cluster could be part of a sparse chunk.
+	 * Sparse clusters must be aligned to sparse chunk alignment.
+	 */
+	if (xfs_sb_version_hassparseinodes(&mp->m_sb) &&
+	    (agbno & (mp->m_sb.sb_spino_align - 1) ||
+	     (agbno + rec->rm_blockcount) & (mp->m_sb.sb_spino_align - 1)))
+		return -EREMOTEIO;
+
+	trace_xrep_ibt_walk_rmap(mp, ri->sc->sa.agno, rec->rm_startblock,
+			rec->rm_blockcount, rec->rm_owner, rec->rm_offset,
+			rec->rm_flags);
+
+	/*
+	 * Record the free/hole masks for each inode cluster that could be
+	 * mapped by this rmap record.
+	 */
+	for (cluster_base = 0;
+	     cluster_base < rec->rm_blockcount;
+	     cluster_base += igeo->blocks_per_cluster) {
+		error = xrep_ibt_process_cluster(ri, agbno + cluster_base);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/* Insert an inode chunk record into a given btree. */
+static int
+xrep_ibt_insert_btrec(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_inobt_rec_incore	*rie,
+	unsigned int			freecount)
+{
+	int				stat;
+	int				error;
+
+	error = xfs_inobt_lookup(cur, rie->ir_startino, XFS_LOOKUP_EQ, &stat);
+	if (error)
+		return error;
+	XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, stat == 0);
+	error = xfs_inobt_insert_rec(cur, rie->ir_holemask, rie->ir_count,
+			freecount, rie->ir_free, &stat);
+	if (error)
+		return error;
+	XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, stat == 1);
+	return error;
+}
+
+/* Compare two ialloc extents. */
+static int
+xfs_inobt_rec_incore_cmp(
+	const void			*a,
+	const void			*b)
+{
+	const struct xfs_inobt_rec_incore	*ap = a;
+	const struct xfs_inobt_rec_incore	*bp = b;
+
+	if (ap->ir_startino > bp->ir_startino)
+		return 1;
+	else if (ap->ir_startino < bp->ir_startino)
+		return -1;
+	return 0;
+}
+
+/*
+ * Iterate all reverse mappings to find the inodes (OWN_INODES) and the inode
+ * btrees (OWN_INOBT).  Figure out if we have enough free space to reconstruct
+ * the inode btrees.  The caller must clean up the lists if anything goes
+ * wrong.
+ */
+STATIC int
+xrep_ibt_find_inodes(
+	struct xfs_scrub	*sc,
+	struct xfbma		*inode_records,
+	struct xfs_bitmap	*old_iallocbt_blocks,
+	unsigned int		*icount,
+	unsigned int		*iused)
+{
+	struct xrep_ibt		ri = {
+		.sc		= sc,
+		.inode_records	= inode_records,
+		.btlist		= old_iallocbt_blocks,
+		.rie		= { .ir_startino = NULLAGINO, },
+	};
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_btree_cur	*cur;
+	xfs_agblock_t		nr_blocks;
+	int			error;
+
+	/* Collect all reverse mappings for inode blocks. */
+	cur = xfs_rmapbt_init_cursor(mp, sc->tp, sc->sa.agf_bp, sc->sa.agno);
+	error = xfs_rmap_query_all(cur, xrep_ibt_walk_rmap, &ri);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, error);
+
+	/* If we have a record ready to go, add it to the array. */
+	if (ri.rie.ir_startino != NULLAGINO) {
+		error = xfbma_append(inode_records, &ri.rie);
+		if (error)
+			return error;
+	}
+
+	/* Do we have enough space to rebuild all inode trees? */
+	nr_blocks = xfs_iallocbt_calc_size(mp, xfbma_length(inode_records));
+	if (xfs_sb_version_hasfinobt(&mp->m_sb))
+		nr_blocks *= 2;
+	if (!xrep_ag_has_space(sc->sa.pag, nr_blocks, XFS_AG_RESV_NONE))
+		return -ENOSPC;
+
+	*icount = ri.icount;
+	*iused = ri.iused;
+	return 0;
+
+err:
+	xfs_btree_del_cursor(cur, error);
+	return error;
+}
+
+/* Update the AGI counters. */
+STATIC int
+xrep_ibt_reset_counters(
+	struct xfs_scrub	*sc,
+	struct xfbma		*inode_records,
+	unsigned int		icount,
+	unsigned int		iused,
+	int			*log_flags)
+{
+	struct xfs_agi		*agi;
+	struct xfs_perag	*pag = sc->sa.pag;
+	unsigned int		freecount;
+
+	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
+	freecount = icount - iused;
+
+	/* Trigger inode count recalculation */
+	xfs_force_summary_recalc(sc->mp);
+
+	/*
+	 * Reset the per-AG info, both incore and ondisk.  Mark the incore
+	 * state stale in case we fail out of here.
+	 */
+	ASSERT(pag->pagi_init);
+	pag->pagi_init = 0;
+	pag->pagi_count = icount;
+	pag->pagi_freecount = freecount;
+
+	agi->agi_count = cpu_to_be32(icount);
+	agi->agi_freecount = cpu_to_be32(freecount);
+	*log_flags |= XFS_AGI_COUNT | XFS_AGI_FREECOUNT;
+
+	return 0;
+}
+
+/* Initialize a new inode btree roots and implant it into the AGI. */
+STATIC int
+xrep_ibt_reset_btree(
+	struct xfs_scrub	*sc,
+	xfs_btnum_t		btnum,
+	enum xfs_ag_resv_type	resv,
+	int			*log_flags)
+{
+	struct xfs_agi		*agi;
+	struct xfs_buf		*bp;
+	struct xfs_mount	*mp = sc->mp;
+	const struct xfs_buf_ops *ops;
+	xfs_fsblock_t		fsbno;
+	int			error;
+
+	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
+
+	switch (btnum) {
+	case XFS_BTNUM_INO:
+		ops = &xfs_inobt_buf_ops;
+		break;
+	case XFS_BTNUM_FINO:
+		ops = &xfs_finobt_buf_ops;
+		break;
+	default:
+		ASSERT(0);
+		return -EFSCORRUPTED;
+	}
+
+	/* Initialize new btree root. */
+	error = xrep_alloc_ag_block(sc, &XFS_RMAP_OINFO_INOBT, &fsbno, resv);
+	if (error)
+		return error;
+	error = xrep_init_btblock(sc, fsbno, &bp, btnum, ops);
+	if (error)
+		return error;
+
+	switch (btnum) {
+	case XFS_BTNUM_INOi:
+		agi->agi_root = cpu_to_be32(XFS_FSB_TO_AGBNO(mp, fsbno));
+		agi->agi_level = cpu_to_be32(1);
+		*log_flags |= XFS_AGI_ROOT | XFS_AGI_LEVEL;
+		break;
+	case XFS_BTNUM_FINOi:
+		agi->agi_free_root = cpu_to_be32(XFS_FSB_TO_AGBNO(mp, fsbno));
+		agi->agi_free_level = cpu_to_be32(1);
+		*log_flags |= XFS_AGI_FREE_ROOT | XFS_AGI_FREE_LEVEL;
+		break;
+	default:
+		ASSERT(0);
+	}
+
+	return 0;
+}
+
+/* Initialize new inobt/finobt roots and implant them into the AGI. */
+STATIC int
+xrep_ibt_reset_btrees(
+	struct xfs_scrub	*sc,
+	int			*log_flags)
+{
+	enum xfs_ag_resv_type	resv;
+	int			error;
+
+	resv = XFS_AG_RESV_NONE;
+	error = xrep_ibt_reset_btree(sc, XFS_BTNUM_INO, XFS_AG_RESV_NONE,
+			log_flags);
+	if (error || !xfs_sb_version_hasfinobt(&sc->mp->m_sb))
+		return error;
+
+	/*
+	 * If we made a per-AG reservation for the finobt then we must account
+	 * the new block correctly.
+	 */
+	if (!sc->mp->m_finobt_nores)
+		resv = XFS_AG_RESV_METADATA;
+	return xrep_ibt_reset_btree(sc, XFS_BTNUM_FINO, resv, log_flags);
+}
+
+/* Insert an inode chunk record into both inode btrees. */
+static int
+xrep_ibt_insert_rec(
+	const void			*item,
+	void				*priv)
+{
+	const struct xfs_inobt_rec_incore	*rie = item;
+	struct xfs_scrub		*sc = priv;
+	struct xfs_btree_cur		*cur;
+	unsigned int			freecount;
+	unsigned int			holes;
+	int				error;
+
+	holes = hweight16(rie->ir_holemask) * XFS_INODES_PER_HOLEMASK_BIT;
+	freecount = hweight64(rie->ir_free) - holes;
+	trace_xrep_ibt_insert(sc->mp, sc->sa.agno, rie->ir_startino,
+			rie->ir_holemask, rie->ir_count, freecount,
+			rie->ir_free);
+
+	/* Insert into the inobt. */
+	cur = xfs_inobt_init_cursor(sc->mp, sc->tp, sc->sa.agi_bp, sc->sa.agno,
+			XFS_BTNUM_INO);
+	error = xrep_ibt_insert_btrec(cur, rie, freecount);
+	if (error)
+		goto out_cur;
+	xfs_btree_del_cursor(cur, error);
+
+	/* Insert into the finobt if chunk has free inodes. */
+	if (xfs_sb_version_hasfinobt(&sc->mp->m_sb) && freecount != 0) {
+		cur = xfs_inobt_init_cursor(sc->mp, sc->tp, sc->sa.agi_bp,
+				sc->sa.agno, XFS_BTNUM_FINO);
+		error = xrep_ibt_insert_btrec(cur, rie, freecount);
+		if (error)
+			goto out_cur;
+		xfs_btree_del_cursor(cur, error);
+	}
+
+	return xrep_roll_ag_trans(sc);
+out_cur:
+	xfs_btree_del_cursor(cur, error);
+	return error;
+}
+
+/* Build new inode btrees and dispose of the old one. */
+STATIC int
+xrep_ibt_rebuild_trees(
+	struct xfs_scrub	*sc,
+	struct xfbma		*inode_records,
+	struct xfs_bitmap	*old_iallocbt_blocks)
+{
+	int			error;
+
+	/*
+	 * Sort the inode extents by startino to avoid btree splits when we
+	 * rebuild the inode btrees.
+	 */
+	error = xfbma_sort(inode_records, xfs_inobt_rec_incore_cmp);
+	if (error)
+		return error;
+
+	/* Free the old inode btree blocks if they're not in use. */
+	error = xrep_reap_extents(sc, old_iallocbt_blocks,
+			&XFS_RMAP_OINFO_INOBT, XFS_AG_RESV_NONE);
+	if (error)
+		return error;
+
+	/* Add all records. */
+	return xfbma_iter_del(inode_records, xrep_ibt_insert_rec, sc);
+}
+
+/*
+ * Make our new inode btree roots permanent so that we can start re-adding
+ * inode records back into the AG.
+ */
+STATIC int
+xrep_ibt_commit_new(
+	struct xfs_scrub	*sc,
+	struct xfs_bitmap	*old_iallocbt_blocks,
+	int			log_flags)
+{
+	int			error;
+
+	xfs_ialloc_log_agi(sc->tp, sc->sa.agi_bp, log_flags);
+
+	/* Invalidate all the inobt/finobt blocks in btlist. */
+	error = xrep_invalidate_blocks(sc, old_iallocbt_blocks);
+	if (error)
+		return error;
+	error = xrep_roll_ag_trans(sc);
+	if (error)
+		return error;
+
+	/*
+	 * Now that we've succeeded, mark the incore state valid again.  If the
+	 * finobt is enabled, make sure we reinitialize the per-AG reservations
+	 * when we're done.
+	 */
+	sc->sa.pag->pagi_init = 1;
+	if (xfs_sb_version_hasfinobt(&sc->mp->m_sb))
+		sc->flags |= XREP_RESET_PERAG_RESV;
+	return 0;
+}
+
+/* Repair both inode btrees. */
+int
+xrep_iallocbt(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_bitmap	old_iallocbt_blocks;
+	struct xfbma		*inode_records;
+	struct xfs_mount	*mp = sc->mp;
+	unsigned int		icount = 0;
+	unsigned int		iused = 0;
+	int			log_flags = 0;
+	int			error = 0;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	xchk_perag_get(sc->mp, &sc->sa);
+
+	/* We rebuild both inode btrees. */
+	sc->sick_mask = XFS_SICK_AG_INOBT | XFS_SICK_AG_FINOBT;
+
+	/* Set up some storage */
+	inode_records = xfbma_init(sizeof(struct xfs_inobt_rec_incore));
+	if (IS_ERR(inode_records))
+		return PTR_ERR(inode_records);
+
+	/* Collect the inode data and find the old btree blocks. */
+	xfs_bitmap_init(&old_iallocbt_blocks);
+	error = xrep_ibt_find_inodes(sc, inode_records, &old_iallocbt_blocks,
+			&icount, &iused);
+	if (error)
+		goto out;
+
+	/*
+	 * Blow out the old inode btrees.  This is the point at which
+	 * we are no longer able to bail out gracefully.
+	 */
+	error = xrep_ibt_reset_counters(sc, inode_records, icount, iused,
+			&log_flags);
+	if (error)
+		goto out;
+	error = xrep_ibt_reset_btrees(sc, &log_flags);
+	if (error)
+		goto out;
+	error = xrep_ibt_commit_new(sc, &old_iallocbt_blocks, log_flags);
+	if (error)
+		goto out;
+
+	/* Now rebuild the inode information. */
+	error = xrep_ibt_rebuild_trees(sc, inode_records, &old_iallocbt_blocks);
+out:
+	xfbma_destroy(inode_records);
+	xfs_bitmap_destroy(&old_iallocbt_blocks);
+	return error;
+}
+
+/* Make sure both btrees are ok after we've rebuilt them. */
+int
+xrep_revalidate_iallocbt(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	error = xchk_inobt(sc);
+	if (error)
+		return error;
+
+	if (xfs_sb_version_hasfinobt(&sc->mp->m_sb))
+		return xchk_finobt(sc);
+
+	return 0;
+}
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 4cfeec57fb05..ad93d25602ae 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -952,3 +952,26 @@ xrep_ino_dqattach(
 
 	return error;
 }
+
+/* Reinitialize the per-AG block reservation for the AG we just fixed. */
+int
+xrep_reset_perag_resv(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	if (!(sc->flags & XREP_RESET_PERAG_RESV))
+		return 0;
+
+	ASSERT(sc->sa.pag != NULL);
+	ASSERT(sc->ops->type == ST_PERAG);
+	ASSERT(sc->tp);
+
+	sc->flags &= ~XREP_RESET_PERAG_RESV;
+	error = xfs_ag_resv_free(sc->sa.pag);
+	if (error)
+		goto out;
+	error = xfs_ag_resv_init(sc->sa.pag, sc->tp);
+out:
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 5a6a1cd437d7..21472fbf11d5 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -51,10 +51,12 @@ int xrep_find_ag_btree_roots(struct xfs_scrub *sc, struct xfs_buf *agf_bp,
 		struct xrep_find_ag_btree *btree_info, struct xfs_buf *agfl_bp);
 void xrep_force_quotacheck(struct xfs_scrub *sc, uint dqtype);
 int xrep_ino_dqattach(struct xfs_scrub *sc);
+int xrep_reset_perag_resv(struct xfs_scrub *sc);
 
 /* Metadata revalidators */
 
 int xrep_revalidate_allocbt(struct xfs_scrub *sc);
+int xrep_revalidate_iallocbt(struct xfs_scrub *sc);
 
 /* Metadata repairers */
 
@@ -64,6 +66,7 @@ int xrep_agf(struct xfs_scrub *sc);
 int xrep_agfl(struct xfs_scrub *sc);
 int xrep_agi(struct xfs_scrub *sc);
 int xrep_allocbt(struct xfs_scrub *sc);
+int xrep_iallocbt(struct xfs_scrub *sc);
 
 #else
 
@@ -84,7 +87,19 @@ xrep_calc_ag_resblks(
 	return 0;
 }
 
+static inline int
+xrep_reset_perag_resv(
+	struct xfs_scrub	*sc)
+{
+	if (!(sc->flags & XREP_RESET_PERAG_RESV))
+		return 0;
+
+	ASSERT(0);
+	return -EOPNOTSUPP;
+}
+
 #define xrep_revalidate_allocbt		(NULL)
+#define xrep_revalidate_iallocbt	(NULL)
 
 #define xrep_probe			xrep_notsupported
 #define xrep_superblock			xrep_notsupported
@@ -92,6 +107,7 @@ xrep_calc_ag_resblks(
 #define xrep_agfl			xrep_notsupported
 #define xrep_agi			xrep_notsupported
 #define xrep_allocbt			xrep_notsupported
+#define xrep_iallocbt			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index b42ac8ecdb49..6011823d0d40 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -231,14 +231,16 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_PERAG,
 		.setup	= xchk_setup_ag_iallocbt,
 		.scrub	= xchk_inobt,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_iallocbt,
+		.repair_eval = xrep_revalidate_iallocbt,
 	},
 	[XFS_SCRUB_TYPE_FINOBT] = {	/* finobt */
 		.type	= ST_PERAG,
 		.setup	= xchk_setup_ag_iallocbt,
 		.scrub	= xchk_finobt,
 		.has	= xfs_sb_version_hasfinobt,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_iallocbt,
+		.repair_eval = xrep_revalidate_iallocbt,
 	},
 	[XFS_SCRUB_TYPE_RMAPBT] = {	/* rmapbt */
 		.type	= ST_PERAG,
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 94a30637a127..16ed1d3e1404 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -89,6 +89,7 @@ struct xfs_scrub {
 #define XCHK_TRY_HARDER		(1 << 0)  /* can't get resources, try again */
 #define XCHK_HAS_QUOTAOFFLOCK	(1 << 1)  /* we hold the quotaoff lock */
 #define XCHK_REAPING_DISABLED	(1 << 2)  /* background block reaping paused */
+#define XREP_RESET_PERAG_RESV	(1 << 30) /* must reset AG space reservation */
 #define XREP_ALREADY_FIXED	(1 << 31) /* checking our repair work */
 
 /* Metadata scrubbers */
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index d43b6003a088..cdf0dffc17d2 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -723,7 +723,7 @@ DEFINE_EVENT(xrep_rmap_class, name, \
 		 uint64_t owner, uint64_t offset, unsigned int flags), \
 	TP_ARGS(mp, agno, agbno, len, owner, offset, flags))
 DEFINE_REPAIR_RMAP_EVENT(xrep_abt_walk_rmap);
-DEFINE_REPAIR_RMAP_EVENT(xrep_ialloc_extent_fn);
+DEFINE_REPAIR_RMAP_EVENT(xrep_ibt_walk_rmap);
 DEFINE_REPAIR_RMAP_EVENT(xrep_rmap_extent_fn);
 DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_extent_fn);
 
@@ -871,7 +871,7 @@ TRACE_EVENT(xrep_reset_counters,
 		  MAJOR(__entry->dev), MINOR(__entry->dev))
 )
 
-TRACE_EVENT(xrep_ialloc_insert,
+TRACE_EVENT(xrep_ibt_insert,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
 		 xfs_agino_t startino, uint16_t holemask, uint8_t count,
 		 uint8_t freecount, uint64_t freemask),

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 06/18] xfs: repair refcount btrees
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
                   ` (4 preceding siblings ...)
  2019-08-05  0:35 ` [PATCH 05/18] xfs: repair inode btrees Darrick J. Wong
@ 2019-08-05  0:35 ` Darrick J. Wong
  2019-08-05  0:35 ` [PATCH 07/18] xfs: repair inode records Darrick J. Wong
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:35 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Reconstruct the refcount data from the rmap btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile                |    1 
 fs/xfs/scrub/refcount_repair.c |  567 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h          |    2 
 fs/xfs/scrub/scrub.c           |    2 
 fs/xfs/scrub/trace.h           |   11 -
 5 files changed, 577 insertions(+), 6 deletions(-)
 create mode 100644 fs/xfs/scrub/refcount_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 3b7fdccf2818..4ac6256fe7c3 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -164,6 +164,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   array.o \
 				   bitmap.o \
 				   ialloc_repair.o \
+				   refcount_repair.o \
 				   repair.o \
 				   )
 endif
diff --git a/fs/xfs/scrub/refcount_repair.c b/fs/xfs/scrub/refcount_repair.c
new file mode 100644
index 000000000000..cbcfb96fd2e0
--- /dev/null
+++ b/fs/xfs/scrub/refcount_repair.c
@@ -0,0 +1,567 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_alloc.h"
+#include "xfs_ialloc.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_refcount.h"
+#include "xfs_refcount_btree.h"
+#include "xfs_error.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/bitmap.h"
+#include "scrub/array.h"
+
+/*
+ * Rebuilding the Reference Count Btree
+ * ====================================
+ *
+ * This algorithm is "borrowed" from xfs_repair.  Imagine the rmap
+ * entries as rectangles representing extents of physical blocks, and
+ * that the rectangles can be laid down to allow them to overlap each
+ * other; then we know that we must emit a refcnt btree entry wherever
+ * the amount of overlap changes, i.e. the emission stimulus is
+ * level-triggered:
+ *
+ *                 -    ---
+ *       --      ----- ----   ---        ------
+ * --   ----     ----------- ----     ---------
+ * -------------------------------- -----------
+ * ^ ^  ^^ ^^    ^ ^^ ^^^  ^^^^  ^ ^^ ^  ^     ^
+ * 2 1  23 21    3 43 234  2123  1 01 2  3     0
+ *
+ * For our purposes, a rmap is a tuple (startblock, len, fileoff, owner).
+ *
+ * Note that in the actual refcnt btree we don't store the refcount < 2
+ * cases because the bnobt tells us which blocks are free; single-use
+ * blocks aren't recorded in the bnobt or the refcntbt.  If the rmapbt
+ * supports storing multiple entries covering a given block we could
+ * theoretically dispense with the refcntbt and simply count rmaps, but
+ * that's inefficient in the (hot) write path, so we'll take the cost of
+ * the extra tree to save time.  Also there's no guarantee that rmap
+ * will be enabled.
+ *
+ * Given an array of rmaps sorted by physical block number, a starting
+ * physical block (sp), a bag to hold rmaps that cover sp, and the next
+ * physical block where the level changes (np), we can reconstruct the
+ * refcount btree as follows:
+ *
+ * While there are still unprocessed rmaps in the array,
+ *  - Set sp to the physical block (pblk) of the next unprocessed rmap.
+ *  - Add to the bag all rmaps in the array where startblock == sp.
+ *  - Set np to the physical block where the bag size will change.  This
+ *    is the minimum of (the pblk of the next unprocessed rmap) and
+ *    (startblock + len of each rmap in the bag).
+ *  - Record the bag size as old_bag_size.
+ *
+ *  - While the bag isn't empty,
+ *     - Remove from the bag all rmaps where startblock + len == np.
+ *     - Add to the bag all rmaps in the array where startblock == np.
+ *     - If the bag size isn't old_bag_size, store the refcount entry
+ *       (sp, np - sp, bag_size) in the refcnt btree.
+ *     - If the bag is empty, break out of the inner loop.
+ *     - Set old_bag_size to the bag size
+ *     - Set sp = np.
+ *     - Set np to the physical block where the bag size will change.
+ *       This is the minimum of (the pblk of the next unprocessed rmap)
+ *       and (startblock + len of each rmap in the bag).
+ *
+ * Like all the other repairers, we make a list of all the refcount
+ * records we need, then reinitialize the refcount btree root and
+ * insert all the records.
+ */
+
+/* The only parts of the rmap that we care about for computing refcounts. */
+struct xrep_refc_rmap {
+	xfs_agblock_t		startblock;
+	xfs_extlen_t		blockcount;
+} __packed;
+
+/* Smallest possible representation of a refcount extent. */
+struct xrep_refc_extent {
+	xfs_agblock_t		startblock;
+	xfs_extlen_t		blockcount;
+	xfs_nlink_t		refcount;
+} __packed;
+
+struct xrep_refc {
+	struct xfbma		*rmap_bag; /* rmaps we're tracking */
+	struct xfbma		*refcount_records;	/* refcount extents */
+	struct xfs_bitmap	*btlist;   /* old refcountbt blocks */
+	struct xfs_scrub	*sc;
+	xfs_extlen_t		btblocks;  /* # of refcountbt blocks */
+};
+
+/* Grab the next (abbreviated) rmap record from the rmapbt. */
+STATIC int
+xrep_refc_next_rrm(
+	struct xfs_btree_cur	*cur,
+	struct xrep_refc	*rr,
+	struct xrep_refc_rmap	*rrm,
+	bool			*have_rec)
+{
+	struct xfs_rmap_irec	rmap;
+	struct xfs_mount	*mp = cur->bc_mp;
+	xfs_fsblock_t		fsbno;
+	int			have_gt;
+	int			error = 0;
+
+	*have_rec = false;
+	/*
+	 * Loop through the remaining rmaps.  Remember CoW staging
+	 * extents and the refcountbt blocks from the old tree for later
+	 * disposal.  We can only share written data fork extents, so
+	 * keep looping until we find an rmap for one.
+	 */
+	do {
+		if (xchk_should_terminate(rr->sc, &error))
+			goto out_error;
+
+		error = xfs_btree_increment(cur, 0, &have_gt);
+		if (error)
+			goto out_error;
+		if (!have_gt)
+			return 0;
+
+		error = xfs_rmap_get_rec(cur, &rmap, &have_gt);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(mp, have_gt == 1, out_error);
+
+		if (rmap.rm_owner == XFS_RMAP_OWN_COW) {
+			struct xrep_refc_extent	ext = {
+				.startblock	= rmap.rm_startblock +
+					XFS_REFC_COW_START,
+				.blockcount	= rmap.rm_blockcount,
+				.refcount	= 1,
+			};
+
+			/* Pass CoW staging extents right through. */
+			error = xfbma_append(rr->refcount_records, &ext);
+			if (error)
+				goto out_error;
+		} else if (rmap.rm_owner == XFS_RMAP_OWN_REFC) {
+			/* refcountbt block, dump it when we're done. */
+			rr->btblocks += rmap.rm_blockcount;
+			fsbno = XFS_AGB_TO_FSB(cur->bc_mp,
+					cur->bc_private.a.agno,
+					rmap.rm_startblock);
+			error = xfs_bitmap_set(rr->btlist, fsbno,
+					rmap.rm_blockcount);
+			if (error)
+				goto out_error;
+		}
+	} while (XFS_RMAP_NON_INODE_OWNER(rmap.rm_owner) ||
+		 xfs_internal_inum(mp, rmap.rm_owner) ||
+		 (rmap.rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK |
+				   XFS_RMAP_UNWRITTEN)));
+
+	rrm->startblock = rmap.rm_startblock;
+	rrm->blockcount = rmap.rm_blockcount;
+	*have_rec = true;
+	return 0;
+
+out_error:
+	return error;
+}
+
+/* Compare two btree extents. */
+static int
+xrep_refc_extent_cmp(
+	const void			*a,
+	const void			*b)
+{
+	const struct xrep_refc_extent	*ap = a;
+	const struct xrep_refc_extent	*bp = b;
+
+	if (ap->startblock > bp->startblock)
+		return 1;
+	else if (ap->startblock < bp->startblock)
+		return -1;
+	return 0;
+}
+
+/* Record a reference count extent. */
+STATIC int
+xrep_refc_remember(
+	struct xfs_scrub		*sc,
+	struct xrep_refc		*rr,
+	xfs_agblock_t			agbno,
+	xfs_extlen_t			len,
+	xfs_nlink_t			refcount)
+{
+	struct xrep_refc_extent		rre = {
+		.startblock	= agbno,
+		.blockcount	= len,
+		.refcount	= refcount,
+	};
+
+	trace_xrep_refcount_extent_fn(sc->mp, sc->sa.agno, agbno, len,
+			refcount);
+
+	return xfbma_append(rr->refcount_records, &rre);
+}
+
+#define RRM_NEXT(r)	((r).startblock + (r).blockcount)
+/*
+ * Find the next block where the refcount changes, given the next rmap we
+ * looked at and the ones we're already tracking.
+ */
+static inline xfs_agblock_t
+xrep_refc_next_edge(
+	struct xfbma		*rmap_bag,
+	struct xrep_refc_rmap	*next_rrm,
+	bool			next_valid)
+{
+	struct xrep_refc_rmap	rrm;
+	uint64_t		i;
+	xfs_agblock_t		nbno;
+
+	nbno = next_valid ? next_rrm->startblock : NULLAGBLOCK;
+	foreach_xfbma_item(rmap_bag, i, rrm)
+		nbno = min_t(xfs_agblock_t, nbno, RRM_NEXT(rrm));
+	return nbno;
+}
+
+/* Iterate all the rmap records to generate reference count data. */
+STATIC int
+xrep_refc_generate_refcounts(
+	struct xfs_scrub	*sc,
+	struct xrep_refc	*rr)
+{
+	struct xrep_refc_rmap	rrm;
+	struct xfs_btree_cur	*cur;
+	xfs_agblock_t		sbno;
+	xfs_agblock_t		cbno;
+	xfs_agblock_t		nbno;
+	size_t			old_stack_sz;
+	size_t			stack_sz = 0;
+	bool			have;
+	int			have_gt;
+	int			error;
+
+	/* Start the rmapbt cursor to the left of all records. */
+	cur = xfs_rmapbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.agno);
+	error = xfs_rmap_lookup_le(cur, 0, 0, 0, 0, 0, &have_gt);
+	if (error)
+		goto out;
+	ASSERT(have_gt == 0);
+
+	/* Process reverse mappings into refcount data. */
+	while (xfs_btree_has_more_records(cur)) {
+		/* Push all rmaps with pblk == sbno onto the stack */
+		error = xrep_refc_next_rrm(cur, rr, &rrm, &have);
+		if (error)
+			goto out;
+		if (!have)
+			break;
+		sbno = cbno = rrm.startblock;
+		while (have && rrm.startblock == sbno) {
+			error = xfbma_insert_anywhere(rr->rmap_bag, &rrm);
+			if (error)
+				goto out;
+			stack_sz++;
+			error = xrep_refc_next_rrm(cur, rr, &rrm, &have);
+			if (error)
+				goto out;
+		}
+		error = xfs_btree_decrement(cur, 0, &have_gt);
+		if (error)
+			goto out;
+		XFS_WANT_CORRUPTED_GOTO(sc->mp, have_gt, out);
+
+		/* Set nbno to the bno of the next refcount change */
+		nbno = xrep_refc_next_edge(rr->rmap_bag, &rrm, have);
+		if (nbno == NULLAGBLOCK) {
+			error = -EFSCORRUPTED;
+			goto out;
+		}
+
+		ASSERT(nbno > sbno);
+		old_stack_sz = stack_sz;
+
+		/* While stack isn't empty... */
+		while (stack_sz) {
+			uint64_t	i;
+
+			/* Pop all rmaps that end at nbno */
+			foreach_xfbma_item(rr->rmap_bag, i, rrm) {
+				if (RRM_NEXT(rrm) != nbno)
+					continue;
+				error = xfbma_nullify(rr->rmap_bag, i);
+				if (error)
+					goto out;
+				stack_sz--;
+			}
+
+			/* Push array items that start at nbno */
+			error = xrep_refc_next_rrm(cur, rr, &rrm, &have);
+			if (error)
+				goto out;
+			while (have && rrm.startblock == nbno) {
+				error = xfbma_insert_anywhere(rr->rmap_bag,
+						&rrm);
+				if (error)
+					goto out;
+				stack_sz++;
+				error = xrep_refc_next_rrm(cur, rr, &rrm,
+						&have);
+				if (error)
+					goto out;
+			}
+			error = xfs_btree_decrement(cur, 0, &have_gt);
+			if (error)
+				goto out;
+			XFS_WANT_CORRUPTED_GOTO(sc->mp, have_gt, out);
+
+			/* Emit refcount if necessary */
+			ASSERT(nbno > cbno);
+			if (stack_sz != old_stack_sz) {
+				if (old_stack_sz > 1) {
+					error = xrep_refc_remember(sc, rr, cbno,
+							nbno - cbno,
+							old_stack_sz);
+					if (error)
+						goto out;
+				}
+				cbno = nbno;
+			}
+
+			/* Stack empty, go find the next rmap */
+			if (stack_sz == 0)
+				break;
+			old_stack_sz = stack_sz;
+			sbno = nbno;
+
+			/* Set nbno to the bno of the next refcount change */
+			nbno = xrep_refc_next_edge(rr->rmap_bag, &rrm, have);
+			if (nbno == NULLAGBLOCK) {
+				error = -EFSCORRUPTED;
+				goto out;
+			}
+
+			ASSERT(nbno > sbno);
+		}
+	}
+
+	ASSERT(stack_sz == 0);
+out:
+	xfs_btree_del_cursor(cur, error);
+	return error;
+}
+#undef RRM_NEXT
+
+/*
+ * Generate all the reference counts for this AG and a list of the old
+ * refcount btree blocks.  Figure out if we have enough free space to
+ * reconstruct the inode btrees.  The caller must clean up the lists if
+ * anything goes wrong.
+ */
+STATIC int
+xrep_refc_find_refcounts(
+	struct xfs_scrub	*sc,
+	struct xfbma		*refcount_records,
+	struct xfs_bitmap	*old_refcountbt_blocks)
+{
+	struct xrep_refc	rr = {
+		.sc			= sc,
+		.refcount_records	= refcount_records,
+		.btlist			= old_refcountbt_blocks,
+	};
+	struct xfs_mount	*mp = sc->mp;
+	xfs_extlen_t		blocks;
+	int			error;
+
+	/* Set up some storage */
+	rr.rmap_bag = xfbma_init(sizeof(struct xrep_refc_rmap));
+	if (IS_ERR(rr.rmap_bag))
+		return PTR_ERR(rr.rmap_bag);
+
+	/* Generate all the refcount records. */
+	error = xrep_refc_generate_refcounts(sc, &rr);
+	if (error)
+		goto out;
+
+	/* Do we actually have enough space to do this? */
+	blocks = xfs_refcountbt_calc_size(mp, xfbma_length(refcount_records));
+	if (!xrep_ag_has_space(sc->sa.pag, blocks, XFS_AG_RESV_METADATA)) {
+		error = -ENOSPC;
+		goto out;
+	}
+
+out:
+	xfbma_destroy(rr.rmap_bag);
+	return error;
+}
+
+/* Initialize new refcountbt root and implant it into the AGF. */
+STATIC int
+xrep_refc_reset_btree(
+	struct xfs_scrub	*sc,
+	int			*log_flags)
+{
+	struct xfs_buf		*bp;
+	struct xfs_agf		*agf;
+	xfs_fsblock_t		btfsb;
+	int			error;
+
+	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
+
+	/* Initialize a new refcountbt root. */
+	error = xrep_alloc_ag_block(sc, &XFS_RMAP_OINFO_REFC, &btfsb,
+			XFS_AG_RESV_METADATA);
+	if (error)
+		return error;
+	error = xrep_init_btblock(sc, btfsb, &bp, XFS_BTNUM_REFC,
+			&xfs_refcountbt_buf_ops);
+	if (error)
+		return error;
+	agf->agf_refcount_root = cpu_to_be32(XFS_FSB_TO_AGBNO(sc->mp, btfsb));
+	agf->agf_refcount_level = cpu_to_be32(1);
+	agf->agf_refcount_blocks = cpu_to_be32(1);
+	*log_flags |= XFS_AGF_REFCOUNT_BLOCKS | XFS_AGF_REFCOUNT_ROOT |
+		      XFS_AGF_REFCOUNT_LEVEL;
+
+	return 0;
+}
+
+/* Insert a single record into the refcount btree. */
+STATIC int
+xrep_refc_insert_rec(
+	const void			*item,
+	void				*priv)
+{
+	const struct xrep_refc_extent	*rre = item;
+	struct xfs_refcount_irec	refc = {
+		.rc_startblock	= rre->startblock,
+		.rc_blockcount	= rre->blockcount,
+		.rc_refcount	= rre->refcount,
+	};
+	struct xfs_scrub		*sc = priv;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_btree_cur		*cur;
+	int				have_gt;
+	int				error;
+
+	/* Insert into the refcountbt. */
+	cur = xfs_refcountbt_init_cursor(mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.agno);
+	error = xfs_refcount_lookup_eq(cur, rre->startblock, &have_gt);
+	if (error)
+		goto out;
+	XFS_WANT_CORRUPTED_GOTO(mp, have_gt == 0, out);
+	error = xfs_refcount_insert(cur, &refc, &have_gt);
+	if (error)
+		goto out;
+	XFS_WANT_CORRUPTED_GOTO(mp, have_gt == 1, out);
+	xfs_btree_del_cursor(cur, error);
+	return xrep_roll_ag_trans(sc);
+out:
+	xfs_btree_del_cursor(cur, error);
+	return error;
+}
+
+/* Build new refcount btree and dispose of the old one. */
+STATIC int
+xrep_refc_rebuild_tree(
+	struct xfs_scrub	*sc,
+	struct xfbma		*refcount_records,
+	struct xfs_bitmap	*old_refcountbt_blocks)
+{
+	int			error;
+
+	/*
+	 * Sort the refcount extents by startblock to avoid btree splits when
+	 * we rebuild the refcount btree.
+	 */
+	error = xfbma_sort(refcount_records, xrep_refc_extent_cmp);
+	if (error)
+		return error;
+
+	/* Free the old refcountbt blocks if they're not in use. */
+	error = xrep_reap_extents(sc, old_refcountbt_blocks,
+			&XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA);
+	if (error)
+		return error;
+
+	/* Add all records. */
+	return xfbma_iter_del(refcount_records, xrep_refc_insert_rec, sc);
+}
+
+/* Rebuild the refcount btree. */
+int
+xrep_refcountbt(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_bitmap	old_refcountbt_blocks;
+	struct xfbma		*refcount_records;
+	struct xfs_mount	*mp = sc->mp;
+	int			log_flags = 0;
+	int			error;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	xchk_perag_get(sc->mp, &sc->sa);
+
+	/* Set up some storage */
+	refcount_records = xfbma_init(sizeof(struct xrep_refc_extent));
+	if (IS_ERR(refcount_records))
+		return PTR_ERR(refcount_records);
+
+	/* Collect all reference counts. */
+	xfs_bitmap_init(&old_refcountbt_blocks);
+	error = xrep_refc_find_refcounts(sc, refcount_records,
+			&old_refcountbt_blocks);
+	if (error)
+		goto out;
+
+	/*
+	 * Blow out the old refcount btrees.  This is the point at which
+	 * we are no longer able to bail out gracefully.
+	 */
+	error = xrep_refc_reset_btree(sc, &log_flags);
+	if (error)
+		goto out;
+	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
+
+	/* Invalidate all the inobt/finobt blocks in btlist. */
+	error = xrep_invalidate_blocks(sc, &old_refcountbt_blocks);
+	if (error)
+		goto out;
+	error = xrep_roll_ag_trans(sc);
+	if (error)
+		goto out;
+
+	/* Now rebuild the refcount information. */
+	error = xrep_refc_rebuild_tree(sc, refcount_records,
+			&old_refcountbt_blocks);
+	if (error)
+		goto out;
+	sc->flags |= XREP_RESET_PERAG_RESV;
+out:
+	xfs_bitmap_destroy(&old_refcountbt_blocks);
+	xfbma_destroy(refcount_records);
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 21472fbf11d5..f952d6739700 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -67,6 +67,7 @@ int xrep_agfl(struct xfs_scrub *sc);
 int xrep_agi(struct xfs_scrub *sc);
 int xrep_allocbt(struct xfs_scrub *sc);
 int xrep_iallocbt(struct xfs_scrub *sc);
+int xrep_refcountbt(struct xfs_scrub *sc);
 
 #else
 
@@ -108,6 +109,7 @@ xrep_reset_perag_resv(
 #define xrep_agi			xrep_notsupported
 #define xrep_allocbt			xrep_notsupported
 #define xrep_iallocbt			xrep_notsupported
+#define xrep_refcountbt			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 6011823d0d40..b104231af049 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -254,7 +254,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.setup	= xchk_setup_ag_refcountbt,
 		.scrub	= xchk_refcountbt,
 		.has	= xfs_sb_version_hasreflink,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_refcountbt,
 	},
 	[XFS_SCRUB_TYPE_INODE] = {	/* inode record */
 		.type	= ST_INODE,
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index cdf0dffc17d2..f7e64a5cc751 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -729,8 +729,9 @@ DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_extent_fn);
 
 TRACE_EVENT(xrep_refcount_extent_fn,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
-		 struct xfs_refcount_irec *irec),
-	TP_ARGS(mp, agno, irec),
+		 xfs_agblock_t startblock, xfs_extlen_t blockcount,
+		 xfs_nlink_t refcount),
+	TP_ARGS(mp, agno, startblock, blockcount, refcount),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_agnumber_t, agno)
@@ -741,9 +742,9 @@ TRACE_EVENT(xrep_refcount_extent_fn,
 	TP_fast_assign(
 		__entry->dev = mp->m_super->s_dev;
 		__entry->agno = agno;
-		__entry->startblock = irec->rc_startblock;
-		__entry->blockcount = irec->rc_blockcount;
-		__entry->refcount = irec->rc_refcount;
+		__entry->startblock = startblock;
+		__entry->blockcount = blockcount;
+		__entry->refcount = refcount;
 	),
 	TP_printk("dev %d:%d agno %u agbno %u len %u refcount %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 07/18] xfs: repair inode records
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
                   ` (5 preceding siblings ...)
  2019-08-05  0:35 ` [PATCH 06/18] xfs: repair refcount btrees Darrick J. Wong
@ 2019-08-05  0:35 ` Darrick J. Wong
  2019-08-05  0:35 ` [PATCH 08/18] xfs: zap broken inode forks Darrick J. Wong
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:35 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Try to reinitialize corrupt inodes, or clear the reflink flag
if it's not needed.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile             |    1 
 fs/xfs/libxfs/xfs_format.h  |    3 
 fs/xfs/scrub/inode_repair.c |  659 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h       |    2 
 fs/xfs/scrub/scrub.c        |    2 
 5 files changed, 665 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/scrub/inode_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 4ac6256fe7c3..a4b0e79ce988 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -164,6 +164,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   array.o \
 				   bitmap.o \
 				   ialloc_repair.o \
+				   inode_repair.o \
 				   refcount_repair.o \
 				   repair.o \
 				   )
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index c968b60cee15..c24dedc11741 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -979,7 +979,8 @@ typedef enum xfs_dinode_fmt {
 #define XFS_DFORK_APTR(dip)	\
 	(XFS_DFORK_DPTR(dip) + XFS_DFORK_BOFF(dip))
 #define XFS_DFORK_PTR(dip,w)	\
-	((w) == XFS_DATA_FORK ? XFS_DFORK_DPTR(dip) : XFS_DFORK_APTR(dip))
+	((void *)((w) == XFS_DATA_FORK ? XFS_DFORK_DPTR(dip) : \
+					 XFS_DFORK_APTR(dip)))
 
 #define XFS_DFORK_FORMAT(dip,w) \
 	((w) == XFS_DATA_FORK ? \
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
new file mode 100644
index 000000000000..fd6353a907d8
--- /dev/null
+++ b/fs/xfs/scrub/inode_repair.c
@@ -0,0 +1,659 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_inode_buf.h"
+#include "xfs_inode_fork.h"
+#include "xfs_ialloc.h"
+#include "xfs_da_format.h"
+#include "xfs_reflink.h"
+#include "xfs_rmap.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_util.h"
+#include "xfs_dir2.h"
+#include "xfs_quota_defs.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * Inode Repair
+ *
+ * Roughly speaking, inode problems can be classified based on whether or not
+ * they trip the dinode verifiers.  If those trip, then we won't be able to
+ * _iget ourselves the inode.
+ *
+ * Therefore, the xrep_dinode_* functions fix anything that will cause the
+ * inode buffer verifier or the dinode verifier.  The xrep_inode_* functions
+ * fix things on live incore inodes.
+ */
+
+/* Make sure this buffer can pass the inode buffer verifier. */
+STATIC void
+xrep_dinode_buf(
+	struct xfs_scrub	*sc,
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_trans	*tp = sc->tp;
+	struct xfs_dinode	*dip;
+	xfs_agnumber_t		agno;
+	xfs_agino_t		agino;
+	int			ioff;
+	int			i;
+	int			ni;
+	bool			crc_ok;
+	bool			magic_ok;
+	bool			unlinked_ok;
+
+	ni = XFS_BB_TO_FSB(mp, bp->b_length) * mp->m_sb.sb_inopblock;
+	agno = xfs_daddr_to_agno(mp, XFS_BUF_ADDR(bp));
+	for (i = 0; i < ni; i++) {
+		ioff = i << mp->m_sb.sb_inodelog;
+		dip = xfs_buf_offset(bp, ioff);
+		agino = be32_to_cpu(dip->di_next_unlinked);
+
+		unlinked_ok = magic_ok = crc_ok = false;
+
+		if (xfs_verify_agino_or_null(sc->mp, agno, agino))
+			unlinked_ok = true;
+
+		if (dip->di_magic == cpu_to_be16(XFS_DINODE_MAGIC) &&
+		    xfs_dinode_good_version(mp, dip->di_version))
+			magic_ok = true;
+
+		if (xfs_verify_cksum((char *)dip, mp->m_sb.sb_inodesize,
+				XFS_DINODE_CRC_OFF))
+			crc_ok = true;
+
+		if (magic_ok && unlinked_ok && crc_ok)
+			continue;
+
+		if (!magic_ok) {
+			dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC);
+			dip->di_version = 3;
+		}
+		if (!unlinked_ok)
+			dip->di_next_unlinked = cpu_to_be32(NULLAGINO);
+		xfs_dinode_calc_crc(mp, dip);
+		xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
+		xfs_trans_log_buf(tp, bp, ioff, ioff + sizeof(*dip) - 1);
+	}
+}
+
+/* Reinitialize things that never change in an inode. */
+STATIC void
+xrep_dinode_header(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip)
+{
+	dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC);
+	if (!xfs_dinode_good_version(sc->mp, dip->di_version))
+		dip->di_version = 3;
+	dip->di_ino = cpu_to_be64(sc->sm->sm_ino);
+	uuid_copy(&dip->di_uuid, &sc->mp->m_sb.sb_meta_uuid);
+	dip->di_gen = cpu_to_be32(sc->sm->sm_gen);
+}
+
+/*
+ * Turn di_mode into /something/ recognizable.
+ *
+ * XXX: Ideally we'd try to read data block 0 to see if it's a directory.
+ */
+STATIC void
+xrep_dinode_mode(
+	struct xfs_dinode	*dip)
+{
+	uint16_t		mode;
+
+	mode = be16_to_cpu(dip->di_mode);
+	if (mode == 0 || xfs_mode_to_ftype(mode) != XFS_DIR3_FT_UNKNOWN)
+		return;
+
+	/* bad mode, so we set it to a file that only root can read */
+	mode = S_IFREG;
+	dip->di_mode = cpu_to_be16(mode);
+	dip->di_uid = 0;
+	dip->di_gid = 0;
+}
+
+/* Fix any conflicting flags that the verifiers complain about. */
+STATIC void
+xrep_dinode_flags(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip)
+{
+	struct xfs_mount	*mp = sc->mp;
+	uint64_t		flags2;
+	uint16_t		mode;
+	uint16_t		flags;
+
+	mode = be16_to_cpu(dip->di_mode);
+	flags = be16_to_cpu(dip->di_flags);
+	flags2 = be64_to_cpu(dip->di_flags2);
+
+	if (xfs_sb_version_hasreflink(&mp->m_sb) && S_ISREG(mode))
+		flags2 |= XFS_DIFLAG2_REFLINK;
+	else
+		flags2 &= ~(XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE);
+	if (flags & XFS_DIFLAG_REALTIME)
+		flags2 &= ~XFS_DIFLAG2_REFLINK;
+	if (flags2 & XFS_DIFLAG2_REFLINK)
+		flags2 &= ~XFS_DIFLAG2_DAX;
+	dip->di_flags = cpu_to_be16(flags);
+	dip->di_flags2 = cpu_to_be64(flags2);
+}
+
+/*
+ * Blow out symlink; now it points to the current dir.  We don't have to worry
+ * about incore state because this inode is failing the verifiers.
+ */
+STATIC void
+xrep_dinode_zap_symlink(
+	struct xfs_dinode	*dip)
+{
+	char			*p;
+
+	dip->di_format = XFS_DINODE_FMT_LOCAL;
+	dip->di_size = cpu_to_be64(1);
+	p = XFS_DFORK_PTR(dip, XFS_DATA_FORK);
+	*p = '.';
+}
+
+/*
+ * Blow out dir, make it point to the root.  In the future repair will
+ * reconstruct this directory for us.  Note that there's no in-core directory
+ * inode because the sf verifier tripped, so we don't have to worry about the
+ * dentry cache.
+ */
+STATIC void
+xrep_dinode_zap_dir(
+	struct xfs_mount		*mp,
+	struct xfs_dinode		*dip)
+{
+	const struct xfs_dir_ops	*ops;
+	struct xfs_dir2_sf_hdr		*sfp;
+	int				i8count;
+
+	dip->di_format = XFS_DINODE_FMT_LOCAL;
+	i8count = mp->m_sb.sb_rootino > XFS_DIR2_MAX_SHORT_INUM;
+	ops = xfs_dir_get_ops(mp, NULL);
+	sfp = XFS_DFORK_PTR(dip, XFS_DATA_FORK);
+	sfp->count = 0;
+	sfp->i8count = i8count;
+	ops->sf_put_parent_ino(sfp, mp->m_sb.sb_rootino);
+	dip->di_size = cpu_to_be64(xfs_dir2_sf_hdr_size(i8count));
+}
+
+/* Make sure we don't have a garbage file size. */
+STATIC void
+xrep_dinode_size(
+	struct xfs_mount	*mp,
+	struct xfs_dinode	*dip)
+{
+	uint64_t		size;
+	uint16_t		mode;
+
+	mode = be16_to_cpu(dip->di_mode);
+	size = be64_to_cpu(dip->di_size);
+	switch (mode & S_IFMT) {
+	case S_IFIFO:
+	case S_IFCHR:
+	case S_IFBLK:
+	case S_IFSOCK:
+		/* di_size can't be nonzero for special files */
+		dip->di_size = 0;
+		break;
+	case S_IFREG:
+		/* Regular files can't be larger than 2^63-1 bytes. */
+		dip->di_size = cpu_to_be64(size & ~(1ULL << 63));
+		break;
+	case S_IFLNK:
+		/*
+		 * Truncate ridiculously oversized symlinks.  If the size is
+		 * zero, reset it to point to the current directory.  Both of
+		 * these conditions trigger dinode verifier errors, so there
+		 * is no in-core state to reset.
+		 */
+		if (size > XFS_SYMLINK_MAXLEN)
+			dip->di_size = cpu_to_be64(XFS_SYMLINK_MAXLEN);
+		else if (size == 0)
+			xrep_dinode_zap_symlink(dip);
+		break;
+	case S_IFDIR:
+		/*
+		 * Directories can't have a size larger than 32G.  If the size
+		 * is zero, reset it to an empty directory.  Both of these
+		 * conditions trigger dinode verifier errors, so there is no
+		 * in-core state to reset.
+		 */
+		if (size > XFS_DIR2_SPACE_SIZE)
+			dip->di_size = cpu_to_be64(XFS_DIR2_SPACE_SIZE);
+		else if (size == 0)
+			xrep_dinode_zap_dir(mp, dip);
+		break;
+	}
+}
+
+/* Fix extent size hints. */
+STATIC void
+xrep_dinode_extsize_hints(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip)
+{
+	struct xfs_mount	*mp = sc->mp;
+	uint64_t		flags2;
+	uint16_t		flags;
+	uint16_t		mode;
+	xfs_failaddr_t		fa;
+
+	mode = be16_to_cpu(dip->di_mode);
+	flags = be16_to_cpu(dip->di_flags);
+	flags2 = be64_to_cpu(dip->di_flags2);
+
+	fa = xfs_inode_validate_extsize(mp, be32_to_cpu(dip->di_extsize),
+			mode, flags);
+	if (fa) {
+		dip->di_extsize = 0;
+		dip->di_flags &= ~cpu_to_be16(XFS_DIFLAG_EXTSIZE |
+					      XFS_DIFLAG_EXTSZINHERIT);
+	}
+
+	if (dip->di_version < 3)
+		return;
+
+	fa = xfs_inode_validate_cowextsize(mp, be32_to_cpu(dip->di_cowextsize),
+			mode, flags, flags2);
+	if (fa) {
+		dip->di_cowextsize = 0;
+		dip->di_flags2 &= ~cpu_to_be64(XFS_DIFLAG2_COWEXTSIZE);
+	}
+}
+
+/* Inode didn't pass verifiers, so fix the raw buffer and retry iget. */
+STATIC int
+xrep_dinode_core(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_imap		imap;
+	struct xfs_buf		*bp;
+	struct xfs_dinode	*dip;
+	xfs_ino_t		ino;
+	bool			inuse;
+	int			error;
+
+	/* Map & read inode. */
+	ino = sc->sm->sm_ino;
+	error = xfs_imap(sc->mp, sc->tp, ino, &imap, XFS_IGET_UNTRUSTED);
+	if (error)
+		return error;
+
+	error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp,
+			imap.im_blkno, imap.im_len, XBF_UNMAPPED, &bp, NULL);
+	if (error)
+		return error;
+
+	/* Make absolutely sure this inode isn't in core. */
+	error = xfs_icache_inode_is_allocated(sc->mp, sc->tp, ino, &inuse);
+	if (error == 0) {
+		ASSERT(0);
+		return -EFSCORRUPTED;
+	}
+
+	/* Make sure we can pass the inode buffer verifier. */
+	xrep_dinode_buf(sc, bp);
+	bp->b_ops = &xfs_inode_buf_ops;
+
+	/* Fix everything the verifier will complain about. */
+	dip = xfs_buf_offset(bp, imap.im_boffset);
+	xrep_dinode_header(sc, dip);
+	xrep_dinode_mode(dip);
+	xrep_dinode_flags(sc, dip);
+	xrep_dinode_size(sc->mp, dip);
+	xrep_dinode_extsize_hints(sc, dip);
+
+	/* Write out the inode... */
+	xfs_dinode_calc_crc(sc->mp, dip);
+	xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_DINO_BUF);
+	xfs_trans_log_buf(sc->tp, bp, imap.im_boffset,
+			imap.im_boffset + sc->mp->m_sb.sb_inodesize - 1);
+	error = xfs_trans_commit(sc->tp);
+	if (error)
+		return error;
+	sc->tp = NULL;
+
+	/* ...and reload it? */
+	error = xfs_iget(sc->mp, sc->tp, ino,
+			XFS_IGET_UNTRUSTED | XFS_IGET_DONTCACHE, 0, &sc->ip);
+	if (error)
+		return error;
+	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+	xfs_ilock(sc->ip, sc->ilock_flags);
+	error = xchk_trans_alloc(sc, 0);
+	if (error)
+		return error;
+	sc->ilock_flags |= XFS_ILOCK_EXCL;
+	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
+
+	return 0;
+}
+
+/* Fix everything xfs_dinode_verify cares about. */
+STATIC int
+xrep_dinode_problems(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	error = xrep_dinode_core(sc);
+	if (error)
+		return error;
+
+	/* We had to fix a totally busted inode, schedule quotacheck. */
+	if (XFS_IS_UQUOTA_ON(sc->mp))
+		xrep_force_quotacheck(sc, XFS_DQ_USER);
+	if (XFS_IS_GQUOTA_ON(sc->mp))
+		xrep_force_quotacheck(sc, XFS_DQ_GROUP);
+	if (XFS_IS_PQUOTA_ON(sc->mp))
+		xrep_force_quotacheck(sc, XFS_DQ_PROJ);
+
+	return 0;
+}
+
+/*
+ * Fix problems that the verifiers don't care about.  In general these are
+ * errors that don't cause problems elsewhere in the kernel that we can easily
+ * detect, so we don't check them all that rigorously.
+ */
+
+/* Make sure block and extent counts are ok. */
+STATIC int
+xrep_inode_blockcounts(
+	struct xfs_scrub	*sc)
+{
+	xfs_filblks_t		count;
+	xfs_filblks_t		acount;
+	xfs_extnum_t		nextents;
+	int			error;
+
+	/* Set data fork counters from the data fork mappings. */
+	error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_DATA_FORK,
+			&nextents, &count);
+	if (error)
+		return error;
+	if (XFS_IS_REALTIME_INODE(sc->ip)) {
+		if (count >= sc->mp->m_sb.sb_rblocks)
+			return -EFSCORRUPTED;
+	} else if (!xfs_sb_version_hasreflink(&sc->mp->m_sb)) {
+		if (count >= sc->mp->m_sb.sb_dblocks)
+			return -EFSCORRUPTED;
+	}
+	sc->ip->i_d.di_nextents = nextents;
+
+	/* Set attr fork counters from the attr fork mappings. */
+	error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_ATTR_FORK,
+			&nextents, &acount);
+	if (error)
+		return error;
+	if (count >= sc->mp->m_sb.sb_dblocks)
+		return -EFSCORRUPTED;
+	if (nextents >= (uint16_t)-1U)
+		return -EFSCORRUPTED;
+	sc->ip->i_d.di_anextents = nextents;
+
+	sc->ip->i_d.di_nblocks = count + acount;
+
+	/*
+	 * If we found attr fork extents but no attr fork root, zero the
+	 * attr fork extent count so that the attr fork repair will run.
+	 */
+	if (sc->ip->i_d.di_anextents != 0 && sc->ip->i_d.di_forkoff == 0)
+		sc->ip->i_d.di_anextents = 0;
+
+	return 0;
+}
+
+/* Check for invalid uid/gid.  Note that a -1U projid is allowed. */
+STATIC void
+xrep_inode_ids(
+	struct xfs_scrub	*sc)
+{
+	if (sc->ip->i_d.di_uid == -1U) {
+		sc->ip->i_d.di_uid = 0;
+		VFS_I(sc->ip)->i_mode &= ~(S_ISUID | S_ISGID);
+		if (XFS_IS_UQUOTA_ON(sc->mp))
+			xrep_force_quotacheck(sc, XFS_DQ_USER);
+	}
+
+	if (sc->ip->i_d.di_gid == -1U) {
+		sc->ip->i_d.di_gid = 0;
+		VFS_I(sc->ip)->i_mode &= ~(S_ISUID | S_ISGID);
+		if (XFS_IS_GQUOTA_ON(sc->mp))
+			xrep_force_quotacheck(sc, XFS_DQ_GROUP);
+	}
+}
+
+/* Nanosecond counters can't have more than 1 billion. */
+STATIC void
+xrep_inode_timestamps(
+	struct xfs_inode	*ip)
+{
+	if ((unsigned long)VFS_I(ip)->i_atime.tv_nsec >= NSEC_PER_SEC)
+		VFS_I(ip)->i_atime.tv_nsec = 0;
+	if ((unsigned long)VFS_I(ip)->i_mtime.tv_nsec >= NSEC_PER_SEC)
+		VFS_I(ip)->i_mtime.tv_nsec = 0;
+	if ((unsigned long)VFS_I(ip)->i_ctime.tv_nsec >= NSEC_PER_SEC)
+		VFS_I(ip)->i_ctime.tv_nsec = 0;
+	if (ip->i_d.di_version > 2 &&
+	    (unsigned long)ip->i_d.di_crtime.t_nsec >= NSEC_PER_SEC)
+		ip->i_d.di_crtime.t_nsec = 0;
+}
+
+/* Fix inode flags that don't make sense together. */
+STATIC void
+xrep_inode_flags(
+	struct xfs_scrub	*sc)
+{
+	uint16_t		mode;
+
+	mode = VFS_I(sc->ip)->i_mode;
+
+	/* Clear junk flags */
+	if (sc->ip->i_d.di_flags & ~XFS_DIFLAG_ANY)
+		sc->ip->i_d.di_flags &= ~XFS_DIFLAG_ANY;
+
+	/* NEWRTBM only applies to realtime bitmaps */
+	if (sc->ip->i_ino == sc->mp->m_sb.sb_rbmino)
+		sc->ip->i_d.di_flags |= XFS_DIFLAG_NEWRTBM;
+	else
+		sc->ip->i_d.di_flags &= ~XFS_DIFLAG_NEWRTBM;
+
+	/* These only make sense for directories. */
+	if (!S_ISDIR(mode))
+		sc->ip->i_d.di_flags &= ~(XFS_DIFLAG_RTINHERIT |
+					  XFS_DIFLAG_EXTSZINHERIT |
+					  XFS_DIFLAG_PROJINHERIT |
+					  XFS_DIFLAG_NOSYMLINKS);
+
+	/* These only make sense for files. */
+	if (!S_ISREG(mode))
+		sc->ip->i_d.di_flags &= ~(XFS_DIFLAG_REALTIME |
+					  XFS_DIFLAG_EXTSIZE);
+
+	/* These only make sense for non-rt files. */
+	if (sc->ip->i_d.di_flags & XFS_DIFLAG_REALTIME)
+		sc->ip->i_d.di_flags &= ~XFS_DIFLAG_FILESTREAM;
+
+	/* Immutable and append only?  Drop the append. */
+	if ((sc->ip->i_d.di_flags & XFS_DIFLAG_IMMUTABLE) &&
+	    (sc->ip->i_d.di_flags & XFS_DIFLAG_APPEND))
+		sc->ip->i_d.di_flags &= ~XFS_DIFLAG_APPEND;
+
+	if (sc->ip->i_d.di_version < 3)
+		return;
+
+	/* Clear junk flags. */
+	if (sc->ip->i_d.di_flags2 & ~XFS_DIFLAG2_ANY)
+		sc->ip->i_d.di_flags2 &= ~XFS_DIFLAG2_ANY;
+
+	/* No reflink flag unless we support it and it's a file. */
+	if (!xfs_sb_version_hasreflink(&sc->mp->m_sb) ||
+	    !S_ISREG(mode))
+		sc->ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+
+	/* DAX only applies to files and dirs. */
+	if (!(S_ISREG(mode) || S_ISDIR(mode)))
+		sc->ip->i_d.di_flags2 &= ~XFS_DIFLAG2_DAX;
+
+	/* No reflink files on the realtime device. */
+	if (sc->ip->i_d.di_flags & XFS_DIFLAG_REALTIME)
+		sc->ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+
+	/* No mixing reflink and DAX yet. */
+	if (sc->ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK)
+		sc->ip->i_d.di_flags2 &= ~XFS_DIFLAG2_DAX;
+}
+
+/*
+ * Fix size problems with block/node format directories.  If we fail to find
+ * the extent list, just bail out and let the bmapbtd repair functions clean
+ * up that mess.
+ */
+STATIC void
+xrep_inode_blockdir_size(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	got;
+	struct xfs_ifork	*ifp;
+	xfs_fileoff_t		off;
+	int			error;
+
+	/* Find the last block before 32G; this is the dir size. */
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
+	if (!(ifp->if_flags & XFS_IFEXTENTS)) {
+		error = xfs_iread_extents(sc->tp, sc->ip, XFS_DATA_FORK);
+		if (error)
+			return;
+	}
+
+	off = XFS_B_TO_FSB(sc->mp, XFS_DIR2_SPACE_SIZE);
+	if (!xfs_iext_lookup_extent_before(sc->ip, ifp, &off, &icur, &got)) {
+		/* zero-extents directory? */
+		return;
+	}
+
+	off = got.br_startoff + got.br_blockcount;
+	sc->ip->i_d.di_size = min_t(loff_t, XFS_DIR2_SPACE_SIZE,
+			XFS_FSB_TO_B(sc->mp, off));
+}
+
+/* Fix size problems with short format directories. */
+STATIC void
+xrep_inode_sfdir_size(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_ifork	*ifp;
+
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
+	sc->ip->i_d.di_size = ifp->if_bytes;
+}
+
+/*
+ * Fix any irregularities in an inode's size now that we can iterate extent
+ * maps and access other regular inode data.
+ */
+STATIC void
+xrep_inode_size(
+	struct xfs_scrub	*sc)
+{
+	/*
+	 * Currently we only support fixing size on extents or btree format
+	 * directories.  Files can be any size and sizes for the other inode
+	 * special types are fixed by xrep_dinode_size.
+	 */
+	if (!S_ISDIR(VFS_I(sc->ip)->i_mode))
+		return;
+	switch (XFS_IFORK_FORMAT(sc->ip, XFS_DATA_FORK)) {
+	case XFS_DINODE_FMT_EXTENTS:
+	case XFS_DINODE_FMT_BTREE:
+		xrep_inode_blockdir_size(sc);
+		break;
+	case XFS_DINODE_FMT_LOCAL:
+		xrep_inode_sfdir_size(sc);
+		break;
+	}
+}
+
+/* Fix any irregularities in an inode that the verifiers don't catch. */
+STATIC int
+xrep_inode_problems(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	error = xrep_inode_blockcounts(sc);
+	if (error)
+		return error;
+	xrep_inode_timestamps(sc->ip);
+	xrep_inode_flags(sc);
+	xrep_inode_ids(sc);
+	xrep_inode_size(sc);
+	xfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE);
+	return xfs_trans_roll_inode(&sc->tp, sc->ip);
+}
+
+/* Repair an inode's fields. */
+int
+xrep_inode(
+	struct xfs_scrub	*sc)
+{
+	int			error = 0;
+
+	/*
+	 * No inode?  That means we failed the _iget verifiers.  Repair all
+	 * the things that the inode verifiers care about, then retry _iget.
+	 */
+	if (!sc->ip) {
+		error = xrep_dinode_problems(sc);
+		if (error)
+			goto out;
+	}
+
+	/* By this point we had better have a working incore inode. */
+	ASSERT(sc->ip);
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	/* If we found corruption of any kind, try to fix it. */
+	if ((sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) ||
+	    (sc->sm->sm_flags & XFS_SCRUB_OFLAG_XCORRUPT)) {
+		error = xrep_inode_problems(sc);
+		if (error)
+			goto out;
+	}
+
+	/* See if we can clear the reflink flag. */
+	if (xfs_is_reflink_inode(sc->ip))
+		return xfs_reflink_clear_inode_flag(sc->ip, &sc->tp);
+
+out:
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index f952d6739700..dc8e27cf6c1c 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -68,6 +68,7 @@ int xrep_agi(struct xfs_scrub *sc);
 int xrep_allocbt(struct xfs_scrub *sc);
 int xrep_iallocbt(struct xfs_scrub *sc);
 int xrep_refcountbt(struct xfs_scrub *sc);
+int xrep_inode(struct xfs_scrub *sc);
 
 #else
 
@@ -110,6 +111,7 @@ xrep_reset_perag_resv(
 #define xrep_allocbt			xrep_notsupported
 #define xrep_iallocbt			xrep_notsupported
 #define xrep_refcountbt			xrep_notsupported
+#define xrep_inode			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index b104231af049..6de28006290c 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -260,7 +260,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xchk_setup_inode,
 		.scrub	= xchk_inode,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_inode,
 	},
 	[XFS_SCRUB_TYPE_BMBTD] = {	/* inode data fork */
 		.type	= ST_INODE,

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 08/18] xfs: zap broken inode forks
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
                   ` (6 preceding siblings ...)
  2019-08-05  0:35 ` [PATCH 07/18] xfs: repair inode records Darrick J. Wong
@ 2019-08-05  0:35 ` Darrick J. Wong
  2019-08-05  0:35 ` [PATCH 09/18] xfs: repair inode block maps Darrick J. Wong
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:35 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Determine if inode fork damage is responsible for the inode being unable
to pass the ifork verifiers in xfs_iget and zap the fork contents if
this is true.  Once this is done the fork will be empty but we'll be
able to construct an in-core inode, and a subsequent call to the inode
fork repair ioctl will search the rmapbt to rebuild the records that
were in the fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_attr_leaf.c |   32 ++-
 fs/xfs/libxfs/xfs_attr_leaf.h |    2 
 fs/xfs/libxfs/xfs_bmap.c      |   21 ++
 fs/xfs/libxfs/xfs_bmap.h      |    2 
 fs/xfs/scrub/inode_repair.c   |  401 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 439 insertions(+), 19 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index 70eb941d02e4..70b94fe09a53 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -890,23 +890,16 @@ xfs_attr_shortform_allfit(
 	return xfs_attr_shortform_bytesfit(dp, bytes);
 }
 
-/* Verify the consistency of an inline attribute fork. */
+/* Verify the consistency of a raw inline attribute fork. */
 xfs_failaddr_t
-xfs_attr_shortform_verify(
-	struct xfs_inode		*ip)
+xfs_attr_shortform_verify_struct(
+	struct xfs_attr_shortform	*sfp,
+	size_t				size)
 {
-	struct xfs_attr_shortform	*sfp;
 	struct xfs_attr_sf_entry	*sfep;
 	struct xfs_attr_sf_entry	*next_sfep;
 	char				*endp;
-	struct xfs_ifork		*ifp;
 	int				i;
-	int				size;
-
-	ASSERT(ip->i_d.di_aformat == XFS_DINODE_FMT_LOCAL);
-	ifp = XFS_IFORK_PTR(ip, XFS_ATTR_FORK);
-	sfp = (struct xfs_attr_shortform *)ifp->if_u1.if_data;
-	size = ifp->if_bytes;
 
 	/*
 	 * Give up if the attribute is way too short.
@@ -964,6 +957,23 @@ xfs_attr_shortform_verify(
 	return NULL;
 }
 
+/* Verify the consistency of an inline attribute fork. */
+xfs_failaddr_t
+xfs_attr_shortform_verify(
+	struct xfs_inode		*ip)
+{
+	struct xfs_attr_shortform	*sfp;
+	struct xfs_ifork		*ifp;
+	int				size;
+
+	ASSERT(ip->i_d.di_aformat == XFS_DINODE_FMT_LOCAL);
+	ifp = XFS_IFORK_PTR(ip, XFS_ATTR_FORK);
+	sfp = (struct xfs_attr_shortform *)ifp->if_u1.if_data;
+	size = ifp->if_bytes;
+
+	return xfs_attr_shortform_verify_struct(sfp, size);
+}
+
 /*
  * Convert a leaf attribute list to shortform attribute list
  */
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
index 7b74e18becff..728af25a1738 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.h
+++ b/fs/xfs/libxfs/xfs_attr_leaf.h
@@ -41,6 +41,8 @@ int	xfs_attr_shortform_to_leaf(struct xfs_da_args *args,
 int	xfs_attr_shortform_remove(struct xfs_da_args *args);
 int	xfs_attr_shortform_allfit(struct xfs_buf *bp, struct xfs_inode *dp);
 int	xfs_attr_shortform_bytesfit(struct xfs_inode *dp, int bytes);
+xfs_failaddr_t xfs_attr_shortform_verify_struct(struct xfs_attr_shortform *sfp,
+		size_t size);
 xfs_failaddr_t xfs_attr_shortform_verify(struct xfs_inode *ip);
 void	xfs_attr_fork_remove(struct xfs_inode *ip, struct xfs_trans *tp);
 
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 2a1bfca79938..39dbc93374dc 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6187,18 +6187,16 @@ xfs_bmap_finish_one(
 	return error;
 }
 
-/* Check that an inode's extent does not have invalid flags or bad ranges. */
+/* Check that an extent does not have invalid flags or bad ranges. */
 xfs_failaddr_t
-xfs_bmap_validate_extent(
-	struct xfs_inode	*ip,
+xfs_bmap_validate_extent_raw(
+	struct xfs_mount	*mp,
+	bool			isrt,
 	int			whichfork,
 	struct xfs_bmbt_irec	*irec)
 {
-	struct xfs_mount	*mp = ip->i_mount;
 	xfs_fsblock_t		endfsb;
-	bool			isrt;
 
-	isrt = XFS_IS_REALTIME_INODE(ip);
 	endfsb = irec->br_startblock + irec->br_blockcount - 1;
 	if (isrt) {
 		if (!xfs_verify_rtbno(mp, irec->br_startblock))
@@ -6218,3 +6216,14 @@ xfs_bmap_validate_extent(
 		return __this_address;
 	return NULL;
 }
+
+/* Check that an inode's extent does not have invalid flags or bad ranges. */
+xfs_failaddr_t
+xfs_bmap_validate_extent(
+	struct xfs_inode	*ip,
+	int			whichfork,
+	struct xfs_bmbt_irec	*irec)
+{
+	return xfs_bmap_validate_extent_raw(ip->i_mount,
+			XFS_IS_REALTIME_INODE(ip), whichfork, irec);
+}
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 8f597f9abdbe..b857762fac55 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -271,6 +271,8 @@ static inline int xfs_bmap_fork_to_state(int whichfork)
 	}
 }
 
+xfs_failaddr_t xfs_bmap_validate_extent_raw(struct xfs_mount *mp, bool isrt,
+		int whichfork, struct xfs_bmbt_irec *irec);
 xfs_failaddr_t xfs_bmap_validate_extent(struct xfs_inode *ip, int whichfork,
 		struct xfs_bmbt_irec *irec);
 
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index fd6353a907d8..dddcb69a0601 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -22,11 +22,15 @@
 #include "xfs_ialloc.h"
 #include "xfs_da_format.h"
 #include "xfs_reflink.h"
+#include "xfs_alloc.h"
 #include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
 #include "xfs_bmap.h"
+#include "xfs_bmap_btree.h"
 #include "xfs_bmap_util.h"
 #include "xfs_dir2.h"
 #include "xfs_quota_defs.h"
+#include "xfs_attr_leaf.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
@@ -139,7 +143,8 @@ xrep_dinode_mode(
 STATIC void
 xrep_dinode_flags(
 	struct xfs_scrub	*sc,
-	struct xfs_dinode	*dip)
+	struct xfs_dinode	*dip,
+	bool			is_rt_file)
 {
 	struct xfs_mount	*mp = sc->mp;
 	uint64_t		flags2;
@@ -150,6 +155,11 @@ xrep_dinode_flags(
 	flags = be16_to_cpu(dip->di_flags);
 	flags2 = be64_to_cpu(dip->di_flags2);
 
+	if (is_rt_file)
+		flags |= XFS_DIFLAG_REALTIME;
+	else
+		flags &= ~XFS_DIFLAG_REALTIME;
+
 	if (xfs_sb_version_hasreflink(&mp->m_sb) && S_ISREG(mode))
 		flags2 |= XFS_DIFLAG2_REFLINK;
 	else
@@ -288,11 +298,392 @@ xrep_dinode_extsize_hints(
 	}
 }
 
+/* Blocks and extents associated with an inode, according to rmap records. */
+struct xrep_dinode_stats {
+	struct xfs_scrub	*sc;
+
+	/* Blocks in use on the data device by data extents or bmbt blocks. */
+	xfs_rfsblock_t		data_blocks;
+
+	/* Blocks in use on the rt device. */
+	xfs_rfsblock_t		rt_blocks;
+
+	/* Blocks in use by the attr fork. */
+	xfs_rfsblock_t		attr_blocks;
+
+	/* Number of data device extents for the data fork. */
+	xfs_extnum_t		data_extents;
+
+	/*
+	 * Number of realtime device extents for the data fork.  If
+	 * data_extents and rt_extents indicate that the data fork has extents
+	 * on both devices, we'll just back away slowly.
+	 */
+	xfs_extnum_t		rt_extents;
+
+	/* Number of (data device) extents for the attr fork. */
+	xfs_aextnum_t		attr_extents;
+};
+
+/* Count extents and blocks for an inode given an rmap. */
+STATIC int
+xrep_dinode_walk_rmap(
+	struct xfs_btree_cur		*cur,
+	struct xfs_rmap_irec		*rec,
+	void				*priv)
+{
+	struct xrep_dinode_stats	*dis = priv;
+
+	/* Is this even the right fork? */
+	if (rec->rm_owner != dis->sc->sm->sm_ino)
+		return 0;
+	if (rec->rm_flags & XFS_RMAP_ATTR_FORK) {
+		dis->attr_blocks += rec->rm_blockcount;
+		if (!(rec->rm_flags & XFS_RMAP_BMBT_BLOCK))
+			dis->attr_extents++;
+	} else {
+		dis->data_blocks += rec->rm_blockcount;
+		if (!(rec->rm_flags & XFS_RMAP_BMBT_BLOCK))
+			dis->data_extents++;
+	}
+	return 0;
+}
+
+/* Count extents and blocks for an inode from all AG rmap data. */
+STATIC int
+xrep_dinode_count_ag_rmaps(
+	struct xrep_dinode_stats	*dis,
+	xfs_agnumber_t			agno)
+{
+	struct xfs_btree_cur		*cur;
+	struct xfs_buf			*agf;
+	int				error;
+
+	error = xfs_alloc_read_agf(dis->sc->mp, dis->sc->tp, agno, 0, &agf);
+	if (error)
+		return error;
+
+	cur = xfs_rmapbt_init_cursor(dis->sc->mp, dis->sc->tp, agf, agno);
+	if (!cur) {
+		error = -ENOMEM;
+		goto out_agf;
+	}
+
+	error = xfs_rmap_query_all(cur, xrep_dinode_walk_rmap, dis);
+	if (error == XFS_BTREE_QUERY_RANGE_ABORT)
+		error = 0;
+
+	xfs_btree_del_cursor(cur, error);
+out_agf:
+	xfs_trans_brelse(dis->sc->tp, agf);
+	return error;
+}
+
+/* Count extents and blocks for a given inode from all rmap data. */
+STATIC int
+xrep_dinode_count_rmaps(
+	struct xrep_dinode_stats	*dis)
+{
+	xfs_agnumber_t			agno;
+	int				error;
+
+	if (!xfs_sb_version_hasrmapbt(&dis->sc->mp->m_sb) ||
+	    xfs_sb_version_hasrealtime(&dis->sc->mp->m_sb))
+		return -EOPNOTSUPP;
+
+	/* XXX: find rt blocks too */
+	if (dis->rt_extents != 0) {
+		ASSERT(0);
+		return -EOPNOTSUPP;
+	}
+
+	for (agno = 0; agno < dis->sc->mp->m_sb.sb_agcount; agno++) {
+		error = xrep_dinode_count_ag_rmaps(dis, agno);
+		if (error)
+			return error;
+	}
+
+	/* Can't have extents on both the rt and the data device. */
+	if (dis->data_extents && dis->rt_extents)
+		return -EFSCORRUPTED;
+
+	return 0;
+}
+
+/* Return true if this extents-format ifork looks like garbage. */
+STATIC bool
+xrep_dinode_bad_extents_fork(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip,
+	int			dfork_size,
+	int			whichfork)
+{
+	struct xfs_bmbt_irec	new;
+	struct xfs_bmbt_rec	*dp;
+	bool			isrt;
+	int			i;
+	int			nex;
+	int			fork_size;
+
+	nex = XFS_DFORK_NEXTENTS(dip, whichfork);
+	fork_size = nex * sizeof(struct xfs_bmbt_rec);
+	if (fork_size < 0 || fork_size > dfork_size)
+		return true;
+	if (whichfork == XFS_ATTR_FORK && nex > ((uint16_t)-1U))
+		return true;
+	dp = XFS_DFORK_PTR(dip, whichfork);
+
+	isrt = dip->di_flags & cpu_to_be16(XFS_DIFLAG_REALTIME);
+	for (i = 0; i < nex; i++, dp++) {
+		xfs_failaddr_t	fa;
+
+		xfs_bmbt_disk_get_all(dp, &new);
+		fa = xfs_bmap_validate_extent_raw(sc->mp, isrt, whichfork,
+				&new);
+		if (fa)
+			return true;
+	}
+
+	return false;
+}
+
+/* Return true if this btree-format ifork looks like garbage. */
+STATIC bool
+xrep_dinode_bad_btree_fork(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip,
+	int			dfork_size,
+	int			whichfork)
+{
+	struct xfs_bmdr_block	*dfp;
+	int			nrecs;
+	int			level;
+
+	if (XFS_DFORK_NEXTENTS(dip, whichfork) <=
+			dfork_size / sizeof(struct xfs_bmbt_rec))
+		return true;
+
+	dfp = XFS_DFORK_PTR(dip, whichfork);
+	nrecs = be16_to_cpu(dfp->bb_numrecs);
+	level = be16_to_cpu(dfp->bb_level);
+
+	if (nrecs == 0 || XFS_BMDR_SPACE_CALC(nrecs) > dfork_size)
+		return true;
+	if (level == 0 || level > XFS_BTREE_MAXLEVELS)
+		return true;
+	return false;
+}
+
+/*
+ * Check the data fork for things that will fail the ifork verifiers or the
+ * ifork formatters.
+ */
+STATIC bool
+xrep_dinode_check_dfork(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip,
+	uint16_t		mode)
+{
+	uint64_t		size;
+	unsigned int		fmt;
+	int			dfork_size;
+
+	fmt = XFS_DFORK_FORMAT(dip, XFS_DATA_FORK);
+	size = be64_to_cpu(dip->di_size);
+	switch (mode & S_IFMT) {
+	case S_IFIFO:
+	case S_IFCHR:
+	case S_IFBLK:
+	case S_IFSOCK:
+		if (fmt != XFS_DINODE_FMT_DEV)
+			return true;
+		break;
+	case S_IFREG:
+		if (fmt == XFS_DINODE_FMT_LOCAL)
+			return true;
+		/* fall through */
+	case S_IFLNK:
+	case S_IFDIR:
+		switch (fmt) {
+		case XFS_DINODE_FMT_LOCAL:
+		case XFS_DINODE_FMT_EXTENTS:
+		case XFS_DINODE_FMT_BTREE:
+			break;
+		default:
+			return true;
+		}
+		break;
+	default:
+		return true;
+	}
+	dfork_size = XFS_DFORK_SIZE(dip, sc->mp, XFS_DATA_FORK);
+	switch (fmt) {
+	case XFS_DINODE_FMT_DEV:
+		break;
+	case XFS_DINODE_FMT_LOCAL:
+		if (size > dfork_size)
+			return true;
+		break;
+	case XFS_DINODE_FMT_EXTENTS:
+		if (xrep_dinode_bad_extents_fork(sc, dip, dfork_size,
+				XFS_DATA_FORK))
+			return true;
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		if (xrep_dinode_bad_btree_fork(sc, dip, dfork_size,
+				XFS_DATA_FORK))
+			return true;
+		break;
+	default:
+		return true;
+	}
+
+	return false;
+}
+
+/* Reset the data fork to something sane. */
+STATIC void
+xrep_dinode_zap_dfork(
+	struct xfs_scrub		*sc,
+	struct xfs_dinode		*dip,
+	uint16_t			mode,
+	struct xrep_dinode_stats	*dis)
+{
+	/* Special files always get reset to DEV */
+	switch (mode & S_IFMT) {
+	case S_IFIFO:
+	case S_IFCHR:
+	case S_IFBLK:
+	case S_IFSOCK:
+		dip->di_format = XFS_DINODE_FMT_DEV;
+		dip->di_size = 0;
+		return;
+	}
+
+	/*
+	 * If we have data extents, reset to an empty map and hope the user
+	 * will run the bmapbtd checker next.
+	 */
+	if (dis->data_extents || dis->rt_extents || S_ISREG(mode)) {
+		dip->di_format = XFS_DINODE_FMT_EXTENTS;
+		dip->di_nextents = 0;
+		return;
+	}
+
+	/* Otherwise, reset the local format to the minimum. */
+	switch (mode & S_IFMT) {
+	case S_IFLNK:
+		xrep_dinode_zap_symlink(dip);
+		break;
+	case S_IFDIR:
+		xrep_dinode_zap_dir(sc->mp, dip);
+		break;
+	}
+}
+
+/*
+ * Check the attr fork for things that will fail the ifork verifiers or the
+ * ifork formatters.
+ */
+STATIC bool
+xrep_dinode_check_afork(
+	struct xfs_scrub		*sc,
+	struct xfs_dinode		*dip)
+{
+	struct xfs_attr_shortform	*sfp;
+	int				size;
+
+	if (XFS_DFORK_BOFF(dip) == 0)
+		return dip->di_aformat != XFS_DINODE_FMT_EXTENTS ||
+		       dip->di_anextents != 0;
+
+	size = XFS_DFORK_SIZE(dip, sc->mp, XFS_ATTR_FORK);
+	switch (XFS_DFORK_FORMAT(dip, XFS_ATTR_FORK)) {
+	case XFS_DINODE_FMT_LOCAL:
+		sfp = XFS_DFORK_PTR(dip, XFS_ATTR_FORK);
+		return xfs_attr_shortform_verify_struct(sfp, size) != NULL;
+	case XFS_DINODE_FMT_EXTENTS:
+		if (xrep_dinode_bad_extents_fork(sc, dip, size, XFS_ATTR_FORK))
+			return true;
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		if (xrep_dinode_bad_btree_fork(sc, dip, size, XFS_ATTR_FORK))
+			return true;
+		break;
+	default:
+		return true;
+	}
+
+	return false;
+}
+
+/* Reset the attr fork to something sane. */
+STATIC void
+xrep_dinode_zap_afork(
+	struct xfs_scrub		*sc,
+	struct xfs_dinode		*dip,
+	struct xrep_dinode_stats	*dis)
+{
+	dip->di_aformat = XFS_DINODE_FMT_EXTENTS;
+	dip->di_anextents = 0;
+	/*
+	 * We leave a nonzero forkoff so that the bmap scrub will look for
+	 * attr rmaps.
+	 */
+	dip->di_forkoff = dis->attr_extents ? 1 : 0;
+}
+
+/*
+ * Zap the data/attr forks if we spot anything that isn't going to pass the
+ * ifork verifiers or the ifork formatters, because we need to get the inode
+ * into good enough shape that the higher level repair functions can run.
+ */
+STATIC void
+xrep_dinode_zap_forks(
+	struct xfs_scrub		*sc,
+	struct xfs_dinode		*dip,
+	struct xrep_dinode_stats	*dis)
+{
+	uint16_t			mode;
+	bool				zap_datafork = false;
+	bool				zap_attrfork = false;
+
+	mode = be16_to_cpu(dip->di_mode);
+
+	/* Inode counters don't make sense? */
+	if (be32_to_cpu(dip->di_nextents) > be64_to_cpu(dip->di_nblocks))
+		zap_datafork = true;
+	if (be16_to_cpu(dip->di_anextents) > be64_to_cpu(dip->di_nblocks))
+		zap_attrfork = true;
+	if (be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
+			be64_to_cpu(dip->di_nblocks))
+		zap_datafork = zap_attrfork = true;
+
+	if (!zap_datafork)
+		zap_datafork = xrep_dinode_check_dfork(sc, dip, mode);
+	if (!zap_attrfork)
+		zap_attrfork = xrep_dinode_check_afork(sc, dip);
+
+	/* Zap whatever's bad. */
+	if (zap_attrfork)
+		xrep_dinode_zap_afork(sc, dip, dis);
+	if (zap_datafork)
+		xrep_dinode_zap_dfork(sc, dip, mode, dis);
+	dip->di_nblocks = 0;
+	if (!zap_attrfork)
+		be64_add_cpu(&dip->di_nblocks, dis->attr_blocks);
+	if (!zap_datafork) {
+		be64_add_cpu(&dip->di_nblocks, dis->data_blocks);
+		be64_add_cpu(&dip->di_nblocks, dis->rt_blocks);
+	}
+}
+
 /* Inode didn't pass verifiers, so fix the raw buffer and retry iget. */
 STATIC int
 xrep_dinode_core(
 	struct xfs_scrub	*sc)
 {
+	struct xrep_dinode_stats	dis = { .sc = sc };
 	struct xfs_imap		imap;
 	struct xfs_buf		*bp;
 	struct xfs_dinode	*dip;
@@ -300,6 +691,11 @@ xrep_dinode_core(
 	bool			inuse;
 	int			error;
 
+	/* Figure out what this inode had mapped in both forks. */
+	error = xrep_dinode_count_rmaps(&dis);
+	if (error)
+		return error;
+
 	/* Map & read inode. */
 	ino = sc->sm->sm_ino;
 	error = xfs_imap(sc->mp, sc->tp, ino, &imap, XFS_IGET_UNTRUSTED);
@@ -326,9 +722,10 @@ xrep_dinode_core(
 	dip = xfs_buf_offset(bp, imap.im_boffset);
 	xrep_dinode_header(sc, dip);
 	xrep_dinode_mode(dip);
-	xrep_dinode_flags(sc, dip);
+	xrep_dinode_flags(sc, dip, dis.rt_extents > 0);
 	xrep_dinode_size(sc->mp, dip);
 	xrep_dinode_extsize_hints(sc, dip);
+	xrep_dinode_zap_forks(sc, dip, &dis);
 
 	/* Write out the inode... */
 	xfs_dinode_calc_crc(sc->mp, dip);

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 09/18] xfs: repair inode block maps
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
                   ` (7 preceding siblings ...)
  2019-08-05  0:35 ` [PATCH 08/18] xfs: zap broken inode forks Darrick J. Wong
@ 2019-08-05  0:35 ` Darrick J. Wong
  2019-08-05  0:35 ` [PATCH 10/18] xfs: repair damaged symlinks Darrick J. Wong
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:35 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Use the reverse-mapping btree information to rebuild an inode fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile            |    1 
 fs/xfs/scrub/bmap.c        |   22 ++
 fs/xfs/scrub/bmap_repair.c |  501 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h      |    4 
 fs/xfs/scrub/scrub.c       |    4 
 fs/xfs/scrub/trace.h       |    2 
 fs/xfs/xfs_trans.c         |   54 +++++
 fs/xfs/xfs_trans.h         |    2 
 8 files changed, 587 insertions(+), 3 deletions(-)
 create mode 100644 fs/xfs/scrub/bmap_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a4b0e79ce988..1aa26be0f82e 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -163,6 +163,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   alloc_repair.o \
 				   array.o \
 				   bitmap.o \
+				   bmap_repair.o \
 				   ialloc_repair.o \
 				   inode_repair.o \
 				   refcount_repair.o \
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 1bd29fdc2ab5..fdf7035925d1 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -29,6 +29,7 @@ xchk_setup_inode_bmap(
 	struct xfs_scrub	*sc,
 	struct xfs_inode	*ip)
 {
+	bool			is_repair = false;
 	int			error;
 
 	error = xchk_get_inode(sc, ip);
@@ -38,6 +39,10 @@ xchk_setup_inode_bmap(
 	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 	xfs_ilock(sc->ip, sc->ilock_flags);
 
+#ifdef CONFIG_XFS_REPAIR
+	is_repair = (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR);
+#endif
+
 	/*
 	 * We don't want any ephemeral data fork updates sitting around
 	 * while we inspect block mappings, so wait for directio to finish
@@ -45,10 +50,27 @@ xchk_setup_inode_bmap(
 	 */
 	if (S_ISREG(VFS_I(sc->ip)->i_mode) &&
 	    sc->sm->sm_type == XFS_SCRUB_TYPE_BMBTD) {
+		/* Break all our leases, we're going to mess with things. */
+		if (is_repair) {
+			error = xfs_break_layouts(VFS_I(sc->ip),
+					&sc->ilock_flags, BREAK_UNMAP);
+			if (error)
+				goto out;
+		}
+
 		inode_dio_wait(VFS_I(sc->ip));
 		error = filemap_write_and_wait(VFS_I(sc->ip)->i_mapping);
 		if (error)
 			goto out;
+
+		/* Drop the page cache if we're repairing block mappings. */
+		if (is_repair) {
+			error = invalidate_inode_pages2(
+					VFS_I(sc->ip)->i_mapping);
+			if (error)
+				goto out;
+		}
+
 	}
 
 	/* Got the inode, lock it and we're ready to go. */
diff --git a/fs/xfs/scrub/bmap_repair.c b/fs/xfs/scrub/bmap_repair.c
new file mode 100644
index 000000000000..198bce36163c
--- /dev/null
+++ b/fs/xfs/scrub/bmap_repair.c
@@ -0,0 +1,501 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_alloc.h"
+#include "xfs_rtalloc.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_util.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_refcount.h"
+#include "xfs_quota.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/bitmap.h"
+#include "scrub/array.h"
+
+/*
+ * Inode fork block mapping (BMBT) repair.
+ *
+ * Basically, we gather all the rmap records for the inode and fork we're
+ * fixing, reset the incore fork, then re-add all the records.
+ */
+
+/* Smallest possible record to represent a single contiguous physical map. */
+#define XREP_BMAP_UNWRITTEN	((uint64_t)1ULL << 63)
+struct xrep_bmap_extent {
+	/* starting offset; upper bit means unwritten */
+	xfs_fileoff_t	startoff;
+	xfs_fsblock_t	startblock;
+	xfs_filblks_t	blockcount;
+} __packed;
+
+static inline xfs_fileoff_t
+xrep_bmap_startoff(
+	const struct xrep_bmap_extent	*ext)
+{
+	return ext->startoff & ~XREP_BMAP_UNWRITTEN;
+}
+
+struct xrep_bmap {
+	/* List of new bmap records. */
+	struct xfbma		*bmap_records;
+
+	/* Old bmbt blocks */
+	struct xfs_bitmap	*btlist;
+
+	struct xfs_scrub	*sc;
+
+	/* Inode we're fixing. */
+	xfs_ino_t		ino;
+
+	/* How many blocks did we find in the other fork? */
+	xfs_rfsblock_t		otherfork_blocks;
+
+	/* How many bmbt blocks did we find for this fork? */
+	xfs_rfsblock_t		bmbt_blocks;
+
+	/* Which fork are we fixing? */
+	int			whichfork;
+};
+
+/* Record extents that belong to this inode's fork. */
+STATIC int
+xrep_bmap_walk_rmap(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*rec,
+	void			*priv)
+{
+	struct xrep_bmap	*rb = priv;
+	struct xrep_bmap_extent	rbe;
+	struct xfs_mount	*mp = cur->bc_mp;
+	xfs_fsblock_t		fsbno;
+	int			error = 0;
+
+	if (xchk_should_terminate(rb->sc, &error))
+		return error;
+
+	/* Skip extents which are not owned by this inode and fork. */
+	if (rec->rm_owner != rb->ino) {
+		return 0;
+	} else if (rb->whichfork == XFS_DATA_FORK &&
+		 (rec->rm_flags & XFS_RMAP_ATTR_FORK)) {
+		rb->otherfork_blocks += rec->rm_blockcount;
+		return 0;
+	} else if (rb->whichfork == XFS_ATTR_FORK &&
+		 !(rec->rm_flags & XFS_RMAP_ATTR_FORK)) {
+		rb->otherfork_blocks += rec->rm_blockcount;
+		return 0;
+	}
+
+	/* Delete the old bmbt blocks later. */
+	if (rec->rm_flags & XFS_RMAP_BMBT_BLOCK) {
+		fsbno = XFS_AGB_TO_FSB(mp, cur->bc_private.a.agno,
+				rec->rm_startblock);
+		rb->bmbt_blocks += rec->rm_blockcount;
+		return xfs_bitmap_set(rb->btlist, fsbno, rec->rm_blockcount);
+	}
+
+	/* Remember this rmap. */
+	rbe.startoff = rec->rm_offset;
+	rbe.startblock = XFS_AGB_TO_FSB(mp, cur->bc_private.a.agno,
+			rec->rm_startblock);
+	rbe.blockcount = rec->rm_blockcount;
+	if (rec->rm_flags & XFS_RMAP_UNWRITTEN)
+		rbe.startoff |= XREP_BMAP_UNWRITTEN;
+	return xfbma_append(rb->bmap_records, &rbe);
+}
+
+/* Compare two bmap extents. */
+static int
+xrep_bmap_extent_cmp(
+	const void			*a,
+	const void			*b)
+{
+	xfs_fileoff_t			ao = xrep_bmap_startoff(a);
+	xfs_fileoff_t			bo = xrep_bmap_startoff(b);
+
+	if (ao > bo)
+		return 1;
+	else if (ao < bo)
+		return -1;
+	return 0;
+}
+
+/* Scan one AG for reverse mappings that we can turn into extent maps. */
+STATIC int
+xrep_bmap_scan_ag(
+	struct xrep_bmap	*rb,
+	xfs_agnumber_t		agno)
+{
+	struct xfs_scrub	*sc = rb->sc;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_buf		*agf_bp = NULL;
+	struct xfs_btree_cur	*cur;
+	int			error;
+
+	error = xfs_alloc_read_agf(mp, sc->tp, agno, 0, &agf_bp);
+	if (error)
+		return error;
+	if (!agf_bp)
+		return -ENOMEM;
+	cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, agno);
+	error = xfs_rmap_query_all(cur, xrep_bmap_walk_rmap, rb);
+	if (error == XFS_BTREE_QUERY_RANGE_ABORT)
+		error = 0;
+	xfs_btree_del_cursor(cur, error);
+	xfs_trans_brelse(sc->tp, agf_bp);
+	return error;
+}
+
+struct xrep_add_bmap {
+	struct xfs_scrub	*sc;
+	int			baseflags;
+};
+
+/* Insert bmap records into an inode fork, given an rmap. */
+STATIC int
+xrep_bmap_insert_rec(
+	const void			*item,
+	void				*priv)
+{
+	const struct xrep_bmap_extent	*rbe = item;
+	struct xfs_bmbt_irec		bmap = {
+		.br_startblock	= rbe->startblock,
+		.br_startoff	= xrep_bmap_startoff(rbe),
+		.br_blockcount	= rbe->blockcount,
+	};
+	struct xrep_add_bmap		*x = priv;
+	xfs_extlen_t			extlen;
+	int				flags = x->baseflags;
+	int				error = 0;
+
+	if (rbe->startoff & XREP_BMAP_UNWRITTEN)
+		flags |= XFS_BMAPI_PREALLOC;
+	while (bmap.br_blockcount > 0) {
+		extlen = min_t(xfs_filblks_t, bmap.br_blockcount, MAXEXTLEN);
+
+		/* Re-add the extent to the fork. */
+		error = xfs_bmapi_remap(x->sc->tp, x->sc->ip, bmap.br_startoff,
+				extlen, bmap.br_startblock, flags);
+		if (error)
+			goto out;
+
+		bmap.br_startblock += extlen;
+		bmap.br_startoff += extlen;
+		bmap.br_blockcount -= extlen;
+
+		error = xfs_defer_finish(&x->sc->tp);
+		if (error)
+			goto out;
+		/* Make sure we roll the transaction. */
+		error = xfs_trans_roll_inode(&x->sc->tp, x->sc->ip);
+		if (error)
+			goto out;
+	}
+
+out:
+	return error;
+}
+
+/* Check for garbage inputs. */
+STATIC int
+xrep_bmap_check_inputs(
+	struct xfs_scrub	*sc,
+	int			whichfork)
+{
+	ASSERT(whichfork == XFS_DATA_FORK || whichfork == XFS_ATTR_FORK);
+
+	/* Don't know how to repair the other fork formats. */
+	if (XFS_IFORK_FORMAT(sc->ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
+	    XFS_IFORK_FORMAT(sc->ip, whichfork) != XFS_DINODE_FMT_BTREE)
+		return -EOPNOTSUPP;
+
+	/*
+	 * If there's no attr fork area in the inode, there's no attr fork to
+	 * rebuild.
+	 */
+	if (whichfork == XFS_ATTR_FORK) {
+		if (!XFS_IFORK_Q(sc->ip))
+			return -ENOENT;
+		return 0;
+	}
+
+	/* Only files, symlinks, and directories get to have data forks. */
+	switch (VFS_I(sc->ip)->i_mode & S_IFMT) {
+	case S_IFREG:
+	case S_IFDIR:
+	case S_IFLNK:
+		/* ok */
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	/* If we somehow have delalloc extents, forget it. */
+	if (sc->ip->i_delayed_blks)
+		return -EBUSY;
+
+	/* Don't know how to rebuild realtime data forks. */
+	if (XFS_IS_REALTIME_INODE(sc->ip))
+		return -EOPNOTSUPP;
+
+	return 0;
+}
+
+/*
+ * Collect block mappings for this fork of this inode and decide if we have
+ * enough space to rebuild.  Caller is responsible for cleaning up the list if
+ * anything goes wrong.
+ */
+STATIC int
+xrep_bmap_find_mappings(
+	struct xfs_scrub	*sc,
+	int			whichfork,
+	struct xfbma		*bmap_records,
+	struct xfs_bitmap	*old_bmbt_blocks,
+	xfs_rfsblock_t		*old_bmbt_block_count,
+	xfs_rfsblock_t		*otherfork_blocks)
+{
+	struct xrep_bmap	rb = {
+		.sc		= sc,
+		.bmap_records	= bmap_records,
+		.btlist		= old_bmbt_blocks,
+		.ino		= sc->ip->i_ino,
+		.whichfork	= whichfork,
+	};
+	xfs_agnumber_t		agno;
+	unsigned int		resblks;
+	int			error;
+
+	/* Iterate the rmaps for extents. */
+	for (agno = 0; agno < sc->mp->m_sb.sb_agcount; agno++) {
+		error = xrep_bmap_scan_ag(&rb, agno);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * Guess how many blocks we're going to need to rebuild an entire bmap
+	 * from the number of extents we found, and pump up our transaction to
+	 * have sufficient block reservation.
+	 */
+	resblks = xfs_bmbt_calc_size(sc->mp, xfbma_length(bmap_records));
+	error = xfs_trans_reserve_more(sc->tp, resblks, 0);
+	if (error)
+		return error;
+
+	*otherfork_blocks = rb.otherfork_blocks;
+	*old_bmbt_block_count = rb.bmbt_blocks;
+	return 0;
+}
+
+/* Update the inode counters. */
+STATIC int
+xrep_bmap_reset_counters(
+	struct xfs_scrub	*sc,
+	xfs_rfsblock_t		old_bmbt_block_count,
+	xfs_rfsblock_t		otherfork_blocks,
+	int			*log_flags)
+{
+	int			error;
+
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	/*
+	 * We're going to use the bmap routines to reconstruct a fork from rmap
+	 * records.  Those functions increment di_nblocks for us, so we need to
+	 * subtract out all the data and bmbt blocks from the fork we're about
+	 * to rebuild.  otherfork_blocks reflects all the data and bmbt blocks
+	 * for the other fork, so this assignment effectively performs the
+	 * subtraction for us.
+	 */
+	sc->ip->i_d.di_nblocks = otherfork_blocks;
+	*log_flags |= XFS_ILOG_CORE;
+
+	if (!old_bmbt_block_count)
+		return 0;
+
+	/* Release quota counts for the old bmbt blocks. */
+	error = xrep_ino_dqattach(sc);
+	if (error)
+		return error;
+	xfs_trans_mod_dquot_byino(sc->tp, sc->ip, XFS_TRANS_DQ_BCOUNT,
+			-(int64_t)old_bmbt_block_count);
+	return 0;
+}
+
+/* Initialize a new fork and implant it in the inode. */
+STATIC void
+xrep_bmap_reset_fork(
+	struct xfs_scrub	*sc,
+	int			whichfork,
+	bool			has_mappings,
+	int			*log_flags)
+{
+	/* Set us back to extents format with zero records. */
+	XFS_IFORK_FMT_SET(sc->ip, whichfork, XFS_DINODE_FMT_EXTENTS);
+	XFS_IFORK_NEXT_SET(sc->ip, whichfork, 0);
+
+	/* Reinitialize the in-core fork. */
+	if (XFS_IFORK_PTR(sc->ip, whichfork) != NULL)
+		xfs_idestroy_fork(sc->ip, whichfork);
+	if (whichfork == XFS_DATA_FORK) {
+		memset(&sc->ip->i_df, 0, sizeof(struct xfs_ifork));
+		sc->ip->i_df.if_flags |= XFS_IFEXTENTS;
+	} else if (whichfork == XFS_ATTR_FORK) {
+		if (has_mappings) {
+			sc->ip->i_afp = NULL;
+		} else {
+			sc->ip->i_afp = kmem_zone_zalloc(xfs_ifork_zone,
+					KM_SLEEP);
+			sc->ip->i_afp->if_flags |= XFS_IFEXTENTS;
+		}
+	}
+
+	/*
+	 * Now that we've reinitialized the in-memory fork and set the inode
+	 * back to extents format with zero extents, any extents that we
+	 * subsequently map into the file will reinitialize the on-disk fork
+	 * area for us.  All we have to do is log the inode core to preserve
+	 * the format and extent count fields.
+	 */
+	*log_flags |= XFS_ILOG_CORE;
+}
+
+/* Make our changes permanent so that we can start rebuilding the fork. */
+STATIC int
+xrep_bmap_commit_new(
+	struct xfs_scrub	*sc,
+	int			log_flags)
+{
+	xfs_trans_log_inode(sc->tp, sc->ip, log_flags);
+	return xfs_trans_roll_inode(&sc->tp, sc->ip);
+}
+
+/* Build new fork mappings and dispose of the old bmbt blocks. */
+STATIC int
+xrep_bmap_rebuild_tree(
+	struct xfs_scrub	*sc,
+	int			whichfork,
+	struct xfbma		*bmap_records,
+	struct xfs_bitmap	*old_bmbt_blocks)
+{
+	struct xfs_owner_info	oinfo;
+	struct xrep_add_bmap	x = {
+		.sc		= sc,
+		.baseflags	= XFS_BMAPI_NORMAP,
+	};
+	int			error;
+
+	if (whichfork == XFS_ATTR_FORK)
+		x.baseflags |= XFS_BMAPI_ATTRFORK;
+
+	/*
+	 * Sort the bmap extents by startblock to avoid btree splits when we
+	 * rebuild the bmbt btree.
+	 */
+	error = xfbma_sort(bmap_records, xrep_bmap_extent_cmp);
+	if (error)
+		return error;
+
+	/* Dispose of all the old bmbt blocks. */
+	xfs_rmap_ino_bmbt_owner(&oinfo, sc->ip->i_ino, whichfork);
+	error = xrep_reap_extents(sc, old_bmbt_blocks, &oinfo,
+			XFS_AG_RESV_NONE);
+	if (error)
+		return error;
+
+	/* "Remap" the extents into the fork. */
+	return xfbma_iter_del(bmap_records, xrep_bmap_insert_rec, &x);
+}
+
+/* Repair an inode fork. */
+STATIC int
+xrep_bmap(
+	struct xfs_scrub	*sc,
+	int			whichfork)
+{
+	struct xfs_bitmap	old_bmbt_blocks;
+	struct xfbma		*bmap_records;
+	xfs_rfsblock_t		old_bmbt_block_count;
+	xfs_rfsblock_t		otherfork_blocks;
+	int			log_flags = 0;
+	int			error = 0;
+
+	error = xrep_bmap_check_inputs(sc, whichfork);
+	if (error)
+		return error;
+
+	/* Set up some storage */
+	bmap_records = xfbma_init(sizeof(struct xrep_bmap_extent));
+	if (IS_ERR(bmap_records))
+		return PTR_ERR(bmap_records);
+
+	/* Collect all reverse mappings for this fork's extents. */
+	xfs_bitmap_init(&old_bmbt_blocks);
+	error = xrep_bmap_find_mappings(sc, whichfork, bmap_records,
+			&old_bmbt_blocks, &old_bmbt_block_count,
+			&otherfork_blocks);
+	if (error)
+		goto out;
+
+	/*
+	 * Blow out the in-core fork and zero the on-disk fork.  This is the
+	 * point at which we are no longer able to bail out gracefully.
+	 */
+	error = xrep_bmap_reset_counters(sc, old_bmbt_block_count,
+			otherfork_blocks, &log_flags);
+	if (error)
+		goto out;
+	xrep_bmap_reset_fork(sc, whichfork, xfbma_length(bmap_records) == 0,
+			&log_flags);
+	error = xrep_bmap_commit_new(sc, log_flags);
+	if (error)
+		goto out;
+
+	/* Now rebuild the fork extent map information. */
+	error = xrep_bmap_rebuild_tree(sc, whichfork, bmap_records,
+			&old_bmbt_blocks);
+out:
+	xfs_bitmap_destroy(&old_bmbt_blocks);
+	xfbma_destroy(bmap_records);
+	return error;
+}
+
+/* Repair an inode's data fork. */
+int
+xrep_bmap_data(
+	struct xfs_scrub	*sc)
+{
+	return xrep_bmap(sc, XFS_DATA_FORK);
+}
+
+/* Repair an inode's attr fork. */
+int
+xrep_bmap_attr(
+	struct xfs_scrub	*sc)
+{
+	return xrep_bmap(sc, XFS_ATTR_FORK);
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index dc8e27cf6c1c..79db78d69c7d 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -69,6 +69,8 @@ int xrep_allocbt(struct xfs_scrub *sc);
 int xrep_iallocbt(struct xfs_scrub *sc);
 int xrep_refcountbt(struct xfs_scrub *sc);
 int xrep_inode(struct xfs_scrub *sc);
+int xrep_bmap_data(struct xfs_scrub *sc);
+int xrep_bmap_attr(struct xfs_scrub *sc);
 
 #else
 
@@ -112,6 +114,8 @@ xrep_reset_perag_resv(
 #define xrep_iallocbt			xrep_notsupported
 #define xrep_refcountbt			xrep_notsupported
 #define xrep_inode			xrep_notsupported
+#define xrep_bmap_data			xrep_notsupported
+#define xrep_bmap_attr			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 6de28006290c..66a59c70d743 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -266,13 +266,13 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xchk_setup_inode_bmap,
 		.scrub	= xchk_bmap_data,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_bmap_data,
 	},
 	[XFS_SCRUB_TYPE_BMBTA] = {	/* inode attr fork */
 		.type	= ST_INODE,
 		.setup	= xchk_setup_inode_bmap,
 		.scrub	= xchk_bmap_attr,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_bmap_attr,
 	},
 	[XFS_SCRUB_TYPE_BMBTC] = {	/* inode CoW fork */
 		.type	= ST_INODE,
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index f7e64a5cc751..1124c86b980f 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -725,7 +725,7 @@ DEFINE_EVENT(xrep_rmap_class, name, \
 DEFINE_REPAIR_RMAP_EVENT(xrep_abt_walk_rmap);
 DEFINE_REPAIR_RMAP_EVENT(xrep_ibt_walk_rmap);
 DEFINE_REPAIR_RMAP_EVENT(xrep_rmap_extent_fn);
-DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_extent_fn);
+DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_walk_rmap);
 
 TRACE_EVENT(xrep_refcount_extent_fn,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index d42a68d8313b..d25fa31cd475 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -129,6 +129,60 @@ xfs_trans_dup(
 	return ntp;
 }
 
+/*
+ * Try to reserve more blocks for a transaction.  The single use case we
+ * support is for online repair -- use a transaction to gather data without
+ * fear of btree cycle deadlocks; calculate how many blocks we really need
+ * from that data; and only then start modifying data.  This can fail due to
+ * ENOSPC, so we have to be able to cancel the transaction.
+ */
+int
+xfs_trans_reserve_more(
+	struct xfs_trans	*tp,
+	uint			blocks,
+	uint			rtextents)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	bool			rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
+	int			error = 0;
+
+	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
+
+	/*
+	 * Attempt to reserve the needed disk blocks by decrementing
+	 * the number needed from the number available.  This will
+	 * fail if the count would go below zero.
+	 */
+	if (blocks > 0) {
+		error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
+		if (error)
+			return -ENOSPC;
+		tp->t_blk_res += blocks;
+	}
+
+	/*
+	 * Attempt to reserve the needed realtime extents by decrementing
+	 * the number needed from the number available.  This will
+	 * fail if the count would go below zero.
+	 */
+	if (rtextents > 0) {
+		error = xfs_mod_frextents(mp, -((int64_t)rtextents));
+		if (error) {
+			error = -ENOSPC;
+			goto out_blocks;
+		}
+		tp->t_rtx_res += rtextents;
+	}
+
+	return 0;
+out_blocks:
+	if (blocks > 0) {
+		xfs_mod_fdblocks(mp, (int64_t)blocks, rsvd);
+		tp->t_blk_res -= blocks;
+	}
+	return error;
+}
+
 /*
  * This is called to reserve free disk blocks and log space for the
  * given transaction.  This must be done before allocating any resources
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 64d7f171ebd3..982d53eb2853 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -165,6 +165,8 @@ typedef struct xfs_trans {
 int		xfs_trans_alloc(struct xfs_mount *mp, struct xfs_trans_res *resp,
 			uint blocks, uint rtextents, uint flags,
 			struct xfs_trans **tpp);
+int		xfs_trans_reserve_more(struct xfs_trans *tp, uint blocks,
+			uint rtextents);
 int		xfs_trans_alloc_empty(struct xfs_mount *mp,
 			struct xfs_trans **tpp);
 void		xfs_trans_mod_sb(xfs_trans_t *, uint, int64_t);

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 10/18] xfs: repair damaged symlinks
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
                   ` (8 preceding siblings ...)
  2019-08-05  0:35 ` [PATCH 09/18] xfs: repair inode block maps Darrick J. Wong
@ 2019-08-05  0:35 ` Darrick J. Wong
  2019-08-05  0:35 ` [PATCH 11/18] xfs: create a blob array data structure Darrick J. Wong
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:35 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Repair inconsistent symbolic link data.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile               |    1 
 fs/xfs/scrub/repair.h         |    2 
 fs/xfs/scrub/scrub.c          |    2 
 fs/xfs/scrub/symlink.c        |    5 +
 fs/xfs/scrub/symlink_repair.c |  243 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_symlink.c          |  150 ++++++++++++++-----------
 fs/xfs/xfs_symlink.h          |    3 +
 7 files changed, 338 insertions(+), 68 deletions(-)
 create mode 100644 fs/xfs/scrub/symlink_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 1aa26be0f82e..e8459ab2b28d 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -168,6 +168,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   inode_repair.o \
 				   refcount_repair.o \
 				   repair.o \
+				   symlink_repair.o \
 				   )
 endif
 endif
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 79db78d69c7d..4ff2ef9fc13b 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -71,6 +71,7 @@ int xrep_refcountbt(struct xfs_scrub *sc);
 int xrep_inode(struct xfs_scrub *sc);
 int xrep_bmap_data(struct xfs_scrub *sc);
 int xrep_bmap_attr(struct xfs_scrub *sc);
+int xrep_symlink(struct xfs_scrub *sc);
 
 #else
 
@@ -116,6 +117,7 @@ xrep_reset_perag_resv(
 #define xrep_inode			xrep_notsupported
 #define xrep_bmap_data			xrep_notsupported
 #define xrep_bmap_attr			xrep_notsupported
+#define xrep_symlink			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 66a59c70d743..ea1154aa2225 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -296,7 +296,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xchk_setup_symlink,
 		.scrub	= xchk_symlink,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_symlink,
 	},
 	[XFS_SCRUB_TYPE_PARENT] = {	/* parent pointers */
 		.type	= ST_INODE,
diff --git a/fs/xfs/scrub/symlink.c b/fs/xfs/scrub/symlink.c
index 99c0b1234c3c..7ecf9aa68596 100644
--- a/fs/xfs/scrub/symlink.c
+++ b/fs/xfs/scrub/symlink.c
@@ -21,12 +21,15 @@ xchk_setup_symlink(
 	struct xfs_scrub	*sc,
 	struct xfs_inode	*ip)
 {
+	uint			resblks;
+
 	/* Allocate the buffer without the inode lock held. */
 	sc->buf = kmem_zalloc_large(XFS_SYMLINK_MAXLEN + 1, KM_SLEEP);
 	if (!sc->buf)
 		return -ENOMEM;
 
-	return xchk_setup_inode_contents(sc, ip, 0);
+	resblks = xfs_symlink_blocks(sc->mp, XFS_SYMLINK_MAXLEN);
+	return xchk_setup_inode_contents(sc, ip, resblks);
 }
 
 /* Symbolic links. */
diff --git a/fs/xfs/scrub/symlink_repair.c b/fs/xfs/scrub/symlink_repair.c
new file mode 100644
index 000000000000..8adb3e34d1c1
--- /dev/null
+++ b/fs/xfs/scrub/symlink_repair.c
@@ -0,0 +1,243 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_symlink.h"
+#include "xfs_bmap.h"
+#include "xfs_quota.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_trans_space.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * Symbolic Link Repair
+ * ====================
+ *
+ * There's not much we can do to repair symbolic links -- we truncate them to
+ * the first NULL byte and reinitialize the target.  Zero-length symlinks are
+ * turned into links to the current dir.
+ */
+
+/* Try to salvage the pathname from rmt blocks. */
+STATIC int
+xrep_symlink_salvage_remote(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
+	struct xfs_inode	*ip = sc->ip;
+	struct xfs_buf		*bp;
+	char			*target_buf = sc->buf;
+	xfs_failaddr_t		fa;
+	xfs_filblks_t		fsblocks;
+	xfs_daddr_t		d;
+	loff_t			len;
+	loff_t			offset;
+	unsigned int		byte_cnt;
+	bool			magic_ok;
+	bool			hdr_ok;
+	int			n;
+	int			nmaps = XFS_SYMLINK_MAPS;
+	int			error;
+
+	/* We'll only read until the buffer is full. */
+	len = max_t(loff_t, ip->i_d.di_size, XFS_SYMLINK_MAXLEN);
+	fsblocks = xfs_symlink_blocks(sc->mp, len);
+	error = xfs_bmapi_read(ip, 0, fsblocks, mval, &nmaps, 0);
+	if (error)
+		return error;
+
+	offset = 0;
+	for (n = 0; n < nmaps; n++) {
+		struct xfs_dsymlink_hdr	*dsl;
+
+		d = XFS_FSB_TO_DADDR(sc->mp, mval[n].br_startblock);
+
+		/* Read the rmt block.  We'll run the verifiers manually. */
+		error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp,
+				d, XFS_FSB_TO_BB(sc->mp, mval[n].br_blockcount),
+				0, &bp, NULL);
+		if (error)
+			return error;
+		bp->b_ops = &xfs_symlink_buf_ops;
+
+		/* How many bytes do we expect to get out of this buffer? */
+		byte_cnt = XFS_FSB_TO_B(sc->mp, mval[n].br_blockcount);
+		byte_cnt = XFS_SYMLINK_BUF_SPACE(sc->mp, byte_cnt);
+		byte_cnt = min_t(unsigned int, byte_cnt, len);
+
+		/*
+		 * See if the verifiers accept this block.  We're willing to
+		 * salvage if the if the offset/byte/ino are ok and either the
+		 * verifier passed or the magic is ok.  Anything else and we
+		 * stop dead in our tracks.
+		 */
+		fa = bp->b_ops->verify_struct(bp);
+		dsl = bp->b_addr;
+		magic_ok = dsl->sl_magic == cpu_to_be32(XFS_SYMLINK_MAGIC);
+		hdr_ok = xfs_symlink_hdr_ok(ip->i_ino, offset, byte_cnt, bp);
+		if (!hdr_ok || (fa != NULL && !magic_ok))
+			break;
+
+		memcpy(target_buf + offset, dsl + 1, byte_cnt);
+
+		len -= byte_cnt;
+		offset += byte_cnt;
+	}
+
+	/* Ensure we have a zero at the end, and /some/ contents. */
+	if (offset == 0)
+		sprintf(target_buf, ".");
+	else
+		target_buf[offset] = 0;
+	return 0;
+}
+
+/*
+ * Try to salvage an inline symlink's contents.  Empty symlinks become a link
+ * to the current directory.
+ */
+STATIC void
+xrep_symlink_salvage_inline(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_inode	*ip = sc->ip;
+	struct xfs_ifork	*ifp;
+
+	ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+	if (ifp->if_u1.if_data)
+		strncpy(sc->buf, ifp->if_u1.if_data, XFS_IFORK_DSIZE(ip));
+	if (strlen(sc->buf) == 0)
+		sprintf(sc->buf, ".");
+}
+
+/* Reset an inline symlink to its fresh configuration. */
+STATIC void
+xrep_symlink_truncate_inline(
+	struct xfs_inode	*ip)
+{
+	xfs_idestroy_fork(ip, XFS_DATA_FORK);
+	ip->i_d.di_format = XFS_DINODE_FMT_EXTENTS;
+	ip->i_d.di_nextents = 0;
+	memset(&ip->i_df, 0, sizeof(struct xfs_ifork));
+	ip->i_df.if_flags |= XFS_IFEXTENTS;
+}
+
+/*
+ * Salvage an inline symlink's contents and reset data fork.
+ * Returns with the inode joined to the transaction.
+ */
+STATIC int
+xrep_symlink_inline(
+	struct xfs_scrub	*sc)
+{
+	/* Salvage whatever link target information we can find. */
+	xrep_symlink_salvage_inline(sc);
+
+	/* Truncate the symlink. */
+	xrep_symlink_truncate_inline(sc->ip);
+
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+	return 0;
+}
+
+/*
+ * Salvage an inline symlink's contents and reset data fork.
+ * Returns with the inode joined to the transaction.
+ */
+STATIC int
+xrep_symlink_remote(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	/* Salvage whatever link target information we can find. */
+	error = xrep_symlink_salvage_remote(sc);
+	if (error)
+		return error;
+
+	/* Truncate the symlink. */
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+	return xfs_itruncate_extents(&sc->tp, sc->ip, XFS_DATA_FORK, 0);
+}
+
+/*
+ * Reinitialize a link target.  Caller must ensure the inode is joined to
+ * the transaction.
+ */
+STATIC int
+xrep_symlink_reinitialize(
+	struct xfs_scrub	*sc)
+{
+	xfs_fsblock_t		fs_blocks;
+	unsigned int		target_len;
+	uint			resblks;
+	int			error;
+
+	/* How many blocks do we need? */
+	target_len = strlen(sc->buf);
+	ASSERT(target_len != 0);
+	if (target_len == 0 || target_len > XFS_SYMLINK_MAXLEN)
+		return -EFSCORRUPTED;
+
+	/* Set up to reinitialize the target. */
+	fs_blocks = xfs_symlink_blocks(sc->mp, target_len);
+	resblks = XFS_SYMLINK_SPACE_RES(sc->mp, target_len, fs_blocks);
+	error = xfs_trans_reserve_quota_nblks(sc->tp, sc->ip, resblks, 0,
+			XFS_QMOPT_RES_REGBLKS);
+
+	/* Try to write the new target back out. */
+	error = xfs_symlink_write_target(sc->tp, sc->ip, sc->buf, target_len,
+			fs_blocks, resblks);
+	if (error)
+		return error;
+
+	/* Finish up any block mapping activities. */
+	return xfs_defer_finish(&sc->tp);
+}
+
+/* Repair a symbolic link. */
+int
+xrep_symlink(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_ifork	*ifp;
+	int			error;
+
+	error = xfs_qm_dqattach_locked(sc->ip, false);
+	if (error)
+		return error;
+
+	/* Salvage whatever we can of the target. */
+	*((char *)sc->buf) = 0;
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
+	if (ifp->if_flags & XFS_IFINLINE)
+		error = xrep_symlink_inline(sc);
+	else
+		error = xrep_symlink_remote(sc);
+	if (error)
+		return error;
+
+	/* Now reset the target. */
+	return xrep_symlink_reinitialize(sc);
+}
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index ed66fd2de327..d48f41e77c4e 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -141,6 +141,86 @@ xfs_readlink(
 	return error;
 }
 
+/* Write the symlink target into the inode. */
+int
+xfs_symlink_write_target(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	const char		*target_path,
+	int			pathlen,
+	xfs_fsblock_t		fs_blocks,
+	uint			resblks)
+{
+	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
+	struct xfs_mount	*mp = tp->t_mountp;
+	const char		*cur_chunk;
+	struct xfs_buf		*bp;
+	xfs_daddr_t		d;
+	int			byte_cnt;
+	int			nmaps;
+	int			offset;
+	int			n;
+	int			error;
+
+	/*
+	 * If the symlink will fit into the inode, write it inline.
+	 */
+	if (pathlen <= XFS_IFORK_DSIZE(ip)) {
+		xfs_init_local_fork(ip, XFS_DATA_FORK, target_path, pathlen);
+
+		ip->i_d.di_size = pathlen;
+		ip->i_d.di_format = XFS_DINODE_FMT_LOCAL;
+		xfs_trans_log_inode(tp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE);
+
+		return 0;
+	}
+
+	/* Write target to remote blocks. */
+	nmaps = XFS_SYMLINK_MAPS;
+	error = xfs_bmapi_write(tp, ip, 0, fs_blocks, XFS_BMAPI_METADATA,
+			resblks, mval, &nmaps);
+	if (error)
+		return error;
+
+	if (resblks)
+		resblks -= fs_blocks;
+	ip->i_d.di_size = pathlen;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+
+	cur_chunk = target_path;
+	offset = 0;
+	for (n = 0; n < nmaps; n++) {
+		char	*buf;
+
+		d = XFS_FSB_TO_DADDR(mp, mval[n].br_startblock);
+		byte_cnt = XFS_FSB_TO_B(mp, mval[n].br_blockcount);
+		bp = xfs_trans_get_buf(tp, mp->m_ddev_targp, d,
+				BTOBB(byte_cnt), 0);
+		if (!bp)
+			return -ENOMEM;
+		bp->b_ops = &xfs_symlink_buf_ops;
+
+		byte_cnt = XFS_SYMLINK_BUF_SPACE(mp, byte_cnt);
+		byte_cnt = min(byte_cnt, pathlen);
+
+		buf = bp->b_addr;
+		buf += xfs_symlink_hdr_set(mp, ip->i_ino, offset,
+					   byte_cnt, bp);
+
+		memcpy(buf, cur_chunk, byte_cnt);
+
+		cur_chunk += byte_cnt;
+		pathlen -= byte_cnt;
+		offset += byte_cnt;
+
+		xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SYMLINK_BUF);
+		xfs_trans_log_buf(tp, bp, 0, (buf + byte_cnt - 1) -
+						(char *)bp->b_addr);
+	}
+	ASSERT(pathlen == 0);
+	return 0;
+}
+
 int
 xfs_symlink(
 	struct xfs_inode	*dp,
@@ -155,15 +235,7 @@ xfs_symlink(
 	int			error = 0;
 	int			pathlen;
 	bool                    unlock_dp_on_error = false;
-	xfs_fileoff_t		first_fsb;
 	xfs_filblks_t		fs_blocks;
-	int			nmaps;
-	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
-	xfs_daddr_t		d;
-	const char		*cur_chunk;
-	int			byte_cnt;
-	int			n;
-	xfs_buf_t		*bp;
 	prid_t			prid;
 	struct xfs_dquot	*udqp = NULL;
 	struct xfs_dquot	*gdqp = NULL;
@@ -257,65 +329,11 @@ xfs_symlink(
 
 	if (resblks)
 		resblks -= XFS_IALLOC_SPACE_RES(mp);
-	/*
-	 * If the symlink will fit into the inode, write it inline.
-	 */
-	if (pathlen <= XFS_IFORK_DSIZE(ip)) {
-		xfs_init_local_fork(ip, XFS_DATA_FORK, target_path, pathlen);
-
-		ip->i_d.di_size = pathlen;
-		ip->i_d.di_format = XFS_DINODE_FMT_LOCAL;
-		xfs_trans_log_inode(tp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE);
-	} else {
-		int	offset;
-
-		first_fsb = 0;
-		nmaps = XFS_SYMLINK_MAPS;
-
-		error = xfs_bmapi_write(tp, ip, first_fsb, fs_blocks,
-				  XFS_BMAPI_METADATA, resblks, mval, &nmaps);
-		if (error)
-			goto out_trans_cancel;
-
-		if (resblks)
-			resblks -= fs_blocks;
-		ip->i_d.di_size = pathlen;
-		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
-
-		cur_chunk = target_path;
-		offset = 0;
-		for (n = 0; n < nmaps; n++) {
-			char	*buf;
-
-			d = XFS_FSB_TO_DADDR(mp, mval[n].br_startblock);
-			byte_cnt = XFS_FSB_TO_B(mp, mval[n].br_blockcount);
-			bp = xfs_trans_get_buf(tp, mp->m_ddev_targp, d,
-					       BTOBB(byte_cnt), 0);
-			if (!bp) {
-				error = -ENOMEM;
-				goto out_trans_cancel;
-			}
-			bp->b_ops = &xfs_symlink_buf_ops;
-
-			byte_cnt = XFS_SYMLINK_BUF_SPACE(mp, byte_cnt);
-			byte_cnt = min(byte_cnt, pathlen);
-
-			buf = bp->b_addr;
-			buf += xfs_symlink_hdr_set(mp, ip->i_ino, offset,
-						   byte_cnt, bp);
-
-			memcpy(buf, cur_chunk, byte_cnt);
 
-			cur_chunk += byte_cnt;
-			pathlen -= byte_cnt;
-			offset += byte_cnt;
-
-			xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SYMLINK_BUF);
-			xfs_trans_log_buf(tp, bp, 0, (buf + byte_cnt - 1) -
-							(char *)bp->b_addr);
-		}
-		ASSERT(pathlen == 0);
-	}
+	error = xfs_symlink_write_target(tp, ip, target_path, pathlen,
+			fs_blocks, resblks);
+	if (error)
+		goto out_trans_cancel;
 
 	/*
 	 * Create the directory entry for the symlink.
diff --git a/fs/xfs/xfs_symlink.h b/fs/xfs/xfs_symlink.h
index 9743d8c9394b..d7252f9cab41 100644
--- a/fs/xfs/xfs_symlink.h
+++ b/fs/xfs/xfs_symlink.h
@@ -12,5 +12,8 @@ int xfs_symlink(struct xfs_inode *dp, struct xfs_name *link_name,
 int xfs_readlink_bmap_ilocked(struct xfs_inode *ip, char *link);
 int xfs_readlink(struct xfs_inode *ip, char *link);
 int xfs_inactive_symlink(struct xfs_inode *ip);
+int xfs_symlink_write_target(struct xfs_trans *tp, struct xfs_inode *ip,
+		const char *target_path, int pathlen, xfs_fsblock_t fs_blocks,
+		uint resblks);
 
 #endif /* __XFS_SYMLINK_H */

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 11/18] xfs: create a blob array data structure
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
                   ` (9 preceding siblings ...)
  2019-08-05  0:35 ` [PATCH 10/18] xfs: repair damaged symlinks Darrick J. Wong
@ 2019-08-05  0:35 ` Darrick J. Wong
  2019-08-05  0:36 ` [PATCH 12/18] xfs: convert xfs_itruncate_extents_flags to use __xfs_bunmapi Darrick J. Wong
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:35 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a simple 'blob array' data structure for storage of arbitrarily
sized metadata objects that will be used to reconstruct metadata.  For
the intended usage (temporarily storing extended attribute names and
values) we only have to support storing objects and retrieving them.

This initial implementation uses linked lists to store the blobs, but a
subsequent patch will restructure the backend to avoid using high order
pinned kernel memory.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile     |    1 
 fs/xfs/scrub/blob.c |  121 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/blob.h |   23 ++++++++++
 3 files changed, 145 insertions(+)
 create mode 100644 fs/xfs/scrub/blob.c
 create mode 100644 fs/xfs/scrub/blob.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index e8459ab2b28d..fecde2c9d2de 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -163,6 +163,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   alloc_repair.o \
 				   array.o \
 				   bitmap.o \
+				   blob.o \
 				   bmap_repair.o \
 				   ialloc_repair.o \
 				   inode_repair.o \
diff --git a/fs/xfs/scrub/blob.c b/fs/xfs/scrub/blob.c
new file mode 100644
index 000000000000..4928f0985d49
--- /dev/null
+++ b/fs/xfs/scrub/blob.c
@@ -0,0 +1,121 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "scrub/array.h"
+#include "scrub/blob.h"
+
+/*
+ * XFS Blob Storage
+ * ================
+ * Stores and retrieves blobs using a list.  Objects are appended to
+ * the list and the pointer is returned as a magic cookie for retrieval.
+ */
+
+#define XB_KEY_MAGIC	0xABAADDAD
+struct xb_key {
+	struct list_head	list;
+	uint32_t		magic;
+	uint32_t		size;
+	/* blob comes after here */
+} __packed;
+
+#define XB_KEY_SIZE(sz)	(sizeof(struct xb_key) + (sz))
+
+/* Initialize a blob storage object. */
+struct xblob *
+xblob_init(void)
+{
+	struct xblob	*blob;
+	int		error;
+
+	error = -ENOMEM;
+	blob = kmem_alloc(sizeof(struct xblob), KM_NOFS | KM_MAYFAIL);
+	if (!blob)
+		return ERR_PTR(error);
+
+	INIT_LIST_HEAD(&blob->list);
+	return blob;
+}
+
+/* Destroy a blob storage object. */
+void
+xblob_destroy(
+	struct xblob	*blob)
+{
+	struct xb_key	*key, *n;
+
+	list_for_each_entry_safe(key, n, &blob->list, list) {
+		list_del(&key->list);
+		kmem_free(key);
+	}
+	kmem_free(blob);
+}
+
+/* Retrieve a blob. */
+int
+xblob_get(
+	struct xblob	*blob,
+	xblob_cookie	cookie,
+	void		*ptr,
+	uint32_t	size)
+{
+	struct xb_key	*key = (struct xb_key *)cookie;
+
+	if (key->magic != XB_KEY_MAGIC) {
+		ASSERT(0);
+		return -ENODATA;
+	}
+	if (size < key->size) {
+		ASSERT(0);
+		return -EFBIG;
+	}
+
+	memcpy(ptr, key + 1, key->size);
+	return 0;
+}
+
+/* Store a blob. */
+int
+xblob_put(
+	struct xblob	*blob,
+	xblob_cookie	*cookie,
+	void		*ptr,
+	uint32_t	size)
+{
+	struct xb_key	*key;
+
+	key = kmem_alloc(XB_KEY_SIZE(size), KM_NOFS | KM_MAYFAIL);
+	if (!key)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&key->list);
+	list_add_tail(&key->list, &blob->list);
+	key->magic = XB_KEY_MAGIC;
+	key->size = size;
+	memcpy(key + 1, ptr, size);
+	*cookie = (xblob_cookie)key;
+	return 0;
+}
+
+/* Free a blob. */
+int
+xblob_free(
+	struct xblob	*blob,
+	xblob_cookie	cookie)
+{
+	struct xb_key	*key = (struct xb_key *)cookie;
+
+	if (key->magic != XB_KEY_MAGIC) {
+		ASSERT(0);
+		return -ENODATA;
+	}
+	key->magic = 0;
+	list_del(&key->list);
+	kmem_free(key);
+	return 0;
+}
diff --git a/fs/xfs/scrub/blob.h b/fs/xfs/scrub/blob.h
new file mode 100644
index 000000000000..2595a15f78ac
--- /dev/null
+++ b/fs/xfs/scrub/blob.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#ifndef __XFS_SCRUB_BLOB_H__
+#define __XFS_SCRUB_BLOB_H__
+
+struct xblob {
+	struct list_head	list;
+};
+
+typedef void			*xblob_cookie;
+
+struct xblob *xblob_init(void);
+void xblob_destroy(struct xblob *blob);
+int xblob_get(struct xblob *blob, xblob_cookie cookie, void *ptr,
+		uint32_t size);
+int xblob_put(struct xblob *blob, xblob_cookie *cookie, void *ptr,
+		uint32_t size);
+int xblob_free(struct xblob *blob, xblob_cookie cookie);
+
+#endif /* __XFS_SCRUB_BLOB_H__ */

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 12/18] xfs: convert xfs_itruncate_extents_flags to use __xfs_bunmapi
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
                   ` (10 preceding siblings ...)
  2019-08-05  0:35 ` [PATCH 11/18] xfs: create a blob array data structure Darrick J. Wong
@ 2019-08-05  0:36 ` Darrick J. Wong
  2019-08-05  0:36 ` [PATCH 13/18] xfs: remove unnecessary inode-transaction roll Darrick J. Wong
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:36 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

There's no reason why we can't consume unmap_len, just use the raw
version.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 6467d5e1df2d..5fa9e49ccb87 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1516,7 +1516,6 @@ xfs_itruncate_extents_flags(
 	xfs_fileoff_t		last_block;
 	xfs_filblks_t		unmap_len;
 	int			error = 0;
-	int			done = 0;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	ASSERT(!atomic_read(&VFS_I(ip)->i_count) ||
@@ -1547,10 +1546,10 @@ xfs_itruncate_extents_flags(
 
 	ASSERT(first_unmap_block < last_block);
 	unmap_len = last_block - first_unmap_block + 1;
-	while (!done) {
+	while (unmap_len > 0) {
 		ASSERT(tp->t_firstblock == NULLFSBLOCK);
-		error = xfs_bunmapi(tp, ip, first_unmap_block, unmap_len, flags,
-				    XFS_ITRUNC_MAX_EXTENTS, &done);
+		error = __xfs_bunmapi(tp, ip, first_unmap_block, &unmap_len,
+				flags, XFS_ITRUNC_MAX_EXTENTS);
 		if (error)
 			goto out;
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 13/18] xfs: remove unnecessary inode-transaction roll
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
                   ` (11 preceding siblings ...)
  2019-08-05  0:36 ` [PATCH 12/18] xfs: convert xfs_itruncate_extents_flags to use __xfs_bunmapi Darrick J. Wong
@ 2019-08-05  0:36 ` Darrick J. Wong
  2019-08-05  0:36 ` [PATCH 14/18] xfs: create a new inode fork block unmap helper Darrick J. Wong
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:36 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Remove the transaction roll at the end of the loop in
xfs_itruncate_extents_flags.  xfs_defer_finish takes care of rolling the
transaction as needed and reattaching the inode, which means we already
start each loop with a clean transaction.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode.c |    4 ----
 1 file changed, 4 deletions(-)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 5fa9e49ccb87..acb9335e6306 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1560,10 +1560,6 @@ xfs_itruncate_extents_flags(
 		error = xfs_defer_finish(&tp);
 		if (error)
 			goto out;
-
-		error = xfs_trans_roll_inode(&tp, ip);
-		if (error)
-			goto out;
 	}
 
 	if (whichfork == XFS_DATA_FORK) {

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 14/18] xfs: create a new inode fork block unmap helper
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
                   ` (12 preceding siblings ...)
  2019-08-05  0:36 ` [PATCH 13/18] xfs: remove unnecessary inode-transaction roll Darrick J. Wong
@ 2019-08-05  0:36 ` Darrick J. Wong
  2019-08-05  0:36 ` [PATCH 15/18] xfs: repair extended attributes Darrick J. Wong
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:36 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a new helper to unmap blocks from an inode's fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   43 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_bmap.h |    3 +++
 fs/xfs/xfs_inode.c       |   29 ++++-------------------------
 3 files changed, 50 insertions(+), 25 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 39dbc93374dc..5be389bfaf3f 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6227,3 +6227,46 @@ xfs_bmap_validate_extent(
 	return xfs_bmap_validate_extent_raw(ip->i_mount,
 			XFS_IS_REALTIME_INODE(ip), whichfork, irec);
 }
+
+/*
+ * Used in xfs_itruncate_extents().  This is the maximum number of extents
+ * freed from a file in a single transaction.
+ */
+#define	XFS_ITRUNC_MAX_EXTENTS	2
+
+/*
+ * Unmap every extent in part of an inode's fork.  We don't do any higher level
+ * invalidation work at all.
+ */
+int
+xfs_bunmapi_range(
+	struct xfs_trans	**tpp,
+	struct xfs_inode	*ip,
+	int			whichfork,
+	xfs_fileoff_t		startoff,
+	xfs_filblks_t		unmap_len,
+	int			bunmapi_flags)
+{
+	int			error = 0;
+
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+
+	bunmapi_flags |= xfs_bmapi_aflag(whichfork);
+	while (unmap_len > 0) {
+		ASSERT((*tpp)->t_firstblock == NULLFSBLOCK);
+		error = __xfs_bunmapi(*tpp, ip, startoff, &unmap_len,
+				bunmapi_flags, XFS_ITRUNC_MAX_EXTENTS);
+		if (error)
+			goto out;
+
+		/*
+		 * Duplicate the transaction that has the permanent
+		 * reservation and commit the old transaction.
+		 */
+		error = xfs_defer_finish(tpp);
+		if (error)
+			goto out;
+	}
+out:
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index b857762fac55..cf460d3654b5 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -279,5 +279,8 @@ xfs_failaddr_t xfs_bmap_validate_extent(struct xfs_inode *ip, int whichfork,
 int	xfs_bmapi_remap(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fileoff_t bno, xfs_filblks_t len, xfs_fsblock_t startblock,
 		int flags);
+int	xfs_bunmapi_range(struct xfs_trans **tpp, struct xfs_inode *ip,
+		int whichfork, xfs_fileoff_t startoff, xfs_filblks_t unmap_len,
+		int bunmapi_flags);
 
 #endif	/* __XFS_BMAP_H__ */
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index acb9335e6306..4a7b0ea22fa3 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -38,12 +38,6 @@
 
 kmem_zone_t *xfs_inode_zone;
 
-/*
- * Used in xfs_itruncate_extents().  This is the maximum number of extents
- * freed from a file in a single transaction.
- */
-#define	XFS_ITRUNC_MAX_EXTENTS	2
-
 STATIC int xfs_iflush_int(struct xfs_inode *, struct xfs_buf *);
 STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *);
 STATIC int xfs_iunlink_remove(struct xfs_trans *, struct xfs_inode *);
@@ -1514,7 +1508,6 @@ xfs_itruncate_extents_flags(
 	struct xfs_trans	*tp = *tpp;
 	xfs_fileoff_t		first_unmap_block;
 	xfs_fileoff_t		last_block;
-	xfs_filblks_t		unmap_len;
 	int			error = 0;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
@@ -1528,8 +1521,6 @@ xfs_itruncate_extents_flags(
 
 	trace_xfs_itruncate_extents_start(ip, new_size);
 
-	flags |= xfs_bmapi_aflag(whichfork);
-
 	/*
 	 * Since it is possible for space to become allocated beyond
 	 * the end of the file (in a crash where the space is allocated
@@ -1545,22 +1536,10 @@ xfs_itruncate_extents_flags(
 		return 0;
 
 	ASSERT(first_unmap_block < last_block);
-	unmap_len = last_block - first_unmap_block + 1;
-	while (unmap_len > 0) {
-		ASSERT(tp->t_firstblock == NULLFSBLOCK);
-		error = __xfs_bunmapi(tp, ip, first_unmap_block, &unmap_len,
-				flags, XFS_ITRUNC_MAX_EXTENTS);
-		if (error)
-			goto out;
-
-		/*
-		 * Duplicate the transaction that has the permanent
-		 * reservation and commit the old transaction.
-		 */
-		error = xfs_defer_finish(&tp);
-		if (error)
-			goto out;
-	}
+	error = xfs_bunmapi_range(&tp, ip, whichfork, first_unmap_block,
+			last_block - first_unmap_block + 1, flags);
+	if (error)
+		goto out;
 
 	if (whichfork == XFS_DATA_FORK) {
 		/* Remove all pending CoW reservations. */

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 15/18] xfs: repair extended attributes
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
                   ` (13 preceding siblings ...)
  2019-08-05  0:36 ` [PATCH 14/18] xfs: create a new inode fork block unmap helper Darrick J. Wong
@ 2019-08-05  0:36 ` Darrick J. Wong
  2019-08-05  0:36 ` [PATCH 16/18] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:36 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

If the extended attributes look bad, try to sift through the rubble to
find whatever keys/values we can, zap the attr tree, and re-add the
values.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile            |    1 
 fs/xfs/scrub/attr.c        |   10 -
 fs/xfs/scrub/attr.h        |   10 +
 fs/xfs/scrub/attr_repair.c |  728 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h      |    2 
 fs/xfs/scrub/scrub.c       |    2 
 fs/xfs/scrub/scrub.h       |    3 
 7 files changed, 753 insertions(+), 3 deletions(-)
 create mode 100644 fs/xfs/scrub/attr_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index fecde2c9d2de..270a3f41fb30 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -162,6 +162,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
 				   alloc_repair.o \
 				   array.o \
+				   attr_repair.o \
 				   bitmap.o \
 				   blob.o \
 				   bmap_repair.o \
diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 1afc58bf71dd..67f42b666aad 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -38,9 +38,15 @@ xchk_setup_xattr_buf(
 	 * We need enough space to read an xattr value from the file or enough
 	 * space to hold three copies of the xattr free space bitmap.  We don't
 	 * need the buffer space for both purposes at the same time.
+	 *
+	 * If we're doing a repair, we need enough space to hold the largest
+	 * xattr value and the largest xattr name.
 	 */
 	sz = 3 * sizeof(long) * BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
-	sz = max_t(size_t, sz, value_size);
+	if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR)
+		sz = max_t(size_t, sz, value_size + XATTR_NAME_MAX + 1);
+	else
+		sz = max_t(size_t, sz, value_size);
 
 	/*
 	 * If there's already a buffer, figure out if we need to reallocate it
@@ -184,7 +190,7 @@ xchk_xattr_listent(
  * Within a char, the lowest bit of the char represents the byte with
  * the smallest address
  */
-STATIC bool
+bool
 xchk_xattr_set_map(
 	struct xfs_scrub	*sc,
 	unsigned long		*map,
diff --git a/fs/xfs/scrub/attr.h b/fs/xfs/scrub/attr.h
index 13a1d2e8424d..b2d758953300 100644
--- a/fs/xfs/scrub/attr.h
+++ b/fs/xfs/scrub/attr.h
@@ -37,6 +37,16 @@ xchk_xattr_valuebuf(
 	return ab->buf;
 }
 
+/* A place to store attribute names. */
+static inline unsigned char *
+xchk_xattr_namebuf(
+	struct xfs_scrub	*sc)
+{
+	struct xchk_xattr_buf	*ab = sc->buf;
+
+	return (unsigned char *)ab->buf + ab->sz - XATTR_NAME_MAX - 1;
+}
+
 /* A bitmap of space usage computed by walking an attr leaf block. */
 static inline unsigned long *
 xchk_xattr_usedmap(
diff --git a/fs/xfs/scrub/attr_repair.c b/fs/xfs/scrub/attr_repair.c
new file mode 100644
index 000000000000..b05547efc7b4
--- /dev/null
+++ b/fs/xfs/scrub/attr_repair.c
@@ -0,0 +1,728 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_dir2.h"
+#include "xfs_attr.h"
+#include "xfs_attr_leaf.h"
+#include "xfs_attr_sf.h"
+#include "xfs_attr_remote.h"
+#include "xfs_bmap.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/array.h"
+#include "scrub/blob.h"
+#include "scrub/attr.h"
+
+/*
+ * Extended Attribute Repair
+ * =========================
+ *
+ * We repair extended attributes by reading the attribute fork blocks looking
+ * for keys and values, then truncate the entire attr fork and reinsert all
+ * the attributes.  Unfortunately, there's no secondary copy of most extended
+ * attribute data, which means that if we blow up midway through there's
+ * little we can do.
+ */
+
+struct xrep_xattr_key {
+	xblob_cookie		value_cookie;
+	xblob_cookie		name_cookie;
+	uint			hash;
+	int			flags;
+	uint16_t		valuelen;
+	uint16_t		namelen;
+} __packed;
+
+struct xrep_xattr {
+	struct xfs_scrub	*sc;
+	struct xfbma		*xattr_records;
+	struct xblob		*xattr_blobs;
+
+	/* Size of the largest attribute value we're trying to salvage. */
+	size_t			max_valuelen;
+};
+
+/*
+ * Iterate each block in an attr fork extent.  The m_attr_geo fsbcount is
+ * always 1 for now, but code defensively in case this ever changes.
+ */
+#define for_each_xfs_attr_block(mp, irec, dabno) \
+	for ((dabno) = roundup((xfs_dablk_t)(irec)->br_startoff, \
+			(mp)->m_attr_geo->fsbcount); \
+	     (dabno) < (irec)->br_startoff + (irec)->br_blockcount; \
+	     (dabno) += (mp)->m_attr_geo->fsbcount)
+
+/*
+ * Decide if we want to salvage this attribute.  We don't bother with
+ * incomplete or oversized keys or values.
+ */
+STATIC int
+xrep_xattr_want_salvage(
+	int			flags,
+	const void		*name,
+	int			namelen,
+	int			valuelen)
+{
+	if (flags & XFS_ATTR_INCOMPLETE)
+		return false;
+	if (namelen > XATTR_NAME_MAX || namelen <= 0)
+		return false;
+	if (valuelen > XATTR_SIZE_MAX || valuelen < 0)
+		return false;
+	if (!xfs_attr_namecheck(name, namelen))
+		return false;
+	return true;
+}
+
+/* Allocate an in-core record to hold xattrs while we rebuild the xattr data. */
+STATIC int
+xrep_xattr_salvage_key(
+	struct xrep_xattr	*rx,
+	int			flags,
+	unsigned char		*name,
+	int			namelen,
+	unsigned char		*value,
+	int			valuelen)
+{
+	struct xrep_xattr_key	key = {
+		.valuelen	= valuelen,
+		.flags		= flags & (XFS_ATTR_ROOT | XFS_ATTR_SECURE),
+		.namelen	= namelen,
+	};
+	int			error;
+
+	error = xblob_put(rx->xattr_blobs, &key.name_cookie, name, namelen);
+	if (error)
+		return error;
+	error = xblob_put(rx->xattr_blobs, &key.value_cookie, value, valuelen);
+	if (error)
+		return error;
+
+	key.hash = xfs_da_hashname(name, namelen);
+
+	error = xfbma_append(rx->xattr_records, &key);
+	if (error)
+		return error;
+
+	rx->max_valuelen = max_t(size_t, rx->max_valuelen, valuelen);
+	return 0;
+}
+
+/*
+ * Record a shortform extended attribute key & value for later reinsertion
+ * into the inode.
+ */
+STATIC int
+xrep_xattr_salvage_sf_attr(
+	struct xrep_xattr		*rx,
+	struct xfs_attr_sf_entry	*sfe)
+{
+	unsigned char			*value = &sfe->nameval[sfe->namelen];
+
+	if (!xrep_xattr_want_salvage(sfe->flags, sfe->nameval, sfe->namelen,
+			sfe->valuelen))
+		return 0;
+
+	return xrep_xattr_salvage_key(rx, sfe->flags, sfe->nameval,
+			sfe->namelen, value, sfe->valuelen);
+}
+
+/*
+ * Record a local format extended attribute key & value for later reinsertion
+ * into the inode.
+ */
+STATIC int
+xrep_xattr_salvage_local_attr(
+	struct xrep_xattr		*rx,
+	struct xfs_attr_leaf_entry	*ent,
+	unsigned int			nameidx,
+	const char			*buf_end,
+	struct xfs_attr_leaf_name_local	*lentry)
+{
+	unsigned char			*value;
+	unsigned long			*usedmap = xchk_xattr_usedmap(rx->sc);
+	unsigned int			valuelen;
+	unsigned int			namesize;
+
+	/*
+	 * Decode the leaf local entry format.  If something seems wrong, we
+	 * junk the attribute.
+	 */
+	valuelen = be16_to_cpu(lentry->valuelen);
+	namesize = xfs_attr_leaf_entsize_local(lentry->namelen, valuelen);
+	if ((char *)lentry + namesize > buf_end)
+		return 0;
+	if (!xrep_xattr_want_salvage(ent->flags, lentry->nameval,
+			lentry->namelen, valuelen))
+		return 0;
+	if (!xchk_xattr_set_map(rx->sc, usedmap, nameidx, namesize))
+		return 0;
+
+	/* Try to save this attribute. */
+	value = &lentry->nameval[lentry->namelen];
+	return xrep_xattr_salvage_key(rx, ent->flags, lentry->nameval,
+			lentry->namelen, value, valuelen);
+}
+
+/*
+ * Record a remote format extended attribute key & value for later reinsertion
+ * into the inode.
+ */
+STATIC int
+xrep_xattr_salvage_remote_attr(
+	struct xrep_xattr		*rx,
+	struct xfs_attr_leaf_entry	*ent,
+	unsigned int			nameidx,
+	const char			*buf_end,
+	struct xfs_attr_leaf_name_remote *rentry,
+	unsigned int			ent_idx,
+	struct xfs_buf			*leaf_bp)
+{
+	struct xfs_da_args		args = {
+		.trans	= rx->sc->tp,
+		.dp	= rx->sc->ip,
+		.index	= ent_idx,
+		.geo	= rx->sc->mp->m_attr_geo,
+	};
+	unsigned long			*usedmap = xchk_xattr_usedmap(rx->sc);
+	unsigned char			*value;
+	unsigned int			valuelen;
+	unsigned int			namesize;
+	int				error;
+
+	/*
+	 * Decode the leaf remote entry format.  If something seems wrong, we
+	 * junk the attribute.  Note that we should never find a zero-length
+	 * remote attribute value.
+	 */
+	valuelen = be32_to_cpu(rentry->valuelen);
+	namesize = xfs_attr_leaf_entsize_remote(rentry->namelen);
+	if ((char *)rentry + namesize > buf_end)
+		return 0;
+	if (valuelen == 0 ||
+	    !xrep_xattr_want_salvage(ent->flags, rentry->name, rentry->namelen,
+			valuelen))
+		return 0;
+	if (!xchk_xattr_set_map(rx->sc, usedmap, nameidx, namesize))
+		return 0;
+
+	/*
+	 * Find somewhere to save this value.  We can't use the xchk_xattr_buf
+	 * here because we're still using the memory for the attr block bitmap.
+	 */
+	value = kmem_alloc_large(valuelen, KM_MAYFAIL);
+	if (!value)
+		return -ENOMEM;
+
+	/* Look up the remote value and stash it for reconstruction. */
+	args.valuelen = valuelen;
+	args.namelen = rentry->namelen;
+	args.name = rentry->name;
+	args.value = value;
+	error = xfs_attr3_leaf_getvalue(leaf_bp, &args);
+	if (error || args.rmtblkno == 0)
+		goto err_free;
+
+	error = xfs_attr_rmtval_get(&args);
+	if (error)
+		goto err_free;
+
+	/* Try to save this attribute. */
+	error = xrep_xattr_salvage_key(rx, ent->flags, rentry->name,
+			rentry->namelen, value, valuelen);
+err_free:
+	/* remote value was garbage, junk it */
+	if (error == -EFSBADCRC || error == -EFSCORRUPTED)
+		error = 0;
+	kmem_free(value);
+	return error;
+}
+
+/* Extract every xattr key that we can from this attr fork block. */
+STATIC int
+xrep_xattr_recover_leaf(
+	struct xrep_xattr		*rx,
+	struct xfs_buf			*bp)
+{
+	struct xfs_attr3_icleaf_hdr	leafhdr;
+	struct xfs_scrub		*sc = rx->sc;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_attr_leafblock	*leaf;
+	unsigned long			*usedmap = xchk_xattr_usedmap(sc);
+	struct xfs_attr_leaf_name_local	*lentry;
+	struct xfs_attr_leaf_name_remote *rentry;
+	struct xfs_attr_leaf_entry	*ent;
+	struct xfs_attr_leaf_entry	*entries;
+	char				*buf_end;
+	size_t				off;
+	unsigned int			nameidx;
+	unsigned int			hdrsize;
+	int				i;
+	int				error = 0;
+
+	bitmap_zero(usedmap, mp->m_attr_geo->blksize);
+
+	/* Check the leaf header */
+	leaf = bp->b_addr;
+	xfs_attr3_leaf_hdr_from_disk(mp->m_attr_geo, &leafhdr, leaf);
+	hdrsize = xfs_attr3_leaf_hdr_size(leaf);
+	xchk_xattr_set_map(sc, usedmap, 0, hdrsize);
+	entries = xfs_attr3_leaf_entryp(leaf);
+
+	buf_end = (char *)bp->b_addr + mp->m_attr_geo->blksize;
+	for (i = 0, ent = entries; i < leafhdr.count; ent++, i++) {
+		if (xchk_should_terminate(sc, &error))
+			break;
+
+		/* Skip key if it conflicts with something else? */
+		off = (char *)ent - (char *)leaf;
+		if (!xchk_xattr_set_map(sc, usedmap, off,
+				sizeof(xfs_attr_leaf_entry_t)))
+			continue;
+
+		/* Check the name information. */
+		nameidx = be16_to_cpu(ent->nameidx);
+		if (nameidx < leafhdr.firstused ||
+		    nameidx >= mp->m_attr_geo->blksize)
+			continue;
+
+		if (ent->flags & XFS_ATTR_LOCAL) {
+			lentry = xfs_attr3_leaf_name_local(leaf, i);
+			error = xrep_xattr_salvage_local_attr(rx, ent, nameidx,
+					buf_end, lentry);
+		} else {
+			rentry = xfs_attr3_leaf_name_remote(leaf, i);
+			error = xrep_xattr_salvage_remote_attr(rx, ent, nameidx,
+					buf_end, rentry, i, bp);
+		}
+		if (error)
+			break;
+	}
+
+	return error;
+}
+
+/* Try to recover shortform attrs. */
+STATIC int
+xrep_xattr_recover_sf(
+	struct xrep_xattr		*rx)
+{
+	struct xfs_attr_shortform	*sf;
+	struct xfs_attr_sf_entry	*sfe;
+	struct xfs_attr_sf_entry	*next;
+	struct xfs_ifork		*ifp;
+	unsigned char			*end;
+	int				i;
+	int				error;
+
+	ifp = XFS_IFORK_PTR(rx->sc->ip, XFS_ATTR_FORK);
+	sf = (struct xfs_attr_shortform *)rx->sc->ip->i_afp->if_u1.if_data;
+	end = (unsigned char *)ifp->if_u1.if_data + ifp->if_bytes;
+
+	for (i = 0, sfe = &sf->list[0]; i < sf->hdr.count; i++) {
+		if (xchk_should_terminate(rx->sc, &error))
+			break;
+
+		next = XFS_ATTR_SF_NEXTENTRY(sfe);
+		if ((unsigned char *)next > end)
+			break;
+
+		/* Ok, let's save this key/value. */
+		error = xrep_xattr_salvage_sf_attr(rx, sfe);
+		if (error)
+			return error;
+
+		sfe = next;
+	}
+
+	return 0;
+}
+
+/* Extract as many attribute keys and values as we can. */
+STATIC int
+xrep_xattr_recover(
+	struct xrep_xattr	*rx)
+{
+	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	got;
+	struct xfs_scrub	*sc = rx->sc;
+	struct xfs_ifork	*ifp;
+	struct xfs_da_blkinfo	*info;
+	struct xfs_buf		*bp;
+	xfs_dablk_t		dabno;
+	int			error = 0;
+
+	if (sc->ip->i_d.di_aformat == XFS_DINODE_FMT_LOCAL)
+		return xrep_xattr_recover_sf(rx);
+
+	/*
+	 * Set the xchk_attr_buf to be as large as we're going to need it to be
+	 * to compute space usage bitmaps for each attr block we try to
+	 * salvage.  We don't salvage attrs whose name and value areas are
+	 * crosslinked with anything else.
+	 */
+	error = xchk_setup_xattr_buf(sc, 0, KM_MAYFAIL);
+	if (error == -ENOMEM)
+		return -EDEADLOCK;
+	if (error)
+		return error;
+
+	/* Iterate each attr block in the attr fork. */
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_ATTR_FORK);
+	for_each_xfs_iext(ifp, &icur, &got) {
+		xfs_trim_extent(&got, 0, (xfs_dablk_t)-1U);
+		if (got.br_blockcount == 0)
+			continue;
+		for_each_xfs_attr_block(sc->mp, &got, dabno) {
+			if (xchk_should_terminate(sc, &error))
+				return error;
+
+			/*
+			 * Try to read buffer.  We invalidate them in the next
+			 * step so we don't bother to set a buffer type or
+			 * ops.
+			 */
+			error = xfs_da_read_buf(sc->tp, sc->ip, dabno, -1, &bp,
+					XFS_ATTR_FORK, NULL);
+			if (error || !bp)
+				continue;
+
+			/* Screen out non-leaves & other garbage. */
+			info = bp->b_addr;
+			if (info->magic != cpu_to_be16(XFS_ATTR3_LEAF_MAGIC) ||
+			    xfs_attr3_leaf_buf_ops.verify_struct(bp) != NULL)
+				continue;
+
+			error = xrep_xattr_recover_leaf(rx, bp);
+			if (error)
+				return error;
+		}
+	}
+
+	return error;
+}
+
+/* Reset a shortform attr fork. */
+static void
+xrep_xattr_reset_attr_local(
+	struct xfs_scrub	*sc,
+	uint64_t		nr_attrs)
+{
+	struct xfs_attr_sf_hdr	*hdr;
+	struct xfs_ifork	*ifp;
+
+	/*
+	 * If the data fork isn't in btree format (or there are no attrs) then
+	 * all we need to do is zap the attr fork.
+	 */
+	if (nr_attrs == 0 || sc->ip->i_d.di_format != XFS_DINODE_FMT_BTREE) {
+		xfs_attr_fork_remove(sc->ip, sc->tp);
+		return;
+	}
+
+	/*
+	 * If the data fork is in btree format we can't change di_forkoff
+	 * because we could run afoul of the rule that forks aren't supposed to
+	 * be in btree format if there's enough space in the fork that we could
+	 * have extents format.  Instead, reinitialize the shortform fork to
+	 * have zero attributes.
+	 */
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_ATTR_FORK);
+	xfs_idata_realloc(sc->ip, (int)sizeof(*hdr) - ifp->if_bytes,
+			XFS_ATTR_FORK);
+	hdr = (struct xfs_attr_sf_hdr *)ifp->if_u1.if_data;
+	hdr->count = 0;
+	hdr->totsize = cpu_to_be16(sizeof(*hdr));
+	xfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE | XFS_ILOG_ADATA);
+}
+
+/* Free all the attribute fork blocks and delete the fork. */
+STATIC int
+xrep_xattr_reset_fork(
+	struct xfs_scrub	*sc,
+	uint64_t		nr_attrs)
+{
+	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	got;
+	struct xfs_ifork	*ifp;
+	struct xfs_buf		*bp;
+	xfs_fileoff_t		lblk;
+	int			error;
+
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	if (sc->ip->i_d.di_aformat == XFS_DINODE_FMT_LOCAL) {
+		xrep_xattr_reset_attr_local(sc, nr_attrs);
+		return 0;
+	}
+
+	/* Invalidate each attr block in the attr fork. */
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_ATTR_FORK);
+	for_each_xfs_iext(ifp, &icur, &got) {
+		xfs_trim_extent(&got, 0, (xfs_dablk_t)-1U);
+		if (got.br_blockcount == 0)
+			continue;
+		for_each_xfs_attr_block(sc->mp, &got, lblk) {
+			error = xfs_da_get_buf(sc->tp, sc->ip, lblk, -1, &bp,
+					XFS_ATTR_FORK);
+			if (error || !bp)
+				continue;
+			xfs_trans_binval(sc->tp, bp);
+			error = xfs_trans_roll_inode(&sc->tp, sc->ip);
+			if (error)
+				return error;
+		}
+	}
+
+	/* Now free all the blocks. */
+	error = xfs_bunmapi_range(&sc->tp, sc->ip, XFS_ATTR_FORK, 0, -1ULL,
+			XFS_BMAPI_NODISCARD);
+	if (error)
+		return error;
+
+	/* Log the inode core to keep it moving forward in the log. */
+	xfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE);
+
+	/* Reset the attribute fork - this also destroys the in-core fork */
+	xfs_attr_fork_remove(sc->ip, sc->tp);
+	return 0;
+}
+
+/*
+ * Compare two xattr keys.  ATTR_SECURE keys come before ATTR_ROOT and
+ * ATTR_ROOT keys come before user attrs.  Otherwise sort in hash order.
+ */
+static int
+xrep_xattr_key_cmp(
+	const void			*a,
+	const void			*b)
+{
+	const struct xrep_xattr_key	*ap = a;
+	const struct xrep_xattr_key	*bp = b;
+
+	if (ap->flags > bp->flags)
+		return 1;
+	else if (ap->flags < bp->flags)
+		return -1;
+
+	if (ap->hash > bp->hash)
+		return 1;
+	else if (ap->hash < bp->hash)
+		return -1;
+	return 0;
+}
+
+/*
+ * Find all the extended attributes for this inode by scraping them out of the
+ * attribute key blocks by hand.  The caller must clean up the lists if
+ * anything goes wrong.
+ */
+STATIC int
+xrep_xattr_find_attributes(
+	struct xfs_scrub	*sc,
+	struct xfbma		*xattr_records,
+	struct xblob		*xattr_blobs)
+{
+	struct xrep_xattr	rx = {
+		.sc		= sc,
+		.xattr_records	= xattr_records,
+		.xattr_blobs	= xattr_blobs,
+	};
+	struct xfs_ifork	*ifp;
+	int			error;
+
+	error = xrep_ino_dqattach(sc);
+	if (error)
+		return error;
+
+	/* Extent map should be loaded. */
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_ATTR_FORK);
+	if (XFS_IFORK_FORMAT(sc->ip, XFS_ATTR_FORK) != XFS_DINODE_FMT_LOCAL &&
+	    !(ifp->if_flags & XFS_IFEXTENTS)) {
+		error = xfs_iread_extents(sc->tp, sc->ip, XFS_ATTR_FORK);
+		if (error)
+			return error;
+	}
+
+	/* Read every attr key and value and record them in memory. */
+	error = xrep_xattr_recover(&rx);
+	if (error)
+		return error;
+
+	/*
+	 * Reset the xchk_attr_buf to be as large as we're going to need it to
+	 * be to store each attribute name and value as we re-add them to the
+	 * file.  We must preallocate the memory here because once we start
+	 * to modify the filesystem we cannot afford an ENOMEM.
+	 */
+	error = xchk_setup_xattr_buf(sc, rx.max_valuelen, KM_MAYFAIL);
+	if (error == -ENOMEM)
+		return -EDEADLOCK;
+	if (error)
+		return error;
+
+	return 0;
+}
+
+struct xrep_add_attr {
+	struct xfs_scrub	*sc;
+	struct xfbma		*xattr_records;
+	struct xblob		*xattr_blobs;
+};
+
+/* Insert one xattr key/value. */
+STATIC int
+xrep_xattr_insert_rec(
+	const void			*item,
+	void				*priv)
+{
+	const struct xrep_xattr_key	*key = item;
+	struct xrep_add_attr		*x = priv;
+	unsigned char			*name = xchk_xattr_namebuf(x->sc);
+	unsigned char			*value = xchk_xattr_valuebuf(x->sc);
+	int				error;
+
+	/*
+	 * The attribute name is stored near the end of the in-core buffer,
+	 * though we reserve one more byte to ensure null termination.
+	 */
+	name[XATTR_NAME_MAX] = 0;
+
+	error = xblob_get(x->xattr_blobs, key->name_cookie, name, key->namelen);
+	if (error)
+		return error;
+
+	error = xblob_free(x->xattr_blobs, key->name_cookie);
+	if (error)
+		return error;
+
+	error = xblob_get(x->xattr_blobs, key->value_cookie, value,
+			key->valuelen);
+	if (error)
+		return error;
+
+	error = xblob_free(x->xattr_blobs, key->value_cookie);
+	if (error)
+		return error;
+
+	name[key->namelen] = 0;
+	value[key->valuelen] = 0;
+
+	return xfs_attr_set(x->sc->ip, name, value, key->valuelen,
+			XFS_ATTR_NSP_ONDISK_TO_ARGS(key->flags));
+}
+
+/*
+ * Insert all the attributes that we collected.
+ *
+ * Commit the repair transaction and drop the ilock because the attribute
+ * setting code needs to be able to allocate special transactions and take the
+ * ilock on its own.  Some day we'll have deferred attribute setting, at which
+ * point we'll be able to use that to replace the attributes atomically and
+ * safely.
+ */
+STATIC int
+xrep_xattr_rebuild_tree(
+	struct xfs_scrub	*sc,
+	struct xfbma		*xattr_records,
+	struct xblob		*xattr_blobs)
+{
+	struct xrep_add_attr	x = {
+		.sc		= sc,
+		.xattr_records	= xattr_records,
+		.xattr_blobs	= xattr_blobs,
+	};
+	int			error;
+
+	error = xfs_trans_commit(sc->tp);
+	sc->tp = NULL;
+	if (error)
+		return error;
+
+	xfs_iunlock(sc->ip, XFS_ILOCK_EXCL);
+	sc->ilock_flags &= ~XFS_ILOCK_EXCL;
+
+	/*
+	 * Sort the attribute keys by hash to minimize dabtree splits when we
+	 * rebuild the extended attribute information.
+	 */
+	error = xfbma_sort(xattr_records, xrep_xattr_key_cmp);
+	if (error)
+		return error;
+
+	/* Re-add every attr to the file. */
+	return xfbma_iter_del(xattr_records, xrep_xattr_insert_rec, &x);
+}
+
+/*
+ * Repair the extended attribute metadata.
+ *
+ * XXX: Remote attribute value buffers encompass the entire (up to 64k) buffer.
+ * The buffer cache in XFS can't handle aliased multiblock buffers, so this
+ * might misbehave if the attr fork is crosslinked with other filesystem
+ * metadata.
+ */
+int
+xrep_xattr(
+	struct xfs_scrub	*sc)
+{
+	struct xfbma		*xattr_records;
+	struct xblob		*xattr_blobs;
+	int			error;
+
+	if (!xfs_inode_hasattr(sc->ip))
+		return -ENOENT;
+
+	/* Set up some storage */
+	xattr_records = xfbma_init(sizeof(struct xrep_xattr_key));
+	if (IS_ERR(xattr_records))
+		return PTR_ERR(xattr_records);
+	xattr_blobs = xblob_init();
+	if (IS_ERR(xattr_blobs)) {
+		error = PTR_ERR(xattr_blobs);
+		goto out_arr;
+	}
+
+	/* Collect extended attributes by parsing raw blocks. */
+	error = xrep_xattr_find_attributes(sc, xattr_records, xattr_blobs);
+	if (error)
+		goto out;
+
+	/*
+	 * Invalidate and truncate all attribute fork extents.  This is the
+	 * point at which we are no longer able to bail out gracefully.
+	 * We commit the transaction here because xfs_attr_set allocates its
+	 * own transactions.
+	 */
+	error = xrep_xattr_reset_fork(sc, xfbma_length(xattr_records));
+	if (error)
+		goto out;
+
+	/* Now rebuild the attribute information. */
+	error = xrep_xattr_rebuild_tree(sc, xattr_records, xattr_blobs);
+out:
+	xblob_destroy(xattr_blobs);
+out_arr:
+	xfbma_destroy(xattr_records);
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 4ff2ef9fc13b..ea77ce90401d 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -72,6 +72,7 @@ int xrep_inode(struct xfs_scrub *sc);
 int xrep_bmap_data(struct xfs_scrub *sc);
 int xrep_bmap_attr(struct xfs_scrub *sc);
 int xrep_symlink(struct xfs_scrub *sc);
+int xrep_xattr(struct xfs_scrub *sc);
 
 #else
 
@@ -118,6 +119,7 @@ xrep_reset_perag_resv(
 #define xrep_bmap_data			xrep_notsupported
 #define xrep_bmap_attr			xrep_notsupported
 #define xrep_symlink			xrep_notsupported
+#define xrep_xattr			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index ea1154aa2225..0561cce37a31 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -290,7 +290,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xchk_setup_xattr,
 		.scrub	= xchk_xattr,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_xattr,
 	},
 	[XFS_SCRUB_TYPE_SYMLINK] = {	/* symbolic link */
 		.type	= ST_INODE,
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 16ed1d3e1404..99c4a3021284 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -170,4 +170,7 @@ struct xchk_fscounters {
 	unsigned long long	icount_max;
 };
 
+bool xchk_xattr_set_map(struct xfs_scrub *sc, unsigned long *map,
+		unsigned int start, unsigned int len);
+
 #endif	/* __XFS_SCRUB_SCRUB_H__ */

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 16/18] xfs: scrub should set preen if attr leaf has holes
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
                   ` (14 preceding siblings ...)
  2019-08-05  0:36 ` [PATCH 15/18] xfs: repair extended attributes Darrick J. Wong
@ 2019-08-05  0:36 ` Darrick J. Wong
  2019-08-05  0:36 ` [PATCH 17/18] xfs: repair quotas Darrick J. Wong
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:36 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, Dave Chinner

From: Darrick J. Wong <darrick.wong@oracle.com>

If an attr block indicates that it could use compaction, set the preen
flag to have the attr fork rebuilt, since the attr fork rebuilder can
take care of that for us.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/attr.c    |    2 ++
 fs/xfs/scrub/dabtree.c |   15 +++++++++++++++
 fs/xfs/scrub/dabtree.h |    1 +
 fs/xfs/scrub/trace.h   |    1 +
 4 files changed, 19 insertions(+)


diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 67f42b666aad..9dcc69ecec76 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -366,6 +366,8 @@ xchk_xattr_block(
 		xchk_da_set_corrupt(ds, level);
 	if (!xchk_xattr_set_map(ds->sc, usedmap, 0, hdrsize))
 		xchk_da_set_corrupt(ds, level);
+	if (leafhdr.holes)
+		xchk_da_set_preen(ds, level);
 
 	if (ds->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
 		goto out;
diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c
index 77ff9f97bcda..8a3dfe88347f 100644
--- a/fs/xfs/scrub/dabtree.c
+++ b/fs/xfs/scrub/dabtree.c
@@ -77,6 +77,21 @@ xchk_da_set_corrupt(
 			__return_address);
 }
 
+/* Flag a da btree node in need of optimization. */
+void
+xchk_da_set_preen(
+	struct xchk_da_btree	*ds,
+	int			level)
+{
+	struct xfs_scrub	*sc = ds->sc;
+
+	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_PREEN;
+	trace_xchk_fblock_preen(sc, ds->dargs.whichfork,
+			xfs_dir2_da_to_db(ds->dargs.geo,
+				ds->state->path.blk[level].blkno),
+			__return_address);
+}
+
 /* Find an entry at a certain level in a da btree. */
 STATIC void *
 xchk_da_btree_entry(
diff --git a/fs/xfs/scrub/dabtree.h b/fs/xfs/scrub/dabtree.h
index cb3f0003245b..b367bf87a183 100644
--- a/fs/xfs/scrub/dabtree.h
+++ b/fs/xfs/scrub/dabtree.h
@@ -36,6 +36,7 @@ bool xchk_da_process_error(struct xchk_da_btree *ds, int level, int *error);
 
 /* Check for da btree corruption. */
 void xchk_da_set_corrupt(struct xchk_da_btree *ds, int level);
+void xchk_da_set_preen(struct xchk_da_btree *ds, int level);
 
 int xchk_da_btree_hash(struct xchk_da_btree *ds, int level, __be32 *hashp);
 int xchk_da_btree(struct xfs_scrub *sc, int whichfork,
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 1124c86b980f..7eb166599a61 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -298,6 +298,7 @@ DEFINE_EVENT(xchk_fblock_error_class, name, \
 
 DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xchk_fblock_error);
 DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xchk_fblock_warning);
+DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xchk_fblock_preen);
 
 TRACE_EVENT(xchk_incomplete,
 	TP_PROTO(struct xfs_scrub *sc, void *ret_ip),

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 17/18] xfs: repair quotas
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
                   ` (15 preceding siblings ...)
  2019-08-05  0:36 ` [PATCH 16/18] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
@ 2019-08-05  0:36 ` Darrick J. Wong
  2019-08-05  0:36 ` [PATCH 18/18] xfs: convert big array and blob array to use memfd backend Darrick J. Wong
  2019-08-05  7:20 ` [PATCH v19 00/18] xfs: online repair support Dave Chinner
  18 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:36 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Fix anything that causes the quota verifiers to fail.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile             |    1 
 fs/xfs/scrub/attr_repair.c  |    2 
 fs/xfs/scrub/common.h       |    9 +
 fs/xfs/scrub/quota.c        |    2 
 fs/xfs/scrub/quota_repair.c |  363 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.c       |   58 +++++++
 fs/xfs/scrub/repair.h       |    8 +
 fs/xfs/scrub/scrub.c        |   11 +
 8 files changed, 446 insertions(+), 8 deletions(-)
 create mode 100644 fs/xfs/scrub/quota_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 270a3f41fb30..a2461621ac26 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -172,5 +172,6 @@ xfs-y				+= $(addprefix scrub/, \
 				   repair.o \
 				   symlink_repair.o \
 				   )
+xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota_repair.o
 endif
 endif
diff --git a/fs/xfs/scrub/attr_repair.c b/fs/xfs/scrub/attr_repair.c
index b05547efc7b4..1dd2064052a0 100644
--- a/fs/xfs/scrub/attr_repair.c
+++ b/fs/xfs/scrub/attr_repair.c
@@ -457,7 +457,7 @@ xrep_xattr_reset_attr_local(
 }
 
 /* Free all the attribute fork blocks and delete the fork. */
-STATIC int
+int
 xrep_xattr_reset_fork(
 	struct xfs_scrub	*sc,
 	uint64_t		nr_attrs)
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 003a772cd26c..475680576c1b 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -142,4 +142,13 @@ int xchk_ilock_inverted(struct xfs_inode *ip, uint lock_mode);
 void xchk_stop_reaping(struct xfs_scrub *sc);
 void xchk_start_reaping(struct xfs_scrub *sc);
 
+/* Do we need to invoke the repair tool? */
+static inline bool xfs_scrub_needs_repair(struct xfs_scrub_metadata *sm)
+{
+	return sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
+			       XFS_SCRUB_OFLAG_XCORRUPT |
+			       XFS_SCRUB_OFLAG_PREEN);
+}
+uint xchk_quota_to_dqtype(struct xfs_scrub *sc);
+
 #endif	/* __XFS_SCRUB_COMMON_H__ */
diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
index 0a33b4421c32..9dd737aff144 100644
--- a/fs/xfs/scrub/quota.c
+++ b/fs/xfs/scrub/quota.c
@@ -18,7 +18,7 @@
 #include "scrub/common.h"
 
 /* Convert a scrub type code to a DQ flag, or return 0 if error. */
-static inline uint
+uint
 xchk_quota_to_dqtype(
 	struct xfs_scrub	*sc)
 {
diff --git a/fs/xfs/scrub/quota_repair.c b/fs/xfs/scrub/quota_repair.c
new file mode 100644
index 000000000000..5f76c4f4db1a
--- /dev/null
+++ b/fs/xfs/scrub/quota_repair.c
@@ -0,0 +1,363 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_alloc.h"
+#include "xfs_bmap.h"
+#include "xfs_quota.h"
+#include "xfs_qm.h"
+#include "xfs_dquot.h"
+#include "xfs_dquot_item.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * Quota Repair
+ * ============
+ *
+ * Quota repairs are fairly simplistic; we fix everything that the dquot
+ * verifiers complain about, cap any counters or limits that make no sense,
+ * and schedule a quotacheck if we had to fix anything.  We also repair any
+ * data fork extent records that don't apply to metadata files.
+ */
+
+struct xrep_quota_info {
+	struct xfs_scrub	*sc;
+	bool			need_quotacheck;
+};
+
+/* Scrub the fields in an individual quota item. */
+STATIC int
+xrep_quota_item(
+	struct xfs_dquot	*dq,
+	uint			dqtype,
+	void			*priv)
+{
+	struct xrep_quota_info	*rqi = priv;
+	struct xfs_scrub	*sc = rqi->sc;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_disk_dquot	*d = &dq->q_core;
+	unsigned long long	bsoft;
+	unsigned long long	isoft;
+	unsigned long long	rsoft;
+	unsigned long long	bhard;
+	unsigned long long	ihard;
+	unsigned long long	rhard;
+	unsigned long long	bcount;
+	unsigned long long	icount;
+	unsigned long long	rcount;
+	xfs_ino_t		fs_icount;
+	bool			dirty = false;
+	int			error;
+
+	/* Did we get the dquot type we wanted? */
+	if (dqtype != (d->d_flags & XFS_DQ_ALLTYPES)) {
+		d->d_flags = dqtype;
+		dirty = true;
+	}
+
+	if (d->d_pad0 || d->d_pad) {
+		d->d_pad0 = 0;
+		d->d_pad = 0;
+		dirty = true;
+	}
+
+	/* Check the limits. */
+	bhard = be64_to_cpu(d->d_blk_hardlimit);
+	ihard = be64_to_cpu(d->d_ino_hardlimit);
+	rhard = be64_to_cpu(d->d_rtb_hardlimit);
+
+	bsoft = be64_to_cpu(d->d_blk_softlimit);
+	isoft = be64_to_cpu(d->d_ino_softlimit);
+	rsoft = be64_to_cpu(d->d_rtb_softlimit);
+
+	if (bsoft > bhard) {
+		d->d_blk_softlimit = d->d_blk_hardlimit;
+		dirty = true;
+	}
+
+	if (isoft > ihard) {
+		d->d_ino_softlimit = d->d_ino_hardlimit;
+		dirty = true;
+	}
+
+	if (rsoft > rhard) {
+		d->d_rtb_softlimit = d->d_rtb_hardlimit;
+		dirty = true;
+	}
+
+	/* Check the resource counts. */
+	bcount = be64_to_cpu(d->d_bcount);
+	icount = be64_to_cpu(d->d_icount);
+	rcount = be64_to_cpu(d->d_rtbcount);
+	fs_icount = percpu_counter_sum(&mp->m_icount);
+
+	/*
+	 * Check that usage doesn't exceed physical limits.  However, on
+	 * a reflink filesystem we're allowed to exceed physical space
+	 * if there are no quota limits.  We don't know what the real number
+	 * is, but we can make quotacheck find out for us.
+	 */
+	if (!xfs_sb_version_hasreflink(&mp->m_sb) &&
+	    mp->m_sb.sb_dblocks < bcount) {
+		dq->q_res_bcount -= be64_to_cpu(dq->q_core.d_bcount);
+		dq->q_res_bcount += mp->m_sb.sb_dblocks;
+		d->d_bcount = cpu_to_be64(mp->m_sb.sb_dblocks);
+		rqi->need_quotacheck = true;
+		dirty = true;
+	}
+	if (icount > fs_icount) {
+		dq->q_res_icount -= be64_to_cpu(dq->q_core.d_icount);
+		dq->q_res_icount += fs_icount;
+		d->d_icount = cpu_to_be64(fs_icount);
+		rqi->need_quotacheck = true;
+		dirty = true;
+	}
+	if (rcount > mp->m_sb.sb_rblocks) {
+		dq->q_res_rtbcount -= be64_to_cpu(dq->q_core.d_rtbcount);
+		dq->q_res_rtbcount += mp->m_sb.sb_rblocks;
+		d->d_rtbcount = cpu_to_be64(mp->m_sb.sb_rblocks);
+		rqi->need_quotacheck = true;
+		dirty = true;
+	}
+
+	if (!dirty)
+		return 0;
+
+	dq->dq_flags |= XFS_DQ_DIRTY;
+	xfs_trans_dqjoin(sc->tp, dq);
+	xfs_trans_log_dquot(sc->tp, dq);
+	error = xfs_trans_roll(&sc->tp);
+	xfs_dqlock(dq);
+	return error;
+}
+
+/* Fix a quota timer so that we can pass the verifier. */
+STATIC void
+xrep_quota_fix_timer(
+	__be64			softlimit,
+	__be64			countnow,
+	__be32			*timer,
+	time_t			timelimit)
+{
+	uint64_t		soft = be64_to_cpu(softlimit);
+	uint64_t		count = be64_to_cpu(countnow);
+
+	if (soft && count > soft && *timer == 0)
+		*timer = cpu_to_be32(get_seconds() + timelimit);
+}
+
+/* Fix anything the verifiers complain about. */
+STATIC int
+xrep_quota_block(
+	struct xfs_scrub	*sc,
+	struct xfs_buf		*bp,
+	uint			dqtype,
+	xfs_dqid_t		id)
+{
+	struct xfs_dqblk	*d = (struct xfs_dqblk *)bp->b_addr;
+	struct xfs_disk_dquot	*ddq;
+	struct xfs_quotainfo	*qi = sc->mp->m_quotainfo;
+	enum xfs_blft		buftype = 0;
+	int			i;
+
+	bp->b_ops = &xfs_dquot_buf_ops;
+	for (i = 0; i < qi->qi_dqperchunk; i++) {
+		ddq = &d[i].dd_diskdq;
+
+		ddq->d_magic = cpu_to_be16(XFS_DQUOT_MAGIC);
+		ddq->d_version = XFS_DQUOT_VERSION;
+		ddq->d_flags = dqtype;
+		ddq->d_id = cpu_to_be32(id + i);
+
+		xrep_quota_fix_timer(ddq->d_blk_softlimit,
+				ddq->d_bcount, &ddq->d_btimer,
+				qi->qi_btimelimit);
+		xrep_quota_fix_timer(ddq->d_ino_softlimit,
+				ddq->d_icount, &ddq->d_itimer,
+				qi->qi_itimelimit);
+		xrep_quota_fix_timer(ddq->d_rtb_softlimit,
+				ddq->d_rtbcount, &ddq->d_rtbtimer,
+				qi->qi_rtbtimelimit);
+
+		/* We only support v5 filesystems so always set these. */
+		uuid_copy(&d->dd_uuid, &sc->mp->m_sb.sb_meta_uuid);
+		xfs_update_cksum((char *)d, sizeof(struct xfs_dqblk),
+				 XFS_DQUOT_CRC_OFF);
+		d->dd_lsn = 0;
+	}
+	switch (dqtype) {
+	case XFS_DQ_USER:
+		buftype = XFS_BLFT_UDQUOT_BUF;
+		break;
+	case XFS_DQ_GROUP:
+		buftype = XFS_BLFT_GDQUOT_BUF;
+		break;
+	case XFS_DQ_PROJ:
+		buftype = XFS_BLFT_PDQUOT_BUF;
+		break;
+	}
+	xfs_trans_buf_set_type(sc->tp, bp, buftype);
+	xfs_trans_log_buf(sc->tp, bp, 0, BBTOB(bp->b_length) - 1);
+	return xfs_trans_roll(&sc->tp);
+}
+
+/* Repair quota's data fork. */
+STATIC int
+xrep_quota_data_fork(
+	struct xfs_scrub	*sc,
+	uint			dqtype)
+{
+	struct xfs_bmbt_irec	irec = { 0 };
+	struct xfs_iext_cursor	icur;
+	struct xfs_quotainfo	*qi = sc->mp->m_quotainfo;
+	struct xfs_ifork	*ifp;
+	struct xfs_buf		*bp;
+	struct xfs_dqblk	*d;
+	xfs_dqid_t		id;
+	xfs_fileoff_t		max_dqid_off;
+	xfs_fileoff_t		off;
+	xfs_fsblock_t		fsbno;
+	bool			truncate = false;
+	int			error = 0;
+
+	error = xrep_metadata_inode_forks(sc);
+	if (error)
+		goto out;
+
+	/* Check for data fork problems that apply only to quota files. */
+	max_dqid_off = ((xfs_dqid_t)-1) / qi->qi_dqperchunk;
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
+	for_each_xfs_iext(ifp, &icur, &irec) {
+		if (isnullstartblock(irec.br_startblock)) {
+			error = -EFSCORRUPTED;
+			goto out;
+		}
+
+		if (irec.br_startoff > max_dqid_off ||
+		    irec.br_startoff + irec.br_blockcount - 1 > max_dqid_off) {
+			truncate = true;
+			break;
+		}
+	}
+	if (truncate) {
+		error = xfs_itruncate_extents(&sc->tp, sc->ip, XFS_DATA_FORK,
+				max_dqid_off * sc->mp->m_sb.sb_blocksize);
+		if (error)
+			goto out;
+	}
+
+	/* Now go fix anything that fails the verifiers. */
+	for_each_xfs_iext(ifp, &icur, &irec) {
+		for (fsbno = irec.br_startblock, off = irec.br_startoff;
+		     fsbno < irec.br_startblock + irec.br_blockcount;
+		     fsbno += XFS_DQUOT_CLUSTER_SIZE_FSB,
+				off += XFS_DQUOT_CLUSTER_SIZE_FSB) {
+			id = off * qi->qi_dqperchunk;
+			error = xfs_trans_read_buf(sc->mp, sc->tp,
+					sc->mp->m_ddev_targp,
+					XFS_FSB_TO_DADDR(sc->mp, fsbno),
+					qi->qi_dqchunklen,
+					0, &bp, &xfs_dquot_buf_ops);
+			if (error == 0) {
+				d = (struct xfs_dqblk *)bp->b_addr;
+				if (id == be32_to_cpu(d->dd_diskdq.d_id)) {
+					xfs_trans_brelse(sc->tp, bp);
+					continue;
+				}
+				error = -EFSCORRUPTED;
+				xfs_trans_brelse(sc->tp, bp);
+			}
+			if (error != -EFSBADCRC && error != -EFSCORRUPTED)
+				goto out;
+
+			/* Failed verifier, try again. */
+			error = xfs_trans_read_buf(sc->mp, sc->tp,
+					sc->mp->m_ddev_targp,
+					XFS_FSB_TO_DADDR(sc->mp, fsbno),
+					qi->qi_dqchunklen,
+					0, &bp, NULL);
+			if (error)
+				goto out;
+
+			/*
+			 * Fix the quota block, which will roll our transaction
+			 * and release bp.
+			 */
+			error = xrep_quota_block(sc, bp, dqtype, id);
+			if (error)
+				goto out;
+		}
+	}
+
+out:
+	return error;
+}
+
+/*
+ * Go fix anything in the quota items that we could have been mad about.  Now
+ * that we've checked the quota inode data fork we have to drop ILOCK_EXCL to
+ * use the regular dquot functions.
+ */
+STATIC int
+xrep_quota_problems(
+	struct xfs_scrub	*sc,
+	uint			dqtype)
+{
+	struct xrep_quota_info	rqi;
+	int			error;
+
+	rqi.sc = sc;
+	rqi.need_quotacheck = false;
+	error = xfs_qm_dqiterate(sc->mp, dqtype, xrep_quota_item, &rqi);
+	if (error)
+		return error;
+
+	/* Make a quotacheck happen. */
+	if (rqi.need_quotacheck)
+		xrep_force_quotacheck(sc, dqtype);
+	return 0;
+}
+
+/* Repair all of a quota type's items. */
+int
+xrep_quota(
+	struct xfs_scrub	*sc)
+{
+	uint			dqtype;
+	int			error;
+
+	dqtype = xchk_quota_to_dqtype(sc);
+
+	/* Fix problematic data fork mappings. */
+	error = xrep_quota_data_fork(sc, dqtype);
+	if (error)
+		goto out;
+
+	/* Unlock quota inode; we play only with dquots from now on. */
+	xfs_iunlock(sc->ip, sc->ilock_flags);
+	sc->ilock_flags = 0;
+
+	/* Fix anything the dquot verifiers complain about. */
+	error = xrep_quota_problems(sc, dqtype);
+out:
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index ad93d25602ae..63b0e2440acf 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -24,6 +24,8 @@
 #include "xfs_extent_busy.h"
 #include "xfs_ag_resv.h"
 #include "xfs_quota.h"
+#include "xfs_attr.h"
+#include "xfs_reflink.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -975,3 +977,59 @@ xrep_reset_perag_resv(
 out:
 	return error;
 }
+
+/*
+ * Repair the attr/data forks of a metadata inode.  The metadata inode must be
+ * pointed to by sc->ip and the ILOCK must be held.
+ */
+int
+xrep_metadata_inode_forks(
+	struct xfs_scrub	*sc)
+{
+	__u32			smtype;
+	__u32			smflags;
+	int			error;
+
+	smtype = sc->sm->sm_type;
+	smflags = sc->sm->sm_flags;
+
+	/* Let's see if the forks need repair. */
+	sc->sm->sm_flags &= ~XFS_SCRUB_FLAGS_OUT;
+	error = xchk_metadata_inode_forks(sc);
+	if (error || !xfs_scrub_needs_repair(sc->sm))
+		goto out;
+
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	/* Clear the reflink flag & attr forks that we shouldn't have. */
+	if (xfs_is_reflink_inode(sc->ip)) {
+		error = xfs_reflink_clear_inode_flag(sc->ip, &sc->tp);
+		if (error)
+			goto out;
+	}
+
+	if (xfs_inode_hasattr(sc->ip)) {
+		error = xrep_xattr_reset_fork(sc, 0);
+		if (error)
+			goto out;
+	}
+
+	/* Repair the data fork. */
+	sc->sm->sm_type = XFS_SCRUB_TYPE_BMBTD;
+	error = xrep_bmap_data(sc);
+	sc->sm->sm_type = smtype;
+	if (error)
+		goto out;
+
+	/* Bail out if we still need repairs. */
+	sc->sm->sm_flags &= ~XFS_SCRUB_FLAGS_OUT;
+	error = xchk_metadata_inode_forks(sc);
+	if (error)
+		goto out;
+	if (xfs_scrub_needs_repair(sc->sm))
+		error = -EFSCORRUPTED;
+out:
+	sc->sm->sm_type = smtype;
+	sc->sm->sm_flags = smflags;
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index ea77ce90401d..334ff33031e6 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -52,6 +52,8 @@ int xrep_find_ag_btree_roots(struct xfs_scrub *sc, struct xfs_buf *agf_bp,
 void xrep_force_quotacheck(struct xfs_scrub *sc, uint dqtype);
 int xrep_ino_dqattach(struct xfs_scrub *sc);
 int xrep_reset_perag_resv(struct xfs_scrub *sc);
+int xrep_xattr_reset_fork(struct xfs_scrub *sc, uint64_t nr_attrs);
+int xrep_metadata_inode_forks(struct xfs_scrub *sc);
 
 /* Metadata revalidators */
 
@@ -73,6 +75,11 @@ int xrep_bmap_data(struct xfs_scrub *sc);
 int xrep_bmap_attr(struct xfs_scrub *sc);
 int xrep_symlink(struct xfs_scrub *sc);
 int xrep_xattr(struct xfs_scrub *sc);
+#ifdef CONFIG_XFS_QUOTA
+int xrep_quota(struct xfs_scrub *sc);
+#else
+# define xrep_quota			xrep_notsupported
+#endif /* CONFIG_XFS_QUOTA */
 
 #else
 
@@ -120,6 +127,7 @@ xrep_reset_perag_resv(
 #define xrep_bmap_attr			xrep_notsupported
 #define xrep_symlink			xrep_notsupported
 #define xrep_xattr			xrep_notsupported
+#define xrep_quota			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 0561cce37a31..3ecf1f24a20e 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -322,19 +322,19 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_FS,
 		.setup	= xchk_setup_quota,
 		.scrub	= xchk_quota,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_quota,
 	},
 	[XFS_SCRUB_TYPE_GQUOTA] = {	/* group quota */
 		.type	= ST_FS,
 		.setup	= xchk_setup_quota,
 		.scrub	= xchk_quota,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_quota,
 	},
 	[XFS_SCRUB_TYPE_PQUOTA] = {	/* project quota */
 		.type	= ST_FS,
 		.setup	= xchk_setup_quota,
 		.scrub	= xchk_quota,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_quota,
 	},
 	[XFS_SCRUB_TYPE_FSCOUNTERS] = {	/* fs summary counters */
 		.type	= ST_FS,
@@ -527,9 +527,8 @@ xfs_scrub_metadata(
 		if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_FORCE_SCRUB_REPAIR))
 			sc.sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
 
-		needs_fix = (sc.sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
-						XFS_SCRUB_OFLAG_XCORRUPT |
-						XFS_SCRUB_OFLAG_PREEN));
+		needs_fix = xfs_scrub_needs_repair(sc.sm);
+
 		/*
 		 * If userspace asked for a repair but it wasn't necessary,
 		 * report that back to userspace.

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 18/18] xfs: convert big array and blob array to use memfd backend
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
                   ` (16 preceding siblings ...)
  2019-08-05  0:36 ` [PATCH 17/18] xfs: repair quotas Darrick J. Wong
@ 2019-08-05  0:36 ` Darrick J. Wong
  2019-08-05  7:20 ` [PATCH v19 00/18] xfs: online repair support Dave Chinner
  18 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2019-08-05  0:36 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

There are several problems with the initial implementations of the big
array and the blob array data structures.  First, using linked lists
imposes a two-pointer overhead on every record stored.  For blobs this
isn't serious, but for fixed-size records this increases memory
requirements by 40-60%.  Second, we're using kernel memory to store the
intermediate records.  Kernel memory cannot be paged out, which means we
run the risk of OOMing the machine when we run out of physical memory.

Therefore, replace the linked lists in both structures with memfd files.
Random access becomes much easier, memory overhead drops to a negligible
amount, and because memfd pages can be swapped, we have considerably
more flexibility for memory use.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile      |    1 
 fs/xfs/scrub/array.c |  607 +++++++++++++++++++++++++++++++++++++++-----------
 fs/xfs/scrub/array.h |   16 -
 fs/xfs/scrub/blob.c  |   94 +++++---
 fs/xfs/scrub/blob.h  |    5 
 fs/xfs/scrub/trace.h |   23 ++
 fs/xfs/scrub/xfile.c |  121 ++++++++++
 fs/xfs/scrub/xfile.h |   21 ++
 8 files changed, 708 insertions(+), 180 deletions(-)
 create mode 100644 fs/xfs/scrub/xfile.c
 create mode 100644 fs/xfs/scrub/xfile.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a2461621ac26..4a4f8121499b 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -171,6 +171,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   refcount_repair.o \
 				   repair.o \
 				   symlink_repair.o \
+				   xfile.o \
 				   )
 xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota_repair.o
 endif
diff --git a/fs/xfs/scrub/array.c b/fs/xfs/scrub/array.c
index 4089e595df8b..1b3635a115b2 100644
--- a/fs/xfs/scrub/array.c
+++ b/fs/xfs/scrub/array.c
@@ -6,24 +6,41 @@
 #include "xfs.h"
 #include "xfs_fs.h"
 #include "xfs_shared.h"
+#include "xfs_format.h"
 #include "scrub/array.h"
+#include "scrub/scrub.h"
+#include "scrub/trace.h"
+#include "scrub/xfile.h"
 
 /*
  * XFS Fixed-Size Big Memory Array
  * ===============================
- * The big memory array uses a list to store large numbers of fixed-size
- * records in memory.  Access to the array is performed via indexed get and put
- * methods, and an append method is provided for convenience.  Array elements
- * can be set to all zeroes, which means that the entry is NULL and will be
- * skipped during iteration.
+ * The file-backed memory array uses a memfd "file" to store large numbers of
+ * fixed-size records in memory that can be paged out.  This puts less stress
+ * on the memory reclaim algorithms because memfd file pages are not pinned and
+ * can be paged out; however, array access is less direct than would be in a
+ * regular memory array.  Access to the array is performed via indexed get and
+ * put methods, and an append method is provided for convenience.  Array
+ * elements can be set to all zeroes, which means that the entry is NULL and
+ * will be skipped during iteration.
  */
 
-struct xa_item {
-	struct list_head	list;
-	/* array item comes after here */
-};
+#define XFBMA_MAX_TEMP	(2)
 
-#define XA_ITEM_SIZE(sz)	(sizeof(struct xa_item) + (sz))
+/*
+ * Pointer to temp space.  Because we can't access the memfd data directly, we
+ * allocate a small amount of memory on the end of the xfbma to buffer array
+ * items when we need space to store values temporarily.
+ */
+static inline void *
+xfbma_temp(
+	struct xfbma	*array,
+	unsigned int	nr)
+{
+	ASSERT(nr < XFBMA_MAX_TEMP);
+
+	return ((char *)(array + 1)) + (nr * array->obj_size);
+}
 
 /* Initialize a big memory array. */
 struct xfbma *
@@ -31,97 +48,47 @@ xfbma_init(
 	size_t		obj_size)
 {
 	struct xfbma	*array;
+	struct file	*filp;
 	int		error;
 
+	filp = xfile_create("big array");
+	if (!filp)
+		return ERR_PTR(-ENOMEM);
+	if (IS_ERR(filp))
+		return ERR_CAST(filp);
+
 	error = -ENOMEM;
-	array = kmem_alloc(sizeof(struct xfbma) + obj_size,
+	array = kmem_alloc(sizeof(struct xfbma) + (XFBMA_MAX_TEMP * obj_size),
 			KM_NOFS | KM_MAYFAIL);
 	if (!array)
-		return ERR_PTR(error);
+		goto out_filp;
 
+	array->filp = filp;
 	array->obj_size = obj_size;
 	array->nr = 0;
-	INIT_LIST_HEAD(&array->list);
-	memset(&array->cache, 0, sizeof(array->cache));
-
 	return array;
+out_filp:
+	fput(filp);
+	return ERR_PTR(error);
 }
 
 void
 xfbma_destroy(
 	struct xfbma	*array)
 {
-	struct xa_item	*item, *n;
-
-	list_for_each_entry_safe(item, n, &array->list, list) {
-		list_del(&item->list);
-		kmem_free(item);
-	}
+	xfile_destroy(array->filp);
 	kmem_free(array);
 }
 
-/* Find something in the cache. */
-static struct xa_item *
-xfbma_cache_lookup(
-	struct xfbma	*array,
-	uint64_t	nr)
-{
-	uint64_t	i;
-
-	for (i = 0; i < XMA_CACHE_SIZE; i++)
-		if (array->cache[i].nr == nr && array->cache[i].item)
-			return array->cache[i].item;
-	return NULL;
-}
-
-/* Invalidate the lookup cache. */
-static void
-xfbma_cache_invalidate(
-	struct xfbma	*array)
-{
-	memset(array->cache, 0, sizeof(array->cache));
-}
-
-/* Put something in the cache. */
-static void
-xfbma_cache_store(
-	struct xfbma	*array,
-	uint64_t	nr,
-	struct xa_item	*item)
-{
-	memmove(array->cache + 1, array->cache,
-			sizeof(struct xma_cache) * (XMA_CACHE_SIZE - 1));
-	array->cache[0].item = item;
-	array->cache[0].nr = nr;
-}
-
-/* Find a particular array item. */
-static struct xa_item *
-xfbma_lookup(
+/* Compute offset of array element. */
+static inline loff_t
+xfbma_offset(
 	struct xfbma	*array,
 	uint64_t	nr)
 {
-	struct xa_item	*item;
-	uint64_t	i;
-
-	if (nr >= array->nr) {
-		ASSERT(0);
-		return NULL;
-	}
-
-	item = xfbma_cache_lookup(array, nr);
-	if (item)
-		return item;
-
-	i = 0;
-	list_for_each_entry(item, &array->list, list) {
-		if (i == nr) {
-			xfbma_cache_store(array, nr, item);
-			return item;
-		}
-		i++;
-	}
-	return NULL;
+	if (nr >= array->nr)
+		return -1;
+	return nr * array->obj_size;
 }
 
 /* Get an element from the array. */
@@ -131,13 +98,14 @@ xfbma_get(
 	uint64_t	nr,
 	void		*ptr)
 {
-	struct xa_item	*item;
+	loff_t		pos = xfbma_offset(array, nr);
 
-	item = xfbma_lookup(array, nr);
-	if (!item)
+	if (pos < 0) {
+		ASSERT(0);
 		return -ENODATA;
-	memcpy(ptr, item + 1, array->obj_size);
-	return 0;
+	}
+
+	return xfile_io(array->filp, XFILE_IO_READ, &pos, ptr, array->obj_size);
 }
 
 /* Put an element in the array. */
@@ -147,13 +115,15 @@ xfbma_set(
 	uint64_t	nr,
 	void		*ptr)
 {
-	struct xa_item	*item;
+	loff_t		pos = xfbma_offset(array, nr);
 
-	item = xfbma_lookup(array, nr);
-	if (!item)
+	if (pos < 0) {
+		ASSERT(0);
 		return -ENODATA;
-	memcpy(item + 1, ptr, array->obj_size);
-	return 0;
+	}
+
+	return xfile_io(array->filp, XFILE_IO_WRITE, &pos, ptr,
+			array->obj_size);
 }
 
 /* Is this array element NULL? */
@@ -171,14 +141,16 @@ xfbma_insert_anywhere(
 	struct xfbma	*array,
 	void		*ptr)
 {
-	struct xa_item	*item;
+	void		*temp = xfbma_temp(array, 0);
+	uint64_t	i;
+	int		error;
 
 	/* Find a null slot to put it in. */
-	list_for_each_entry(item, &array->list, list) {
-		if (!xfbma_is_null(array, item + 1))
+	for (i = 0; i < array->nr; i++) {
+		error = xfbma_get(array, i, temp);
+		if (error || !xfbma_is_null(array, temp))
 			continue;
-		memcpy(item + 1, ptr, array->obj_size);
-		return 0;
+		return xfbma_set(array, i, ptr);
 	}
 
 	/* No null slots, just dump it on the end. */
@@ -191,13 +163,17 @@ xfbma_nullify(
 	struct xfbma	*array,
 	uint64_t	nr)
 {
-	struct xa_item	*item;
+	void		*temp = xfbma_temp(array, 0);
+	loff_t		pos = xfbma_offset(array, nr);
 
-	item = xfbma_lookup(array, nr);
-	if (!item)
+	if (pos < 0) {
+		ASSERT(0);
 		return -ENODATA;
-	memset(item + 1, 0, array->obj_size);
-	return 0;
+	}
+
+	memset(temp, 0, array->obj_size);
+	return xfile_io(array->filp, XFILE_IO_WRITE, &pos, temp,
+			array->obj_size);
 }
 
 /* Append an element to the array. */
@@ -206,22 +182,25 @@ xfbma_append(
 	struct xfbma	*array,
 	void		*ptr)
 {
-	struct xa_item	*item;
+	loff_t		pos = array->obj_size * array->nr;
+	int		error;
 
-	item = kmem_alloc(XA_ITEM_SIZE(array->obj_size), KM_NOFS | KM_MAYFAIL);
-	if (!item)
-		return -ENOMEM;
+	if (pos < 0) {
+		ASSERT(0);
+		return -ENODATA;
+	}
 
-	INIT_LIST_HEAD(&item->list);
-	memcpy(item + 1, ptr, array->obj_size);
-	list_add_tail(&item->list, &array->list);
+	error = xfile_io(array->filp, XFILE_IO_WRITE, &pos, ptr,
+			array->obj_size);
+	if (error)
+		return error;
 	array->nr++;
 	return 0;
 }
 
 /*
  * Iterate every element in this array, freeing each element as we go.
- * Array elements will be shifted down.
+ * Array elements will be nulled out.
  */
 int
 xfbma_iter_del(
@@ -229,23 +208,35 @@ xfbma_iter_del(
 	xfbma_iter_fn	iter_fn,
 	void		*priv)
 {
-	struct xa_item	*item, *n;
+	void		*temp = xfbma_temp(array, 0);
+	pgoff_t		oldpagenr = 0;
+	uint64_t	max_bytes;
+	uint64_t	i;
+	loff_t		pos;
 	int		error = 0;
 
-	list_for_each_entry_safe(item, n, &array->list, list) {
-		if (xfbma_is_null(array, item + 1))
+	max_bytes = array->nr * array->obj_size;
+	for (pos = 0, i = 0; pos < max_bytes; i++) {
+		pgoff_t	pagenr;
+
+		error = xfile_io(array->filp, XFILE_IO_READ, &pos, temp,
+				array->obj_size);
+		if (error)
+			break;
+		if (xfbma_is_null(array, temp))
 			goto next;
-		memcpy(array + 1, item + 1, array->obj_size);
-		error = iter_fn(array + 1, priv);
+		error = iter_fn(temp, priv);
 		if (error)
 			break;
 next:
-		list_del(&item->list);
-		kmem_free(item);
-		array->nr--;
+		/* Release the previous page if possible. */
+		pagenr = pos >> PAGE_SHIFT;
+		if (pagenr != oldpagenr)
+			xfile_discard(array->filp, oldpagenr << PAGE_SHIFT,
+					pos - 1);
+		oldpagenr = pagenr;
 	}
 
-	xfbma_cache_invalidate(array);
 	return error;
 }
 
@@ -257,27 +248,383 @@ xfbma_length(
 	return array->nr;
 }
 
-static int
-xfbma_item_cmp(
-	void			*priv,
-	struct list_head	*a,
-	struct list_head	*b)
+/*
+ * Select the median value from a[lo], a[mid], and a[hi].  Put the median in
+ * a[lo], the lowest in a[lo], and the highest in a[hi].  Using the median of
+ * the three reduces the chances that we pick the worst case pivot value, since
+ * it's likely that our array values are nearly sorted.
+ */
+STATIC int
+xfbma_qsort_pivot(
+	struct xfbma	*array,
+	xfbma_cmp_fn	cmp_fn,
+	uint64_t	lo,
+	uint64_t	mid,
+	uint64_t	hi)
 {
-	int			(*cmp_fn)(void *a, void *b) = priv;
-	struct xa_item		*ai, *bi;
+	void		*a = xfbma_temp(array, 0);
+	void		*b = xfbma_temp(array, 1);
+	int		error;
 
-	ai = container_of(a, struct xa_item, list);
-	bi = container_of(b, struct xa_item, list);
+	/* if a[mid] < a[lo], swap a[mid] and a[lo]. */
+	error = xfbma_get(array, mid, a);
+	if (error)
+		return error;
+	error = xfbma_get(array, lo, b);
+	if (error)
+		return error;
+	if (cmp_fn(a, b) < 0) {
+		error = xfbma_set(array, lo, a);
+		if (error)
+			return error;
+		error = xfbma_set(array, mid, b);
+		if (error)
+			return error;
+	}
 
-	return cmp_fn(ai + 1, bi + 1);
+	/* if a[hi] < a[mid], swap a[mid] and a[hi]. */
+	error = xfbma_get(array, hi, a);
+	if (error)
+		return error;
+	error = xfbma_get(array, mid, b);
+	if (error)
+		return error;
+	if (cmp_fn(a, b) < 0) {
+		error = xfbma_set(array, mid, a);
+		if (error)
+			return error;
+		error = xfbma_set(array, hi, b);
+		if (error)
+			return error;
+	} else {
+		goto move_front;
+	}
+
+	/* if a[mid] < a[lo], swap a[mid] and a[lo]. */
+	error = xfbma_get(array, mid, a);
+	if (error)
+		return error;
+	error = xfbma_get(array, lo, b);
+	if (error)
+		return error;
+	if (cmp_fn(a, b) < 0) {
+		error = xfbma_set(array, lo, a);
+		if (error)
+			return error;
+		error = xfbma_set(array, mid, b);
+		if (error)
+			return error;
+	}
+move_front:
+	/* move our selected pivot to a[lo] */
+	error = xfbma_get(array, lo, b);
+	if (error)
+		return error;
+	error = xfbma_get(array, mid, a);
+	if (error)
+		return error;
+	error = xfbma_set(array, mid, b);
+	if (error)
+		return error;
+	return xfbma_set(array, lo, a);
+}
+
+/*
+ * Perform an insertion sort on a subset of the array.
+ * Though insertion sort is an O(n^2) algorithm, for small set sizes it's
+ * faster than quicksort's stack machine, so we let it take over for that.
+ */
+STATIC int
+xfbma_isort(
+	struct xfbma	*array,
+	xfbma_cmp_fn	cmp_fn,
+	uint64_t	start,
+	uint64_t	end)
+{
+	void		*a = xfbma_temp(array, 0);
+	void		*b = xfbma_temp(array, 1);
+	uint64_t	tmp;
+	uint64_t	i;
+	uint64_t	run;
+	int		error;
+
+	/*
+	 * Move the smallest element in a[start..end] to a[start].  This
+	 * simplifies the loop control logic below.
+	 */
+	tmp = start;
+	error = xfbma_get(array, tmp, b);
+	if (error)
+		return error;
+	for (run = start + 1; run <= end; run++) {
+		/* if a[run] < a[tmp], tmp = run */
+		error = xfbma_get(array, run, a);
+		if (error)
+			return error;
+		if (cmp_fn(a, b) < 0) {
+			tmp = run;
+			memcpy(b, a, array->obj_size);
+		}
+	}
+
+	/*
+	 * The smallest element is a[tmp]; swap with a[start] if tmp != start.
+	 * Recall that a[tmp] is already in *b.
+	 */
+	if (tmp != start) {
+		error = xfbma_get(array, start, a);
+		if (error)
+			return error;
+		error = xfbma_set(array, tmp, a);
+		if (error)
+			return error;
+		error = xfbma_set(array, start, b);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * Perform an insertion sort on a[start+1..end].  We already made sure
+	 * that the smallest value in the original range is now in a[start],
+	 * so the inner loop should never underflow.
+	 *
+	 * For each a[start+2..end], make sure it's in the correct position
+	 * with respect to the elements that came before it.
+	 */
+	for (run = start + 2; run <= end; run++) {
+		error = xfbma_get(array, run, a);
+		if (error)
+			return error;
+
+		/*
+		 * Find the correct place for a[run] by walking leftwards
+		 * towards the start of the range until a[tmp] is no longer
+		 * greater than a[run].
+		 */
+		tmp = run - 1;
+		error = xfbma_get(array, tmp, b);
+		if (error)
+			return error;
+		while (cmp_fn(a, b) < 0) {
+			tmp--;
+			error = xfbma_get(array, tmp, b);
+			if (error)
+				return error;
+		}
+		tmp++;
+
+		/*
+		 * If tmp != run, then a[tmp..run-1] are all less than a[run],
+		 * so right barrel roll a[tmp..run] to get this range in
+		 * sorted order.
+		 */
+		if (tmp == run)
+			continue;
+
+		for (i = run; i >= tmp; i--) {
+			error = xfbma_get(array, i - 1, b);
+			if (error)
+				return error;
+			error = xfbma_set(array, i, b);
+			if (error)
+				return error;
+		}
+		error = xfbma_set(array, tmp, a);
+		if (error)
+			return error;
+	}
+
+	return 0;
 }
 
-/* Sort everything in this array. */
+/*
+ * Sort the array elements via quicksort.  This implementation incorporates
+ * four optimizations discussed in Sedgewick:
+ *
+ * 1. Use an explicit stack of array indicies to store the next array
+ *    partition to sort.  This helps us to avoid recursion in the call stack,
+ *    which is particularly expensive in the kernel.
+ *
+ * 2. Choose the pivot element using a median-of-three decision tree.  This
+ *    reduces the probability of selecting a bad pivot value which causes
+ *    worst case behavior (i.e. partition sizes of 1).  Chance are fairly good
+ *    that the list is nearly sorted, so this is important.
+ *
+ * 3. The smaller of the two sub-partitions is pushed onto the stack to start
+ *    the next level of recursion, and the larger sub-partition replaces the
+ *    current stack frame.  This guarantees that we won't need more than
+ *    log2(nr) stack space.
+ *
+ * 4. Use insertion sort for small sets since since insertion sort is faster
+ *    for small, mostly sorted array segments.  In the author's experience,
+ *    substituting insertion sort for arrays smaller than 4 elements yields
+ *    a ~10% reduction in runtime.
+ */
+
+/*
+ * Due to the use of signed indices, we can only support up to 2^63 records.
+ * Files can only grow to 2^63 bytes, so this is not much of a limitation.
+ */
+#define QSORT_MAX_RECS		(1ULL << 63)
+
+/*
+ * For array subsets smaller than 4 elements, it's slightly faster to use
+ * insertion sort than quicksort's stack machine.
+ */
+#define ISORT_THRESHOLD		(4)
 int
 xfbma_sort(
 	struct xfbma	*array,
 	xfbma_cmp_fn	cmp_fn)
 {
-	list_sort(cmp_fn, &array->list, xfbma_item_cmp);
-	return 0;
+	int64_t		*stack;
+	int64_t		*beg;
+	int64_t		*end;
+	void		*pivot = xfbma_temp(array, 0);
+	void		*temp = xfbma_temp(array, 1);
+	int64_t		lo, mid, hi;
+	const int	max_stack_depth = ilog2(array->nr) + 1;
+	int		stack_depth = 0;
+	int		max_stack_used = 0;
+	int		error = 0;
+
+	if (array->nr == 0)
+		return 0;
+	if (array->nr >= QSORT_MAX_RECS)
+		return -E2BIG;
+	if (array->nr <= ISORT_THRESHOLD)
+		return xfbma_isort(array, cmp_fn, 0, array->nr - 1);
+
+	/* Allocate our pointer stacks for sorting. */
+	stack = kmem_alloc(sizeof(int64_t) * 2 * max_stack_depth,
+			KM_NOFS | KM_MAYFAIL);
+	if (!stack)
+		return -ENOMEM;
+	beg = stack;
+	end = &stack[max_stack_depth];
+
+	beg[0] = 0;
+	end[0] = array->nr;
+	while (stack_depth >= 0) {
+		lo = beg[stack_depth];
+		hi = end[stack_depth] - 1;
+
+		/* Nothing left in this partition to sort; pop stack. */
+		if (lo >= hi) {
+			stack_depth--;
+			continue;
+		}
+
+		/* Small enough for insertion sort? */
+		if (hi - lo <= ISORT_THRESHOLD) {
+			error = xfbma_isort(array, cmp_fn, lo, hi);
+			if (error)
+				goto out_free;
+			stack_depth--;
+			continue;
+		}
+
+		/* Pick a pivot, move it to a[lo] and stash it. */
+		mid = lo + ((hi - lo) / 2);
+		error = xfbma_qsort_pivot(array, cmp_fn, lo, mid, hi);
+		if (error)
+			goto out_free;
+
+		error = xfbma_get(array, lo, pivot);
+		if (error)
+			goto out_free;
+
+		/*
+		 * Rearrange a[lo..hi] such that everything smaller than the
+		 * pivot is on the left side of the range and everything larger
+		 * than the pivot is on the right side of the range.
+		 */
+		while (lo < hi) {
+			/*
+			 * Decrement hi until it finds an a[hi] less than the
+			 * pivot value.
+			 */
+			error = xfbma_get(array, hi, temp);
+			if (error)
+				goto out_free;
+			while (cmp_fn(temp, pivot) >= 0 && lo < hi) {
+				hi--;
+				error = xfbma_get(array, hi, temp);
+				if (error)
+					goto out_free;
+			}
+
+			/* Copy that item (a[hi]) to a[lo]. */
+			if (lo < hi) {
+				error = xfbma_set(array, lo++, temp);
+				if (error)
+					goto out_free;
+			}
+
+			/*
+			 * Increment lo until it finds an a[lo] greater than
+			 * the pivot value.
+			 */
+			error = xfbma_get(array, lo, temp);
+			if (error)
+				goto out_free;
+			while (cmp_fn(temp, pivot) <= 0 && lo < hi) {
+				lo++;
+				error = xfbma_get(array, lo, temp);
+				if (error)
+					goto out_free;
+			}
+
+			/* Copy that item (a[lo]) to a[hi]. */
+			if (lo < hi) {
+				error = xfbma_set(array, hi--, temp);
+				if (error)
+					goto out_free;
+			}
+		}
+
+		/*
+		 * Put our pivot value in the correct place at a[lo].  All
+		 * values between a[beg[i]] and a[lo - 1] should be less than
+		 * the pivot; and all values between a[lo + 1] and a[end[i]-1]
+		 * should be greater than the pivot.
+		 */
+		error = xfbma_set(array, lo, pivot);
+		if (error)
+			goto out_free;
+
+		/*
+		 * Set up the pointers for the next iteration.  We push onto
+		 * the stack all of the unsorted values between a[lo + 1] and
+		 * a[end[i]], and we tweak the current stack frame to point to
+		 * the unsorted values between a[beg[i]] and a[lo] so that
+		 * those values will be sorted when we pop the stack.
+		 */
+		beg[stack_depth + 1] = lo + 1;
+		end[stack_depth + 1] = end[stack_depth];
+		end[stack_depth++] = lo;
+
+		/* Check our stack usage. */
+		max_stack_used = max(max_stack_used, stack_depth);
+		if (stack_depth >= max_stack_depth) {
+			ASSERT(0);
+			return -EFSCORRUPTED;
+		}
+
+		/*
+		 * Always start with the smaller of the two partitions to keep
+		 * the amount of recursion in check.
+		 */
+		if (end[stack_depth] - beg[stack_depth] >
+		    end[stack_depth - 1] - beg[stack_depth - 1]) {
+			swap(beg[stack_depth], beg[stack_depth - 1]);
+			swap(end[stack_depth], end[stack_depth - 1]);
+		}
+	}
+
+out_free:
+	kfree(stack);
+	trace_xfbma_sort_stats(array->nr, max_stack_depth, max_stack_used,
+			error);
+	return error;
 }
diff --git a/fs/xfs/scrub/array.h b/fs/xfs/scrub/array.h
index 607e664147b3..e002edb657f4 100644
--- a/fs/xfs/scrub/array.h
+++ b/fs/xfs/scrub/array.h
@@ -6,20 +6,10 @@
 #ifndef __XFS_SCRUB_ARRAY_H__
 #define __XFS_SCRUB_ARRAY_H__
 
-struct xma_item;
-
-struct xma_cache {
-	uint64_t	nr;
-	struct xa_item	*item;
-};
-
-#define XMA_CACHE_SIZE	(8)
-
 struct xfbma {
-	struct list_head	list;
-	size_t			obj_size;
-	uint64_t		nr;
-	struct xma_cache	cache[XMA_CACHE_SIZE];
+	struct file	*filp;
+	size_t		obj_size;
+	uint64_t	nr;
 };
 
 struct xfbma *xfbma_init(size_t obj_size);
diff --git a/fs/xfs/scrub/blob.c b/fs/xfs/scrub/blob.c
index 4928f0985d49..94912fcb1fd1 100644
--- a/fs/xfs/scrub/blob.c
+++ b/fs/xfs/scrub/blob.c
@@ -8,38 +8,48 @@
 #include "xfs_shared.h"
 #include "scrub/array.h"
 #include "scrub/blob.h"
+#include "scrub/xfile.h"
 
 /*
  * XFS Blob Storage
  * ================
- * Stores and retrieves blobs using a list.  Objects are appended to
- * the list and the pointer is returned as a magic cookie for retrieval.
+ * Stores and retrieves blobs using a memfd object.  Objects are appended to
+ * the file and the offset is returned as a magic cookie for retrieval.
  */
 
 #define XB_KEY_MAGIC	0xABAADDAD
 struct xb_key {
-	struct list_head	list;
 	uint32_t		magic;
 	uint32_t		size;
+	loff_t			offset;
 	/* blob comes after here */
 } __packed;
 
-#define XB_KEY_SIZE(sz)	(sizeof(struct xb_key) + (sz))
-
 /* Initialize a blob storage object. */
 struct xblob *
 xblob_init(void)
 {
 	struct xblob	*blob;
+	struct file	*filp;
 	int		error;
 
+	filp = xfile_create("blob storage");
+	if (!filp)
+		return ERR_PTR(-ENOMEM);
+	if (IS_ERR(filp))
+		return ERR_CAST(filp);
+
 	error = -ENOMEM;
 	blob = kmem_alloc(sizeof(struct xblob), KM_NOFS | KM_MAYFAIL);
 	if (!blob)
-		return ERR_PTR(error);
+		goto out_filp;
 
-	INIT_LIST_HEAD(&blob->list);
+	blob->filp = filp;
+	blob->last_offset = PAGE_SIZE;
 	return blob;
+out_filp:
+	fput(filp);
+	return ERR_PTR(error);
 }
 
 /* Destroy a blob storage object. */
@@ -47,12 +57,7 @@ void
 xblob_destroy(
 	struct xblob	*blob)
 {
-	struct xb_key	*key, *n;
-
-	list_for_each_entry_safe(key, n, &blob->list, list) {
-		list_del(&key->list);
-		kmem_free(key);
-	}
+	xfile_destroy(blob->filp);
 	kmem_free(blob);
 }
 
@@ -64,19 +69,24 @@ xblob_get(
 	void		*ptr,
 	uint32_t	size)
 {
-	struct xb_key	*key = (struct xb_key *)cookie;
+	struct xb_key	key;
+	loff_t		pos = cookie;
+	int		error;
+
+	error = xfile_io(blob->filp, XFILE_IO_READ, &pos, &key, sizeof(key));
+	if (error)
+		return error;
 
-	if (key->magic != XB_KEY_MAGIC) {
+	if (key.magic != XB_KEY_MAGIC || key.offset != cookie) {
 		ASSERT(0);
 		return -ENODATA;
 	}
-	if (size < key->size) {
+	if (size < key.size) {
 		ASSERT(0);
 		return -EFBIG;
 	}
 
-	memcpy(ptr, key + 1, key->size);
-	return 0;
+	return xfile_io(blob->filp, XFILE_IO_READ, &pos, ptr, key.size);
 }
 
 /* Store a blob. */
@@ -87,19 +97,28 @@ xblob_put(
 	void		*ptr,
 	uint32_t	size)
 {
-	struct xb_key	*key;
-
-	key = kmem_alloc(XB_KEY_SIZE(size), KM_NOFS | KM_MAYFAIL);
-	if (!key)
-		return -ENOMEM;
-
-	INIT_LIST_HEAD(&key->list);
-	list_add_tail(&key->list, &blob->list);
-	key->magic = XB_KEY_MAGIC;
-	key->size = size;
-	memcpy(key + 1, ptr, size);
-	*cookie = (xblob_cookie)key;
+	struct xb_key	key = {
+		.offset = blob->last_offset,
+		.magic = XB_KEY_MAGIC,
+		.size = size,
+	};
+	loff_t		pos = blob->last_offset;
+	int		error;
+
+	error = xfile_io(blob->filp, XFILE_IO_WRITE, &pos, &key, sizeof(key));
+	if (error)
+		goto out_err;
+
+	error = xfile_io(blob->filp, XFILE_IO_WRITE, &pos, ptr, size);
+	if (error)
+		goto out_err;
+
+	*cookie = blob->last_offset;
+	blob->last_offset = pos;
 	return 0;
+out_err:
+	xfile_discard(blob->filp, blob->last_offset, pos - 1);
+	return -ENOMEM;
 }
 
 /* Free a blob. */
@@ -108,14 +127,19 @@ xblob_free(
 	struct xblob	*blob,
 	xblob_cookie	cookie)
 {
-	struct xb_key	*key = (struct xb_key *)cookie;
+	struct xb_key	key;
+	loff_t		pos = cookie;
+	int		error;
+
+	error = xfile_io(blob->filp, XFILE_IO_READ, &pos, &key, sizeof(key));
+	if (error)
+		return error;
 
-	if (key->magic != XB_KEY_MAGIC) {
+	if (key.magic != XB_KEY_MAGIC || key.offset != cookie) {
 		ASSERT(0);
 		return -ENODATA;
 	}
-	key->magic = 0;
-	list_del(&key->list);
-	kmem_free(key);
+
+	xfile_discard(blob->filp, cookie, cookie + sizeof(key) + key.size - 1);
 	return 0;
 }
diff --git a/fs/xfs/scrub/blob.h b/fs/xfs/scrub/blob.h
index 2595a15f78ac..c6f6c6a2e084 100644
--- a/fs/xfs/scrub/blob.h
+++ b/fs/xfs/scrub/blob.h
@@ -7,10 +7,11 @@
 #define __XFS_SCRUB_BLOB_H__
 
 struct xblob {
-	struct list_head	list;
+	struct file	*filp;
+	loff_t		last_offset;
 };
 
-typedef void			*xblob_cookie;
+typedef loff_t		xblob_cookie;
 
 struct xblob *xblob_init(void);
 void xblob_destroy(struct xblob *blob);
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 7eb166599a61..8788030d13f6 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -906,6 +906,29 @@ TRACE_EVENT(xrep_ibt_insert,
 		  __entry->freemask)
 )
 
+TRACE_EVENT(xfbma_sort_stats,
+	TP_PROTO(uint64_t nr, unsigned int max_stack_depth,
+		 unsigned int max_stack_used, int error),
+	TP_ARGS(nr, max_stack_depth, max_stack_used, error),
+	TP_STRUCT__entry(
+		__field(uint64_t, nr)
+		__field(unsigned int, max_stack_depth)
+		__field(unsigned int, max_stack_used)
+		__field(int, error)
+	),
+	TP_fast_assign(
+		__entry->nr = nr;
+		__entry->max_stack_depth = max_stack_depth;
+		__entry->max_stack_used = max_stack_used;
+		__entry->error = error;
+	),
+	TP_printk("nr %llu max_depth %u max_used %u error %d",
+		  __entry->nr,
+		  __entry->max_stack_depth,
+		  __entry->max_stack_used,
+		  __entry->error)
+);
+
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 
 #endif /* _TRACE_XFS_SCRUB_TRACE_H */
diff --git a/fs/xfs/scrub/xfile.c b/fs/xfs/scrub/xfile.c
new file mode 100644
index 000000000000..e0058e61202f
--- /dev/null
+++ b/fs/xfs/scrub/xfile.c
@@ -0,0 +1,121 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "scrub/array.h"
+#include "scrub/scrub.h"
+#include "scrub/trace.h"
+#include "scrub/xfile.h"
+#include <linux/shmem_fs.h>
+
+/*
+ * Create a memfd to our specifications and return a file pointer.  The file
+ * is not installed in the file description table (because userspace has no
+ * business accessing our internal data), which means that the caller /must/
+ * fput the file when finished.
+ */
+struct file *
+xfile_create(
+	const char	*description)
+{
+	struct file	*filp;
+
+	filp = shmem_file_setup(description, 0, 0);
+	if (IS_ERR_OR_NULL(filp))
+		return filp;
+
+	filp->f_mode |= FMODE_PREAD | FMODE_PWRITE;
+	filp->f_flags |= O_RDWR | O_LARGEFILE;
+	return filp;
+}
+
+void
+xfile_destroy(
+	struct file	*filp)
+{
+	fput(filp);
+}
+
+struct xfile_io_args {
+	struct work_struct	work;
+	struct completion	*done;
+
+	struct file		*filp;
+	void			*ptr;
+	loff_t			*pos;
+	size_t			count;
+	ssize_t			ret;
+	bool			is_read;
+};
+
+static void
+xfile_io_worker(
+	struct work_struct	*work)
+{
+	struct xfile_io_args	*args;
+	unsigned int		pflags;
+
+	args = container_of(work, struct xfile_io_args, work);
+	pflags = memalloc_nofs_save();
+
+	if (args->is_read)
+		args->ret = kernel_read(args->filp, args->ptr, args->count,
+				args->pos);
+	else
+		args->ret = kernel_write(args->filp, args->ptr, args->count,
+				args->pos);
+	complete(args->done);
+
+	memalloc_nofs_restore(pflags);
+}
+
+/*
+ * Perform a read or write IO to the file backing the array.  We can defer
+ * the work to a workqueue if the caller so desires, either to reduce stack
+ * usage or because the xfs is frozen and we want to avoid deadlocking on the
+ * page fault that might be about to happen.
+ */
+int
+xfile_io(
+	struct file	*filp,
+	unsigned int	cmd_flags,
+	loff_t		*pos,
+	void		*ptr,
+	size_t		count)
+{
+	DECLARE_COMPLETION_ONSTACK(done);
+	struct xfile_io_args	args = {
+		.filp = filp,
+		.ptr = ptr,
+		.pos = pos,
+		.count = count,
+		.done = &done,
+		.is_read = (cmd_flags & XFILE_IO_MASK) == XFILE_IO_READ,
+	};
+
+	INIT_WORK_ONSTACK(&args.work, xfile_io_worker);
+	schedule_work(&args.work);
+	wait_for_completion(&done);
+	destroy_work_on_stack(&args.work);
+
+	/*
+	 * Since we're treating this file as "memory", any IO error should be
+	 * treated as a failure to find any memory.
+	 */
+	return args.ret == count ? 0 : -ENOMEM;
+}
+
+/* Discard pages backing a range of the file. */
+void
+xfile_discard(
+	struct file	*filp,
+	loff_t		start,
+	loff_t		end)
+{
+	shmem_truncate_range(file_inode(filp), start, end);
+}
diff --git a/fs/xfs/scrub/xfile.h b/fs/xfs/scrub/xfile.h
new file mode 100644
index 000000000000..41817bcadc43
--- /dev/null
+++ b/fs/xfs/scrub/xfile.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#ifndef __XFS_SCRUB_XFILE_H__
+#define __XFS_SCRUB_XFILE_H__
+
+struct file *xfile_create(const char *description);
+void xfile_destroy(struct file *filp);
+
+/* read or write? */
+#define XFILE_IO_READ		(0)
+#define XFILE_IO_WRITE		(1)
+#define XFILE_IO_MASK		(1 << 0)
+int xfile_io(struct file *filp, unsigned int cmd_flags, loff_t *pos,
+		void *ptr, size_t count);
+
+void xfile_discard(struct file *filp, loff_t start, loff_t end);
+
+#endif /* __XFS_SCRUB_XFILE_H__ */

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v19 00/18] xfs: online repair support
  2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
                   ` (17 preceding siblings ...)
  2019-08-05  0:36 ` [PATCH 18/18] xfs: convert big array and blob array to use memfd backend Darrick J. Wong
@ 2019-08-05  7:20 ` Dave Chinner
  18 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2019-08-05  7:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Aug 04, 2019 at 05:34:43PM -0700, Darrick J. Wong wrote:
> Hi all,
> 
> This is the first part of the nineteenth revision of a patchset that
> adds to XFS kernel support for online metadata scrubbing and repair.
> There aren't any on-disk format changes.
> 
> New for this version is a rebase against 5.3-rc2, integration with the
> health reporting subsystem, and the explicit revalidation of all
> metadata structures that were rebuilt.
> 
> Patch 1 lays the groundwork for scrub types specifying a revalidation
> function that will check everything that the repair function might have
> rebuilt.  This will be necessary for the free space and inode btree
> repair functions, which rebuild both btrees at once.
> 
> Patch 2 ensures that the health reporting query code doesn't get in the
> way of post-repair revalidation of all rebuilt metadata structures.
> 
> Patch 3 creates a new data structure that provides an abstraction of a
> big memory array by using linked lists.  This is where we store records
> for btree reconstruction.  This first implementation is memory
> inefficient and consumes a /lot/ of kernel memory, but lays the
> groundwork for the last patch in the set to convert the implementation
> to use a (memfd) swap file, which enables us to use pageable memory
> without pounding the slab cache.
> 
> Patches 4-10 implement reconstruction of the free space btrees, inode
> btrees, reference count btrees, inode records, inode forks, inode block
> maps, and symbolic links.

Darrick and I had a discussion on #xfs about the btree rebuilds
mainly centered around robustness. The biggest issue I saw with the
code as it stands is that we replace the existing btree as we build
it. As a result, we go from a complete tree with a single corruption
to an empty tree with lots of external dangling references (i.e.
massive corruption!) until the rebuild finishes. Hence if we crash
while the rebuild is in progress, we risk being in a state where:

	- log recovery will abort because it trips over partial tree
	  state
	- mounting won't run because scanning the btree at mount
	  time falls of the end of the btree unexpectedly, doesn't
	  find enough free space for reservations, etc
	- mounting succeeds but then the first operations fail
	  because the tree is incomplete and the filesystem
	  immediately shuts down.

So if we crash while there is a background repair taking place on
the root filesystem, then it is very likely the system will not boot
up after the crash. :(

We came to the conclusion - independently, at the same time :) -
that we should rebuild btrees in known free space with a dangling
root node and then, once the whole new tree has been built, we
atomically swap the btree root nodes. Hence if we crash during
rebuild, we just have some dangling, unreferenced used space that a
subsequent scrub/repair/rebuild cycle will release back to the free
space pool.

That leaves the original corrupt tree in place, and hence we don't
make things any worse than they already are by trying to repair the
tree. The atomic swap of the root nodes allows failsafe transition
between the old and new trees, and the rebuild can then free the
space the old tree used. If we crash at this point, then it's just
dangling free space and a subsequent scrub/repair/rebuild cycle will
release it back to the free space pool.

This mechanism also works with xfs_repair - if we run xfs_repair
after a crash during online rebuild, it will still see the original
corrupt trees, find the dangling free space as well, and clean
everything up with a new tree rebuild. Which means, again, an online
rebuild failure does not make anything worse than before the rebuild
started....

Darrick thinks that this can quite easily be done simply by skipping
the root node pointer update (->set_root, IIRC) until the new tree
has been fully rebuilt. Hopefully that is the case, because an
atomic swap mechanism like this will make the repair algorithms a
lot more robust. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-08-05  7:21 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-08-05  0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
2019-08-05  0:34 ` [PATCH 01/18] xfs: add a repair revalidation function pointer Darrick J. Wong
2019-08-05  0:34 ` [PATCH 02/18] xfs: always rescan allegedly healthy per-ag metadata after repair Darrick J. Wong
2019-08-05  0:35 ` [PATCH 03/18] xfs: create a big array data structure Darrick J. Wong
2019-08-05  0:35 ` [PATCH 04/18] xfs: repair free space btrees Darrick J. Wong
2019-08-05  0:35 ` [PATCH 05/18] xfs: repair inode btrees Darrick J. Wong
2019-08-05  0:35 ` [PATCH 06/18] xfs: repair refcount btrees Darrick J. Wong
2019-08-05  0:35 ` [PATCH 07/18] xfs: repair inode records Darrick J. Wong
2019-08-05  0:35 ` [PATCH 08/18] xfs: zap broken inode forks Darrick J. Wong
2019-08-05  0:35 ` [PATCH 09/18] xfs: repair inode block maps Darrick J. Wong
2019-08-05  0:35 ` [PATCH 10/18] xfs: repair damaged symlinks Darrick J. Wong
2019-08-05  0:35 ` [PATCH 11/18] xfs: create a blob array data structure Darrick J. Wong
2019-08-05  0:36 ` [PATCH 12/18] xfs: convert xfs_itruncate_extents_flags to use __xfs_bunmapi Darrick J. Wong
2019-08-05  0:36 ` [PATCH 13/18] xfs: remove unnecessary inode-transaction roll Darrick J. Wong
2019-08-05  0:36 ` [PATCH 14/18] xfs: create a new inode fork block unmap helper Darrick J. Wong
2019-08-05  0:36 ` [PATCH 15/18] xfs: repair extended attributes Darrick J. Wong
2019-08-05  0:36 ` [PATCH 16/18] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
2019-08-05  0:36 ` [PATCH 17/18] xfs: repair quotas Darrick J. Wong
2019-08-05  0:36 ` [PATCH 18/18] xfs: convert big array and blob array to use memfd backend Darrick J. Wong
2019-08-05  7:20 ` [PATCH v19 00/18] xfs: online repair support Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox