* [PATCHSET v2] xfs: random fixes for 6.10
@ 2024-06-20 23:13 Darrick J. Wong
2024-06-20 23:13 ` [PATCH 1/6] xfs: fix freeing speculative preallocations for preallocated files Darrick J. Wong
` (5 more replies)
0 siblings, 6 replies; 10+ messages in thread
From: Darrick J. Wong @ 2024-06-20 23:13 UTC (permalink / raw)
To: hch, chandanbabu, djwong; +Cc: linux-xfs
Hi all,
Here are some bugfixes for 6.10. The first two patches are from hch,
and fix some longstanding delalloc leaks that only came to light now
that we've enabled it for realtime.
The second two fixes are from me -- one fixes a bug when we run out
of space for cow preallocations when alwayscow is turned on (xfs/205),
and the other corrects overzealous inode validation that causes log
recovery failure with generic/388.
The last patch is a debugging patch to ensure that transactions never
commit corrupt inodes, buffers, or dquots.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been lightly tested with fstests. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=random-fixes-6.10
---
Commits in this patchset:
* xfs: fix freeing speculative preallocations for preallocated files
* xfs: restrict when we try to align cow fork delalloc to cowextsz hints
* xfs: allow unlinked symlinks and dirs with zero size
* xfs: verify buffer, inode, and dquot items every tx commit
* xfs: fix direction in XFS_IOC_EXCHANGE_RANGE
* xfs: honor init_xattrs in xfs_init_new_inode for !ATTR fs
---
fs/xfs/Kconfig | 12 ++++++++++++
fs/xfs/libxfs/xfs_bmap.c | 31 +++++++++++++++++++++++++++----
fs/xfs/libxfs/xfs_fs.h | 2 +-
fs/xfs/libxfs/xfs_inode_buf.c | 23 ++++++++++++++++++-----
fs/xfs/xfs.h | 4 ++++
fs/xfs/xfs_bmap_util.c | 30 ++++++++++++++++++++++--------
fs/xfs/xfs_bmap_util.h | 2 +-
fs/xfs/xfs_buf_item.c | 32 ++++++++++++++++++++++++++++++++
fs/xfs/xfs_dquot_item.c | 31 +++++++++++++++++++++++++++++++
fs/xfs/xfs_icache.c | 2 +-
fs/xfs/xfs_inode.c | 24 +++++++++++++-----------
fs/xfs/xfs_inode_item.c | 32 ++++++++++++++++++++++++++++++++
fs/xfs/xfs_iomap.c | 34 ++++++++++++----------------------
13 files changed, 206 insertions(+), 53 deletions(-)
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/6] xfs: fix freeing speculative preallocations for preallocated files
2024-06-20 23:13 [PATCHSET v2] xfs: random fixes for 6.10 Darrick J. Wong
@ 2024-06-20 23:13 ` Darrick J. Wong
2024-06-20 23:13 ` [PATCH 2/6] xfs: restrict when we try to align cow fork delalloc to cowextsz hints Darrick J. Wong
` (4 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2024-06-20 23:13 UTC (permalink / raw)
To: hch, chandanbabu, djwong; +Cc: linux-xfs
From: Christoph Hellwig <hch@lst.de>
xfs_can_free_eofblocks returns false for files that have persistent
preallocations unless the force flag is passed and there are delayed
blocks. This means it won't free delalloc reservations for files
with persistent preallocations unless the force flag is set, and it
will also free the persistent preallocations if the force flag is
set and the file happens to have delayed allocations.
Both of these are bad, so do away with the force flag and always free
only post-EOF delayed allocations for files with the XFS_DIFLAG_PREALLOC
or APPEND flags set.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_bmap_util.c | 30 ++++++++++++++++++++++--------
fs/xfs/xfs_bmap_util.h | 2 +-
fs/xfs/xfs_icache.c | 2 +-
fs/xfs/xfs_inode.c | 14 ++++----------
4 files changed, 28 insertions(+), 20 deletions(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index ac2e77ebb54c..a4d9fbc21b83 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -486,13 +486,11 @@ xfs_bmap_punch_delalloc_range(
/*
* Test whether it is appropriate to check an inode for and free post EOF
- * blocks. The 'force' parameter determines whether we should also consider
- * regular files that are marked preallocated or append-only.
+ * blocks.
*/
bool
xfs_can_free_eofblocks(
- struct xfs_inode *ip,
- bool force)
+ struct xfs_inode *ip)
{
struct xfs_bmbt_irec imap;
struct xfs_mount *mp = ip->i_mount;
@@ -526,11 +524,11 @@ xfs_can_free_eofblocks(
return false;
/*
- * Do not free real preallocated or append-only files unless the file
- * has delalloc blocks and we are forced to remove them.
+ * Only free real extents for inodes with persistent preallocations or
+ * the append-only flag.
*/
if (ip->i_diflags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND))
- if (!force || ip->i_delayed_blks == 0)
+ if (ip->i_delayed_blks == 0)
return false;
/*
@@ -584,6 +582,22 @@ xfs_free_eofblocks(
/* Wait on dio to ensure i_size has settled. */
inode_dio_wait(VFS_I(ip));
+ /*
+ * For preallocated files only free delayed allocations.
+ *
+ * Note that this means we also leave speculative preallocations in
+ * place for preallocated files.
+ */
+ if (ip->i_diflags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)) {
+ if (ip->i_delayed_blks) {
+ xfs_bmap_punch_delalloc_range(ip,
+ round_up(XFS_ISIZE(ip), mp->m_sb.sb_blocksize),
+ LLONG_MAX);
+ }
+ xfs_inode_clear_eofblocks_tag(ip);
+ return 0;
+ }
+
error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
if (error) {
ASSERT(xfs_is_shutdown(mp));
@@ -891,7 +905,7 @@ xfs_prepare_shift(
* Trim eofblocks to avoid shifting uninitialized post-eof preallocation
* into the accessible region of the file.
*/
- if (xfs_can_free_eofblocks(ip, true)) {
+ if (xfs_can_free_eofblocks(ip)) {
error = xfs_free_eofblocks(ip);
if (error)
return error;
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 51f84d8ff372..eb0895bfb9da 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -63,7 +63,7 @@ int xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
xfs_off_t len);
/* EOF block manipulation functions */
-bool xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
+bool xfs_can_free_eofblocks(struct xfs_inode *ip);
int xfs_free_eofblocks(struct xfs_inode *ip);
int xfs_swap_extents(struct xfs_inode *ip, struct xfs_inode *tip,
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0953163a2d84..9967334ea99f 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1155,7 +1155,7 @@ xfs_inode_free_eofblocks(
}
*lockflags |= XFS_IOLOCK_EXCL;
- if (xfs_can_free_eofblocks(ip, false))
+ if (xfs_can_free_eofblocks(ip))
return xfs_free_eofblocks(ip);
/* inode could be preallocated or append-only */
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 58fb7a5062e1..b699fa6ee3b6 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1595,7 +1595,7 @@ xfs_release(
if (!xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL))
return 0;
- if (xfs_can_free_eofblocks(ip, false)) {
+ if (xfs_can_free_eofblocks(ip)) {
/*
* Check if the inode is being opened, written and closed
* frequently and we have delayed allocation blocks outstanding
@@ -1856,15 +1856,13 @@ xfs_inode_needs_inactive(
/*
* This file isn't being freed, so check if there are post-eof blocks
- * to free. @force is true because we are evicting an inode from the
- * cache. Post-eof blocks must be freed, lest we end up with broken
- * free space accounting.
+ * to free.
*
* Note: don't bother with iolock here since lockdep complains about
* acquiring it in reclaim context. We have the only reference to the
* inode at this point anyways.
*/
- return xfs_can_free_eofblocks(ip, true);
+ return xfs_can_free_eofblocks(ip);
}
/*
@@ -1947,15 +1945,11 @@ xfs_inactive(
if (VFS_I(ip)->i_nlink != 0) {
/*
- * force is true because we are evicting an inode from the
- * cache. Post-eof blocks must be freed, lest we end up with
- * broken free space accounting.
- *
* Note: don't bother with iolock here since lockdep complains
* about acquiring it in reclaim context. We have the only
* reference to the inode at this point anyways.
*/
- if (xfs_can_free_eofblocks(ip, true))
+ if (xfs_can_free_eofblocks(ip))
error = xfs_free_eofblocks(ip);
goto out;
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/6] xfs: restrict when we try to align cow fork delalloc to cowextsz hints
2024-06-20 23:13 [PATCHSET v2] xfs: random fixes for 6.10 Darrick J. Wong
2024-06-20 23:13 ` [PATCH 1/6] xfs: fix freeing speculative preallocations for preallocated files Darrick J. Wong
@ 2024-06-20 23:13 ` Darrick J. Wong
2024-06-21 4:32 ` Christoph Hellwig
2024-06-20 23:14 ` [PATCH 3/6] xfs: allow unlinked symlinks and dirs with zero size Darrick J. Wong
` (3 subsequent siblings)
5 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2024-06-20 23:13 UTC (permalink / raw)
To: hch, chandanbabu, djwong; +Cc: linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
xfs/205 produces the following failure when always_cow is enabled:
--- a/tests/xfs/205.out 2024-02-28 16:20:24.437887970 -0800
+++ b/tests/xfs/205.out.bad 2024-06-03 21:13:40.584000000 -0700
@@ -1,4 +1,5 @@
QA output created by 205
*** one file
+ !!! disk full (expected)
*** one file, a few bytes at a time
*** done
This is the result of overly aggressive attempts to align cow fork
delalloc reservations to the CoW extent size hint. Looking at the trace
data, we're trying to append a single fsblock to the "fred" file.
Trying to create a speculative post-eof reservation fails because
there's not enough space.
We then set @prealloc_blocks to zero and try again, but the cowextsz
alignment code triggers, which expands our request for a 1-fsblock
reservation into a 39-block reservation. There's not enough space for
that, so the whole write fails with ENOSPC even though there's
sufficient space in the filesystem to allocate the single block that we
need to land the write.
There are two things wrong here -- first, we shouldn't be attempting
speculative preallocations beyond what was requested when we're low on
space. Second, if we've already computed a posteof preallocation, we
shouldn't bother trying to align that to the cowextsize hint.
Fix both of these problems by adding a flag that only enables the
expansion of the delalloc reservation to the cowextsize if we're doing a
non-extending write, and only if we're not doing an ENOSPC retry. This
requires us to move the ENOSPC retry logic to xfs_bmapi_reserve_delalloc.
I probably should have caught this six years ago when 6ca30729c206d was
being reviewed, but oh well. Update the comments to reflect what the
code does now.
Fixes: 6ca30729c206d ("xfs: bmap code cleanup")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_bmap.c | 31 +++++++++++++++++++++++++++----
fs/xfs/xfs_iomap.c | 34 ++++++++++++----------------------
2 files changed, 39 insertions(+), 26 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index c101cf266bc4..6af6f744fdd6 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4058,20 +4058,32 @@ xfs_bmapi_reserve_delalloc(
xfs_extlen_t indlen;
uint64_t fdblocks;
int error;
- xfs_fileoff_t aoff = off;
+ xfs_fileoff_t aoff;
+ bool use_cowextszhint =
+ whichfork == XFS_COW_FORK && !prealloc;
+retry:
/*
* Cap the alloc length. Keep track of prealloc so we know whether to
* tag the inode before we return.
*/
+ aoff = off;
alen = XFS_FILBLKS_MIN(len + prealloc, XFS_MAX_BMBT_EXTLEN);
if (!eof)
alen = XFS_FILBLKS_MIN(alen, got->br_startoff - aoff);
if (prealloc && alen >= len)
prealloc = alen - len;
- /* Figure out the extent size, adjust alen */
- if (whichfork == XFS_COW_FORK) {
+ /*
+ * If we're targetting the COW fork but aren't creating a speculative
+ * posteof preallocation, try to expand the reservation to align with
+ * the COW extent size hint if there's sufficient free space.
+ *
+ * Unlike the data fork, the CoW cancellation functions will free all
+ * the reservations at inactivation, so we don't require that every
+ * delalloc reservation have a dirty pagecache.
+ */
+ if (use_cowextszhint) {
struct xfs_bmbt_irec prev;
xfs_extlen_t extsz = xfs_get_cowextsz_hint(ip);
@@ -4090,7 +4102,7 @@ xfs_bmapi_reserve_delalloc(
*/
error = xfs_quota_reserve_blkres(ip, alen);
if (error)
- return error;
+ goto out;
/*
* Split changing sb for alen and indlen since they could be coming
@@ -4140,6 +4152,17 @@ xfs_bmapi_reserve_delalloc(
out_unreserve_quota:
if (XFS_IS_QUOTA_ON(mp))
xfs_quota_unreserve_blkres(ip, alen);
+out:
+ if (error == -ENOSPC || error == -EDQUOT) {
+ trace_xfs_delalloc_enospc(ip, off, len);
+
+ if (prealloc || use_cowextszhint) {
+ /* retry without any preallocation */
+ use_cowextszhint = false;
+ prealloc = 0;
+ goto retry;
+ }
+ }
return error;
}
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 378342673925..414903885ab9 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1148,33 +1148,23 @@ xfs_buffered_write_iomap_begin(
}
}
-retry:
- error = xfs_bmapi_reserve_delalloc(ip, allocfork, offset_fsb,
- end_fsb - offset_fsb, prealloc_blocks,
- allocfork == XFS_DATA_FORK ? &imap : &cmap,
- allocfork == XFS_DATA_FORK ? &icur : &ccur,
- allocfork == XFS_DATA_FORK ? eof : cow_eof);
- switch (error) {
- case 0:
- break;
- case -ENOSPC:
- case -EDQUOT:
- /* retry without any preallocation */
- trace_xfs_delalloc_enospc(ip, offset, count);
- if (prealloc_blocks) {
- prealloc_blocks = 0;
- goto retry;
- }
- fallthrough;
- default:
- goto out_unlock;
- }
-
if (allocfork == XFS_COW_FORK) {
+ error = xfs_bmapi_reserve_delalloc(ip, allocfork, offset_fsb,
+ end_fsb - offset_fsb, prealloc_blocks, &cmap,
+ &ccur, cow_eof);
+ if (error)
+ goto out_unlock;
+
trace_xfs_iomap_alloc(ip, offset, count, allocfork, &cmap);
goto found_cow;
}
+ error = xfs_bmapi_reserve_delalloc(ip, allocfork, offset_fsb,
+ end_fsb - offset_fsb, prealloc_blocks, &imap, &icur,
+ eof);
+ if (error)
+ goto out_unlock;
+
/*
* Flag newly allocated delalloc blocks with IOMAP_F_NEW so we punch
* them out if the write happens to fail.
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 3/6] xfs: allow unlinked symlinks and dirs with zero size
2024-06-20 23:13 [PATCHSET v2] xfs: random fixes for 6.10 Darrick J. Wong
2024-06-20 23:13 ` [PATCH 1/6] xfs: fix freeing speculative preallocations for preallocated files Darrick J. Wong
2024-06-20 23:13 ` [PATCH 2/6] xfs: restrict when we try to align cow fork delalloc to cowextsz hints Darrick J. Wong
@ 2024-06-20 23:14 ` Darrick J. Wong
2024-06-20 23:14 ` [PATCH 4/6] xfs: verify buffer, inode, and dquot items every tx commit Darrick J. Wong
` (2 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2024-06-20 23:14 UTC (permalink / raw)
To: hch, chandanbabu, djwong; +Cc: linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
For a very very long time, inode inactivation has set the inode size to
zero before unmapping the extents associated with the data fork.
Unfortunately, commit 3c6f46eacd876 changed the inode verifier to
prohibit zero-length symlinks and directories. If an inode happens to
get logged in this state and the system crashes before freeing the
inode, log recovery will also fail on the broken inode.
Therefore, allow zero-size symlinks and directories as long as the link
count is zero; nobody will be able to open these files by handle so
there isn't any risk of data exposure.
Fixes: 3c6f46eacd876 ("xfs: sanity check directory inode di_size")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/libxfs/xfs_inode_buf.c | 23 ++++++++++++++++++-----
1 file changed, 18 insertions(+), 5 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index e7a7bfbe75b4..513b50da6215 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -379,10 +379,13 @@ xfs_dinode_verify_fork(
/*
* A directory small enough to fit in the inode must be stored
* in local format. The directory sf <-> extents conversion
- * code updates the directory size accordingly.
+ * code updates the directory size accordingly. Directories
+ * being truncated have zero size and are not subject to this
+ * check.
*/
if (S_ISDIR(mode)) {
- if (be64_to_cpu(dip->di_size) <= fork_size &&
+ if (dip->di_size &&
+ be64_to_cpu(dip->di_size) <= fork_size &&
fork_format != XFS_DINODE_FMT_LOCAL)
return __this_address;
}
@@ -528,9 +531,19 @@ xfs_dinode_verify(
if (mode && xfs_mode_to_ftype(mode) == XFS_DIR3_FT_UNKNOWN)
return __this_address;
- /* No zero-length symlinks/dirs. */
- if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
- return __this_address;
+ /*
+ * No zero-length symlinks/dirs unless they're unlinked and hence being
+ * inactivated.
+ */
+ if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0) {
+ if (dip->di_version > 1) {
+ if (dip->di_nlink)
+ return __this_address;
+ } else {
+ if (dip->di_onlink)
+ return __this_address;
+ }
+ }
fa = xfs_dinode_verify_nrext64(mp, dip);
if (fa)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 4/6] xfs: verify buffer, inode, and dquot items every tx commit
2024-06-20 23:13 [PATCHSET v2] xfs: random fixes for 6.10 Darrick J. Wong
` (2 preceding siblings ...)
2024-06-20 23:14 ` [PATCH 3/6] xfs: allow unlinked symlinks and dirs with zero size Darrick J. Wong
@ 2024-06-20 23:14 ` Darrick J. Wong
2024-06-20 23:14 ` [PATCH 5/6] xfs: fix direction in XFS_IOC_EXCHANGE_RANGE Darrick J. Wong
2024-06-20 23:14 ` [PATCH 6/6] xfs: honor init_xattrs in xfs_init_new_inode for !ATTR fs Darrick J. Wong
5 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2024-06-20 23:14 UTC (permalink / raw)
To: hch, chandanbabu, djwong; +Cc: linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
generic/388 has an annoying tendency to fail like this during log
recovery:
XFS (sda4): Unmounting Filesystem 435fe39b-82b6-46ef-be56-819499585130
XFS (sda4): Mounting V5 Filesystem 435fe39b-82b6-46ef-be56-819499585130
XFS (sda4): Starting recovery (logdev: internal)
00000000: 49 4e 81 b6 03 02 00 00 00 00 00 07 00 00 00 07 IN..............
00000010: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 10 ................
00000020: 35 9a 8b c1 3e 6e 81 00 35 9a 8b c1 3f dc b7 00 5...>n..5...?...
00000030: 35 9a 8b c1 3f dc b7 00 00 00 00 00 00 3c 86 4f 5...?........<.O
00000040: 00 00 00 00 00 00 02 f3 00 00 00 00 00 00 00 00 ................
00000050: 00 00 1f 01 00 00 00 00 00 00 00 02 b2 74 c9 0b .............t..
00000060: ff ff ff ff d7 45 73 10 00 00 00 00 00 00 00 2d .....Es........-
00000070: 00 00 07 92 00 01 fe 30 00 00 00 00 00 00 00 1a .......0........
00000080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000090: 35 9a 8b c1 3b 55 0c 00 00 00 00 00 04 27 b2 d1 5...;U.......'..
000000a0: 43 5f e3 9b 82 b6 46 ef be 56 81 94 99 58 51 30 C_....F..V...XQ0
XFS (sda4): Internal error Bad dinode after recovery at line 539 of file fs/xfs/xfs_inode_item_recover.c. Caller xlog_recover_items_pass2+0x4e/0xc0 [xfs]
CPU: 0 PID: 2189311 Comm: mount Not tainted 6.9.0-rc4-djwx #rc4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x4f/0x60
xfs_corruption_error+0x90/0xa0
xlog_recover_inode_commit_pass2+0x5f1/0xb00
xlog_recover_items_pass2+0x4e/0xc0
xlog_recover_commit_trans+0x2db/0x350
xlog_recovery_process_trans+0xab/0xe0
xlog_recover_process_data+0xa7/0x130
xlog_do_recovery_pass+0x398/0x840
xlog_do_log_recovery+0x62/0xc0
xlog_do_recover+0x34/0x1d0
xlog_recover+0xe9/0x1a0
xfs_log_mount+0xff/0x260
xfs_mountfs+0x5d9/0xb60
xfs_fs_fill_super+0x76b/0xa30
get_tree_bdev+0x124/0x1d0
vfs_get_tree+0x17/0xa0
path_mount+0x72b/0xa90
__x64_sys_mount+0x112/0x150
do_syscall_64+0x49/0x100
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
XFS (sda4): Corruption detected. Unmount and run xfs_repair
XFS (sda4): Metadata corruption detected at xfs_dinode_verify.part.0+0x739/0x920 [xfs], inode 0x427b2d1
XFS (sda4): Filesystem has been shut down due to log error (0x2).
XFS (sda4): Please unmount the filesystem and rectify the problem(s).
XFS (sda4): log mount/recovery failed: error -117
XFS (sda4): log mount failed
This inode log item recovery failing the dinode verifier after
replaying the contents of the inode log item into the ondisk inode.
Looking back into what the kernel was doing at the time of the fs
shutdown, a thread was in the middle of running a series of
transactions, each of which committed changes to the inode.
At some point in the middle of that chain, an invalid (at least
according to the verifier) change was committed. Had the filesystem not
shut down in the middle of the chain, a subsequent transaction would
have corrected the invalid state and nobody would have noticed. But
that's not what happened here. Instead, the invalid inode state was
committed to the ondisk log, so log recovery tripped over it.
The actual defect here was an overzealous inode verifier, which was
fixed in a separate patch. This patch adds some transaction precommit
functions for CONFIG_XFS_DEBUG=y mode so that we can detect these kinds
of transient errors at transaction commit time, where it's much easier
to find the root cause.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/Kconfig | 12 ++++++++++++
fs/xfs/xfs.h | 4 ++++
fs/xfs/xfs_buf_item.c | 32 ++++++++++++++++++++++++++++++++
fs/xfs/xfs_dquot_item.c | 31 +++++++++++++++++++++++++++++++
fs/xfs/xfs_inode_item.c | 32 ++++++++++++++++++++++++++++++++
5 files changed, 111 insertions(+)
diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index d41edd30388b..fffd6fffdce0 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -217,6 +217,18 @@ config XFS_DEBUG
Say N unless you are an XFS developer, or you play one on TV.
+config XFS_DEBUG_EXPENSIVE
+ bool "XFS expensive debugging checks"
+ depends on XFS_FS && XFS_DEBUG
+ help
+ Say Y here to get an XFS build with expensive debugging checks
+ enabled. These checks may affect performance significantly.
+
+ Note that the resulting code will be HUGER and SLOWER, and probably
+ not useful unless you are debugging a particular problem.
+
+ Say N unless you are an XFS developer, or you play one on TV.
+
config XFS_ASSERT_FATAL
bool "XFS fatal asserts"
default y
diff --git a/fs/xfs/xfs.h b/fs/xfs/xfs.h
index f6ffb4f248f7..9355ccad9503 100644
--- a/fs/xfs/xfs.h
+++ b/fs/xfs/xfs.h
@@ -10,6 +10,10 @@
#define DEBUG 1
#endif
+#ifdef CONFIG_XFS_DEBUG_EXPENSIVE
+#define DEBUG_EXPENSIVE 1
+#endif
+
#ifdef CONFIG_XFS_ASSERT_FATAL
#define XFS_ASSERT_FATAL 1
#endif
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 43031842341a..47549cfa61cd 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -22,6 +22,7 @@
#include "xfs_trace.h"
#include "xfs_log.h"
#include "xfs_log_priv.h"
+#include "xfs_error.h"
struct kmem_cache *xfs_buf_item_cache;
@@ -781,8 +782,39 @@ xfs_buf_item_committed(
return lsn;
}
+#ifdef DEBUG_EXPENSIVE
+static int
+xfs_buf_item_precommit(
+ struct xfs_trans *tp,
+ struct xfs_log_item *lip)
+{
+ struct xfs_buf_log_item *bip = BUF_ITEM(lip);
+ struct xfs_buf *bp = bip->bli_buf;
+ struct xfs_mount *mp = bp->b_mount;
+ xfs_failaddr_t fa;
+
+ if (!bp->b_ops || !bp->b_ops->verify_struct)
+ return 0;
+ if (bip->bli_flags & XFS_BLI_STALE)
+ return 0;
+
+ fa = bp->b_ops->verify_struct(bp);
+ if (fa) {
+ xfs_buf_verifier_error(bp, -EFSCORRUPTED, bp->b_ops->name,
+ bp->b_addr, BBTOB(bp->b_length), fa);
+ xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+ ASSERT(fa == NULL);
+ }
+
+ return 0;
+}
+#else
+# define xfs_buf_item_precommit NULL
+#endif
+
static const struct xfs_item_ops xfs_buf_item_ops = {
.iop_size = xfs_buf_item_size,
+ .iop_precommit = xfs_buf_item_precommit,
.iop_format = xfs_buf_item_format,
.iop_pin = xfs_buf_item_pin,
.iop_unpin = xfs_buf_item_unpin,
diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
index 6a1aae799cf1..7d19091215b0 100644
--- a/fs/xfs/xfs_dquot_item.c
+++ b/fs/xfs/xfs_dquot_item.c
@@ -17,6 +17,7 @@
#include "xfs_trans_priv.h"
#include "xfs_qm.h"
#include "xfs_log.h"
+#include "xfs_error.h"
static inline struct xfs_dq_logitem *DQUOT_ITEM(struct xfs_log_item *lip)
{
@@ -193,8 +194,38 @@ xfs_qm_dquot_logitem_committing(
return xfs_qm_dquot_logitem_release(lip);
}
+#ifdef DEBUG_EXPENSIVE
+static int
+xfs_qm_dquot_logitem_precommit(
+ struct xfs_trans *tp,
+ struct xfs_log_item *lip)
+{
+ struct xfs_dquot *dqp = DQUOT_ITEM(lip)->qli_dquot;
+ struct xfs_mount *mp = dqp->q_mount;
+ struct xfs_disk_dquot ddq = { };
+ xfs_failaddr_t fa;
+
+ xfs_dquot_to_disk(&ddq, dqp);
+ fa = xfs_dquot_verify(mp, &ddq, dqp->q_id);
+ if (fa) {
+ XFS_CORRUPTION_ERROR("Bad dquot during logging",
+ XFS_ERRLEVEL_LOW, mp, &ddq, sizeof(ddq));
+ xfs_alert(mp,
+ "Metadata corruption detected at %pS, dquot 0x%x",
+ fa, dqp->q_id);
+ xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+ ASSERT(fa == NULL);
+ }
+
+ return 0;
+}
+#else
+# define xfs_qm_dquot_logitem_precommit NULL
+#endif
+
static const struct xfs_item_ops xfs_dquot_item_ops = {
.iop_size = xfs_qm_dquot_logitem_size,
+ .iop_precommit = xfs_qm_dquot_logitem_precommit,
.iop_format = xfs_qm_dquot_logitem_format,
.iop_pin = xfs_qm_dquot_logitem_pin,
.iop_unpin = xfs_qm_dquot_logitem_unpin,
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index f28d653300d1..ef05cbbe116c 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -37,6 +37,36 @@ xfs_inode_item_sort(
return INODE_ITEM(lip)->ili_inode->i_ino;
}
+#ifdef DEBUG_EXPENSIVE
+static void
+xfs_inode_item_precommit_check(
+ struct xfs_inode *ip)
+{
+ struct xfs_mount *mp = ip->i_mount;
+ struct xfs_dinode *dip;
+ xfs_failaddr_t fa;
+
+ dip = kzalloc(mp->m_sb.sb_inodesize, GFP_KERNEL | GFP_NOFS);
+ if (!dip) {
+ ASSERT(dip != NULL);
+ return;
+ }
+
+ xfs_inode_to_disk(ip, dip, 0);
+ xfs_dinode_calc_crc(mp, dip);
+ fa = xfs_dinode_verify(mp, ip->i_ino, dip);
+ if (fa) {
+ xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
+ sizeof(*dip), fa);
+ xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+ ASSERT(fa == NULL);
+ }
+ kfree(dip);
+}
+#else
+# define xfs_inode_item_precommit_check(ip) ((void)0)
+#endif
+
/*
* Prior to finally logging the inode, we have to ensure that all the
* per-modification inode state changes are applied. This includes VFS inode
@@ -169,6 +199,8 @@ xfs_inode_item_precommit(
iip->ili_fields |= (flags | iip->ili_last_fields);
spin_unlock(&iip->ili_lock);
+ xfs_inode_item_precommit_check(ip);
+
/*
* We are done with the log item transaction dirty state, so clear it so
* that it doesn't pollute future transactions.
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 5/6] xfs: fix direction in XFS_IOC_EXCHANGE_RANGE
2024-06-20 23:13 [PATCHSET v2] xfs: random fixes for 6.10 Darrick J. Wong
` (3 preceding siblings ...)
2024-06-20 23:14 ` [PATCH 4/6] xfs: verify buffer, inode, and dquot items every tx commit Darrick J. Wong
@ 2024-06-20 23:14 ` Darrick J. Wong
2024-06-21 4:32 ` Christoph Hellwig
2024-06-20 23:14 ` [PATCH 6/6] xfs: honor init_xattrs in xfs_init_new_inode for !ATTR fs Darrick J. Wong
5 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2024-06-20 23:14 UTC (permalink / raw)
To: hch, chandanbabu, djwong; +Cc: linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
The kernel reads userspace's buffer but does not write it back.
Therefore this is really an _IOW ioctl. Change this before 6.10 final
releases.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_fs.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 97996cb79aaa..454b63ef7201 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -996,7 +996,7 @@ struct xfs_getparents_by_handle {
#define XFS_IOC_FSGEOMETRY _IOR ('X', 126, struct xfs_fsop_geom)
#define XFS_IOC_BULKSTAT _IOR ('X', 127, struct xfs_bulkstat_req)
#define XFS_IOC_INUMBERS _IOR ('X', 128, struct xfs_inumbers_req)
-#define XFS_IOC_EXCHANGE_RANGE _IOWR('X', 129, struct xfs_exchange_range)
+#define XFS_IOC_EXCHANGE_RANGE _IOW ('X', 129, struct xfs_exchange_range)
/* XFS_IOC_GETFSUUID ---------- deprecated 140 */
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 6/6] xfs: honor init_xattrs in xfs_init_new_inode for !ATTR fs
2024-06-20 23:13 [PATCHSET v2] xfs: random fixes for 6.10 Darrick J. Wong
` (4 preceding siblings ...)
2024-06-20 23:14 ` [PATCH 5/6] xfs: fix direction in XFS_IOC_EXCHANGE_RANGE Darrick J. Wong
@ 2024-06-20 23:14 ` Darrick J. Wong
2024-06-21 4:34 ` Christoph Hellwig
5 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2024-06-20 23:14 UTC (permalink / raw)
To: hch, chandanbabu, djwong; +Cc: linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
xfs_init_new_inode ignores the init_xattrs parameter for filesystems
that do not have ATTR enabled. As a result, the first init_xattrs file
to be created by the kernel will not have an attr fork created to store
acls. Storing that first acl will add ATTR to the superblock flags, so
subsequent files will be created with attr forks. The overhead of this
is so small that chances are that nobody has noticed this behavior.
However, this is disastrous on a filesystem with parent pointers because
it requires that a new linkable file /must/ have a pre-existing attr
fork, and the parent pointers code uses init_xattrs to create that fork.
The preproduction version of mkfs.xfs used to set this, but the V5 sb
verifier only requires ATTR2, not ATTR. There is no guard for
filesystems with (PARENT && !ATTR).
It turns out that I misunderstood the two flags -- ATTR means that we at
some point created an attr fork to store xattrs in a file; ATTR2
apparently means only that inodes have dynamic fork offsets or that the
filesystem was mounted with the "attr2" option.
Fixes: 2442ee15bb1e ("xfs: eager inode attr fork init needs attr feature awareness")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_inode.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b699fa6ee3b6..aa134687027c 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -42,6 +42,7 @@
#include "xfs_pnfs.h"
#include "xfs_parent.h"
#include "xfs_xattr.h"
+#include "xfs_sb.h"
struct kmem_cache *xfs_inode_cache;
@@ -870,9 +871,16 @@ xfs_init_new_inode(
* this saves us from needing to run a separate transaction to set the
* fork offset in the immediate future.
*/
- if (init_xattrs && xfs_has_attr(mp)) {
+ if (init_xattrs) {
ip->i_forkoff = xfs_default_attroffset(ip) >> 3;
xfs_ifork_init_attr(ip, XFS_DINODE_FMT_EXTENTS, 0);
+
+ if (!xfs_has_attr(mp)) {
+ spin_lock(&mp->m_sb_lock);
+ xfs_add_attr(mp);
+ spin_unlock(&mp->m_sb_lock);
+ xfs_log_sb(tp);
+ }
}
/*
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/6] xfs: restrict when we try to align cow fork delalloc to cowextsz hints
2024-06-20 23:13 ` [PATCH 2/6] xfs: restrict when we try to align cow fork delalloc to cowextsz hints Darrick J. Wong
@ 2024-06-21 4:32 ` Christoph Hellwig
0 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2024-06-21 4:32 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: chandanbabu, linux-xfs
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 5/6] xfs: fix direction in XFS_IOC_EXCHANGE_RANGE
2024-06-20 23:14 ` [PATCH 5/6] xfs: fix direction in XFS_IOC_EXCHANGE_RANGE Darrick J. Wong
@ 2024-06-21 4:32 ` Christoph Hellwig
0 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2024-06-21 4:32 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: hch, chandanbabu, linux-xfs
On Thu, Jun 20, 2024 at 04:14:42PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> The kernel reads userspace's buffer but does not write it back.
> Therefore this is really an _IOW ioctl. Change this before 6.10 final
> releases.
Eww, yes:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 6/6] xfs: honor init_xattrs in xfs_init_new_inode for !ATTR fs
2024-06-20 23:14 ` [PATCH 6/6] xfs: honor init_xattrs in xfs_init_new_inode for !ATTR fs Darrick J. Wong
@ 2024-06-21 4:34 ` Christoph Hellwig
0 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2024-06-21 4:34 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: chandanbabu, linux-xfs
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-06-21 4:34 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-20 23:13 [PATCHSET v2] xfs: random fixes for 6.10 Darrick J. Wong
2024-06-20 23:13 ` [PATCH 1/6] xfs: fix freeing speculative preallocations for preallocated files Darrick J. Wong
2024-06-20 23:13 ` [PATCH 2/6] xfs: restrict when we try to align cow fork delalloc to cowextsz hints Darrick J. Wong
2024-06-21 4:32 ` Christoph Hellwig
2024-06-20 23:14 ` [PATCH 3/6] xfs: allow unlinked symlinks and dirs with zero size Darrick J. Wong
2024-06-20 23:14 ` [PATCH 4/6] xfs: verify buffer, inode, and dquot items every tx commit Darrick J. Wong
2024-06-20 23:14 ` [PATCH 5/6] xfs: fix direction in XFS_IOC_EXCHANGE_RANGE Darrick J. Wong
2024-06-21 4:32 ` Christoph Hellwig
2024-06-20 23:14 ` [PATCH 6/6] xfs: honor init_xattrs in xfs_init_new_inode for !ATTR fs Darrick J. Wong
2024-06-21 4:34 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox