Linux EXT4 FS development

Linux EXT4 FS development
 help / color / mirror / Atom feed

* [PATCH v3] ext4: validate EA inode i_nlink in ext4_xattr_inode_iget
From: Yun Zhou @ 2026-06-14  3:54 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel

Validate EA inode state in ext4_xattr_inode_iget() to prevent
WARN_ONCE triggers in ext4_xattr_inode_update_ref() and reject
corrupted EA inodes before they can cause further damage.

When a corrupted ext4 image has an EA inode with inconsistent i_nlink
and ref_count values, the code currently allows it through and later
hits WARN_ONCE when ref_count transitions cross the 0/1 boundary.
While this is not a security or stability issue -- it only fires on
crafted filesystem images and merely prints a call trace -- it is
better handled as an early sanity check that returns -EFSCORRUPTED,
consistent with how ext4 treats other on-disk corruption.

Since ext4_xattr_inode_iget() resolves references from active xattr
entries, the target EA inode must be in active state (i_nlink=1,
ref_count>0).  Reject any inode that does not satisfy this.

Reported-by: syzbot+76916a45d2294b551fd9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=76916a45d2294b551fd9
Fixes: dec214d00e0d ("ext4: xattr inode deduplication")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v3:
 - Move check after Lustre branch to avoid false positives on Lustre EA inodes
 - Merge into single condition: i_nlink != 1 || !ref_count
 - Add make_bad_inode() before iput() to avoid truncation in active txn

v2:
 - Add ref_count validation to also catch i_nlink=1, ref_count=0 case

 fs/ext4/xattr.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 982a1f831e22..77c11e4096bb 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -464,6 +464,27 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
 		inode_unlock(inode);
 	}
 
+	/*
+	 * Since this function resolves references from active xattr entries,
+	 * the EA inode must be in active state (i_nlink=1, ref_count>0).
+	 * i_nlink > 1, i_nlink == 0 (dangling reference), or ref_count == 0
+	 * (inconsistent with an active entry) all indicate corruption.
+	 */
+	if (inode->i_nlink != 1 || !ext4_xattr_inode_get_ref(inode)) {
+		ext4_error(parent->i_sb,
+			   "EA inode %lu has unexpected i_nlink=%u ref_count=%llu",
+			   ea_ino, inode->i_nlink,
+			   ext4_xattr_inode_get_ref(inode));
+		/*
+		 * Mark rejected inode to prevent ext4_evict_inode() from
+		 * attempting truncation on a corrupted inode within an active
+		 * transaction, which could exhaust journal credits.
+		 */
+		make_bad_inode(inode);
+		iput(inode);
+		return -EFSCORRUPTED;
+	}
+
 	*ea_inode = inode;
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related

* [PATCH ext4 v3] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Xiang Mei @ 2026-06-13 22:18 UTC (permalink / raw)
  To: linux-ext4, Jan Kara
  Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, Weiming Shi, Xiang Mei

ext4_read_inline_dir() can read a dirent header past the end of its inline
buffer, triggering a slab-out-of-bounds read during getdents64():

  BUG: KASAN: slab-out-of-bounds in __ext4_check_dir_entry
  Read of size 2 at addr ffff88800f3dd23c by task exploit/148
   ...
   __ext4_check_dir_entry
   ext4_read_inline_dir
   iterate_dir

The dirent payload lives in a buffer of exactly inline_size bytes:

	dir_buf = kmalloc(inline_size, GFP_NOFS);

but iteration runs in a position space extra_offset bytes larger
(extra_size = extra_offset + inline_size) so the synthetic "." and ".."
land at their block-dir offsets. A dirent is formed at "dir_buf + pos -
extra_offset", yet the loop bound (ctx->pos < extra_size) and the
ext4_check_dir_entry() length argument both use the larger extra_size. A
ctx->pos that is valid in extra_size space but whose de lies past 
inline_size is therefore accepted, and the rescan loop's rec_len probe 
and ext4_check_dir_entry() dereference de->rec_len before the entry is
rejected.

Bound the dirent header by inline_size in both loops: break out of the
rescan loop once a minimum-size header no longer fits, reject such a
position in the main loop before forming de, and pass inline_size rather
than extra_size to ext4_check_dir_entry().

Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Xiang Mei <xmei5@asu.edu>
---
 fs/ext4/inline.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 8045e4ff270c..1266c8827cca 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1454,6 +1454,9 @@ int ext4_read_inline_dir(struct file *file,
 			/* for other entry, the real offset in
 			 * the buf has to be tuned accordingly.
 			 */
+			if (i - extra_offset + ext4_dir_rec_len(1, NULL) >
+			    inline_size)
+				break;
 			de = (struct ext4_dir_entry_2 *)
 				(dir_buf + i - extra_offset);
 			/* It's too expensive to do a full
@@ -1488,10 +1491,18 @@ int ext4_read_inline_dir(struct file *file,
 			continue;
 		}

+		/*
+		 * de lives at dir_buf + ctx->pos - extra_offset, within the
+		 * kmalloc(inline_size) buffer.  Make sure its header fits before
+		 * ext4_check_dir_entry() dereferences de->rec_len.
+		 */
+		if (ctx->pos - extra_offset + ext4_dir_rec_len(1, NULL) >
+		    inline_size)
+			goto out;
 		de = (struct ext4_dir_entry_2 *)
 			(dir_buf + ctx->pos - extra_offset);
 		if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
-					 extra_size, ctx->pos))
+					 inline_size, ctx->pos))
 			goto out;
 		if (le32_to_cpu(de->inode)) {
 			if (!dir_emit(ctx, de->name, de->name_len,
-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH net v2] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Xiang Mei @ 2026-06-13 22:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Baokun Li,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Weiming Shi
In-Reply-To: <CAPpSM+TS=mXfKXqN65ic5xFA+S3YvJfa5BuSQ5VwEA1O19VqzA@mail.gmail.com>

On Sat, Jun 13, 2026 at 3:06 PM Xiang Mei <xmei5@asu.edu> wrote:
>
> On Wed, Jun 10, 2026 at 3:01 AM Jan Kara <jack@suse.cz> wrote:
> >
> > What does the 'net' in [PATCH net v2] mean?
> >
> > On Mon 08-06-26 18:07:39, Xiang Mei wrote:
> > > ext4_read_inline_dir() reads de->rec_len / de->name past the end of its
> > > inline buffer for a crafted or corrupted inline directory, triggering a
> > > slab-out-of-bounds read during getdents64():
> > >
> > >   BUG: KASAN: slab-out-of-bounds in filldir64 (fs/readdir.c:371)
> > >   Read of size 8 at addr ffff88800fd3da3c by task exploit/146
> > >    ...
> > >    kasan_report (mm/kasan/report.c:595)
> > >    filldir64 (fs/readdir.c:371)
> > >    iterate_dir (fs/readdir.c:110)
> > >    ...
> > >
> > > The payload is copied into a buffer of exactly inline_size bytes:
> > >
> > >       dir_buf = kmalloc(inline_size, GFP_NOFS);
> > >
> > > but iteration runs in a logical position space extra_offset bytes larger
> > > than the buffer (extra_size = extra_offset + inline_size), so the synthetic
> > > "." and ".." entries land at the offsets they would have in a block-based
> > > directory. A real dirent is formed at "dir_buf + pos - extra_offset", yet
> > > the loop bounds and the ext4_check_dir_entry() length argument are all
> > > expressed in the larger extra_size. Two reachable sites dereference a
> > > dirent before confirming its physical offset is inside the allocation:
> > >
> > > In the main loop, ctx->pos is attacker-controlled via lseek() and the entry
> > > is validated with extra_size, so ext4_check_dir_entry() accepts a dirent
> > > running up to extra_offset bytes past the allocation before its length
> > > check fires. ctx->pos is also a signed loff_t: an lseek() to a small value
> > > below extra_offset makes "ctx->pos - extra_offset" negative, so a check
> > > that only bounds the top of the buffer is bypassed by underflow and de is
> > > formed before dir_buf.
> > >
> > > In the cookie-rescan loop, entered when i_version changed since the last
> > > readdir(2), the walk restarts from the beginning with i bounded by
> > > extra_size, so as i approaches extra_size the unconditional read of
> > > de->rec_len runs past the allocation before any validation.
> > >
> > > Both are the same defect, logical extra_size space versus the physical
> > > inline_size buffer. In each loop, reject a dirent whose header would not
> > > fit within inline_size before forming de, and in the main loop also reject a
> > > position that underflows below extra_offset. Validate the main-loop entry
> > > against inline_size rather than extra_size. Entries that legitimately fill
> > > the inline data still pass.
> > >
> > > Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
> > > Reported-by: Weiming Shi <bestswngs@gmail.com>
> > > Assisted-by: Claude:claude-opus-4-8
> > > Signed-off-by: Xiang Mei <xmei5@asu.edu>
> >
> > Thanks for the analysis and the patch. See some suggestions for improvement
> > below:
> >
> > > @@ -1488,10 +1491,20 @@ int ext4_read_inline_dir(struct file *file,
> > >                       continue;
> > >               }
> > >
> > > +             /*
> > > +              * de lives at dir_buf + ctx->pos - extra_offset, so the dirent
> > > +              * header must fit within inline_size.  ctx->pos is a signed,
> > > +              * lseek()-controlled loff_t: check the lower bound first, or
> > > +              * ctx->pos < extra_offset underflows and points de before dir_buf.
> > > +              */
> > > +             if (ctx->pos < extra_offset ||
> > > +                 ctx->pos - extra_offset + ext4_dir_rec_len(1, NULL) >
> > > +                 inline_size)
> > > +                     goto out;
> >
> > So I don't think this is really possible. ctx->pos isn't really fully user
> > controlled. When you use seek to modify ctx->pos, ext4_dir_llseek() does
> > set info->cookie to invalid value so the next time we enter
> > ext4_read_inline_dir() we are guaranteed to revalidate the offset and reset
> > it to 0, dotdot_offset, or some value greater than extra_size.
>
> Please ignore v3. It doesn't actually fix the original bug, and the PoC
> still triggers KASAN with v3 applied.
>
> You're right that lseek() poisons info->cookie and the next
> ext4_read_inline_dir() always takes the !inode_eq_iversion() rescan path.
> But the rescan doesn't reset ctx->pos to a safe value. It walks forward:
>
> i += ext4_rec_len_from_disk(de->rec_len, extra_size);
>
> i.e. it advances i by the rec_len field of each inline dirent, which is
> attacker-controlled, and the only per-entry check is the "non-zero rec_len"
> probe (which itself reads de->rec_len). It then commits "offset = i;
> ctx->pos = offset;". So the rescan can legitimately land ctx->pos at a
> position whose dirent header already lies past inline_size - it is not
> clamped to a valid in-bounds entry.
>
> The main loop then forms de = dir_buf + ctx->pos - extra_offset and calls
> ext4_check_dir_entry(), which dereferences de->rec_len before the length
> check fires. With v3 (which only fixed the rescan probe and the size
> argument, but had no bounds check on the committed ctx->pos), this still
> reads past the buffer:
>
> BUG: KASAN: slab-out-of-bounds in __ext4_check_dir_entry
> Read of size 2 at addr ... (0 bytes to the right of a 60-byte
> kmalloc-64 region)
> __ext4_check_dir_entry
> ext4_read_inline_dir
> iterate_dir
>
> You were right about one thing: the lower-bound (ctx->pos < extra_offset,
> "de before dir_buf") case I once guarded against can't happen, since
> ctx->pos never comes in below the dotdot offsets. v4 drops that and keeps
> only the upper-bound check in the main loop, which I've confirmed stops the
> reproducer. Sorry for sending a patch I hadn't fully verified.
>
> v4 to follow.
My bad again. I didn't send out the broken v3. So the correct version
would be the corrected v3.
v3 to follow.

Xiang
>
> Xiang
>
> >
> > >               de = (struct ext4_dir_entry_2 *)
> > >                       (dir_buf + ctx->pos - extra_offset);
> > >               if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
> > > -                                      extra_size, ctx->pos))
> > > +                                      inline_size, ctx->pos))
> > >                       goto out;
> > >               if (le32_to_cpu(de->inode)) {
> > >                       if (!dir_emit(ctx, de->name, de->name_len,
> >
> > Otherwise the patch looks good.
> >
> >                                                                 Honza
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH net v2] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Xiang Mei @ 2026-06-13 22:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Baokun Li,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Weiming Shi
In-Reply-To: <p2kykhe5532mxqgnmgyrg24cmlpcz24qn2erw3lkzabjjubbyq@i2c3mfdm7mpc>

On Wed, Jun 10, 2026 at 3:01 AM Jan Kara <jack@suse.cz> wrote:
>
> What does the 'net' in [PATCH net v2] mean?
>
> On Mon 08-06-26 18:07:39, Xiang Mei wrote:
> > ext4_read_inline_dir() reads de->rec_len / de->name past the end of its
> > inline buffer for a crafted or corrupted inline directory, triggering a
> > slab-out-of-bounds read during getdents64():
> >
> >   BUG: KASAN: slab-out-of-bounds in filldir64 (fs/readdir.c:371)
> >   Read of size 8 at addr ffff88800fd3da3c by task exploit/146
> >    ...
> >    kasan_report (mm/kasan/report.c:595)
> >    filldir64 (fs/readdir.c:371)
> >    iterate_dir (fs/readdir.c:110)
> >    ...
> >
> > The payload is copied into a buffer of exactly inline_size bytes:
> >
> >       dir_buf = kmalloc(inline_size, GFP_NOFS);
> >
> > but iteration runs in a logical position space extra_offset bytes larger
> > than the buffer (extra_size = extra_offset + inline_size), so the synthetic
> > "." and ".." entries land at the offsets they would have in a block-based
> > directory. A real dirent is formed at "dir_buf + pos - extra_offset", yet
> > the loop bounds and the ext4_check_dir_entry() length argument are all
> > expressed in the larger extra_size. Two reachable sites dereference a
> > dirent before confirming its physical offset is inside the allocation:
> >
> > In the main loop, ctx->pos is attacker-controlled via lseek() and the entry
> > is validated with extra_size, so ext4_check_dir_entry() accepts a dirent
> > running up to extra_offset bytes past the allocation before its length
> > check fires. ctx->pos is also a signed loff_t: an lseek() to a small value
> > below extra_offset makes "ctx->pos - extra_offset" negative, so a check
> > that only bounds the top of the buffer is bypassed by underflow and de is
> > formed before dir_buf.
> >
> > In the cookie-rescan loop, entered when i_version changed since the last
> > readdir(2), the walk restarts from the beginning with i bounded by
> > extra_size, so as i approaches extra_size the unconditional read of
> > de->rec_len runs past the allocation before any validation.
> >
> > Both are the same defect, logical extra_size space versus the physical
> > inline_size buffer. In each loop, reject a dirent whose header would not
> > fit within inline_size before forming de, and in the main loop also reject a
> > position that underflows below extra_offset. Validate the main-loop entry
> > against inline_size rather than extra_size. Entries that legitimately fill
> > the inline data still pass.
> >
> > Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
> > Reported-by: Weiming Shi <bestswngs@gmail.com>
> > Assisted-by: Claude:claude-opus-4-8
> > Signed-off-by: Xiang Mei <xmei5@asu.edu>
>
> Thanks for the analysis and the patch. See some suggestions for improvement
> below:
>
> > @@ -1488,10 +1491,20 @@ int ext4_read_inline_dir(struct file *file,
> >                       continue;
> >               }
> >
> > +             /*
> > +              * de lives at dir_buf + ctx->pos - extra_offset, so the dirent
> > +              * header must fit within inline_size.  ctx->pos is a signed,
> > +              * lseek()-controlled loff_t: check the lower bound first, or
> > +              * ctx->pos < extra_offset underflows and points de before dir_buf.
> > +              */
> > +             if (ctx->pos < extra_offset ||
> > +                 ctx->pos - extra_offset + ext4_dir_rec_len(1, NULL) >
> > +                 inline_size)
> > +                     goto out;
>
> So I don't think this is really possible. ctx->pos isn't really fully user
> controlled. When you use seek to modify ctx->pos, ext4_dir_llseek() does
> set info->cookie to invalid value so the next time we enter
> ext4_read_inline_dir() we are guaranteed to revalidate the offset and reset
> it to 0, dotdot_offset, or some value greater than extra_size.

Please ignore v3. It doesn't actually fix the original bug, and the PoC
still triggers KASAN with v3 applied.

You're right that lseek() poisons info->cookie and the next
ext4_read_inline_dir() always takes the !inode_eq_iversion() rescan path.
But the rescan doesn't reset ctx->pos to a safe value. It walks forward:

i += ext4_rec_len_from_disk(de->rec_len, extra_size);

i.e. it advances i by the rec_len field of each inline dirent, which is
attacker-controlled, and the only per-entry check is the "non-zero rec_len"
probe (which itself reads de->rec_len). It then commits "offset = i;
ctx->pos = offset;". So the rescan can legitimately land ctx->pos at a
position whose dirent header already lies past inline_size - it is not
clamped to a valid in-bounds entry.

The main loop then forms de = dir_buf + ctx->pos - extra_offset and calls
ext4_check_dir_entry(), which dereferences de->rec_len before the length
check fires. With v3 (which only fixed the rescan probe and the size
argument, but had no bounds check on the committed ctx->pos), this still
reads past the buffer:

BUG: KASAN: slab-out-of-bounds in __ext4_check_dir_entry
Read of size 2 at addr ... (0 bytes to the right of a 60-byte
kmalloc-64 region)
__ext4_check_dir_entry
ext4_read_inline_dir
iterate_dir

You were right about one thing: the lower-bound (ctx->pos < extra_offset,
"de before dir_buf") case I once guarded against can't happen, since
ctx->pos never comes in below the dotdot offsets. v4 drops that and keeps
only the upper-bound check in the main loop, which I've confirmed stops the
reproducer. Sorry for sending a patch I hadn't fully verified.

v4 to follow.

Xiang

>
> >               de = (struct ext4_dir_entry_2 *)
> >                       (dir_buf + ctx->pos - extra_offset);
> >               if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
> > -                                      extra_size, ctx->pos))
> > +                                      inline_size, ctx->pos))
> >                       goto out;
> >               if (le32_to_cpu(de->inode)) {
> >                       if (!dir_emit(ctx, de->name, de->name_len,
>
> Otherwise the patch looks good.
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH net v2] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Xiang Mei @ 2026-06-13 21:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Baokun Li,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Weiming Shi
In-Reply-To: <p2kykhe5532mxqgnmgyrg24cmlpcz24qn2erw3lkzabjjubbyq@i2c3mfdm7mpc>

On Wed, Jun 10, 2026 at 3:01 AM Jan Kara <jack@suse.cz> wrote:
>
> What does the 'net' in [PATCH net v2] mean?
>
Sorry, it was a mistake. It should be ext4.
> On Mon 08-06-26 18:07:39, Xiang Mei wrote:
> > ext4_read_inline_dir() reads de->rec_len / de->name past the end of its
> > inline buffer for a crafted or corrupted inline directory, triggering a
> > slab-out-of-bounds read during getdents64():
> >
> >   BUG: KASAN: slab-out-of-bounds in filldir64 (fs/readdir.c:371)
> >   Read of size 8 at addr ffff88800fd3da3c by task exploit/146
> >    ...
> >    kasan_report (mm/kasan/report.c:595)
> >    filldir64 (fs/readdir.c:371)
> >    iterate_dir (fs/readdir.c:110)
> >    ...
> >
> > The payload is copied into a buffer of exactly inline_size bytes:
> >
> >       dir_buf = kmalloc(inline_size, GFP_NOFS);
> >
> > but iteration runs in a logical position space extra_offset bytes larger
> > than the buffer (extra_size = extra_offset + inline_size), so the synthetic
> > "." and ".." entries land at the offsets they would have in a block-based
> > directory. A real dirent is formed at "dir_buf + pos - extra_offset", yet
> > the loop bounds and the ext4_check_dir_entry() length argument are all
> > expressed in the larger extra_size. Two reachable sites dereference a
> > dirent before confirming its physical offset is inside the allocation:
> >
> > In the main loop, ctx->pos is attacker-controlled via lseek() and the entry
> > is validated with extra_size, so ext4_check_dir_entry() accepts a dirent
> > running up to extra_offset bytes past the allocation before its length
> > check fires. ctx->pos is also a signed loff_t: an lseek() to a small value
> > below extra_offset makes "ctx->pos - extra_offset" negative, so a check
> > that only bounds the top of the buffer is bypassed by underflow and de is
> > formed before dir_buf.
> >
> > In the cookie-rescan loop, entered when i_version changed since the last
> > readdir(2), the walk restarts from the beginning with i bounded by
> > extra_size, so as i approaches extra_size the unconditional read of
> > de->rec_len runs past the allocation before any validation.
> >
> > Both are the same defect, logical extra_size space versus the physical
> > inline_size buffer. In each loop, reject a dirent whose header would not
> > fit within inline_size before forming de, and in the main loop also reject a
> > position that underflows below extra_offset. Validate the main-loop entry
> > against inline_size rather than extra_size. Entries that legitimately fill
> > the inline data still pass.
> >
> > Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
> > Reported-by: Weiming Shi <bestswngs@gmail.com>
> > Assisted-by: Claude:claude-opus-4-8
> > Signed-off-by: Xiang Mei <xmei5@asu.edu>
>
> Thanks for the analysis and the patch. See some suggestions for improvement
> below:
>
> > @@ -1488,10 +1491,20 @@ int ext4_read_inline_dir(struct file *file,
> >                       continue;
> >               }
> >
> > +             /*
> > +              * de lives at dir_buf + ctx->pos - extra_offset, so the dirent
> > +              * header must fit within inline_size.  ctx->pos is a signed,
> > +              * lseek()-controlled loff_t: check the lower bound first, or
> > +              * ctx->pos < extra_offset underflows and points de before dir_buf.
> > +              */
> > +             if (ctx->pos < extra_offset ||
> > +                 ctx->pos - extra_offset + ext4_dir_rec_len(1, NULL) >
> > +                 inline_size)
> > +                     goto out;
>
> So I don't think this is really possible. ctx->pos isn't really fully user
> controlled. When you use seek to modify ctx->pos, ext4_dir_llseek() does
> set info->cookie to invalid value so the next time we enter
> ext4_read_inline_dir() we are guaranteed to revalidate the offset and reset
> it to 0, dotdot_offset, or some value greater than extra_size.

You're right, thanks. The rescan path rebuilds ctx->pos, so the
underflow can't happen. I'll drop this in v3.

Xiang
>
> >               de = (struct ext4_dir_entry_2 *)
> >                       (dir_buf + ctx->pos - extra_offset);
> >               if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
> > -                                      extra_size, ctx->pos))
> > +                                      inline_size, ctx->pos))
> >                       goto out;
> >               if (le32_to_cpu(de->inode)) {
> >                       if (!dir_emit(ctx, de->name, de->name_len,
>
> Otherwise the patch looks good.
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] e2fsprogs/resize2fs: allow online resizing with CONFIG_BLK_DEV_WRITE_MOUNTED disabled
From: Theodore Tso @ 2026-06-13 15:41 UTC (permalink / raw)
  To: Hanno Böck; +Cc: linux-ext4, theodore.tso
In-Reply-To: <20260613110507.67289f68@hboeck.de>

On Sat, Jun 13, 2026 at 11:05:07AM +0200, Hanno Böck wrote:
> Kernels since version 6.8 have an option CONFIG_BLK_DEV_WRITE_MOUNTED
> that, if disabled, prevents accidental writes to and data corruption of
> mounted devices.
> 
> This breaks online resizing with resize2fs, as it opens the device with
> O_RDWR. However, it appears that this is not necessary and by changing
> it to O_RDONLY, online resizing works again.
> 
> See also: https://unix.stackexchange.com/a/796881
> 
> Signed-off-by: Hanno Böck <hanno@hboeck.de>

If you try run e2fssprogs's regression test (using "make check"), with
your patch applied, you'll find:

% make -j24 ; make -j24 check

...
t_mmp_2off: disable MMP using tune2fs: ok
377 tests succeeded	15 tests failed
Tests failed: m_minrootdir r_1024_small_bg r_32to64bit_expand_full r_bigalloc_big_expand r_ext4_small_bg r_fixup_lastbg_big r_fixup_lastbg r_inline_xattr r_min_itable r_move_inode_int_extent r_move_itable r_move_itable_nostride r_move_itable_realloc r_orphan_file r_resize_inode
make[1]: *** [Makefile:403: test_post] Error 1
make[1]: Leaving directory '/build/e2fsprogs/tests'
make: *** [Makefile:431: check-recursive] Error 1

Specifically, your change breaks off-line resizes, where the file
system is not mounted.  E2fsprogs builds on FreeBSD, GNU/Hurd, MacOS,
and other operating systems which don't support online resizing, which
is a Linux-only feature.

Furthermore, even on Linux, online resizing only supports growing the
file system.  If you want to shrink the file system, this must be done
using off-line resizing, with the file system unmounted.

And as I pointed out on another e-mail thread, fixing resize2fs is not
sufficient to disable CONFIG_BLK_DEV_WRITE_MOUNTED and retain full
functionality.  Specifically, certain file system parameters can be
modified using tune2fs while the file system is mounted.  Disabling
CONFIG_BLK_DEV_WRITE_MOUNTED will break this.

The CONFIG_BLK_DEV_WRITE_MOUNTED option was added specifically to
suppress maintainers from downing in Syzbot-generated noise, which
will gleefully report "security failures" caused by being able to
modify the block device while the file system is mounted.  The syzbot
scanner runs as root, which means it's great for inflating security
reports, at the cost of overloading maintainers who often deal with
syzbot reports late at night or on weekends, or have decided to just
ignore them since the signal is drowned out by the noise.  It was
**not** intended for users to disable that feature on production
systems.

At some point, in the future, the following thing needs to happen to
before blocking writes to the block device can be enabled for real:

*) E2fprogs's tune2fs program needs to be taught to use
 EXT4_IOC_GET_TUNE_SB_PARAM and EXT4_IOC_SET_TUNE_SB_PARAM ioctls so
 that the changes that are allowed to be made while the file system is
 mounted are done via this ioctls.

*) The kernel is enhanced so that write access to block device with a
 mounted file system can be blocked on a per "struct super", so the
 file system can block access either on per file system basis, or on a
 per mount basis based on a mount option.

*) We can only enable blocking block device writes to mounted file
 systems when we are use that users have both a kernel and e2fsprogs
 with support of EXT4_IOC_[SG]ET_TUNE_SB_PARAM.

Also note that at the moment, disabling block device writes of file
systems is at this point security theatre which is only rivaled by the
metal detectors in the airport operating by the United States
Transport Security Agency.  That's because
!CONFIG_BLK_DEV_WRITE_MOUNTED can be trivially bypassed by using the
loop device.

Cheers,

					- Ted

> --- a/resize/main.c	2026-03-06 18:17:36.000000000 +0100
> +++ b/resize/main.c	2026-06-13 10:53:46.165030199 +0200
> @@ -256,7 +256,7 @@ int main (int argc, char ** argv)
>  	int		force_min_size = 0;
>  	int		print_min_size = 0;
>  	int		fd, ret;
> -	int		open_flags = O_RDWR;
> +	int		open_flags = O_RDONLY;
>  	blk64_t		new_size = 0;
>  	blk64_t		max_size = 0;
>  	blk64_t		min_size = 0;
> @@ -364,9 +364,6 @@ int main (int argc, char ** argv)
>  		len = 2 * len;
>  	}
>  
> -	if (print_min_size)
> -		open_flags = O_RDONLY;
> -
>  	fd = ext2fs_open_file(device_name, open_flags, 0);
>  	if (fd < 0) {
>  		com_err("open", errno, _("while opening %s"),
> 

^ permalink raw reply

* [PATCH v2] ext4: validate EA inode i_nlink in ext4_xattr_inode_iget
From: Yun Zhou @ 2026-06-13 14:14 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel

Validate EA inode i_nlink early in ext4_xattr_inode_iget() to convert
a WARN_ONCE in ext4_xattr_inode_update_ref() into a graceful error
return.

When a corrupted ext4 image has an EA inode with inconsistent i_nlink
and ref_count values, the code currently allows it through and later
hits WARN_ONCE when ref_count transitions cross the 0/1 boundary.
While this is not a security or stability issue -- it only fires on
crafted filesystem images and merely prints a call trace -- it is
better handled as an early sanity check that returns -EFSCORRUPTED,
consistent with how ext4 treats other on-disk corruption.

The valid states for an EA inode are:
  - i_nlink=0, ref_count=0: orphaned, pending deletion
  - i_nlink=1, ref_count>0: active, referenced

Reject any EA inode that does not match one of these states at iget
time.

Reported-by: syzbot+76916a45d2294b551fd9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=76916a45d2294b551fd9
Fixes: dec214d00e0d ("ext4: xattr inode deduplication")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/xattr.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 982a1f831e22..4deb17b7bcbe 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -448,6 +448,23 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
 	}
 	ext4_xattr_inode_set_class(inode);
 
+	/*
+	 * EA inode has two valid states:
+	 *   i_nlink=0, ref_count=0: orphaned, pending deletion
+	 *   i_nlink=1, ref_count>0: active, referenced by one or more inodes
+	 * Anything else indicates on-disk corruption.
+	 */
+	if (inode->i_nlink > 1 ||
+	    (inode->i_nlink && !ext4_xattr_inode_get_ref(inode)) ||
+	    (!inode->i_nlink && ext4_xattr_inode_get_ref(inode))) {
+		ext4_error(parent->i_sb,
+			   "EA inode %lu has unexpected i_nlink=%u ref_count=%llu",
+			   ea_ino, inode->i_nlink,
+			   ext4_xattr_inode_get_ref(inode));
+		iput(inode);
+		return -EFSCORRUPTED;
+	}
+
 	/*
 	 * Check whether this is an old Lustre-style xattr inode. Lustre
 	 * implementation does not have hash validation, rather it has a
-- 
2.43.0


^ permalink raw reply related

* [PATCH] ext4: validate EA inode i_nlink in ext4_xattr_inode_iget
From: Yun Zhou @ 2026-06-13 13:29 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel

Validate EA inode i_nlink early in ext4_xattr_inode_iget() to convert
a WARN_ONCE in ext4_xattr_inode_update_ref() into a graceful error
return.

When a corrupted ext4 image has an EA inode with i_nlink set to an
invalid value (e.g. 65535), the code currently allows it through and
later hits a WARN_ONCE when ref_count reaches 0.  While this is not a
security or stability issue -- it only fires on crafted filesystem
images and merely prints a call trace -- it is better handled as an
early sanity check that returns -EFSCORRUPTED, consistent with how ext4
treats other on-disk corruption.

An EA inode should only ever have i_nlink of 0 (orphaned, pending
deletion) or 1 (active).  Reject anything above 1 at iget time.

Reported-by: syzbot+76916a45d2294b551fd9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=76916a45d2294b551fd9
Fixes: dec214d00e0d ("ext4: xattr inode deduplication")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/xattr.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 982a1f831e22..2fc41a06a446 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -448,6 +448,14 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
 	}
 	ext4_xattr_inode_set_class(inode);

+	if (inode->i_nlink > 1) {
+		ext4_error(parent->i_sb,
+			   "EA inode %lu has unexpected i_nlink=%u",
+			   ea_ino, inode->i_nlink);
+		iput(inode);
+		return -EFSCORRUPTED;
+	}
+
 	/*
 	 * Check whether this is an old Lustre-style xattr inode. Lustre
 	 * implementation does not have hash validation, rather it has a
-- 
2.43.0

^ permalink raw reply related

* [PATCH] e2fsprogs/resize2fs: allow online resizing with CONFIG_BLK_DEV_WRITE_MOUNTED disabled
From: Hanno Böck @ 2026-06-13  9:05 UTC (permalink / raw)
  To: linux-ext4; +Cc: theodore.tso

Kernels since version 6.8 have an option CONFIG_BLK_DEV_WRITE_MOUNTED
that, if disabled, prevents accidental writes to and data corruption of
mounted devices.

This breaks online resizing with resize2fs, as it opens the device with
O_RDWR. However, it appears that this is not necessary and by changing
it to O_RDONLY, online resizing works again.

See also: https://unix.stackexchange.com/a/796881

Signed-off-by: Hanno Böck <hanno@hboeck.de>
--- a/resize/main.c	2026-03-06 18:17:36.000000000 +0100
+++ b/resize/main.c	2026-06-13 10:53:46.165030199 +0200
@@ -256,7 +256,7 @@ int main (int argc, char ** argv)
 	int		force_min_size = 0;
 	int		print_min_size = 0;
 	int		fd, ret;
-	int		open_flags = O_RDWR;
+	int		open_flags = O_RDONLY;
 	blk64_t		new_size = 0;
 	blk64_t		max_size = 0;
 	blk64_t		min_size = 0;
@@ -364,9 +364,6 @@ int main (int argc, char ** argv)
 		len = 2 * len;
 	}
 
-	if (print_min_size)
-		open_flags = O_RDONLY;
-
 	fd = ext2fs_open_file(device_name, open_flags, 0);
 	if (fd < 0) {
 		com_err("open", errno, _("while opening %s"),

^ permalink raw reply

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Carlos Maiolino @ 2026-06-12 13:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, brauner, linux-block, linux-fsdevel, linux-ext4,
	linux-xfs, Hannes Reinecke, Martin K. Petersen, Jens Axboe
In-Reply-To: <20260612052831.GA9010@lst.de>

On Fri, Jun 12, 2026 at 07:28:31AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 11, 2026 at 05:47:07PM +0200, Carlos Maiolino wrote:
> > On Thu, Jun 11, 2026 at 03:38:33PM +0200, Christoph Hellwig wrote:
> > > On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> > > > It's entirely possible a device supports byte aligned addresses. The
> > > > block layer just doesn't let a driver report that. So either it really
> > > > was successful because you found a bug that skips the alignment checks,
> > > > or your device silently corrupted your payload.
> > 
> > I tried this on different hardware, I find it hard to say all those
> > devices were corrupting the payload.
> 
> I think in the other thread we agreed that we are currently missing
> the alignment check for fast-path bios not hitting the splitting code,
> so maybe that is something you see.  Additionally we're missing the
> checks for purely bio based drivers not calling the splitting helper
> at all, but I don't think that applies here.
> 
> > > > Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> > > > though, in not taking the optimization when it was possible. So here's
> > > > an alternative suggestion that should get things working as expected:
> > > 
> > > The fix below looks like it is addressing a real bug.  I'm not sure if
> > > Carlos is hitting it, but we were missing the alignment checks for
> > > single-bvec fast path bios so far indeed.
> > 
> > You left context out so I'm assuming by the fix you meant Keith's patch.
> 
> Yes.

The fix indeed seems to fix the behavior I'm seeing. Keith could you Cc
me if you end up sending an official version?

^ permalink raw reply

* [PATCH v3] ext4: defer iput() in ext4_xattr_block_set() to avoid deadlock with writepages
From: Yun Zhou @ 2026-06-12 13:19 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel
In-Reply-To: <20260612095846.1024470-1-yun.zhou@windriver.com>

ext4_xattr_block_set() calls iput() on ea_inode while its callers hold
xattr_sem.  If this iput() drops the last reference, it can trigger
write_inode_now() -> ext4_writepages() -> s_writepages_rwsem, which
violates the lock ordering since ext4_writepages() already establishes
s_writepages_rwsem -> jbd2_handle ordering:

  CPU0 (writeback worker)            CPU1 (file create)
  ----                               ----
  ext4_writepages()
    s_writepages_rwsem (read)        ext4_create()
    ext4_do_writepages()               __ext4_new_inode()
      ext4_journal_start()               [holds jbd2 handle]
        wait_transaction_locked()        ext4_xattr_set_handle()
        [WAIT for jbd2_handle]             xattr_sem (write)

  CPU2 (xattr set or isize expand)
  ----
  ext4_xattr_set_handle() or ext4_try_to_expand_extra_isize()
    xattr_sem (write)
    ext4_xattr_block_set()
      iput(ea_inode)
        write_inode_now()
          ext4_writepages()
            s_writepages_rwsem (read) [DEADLOCK]

This forms a circular dependency on lock classes:

  s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem

Fix by deferring iput() calls inside ext4_xattr_block_set() via the
existing ext4_xattr_inode_array mechanism.  The array is threaded
through the call chain and freed by callers after releasing xattr_sem.

Reported-by: syzbot+5d19358d7eb30ffb0cc5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v3: Address AI review feedback on v2:
  - Check ext4_expand_inode_array() return value; fallback to
    direct iput() on ENOMEM to prevent inode leak.
  - Make ext4_xattr_set_handle() take an optional ea_inode_array
    output parameter so callers can free after ext4_journal_stop(),
	avoiding the jbd2_handle vs s_writepages_rwsem AB-BA.
  - Pass ea_inode_array directly to ext4_xattr_release_block()
    instead of using a local array freed under xattr_sem.
  - Move ext4_xattr_inode_array_free() after ext4_journal_stop()

v2: Defer iput() in ext4_xattr_block_set() via ea_inode_array,
	freed after xattr_sem is released. Fixes the root cause.

v1: Set EXT4_STATE_NO_EXPAND in ext4_evict_inode() to skip expand
	on inodes being deleted. Only fixes the syzbot reproducer, not
	the underlying lock ordering violation.

 fs/ext4/acl.c            |  2 +-
 fs/ext4/crypto.c         |  4 ++--
 fs/ext4/inode.c          | 13 ++++++----
 fs/ext4/xattr.c          | 51 ++++++++++++++++++++++++++--------------
 fs/ext4/xattr.h          |  6 +++--
 fs/ext4/xattr_security.c |  3 ++-
 6 files changed, 51 insertions(+), 28 deletions(-)

diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index 3bffe862f954..21de8276b558 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -215,7 +215,7 @@ __ext4_set_acl(handle_t *handle, struct inode *inode, int type,
 	}
 
 	error = ext4_xattr_set_handle(handle, inode, name_index, "",
-				      value, size, xattr_flags);
+				      value, size, xattr_flags, NULL);
 
 	kfree(value);
 	if (!error)
diff --git a/fs/ext4/crypto.c b/fs/ext4/crypto.c
index f41f320f4437..bca760751c1d 100644
--- a/fs/ext4/crypto.c
+++ b/fs/ext4/crypto.c
@@ -173,7 +173,7 @@ static int ext4_set_context(struct inode *inode, const void *ctx, size_t len,
 		res = ext4_xattr_set_handle(handle, inode,
 					    EXT4_XATTR_INDEX_ENCRYPTION,
 					    EXT4_XATTR_NAME_ENCRYPTION_CONTEXT,
-					    ctx, len, XATTR_CREATE);
+					    ctx, len, XATTR_CREATE, NULL);
 		if (!res) {
 			ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
 			ext4_clear_inode_state(inode,
@@ -202,7 +202,7 @@ static int ext4_set_context(struct inode *inode, const void *ctx, size_t len,
 
 	res = ext4_xattr_set_handle(handle, inode, EXT4_XATTR_INDEX_ENCRYPTION,
 				    EXT4_XATTR_NAME_ENCRYPTION_CONTEXT,
-				    ctx, len, 0);
+				    ctx, len, 0, NULL);
 	if (!res) {
 		ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
 		/*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cd7588a3fa45..2cf68d27e896 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6408,7 +6408,8 @@ ext4_reserve_inode_write(handle_t *handle, struct inode *inode,
 static int __ext4_expand_extra_isize(struct inode *inode,
 				     unsigned int new_extra_isize,
 				     struct ext4_iloc *iloc,
-				     handle_t *handle, int *no_expand)
+				     handle_t *handle, int *no_expand,
+				     struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_inode *raw_inode;
 	struct ext4_xattr_ibody_header *header;
@@ -6453,7 +6454,7 @@ static int __ext4_expand_extra_isize(struct inode *inode,
 
 	/* try to expand with EAs present */
 	error = ext4_expand_extra_isize_ea(inode, new_extra_isize,
-					   raw_inode, handle);
+					   raw_inode, handle, ea_inode_array);
 	if (error) {
 		/*
 		 * Inode size expansion failed; don't try again
@@ -6475,6 +6476,7 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 {
 	int no_expand;
 	int error;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 
 	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND))
 		return -EOVERFLOW;
@@ -6496,8 +6498,9 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 		return -EBUSY;
 
 	error = __ext4_expand_extra_isize(inode, new_extra_isize, &iloc,
-					  handle, &no_expand);
+					  handle, &no_expand, &ea_inode_array);
 	ext4_write_unlock_xattr(inode, &no_expand);
+	ext4_xattr_inode_array_free(ea_inode_array);
 
 	return error;
 }
@@ -6509,6 +6512,7 @@ int ext4_expand_extra_isize(struct inode *inode,
 	handle_t *handle;
 	int no_expand;
 	int error, rc;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 
 	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND)) {
 		brelse(iloc->bh);
@@ -6534,7 +6538,7 @@ int ext4_expand_extra_isize(struct inode *inode,
 	}
 
 	error = __ext4_expand_extra_isize(inode, new_extra_isize, iloc,
-					  handle, &no_expand);
+					  handle, &no_expand, &ea_inode_array);
 
 	rc = ext4_mark_iloc_dirty(handle, inode, iloc);
 	if (!error)
@@ -6543,6 +6547,7 @@ int ext4_expand_extra_isize(struct inode *inode,
 out_unlock:
 	ext4_write_unlock_xattr(inode, &no_expand);
 	ext4_journal_stop(handle);
+	ext4_xattr_inode_array_free(ea_inode_array);
 	return error;
 }
 
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index e91af66db7a7..fa9a16c86fd8 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1906,7 +1906,8 @@ ext4_xattr_block_find(struct inode *inode, struct ext4_xattr_info *i,
 static int
 ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 		     struct ext4_xattr_info *i,
-		     struct ext4_xattr_block_find *bs)
+		     struct ext4_xattr_block_find *bs,
+		     struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct super_block *sb = inode->i_sb;
 	struct buffer_head *new_bh = NULL;
@@ -2158,7 +2159,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 					ext4_warning_inode(ea_inode,
 							   "dec ref error=%d",
 							   error);
-				iput(ea_inode);
+				if (ext4_expand_inode_array(ea_inode_array, ea_inode))
+					iput(ea_inode);
 				ea_inode = NULL;
 			}
 
@@ -2190,12 +2192,9 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 
 	/* Drop the previous xattr block. */
 	if (bs->bh && bs->bh != new_bh) {
-		struct ext4_xattr_inode_array *ea_inode_array = NULL;
-
 		ext4_xattr_release_block(handle, inode, bs->bh,
-					 &ea_inode_array,
+					 ea_inode_array,
 					 0 /* extra_credits */);
-		ext4_xattr_inode_array_free(ea_inode_array);
 	}
 	error = 0;
 
@@ -2211,7 +2210,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 			ext4_xattr_inode_free_quota(inode, ea_inode,
 						    i_size_read(ea_inode));
 		}
-		iput(ea_inode);
+		if (ext4_expand_inode_array(ea_inode_array, ea_inode))
+			iput(ea_inode);
 	}
 	if (ce)
 		mb_cache_entry_put(ea_block_cache, ce);
@@ -2356,7 +2356,7 @@ static struct buffer_head *ext4_xattr_get_block(struct inode *inode)
 int
 ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 		      const char *name, const void *value, size_t value_len,
-		      int flags)
+		      int flags, struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_info i = {
 		.name_index = name_index,
@@ -2371,6 +2371,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 	struct ext4_xattr_block_find bs = {
 		.s = { .not_found = -ENODATA, },
 	};
+	struct ext4_xattr_inode_array *local_array = NULL;
 	int no_expand;
 	int error;
 
@@ -2379,6 +2380,9 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 	if (strlen(name) > 255)
 		return -ERANGE;
 
+	if (!ea_inode_array)
+		ea_inode_array = &local_array;
+
 	ext4_write_lock_xattr(inode, &no_expand);
 
 	/* Check journal credits under write lock. */
@@ -2438,7 +2442,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 		if (!is.s.not_found)
 			error = ext4_xattr_ibody_set(handle, inode, &i, &is);
 		else if (!bs.s.not_found)
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     ea_inode_array);
 	} else {
 		error = 0;
 		/* Xattr value did not change? Save us some work and bail out */
@@ -2455,7 +2460,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 		error = ext4_xattr_ibody_set(handle, inode, &i, &is);
 		if (!error && !bs.s.not_found) {
 			i.value = NULL;
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     ea_inode_array);
 		} else if (error == -ENOSPC) {
 			if (EXT4_I(inode)->i_file_acl && !bs.s.base) {
 				brelse(bs.bh);
@@ -2464,7 +2470,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 				if (error)
 					goto cleanup;
 			}
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     ea_inode_array);
 			if (!error && !is.s.not_found) {
 				i.value = NULL;
 				error = ext4_xattr_ibody_set(handle, inode, &i,
@@ -2503,6 +2510,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 	brelse(is.iloc.bh);
 	brelse(bs.bh);
 	ext4_write_unlock_xattr(inode, &no_expand);
+	ext4_xattr_inode_array_free(local_array);
 	return error;
 }
 
@@ -2547,6 +2555,7 @@ ext4_xattr_set(struct inode *inode, int name_index, const char *name,
 {
 	handle_t *handle;
 	struct super_block *sb = inode->i_sb;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 	int error, retries = 0;
 	int credits;
 
@@ -2567,10 +2576,13 @@ ext4_xattr_set(struct inode *inode, int name_index, const char *name,
 		int error2;
 
 		error = ext4_xattr_set_handle(handle, inode, name_index, name,
-					      value, value_len, flags);
+					      value, value_len, flags,
+					      &ea_inode_array);
 		ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR,
 					handle);
 		error2 = ext4_journal_stop(handle);
+		ext4_xattr_inode_array_free(ea_inode_array);
+		ea_inode_array = NULL;
 		if (error == -ENOSPC &&
 		    ext4_should_retry_alloc(sb, &retries))
 			goto retry;
@@ -2612,7 +2624,8 @@ static void ext4_xattr_shift_entries(struct ext4_xattr_entry *entry,
  */
 static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 				    struct ext4_inode *raw_inode,
-				    struct ext4_xattr_entry *entry)
+				    struct ext4_xattr_entry *entry,
+				    struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_find *is = NULL;
 	struct ext4_xattr_block_find *bs = NULL;
@@ -2676,7 +2689,7 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 		goto out;
 
 	/* Move ea entry from the inode into the block */
-	error = ext4_xattr_block_set(handle, inode, &i, bs);
+	error = ext4_xattr_block_set(handle, inode, &i, bs, ea_inode_array);
 	if (error)
 		goto out;
 
@@ -2702,7 +2715,8 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
 				       struct ext4_inode *raw_inode,
 				       int isize_diff, size_t ifree,
-				       size_t bfree, int *total_ino)
+				       size_t bfree, int *total_ino,
+				       struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_header *header = IHDR(inode, raw_inode);
 	struct ext4_xattr_entry *small_entry;
@@ -2752,7 +2766,7 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
 			total_size += EXT4_XATTR_SIZE(
 					      le32_to_cpu(entry->e_value_size));
 		error = ext4_xattr_move_to_block(handle, inode, raw_inode,
-						 entry);
+						 entry, ea_inode_array);
 		if (error)
 			return error;
 
@@ -2769,7 +2783,8 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
  * Returns 0 on success or negative error number on failure.
  */
 int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
-			       struct ext4_inode *raw_inode, handle_t *handle)
+			       struct ext4_inode *raw_inode, handle_t *handle,
+			       struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_header *header;
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -2841,7 +2856,7 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 
 	error = ext4_xattr_make_inode_space(handle, inode, raw_inode,
 					    isize_diff, ifree, bfree,
-					    &total_ino);
+					    &total_ino, ea_inode_array);
 	if (error) {
 		if (error == -ENOSPC && !tried_min_extra_isize &&
 		    s_min_extra_isize) {
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 1fedf44d4fb6..9c3f1a96895d 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -179,7 +179,8 @@ extern ssize_t ext4_listxattr(struct dentry *, char *, size_t);
 
 extern int ext4_xattr_get(struct inode *, int, const char *, void *, size_t);
 extern int ext4_xattr_set(struct inode *, int, const char *, const void *, size_t, int);
-extern int ext4_xattr_set_handle(handle_t *, struct inode *, int, const char *, const void *, size_t, int);
+extern int ext4_xattr_set_handle(handle_t *, struct inode *, int, const char *,
+		const void *, size_t, int, struct ext4_xattr_inode_array **);
 extern int ext4_xattr_set_credits(struct inode *inode, size_t value_len,
 				  bool is_create, int *credits);
 extern int __ext4_xattr_set_credits(struct super_block *sb, struct inode *inode,
@@ -192,7 +193,8 @@ extern int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 extern void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *array);
 
 extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
-			    struct ext4_inode *raw_inode, handle_t *handle);
+			    struct ext4_inode *raw_inode, handle_t *handle,
+			    struct ext4_xattr_inode_array **ea_inode_array);
 extern void ext4_evict_ea_inode(struct inode *inode);
 
 extern const struct xattr_handler * const ext4_xattr_handlers[];
diff --git a/fs/ext4/xattr_security.c b/fs/ext4/xattr_security.c
index 776cf11d24ca..6b7ab6e703ad 100644
--- a/fs/ext4/xattr_security.c
+++ b/fs/ext4/xattr_security.c
@@ -44,7 +44,8 @@ ext4_initxattrs(struct inode *inode, const struct xattr *xattr_array,
 		err = ext4_xattr_set_handle(handle, inode,
 					    EXT4_XATTR_INDEX_SECURITY,
 					    xattr->name, xattr->value,
-					    xattr->value_len, XATTR_CREATE);
+					    xattr->value_len, XATTR_CREATE,
+					    NULL);
 		if (err < 0)
 			break;
 	}
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing
From: Jan Kara @ 2026-06-12 12:55 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, yi.zhang, ojaswin,
	ritesh.list, peng_wang
In-Reply-To: <20260611163441.2431805-3-libaokun@linux.alibaba.com>

On Fri 12-06-26 00:34:41, Baokun Li wrote:
> For unaligned DIO writes, the previous ext4_overwrite_io() required the
> entire range to fall within a single written extent.  This was overly
> conservative: the DIO layer only performs partial block zeroing for the
> head and tail blocks when they are partially covered by the write.
> Middle blocks that are fully covered are written as whole blocks
> without any zeroing, so they are safe regardless of extent state.
> 
> Therefore exclusive lock is only required when partial block zeroing
> will actually happen:
>  - The head partial block (if any) lands on a hole or unwritten extent.
>  - The tail partial block (if any) lands on a hole or unwritten extent.
> 
> Middle full-cover blocks can be in any state (hole, unwritten, or
> written) - block allocation under shared lock is safe per the previous
> patch's analysis (inode_dio_begin + i_data_sem protection).
> 
> Replace ext4_overwrite_io() with ext4_dio_needs_zeroing(), which
> directly answers the question driving the lock decision.  It uses at
> most two ext4_map_blocks() calls: one for the head partial block (also
> catching the case where it spans through the tail), and one for the
> tail partial block if not already covered.
> 
> This enables shared lock for previously-rejected scenarios such as:
>  - Unaligned write spanning written extent + mid-range hole + written
>    extent at the tail.
>  - Unaligned write where the partial blocks land on written extents but
>    the middle has unwritten extents.
> 
> Performance:
> 
> Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
> Filesystem: ext4 default mkfs
> 
> Unaligned DIO writes (14336 bytes at +512 within each 16K stripe).
> Each stripe is laid out as [written][unwritten][unwritten][written],
> so the head and tail partial blocks land on written extents but the
> middle is unwritten.  Metric: IOPS.
> 
>   JOBS      Before        After    speedup
>   ----    --------    ---------    -------
>      1      15,547       17,381      1.12x
>      2      15,910       34,172      2.15x
>      4      15,014       57,567      3.83x
>      8      15,022       81,947      5.46x
>     16      14,586       99,126      6.80x
>     32      14,047       92,519      6.59x
> 
> Wall time at JOBS=32: 149.3s (Before) -> 22.7s (After), 6.58x faster.
> 
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/file.c | 108 +++++++++++++++++++++++++++++++++----------------
>  1 file changed, 73 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 6f3886465ce3..aa926e641739 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -213,31 +213,60 @@ ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
>  	return false;
>  }
>  
> -/* Is IO overwriting allocated or initialized blocks? */
> -static bool ext4_overwrite_io(struct inode *inode,
> -			      loff_t pos, loff_t len, bool *unwritten)
> +/*
> + * Does an unaligned DIO write require partial block zeroing?
> + *
> + * Partial block zeroing is performed only for the head and tail blocks
> + * when they are partially covered by the write and the underlying extent
> + * is a hole or unwritten. Middle blocks (fully covered by the write)
> + * are written as whole blocks without zeroing.
> + *
> + * When zeroing is required, two concurrent unaligned DIO writes to the
> + * same partial block can race and corrupt each other's data, so the
> + * caller must take the exclusive i_rwsem and drain in-flight DIO. When
> + * zeroing is not required, shared lock is safe -- block allocation and
> + * unwritten conversion for middle blocks are protected by i_data_sem
> + * and inode_dio_begin().
> + */
> +static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
>  {
>  	struct ext4_map_blocks map;
>  	unsigned int blkbits = inode->i_blkbits;
> -	int err, blklen;
> +	unsigned long blockmask = inode->i_sb->s_blocksize - 1;
> +	bool head_partial, tail_partial;
> +	ext4_lblk_t head_lblk, tail_lblk;
> +	int err;
>  
>  	if (pos + len > i_size_read(inode))
> -		return false;
> +		return true;
>  
> -	map.m_lblk = pos >> blkbits;
> -	map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
> -	blklen = map.m_len;
> +	head_partial = (pos & blockmask) != 0;
> +	tail_partial = ((pos + len) & blockmask) != 0;
> +	head_lblk = pos >> blkbits;
> +	tail_lblk = (pos + len - 1) >> blkbits;
> +
> +	/* Check the head partial block. */
> +	if (head_partial) {
> +		map.m_lblk = head_lblk;
> +		map.m_len = tail_lblk - head_lblk + 1;
> +		err = ext4_map_blocks(NULL, inode, &map, 0);
> +		if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
> +			return true;
> +		/* If this mapping already covers the tail block, we're done. */
> +		if (!tail_partial || map.m_lblk + err > tail_lblk)
> +			return false;
> +	}
>  
> -	err = ext4_map_blocks(NULL, inode, &map, 0);
> -	if (err != blklen)
> -		return false;
> -	/*
> -	 * 'err==len' means that all of the blocks have been preallocated,
> -	 * regardless of whether they have been initialized or not. We need to
> -	 * check m_flags to distinguish the unwritten extents.
> -	 */
> -	*unwritten = !(map.m_flags & EXT4_MAP_MAPPED);
> -	return true;
> +	/* Check the tail partial block. */
> +	if (tail_partial) {
> +		map.m_lblk = tail_lblk;
> +		map.m_len = 1;
> +		err = ext4_map_blocks(NULL, inode, &map, 0);
> +		if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
> +			return true;
> +	}
> +
> +	return false;
>  }
>  
>  static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
> @@ -446,9 +475,10 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
>   *    i_data_sem serializes concurrent extent tree modifications.
>   *
>   * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
> - *    only safe for pure written-extent overwrites. Unwritten extents or
> - *    holes require exclusive lock because concurrent partial block zeroing
> - *    in the DIO layer could corrupt data.
> + *    safe unless the DIO layer needs to perform partial block zeroing --
> + *    i.e. the head or tail partial block sits on a hole or unwritten
> + *    extent. In that case upgrade to the exclusive lock and drain
> + *    in-flight DIO to avoid races with concurrent partial block zeroing.
>   */
>  static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  				     bool *ilock_shared, bool *extend,
> @@ -459,7 +489,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  	loff_t offset;
>  	size_t count;
>  	ssize_t ret;
> -	bool overwrite = true, unaligned_io, unwritten = false;
> +	bool needs_zeroing = false;
>  
>  restart:
>  	ret = ext4_generic_write_checks(iocb, from);
> @@ -469,21 +499,22 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  	offset = iocb->ki_pos;
>  	count = ret;
>  
> -	unaligned_io = ext4_unaligned_io(inode, from, offset);
>  	*extend = ext4_extending_io(inode, offset, count);
>  
>  	/*
> -	 * For unaligned writes we need to know the extent state to determine
> -	 * whether shared lock is safe. For aligned writes we skip this check
> -	 * entirely since allocation under shared lock is safe.
> +	 * For unaligned writes, check whether partial block zeroing will be
> +	 * needed. If so, exclusive lock is required to serialize against
> +	 * concurrent DIO that could race with the zeroing.
> +	 *
> +	 * For aligned writes we skip this check entirely since allocation
> +	 * under shared lock is safe.
>  	 */
> -	if (unaligned_io)
> -		overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
> +	if (ext4_unaligned_io(inode, from, offset))
> +		needs_zeroing = ext4_dio_needs_zeroing(inode, offset, count);
>  
>  	/* Determine whether we need to upgrade to an exclusive lock. */
>  	if (*ilock_shared &&
> -	    ((!IS_NOSEC(inode) || *extend ||
> -	     (unaligned_io && (!overwrite || unwritten))))) {
> +	    (!IS_NOSEC(inode) || *extend || needs_zeroing)) {
>  		if (iocb->ki_flags & IOCB_NOWAIT) {
>  			ret = -EAGAIN;
>  			goto out;
> @@ -497,16 +528,23 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  	/*
>  	 * Now that locking is settled, determine dio flags and exclusivity
>  	 * requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
> -	 * behavior already. The inode lock is already held exclusive if the
> -	 * write is unaligned non-overwrite or extending, so drain all
> -	 * outstanding dio and set the force wait dio flag.
> +	 * behavior already. When holding the exclusive lock for a write that
> +	 * needs partial block zeroing or is extending the file, we must wait
> +	 * for the I/O to complete synchronously:
> +	 *
> +	 *  - needs_zeroing: drain in-flight DIO whose end_io could race with
> +	 *    our partial block zeroing, and force synchronous completion so we
> +	 *    don't leave in-flight zeroing bios for the next writer to drain.
> +	 *
> +	 *  - extend: the caller must update i_disksize after I/O completion,
> +	 *    which requires the data to be on disk first.
>  	 */
> -	if (!*ilock_shared && (unaligned_io || *extend)) {
> +	if (!*ilock_shared && (needs_zeroing || *extend)) {
>  		if (iocb->ki_flags & IOCB_NOWAIT) {
>  			ret = -EAGAIN;
>  			goto out;
>  		}
> -		if (unaligned_io && (!overwrite || unwritten))
> +		if (needs_zeroing)
>  			inode_dio_wait(inode);
>  		*dio_flags = IOMAP_DIO_FORCE_WAIT;
>  	}
> -- 
> 2.43.7
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 1/2] ext4: skip overwrite check for aligned non-extending DIO writes
From: Jan Kara @ 2026-06-12 12:46 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, yi.zhang, ojaswin,
	ritesh.list, peng_wang
In-Reply-To: <20260611163441.2431805-2-libaokun@linux.alibaba.com>

On Fri 12-06-26 00:34:40, Baokun Li wrote:
> Currently, ext4_dio_write_checks() calls ext4_overwrite_io() to
> determine if a write is a pure overwrite, and upgrades to exclusive
> i_rwsem if not. However, ext4_overwrite_io() uses a single
> ext4_map_blocks() call which only returns the first contiguous extent of
> the same type. A write spanning multiple pre-allocated extents (e.g.
> written + unwritten, or two physically discontiguous written extents)
> produces a false negative, forcing an unnecessary exclusive lock upgrade.
> 
> After commit 5d87c7fca2c1 ("ext4: avoid starting handle when dio
> writing an unwritten extent") and commit 012924f0eeef ("ext4: remove
> useless ext4_iomap_overwrite_ops"), ext4_iomap_begin()'s fast path
> accepts both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN without starting a
> journal transaction. The iomap iteration naturally handles multi-extent
> ranges: each call returns the mapping for the current segment, and
> unwritten-to-written conversion is deferred to ext4_dio_write_end_io().
> This means the common case of mixed written/unwritten extents never
> reaches ext4_iomap_alloc() at all.
> 
> Even for the less common case where the range contains a hole and
> ext4_iomap_alloc() is needed, exclusive i_rwsem is still unnecessary for
> aligned non-extending writes:
> 
>  - truncate/punch_hole are kept out: they require exclusive i_rwsem
>    (blocked by our shared lock during allocation), and inode_dio_begin()
>    keeps their inode_dio_wait() blocked until in-flight bios complete.
>  - i_data_sem write-lock inside ext4_map_blocks() serializes concurrent
>    extent tree modifications (parallel writers to the same hole).
>  - The journal handle is per-thread and does not require i_rwsem
>    exclusion.
>  - i_disksize and orphan list are not involved in non-extending writes.
> 
> Skip the ext4_overwrite_io() check entirely for aligned writes by
> initializing overwrite to true and only calling ext4_overwrite_io() for
> unaligned writes. Unaligned writes still need the extent state check
> because concurrent partial block zeroing in the DIO layer requires
> exclusive serialization unless the range is a pure written-extent
> overwrite.
> 
> Performance:
> 
> Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
> Filesystem: ext4 default mkfs
> 
> Aligned 8K DIO writes spanning written+unwritten extent boundaries.
> Each thread writes its own 1G region sequentially; the file is rebuilt
> between runs so every block is written exactly once. Metric: IOPS.
> 
>   JOBS      Before        After    speedup
>   ----    --------    ---------    -------
>      1      42,322       43,329      1.02x
>      2      68,516       70,677      1.03x
>      4      62,489       97,072      1.55x
>      8      58,701      110,819      1.89x
>     16      58,569      116,392      1.99x
>     32      60,860      117,244      1.93x
> 
> Wall time at JOBS=32: 69.2s (Before) -> 35.4s (After), 1.96x faster.
> 
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

Uh, that's a significant change. I have to say I feel slightly uneasy :). But
I don't see a hole in your justification and the patch looks good. Nice
find and feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/file.c | 52 +++++++++++++++++++++++++++++---------------------
>  1 file changed, 30 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index eb1a323962b1..6f3886465ce3 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -428,16 +428,27 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
>   * condition requires an exclusive inode lock. If yes, then we restart the
>   * whole operation by releasing the shared lock and acquiring exclusive lock.
>   *
> - * - For unaligned_io we never take shared lock as it may cause data corruption
> - *   when two unaligned IO tries to modify the same block e.g. while zeroing.
> + * The decision is layered, evaluated in this order:
>   *
> - * - For extending writes case we don't take the shared lock, since it requires
> - *   updating inode i_disksize and/or orphan handling with exclusive lock.
> + * 1. If file_modified() needs to update security info (!IS_NOSEC), upgrade
> + *    to the exclusive lock -- the security update itself requires it,
> + *    regardless of whether the write extends the file or is aligned.
>   *
> - * - shared locking will only be true mostly with overwrites, including
> - *   initialized blocks and unwritten blocks.
> + * 2. If the write extends i_size or i_disksize, upgrade to the exclusive
> + *    lock to safely update i_disksize and the orphan list, regardless of
> + *    alignment.
>   *
> - * - Otherwise we will switch to exclusive i_rwsem lock.
> + * 3. Otherwise, for aligned non-extending writes, shared lock is always
> + *    sufficient regardless of extent state (written, unwritten, or hole).
> + *    truncate/punch_hole cannot run while we hold the shared i_rwsem
> + *    (they need it exclusively); after we release it, inode_dio_begin()
> + *    keeps their inode_dio_wait() blocked until in-flight bios complete.
> + *    i_data_sem serializes concurrent extent tree modifications.
> + *
> + * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
> + *    only safe for pure written-extent overwrites. Unwritten extents or
> + *    holes require exclusive lock because concurrent partial block zeroing
> + *    in the DIO layer could corrupt data.
>   */
>  static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  				     bool *ilock_shared, bool *extend,
> @@ -448,7 +459,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  	loff_t offset;
>  	size_t count;
>  	ssize_t ret;
> -	bool overwrite, unaligned_io, unwritten;
> +	bool overwrite = true, unaligned_io, unwritten = false;
>  
>  restart:
>  	ret = ext4_generic_write_checks(iocb, from);
> @@ -460,22 +471,19 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  
>  	unaligned_io = ext4_unaligned_io(inode, from, offset);
>  	*extend = ext4_extending_io(inode, offset, count);
> -	overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
>  
>  	/*
> -	 * Determine whether we need to upgrade to an exclusive lock. This is
> -	 * required to change security info in file_modified(), for extending
> -	 * I/O, any form of non-overwrite I/O, and unaligned I/O to unwritten
> -	 * extents (as partial block zeroing may be required).
> -	 *
> -	 * Note that unaligned writes are allowed under shared lock so long as
> -	 * they are pure overwrites. Otherwise, concurrent unaligned writes risk
> -	 * data corruption due to partial block zeroing in the dio layer, and so
> -	 * the I/O must occur exclusively.
> +	 * For unaligned writes we need to know the extent state to determine
> +	 * whether shared lock is safe. For aligned writes we skip this check
> +	 * entirely since allocation under shared lock is safe.
>  	 */
> +	if (unaligned_io)
> +		overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
> +
> +	/* Determine whether we need to upgrade to an exclusive lock. */
>  	if (*ilock_shared &&
> -	    ((!IS_NOSEC(inode) || *extend || !overwrite ||
> -	     (unaligned_io && unwritten)))) {
> +	    ((!IS_NOSEC(inode) || *extend ||
> +	     (unaligned_io && (!overwrite || unwritten))))) {
>  		if (iocb->ki_flags & IOCB_NOWAIT) {
>  			ret = -EAGAIN;
>  			goto out;
> @@ -490,8 +498,8 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  	 * Now that locking is settled, determine dio flags and exclusivity
>  	 * requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
>  	 * behavior already. The inode lock is already held exclusive if the
> -	 * write is non-overwrite or extending, so drain all outstanding dio and
> -	 * set the force wait dio flag.
> +	 * write is unaligned non-overwrite or extending, so drain all
> +	 * outstanding dio and set the force wait dio flag.
>  	 */
>  	if (!*ilock_shared && (unaligned_io || *extend)) {
>  		if (iocb->ki_flags & IOCB_NOWAIT) {
> -- 
> 2.43.7
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* [PATCH v2] ext4: defer iput() in ext4_xattr_block_set() to avoid deadlock with writepages
From: Yun Zhou @ 2026-06-12  9:58 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel
In-Reply-To: <20260611124555.1541195-1-yun.zhou@windriver.com>

ext4_xattr_block_set() calls iput() on ea_inode while its callers hold
xattr_sem.  If this iput() drops the last reference, it can trigger
write_inode_now() -> ext4_writepages() -> s_writepages_rwsem, which
violates the lock ordering since ext4_writepages() already establishes
s_writepages_rwsem -> jbd2_handle ordering:

  CPU0 (writeback worker)            CPU1 (file create)
  ----                               ----
  ext4_writepages()
    s_writepages_rwsem (read)        ext4_create()
    ext4_do_writepages()               __ext4_new_inode()
      ext4_journal_start()               [holds jbd2 handle]
        wait_transaction_locked()        ext4_xattr_set_handle()
        [WAIT for jbd2_handle]             xattr_sem (write)

  CPU2 (xattr set or isize expand)
  ----
  ext4_xattr_set_handle() or ext4_try_to_expand_extra_isize()
    xattr_sem (write)
    ext4_xattr_block_set()
      iput(ea_inode)
        write_inode_now()
          ext4_writepages()
            s_writepages_rwsem (read) [DEADLOCK]

This forms a circular dependency on lock classes:

  s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem

Fix by deferring iput() calls inside ext4_xattr_block_set() via the
existing ext4_xattr_inode_array mechanism.  The array is threaded
through the call chain and freed by callers after releasing xattr_sem.

Reported-by: syzbot+5d19358d7eb30ffb0cc5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v2: Defer iput() in ext4_xattr_block_set() via ea_inode_array,
	freed after xattr_sem is released. Fixes the root cause.

v1: Set EXT4_STATE_NO_EXPAND in ext4_evict_inode() to skip expand
	on inodes being deleted. Only fixes the syzbot reproducer, not
	the underlying lock ordering violation.

 fs/ext4/inode.c | 15 +++++++++++----
 fs/ext4/xattr.c | 40 +++++++++++++++++++++++++---------------
 fs/ext4/xattr.h |  3 ++-
 3 files changed, 38 insertions(+), 20 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cd7588a3fa45..c6448a9eb1e7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6408,7 +6408,8 @@ ext4_reserve_inode_write(handle_t *handle, struct inode *inode,
 static int __ext4_expand_extra_isize(struct inode *inode,
 				     unsigned int new_extra_isize,
 				     struct ext4_iloc *iloc,
-				     handle_t *handle, int *no_expand)
+				     handle_t *handle, int *no_expand,
+				     struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_inode *raw_inode;
 	struct ext4_xattr_ibody_header *header;
@@ -6453,7 +6454,7 @@ static int __ext4_expand_extra_isize(struct inode *inode,
 
 	/* try to expand with EAs present */
 	error = ext4_expand_extra_isize_ea(inode, new_extra_isize,
-					   raw_inode, handle);
+					   raw_inode, handle, ea_inode_array);
 	if (error) {
 		/*
 		 * Inode size expansion failed; don't try again
@@ -6475,6 +6476,7 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 {
 	int no_expand;
 	int error;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 
 	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND))
 		return -EOVERFLOW;
@@ -6496,8 +6498,10 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 		return -EBUSY;
 
 	error = __ext4_expand_extra_isize(inode, new_extra_isize, &iloc,
-					  handle, &no_expand);
+					  handle, &no_expand,
+					  &ea_inode_array);
 	ext4_write_unlock_xattr(inode, &no_expand);
+	ext4_xattr_inode_array_free(ea_inode_array);
 
 	return error;
 }
@@ -6509,6 +6513,7 @@ int ext4_expand_extra_isize(struct inode *inode,
 	handle_t *handle;
 	int no_expand;
 	int error, rc;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 
 	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND)) {
 		brelse(iloc->bh);
@@ -6534,7 +6539,8 @@ int ext4_expand_extra_isize(struct inode *inode,
 	}
 
 	error = __ext4_expand_extra_isize(inode, new_extra_isize, iloc,
-					  handle, &no_expand);
+					  handle, &no_expand,
+					  &ea_inode_array);
 
 	rc = ext4_mark_iloc_dirty(handle, inode, iloc);
 	if (!error)
@@ -6542,6 +6548,7 @@ int ext4_expand_extra_isize(struct inode *inode,
 
 out_unlock:
 	ext4_write_unlock_xattr(inode, &no_expand);
+	ext4_xattr_inode_array_free(ea_inode_array);
 	ext4_journal_stop(handle);
 	return error;
 }
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index e91af66db7a7..bf8424927383 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1906,7 +1906,8 @@ ext4_xattr_block_find(struct inode *inode, struct ext4_xattr_info *i,
 static int
 ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 		     struct ext4_xattr_info *i,
-		     struct ext4_xattr_block_find *bs)
+		     struct ext4_xattr_block_find *bs,
+		     struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct super_block *sb = inode->i_sb;
 	struct buffer_head *new_bh = NULL;
@@ -2158,7 +2159,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 					ext4_warning_inode(ea_inode,
 							   "dec ref error=%d",
 							   error);
-				iput(ea_inode);
+				ext4_expand_inode_array(ea_inode_array,
+							ea_inode);
 				ea_inode = NULL;
 			}
 
@@ -2190,12 +2192,12 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 
 	/* Drop the previous xattr block. */
 	if (bs->bh && bs->bh != new_bh) {
-		struct ext4_xattr_inode_array *ea_inode_array = NULL;
+		struct ext4_xattr_inode_array *old_ea_inode_array = NULL;
 
 		ext4_xattr_release_block(handle, inode, bs->bh,
-					 &ea_inode_array,
+					 &old_ea_inode_array,
 					 0 /* extra_credits */);
-		ext4_xattr_inode_array_free(ea_inode_array);
+		ext4_xattr_inode_array_free(old_ea_inode_array);
 	}
 	error = 0;
 
@@ -2211,7 +2213,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 			ext4_xattr_inode_free_quota(inode, ea_inode,
 						    i_size_read(ea_inode));
 		}
-		iput(ea_inode);
+		ext4_expand_inode_array(ea_inode_array, ea_inode);
 	}
 	if (ce)
 		mb_cache_entry_put(ea_block_cache, ce);
@@ -2371,6 +2373,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 	struct ext4_xattr_block_find bs = {
 		.s = { .not_found = -ENODATA, },
 	};
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 	int no_expand;
 	int error;
 
@@ -2438,7 +2441,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 		if (!is.s.not_found)
 			error = ext4_xattr_ibody_set(handle, inode, &i, &is);
 		else if (!bs.s.not_found)
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     &ea_inode_array);
 	} else {
 		error = 0;
 		/* Xattr value did not change? Save us some work and bail out */
@@ -2455,7 +2459,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 		error = ext4_xattr_ibody_set(handle, inode, &i, &is);
 		if (!error && !bs.s.not_found) {
 			i.value = NULL;
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     &ea_inode_array);
 		} else if (error == -ENOSPC) {
 			if (EXT4_I(inode)->i_file_acl && !bs.s.base) {
 				brelse(bs.bh);
@@ -2464,7 +2469,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 				if (error)
 					goto cleanup;
 			}
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     &ea_inode_array);
 			if (!error && !is.s.not_found) {
 				i.value = NULL;
 				error = ext4_xattr_ibody_set(handle, inode, &i,
@@ -2503,6 +2509,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 	brelse(is.iloc.bh);
 	brelse(bs.bh);
 	ext4_write_unlock_xattr(inode, &no_expand);
+	ext4_xattr_inode_array_free(ea_inode_array);
 	return error;
 }
 
@@ -2612,7 +2619,8 @@ static void ext4_xattr_shift_entries(struct ext4_xattr_entry *entry,
  */
 static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 				    struct ext4_inode *raw_inode,
-				    struct ext4_xattr_entry *entry)
+				    struct ext4_xattr_entry *entry,
+				    struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_find *is = NULL;
 	struct ext4_xattr_block_find *bs = NULL;
@@ -2676,7 +2684,7 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 		goto out;
 
 	/* Move ea entry from the inode into the block */
-	error = ext4_xattr_block_set(handle, inode, &i, bs);
+	error = ext4_xattr_block_set(handle, inode, &i, bs, ea_inode_array);
 	if (error)
 		goto out;
 
@@ -2702,7 +2710,8 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
 				       struct ext4_inode *raw_inode,
 				       int isize_diff, size_t ifree,
-				       size_t bfree, int *total_ino)
+				       size_t bfree, int *total_ino,
+				       struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_header *header = IHDR(inode, raw_inode);
 	struct ext4_xattr_entry *small_entry;
@@ -2752,7 +2761,7 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
 			total_size += EXT4_XATTR_SIZE(
 					      le32_to_cpu(entry->e_value_size));
 		error = ext4_xattr_move_to_block(handle, inode, raw_inode,
-						 entry);
+						 entry, ea_inode_array);
 		if (error)
 			return error;
 
@@ -2769,7 +2778,8 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
  * Returns 0 on success or negative error number on failure.
  */
 int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
-			       struct ext4_inode *raw_inode, handle_t *handle)
+			       struct ext4_inode *raw_inode, handle_t *handle,
+			       struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_header *header;
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -2841,7 +2851,7 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 
 	error = ext4_xattr_make_inode_space(handle, inode, raw_inode,
 					    isize_diff, ifree, bfree,
-					    &total_ino);
+					    &total_ino, ea_inode_array);
 	if (error) {
 		if (error == -ENOSPC && !tried_min_extra_isize &&
 		    s_min_extra_isize) {
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 1fedf44d4fb6..02a172515193 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -192,7 +192,8 @@ extern int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 extern void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *array);
 
 extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
-			    struct ext4_inode *raw_inode, handle_t *handle);
+			    struct ext4_inode *raw_inode, handle_t *handle,
+			    struct ext4_xattr_inode_array **ea_inode_array);
 extern void ext4_evict_ea_inode(struct inode *inode);
 
 extern const struct xattr_handler * const ext4_xattr_handlers[];
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Christoph Hellwig @ 2026-06-12  5:28 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, Keith Busch, brauner, linux-block,
	linux-fsdevel, linux-ext4, linux-xfs, Hannes Reinecke,
	Martin K. Petersen, Jens Axboe
In-Reply-To: <airX6BmMQ14Rvjcb@nidhogg.toxiclabs.cc>

On Thu, Jun 11, 2026 at 05:47:07PM +0200, Carlos Maiolino wrote:
> On Thu, Jun 11, 2026 at 03:38:33PM +0200, Christoph Hellwig wrote:
> > On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> > > It's entirely possible a device supports byte aligned addresses. The
> > > block layer just doesn't let a driver report that. So either it really
> > > was successful because you found a bug that skips the alignment checks,
> > > or your device silently corrupted your payload.
> 
> I tried this on different hardware, I find it hard to say all those
> devices were corrupting the payload.

I think in the other thread we agreed that we are currently missing
the alignment check for fast-path bios not hitting the splitting code,
so maybe that is something you see.  Additionally we're missing the
checks for purely bio based drivers not calling the splitting helper
at all, but I don't think that applies here.

> > > Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> > > though, in not taking the optimization when it was possible. So here's
> > > an alternative suggestion that should get things working as expected:
> > 
> > The fix below looks like it is addressing a real bug.  I'm not sure if
> > Carlos is hitting it, but we were missing the alignment checks for
> > single-bvec fast path bios so far indeed.
> 
> You left context out so I'm assuming by the fix you meant Keith's patch.

Yes.


^ permalink raw reply

* [PATCH v3] ext4: fix circular lock dependency in ext4_ext_migrate
From: Yun Zhou @ 2026-06-12  0:53 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel
In-Reply-To: <20260609084007.3432061-1-yun.zhou@windriver.com>

Move iput(tmp_inode) after ext4_writepages_up_write() to avoid a
circular lock dependency between s_writepages_rwsem and sb_internal
(freeze protection).

The deadlock scenario:

  CPU0 (EXT4_IOC_MIGRATE)        CPU1 (orphan cleanup during mount)
  ----                           ----
  ext4_ext_migrate()
    ext4_writepages_down_write()
      s_writepages_rwsem (write)
                                 ext4_evict_inode()
                                   sb_start_intwrite()   [sb_internal]
                                   ...
                                     ext4_writepages()
                                       s_writepages_rwsem (read) [BLOCKED]
    iput(tmp_inode)
      ext4_evict_inode()
        sb_start_intwrite()         [BLOCKED]

The tmp_inode is a temporary inode with nlink=0 created solely for
building the extent tree.  Its eviction does not require
s_writepages_rwsem protection, so deferring iput() until after
releasing the rwsem is safe.

Reported-by: syzbot+212e8f62790f8e0bc63b@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=212e8f62790f8e0bc63b
Fixes: cb85f4d23f79 ("ext4: fix race between writepages and enabling EXT4_EXTENTS_FL")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
v3: fixes Reported-by tag and Closes tag.

v2: remove redundant null pointer check for iput(tmp_inode).

 fs/ext4/migrate.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index 477d43d7e294..5d60ef10fe11 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -464,6 +464,7 @@ int ext4_ext_migrate(struct inode *inode)
 	if (IS_ERR(tmp_inode)) {
 		retval = PTR_ERR(tmp_inode);
 		ext4_journal_stop(handle);
+		tmp_inode = NULL;
 		goto out_unlock;
 	}
 	/*
@@ -591,9 +592,9 @@ int ext4_ext_migrate(struct inode *inode)
 	ext4_journal_stop(handle);
 out_tmp_inode:
 	unlock_new_inode(tmp_inode);
-	iput(tmp_inode);
 out_unlock:
 	ext4_writepages_up_write(inode->i_sb, alloc_ctx);
+	iput(tmp_inode);
 	return retval;
 }
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH 1/2] ext4: skip overwrite check for aligned non-extending DIO writes
From: Baokun Li @ 2026-06-11 16:34 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang
In-Reply-To: <20260611163441.2431805-1-libaokun@linux.alibaba.com>

Currently, ext4_dio_write_checks() calls ext4_overwrite_io() to
determine if a write is a pure overwrite, and upgrades to exclusive
i_rwsem if not. However, ext4_overwrite_io() uses a single
ext4_map_blocks() call which only returns the first contiguous extent of
the same type. A write spanning multiple pre-allocated extents (e.g.
written + unwritten, or two physically discontiguous written extents)
produces a false negative, forcing an unnecessary exclusive lock upgrade.

After commit 5d87c7fca2c1 ("ext4: avoid starting handle when dio
writing an unwritten extent") and commit 012924f0eeef ("ext4: remove
useless ext4_iomap_overwrite_ops"), ext4_iomap_begin()'s fast path
accepts both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN without starting a
journal transaction. The iomap iteration naturally handles multi-extent
ranges: each call returns the mapping for the current segment, and
unwritten-to-written conversion is deferred to ext4_dio_write_end_io().
This means the common case of mixed written/unwritten extents never
reaches ext4_iomap_alloc() at all.

Even for the less common case where the range contains a hole and
ext4_iomap_alloc() is needed, exclusive i_rwsem is still unnecessary for
aligned non-extending writes:

 - truncate/punch_hole are kept out: they require exclusive i_rwsem
   (blocked by our shared lock during allocation), and inode_dio_begin()
   keeps their inode_dio_wait() blocked until in-flight bios complete.
 - i_data_sem write-lock inside ext4_map_blocks() serializes concurrent
   extent tree modifications (parallel writers to the same hole).
 - The journal handle is per-thread and does not require i_rwsem
   exclusion.
 - i_disksize and orphan list are not involved in non-extending writes.

Skip the ext4_overwrite_io() check entirely for aligned writes by
initializing overwrite to true and only calling ext4_overwrite_io() for
unaligned writes. Unaligned writes still need the extent state check
because concurrent partial block zeroing in the DIO layer requires
exclusive serialization unless the range is a pure written-extent
overwrite.

Performance:

Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs

Aligned 8K DIO writes spanning written+unwritten extent boundaries.
Each thread writes its own 1G region sequentially; the file is rebuilt
between runs so every block is written exactly once. Metric: IOPS.

  JOBS      Before        After    speedup
  ----    --------    ---------    -------
     1      42,322       43,329      1.02x
     2      68,516       70,677      1.03x
     4      62,489       97,072      1.55x
     8      58,701      110,819      1.89x
    16      58,569      116,392      1.99x
    32      60,860      117,244      1.93x

Wall time at JOBS=32: 69.2s (Before) -> 35.4s (After), 1.96x faster.

Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/file.c | 52 +++++++++++++++++++++++++++++---------------------
 1 file changed, 30 insertions(+), 22 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index eb1a323962b1..6f3886465ce3 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -428,16 +428,27 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
  * condition requires an exclusive inode lock. If yes, then we restart the
  * whole operation by releasing the shared lock and acquiring exclusive lock.
  *
- * - For unaligned_io we never take shared lock as it may cause data corruption
- *   when two unaligned IO tries to modify the same block e.g. while zeroing.
+ * The decision is layered, evaluated in this order:
  *
- * - For extending writes case we don't take the shared lock, since it requires
- *   updating inode i_disksize and/or orphan handling with exclusive lock.
+ * 1. If file_modified() needs to update security info (!IS_NOSEC), upgrade
+ *    to the exclusive lock -- the security update itself requires it,
+ *    regardless of whether the write extends the file or is aligned.
  *
- * - shared locking will only be true mostly with overwrites, including
- *   initialized blocks and unwritten blocks.
+ * 2. If the write extends i_size or i_disksize, upgrade to the exclusive
+ *    lock to safely update i_disksize and the orphan list, regardless of
+ *    alignment.
  *
- * - Otherwise we will switch to exclusive i_rwsem lock.
+ * 3. Otherwise, for aligned non-extending writes, shared lock is always
+ *    sufficient regardless of extent state (written, unwritten, or hole).
+ *    truncate/punch_hole cannot run while we hold the shared i_rwsem
+ *    (they need it exclusively); after we release it, inode_dio_begin()
+ *    keeps their inode_dio_wait() blocked until in-flight bios complete.
+ *    i_data_sem serializes concurrent extent tree modifications.
+ *
+ * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
+ *    only safe for pure written-extent overwrites. Unwritten extents or
+ *    holes require exclusive lock because concurrent partial block zeroing
+ *    in the DIO layer could corrupt data.
  */
 static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 				     bool *ilock_shared, bool *extend,
@@ -448,7 +459,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	loff_t offset;
 	size_t count;
 	ssize_t ret;
-	bool overwrite, unaligned_io, unwritten;
+	bool overwrite = true, unaligned_io, unwritten = false;
 
 restart:
 	ret = ext4_generic_write_checks(iocb, from);
@@ -460,22 +471,19 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 
 	unaligned_io = ext4_unaligned_io(inode, from, offset);
 	*extend = ext4_extending_io(inode, offset, count);
-	overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
 
 	/*
-	 * Determine whether we need to upgrade to an exclusive lock. This is
-	 * required to change security info in file_modified(), for extending
-	 * I/O, any form of non-overwrite I/O, and unaligned I/O to unwritten
-	 * extents (as partial block zeroing may be required).
-	 *
-	 * Note that unaligned writes are allowed under shared lock so long as
-	 * they are pure overwrites. Otherwise, concurrent unaligned writes risk
-	 * data corruption due to partial block zeroing in the dio layer, and so
-	 * the I/O must occur exclusively.
+	 * For unaligned writes we need to know the extent state to determine
+	 * whether shared lock is safe. For aligned writes we skip this check
+	 * entirely since allocation under shared lock is safe.
 	 */
+	if (unaligned_io)
+		overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
+
+	/* Determine whether we need to upgrade to an exclusive lock. */
 	if (*ilock_shared &&
-	    ((!IS_NOSEC(inode) || *extend || !overwrite ||
-	     (unaligned_io && unwritten)))) {
+	    ((!IS_NOSEC(inode) || *extend ||
+	     (unaligned_io && (!overwrite || unwritten))))) {
 		if (iocb->ki_flags & IOCB_NOWAIT) {
 			ret = -EAGAIN;
 			goto out;
@@ -490,8 +498,8 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	 * Now that locking is settled, determine dio flags and exclusivity
 	 * requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
 	 * behavior already. The inode lock is already held exclusive if the
-	 * write is non-overwrite or extending, so drain all outstanding dio and
-	 * set the force wait dio flag.
+	 * write is unaligned non-overwrite or extending, so drain all
+	 * outstanding dio and set the force wait dio flag.
 	 */
 	if (!*ilock_shared && (unaligned_io || *extend)) {
 		if (iocb->ki_flags & IOCB_NOWAIT) {
-- 
2.43.7


^ permalink raw reply related

* [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing
From: Baokun Li @ 2026-06-11 16:34 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang
In-Reply-To: <20260611163441.2431805-1-libaokun@linux.alibaba.com>

For unaligned DIO writes, the previous ext4_overwrite_io() required the
entire range to fall within a single written extent.  This was overly
conservative: the DIO layer only performs partial block zeroing for the
head and tail blocks when they are partially covered by the write.
Middle blocks that are fully covered are written as whole blocks
without any zeroing, so they are safe regardless of extent state.

Therefore exclusive lock is only required when partial block zeroing
will actually happen:
 - The head partial block (if any) lands on a hole or unwritten extent.
 - The tail partial block (if any) lands on a hole or unwritten extent.

Middle full-cover blocks can be in any state (hole, unwritten, or
written) - block allocation under shared lock is safe per the previous
patch's analysis (inode_dio_begin + i_data_sem protection).

Replace ext4_overwrite_io() with ext4_dio_needs_zeroing(), which
directly answers the question driving the lock decision.  It uses at
most two ext4_map_blocks() calls: one for the head partial block (also
catching the case where it spans through the tail), and one for the
tail partial block if not already covered.

This enables shared lock for previously-rejected scenarios such as:
 - Unaligned write spanning written extent + mid-range hole + written
   extent at the tail.
 - Unaligned write where the partial blocks land on written extents but
   the middle has unwritten extents.

Performance:

Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs

Unaligned DIO writes (14336 bytes at +512 within each 16K stripe).
Each stripe is laid out as [written][unwritten][unwritten][written],
so the head and tail partial blocks land on written extents but the
middle is unwritten.  Metric: IOPS.

  JOBS      Before        After    speedup
  ----    --------    ---------    -------
     1      15,547       17,381      1.12x
     2      15,910       34,172      2.15x
     4      15,014       57,567      3.83x
     8      15,022       81,947      5.46x
    16      14,586       99,126      6.80x
    32      14,047       92,519      6.59x

Wall time at JOBS=32: 149.3s (Before) -> 22.7s (After), 6.58x faster.

Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/file.c | 108 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 73 insertions(+), 35 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6f3886465ce3..aa926e641739 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -213,31 +213,60 @@ ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
 	return false;
 }
 
-/* Is IO overwriting allocated or initialized blocks? */
-static bool ext4_overwrite_io(struct inode *inode,
-			      loff_t pos, loff_t len, bool *unwritten)
+/*
+ * Does an unaligned DIO write require partial block zeroing?
+ *
+ * Partial block zeroing is performed only for the head and tail blocks
+ * when they are partially covered by the write and the underlying extent
+ * is a hole or unwritten. Middle blocks (fully covered by the write)
+ * are written as whole blocks without zeroing.
+ *
+ * When zeroing is required, two concurrent unaligned DIO writes to the
+ * same partial block can race and corrupt each other's data, so the
+ * caller must take the exclusive i_rwsem and drain in-flight DIO. When
+ * zeroing is not required, shared lock is safe -- block allocation and
+ * unwritten conversion for middle blocks are protected by i_data_sem
+ * and inode_dio_begin().
+ */
+static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
 {
 	struct ext4_map_blocks map;
 	unsigned int blkbits = inode->i_blkbits;
-	int err, blklen;
+	unsigned long blockmask = inode->i_sb->s_blocksize - 1;
+	bool head_partial, tail_partial;
+	ext4_lblk_t head_lblk, tail_lblk;
+	int err;
 
 	if (pos + len > i_size_read(inode))
-		return false;
+		return true;
 
-	map.m_lblk = pos >> blkbits;
-	map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
-	blklen = map.m_len;
+	head_partial = (pos & blockmask) != 0;
+	tail_partial = ((pos + len) & blockmask) != 0;
+	head_lblk = pos >> blkbits;
+	tail_lblk = (pos + len - 1) >> blkbits;
+
+	/* Check the head partial block. */
+	if (head_partial) {
+		map.m_lblk = head_lblk;
+		map.m_len = tail_lblk - head_lblk + 1;
+		err = ext4_map_blocks(NULL, inode, &map, 0);
+		if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
+			return true;
+		/* If this mapping already covers the tail block, we're done. */
+		if (!tail_partial || map.m_lblk + err > tail_lblk)
+			return false;
+	}
 
-	err = ext4_map_blocks(NULL, inode, &map, 0);
-	if (err != blklen)
-		return false;
-	/*
-	 * 'err==len' means that all of the blocks have been preallocated,
-	 * regardless of whether they have been initialized or not. We need to
-	 * check m_flags to distinguish the unwritten extents.
-	 */
-	*unwritten = !(map.m_flags & EXT4_MAP_MAPPED);
-	return true;
+	/* Check the tail partial block. */
+	if (tail_partial) {
+		map.m_lblk = tail_lblk;
+		map.m_len = 1;
+		err = ext4_map_blocks(NULL, inode, &map, 0);
+		if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
+			return true;
+	}
+
+	return false;
 }
 
 static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
@@ -446,9 +475,10 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
  *    i_data_sem serializes concurrent extent tree modifications.
  *
  * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
- *    only safe for pure written-extent overwrites. Unwritten extents or
- *    holes require exclusive lock because concurrent partial block zeroing
- *    in the DIO layer could corrupt data.
+ *    safe unless the DIO layer needs to perform partial block zeroing --
+ *    i.e. the head or tail partial block sits on a hole or unwritten
+ *    extent. In that case upgrade to the exclusive lock and drain
+ *    in-flight DIO to avoid races with concurrent partial block zeroing.
  */
 static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 				     bool *ilock_shared, bool *extend,
@@ -459,7 +489,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	loff_t offset;
 	size_t count;
 	ssize_t ret;
-	bool overwrite = true, unaligned_io, unwritten = false;
+	bool needs_zeroing = false;
 
 restart:
 	ret = ext4_generic_write_checks(iocb, from);
@@ -469,21 +499,22 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	offset = iocb->ki_pos;
 	count = ret;
 
-	unaligned_io = ext4_unaligned_io(inode, from, offset);
 	*extend = ext4_extending_io(inode, offset, count);
 
 	/*
-	 * For unaligned writes we need to know the extent state to determine
-	 * whether shared lock is safe. For aligned writes we skip this check
-	 * entirely since allocation under shared lock is safe.
+	 * For unaligned writes, check whether partial block zeroing will be
+	 * needed. If so, exclusive lock is required to serialize against
+	 * concurrent DIO that could race with the zeroing.
+	 *
+	 * For aligned writes we skip this check entirely since allocation
+	 * under shared lock is safe.
 	 */
-	if (unaligned_io)
-		overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
+	if (ext4_unaligned_io(inode, from, offset))
+		needs_zeroing = ext4_dio_needs_zeroing(inode, offset, count);
 
 	/* Determine whether we need to upgrade to an exclusive lock. */
 	if (*ilock_shared &&
-	    ((!IS_NOSEC(inode) || *extend ||
-	     (unaligned_io && (!overwrite || unwritten))))) {
+	    (!IS_NOSEC(inode) || *extend || needs_zeroing)) {
 		if (iocb->ki_flags & IOCB_NOWAIT) {
 			ret = -EAGAIN;
 			goto out;
@@ -497,16 +528,23 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	/*
 	 * Now that locking is settled, determine dio flags and exclusivity
 	 * requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
-	 * behavior already. The inode lock is already held exclusive if the
-	 * write is unaligned non-overwrite or extending, so drain all
-	 * outstanding dio and set the force wait dio flag.
+	 * behavior already. When holding the exclusive lock for a write that
+	 * needs partial block zeroing or is extending the file, we must wait
+	 * for the I/O to complete synchronously:
+	 *
+	 *  - needs_zeroing: drain in-flight DIO whose end_io could race with
+	 *    our partial block zeroing, and force synchronous completion so we
+	 *    don't leave in-flight zeroing bios for the next writer to drain.
+	 *
+	 *  - extend: the caller must update i_disksize after I/O completion,
+	 *    which requires the data to be on disk first.
 	 */
-	if (!*ilock_shared && (unaligned_io || *extend)) {
+	if (!*ilock_shared && (needs_zeroing || *extend)) {
 		if (iocb->ki_flags & IOCB_NOWAIT) {
 			ret = -EAGAIN;
 			goto out;
 		}
-		if (unaligned_io && (!overwrite || unwritten))
+		if (needs_zeroing)
 			inode_dio_wait(inode);
 		*dio_flags = IOMAP_DIO_FORCE_WAIT;
 	}
-- 
2.43.7


^ permalink raw reply related

* [PATCH 0/2] ext4: allow more DIO writes under shared i_rwsem
From: Baokun Li @ 2026-06-11 16:34 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang

Hi all,

This series relaxes the i_rwsem requirements of ext4_dio_write_iter()
so that more direct I/O writes can proceed under the shared lock.

It continues the work started by Peng Wang's RFC [1]; I'm taking
over this effort going forward.

ext4_dio_write_checks() currently calls ext4_overwrite_io() to decide
whether the shared lock is sufficient. Its single ext4_map_blocks()
lookup only sees the first contiguous extent of the same type, which
forces the exclusive lock for two cases that are actually safe under
the shared lock (see individual patches for the full safety
argument):

  1. Aligned writes spanning multiple already-allocated extents (e.g.
     written + unwritten, or two discontiguous written extents).

  2. Unaligned writes whose head/tail partial blocks land on written
     extents but the fully-covered middle blocks include hole or
     unwritten extents.

Patch 1 skips the ext4_overwrite_io() pre-check entirely for aligned
non-extending writes, letting them proceed under the shared lock
regardless of extent state.

Patch 2 replaces ext4_overwrite_io() with ext4_dio_needs_zeroing(),
which directly answers the question driving the lock decision. It
checks only the head and tail partial blocks (at most two
ext4_map_blocks() calls), and ignores the state of middle blocks.

Testing
=======

"kvm-xfstests -c ext4/all -g auto" passes with no new failures.

Performance
===========

Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs

Test 1: aligned 8K DIO writes spanning written+unwritten extent
boundaries. Each thread writes its own 1G region sequentially; the
file is rebuilt between runs so every block is written exactly once.
Metric: IOPS.

  JOBS         base    +patch 1    +patch 1+2    speedup
  ----    ---------    --------    ----------    -------
     1       42,322      43,329        43,087      1.02x
     2       68,516      70,677        66,958      1.03x
     4       62,489      97,072       101,468      1.62x
     8       58,701     110,819       113,679      1.94x
    16       58,569     116,392       115,272      1.97x
    32       60,860     117,244       119,621      1.97x

Wall time at JOBS=32: 69.2s (base) -> 35.4s (patched), 1.96x faster.

Test 2: unaligned DIO writes (14336 bytes at +512 within each 16K
stripe). Each stripe is laid out as [written][unwritten][unwritten]
[written], so the head and tail partial blocks land on written
extents but the middle is unwritten. Metric: IOPS.

  JOBS         base    +patch 1    +patch 1+2    speedup
  ----    ---------    --------    ----------    -------
     1       15,547      15,975        17,381      1.12x
     2       15,910      14,808        34,172      2.15x
     4       15,014      14,828        57,567      3.83x
     8       15,022      14,648        81,947      5.46x
    16       14,586      14,262        99,126      6.80x
    32       14,047      13,809        92,519      6.59x

Wall time at JOBS=32: 149.3s (base) -> 22.7s (patched), 6.58x faster.

In test 2, patch 1 alone has no effect (slight noise) because patch 1
only touches the aligned write path. Patch 2 introduces
ext4_dio_needs_zeroing() which precisely identifies when partial
block zeroing is required, allowing the shared lock for the much
larger set of unaligned writes that don't actually trigger zeroing.

Comments and questions are, as always, welcome.

Thanks,
Baokun

[1]: https://patch.msgid.link/20260607124935.6168-1-peng_wang@linux.alibaba.com

Baokun Li (2):
  ext4: skip overwrite check for aligned non-extending DIO writes
  ext4: base unaligned DIO lock decision on partial block zeroing

 fs/ext4/file.c | 132 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 89 insertions(+), 43 deletions(-)

-- 
2.43.7

^ permalink raw reply

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Carlos Maiolino @ 2026-06-11 15:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, brauner, linux-block, linux-fsdevel, linux-ext4,
	linux-xfs, Hannes Reinecke, Martin K. Petersen, Jens Axboe
In-Reply-To: <20260611133833.GA14645@lst.de>

On Thu, Jun 11, 2026 at 03:38:33PM +0200, Christoph Hellwig wrote:
> On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> > It's entirely possible a device supports byte aligned addresses. The
> > block layer just doesn't let a driver report that. So either it really
> > was successful because you found a bug that skips the alignment checks,
> > or your device silently corrupted your payload.

I tried this on different hardware, I find it hard to say all those
devices were corrupting the payload.

> > 
> > Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> > though, in not taking the optimization when it was possible. So here's
> > an alternative suggestion that should get things working as expected:
> 
> The fix below looks like it is addressing a real bug.  I'm not sure if
> Carlos is hitting it, but we were missing the alignment checks for
> single-bvec fast path bios so far indeed.

You left context out so I'm assuming by the fix you meant Keith's patch.
I can give it a spin and see if it fixes the behavior I'm talking
about. Give me some time as I have a bunch of stuff to do tonight so
likely I will only manage to try this tomorrow.

^ permalink raw reply

* [tytso-ext4:dev] BUILD SUCCESS c143957520c6c9b5cd72e0de8b52b814f0c576fe
From: kernel test robot @ 2026-06-11 14:52 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
branch HEAD: c143957520c6c9b5cd72e0de8b52b814f0c576fe  ext4: validate donor file superblock early in EXT4_IOC_MOVE_EXT

elapsed time: 772m

configs tested: 195
configs skipped: 2

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alpha                             allnoconfig    gcc-16.1.0
alpha                            allyesconfig    gcc-16.1.0
alpha                               defconfig    gcc-16.1.0
arc                              allmodconfig    clang-23
arc                              allmodconfig    gcc-16.1.0
arc                               allnoconfig    gcc-16.1.0
arc                              allyesconfig    clang-23
arc                              allyesconfig    gcc-16.1.0
arc                                 defconfig    gcc-16.1.0
arc                        nsim_700_defconfig    gcc-16.1.0
arc                            randconfig-001    gcc-14.3.0
arc                   randconfig-001-20260611    gcc-14.3.0
arc                            randconfig-002    gcc-14.3.0
arc                   randconfig-002-20260611    gcc-14.3.0
arm                               allnoconfig    clang-23
arm                               allnoconfig    gcc-16.1.0
arm                              allyesconfig    clang-23
arm                              allyesconfig    gcc-16.1.0
arm                                 defconfig    gcc-16.1.0
arm                          pxa910_defconfig    gcc-16.1.0
arm                            randconfig-001    gcc-14.3.0
arm                   randconfig-001-20260611    gcc-14.3.0
arm                            randconfig-002    gcc-14.3.0
arm                   randconfig-002-20260611    gcc-14.3.0
arm                            randconfig-003    gcc-14.3.0
arm                   randconfig-003-20260611    gcc-14.3.0
arm                            randconfig-004    gcc-14.3.0
arm                   randconfig-004-20260611    gcc-14.3.0
arm64                            allmodconfig    clang-23
arm64                             allnoconfig    gcc-16.1.0
arm64                               defconfig    gcc-16.1.0
csky                             allmodconfig    gcc-16.1.0
csky                              allnoconfig    gcc-16.1.0
csky                                defconfig    gcc-16.1.0
hexagon                          allmodconfig    clang-23
hexagon                          allmodconfig    gcc-16.1.0
hexagon                           allnoconfig    clang-23
hexagon                           allnoconfig    gcc-16.1.0
hexagon                             defconfig    gcc-16.1.0
hexagon               randconfig-001-20260611    clang-16
hexagon               randconfig-002-20260611    clang-16
i386                             allmodconfig    gcc-14
i386                              allnoconfig    gcc-14
i386                              allnoconfig    gcc-16.1.0
i386                             allyesconfig    gcc-14
i386                 buildonly-randconfig-001    clang-22
i386        buildonly-randconfig-001-20260611    clang-22
i386                 buildonly-randconfig-002    clang-22
i386        buildonly-randconfig-002-20260611    clang-22
i386                 buildonly-randconfig-003    clang-22
i386        buildonly-randconfig-003-20260611    clang-22
i386                 buildonly-randconfig-004    clang-22
i386        buildonly-randconfig-004-20260611    clang-22
i386                 buildonly-randconfig-005    clang-22
i386        buildonly-randconfig-005-20260611    clang-22
i386                 buildonly-randconfig-006    clang-22
i386        buildonly-randconfig-006-20260611    clang-22
i386                                defconfig    gcc-16.1.0
i386                  randconfig-001-20260611    gcc-14
i386                  randconfig-002-20260611    gcc-14
i386                  randconfig-003-20260611    gcc-14
i386                  randconfig-004-20260611    gcc-14
i386                  randconfig-005-20260611    gcc-14
i386                  randconfig-006-20260611    gcc-14
i386                  randconfig-007-20260611    gcc-14
i386                  randconfig-011-20260611    gcc-14
i386                  randconfig-012-20260611    gcc-14
i386                  randconfig-013-20260611    gcc-14
i386                  randconfig-014-20260611    gcc-14
i386                  randconfig-015-20260611    gcc-14
i386                  randconfig-016-20260611    gcc-14
i386                  randconfig-017-20260611    gcc-14
loongarch                        allmodconfig    clang-19
loongarch                        allmodconfig    clang-23
loongarch                         allnoconfig    clang-20
loongarch                         allnoconfig    gcc-16.1.0
loongarch                           defconfig    clang-23
loongarch             randconfig-001-20260611    clang-16
loongarch             randconfig-002-20260611    clang-16
m68k                             allmodconfig    gcc-16.1.0
m68k                              allnoconfig    gcc-16.1.0
m68k                             allyesconfig    clang-23
m68k                             allyesconfig    gcc-16.1.0
m68k                                defconfig    clang-23
microblaze                        allnoconfig    gcc-16.1.0
microblaze                       allyesconfig    gcc-16.1.0
microblaze                          defconfig    clang-23
mips                             allmodconfig    gcc-16.1.0
mips                              allnoconfig    gcc-16.1.0
mips                             allyesconfig    gcc-16.1.0
nios2                            allmodconfig    clang-20
nios2                            allmodconfig    gcc-11.5.0
nios2                             allnoconfig    clang-23
nios2                             allnoconfig    gcc-11.5.0
nios2                               defconfig    clang-23
nios2                 randconfig-001-20260611    clang-16
nios2                 randconfig-002-20260611    clang-16
openrisc                         allmodconfig    clang-20
openrisc                         allmodconfig    gcc-16.1.0
openrisc                          allnoconfig    clang-23
openrisc                          allnoconfig    gcc-16.1.0
openrisc                            defconfig    gcc-16.1.0
parisc                           allmodconfig    gcc-16.1.0
parisc                            allnoconfig    clang-23
parisc                            allnoconfig    gcc-16.1.0
parisc                           allyesconfig    clang-23
parisc                           allyesconfig    gcc-16.1.0
parisc                              defconfig    gcc-16.1.0
parisc64                            defconfig    clang-23
powerpc                          allmodconfig    gcc-16.1.0
powerpc                           allnoconfig    clang-23
powerpc                           allnoconfig    gcc-16.1.0
powerpc                     tqm8540_defconfig    gcc-16.1.0
riscv                            allmodconfig    clang-23
riscv                             allnoconfig    clang-23
riscv                             allnoconfig    gcc-16.1.0
riscv                            allyesconfig    clang-23
riscv                               defconfig    gcc-16.1.0
riscv                 randconfig-001-20260611    gcc-12.5.0
riscv                 randconfig-002-20260611    gcc-12.5.0
s390                             allmodconfig    clang-23
s390                              allnoconfig    clang-23
s390                             allyesconfig    gcc-16.1.0
s390                                defconfig    gcc-16.1.0
s390                  randconfig-001-20260611    gcc-12.5.0
s390                  randconfig-002-20260611    gcc-12.5.0
sh                               allmodconfig    gcc-16.1.0
sh                                allnoconfig    clang-23
sh                                allnoconfig    gcc-16.1.0
sh                               allyesconfig    clang-23
sh                               allyesconfig    gcc-16.1.0
sh                                  defconfig    gcc-14
sh                    randconfig-001-20260611    gcc-12.5.0
sh                    randconfig-002-20260611    gcc-12.5.0
sparc                             allnoconfig    clang-23
sparc                             allnoconfig    gcc-16.1.0
sparc                               defconfig    gcc-16.1.0
sparc                 randconfig-001-20260611    gcc-15.2.0
sparc                 randconfig-002-20260611    gcc-15.2.0
sparc64                          allmodconfig    clang-20
sparc64                             defconfig    gcc-14
sparc64               randconfig-001-20260611    gcc-15.2.0
sparc64               randconfig-002-20260611    gcc-15.2.0
um                               allmodconfig    clang-23
um                                allnoconfig    clang-16
um                                allnoconfig    clang-23
um                               allyesconfig    gcc-14
um                               allyesconfig    gcc-16.1.0
um                                  defconfig    gcc-14
um                             i386_defconfig    gcc-14
um                    randconfig-001-20260611    gcc-15.2.0
um                    randconfig-002-20260611    gcc-15.2.0
um                           x86_64_defconfig    gcc-14
x86_64                           allmodconfig    clang-22
x86_64                            allnoconfig    clang-22
x86_64                            allnoconfig    clang-23
x86_64                           allyesconfig    clang-22
x86_64      buildonly-randconfig-001-20260611    gcc-14
x86_64      buildonly-randconfig-002-20260611    gcc-14
x86_64      buildonly-randconfig-003-20260611    gcc-14
x86_64      buildonly-randconfig-004-20260611    gcc-14
x86_64      buildonly-randconfig-005-20260611    gcc-14
x86_64      buildonly-randconfig-006-20260611    gcc-14
x86_64                              defconfig    gcc-14
x86_64                                  kexec    clang-22
x86_64                randconfig-001-20260611    gcc-14
x86_64                randconfig-002-20260611    gcc-14
x86_64                randconfig-003-20260611    gcc-14
x86_64                randconfig-004-20260611    gcc-14
x86_64                randconfig-005-20260611    gcc-14
x86_64                randconfig-006-20260611    gcc-14
x86_64                randconfig-011-20260611    gcc-14
x86_64                randconfig-012-20260611    gcc-14
x86_64                randconfig-013-20260611    gcc-14
x86_64                randconfig-014-20260611    gcc-14
x86_64                randconfig-015-20260611    gcc-14
x86_64                randconfig-016-20260611    gcc-14
x86_64                randconfig-071-20260611    clang-22
x86_64                randconfig-072-20260611    clang-22
x86_64                randconfig-073-20260611    clang-22
x86_64                randconfig-074-20260611    clang-22
x86_64                randconfig-075-20260611    clang-22
x86_64                randconfig-076-20260611    clang-22
x86_64                               rhel-9.4    clang-22
x86_64                           rhel-9.4-bpf    gcc-14
x86_64                          rhel-9.4-func    clang-22
x86_64                    rhel-9.4-kselftests    clang-22
x86_64                         rhel-9.4-kunit    gcc-14
x86_64                           rhel-9.4-ltp    gcc-14
x86_64                          rhel-9.4-rust    clang-22
xtensa                            allnoconfig    clang-23
xtensa                            allnoconfig    gcc-16.1.0
xtensa                           allyesconfig    clang-20
xtensa                randconfig-001-20260611    gcc-15.2.0
xtensa                randconfig-002-20260611    gcc-15.2.0

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH] ext4: skip extra isize expansion on inode eviction to avoid deadlock
From: Jan Kara @ 2026-06-11 14:00 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, linux-ext4, linux-kernel
In-Reply-To: <20260611124555.1541195-1-yun.zhou@windriver.com>

On Thu 11-06-26 20:45:55, Yun Zhou wrote:
> Expanding extra isize on an inode that is being evicted is pointless
> since the inode is about to be deleted.  Skip it by setting
> EXT4_STATE_NO_EXPAND before calling ext4_mark_inode_dirty() in the
> eviction path.
> 
> This also breaks a circular lock dependency reported by lockdep during
> orphan cleanup at mount time:
> 
>   CPU0 (writeback worker)            CPU1 (open)
>   ----                               ----
>   ext4_writepages()
>     s_writepages_rwsem (read)        ext4_create()
>     ext4_do_writepages()               __ext4_new_inode()
>       ext4_journal_start()               [holds jbd2 handle]
>         wait_transaction_locked()        ext4_xattr_set_handle()
>         [WAIT for jbd2_handle]             xattr_sem (write)
> 
>   CPU2 (mount / orphan cleanup)
>   ----
>   ext4_evict_inode()
>     __ext4_mark_inode_dirty()
>       ext4_try_to_expand_extra_isize()
>         xattr_sem (write)
>         ext4_expand_extra_isize_ea()
>           ext4_xattr_block_set()
>             iput(ea_inode)
>               write_inode_now()
>                 ext4_writepages()
>                   s_writepages_rwsem (read)
>                   [WAIT for s_writepages_rwsem -- if blocked by write lock holder]
> 
> This forms a circular dependency on lock classes:
> 
>   s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem
> 
> The iput() inside ext4_xattr_block_set() triggers write_inode_now()
> because SB_ACTIVE is not yet set during mount, so iput_final() cannot
> cache the inode in the LRU and must flush it synchronously.
> 
> Setting EXT4_STATE_NO_EXPAND prevents ext4_try_to_expand_extra_isize()
> from executing, which eliminates the xattr_sem --> s_writepages_rwsem
> edge and breaks the cycle.
> 
> Reported-by: syzbot+5d19358d7eb30ffb0cc5@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
> Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>

Thanks for the patch! So I have no problem with setting EXT4_STATE_NO_EXPAND
in ext4_evict_inode() as you correctly point out expansion is pointless in
that case. But your patch actually doesn't fix the real problem, it only
deals with the particular syzbot reproducer. The real problem is that
ext4_xattr_block_set() which is run inside a transaction can end up
acquiring s_writepages_rwsem which violates the lock ordering rules. So
this is the problem that really needs to be fixed.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Christoph Hellwig @ 2026-06-11 13:38 UTC (permalink / raw)
  To: Keith Busch
  Cc: Carlos Maiolino, Christoph Hellwig, brauner, linux-block,
	linux-fsdevel, linux-ext4, linux-xfs, Hannes Reinecke,
	Martin K. Petersen, Jens Axboe
In-Reply-To: <aiqwy5DfHI79KXuZ@kbusch-mbp>

On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> It's entirely possible a device supports byte aligned addresses. The
> block layer just doesn't let a driver report that. So either it really
> was successful because you found a bug that skips the alignment checks,
> or your device silently corrupted your payload.
> 
> Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> though, in not taking the optimization when it was possible. So here's
> an alternative suggestion that should get things working as expected:

The fix below looks like it is addressing a real bug.  I'm not sure if
Carlos is hitting it, but we were missing the alignment checks for
single-bvec fast path bios so far indeed.


^ permalink raw reply

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Keith Busch @ 2026-06-11 12:57 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, brauner, linux-block, linux-fsdevel,
	linux-ext4, linux-xfs, Hannes Reinecke, Martin K. Petersen,
	Jens Axboe
In-Reply-To: <aiqBvF93P4NjfaDR@nidhogg.toxiclabs.cc>

On Thu, Jun 11, 2026 at 12:05:22PM +0200, Carlos Maiolino wrote:
> The passed in address 0x1003af80001 is one byte misaligned and shouldn't
> (at least in theory) ever be accepted no? Or am I missing something
> else?

It's entirely possible a device supports byte aligned addresses. The
block layer just doesn't let a driver report that. So either it really
was successful because you found a bug that skips the alignment checks,
or your device silently corrupted your payload.

Anyway, my earlier suggestion should work. Ming thinks it may go to far,
though, in not taking the optimization when it was possible. So here's
an alternative suggestion that should get things working as expected:

---
diff --git a/block/blk.h b/block/blk.h
index 1a2d9101bba04..4c31762d6fb5f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -404,6 +404,9 @@ static inline bool bio_may_need_split(struct bio *bio,
        bv = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
        if (bio->bi_iter.bi_size > bv->bv_len - bio->bi_iter.bi_bvec_done)
                return true;
+
+       if ((bv->bv_offset | bv->bv_len) & lim->dma_alignment)
+               return true;
        return bv->bv_len + bv->bv_offset > lim->max_fast_segment_size;
 }
 
-- 

^ permalink raw reply related

* [PATCH] ext4: skip extra isize expansion on inode eviction to avoid deadlock
From: Yun Zhou @ 2026-06-11 12:45 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel

Expanding extra isize on an inode that is being evicted is pointless
since the inode is about to be deleted.  Skip it by setting
EXT4_STATE_NO_EXPAND before calling ext4_mark_inode_dirty() in the
eviction path.

This also breaks a circular lock dependency reported by lockdep during
orphan cleanup at mount time:

  CPU0 (writeback worker)            CPU1 (open)
  ----                               ----
  ext4_writepages()
    s_writepages_rwsem (read)        ext4_create()
    ext4_do_writepages()               __ext4_new_inode()
      ext4_journal_start()               [holds jbd2 handle]
        wait_transaction_locked()        ext4_xattr_set_handle()
        [WAIT for jbd2_handle]             xattr_sem (write)

  CPU2 (mount / orphan cleanup)
  ----
  ext4_evict_inode()
    __ext4_mark_inode_dirty()
      ext4_try_to_expand_extra_isize()
        xattr_sem (write)
        ext4_expand_extra_isize_ea()
          ext4_xattr_block_set()
            iput(ea_inode)
              write_inode_now()
                ext4_writepages()
                  s_writepages_rwsem (read)
                  [WAIT for s_writepages_rwsem -- if blocked by write lock holder]

This forms a circular dependency on lock classes:

  s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem

The iput() inside ext4_xattr_block_set() triggers write_inode_now()
because SB_ACTIVE is not yet set during mount, so iput_final() cannot
cache the inode in the LRU and must flush it synchronously.

Setting EXT4_STATE_NO_EXPAND prevents ext4_try_to_expand_extra_isize()
from executing, which eliminates the xattr_sem --> s_writepages_rwsem
edge and breaks the cycle.

Reported-by: syzbot+5d19358d7eb30ffb0cc5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/inode.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cd7588a3fa45..cbfd1d1282e6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -264,6 +264,12 @@ void ext4_evict_inode(struct inode *inode)
 	if (ext4_inode_is_fast_symlink(inode))
 		memset(EXT4_I(inode)->i_data, 0, sizeof(EXT4_I(inode)->i_data));
 	inode->i_size = 0;
+	/*
+	 * Skip extra isize expansion on inodes being deleted -- it is
+	 * pointless and can trigger a circular lock dependency:
+	 *   xattr_sem -> ext4_xattr_block_set -> iput -> s_writepages_rwsem
+	 */
+	ext4_set_inode_state(inode, EXT4_STATE_NO_EXPAND);
 	err = ext4_mark_inode_dirty(handle, inode);
 	if (err) {
 		ext4_warning(inode->i_sb,
-- 
2.43.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox