* [PATCH v1] ext4: add mb_stats_clear for mballoc statistics
From: Baolin Liu @ 2026-04-14 10:02 UTC (permalink / raw)
To: tytso, adilger.kernel
Cc: liubaolin12138, linux-ext4, linux-kernel, wangguanyu, Baolin Liu
From: Baolin Liu <liubaolin@kylinos.cn>
Add a write-only mb_stats_clear sysfs knob to reset ext4 mballoc
runtime statistics.This makes it easier to inspect allocator
activity for a specific workload instead of using counters
accumulated since mount.
Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
---
fs/ext4/ext4.h | 1 +
fs/ext4/mballoc.c | 31 +++++++++++++++++++++++++++++++
fs/ext4/sysfs.c | 24 ++++++++++++++++++++++++
3 files changed, 56 insertions(+)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 7617e2d454ea..3a32e1a515dd 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2995,6 +2995,7 @@ int ext4_fc_record_regions(struct super_block *sb, int ino,
extern const struct seq_operations ext4_mb_seq_groups_ops;
extern const struct seq_operations ext4_mb_seq_structs_summary_ops;
extern int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset);
+extern void ext4_mb_stats_clear(struct ext4_sb_info *sbi);
extern int ext4_mb_init(struct super_block *);
extern void ext4_mb_release(struct super_block *);
extern ext4_fsblk_t ext4_mb_new_blocks(handle_t *,
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index bb58eafb87bc..382c91586b26 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3219,6 +3219,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
}
seq_printf(seq, "\treqs: %u\n", atomic_read(&sbi->s_bal_reqs));
seq_printf(seq, "\tsuccess: %u\n", atomic_read(&sbi->s_bal_success));
+ seq_printf(seq, "\tblocks_allocated: %u\n",
+ atomic_read(&sbi->s_bal_allocated));
seq_printf(seq, "\tgroups_scanned: %u\n",
atomic_read(&sbi->s_bal_groups_scanned));
@@ -4721,6 +4723,35 @@ static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
trace_ext4_mballoc_prealloc(ac);
}
+void ext4_mb_stats_clear(struct ext4_sb_info *sbi)
+{
+ int i;
+
+ atomic_set(&sbi->s_bal_reqs, 0);
+ atomic_set(&sbi->s_bal_success, 0);
+ atomic_set(&sbi->s_bal_allocated, 0);
+ atomic_set(&sbi->s_bal_groups_scanned, 0);
+
+ for (i = 0; i < EXT4_MB_NUM_CRS; i++) {
+ atomic64_set(&sbi->s_bal_cX_hits[i], 0);
+ atomic64_set(&sbi->s_bal_cX_groups_considered[i], 0);
+ atomic_set(&sbi->s_bal_cX_ex_scanned[i], 0);
+ atomic64_set(&sbi->s_bal_cX_failed[i], 0);
+ }
+
+ atomic_set(&sbi->s_bal_ex_scanned, 0);
+ atomic_set(&sbi->s_bal_goals, 0);
+ atomic_set(&sbi->s_bal_stream_goals, 0);
+ atomic_set(&sbi->s_bal_len_goals, 0);
+ atomic_set(&sbi->s_bal_2orders, 0);
+ atomic_set(&sbi->s_bal_breaks, 0);
+ atomic_set(&sbi->s_mb_lost_chunks, 0);
+ atomic_set(&sbi->s_mb_buddies_generated, 0);
+ atomic64_set(&sbi->s_mb_generation_time, 0);
+ atomic_set(&sbi->s_mb_preallocated, 0);
+ atomic_set(&sbi->s_mb_discarded, 0);
+}
+
/*
* Called on failure; free up any blocks from the inode PA for this
* context. We don't need this for MB_GROUP_PA because we only change
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index 923b375e017f..a5bd88a99f22 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -41,6 +41,7 @@ typedef enum {
attr_pointer_atomic,
attr_journal_task,
attr_err_report_sec,
+ attr_mb_stats_clear,
} attr_id_t;
typedef enum {
@@ -161,6 +162,25 @@ static ssize_t err_report_sec_store(struct ext4_sb_info *sbi,
return count;
}
+static ssize_t mb_stats_clear_store(struct ext4_sb_info *sbi,
+ const char *buf, size_t count)
+{
+ int val;
+ int ret;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ ret = kstrtoint(skip_spaces(buf), 0, &val);
+ if (ret)
+ return ret;
+ if (val != 1)
+ return -EINVAL;
+
+ ext4_mb_stats_clear(sbi);
+ return count;
+}
+
static ssize_t journal_task_show(struct ext4_sb_info *sbi, char *buf)
{
if (!sbi->s_journal)
@@ -251,6 +271,7 @@ EXT4_ATTR_OFFSET(mb_best_avail_max_trim_order, 0644, mb_order,
EXT4_ATTR_OFFSET(err_report_sec, 0644, err_report_sec, ext4_sb_info, s_err_report_sec);
EXT4_RW_ATTR_SBI_UI(inode_goal, s_inode_goal);
EXT4_RW_ATTR_SBI_UI(mb_stats, s_mb_stats);
+EXT4_ATTR(mb_stats_clear, 0200, mb_stats_clear);
EXT4_RW_ATTR_SBI_UI(mb_max_to_scan, s_mb_max_to_scan);
EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan);
EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
@@ -301,6 +322,7 @@ static struct attribute *ext4_attrs[] = {
ATTR_LIST(inode_readahead_blks),
ATTR_LIST(inode_goal),
ATTR_LIST(mb_stats),
+ ATTR_LIST(mb_stats_clear),
ATTR_LIST(mb_max_to_scan),
ATTR_LIST(mb_min_to_scan),
ATTR_LIST(mb_order2_req),
@@ -561,6 +583,8 @@ static ssize_t ext4_attr_store(struct kobject *kobj,
return trigger_test_error(sbi, buf, len);
case attr_err_report_sec:
return err_report_sec_store(sbi, buf, len);
+ case attr_mb_stats_clear:
+ return mb_stats_clear_store(sbi, buf, len);
default:
return ext4_generic_attr_store(a, sbi, buf, len);
}
--
2.51.0
^ permalink raw reply related
* Re: [PATCH v1] ext4: add mb_stats_clear for mballoc statistics
From: liubaolin @ 2026-04-14 10:07 UTC (permalink / raw)
To: tytso, adilger.kernel; +Cc: linux-ext4, linux-kernel, wangguanyu, Baolin Liu
In-Reply-To: <20260414100212.95209-1-liubaolin12138@163.com>
> Dear all,
> I have sent a small ext4 patch to add a manual reset capability for the mballoc statistics, and I would like to add some background on the motivation.
>
> The idea came mainly from XFS stats_clear.
> ext4 already exports mballoc runtime statistics through /proc/fs/ext4/<dev>/mb_stats,
> but these counters keep accumulating from mount time, which makes it inconvenient when trying to observe allocator behavior for a single test run.
>
> This patch adds a write-only sysfs node, /sys/fs/ext4/<dev>/mb_stats_clear, so that writing 1 to it resets the ext4 mballoc runtime statistics.
> It also adds sbi->s_bal_allocated to /proc/fs/ext4/<dev>/mb_stats,
> so that the proc output matches the mballoc summary printed at unmount time and the set of counters covered by mb_stats_clear is more complete.
>
> The main goal is to make it easier to observe allocator activity for a specific test run instead of relying on counters accumulated since mount.
> With this in place, the counters can be cleared before starting a test, and the resulting mb_stats output reflects only the activity generated by that test.
>
> The counters being cleared are runtime mballoc statistics used for /proc/fs/ext4/<dev>/mb_stats reporting and for the mballoc summary printed at unmount time.
> I did not find any cases where these fields are read back to drive ext4 behavior, so the reset only affects statistics reporting.
>
> For validation, /sys/fs/ext4/<dev>/mb_stats can be enabled first,
> then a file operation test can be run so that the relevant values in /proc/fs/ext4/<dev>/mb_stats become non-zero.
> After writing 1 to /sys/fs/ext4/<dev>/mb_stats_clear, those values should return to 0.
> Running another file operation test afterward should make those values increase again.
>
> Best regards,
> Baolin Liu
在 2026/4/14 18:02, Baolin Liu 写道:
> From: Baolin Liu <liubaolin@kylinos.cn>
>
> Add a write-only mb_stats_clear sysfs knob to reset ext4 mballoc
> runtime statistics.This makes it easier to inspect allocator
> activity for a specific workload instead of using counters
> accumulated since mount.
>
> Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
> ---
> fs/ext4/ext4.h | 1 +
> fs/ext4/mballoc.c | 31 +++++++++++++++++++++++++++++++
> fs/ext4/sysfs.c | 24 ++++++++++++++++++++++++
> 3 files changed, 56 insertions(+)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 7617e2d454ea..3a32e1a515dd 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2995,6 +2995,7 @@ int ext4_fc_record_regions(struct super_block *sb, int ino,
> extern const struct seq_operations ext4_mb_seq_groups_ops;
> extern const struct seq_operations ext4_mb_seq_structs_summary_ops;
> extern int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset);
> +extern void ext4_mb_stats_clear(struct ext4_sb_info *sbi);
> extern int ext4_mb_init(struct super_block *);
> extern void ext4_mb_release(struct super_block *);
> extern ext4_fsblk_t ext4_mb_new_blocks(handle_t *,
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index bb58eafb87bc..382c91586b26 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -3219,6 +3219,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
> }
> seq_printf(seq, "\treqs: %u\n", atomic_read(&sbi->s_bal_reqs));
> seq_printf(seq, "\tsuccess: %u\n", atomic_read(&sbi->s_bal_success));
> + seq_printf(seq, "\tblocks_allocated: %u\n",
> + atomic_read(&sbi->s_bal_allocated));
>
> seq_printf(seq, "\tgroups_scanned: %u\n",
> atomic_read(&sbi->s_bal_groups_scanned));
> @@ -4721,6 +4723,35 @@ static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
> trace_ext4_mballoc_prealloc(ac);
> }
>
> +void ext4_mb_stats_clear(struct ext4_sb_info *sbi)
> +{
> + int i;
> +
> + atomic_set(&sbi->s_bal_reqs, 0);
> + atomic_set(&sbi->s_bal_success, 0);
> + atomic_set(&sbi->s_bal_allocated, 0);
> + atomic_set(&sbi->s_bal_groups_scanned, 0);
> +
> + for (i = 0; i < EXT4_MB_NUM_CRS; i++) {
> + atomic64_set(&sbi->s_bal_cX_hits[i], 0);
> + atomic64_set(&sbi->s_bal_cX_groups_considered[i], 0);
> + atomic_set(&sbi->s_bal_cX_ex_scanned[i], 0);
> + atomic64_set(&sbi->s_bal_cX_failed[i], 0);
> + }
> +
> + atomic_set(&sbi->s_bal_ex_scanned, 0);
> + atomic_set(&sbi->s_bal_goals, 0);
> + atomic_set(&sbi->s_bal_stream_goals, 0);
> + atomic_set(&sbi->s_bal_len_goals, 0);
> + atomic_set(&sbi->s_bal_2orders, 0);
> + atomic_set(&sbi->s_bal_breaks, 0);
> + atomic_set(&sbi->s_mb_lost_chunks, 0);
> + atomic_set(&sbi->s_mb_buddies_generated, 0);
> + atomic64_set(&sbi->s_mb_generation_time, 0);
> + atomic_set(&sbi->s_mb_preallocated, 0);
> + atomic_set(&sbi->s_mb_discarded, 0);
> +}
> +
> /*
> * Called on failure; free up any blocks from the inode PA for this
> * context. We don't need this for MB_GROUP_PA because we only change
> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> index 923b375e017f..a5bd88a99f22 100644
> --- a/fs/ext4/sysfs.c
> +++ b/fs/ext4/sysfs.c
> @@ -41,6 +41,7 @@ typedef enum {
> attr_pointer_atomic,
> attr_journal_task,
> attr_err_report_sec,
> + attr_mb_stats_clear,
> } attr_id_t;
>
> typedef enum {
> @@ -161,6 +162,25 @@ static ssize_t err_report_sec_store(struct ext4_sb_info *sbi,
> return count;
> }
>
> +static ssize_t mb_stats_clear_store(struct ext4_sb_info *sbi,
> + const char *buf, size_t count)
> +{
> + int val;
> + int ret;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + ret = kstrtoint(skip_spaces(buf), 0, &val);
> + if (ret)
> + return ret;
> + if (val != 1)
> + return -EINVAL;
> +
> + ext4_mb_stats_clear(sbi);
> + return count;
> +}
> +
> static ssize_t journal_task_show(struct ext4_sb_info *sbi, char *buf)
> {
> if (!sbi->s_journal)
> @@ -251,6 +271,7 @@ EXT4_ATTR_OFFSET(mb_best_avail_max_trim_order, 0644, mb_order,
> EXT4_ATTR_OFFSET(err_report_sec, 0644, err_report_sec, ext4_sb_info, s_err_report_sec);
> EXT4_RW_ATTR_SBI_UI(inode_goal, s_inode_goal);
> EXT4_RW_ATTR_SBI_UI(mb_stats, s_mb_stats);
> +EXT4_ATTR(mb_stats_clear, 0200, mb_stats_clear);
> EXT4_RW_ATTR_SBI_UI(mb_max_to_scan, s_mb_max_to_scan);
> EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan);
> EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
> @@ -301,6 +322,7 @@ static struct attribute *ext4_attrs[] = {
> ATTR_LIST(inode_readahead_blks),
> ATTR_LIST(inode_goal),
> ATTR_LIST(mb_stats),
> + ATTR_LIST(mb_stats_clear),
> ATTR_LIST(mb_max_to_scan),
> ATTR_LIST(mb_min_to_scan),
> ATTR_LIST(mb_order2_req),
> @@ -561,6 +583,8 @@ static ssize_t ext4_attr_store(struct kobject *kobj,
> return trigger_test_error(sbi, buf, len);
> case attr_err_report_sec:
> return err_report_sec_store(sbi, buf, len);
> + case attr_mb_stats_clear:
> + return mb_stats_clear_store(sbi, buf, len);
> default:
> return ext4_generic_attr_store(a, sbi, buf, len);
> }
^ permalink raw reply
* [RFC PATCH] iomap: add fast read path for small direct I/O
From: Fengnan Chang @ 2026-04-14 12:26 UTC (permalink / raw)
To: brauner, djwong, linux-xfs, linux-fsdevel, linux-ext4
Cc: lidiangang, Fengnan Chang
When running 4K random read workloads on high-performance Gen5 NVMe
SSDs, the software overhead in the iomap direct I/O path
(__iomap_dio_rw) becomes a significant bottleneck.
Using io_uring with poll mode for a 4K randread test on a raw block
device:
taskset -c 30 ./t/io_uring -p1 -d512 -b4096 -s32 -c32 -F1 -B1 -R1 -X1
-n1 -P1 /dev/nvme10n1
Result: ~3.2M IOPS
Running the exact same workload on ext4 and XFS:
taskset -c 30 ./t/io_uring -p1 -d512 -b4096 -s32 -c32 -F1 -B1 -R1 -X1
-n1 -P1 /mnt/testfile
Result: ~1.9M IOPS
Profiling the ext4 workload reveals that a significant portion of CPU
time is spent on memory allocation and the iomap state machine
iteration:
5.33% [kernel] [k] __iomap_dio_rw
3.26% [kernel] [k] iomap_iter
2.37% [kernel] [k] iomap_dio_bio_iter
2.35% [kernel] [k] kfree
1.33% [kernel] [k] iomap_dio_complete
I attempted several incremental optimizations in the __iomap_dio_rw()
path to close the gap:
1. Allocating the `bio` and `struct iomap_dio` together to avoid a
separate kmalloc. However, because `struct iomap_dio` is relatively
large and the main path is complex, this yielded almost no
performance improvement.
2. Reducing unnecessary state resets in the iomap state machine (e.g.,
skipping `iomap_iter_reset_iomap` where safe). This provided a ~5%
IOPS boost, which is helpful but still falls far short of closing
the gap with the raw block device.
Since optimizing the heavy generic path did not yield the desired
results for this specific, highly-demanding Gen5 SSD scenario, this
RFC patch introduces a dedicated asynchronous fast path.
The fast path is triggered when the request satisfies:
- Asynchronous READ request only for now.
- I/O size is <= inode blocksize (fits in a single block, no splits).
- Aligned to the block device's logical block size.
- No bounce buffering, fscrypt, or fsverity involved.
- No custom `iomap_dio_ops` (dops) registered by the filesystem.
By using a dedicated bio_set (`iomap_dio_fast_read_pool`) to embed a
much smaller completion state (`struct iomap_dio_fast_read`) directly
in the bio's front padding, we completely eliminate kmalloc/kfree and
drastically shorten the execution path.
After this optimization, the heavy generic functions disappear from the
profile, replaced by a single streamlined execution path:
4.83% [kernel] [k] iomap_dio_fast_read_async.isra.31
With this patch, 4K random read IOPS on ext4 increases from 1.9M to
2.3M.
I am aware that adding a completely separate fast path introduces
duplicate code and may result in iomap_begin being called twice, this
likely unacceptable for merging in its current form.
However, I am submitting this patch to validate whether this
optimization direction is correct and worth pursuing. I would appreciate
feedback on how to better integrate these ideas into the main iomap
execution path.
Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
---
fs/iomap/direct-io.c | 275 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 275 insertions(+)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index e911daedff65a..e4183f7c2f962 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -5,10 +5,14 @@
*/
#include <linux/blk-crypto.h>
#include <linux/fscrypt.h>
+#include <linux/fsverity.h>
#include <linux/pagemap.h>
#include <linux/iomap.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/fserror.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
#include "internal.h"
#include "trace.h"
@@ -880,12 +884,231 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
}
EXPORT_SYMBOL_GPL(__iomap_dio_rw);
+static bool iomap_dio_fast_read_enabled = true;
+
+struct iomap_dio_fast_read {
+ struct kiocb *iocb;
+ size_t size;
+ bool should_dirty;
+ struct work_struct work;
+ struct bio bio ____cacheline_aligned_in_smp;
+};
+
+static struct bio_set iomap_dio_fast_read_pool;
+
+static void iomap_dio_fast_read_complete_work(struct work_struct *work)
+{
+ struct iomap_dio_fast_read *fr =
+ container_of(work, struct iomap_dio_fast_read, work);
+ struct kiocb *iocb = fr->iocb;
+ struct inode *inode = file_inode(iocb->ki_filp);
+ bool should_dirty = fr->should_dirty;
+ struct bio *bio = &fr->bio;
+ ssize_t ret;
+
+ WRITE_ONCE(iocb->private, NULL);
+
+ if (likely(!bio->bi_status)) {
+ ret = fr->size;
+ iocb->ki_pos += ret;
+ } else {
+ ret = blk_status_to_errno(bio->bi_status);
+ fserror_report_io(inode, FSERR_DIRECTIO_READ, iocb->ki_pos,
+ fr->size, ret, GFP_NOFS);
+ }
+
+ if (should_dirty) {
+ bio_check_pages_dirty(bio);
+ } else {
+ bio_release_pages(bio, false);
+ bio_put(bio);
+ }
+
+ inode_dio_end(inode);
+
+ trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0, ret > 0 ? ret : 0);
+ iocb->ki_complete(iocb, ret);
+}
+
+static void iomap_dio_fast_read_end_io(struct bio *bio)
+{
+ struct iomap_dio_fast_read *fr = bio->bi_private;
+ struct kiocb *iocb = fr->iocb;
+
+ if (unlikely(bio->bi_status)) {
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ INIT_WORK(&fr->work, iomap_dio_fast_read_complete_work);
+ queue_work(inode->i_sb->s_dio_done_wq, &fr->work);
+ return;
+ }
+
+ iomap_dio_fast_read_complete_work(&fr->work);
+}
+
+static inline bool iomap_dio_fast_read_supported(struct kiocb *iocb,
+ struct iov_iter *iter,
+ unsigned int dio_flags,
+ size_t done_before)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ size_t count = iov_iter_count(iter);
+ unsigned int alignment;
+
+ if (!iomap_dio_fast_read_enabled)
+ return false;
+ if (iov_iter_rw(iter) != READ)
+ return false;
+
+ /*
+ * Fast read is an optimization for small IO. Filter out large IO early
+ * as it's the most common case to fail for typical direct IO workloads.
+ */
+ if (count > inode->i_sb->s_blocksize)
+ return false;
+
+ if (is_sync_kiocb(iocb) || done_before)
+ return false;
+ if (dio_flags & (IOMAP_DIO_FORCE_WAIT | IOMAP_DIO_BOUNCE))
+ return false;
+ if (iocb->ki_pos + count > i_size_read(inode))
+ return false;
+ if (IS_ENCRYPTED(inode) || fsverity_active(inode))
+ return false;
+
+ if (count < bdev_logical_block_size(inode->i_sb->s_bdev))
+ return false;
+
+ if (dio_flags & IOMAP_DIO_FSBLOCK_ALIGNED)
+ alignment = i_blocksize(inode);
+ else
+ alignment = bdev_logical_block_size(inode->i_sb->s_bdev);
+
+ if ((iocb->ki_pos | count) & (alignment - 1))
+ return false;
+
+ return true;
+}
+
+static ssize_t iomap_dio_fast_read_async(struct kiocb *iocb,
+ struct iov_iter *iter,
+ const struct iomap_ops *ops,
+ void *private)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ size_t count = iov_iter_count(iter);
+ int nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS);
+ bool should_dirty = user_backed_iter(iter);
+ struct iomap_dio_fast_read *fr;
+ struct iomap_iter iomi = {
+ .inode = inode,
+ .pos = iocb->ki_pos,
+ .len = count,
+ .flags = IOMAP_DIRECT,
+ .private = private,
+ };
+ struct bio *bio;
+ ssize_t ret;
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ iomi.flags |= IOMAP_NOWAIT;
+
+ ret = kiocb_write_and_wait(iocb, count);
+ if (ret)
+ return ret;
+
+ inode_dio_begin(inode);
+
+ ret = ops->iomap_begin(inode, iomi.pos, count, iomi.flags,
+ &iomi.iomap, &iomi.srcmap);
+ if (ret) {
+ inode_dio_end(inode);
+ return ret;
+ }
+
+ if (iomi.iomap.type != IOMAP_MAPPED ||
+ iomi.iomap.offset > iomi.pos ||
+ iomi.iomap.offset + iomi.iomap.length < iomi.pos + count ||
+ (iomi.iomap.flags & IOMAP_F_ANON_WRITE)) {
+ ret = -EAGAIN;
+ goto out_iomap_end;
+ }
+
+ if (!inode->i_sb->s_dio_done_wq) {
+ ret = sb_init_dio_done_wq(inode->i_sb);
+ if (ret < 0)
+ goto out_iomap_end;
+ }
+
+ trace_iomap_dio_rw_begin(iocb, iter, 0, 0);
+
+ bio = bio_alloc_bioset(iomi.iomap.bdev, nr_pages,
+ REQ_OP_READ | REQ_SYNC | REQ_IDLE,
+ GFP_KERNEL, &iomap_dio_fast_read_pool);
+ fr = container_of(bio, struct iomap_dio_fast_read, bio);
+ fr->iocb = iocb;
+ fr->should_dirty = should_dirty;
+
+ bio->bi_iter.bi_sector = iomap_sector(&iomi.iomap, iomi.pos);
+ bio->bi_ioprio = iocb->ki_ioprio;
+ bio->bi_private = fr;
+ bio->bi_end_io = iomap_dio_fast_read_end_io;
+
+ ret = bio_iov_iter_get_pages(bio, iter,
+ bdev_logical_block_size(iomi.iomap.bdev) - 1);
+ if (unlikely(ret)) {
+ bio_put(bio);
+ goto out_iomap_end;
+ }
+
+ if (bio->bi_iter.bi_size != count) {
+ iov_iter_revert(iter, bio->bi_iter.bi_size);
+ bio_release_pages(bio, false);
+ bio_put(bio);
+ ret = -EAGAIN;
+ goto out_iomap_end;
+ }
+
+ fr->size = bio->bi_iter.bi_size;
+
+ if (should_dirty)
+ bio_set_pages_dirty(bio);
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ bio->bi_opf |= REQ_NOWAIT;
+ if (iocb->ki_flags & IOCB_HIPRI) {
+ bio->bi_opf |= REQ_POLLED;
+ bio_set_polled(bio, iocb);
+ WRITE_ONCE(iocb->private, bio);
+ }
+ submit_bio(bio);
+
+ if (ops->iomap_end)
+ ops->iomap_end(inode, iomi.pos, count, count, iomi.flags,
+ &iomi.iomap);
+ return -EIOCBQUEUED;
+
+out_iomap_end:
+ if (ops->iomap_end)
+ ops->iomap_end(inode, iomi.pos, count, 0, iomi.flags,
+ &iomi.iomap);
+ inode_dio_end(inode);
+ return ret;
+}
+
ssize_t
iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
unsigned int dio_flags, void *private, size_t done_before)
{
struct iomap_dio *dio;
+ ssize_t ret;
+
+ if (!dops && iomap_dio_fast_read_supported(iocb, iter, dio_flags, done_before)) {
+ ret = iomap_dio_fast_read_async(iocb, iter, ops, private);
+ if (ret != -EAGAIN)
+ return ret;
+ }
dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, private,
done_before);
@@ -894,3 +1117,55 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
return iomap_dio_complete(dio);
}
EXPORT_SYMBOL_GPL(iomap_dio_rw);
+
+static ssize_t fast_read_enable_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sysfs_emit(buf, "%d\n", iomap_dio_fast_read_enabled);
+}
+
+static ssize_t fast_read_enable_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ bool enable;
+ int ret;
+
+ ret = kstrtobool(buf, &enable);
+ if (ret)
+ return ret;
+
+ iomap_dio_fast_read_enabled = enable;
+ return count;
+}
+
+static struct kobj_attribute fast_read_enable_attr =
+ __ATTR(fast_read_enable, 0644, fast_read_enable_show, fast_read_enable_store);
+
+static struct kobject *iomap_kobj;
+
+static int __init iomap_dio_sysfs_init(void)
+{
+ int ret;
+
+ ret = bioset_init(&iomap_dio_fast_read_pool, 4,
+ offsetof(struct iomap_dio_fast_read, bio),
+ BIOSET_NEED_BVECS | BIOSET_PERCPU_CACHE);
+ if (ret)
+ return ret;
+
+ iomap_kobj = kobject_create_and_add("iomap", fs_kobj);
+ if (!iomap_kobj) {
+ bioset_exit(&iomap_dio_fast_read_pool);
+ return -ENOMEM;
+ }
+
+ if (sysfs_create_file(iomap_kobj, &fast_read_enable_attr.attr)) {
+ kobject_put(iomap_kobj);
+ bioset_exit(&iomap_dio_fast_read_pool);
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+fs_initcall(iomap_dio_sysfs_init);
--
2.39.5 (Apple Git-154)
^ permalink raw reply related
* Re: [patch 31/38] parisc: Select ARCH_HAS_RANDOM_ENTROPY
From: Helge Deller @ 2026-04-14 12:41 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: linux-parisc, Arnd Bergmann, x86, Lu Baolu, iommu,
Michael Grzeschik, netdev, linux-wireless, Herbert Xu,
linux-crypto, Vlastimil Babka, linux-mm, David Woodhouse,
Bernie Thompson, linux-fbdev, Theodore Tso, linux-ext4,
Andrew Morton, Uladzislau Rezki, Marco Elver, Dmitry Vyukov,
kasan-dev, Andrey Ryabinin, Thomas Sailer, linux-hams,
Jason A. Donenfeld, Richard Henderson, linux-alpha, Russell King,
linux-arm-kernel, Catalin Marinas, Huacai Chen, loongarch,
Geert Uytterhoeven, linux-m68k, Dinh Nguyen, Jonas Bonn,
linux-openrisc, Michael Ellerman, linuxppc-dev, Paul Walmsley,
linux-riscv, Heiko Carstens, linux-s390, David S. Miller,
sparclinux
In-Reply-To: <20260410120319.658485572@kernel.org>
On 4/10/26 14:21, Thomas Gleixner wrote:
> The only remaining non-architecture usage of get_cycles() is to provide
> random_get_entropy().
>
> Switch parisc over to the new scheme of selecting ARCH_HAS_RANDOM_ENTROPY
> and providing random_get_entropy() in asm/random.h.
>
> Add 'asm/timex.h' includes to the relevant files, so the global include can
> be removed once all architectures are converted over.
>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Cc: Helge Deller <deller@gmx.de>
> Cc: linux-parisc@vger.kernel.org
> ---
> arch/parisc/Kconfig | 1 +
> arch/parisc/include/asm/random.h | 12 ++++++++++++
> arch/parisc/include/asm/timex.h | 6 ------
> arch/parisc/kernel/processor.c | 1 +
> arch/parisc/kernel/time.c | 1 +
> 5 files changed, 15 insertions(+), 6 deletions(-)
I tested this series on parisc.
Works as expected.
Tested-by: Helge Deller <deller@gmx.de>
Thanks!
Helge
^ permalink raw reply
* Re: [PATCH] jbd2: validate transaction state before dropping from journal
From: Jan Kara @ 2026-04-14 12:46 UTC (permalink / raw)
To: Milos Nikic; +Cc: jack, tytso, linux-ext4, linux-kernel
In-Reply-To: <20260413180824.126739-1-nikic.milos@gmail.com>
On Mon 13-04-26 11:08:24, Milos Nikic wrote:
> Currently, __jbd2_journal_drop_transaction() unlinks the transaction
> from the journal's checkpoint lists and only then proceeds to validate the
> transaction's internal state using a series of J_ASSERTs.
>
> There is no need to 'mutate before validate'. If we are going to halt the
> system that makes manipulating corrupted pointers in memory irrelevant.
>
> Move the state validation block above the pointer manipulation. This
> ensures the transaction is entirely valid before modifying the journal's
> internal lists, modernizing the function's logic and paving the way
> for future graceful degradation of these assertions.
Either you have a poetry gift or you should tell your AI agent to keep the
tone more to the point :). Anyway I think this is really just a pointless
churn as it doesn't really matter whether we crash the kernel one way or
another...
Honza
>
> Signed-off-by: Milos Nikic <nikic.milos@gmail.com>
> ---
> fs/jbd2/checkpoint.c | 18 +++++++++---------
> 1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index 1508e2f54462..c82b6bedd27b 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -703,6 +703,15 @@ void __jbd2_journal_drop_transaction(journal_t *journal, transaction_t *transact
> {
> assert_spin_locked(&journal->j_list_lock);
>
> + J_ASSERT(transaction->t_state == T_FINISHED);
> + J_ASSERT(transaction->t_buffers == NULL);
> + J_ASSERT(transaction->t_forget == NULL);
> + J_ASSERT(transaction->t_shadow_list == NULL);
> + J_ASSERT(transaction->t_checkpoint_list == NULL);
> + J_ASSERT(atomic_read(&transaction->t_updates) == 0);
> + J_ASSERT(journal->j_committing_transaction != transaction);
> + J_ASSERT(journal->j_running_transaction != transaction);
> +
> journal->j_shrink_transaction = NULL;
> if (transaction->t_cpnext) {
> transaction->t_cpnext->t_cpprev = transaction->t_cpprev;
> @@ -714,15 +723,6 @@ void __jbd2_journal_drop_transaction(journal_t *journal, transaction_t *transact
> journal->j_checkpoint_transactions = NULL;
> }
>
> - J_ASSERT(transaction->t_state == T_FINISHED);
> - J_ASSERT(transaction->t_buffers == NULL);
> - J_ASSERT(transaction->t_forget == NULL);
> - J_ASSERT(transaction->t_shadow_list == NULL);
> - J_ASSERT(transaction->t_checkpoint_list == NULL);
> - J_ASSERT(atomic_read(&transaction->t_updates) == 0);
> - J_ASSERT(journal->j_committing_transaction != transaction);
> - J_ASSERT(journal->j_running_transaction != transaction);
> -
> trace_jbd2_drop_transaction(journal, transaction);
>
> jbd2_debug(1, "Dropping transaction %d, all done\n", transaction->t_tid);
> --
> 2.53.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH] jbd2: enforce power-of-two default revoke hash size at compile time
From: Jan Kara @ 2026-04-14 12:59 UTC (permalink / raw)
To: Milos Nikic; +Cc: jack, tytso, linux-ext4, linux-kernel
In-Reply-To: <20260413212724.127035-1-nikic.milos@gmail.com>
On Mon 13-04-26 14:27:24, Milos Nikic wrote:
> The jbd2 revoke table relies on bitwise AND operations for fast hash
> indexing, which requires the hash table size to be a strict power of two.
>
> Currently, this requirement is only enforced at runtime via a J_ASSERT
> in jbd2_journal_init_revoke(). While this successfully catches invalid
> dynamic allocations, it means a developer accidentally modifying the
> hardcoded JOURNAL_REVOKE_DEFAULT_HASH macro will experience a system
> panic upon mounting the filesystem during testing.
>
> Add a BUILD_BUG_ON() in journal_init_common() to validate the default
> macro at compile time. This acts as an immediate, zero-overhead
> safeguard, preventing compilation entirely if the default hash size is
> mathematically invalid.
>
> Signed-off-by: Milos Nikic <nikic.milos@gmail.com>
Eh, if you modify JOURNAL_REVOKE_DEFAULT_HASH you should better know what
you are doing and if you mess up, then the kernel failing with assertion
isn't that difficult to diagnose. So sorry I don't think this "cleanup" is
useful either.
Honza
> ---
> fs/jbd2/journal.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 4f397fcdb13c..62b36a2fc4e2 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -1565,6 +1565,7 @@ static journal_t *journal_init_common(struct block_device *bdev,
> /* The journal is marked for error until we succeed with recovery! */
> journal->j_flags = JBD2_ABORT;
>
> + BUILD_BUG_ON(!is_power_of_2(JOURNAL_REVOKE_DEFAULT_HASH));
> /* Set up a default-sized revoke table for the new mount. */
> err = jbd2_journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
> if (err)
> --
> 2.53.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH v7 03/22] ovl: use core fsverity ensure info interface
From: Andrey Albershteyn @ 2026-04-14 13:53 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrey Albershteyn, linux-xfs, fsverity, linux-fsdevel, ebiggers,
linux-ext4, linux-f2fs-devel, linux-btrfs, djwong
In-Reply-To: <20260414081301.GB11138@lst.de>
On 2026-04-14 10:13:01, Christoph Hellwig wrote:
> On Thu, Apr 09, 2026 at 03:13:35PM +0200, Andrey Albershteyn wrote:
> > - if (!fsverity_active(inode) && IS_VERITY(inode)) {
> > - /*
> > - * If this inode was not yet opened, the verity info hasn't been
> > - * loaded yet, so we need to do that here to force it into memory.
> > - */
> > - filp = kernel_file_open(datapath, O_RDONLY, current_cred());
> > - if (IS_ERR(filp))
> > - return PTR_ERR(filp);
> > - fput(filp);
> > - }
> > + if (fsverity_active(inode))
> > + fsverity_ensure_verity_info(inode);
>
> fsverity_ensure_verity_info already is a no-op for !fsverity_active,
> so the check could be remove.
I don't think it is. For non-fsverity inodes it will try to call into
descriptor reading callback and fail.
> Also we should probably propagate the
> error return from fsverity_ensure_verity_info here.
oh right, thanks!
--
- Andrey
^ permalink raw reply
* Re: [PATCH v7 03/22] ovl: use core fsverity ensure info interface
From: Christoph Hellwig @ 2026-04-15 5:23 UTC (permalink / raw)
To: Andrey Albershteyn
Cc: Christoph Hellwig, Andrey Albershteyn, linux-xfs, fsverity,
linux-fsdevel, ebiggers, linux-ext4, linux-f2fs-devel,
linux-btrfs, djwong
In-Reply-To: <u3szxuhjrv7vyxwyrepuflwhzeucss7xj3cxj73mnpm5kal2da@jck24ig4oxxa>
On Tue, Apr 14, 2026 at 03:53:36PM +0200, Andrey Albershteyn wrote:
> > > + if (fsverity_active(inode))
> > > + fsverity_ensure_verity_info(inode);
> >
> > fsverity_ensure_verity_info already is a no-op for !fsverity_active,
> > so the check could be remove.
>
> I don't think it is. For non-fsverity inodes it will try to call into
> descriptor reading callback and fail.
You're right, sorry for the noise.
^ permalink raw reply
* Re: [patch 32/38] powerpc/spufs: Use mftb() directly
From: Christophe Leroy (CS GROUP) @ 2026-04-15 6:38 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Michael Ellerman, linuxppc-dev, Arnd Bergmann, x86, Lu Baolu,
iommu, Michael Grzeschik, netdev, linux-wireless, Herbert Xu,
linux-crypto, Vlastimil Babka, linux-mm, David Woodhouse,
Bernie Thompson, linux-fbdev, Theodore Tso, linux-ext4,
Andrew Morton, Uladzislau Rezki, Marco Elver, Dmitry Vyukov,
kasan-dev, Andrey Ryabinin, Thomas Sailer, linux-hams,
Jason A. Donenfeld, Richard Henderson, linux-alpha, Russell King,
linux-arm-kernel, Catalin Marinas, Huacai Chen, loongarch,
Geert Uytterhoeven, linux-m68k, Dinh Nguyen, Jonas Bonn,
linux-openrisc, Helge Deller, linux-parisc, Paul Walmsley,
linux-riscv, Heiko Carstens, linux-s390, David S. Miller,
sparclinux
In-Reply-To: <20260410120319.723429844@kernel.org>
Le 10/04/2026 à 14:21, Thomas Gleixner a écrit :
> There is no reason to indirect via get_cycles(), which is about to be
> removed.
>
> Use mftb() directly.
>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
> ---
> arch/powerpc/platforms/cell/spufs/switch.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> --- a/arch/powerpc/platforms/cell/spufs/switch.c
> +++ b/arch/powerpc/platforms/cell/spufs/switch.c
> @@ -34,6 +34,7 @@
> #include <asm/spu_priv1.h>
> #include <asm/spu_csa.h>
> #include <asm/mmu_context.h>
> +#include <asm/time.h>
>
> #include "spufs.h"
>
> @@ -279,7 +280,7 @@ static inline void save_timebase(struct
> * Read PPE Timebase High and Timebase low registers
> * and save in CSA. TBD.
> */
> - csa->suspend_time = get_cycles();
> + csa->suspend_time = mftb();
> }
>
> static inline void remove_other_spu_access(struct spu_state *csa,
> @@ -1261,7 +1262,7 @@ static inline void setup_decr(struct spu
> * in LSCSA.
> */
> if (csa->priv2.mfc_control_RW & MFC_CNTL_DECREMENTER_RUNNING) {
> - cycles_t resume_time = get_cycles();
> + cycles_t resume_time = mftb();
> cycles_t delta_time = resume_time - csa->suspend_time;
>
> csa->lscsa->decr_status.slot[0] = SPU_DECR_STATUS_RUNNING;
>
>
^ permalink raw reply
* Re: [patch 05/38] treewide: Remove CLOCK_TICK_RATE
From: Christophe Leroy (CS GROUP) @ 2026-04-15 6:40 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Arnd Bergmann, x86, Lu Baolu, iommu, Michael Grzeschik, netdev,
linux-wireless, Herbert Xu, linux-crypto, Vlastimil Babka,
linux-mm, David Woodhouse, Bernie Thompson, linux-fbdev,
Theodore Tso, linux-ext4, Andrew Morton, Uladzislau Rezki,
Marco Elver, Dmitry Vyukov, kasan-dev, Andrey Ryabinin,
Thomas Sailer, linux-hams, Jason A. Donenfeld, Richard Henderson,
linux-alpha, Russell King, linux-arm-kernel, Catalin Marinas,
Huacai Chen, loongarch, Geert Uytterhoeven, linux-m68k,
Dinh Nguyen, Jonas Bonn, linux-openrisc, Helge Deller,
linux-parisc, Michael Ellerman, linuxppc-dev, Paul Walmsley,
linux-riscv, Heiko Carstens, linux-s390, David S. Miller,
sparclinux
In-Reply-To: <20260410120317.910770161@kernel.org>
Le 10/04/2026 à 14:18, Thomas Gleixner a écrit :
> This has been scheduled for removal more than a decade ago and the comments
> related to it have been dutifully ignored. The last dependencies are gone.
>
> Remove it along with various now empty asm/timex.h files.
>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
For powerpc:
Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
> ---
> arch/alpha/include/asm/timex.h | 4 ----
> arch/arc/include/asm/timex.h | 15 ---------------
> arch/arm/mach-omap1/Kconfig | 2 +-
> arch/hexagon/include/asm/timex.h | 3 ---
> arch/m68k/include/asm/timex.h | 15 ---------------
> arch/microblaze/include/asm/timex.h | 13 -------------
> arch/mips/include/asm/timex.h | 8 --------
> arch/openrisc/include/asm/timex.h | 3 ---
> arch/parisc/include/asm/timex.h | 2 --
> arch/powerpc/include/asm/timex.h | 2 --
> arch/s390/include/asm/timex.h | 2 --
> arch/sh/include/asm/timex.h | 24 ------------------------
> arch/sparc/include/asm/timex.h | 2 +-
> arch/sparc/include/asm/timex_32.h | 14 --------------
> arch/sparc/include/asm/timex_64.h | 2 --
> arch/um/include/asm/timex.h | 9 ---------
> arch/x86/include/asm/timex.h | 3 ---
> 17 files changed, 2 insertions(+), 121 deletions(-)
>
> --- a/arch/alpha/include/asm/timex.h
> +++ b/arch/alpha/include/asm/timex.h
> @@ -7,10 +7,6 @@
> #ifndef _ASMALPHA_TIMEX_H
> #define _ASMALPHA_TIMEX_H
>
> -/* With only one or two oddballs, we use the RTC as the ticker, selecting
> - the 32.768kHz reference clock, which nicely divides down to our HZ. */
> -#define CLOCK_TICK_RATE 32768
> -
> /*
> * Standard way to access the cycle counter.
> * Currently only used on SMP for scheduling.
> --- a/arch/arc/include/asm/timex.h
> +++ /dev/null
> @@ -1,15 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0-only */
> -/*
> - * Copyright (C) 2004, 2007-2010, 2011-2012 Synopsys, Inc. (https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.synopsys.com%2F&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7Cac13d5b928bc4eabd9b708de96fb5935%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C639114203455047148%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=uCL895qVLUoy3Stzhmgph2DiYmjpd4RPdQIW2dZcJ7w%3D&reserved=0)
> - */
> -
> -#ifndef _ASM_ARC_TIMEX_H
> -#define _ASM_ARC_TIMEX_H
> -
> -#define CLOCK_TICK_RATE 80000000 /* slated to be removed */
> -
> -#include <asm-generic/timex.h>
> -
> -/* XXX: get_cycles() to be implemented with RTSC insn */
> -
> -#endif /* _ASM_ARC_TIMEX_H */
> --- a/arch/arm/mach-omap1/Kconfig
> +++ b/arch/arm/mach-omap1/Kconfig
> @@ -74,7 +74,7 @@ config OMAP_32K_TIMER
> currently only available for OMAP16XX, 24XX, 34XX, OMAP4/5 and DRA7XX.
>
> On OMAP2PLUS this value is only used for CONFIG_HZ and
> - CLOCK_TICK_RATE compile time calculation.
> + timer frequency compile time calculation.
> The actual timer selection is done in the board file
> through the (DT_)MACHINE_START structure.
>
> --- a/arch/hexagon/include/asm/timex.h
> +++ b/arch/hexagon/include/asm/timex.h
> @@ -9,9 +9,6 @@
> #include <asm-generic/timex.h>
> #include <asm/hexagon_vm.h>
>
> -/* Using TCX0 as our clock. CLOCK_TICK_RATE scheduled to be removed. */
> -#define CLOCK_TICK_RATE 19200
> -
> #define ARCH_HAS_READ_CURRENT_TIMER
>
> static inline int read_current_timer(unsigned long *timer_val)
> --- a/arch/m68k/include/asm/timex.h
> +++ b/arch/m68k/include/asm/timex.h
> @@ -7,21 +7,6 @@
> #ifndef _ASMm68K_TIMEX_H
> #define _ASMm68K_TIMEX_H
>
> -#ifdef CONFIG_COLDFIRE
> -/*
> - * CLOCK_TICK_RATE should give the underlying frequency of the tick timer
> - * to make ntp work best. For Coldfires, that's the main clock.
> - */
> -#include <asm/coldfire.h>
> -#define CLOCK_TICK_RATE MCF_CLK
> -#else
> -/*
> - * This default CLOCK_TICK_RATE is probably wrong for many 68k boards
> - * Users of those boards will need to check and modify accordingly
> - */
> -#define CLOCK_TICK_RATE 1193180 /* Underlying HZ */
> -#endif
> -
> typedef unsigned long cycles_t;
>
> static inline cycles_t get_cycles(void)
> --- a/arch/microblaze/include/asm/timex.h
> +++ /dev/null
> @@ -1,13 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> -/*
> - * Copyright (C) 2006 Atmark Techno, Inc.
> - */
> -
> -#ifndef _ASM_MICROBLAZE_TIMEX_H
> -#define _ASM_MICROBLAZE_TIMEX_H
> -
> -#include <asm-generic/timex.h>
> -
> -#define CLOCK_TICK_RATE 1000 /* Timer input freq. */
> -
> -#endif /* _ASM_TIMEX_H */
> --- a/arch/mips/include/asm/timex.h
> +++ b/arch/mips/include/asm/timex.h
> @@ -19,14 +19,6 @@
> #include <asm/cpu-type.h>
>
> /*
> - * This is the clock rate of the i8253 PIT. A MIPS system may not have
> - * a PIT by the symbol is used all over the kernel including some APIs.
> - * So keeping it defined to the number for the PIT is the only sane thing
> - * for now.
> - */
> -#define CLOCK_TICK_RATE 1193182
> -
> -/*
> * Standard way to access the cycle counter.
> * Currently only used on SMP for scheduling.
> *
> --- a/arch/openrisc/include/asm/timex.h
> +++ b/arch/openrisc/include/asm/timex.h
> @@ -25,9 +25,6 @@ static inline cycles_t get_cycles(void)
> }
> #define get_cycles get_cycles
>
> -/* This isn't really used any more */
> -#define CLOCK_TICK_RATE 1000
> -
> #define ARCH_HAS_READ_CURRENT_TIMER
>
> #endif
> --- a/arch/parisc/include/asm/timex.h
> +++ b/arch/parisc/include/asm/timex.h
> @@ -9,8 +9,6 @@
>
> #include <asm/special_insns.h>
>
> -#define CLOCK_TICK_RATE 1193180 /* Underlying HZ */
> -
> typedef unsigned long cycles_t;
>
> static inline cycles_t get_cycles(void)
> --- a/arch/powerpc/include/asm/timex.h
> +++ b/arch/powerpc/include/asm/timex.h
> @@ -11,8 +11,6 @@
> #include <asm/cputable.h>
> #include <asm/vdso/timebase.h>
>
> -#define CLOCK_TICK_RATE 1024000 /* Underlying HZ */
> -
> typedef unsigned long cycles_t;
>
> static inline cycles_t get_cycles(void)
> --- a/arch/s390/include/asm/timex.h
> +++ b/arch/s390/include/asm/timex.h
> @@ -177,8 +177,6 @@ static inline void local_tick_enable(uns
> set_clock_comparator(get_lowcore()->clock_comparator);
> }
>
> -#define CLOCK_TICK_RATE 1193180 /* Underlying HZ */
> -
> typedef unsigned long cycles_t;
>
> static __always_inline unsigned long get_tod_clock(void)
> --- a/arch/sh/include/asm/timex.h
> +++ /dev/null
> @@ -1,24 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> -/*
> - * linux/include/asm-sh/timex.h
> - *
> - * sh architecture timex specifications
> - */
> -#ifndef __ASM_SH_TIMEX_H
> -#define __ASM_SH_TIMEX_H
> -
> -/*
> - * Only parts using the legacy CPG code for their clock framework
> - * implementation need to define their own Pclk value. If provided, this
> - * can be used for accurately setting CLOCK_TICK_RATE, otherwise we
> - * simply fall back on the i8253 PIT value.
> - */
> -#ifdef CONFIG_SH_PCLK_FREQ
> -#define CLOCK_TICK_RATE (CONFIG_SH_PCLK_FREQ / 4) /* Underlying HZ */
> -#else
> -#define CLOCK_TICK_RATE 1193180
> -#endif
> -
> -#include <asm-generic/timex.h>
> -
> -#endif /* __ASM_SH_TIMEX_H */
> --- a/arch/sparc/include/asm/timex.h
> +++ b/arch/sparc/include/asm/timex.h
> @@ -4,6 +4,6 @@
> #if defined(__sparc__) && defined(__arch64__)
> #include <asm/timex_64.h>
> #else
> -#include <asm/timex_32.h>
> +#include <asm-generic/timex.h>
> #endif
> #endif
> --- a/arch/sparc/include/asm/timex_32.h
> +++ /dev/null
> @@ -1,14 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> -/*
> - * linux/include/asm/timex.h
> - *
> - * sparc architecture timex specifications
> - */
> -#ifndef _ASMsparc_TIMEX_H
> -#define _ASMsparc_TIMEX_H
> -
> -#define CLOCK_TICK_RATE 1193180 /* Underlying HZ */
> -
> -#include <asm-generic/timex.h>
> -
> -#endif
> --- a/arch/sparc/include/asm/timex_64.h
> +++ b/arch/sparc/include/asm/timex_64.h
> @@ -9,8 +9,6 @@
>
> #include <asm/timer.h>
>
> -#define CLOCK_TICK_RATE 1193180 /* Underlying HZ */
> -
> /* Getting on the cycle counter on sparc64. */
> typedef unsigned long cycles_t;
> #define get_cycles() tick_ops->get_tick()
> --- a/arch/um/include/asm/timex.h
> +++ /dev/null
> @@ -1,9 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> -#ifndef __UM_TIMEX_H
> -#define __UM_TIMEX_H
> -
> -#define CLOCK_TICK_RATE (HZ)
> -
> -#include <asm-generic/timex.h>
> -
> -#endif
> --- a/arch/x86/include/asm/timex.h
> +++ b/arch/x86/include/asm/timex.h
> @@ -14,9 +14,6 @@ static inline unsigned long random_get_e
> }
> #define random_get_entropy random_get_entropy
>
> -/* Assume we use the PIT time source for the clock tick */
> -#define CLOCK_TICK_RATE PIT_TICK_RATE
> -
> #define ARCH_HAS_READ_CURRENT_TIMER
>
> #endif /* _ASM_X86_TIMEX_H */
>
>
^ permalink raw reply
* Re: [patch 07/38] treewide: Consolidate cycles_t
From: Christophe Leroy (CS GROUP) @ 2026-04-15 6:43 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Arnd Bergmann, x86, Lu Baolu, iommu, Michael Grzeschik, netdev,
linux-wireless, Herbert Xu, linux-crypto, Vlastimil Babka,
linux-mm, David Woodhouse, Bernie Thompson, linux-fbdev,
Theodore Tso, linux-ext4, Andrew Morton, Uladzislau Rezki,
Marco Elver, Dmitry Vyukov, kasan-dev, Andrey Ryabinin,
Thomas Sailer, linux-hams, Jason A. Donenfeld, Richard Henderson,
linux-alpha, Russell King, linux-arm-kernel, Catalin Marinas,
Huacai Chen, loongarch, Geert Uytterhoeven, linux-m68k,
Dinh Nguyen, Jonas Bonn, linux-openrisc, Helge Deller,
linux-parisc, Michael Ellerman, linuxppc-dev, Paul Walmsley,
linux-riscv, Heiko Carstens, linux-s390, David S. Miller,
sparclinux
In-Reply-To: <20260410120318.045532623@kernel.org>
Le 10/04/2026 à 14:19, Thomas Gleixner a écrit :
> Most architectures define cycles_t as unsigned long execpt:
>
> - x86 requires it to be 64-bit independent of the 32-bit/64-bit build.
>
> - parisc and mips define it as unsigned int
>
> parisc has no real reason to do so as there are only a few usage sites
> which either expand it to a 64-bit value or utilize only the lower
> 32bits.
>
> mips has no real requirement either.
>
> Move the typedef to types.h and provide a config switch to enforce the
> 64-bit type for x86.
>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> ---
> arch/Kconfig | 4 ++++
> arch/alpha/include/asm/timex.h | 3 ---
> arch/arm/include/asm/timex.h | 1 -
> arch/loongarch/include/asm/timex.h | 2 --
> arch/m68k/include/asm/timex.h | 2 --
> arch/mips/include/asm/timex.h | 2 --
> arch/nios2/include/asm/timex.h | 2 --
> arch/parisc/include/asm/timex.h | 2 --
> arch/powerpc/include/asm/timex.h | 4 +---
> arch/riscv/include/asm/timex.h | 2 --
> arch/s390/include/asm/timex.h | 2 --
> arch/sparc/include/asm/timex_64.h | 1 -
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/tsc.h | 2 --
> include/asm-generic/timex.h | 1 -
> include/linux/types.h | 6 ++++++
> 16 files changed, 12 insertions(+), 25 deletions(-)
>
> --- a/arch/powerpc/include/asm/timex.h
> +++ b/arch/powerpc/include/asm/timex.h
> @@ -11,9 +11,7 @@
> #include <asm/cputable.h>
> #include <asm/vdso/timebase.h>
>
> -typedef unsigned long cycles_t;
> -
> -static inline cycles_t get_cycles(void)
> +ostatic inline cycles_t get_cycles(void)
What is 'ostatic' ?
> {
> return mftb();
> }
^ permalink raw reply
* Re: [patch 33/38] powerpc: Select ARCH_HAS_RANDOM_ENTROPY
From: Christophe Leroy (CS GROUP) @ 2026-04-15 6:47 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Michael Ellerman, linuxppc-dev, Arnd Bergmann, x86, Lu Baolu,
iommu, Michael Grzeschik, netdev, linux-wireless, Herbert Xu,
linux-crypto, Vlastimil Babka, linux-mm, David Woodhouse,
Bernie Thompson, linux-fbdev, Theodore Tso, linux-ext4,
Andrew Morton, Uladzislau Rezki, Marco Elver, Dmitry Vyukov,
kasan-dev, Andrey Ryabinin, Thomas Sailer, linux-hams,
Jason A. Donenfeld, Richard Henderson, linux-alpha, Russell King,
linux-arm-kernel, Catalin Marinas, Huacai Chen, loongarch,
Geert Uytterhoeven, linux-m68k, Dinh Nguyen, Jonas Bonn,
linux-openrisc, Helge Deller, linux-parisc, Paul Walmsley,
linux-riscv, Heiko Carstens, linux-s390, David S. Miller,
sparclinux
In-Reply-To: <20260410120319.789114053@kernel.org>
Le 10/04/2026 à 14:21, Thomas Gleixner a écrit :
> The only remaining usage of get_cycles() is to provide random_get_entropy().
>
> Switch powerpc over to the new scheme of selecting ARCH_HAS_RANDOM_ENTROPY
> and providing random_get_entropy() in asm/random.h.
>
> Remove asm/timex.h as it has no functionality anymore.
>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
> ---
> arch/powerpc/Kconfig | 1 +
> arch/powerpc/include/asm/random.h | 13 +++++++++++++
> arch/powerpc/include/asm/timex.h | 21 ---------------------
> 3 files changed, 14 insertions(+), 21 deletions(-)
>
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -150,6 +150,7 @@ config PPC
> select ARCH_HAS_PREEMPT_LAZY
> select ARCH_HAS_PTDUMP
> select ARCH_HAS_PTE_SPECIAL
> + select ARCH_HAS_RANDOM_ENTROPY
> select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64
> select ARCH_HAS_SET_MEMORY
> select ARCH_HAS_STRICT_KERNEL_RWX if (PPC_BOOK3S || PPC_8xx) && !HIBERNATION
> --- /dev/null
> +++ b/arch/powerpc/include/asm/random.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_POWERPC_RANDOM_H
> +#define _ASM_POWERPC_RANDOM_H
> +
> +#include <asm/cputable.h>
> +#include <asm/vdso/timebase.h>
> +
> +static inline unsigned long random_get_entropy(void)
> +{
> + return mftb();
> +}
> +
> +#endif /* _ASM_POWERPC_RANDOM_H */
> --- a/arch/powerpc/include/asm/timex.h
> +++ b/arch/powerpc/include/asm/timex.h
> @@ -1,21 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> -#ifndef _ASM_POWERPC_TIMEX_H
> -#define _ASM_POWERPC_TIMEX_H
> -
> -#ifdef __KERNEL__
> -
> -/*
> - * PowerPC architecture timex specifications
> - */
> -
> -#include <asm/cputable.h>
> -#include <asm/vdso/timebase.h>
> -
> -ostatic inline cycles_t get_cycles(void)
> -{
> - return mftb();
> -}
> -#define get_cycles get_cycles
> -
> -#endif /* __KERNEL__ */
> -#endif /* _ASM_POWERPC_TIMEX_H */
>
>
^ permalink raw reply
* Re: [RFC PATCH] iomap: add fast read path for small direct I/O
From: Christoph Hellwig @ 2026-04-15 7:14 UTC (permalink / raw)
To: Fengnan Chang
Cc: brauner, djwong, linux-xfs, linux-fsdevel, linux-ext4, lidiangang,
Fengnan Chang
In-Reply-To: <20260414122647.15686-1-changfengnan@bytedance.com>
On Tue, Apr 14, 2026 at 08:26:47PM +0800, Fengnan Chang wrote:
> 1. Allocating the `bio` and `struct iomap_dio` together to avoid a
> separate kmalloc. However, because `struct iomap_dio` is relatively
> large and the main path is complex, this yielded almost no
> performance improvement.
One interesting bit here would be a slab for struct iomap_dio, and
the previously discussed per-cpu allocation of some form.
> 2. Reducing unnecessary state resets in the iomap state machine (e.g.,
> skipping `iomap_iter_reset_iomap` where safe). This provided a ~5%
> IOPS boost, which is helpful but still falls far short of closing
> the gap with the raw block device.
But it already is a major improvement, and one that would apply outside
of narrow special cases. So I'd really like to see that patch.
> The fast path is triggered when the request satisfies:
> - Asynchronous READ request only for now.
I think you really should handle synchronous reads as well.
> - I/O size is <= inode blocksize (fits in a single block, no splits).
Makes sense, and I suspect this is the main source of speedups.
> - Aligned to the block device's logical block size.
All direct I/O requires this.
> - No bounce buffering, fscrypt, or fsverity involved.
> - No custom `iomap_dio_ops` (dops) registered by the filesystem.
I'm really curious at what difference this makes. It removes a few
branches, but should not have much of an effect while limiting the
applicability a lot.
> After this optimization, the heavy generic functions disappear from the
> profile, replaced by a single streamlined execution path:
> 4.83% [kernel] [k] iomap_dio_fast_read_async.isra.31
>
> With this patch, 4K random read IOPS on ext4 increases from 1.9M to
> 2.3M.
That is still a lot slower than the block device path. A big part of
it should be the extent lookup and locking associated with it, but
I'd expect things to be a bit better. Do you have XFS version as well?
> However, I am submitting this patch to validate whether this
> optimization direction is correct and worth pursuing. I would appreciate
> feedback on how to better integrate these ideas into the main iomap
> execution path.
I think a <= block size fast path makes a lot of sense, just like we
have a simple version on the block device, but it needs more work.
> +struct iomap_dio_fast_read {
> + struct kiocb *iocb;
> + size_t size;
> + bool should_dirty;
> + struct work_struct work;
> + struct bio bio ____cacheline_aligned_in_smp;
Does the cache line alignment matter here? If yes, can you explain why
in a comment?
> +static struct bio_set iomap_dio_fast_read_pool;
In general I'd prefer to stick to simple as in the block device version
instead of fast.
> +static void iomap_dio_fast_read_complete_work(struct work_struct *work)
> +{
> + struct iomap_dio_fast_read *fr =
> + container_of(work, struct iomap_dio_fast_read, work);
> + struct kiocb *iocb = fr->iocb;
> + struct inode *inode = file_inode(iocb->ki_filp);
> + bool should_dirty = fr->should_dirty;
> + struct bio *bio = &fr->bio;
> + ssize_t ret;
> +
> + WRITE_ONCE(iocb->private, NULL);
> +
> + if (likely(!bio->bi_status)) {
> + ret = fr->size;
> + iocb->ki_pos += ret;
> + } else {
> + ret = blk_status_to_errno(bio->bi_status);
> + fserror_report_io(inode, FSERR_DIRECTIO_READ, iocb->ki_pos,
> + fr->size, ret, GFP_NOFS);
> + }
> +
> + if (should_dirty) {
> + bio_check_pages_dirty(bio);
> + } else {
> + bio_release_pages(bio, false);
> + bio_put(bio);
> + }
> +
> + inode_dio_end(inode);
> +
> + trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0, ret > 0 ? ret : 0);
> + iocb->ki_complete(iocb, ret);
This is a lot of duplicate cork. Can we somehow share it by passing
more arguments or embedding the simple context into the bigger one?
> +static inline bool iomap_dio_fast_read_supported(struct kiocb *iocb,
> + struct iov_iter *iter,
> + unsigned int dio_flags,
> + size_t done_before)
Please stick to two-tab indents for prototype continuations, which is
both more readable and easier to modify later.
> + if (count < bdev_logical_block_size(inode->i_sb->s_bdev))
> + return false;
Sub-sector reads (unlike writes) don't require any special handling, so
I don't see why they are excluded.
> + if (dio_flags & IOMAP_DIO_FSBLOCK_ALIGNED)
> + alignment = i_blocksize(inode);
> + else
> + alignment = bdev_logical_block_size(inode->i_sb->s_bdev);
> +
> + if ((iocb->ki_pos | count) & (alignment - 1))
> + return false;
Factor this into a helper?
> + inode_dio_begin(inode);
> +
> + ret = ops->iomap_begin(inode, iomi.pos, count, iomi.flags,
> + &iomi.iomap, &iomi.srcmap);
> + if (ret) {
> + inode_dio_end(inode);
> + return ret;
> + }
If we can I'd much prefer avoiding the open coded iomap_begin
invocation, as that is a real maintenance burden.
> +
> + if (iomi.iomap.type != IOMAP_MAPPED ||
> + iomi.iomap.offset > iomi.pos ||
> + iomi.iomap.offset + iomi.iomap.length < iomi.pos + count ||
> + (iomi.iomap.flags & IOMAP_F_ANON_WRITE)) {
IOMAP_F_ANON_WRITE (as the name implies) only applies to writes.
> + ret = -EAGAIN;
-EAGAIN is a bad status code, as we already use to indicate that a
non-blocking read blocks.
> + ret = bio_iov_iter_get_pages(bio, iter,
> + bdev_logical_block_size(iomi.iomap.bdev) - 1);
Overly long line. Also this needs to use the calculated alignment
value.
> + if (unlikely(ret)) {
> + bio_put(bio);
> + goto out_iomap_end;
> + }
> +
> + if (bio->bi_iter.bi_size != count) {
> + iov_iter_revert(iter, bio->bi_iter.bi_size);
> + bio_release_pages(bio, false);
> + bio_put(bio);
> + ret = -EAGAIN;
> + goto out_iomap_end;
> + }
Share the bio_put with a new goto label, and maybe also move all
the other cleanup code out of the main path into a label?
> + if (!dops && iomap_dio_fast_read_supported(iocb, iter, dio_flags, done_before)) {
Overly long line. But we should not make the fast path conditional
on an option anyway.
^ permalink raw reply
* [PATCH v2 2/4] ext4: skip cursor node in ext4_orphan_del()
From: Ye Bin @ 2026-04-15 10:55 UTC (permalink / raw)
To: tytso, adilger.kernel, linux-ext4; +Cc: jack
In-Reply-To: <20260415105505.342358-1-yebin@huaweicloud.com>
From: Ye Bin <yebin10@huawei.com>
This patch is prepared for displaying orphan_list information. Because
temporary nodes may be inserted when the orphan_list is traversed and
displayed, these temporary nodes need to be skipped.
Signed-off-by: Ye Bin <yebin10@huawei.com>
---
fs/ext4/orphan.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/orphan.c b/fs/ext4/orphan.c
index f7e7f77e021e..a6bffe67ef75 100644
--- a/fs/ext4/orphan.c
+++ b/fs/ext4/orphan.c
@@ -220,6 +220,23 @@ static int ext4_orphan_file_del(handle_t *handle, struct inode *inode)
return ret;
}
+static inline bool ext4_is_cursor(struct inode *inode)
+{
+ return (inode->i_ino == 0);
+}
+
+static inline struct list_head *ext4_orphan_prev_node(
+ struct ext4_inode_info *pos,
+ struct list_head *head)
+{
+ list_for_each_entry_continue_reverse(pos, head, i_orphan) {
+ if (likely(!ext4_is_cursor(&pos->vfs_inode)))
+ return &pos->i_orphan;
+ }
+
+ return head;
+}
+
/*
* ext4_orphan_del() removes an unlinked or truncated inode from the list
* of such inodes stored on disk, because it is finally being cleaned up.
@@ -253,7 +270,8 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
mutex_lock(&sbi->s_orphan_lock);
ext4_debug("remove inode %llu from orphan list\n", inode->i_ino);
- prev = ei->i_orphan.prev;
+ prev = ext4_orphan_prev_node(ei, &sbi->s_orphan);
+
list_del_init(&ei->i_orphan);
/* If we're on an error path, we may not have a valid
--
2.34.1
^ permalink raw reply related
* [PATCH v2 1/4] ext4: register 'orphan_list' procfs
From: Ye Bin @ 2026-04-15 10:55 UTC (permalink / raw)
To: tytso, adilger.kernel, linux-ext4; +Cc: jack
In-Reply-To: <20260415105505.342358-1-yebin@huaweicloud.com>
From: Ye Bin <yebin10@huawei.com>
This patch register '/proc/fs/ext4/XXX/orphan_list' procfs for show inode
orphan list about EXT4 file system.
In actual production environments, there may be inconsistencies in df/du,
sometimes due to kernel occupation, making it difficult to find such files,
and it is also difficult to operate in the current network environment. So
add "orphan_list" procfs to quickly query files that have been deleted but
are occupied.
Signed-off-by: Ye Bin <yebin10@huawei.com>
---
fs/ext4/ext4.h | 1 +
fs/ext4/orphan.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/sysfs.c | 2 ++
3 files changed, 80 insertions(+)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0cf68f85dfd1..ccb0fd1e63e7 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3875,6 +3875,7 @@ extern void ext4_stop_mmpd(struct ext4_sb_info *sbi);
extern const struct fsverity_operations ext4_verityops;
/* orphan.c */
+extern const struct proc_ops ext4_orphan_proc_ops;
extern int ext4_orphan_add(handle_t *, struct inode *);
extern int ext4_orphan_del(handle_t *, struct inode *);
extern void ext4_orphan_cleanup(struct super_block *sb,
diff --git a/fs/ext4/orphan.c b/fs/ext4/orphan.c
index 64ea47624233..f7e7f77e021e 100644
--- a/fs/ext4/orphan.c
+++ b/fs/ext4/orphan.c
@@ -4,6 +4,8 @@
#include <linux/fs.h>
#include <linux/quotaops.h>
#include <linux/buffer_head.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
#include "ext4.h"
#include "ext4_jbd2.h"
@@ -657,3 +659,78 @@ int ext4_orphan_file_empty(struct super_block *sb)
return 0;
return 1;
}
+
+struct ext4_proc_orphan {
+ struct ext4_inode_info cursor;
+};
+
+static void *ext4_orphan_seq_start(struct seq_file *seq, loff_t *pos)
+{
+ return NULL;
+}
+
+static void *ext4_orphan_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ return NULL;
+}
+
+static int ext4_orphan_seq_show(struct seq_file *seq, void *v)
+{
+ return 0;
+}
+
+static void ext4_orphan_seq_stop(struct seq_file *seq, void *v)
+{
+}
+
+const struct seq_operations ext4_orphan_seq_ops = {
+ .start = ext4_orphan_seq_start,
+ .next = ext4_orphan_seq_next,
+ .stop = ext4_orphan_seq_stop,
+ .show = ext4_orphan_seq_show,
+};
+
+static int ext4_seq_orphan_open(struct inode *inode, struct file *file)
+{
+ int rc;
+ struct seq_file *m;
+ struct ext4_proc_orphan *private;
+
+ rc = seq_open_private(file, &ext4_orphan_seq_ops,
+ sizeof(struct ext4_proc_orphan));
+ if (!rc) {
+ m = file->private_data;
+ private = m->private;
+ INIT_LIST_HEAD(&private->cursor.i_orphan);
+ private->cursor.vfs_inode.i_ino = 0;
+ }
+
+ return rc;
+}
+
+static int ext4_seq_orphan_release(struct inode *inode, struct file *file)
+{
+ struct seq_file *seq = file->private_data;
+ struct ext4_proc_orphan *s = seq->private;
+ struct ext4_sb_info *sbi = EXT4_SB(pde_data(inode));
+
+ /*
+ * The function close_pdeo() is called when deleting the procfs
+ * in ext4_unregister_sysfs(), and this function is used to remove
+ * the entry from the 'pde->pde_openers' list. Therefore, when the
+ * file is closed, proc_reg_release() will not call close_pdeo()
+ * again because it cannot find the node on the 'pde->pde_openers'
+ * list. This prevents the UAF issue from occurring.
+ */
+ mutex_lock(&sbi->s_orphan_lock);
+ list_del(&s->cursor.i_orphan);
+ mutex_unlock(&sbi->s_orphan_lock);
+
+ return seq_release_private(inode, file);
+}
+
+const struct proc_ops ext4_orphan_proc_ops = {
+ .proc_open = ext4_seq_orphan_open,
+ .proc_read = seq_read,
+ .proc_release = ext4_seq_orphan_release,
+};
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index 923b375e017f..b40a934e30c9 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -639,6 +639,8 @@ int ext4_register_sysfs(struct super_block *sb)
ext4_seq_mb_stats_show, sb);
proc_create_seq_data("mb_structs_summary", 0444, sbi->s_proc,
&ext4_mb_seq_structs_summary_ops, sb);
+ proc_create_data("orphan_list", 0400, sbi->s_proc,
+ &ext4_orphan_proc_ops, sb);
}
return 0;
}
--
2.34.1
^ permalink raw reply related
* [PATCH v2 0/4] show orphan file inode detail info
From: Ye Bin @ 2026-04-15 10:55 UTC (permalink / raw)
To: tytso, adilger.kernel, linux-ext4; +Cc: jack
From: Ye Bin <yebin10@huawei.com>
Diffs v2 vs v1:
(1) Fix sashiko review issues:
https://sashiko.dev/#/patchset/20260403082507.1882703-1-yebin%40huaweicloud.com
(2) Change "orphan_list" file mode from 0444 to 0400;
(3) The display format of the "orphan_list" file is modified according
to Andreas' suggestions.
Fault injection tests have been conducted to address the issues raised
in the sashik review. There is no UAF issue in the ext4_seq_orphan_release()
function. The reason for this has already been explained in the code comments.
In addition to the fault injection tests, we also performed a stress test by
observing the /proc/fs/ext4/XX/orphan_list and the concurrent processes of
adding and removing orphan nodes, and no issues were found so far.
In actual production environments, the issue of inconsistency between
df and du is frequently encountered. In many cases, the cause of the
problem can be identified through the use of lsof. However, when
overlayfs is combined with project quota configuration, the issue becomes
more complex and troublesome to diagnose. First, to determine the project
ID, one needs to obtain orphaned nodes using `fsck.ext4 -fn /dev/xx`, and
then retrieve file information through `debugfs`. However, the file names
cannot always be obtained, and it is often unclear which files they are.
To identify which files these are, one would need to use crash for online
debugging or use kprobe to gather information incrementally. However, some
customers in production environments do not agree to upload any tools, and
online debugging might impact the business. There are also scenarios where
files are opened in kernel mode, which do not generate file descriptors(fds),
making it impossible to identify which files were deleted but still have
references through lsof. This patchset adds a procfs interface to query
information about orphaned nodes, which can assist in the analysis and
localization of such issues.
Ye Bin (4):
ext4: register 'orphan_list' procfs
ext4: skip cursor node in ext4_orphan_del()
ext4: show inode orphan list detail information
ext4: show orphan file inode detail info
fs/ext4/ext4.h | 1 +
fs/ext4/orphan.c | 326 ++++++++++++++++++++++++++++++++++++++++++++++-
fs/ext4/sysfs.c | 2 +
3 files changed, 328 insertions(+), 1 deletion(-)
--
2.34.1
^ permalink raw reply
* [PATCH v2 3/4] ext4: show inode orphan list detail information
From: Ye Bin @ 2026-04-15 10:55 UTC (permalink / raw)
To: tytso, adilger.kernel, linux-ext4; +Cc: jack
In-Reply-To: <20260415105505.342358-1-yebin@huaweicloud.com>
From: Ye Bin <yebin10@huawei.com>
Some inodes added to the orphan list are due to truncation, while others
are due to deletion.Therefore, we printed the information of inode as
follows: inode number/i_nlink/i_size/i_blocks/projid/file path. By using
this information, it is possible to quickly identify files that have been
deleted but are still being referenced.
Signed-off-by: Ye Bin <yebin10@huawei.com>
---
fs/ext4/orphan.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 126 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/orphan.c b/fs/ext4/orphan.c
index a6bffe67ef75..4d6f8c9edaeb 100644
--- a/fs/ext4/orphan.c
+++ b/fs/ext4/orphan.c
@@ -682,23 +682,147 @@ struct ext4_proc_orphan {
struct ext4_inode_info cursor;
};
-static void *ext4_orphan_seq_start(struct seq_file *seq, loff_t *pos)
+static struct inode *ext4_list_next(struct ext4_proc_orphan *s,
+ struct list_head *head,
+ struct list_head *p)
{
+ list_for_each_continue(p, head) {
+ struct ext4_inode_info *ei;
+ struct inode *inode;
+
+ ei = list_entry(p, typeof(*ei), i_orphan);
+ inode = &ei->vfs_inode;
+
+ /*
+ * It is safe to insert a cursor into the orphan list
+ * because ext4_orphan_del() will skip cursor. When the
+ * orphan list is processed in ext4_put_super(),
+ * ext4_seq_orphan_release() must have already been called,
+ * so the cursor must have already been removed from the
+ * orphan list.Therefore, there will be no access to a
+ * stale cursor.
+ */
+ list_move(&s->cursor.i_orphan, &ei->i_orphan);
+
+ /*
+ * Because the cursor has moved to the node after the
+ * current node, the traversal cannot continue from the
+ * current node. Instead, the traversal should continue
+ * from the cursor.
+ */
+ p = &s->cursor.i_orphan;
+
+ if (ext4_is_cursor(inode))
+ continue;
+
+ if (!igrab(inode))
+ continue;
+
+ return inode;
+ }
+
return NULL;
}
+static void *ext4_orphan_seq_start(struct seq_file *seq, loff_t *pos)
+{
+ struct ext4_proc_orphan *s = seq->private;
+ struct super_block *sb = pde_data(file_inode(seq->file));
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct list_head *prev;
+
+ mutex_lock(&sbi->s_orphan_lock);
+
+ if (!*pos) {
+ prev = &sbi->s_orphan;
+ } else {
+ prev = &s->cursor.i_orphan;
+ if (list_empty(prev))
+ return NULL;
+ }
+
+ return ext4_list_next(s, &sbi->s_orphan, prev);
+}
+
static void *ext4_orphan_seq_next(struct seq_file *seq, void *v, loff_t *pos)
{
- return NULL;
+ struct super_block *sb = pde_data(file_inode(seq->file));
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_proc_orphan *s = seq->private;
+ struct inode *inode = v;
+
+ ++*pos;
+
+ /*
+ * To prevent the deadlock caused by orphan node deletion when the
+ * last inode reference count is released, the inode reference
+ * count needs to be released in the unlocked state.
+ */
+ mutex_unlock(&sbi->s_orphan_lock);
+ iput(inode);
+ mutex_lock(&sbi->s_orphan_lock);
+
+ return ext4_list_next(s, &sbi->s_orphan, &s->cursor.i_orphan);
+}
+
+static void ext4_show_filename(struct seq_file *seq, struct inode *inode)
+{
+ struct dentry *dentry;
+
+ dentry = d_find_alias(inode);
+ if (!dentry)
+ dentry = d_find_any_alias(inode);
+
+ if (dentry)
+ seq_dentry(seq, dentry, " \t\n\\");
+ else
+ seq_puts(seq, "unknown");
+
+ seq_puts(seq, "\"\n");
+
+ /*
+ * Since igrab() has already been called in ext4_list_next(), the
+ * inode will not be released here, so there will be no deadlock.
+ */
+ dput(dentry);
}
static int ext4_orphan_seq_show(struct seq_file *seq, void *v)
{
+ struct inode *inode = v;
+
+ /*
+ * Print the original data without differentiating namespaces.
+ */
+ seq_printf(seq, "ino: %llu, link: %u, size: %llu, blocks: %llu, proj: %u, path: \"",
+ inode->i_ino, inode->i_nlink,
+ i_size_read(inode), inode->i_blocks,
+ __kprojid_val(EXT4_I(inode)->i_projid));
+
+ ext4_show_filename(seq, inode);
+
return 0;
}
static void ext4_orphan_seq_stop(struct seq_file *seq, void *v)
{
+ struct super_block *sb = pde_data(file_inode(seq->file));
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_proc_orphan *s = seq->private;
+ struct inode *inode = v;
+
+ /*
+ * stop() is called when the cache is full, so the traversal
+ * position needs to be moved back to the front of the current
+ * inode.
+ */
+ if (v)
+ list_move_tail(&s->cursor.i_orphan,
+ &EXT4_I(inode)->i_orphan);
+
+ mutex_unlock(&sbi->s_orphan_lock);
+
+ iput(inode);
}
const struct seq_operations ext4_orphan_seq_ops = {
--
2.34.1
^ permalink raw reply related
* [PATCH v2 4/4] ext4: show orphan file inode detail info
From: Ye Bin @ 2026-04-15 10:55 UTC (permalink / raw)
To: tytso, adilger.kernel, linux-ext4; +Cc: jack
In-Reply-To: <20260415105505.342358-1-yebin@huaweicloud.com>
From: Ye Bin <yebin10@huawei.com>
Support show inode information in orphan file.
Signed-off-by: Ye Bin <yebin10@huawei.com>
---
fs/ext4/orphan.c | 179 ++++++++++++++++++++++++++++++++++++++---------
1 file changed, 146 insertions(+), 33 deletions(-)
diff --git a/fs/ext4/orphan.c b/fs/ext4/orphan.c
index 4d6f8c9edaeb..715d04e386d0 100644
--- a/fs/ext4/orphan.c
+++ b/fs/ext4/orphan.c
@@ -680,6 +680,11 @@ int ext4_orphan_file_empty(struct super_block *sb)
struct ext4_proc_orphan {
struct ext4_inode_info cursor;
+ struct ext4_orphan_info *oi;
+ int inodes_per_ob;
+ int block_idx;
+ int offset;
+ bool orphan_file;
};
static struct inode *ext4_list_next(struct ext4_proc_orphan *s,
@@ -724,24 +729,94 @@ static struct inode *ext4_list_next(struct ext4_proc_orphan *s,
return NULL;
}
+static struct inode *ext4_orphan_file_next(struct ext4_proc_orphan *s,
+ struct super_block *sb)
+{
+ struct inode *inode = NULL;
+ struct ext4_orphan_info *oi = s->oi;
+
+ for (; s->block_idx < oi->of_blocks; s->block_idx++) {
+ int idx = s->block_idx;
+ struct ext4_orphan_block *binfo = &oi->of_binfo[idx];
+ __le32 *bdata = (__le32 *)(binfo->ob_bh->b_data);
+
+ if (atomic_read(&binfo->ob_free_entries) ==
+ s->inodes_per_ob) {
+ s->offset = 0;
+ continue;
+ }
+ for (; s->offset < s->inodes_per_ob; s->offset++) {
+ u64 ino = le32_to_cpu(bdata[s->offset]);
+
+ if (!ino)
+ continue;
+ /*
+ * Orphan nodes in the running state are those
+ * inodes that are still in use, so here we get
+ * them from the cache if available.
+ */
+ inode = ilookup(sb, ino);
+ if (!inode)
+ continue;
+
+ if (!ext4_test_inode_state(inode,
+ EXT4_STATE_ORPHAN_FILE)) {
+ iput(inode);
+ continue;
+ }
+
+ s->offset++;
+ if (s->offset == s->inodes_per_ob) {
+ s->offset = 0;
+ s->block_idx++;
+ }
+ return inode;
+ }
+
+ s->offset = 0;
+ }
+
+ return NULL;
+}
+
static void *ext4_orphan_seq_start(struct seq_file *seq, loff_t *pos)
{
struct ext4_proc_orphan *s = seq->private;
struct super_block *sb = pde_data(file_inode(seq->file));
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct list_head *prev;
+ void *ret;
- mutex_lock(&sbi->s_orphan_lock);
+ if (!s->orphan_file) {
+ mutex_lock(&sbi->s_orphan_lock);
+ if (!*pos)
+ prev = &sbi->s_orphan;
+ else
+ prev = &s->cursor.i_orphan;
- if (!*pos) {
- prev = &sbi->s_orphan;
- } else {
- prev = &s->cursor.i_orphan;
- if (list_empty(prev))
+ /*
+ * Here, the code checks whether the linked list is empty
+ * because when the orphan_file feature is supported, the
+ * cursor is removed from the linked list after the orphan
+ * list is traversed. If the orphan_file feature is not
+ * enabled, calling ext4_orphan_seq_start() again would
+ * cause an infinite loop.
+ */
+ if (!list_empty(prev)) {
+ ret = ext4_list_next(s, &sbi->s_orphan, prev);
+ if (ret)
+ return ret;
+ }
+
+ if (!s->oi)
return NULL;
+
+ list_del_init(&s->cursor.i_orphan);
+ mutex_unlock(&sbi->s_orphan_lock);
+ s->orphan_file = true;
}
- return ext4_list_next(s, &sbi->s_orphan, prev);
+ return ext4_orphan_file_next(s, sb);
}
static void *ext4_orphan_seq_next(struct seq_file *seq, void *v, loff_t *pos)
@@ -750,19 +825,36 @@ static void *ext4_orphan_seq_next(struct seq_file *seq, void *v, loff_t *pos)
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct ext4_proc_orphan *s = seq->private;
struct inode *inode = v;
+ void *ret;
++*pos;
- /*
- * To prevent the deadlock caused by orphan node deletion when the
- * last inode reference count is released, the inode reference
- * count needs to be released in the unlocked state.
- */
- mutex_unlock(&sbi->s_orphan_lock);
- iput(inode);
- mutex_lock(&sbi->s_orphan_lock);
+ if (!s->orphan_file) {
+ /*
+ * To prevent the deadlock caused by orphan node deletion
+ * when the last inode reference count is released, the
+ * inode reference count needs to be released in the
+ * unlocked state.
+ */
+ mutex_unlock(&sbi->s_orphan_lock);
+ iput(inode);
+ mutex_lock(&sbi->s_orphan_lock);
+
+ ret = ext4_list_next(s, &sbi->s_orphan,
+ &s->cursor.i_orphan);
+ if (ret)
+ return ret;
+ if (!s->oi)
+ return NULL;
+ list_del_init(&s->cursor.i_orphan);
+ mutex_unlock(&sbi->s_orphan_lock);
+ s->orphan_file = true;
+ } else {
+ iput(inode);
+ }
+
- return ext4_list_next(s, &sbi->s_orphan, &s->cursor.i_orphan);
+ return ext4_orphan_file_next(s, sb);
}
static void ext4_show_filename(struct seq_file *seq, struct inode *inode)
@@ -811,16 +903,28 @@ static void ext4_orphan_seq_stop(struct seq_file *seq, void *v)
struct ext4_proc_orphan *s = seq->private;
struct inode *inode = v;
- /*
- * stop() is called when the cache is full, so the traversal
- * position needs to be moved back to the front of the current
- * inode.
- */
- if (v)
- list_move_tail(&s->cursor.i_orphan,
- &EXT4_I(inode)->i_orphan);
+ if (!s->orphan_file) {
+ /*
+ * stop() is called when the cache is full, so the
+ * traversal position needs to be moved back to the
+ * front of the current inode. If stop() due to EOF
+ * then remove cursor from orphan list.
+ */
+ if (inode)
+ list_move_tail(&s->cursor.i_orphan,
+ &EXT4_I(inode)->i_orphan);
+ else
+ list_del_init(&s->cursor.i_orphan);
- mutex_unlock(&sbi->s_orphan_lock);
+ mutex_unlock(&sbi->s_orphan_lock);
+ } else if (inode) {
+ if (s->offset) {
+ s->offset--;
+ } else if (s->block_idx) {
+ s->block_idx--;
+ s->offset = s->inodes_per_ob - 1;
+ }
+ }
iput(inode);
}
@@ -836,15 +940,21 @@ static int ext4_seq_orphan_open(struct inode *inode, struct file *file)
{
int rc;
struct seq_file *m;
- struct ext4_proc_orphan *private;
+ struct ext4_proc_orphan *s;
rc = seq_open_private(file, &ext4_orphan_seq_ops,
sizeof(struct ext4_proc_orphan));
if (!rc) {
+ struct super_block *sb = pde_data(file_inode(file));
m = file->private_data;
- private = m->private;
- INIT_LIST_HEAD(&private->cursor.i_orphan);
- private->cursor.vfs_inode.i_ino = 0;
+ s = m->private;
+ INIT_LIST_HEAD(&s->cursor.i_orphan);
+ s->cursor.vfs_inode.i_ino = 0;
+ s->orphan_file = 0;
+ if (ext4_has_feature_orphan_file(sb)) {
+ s->oi = &EXT4_SB(sb)->s_orphan_info;
+ s->inodes_per_ob = ext4_inodes_per_orphan_block(sb);
+ }
}
return rc;
@@ -862,11 +972,14 @@ static int ext4_seq_orphan_release(struct inode *inode, struct file *file)
* the entry from the 'pde->pde_openers' list. Therefore, when the
* file is closed, proc_reg_release() will not call close_pdeo()
* again because it cannot find the node on the 'pde->pde_openers'
- * list. This prevents the UAF issue from occurring.
+ * list. This prevents the UAF issue from occurring. Maybe, cursor
+ * already removed in stop().
*/
- mutex_lock(&sbi->s_orphan_lock);
- list_del(&s->cursor.i_orphan);
- mutex_unlock(&sbi->s_orphan_lock);
+ if (!s->orphan_file) {
+ mutex_lock(&sbi->s_orphan_lock);
+ list_del(&s->cursor.i_orphan);
+ mutex_unlock(&sbi->s_orphan_lock);
+ }
return seq_release_private(inode, file);
}
--
2.34.1
^ permalink raw reply related
* Re: [PATCH v2 0/4] show orphan file inode detail info
From: Jan Kara @ 2026-04-15 17:59 UTC (permalink / raw)
To: Ye Bin; +Cc: tytso, adilger.kernel, linux-ext4, jack
In-Reply-To: <20260415105505.342358-1-yebin@huaweicloud.com>
Hello!
On Wed 15-04-26 18:55:01, Ye Bin wrote:
> From: Ye Bin <yebin10@huawei.com>
>
> Diffs v2 vs v1:
> (1) Fix sashiko review issues:
> https://sashiko.dev/#/patchset/20260403082507.1882703-1-yebin%40huaweicloud.com
> (2) Change "orphan_list" file mode from 0444 to 0400;
> (3) The display format of the "orphan_list" file is modified according
> to Andreas' suggestions.
> Fault injection tests have been conducted to address the issues raised
> in the sashik review. There is no UAF issue in the ext4_seq_orphan_release()
> function. The reason for this has already been explained in the code comments.
> In addition to the fault injection tests, we also performed a stress test by
> observing the /proc/fs/ext4/XX/orphan_list and the concurrent processes of
> adding and removing orphan nodes, and no issues were found so far.
>
>
> In actual production environments, the issue of inconsistency between
> df and du is frequently encountered. In many cases, the cause of the
> problem can be identified through the use of lsof. However, when
> overlayfs is combined with project quota configuration, the issue becomes
> more complex and troublesome to diagnose. First, to determine the project
> ID, one needs to obtain orphaned nodes using `fsck.ext4 -fn /dev/xx`, and
> then retrieve file information through `debugfs`. However, the file names
> cannot always be obtained, and it is often unclear which files they are.
> To identify which files these are, one would need to use crash for online
> debugging or use kprobe to gather information incrementally. However, some
> customers in production environments do not agree to upload any tools, and
> online debugging might impact the business. There are also scenarios where
> files are opened in kernel mode, which do not generate file descriptors(fds),
> making it impossible to identify which files were deleted but still have
> references through lsof. This patchset adds a procfs interface to query
> information about orphaned nodes, which can assist in the analysis and
> localization of such issues.
Ye, did you read my comments to the v1 of the patchset [1]? I didn't see
any reply from you. I don't think this is a good way how to expose orphan
information for a filesystem for reasons I've outlined in that email.
Honza
[1] https://lore.kernel.org/all/n4sccudy5avcgnkdhc27rzofzoprxqtwhfrlmsh3yyrj6vbc6d@mmu73gmtawkq/
>
> Ye Bin (4):
> ext4: register 'orphan_list' procfs
> ext4: skip cursor node in ext4_orphan_del()
> ext4: show inode orphan list detail information
> ext4: show orphan file inode detail info
>
> fs/ext4/ext4.h | 1 +
> fs/ext4/orphan.c | 326 ++++++++++++++++++++++++++++++++++++++++++++++-
> fs/ext4/sysfs.c | 2 +
> 3 files changed, 328 insertions(+), 1 deletion(-)
>
> --
> 2.34.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH 4/6] generic/765: Ignore mkfs warning
From: Ojaswin Mujoo @ 2026-04-15 18:52 UTC (permalink / raw)
To: Theodore Tso
Cc: Darrick J. Wong, Zorro Lang, fstests, fdmanana, ritesh.list,
naohiro.aota, wqu, Disha Goel, linux-ext4
In-Reply-To: <20260413204215.GA5461@macsyma-wired.lan>
On Mon, Apr 13, 2026 at 04:42:15PM -0400, Theodore Tso wrote:
> > > > > > The output can get corrupted with warnings like below because clustersize
> > > > > > more than 16xbs is experimental:
> > > > > >
> > > > > > + 16 times the block size is considered experimental
> > > > > >
> > > > > > Hence pipe these to seqres.full to avoid false negatives.
>
> You could also suppress the warnings using the -q option, for example:
>
> mke2fs -Fq -t ext4 -O bigalloc,quota -b 4096 -C 131072 /tmp/foo.img 4G
>
> > > Futher, mke2fs has multiple instances where we print warnings to stderr,
> > > should we go and fix all of them as well?
> >
> > "stderr" meaning "standard error", I'd say that errors are anything that
> > prohibits the format from completing, and only errors should go there.
>
> Sure, I'll accept those changes. But adding -q will allow the test to
> pass using older versions of e2fsprogs, while still allowing stderr to
> go out the expected output.
Okay cool, I'll send the patches.
thanks,
ojaswin
>
> - Ted
^ permalink raw reply
* Re: [RFC PATCH] iomap: add fast read path for small direct I/O
From: Ojaswin Mujoo @ 2026-04-15 19:06 UTC (permalink / raw)
To: Fengnan Chang
Cc: brauner, djwong, linux-xfs, linux-fsdevel, linux-ext4, lidiangang,
Fengnan Chang
In-Reply-To: <20260414122647.15686-1-changfengnan@bytedance.com>
On Tue, Apr 14, 2026 at 08:26:47PM +0800, Fengnan Chang wrote:
> When running 4K random read workloads on high-performance Gen5 NVMe
> SSDs, the software overhead in the iomap direct I/O path
> (__iomap_dio_rw) becomes a significant bottleneck.
>
> Using io_uring with poll mode for a 4K randread test on a raw block
> device:
> taskset -c 30 ./t/io_uring -p1 -d512 -b4096 -s32 -c32 -F1 -B1 -R1 -X1
> -n1 -P1 /dev/nvme10n1
> Result: ~3.2M IOPS
>
> Running the exact same workload on ext4 and XFS:
> taskset -c 30 ./t/io_uring -p1 -d512 -b4096 -s32 -c32 -F1 -B1 -R1 -X1
> -n1 -P1 /mnt/testfile
> Result: ~1.9M IOPS
Hi Fengnan, interesting optimization!
Which test suite are you using here for the io_uring tests?
>
> Profiling the ext4 workload reveals that a significant portion of CPU
> time is spent on memory allocation and the iomap state machine
> iteration:
> 5.33% [kernel] [k] __iomap_dio_rw
> 3.26% [kernel] [k] iomap_iter
> 2.37% [kernel] [k] iomap_dio_bio_iter
> 2.35% [kernel] [k] kfree
> 1.33% [kernel] [k] iomap_dio_complete
Hmm read is usually under a shared lock for inode as well as extent
lookup so we should ideally not be blocking too much there. Can you
share a bit more detailed perf report. I'd be interested to see where
in iomap_iter() are you seeing the regression?
>
> I attempted several incremental optimizations in the __iomap_dio_rw()
> path to close the gap:
> 1. Allocating the `bio` and `struct iomap_dio` together to avoid a
> separate kmalloc. However, because `struct iomap_dio` is relatively
> large and the main path is complex, this yielded almost no
> performance improvement.
> 2. Reducing unnecessary state resets in the iomap state machine (e.g.,
> skipping `iomap_iter_reset_iomap` where safe). This provided a ~5%
> IOPS boost, which is helpful but still falls far short of closing
> the gap with the raw block device.
>
<...>
>
> +static bool iomap_dio_fast_read_enabled = true;
> +
> +struct iomap_dio_fast_read {
> + struct kiocb *iocb;
> + size_t size;
> + bool should_dirty;
> + struct work_struct work;
> + struct bio bio ____cacheline_aligned_in_smp;
As Christoph pointed out, were you seeing any performance loss due to
not aligning to cacheline? Architectures like powerpc have a 128byte
cacheline and we could end up wasting significant space here.
> +};
> +
> +static struct bio_set iomap_dio_fast_read_pool;
> +
> +static void iomap_dio_fast_read_complete_work(struct work_struct *work)
> +{
<...>
> +
> +static inline bool iomap_dio_fast_read_supported(struct kiocb *iocb,
> + struct iov_iter *iter,
> + unsigned int dio_flags,
> + size_t done_before)
> +{
> + struct inode *inode = file_inode(iocb->ki_filp);
> + size_t count = iov_iter_count(iter);
> + unsigned int alignment;
> +
> + if (!iomap_dio_fast_read_enabled)
> + return false;
> + if (iov_iter_rw(iter) != READ)
> + return false;
> +
> + /*
> + * Fast read is an optimization for small IO. Filter out large IO early
> + * as it's the most common case to fail for typical direct IO workloads.
> + */
> + if (count > inode->i_sb->s_blocksize)
> + return false;
> +
> + if (is_sync_kiocb(iocb) || done_before)
Did you try this for sync reads as well? I think we should be seeing
similar benefits with sync reads too. Further, if the fast path helps us
reduce the critical section under inode lock, it could be a good win for
mixed read write workloads.
> + return false;
> + if (dio_flags & (IOMAP_DIO_FORCE_WAIT | IOMAP_DIO_BOUNCE))
> + return false;
> + if (iocb->ki_pos + count > i_size_read(inode))
> + return false;
> + if (IS_ENCRYPTED(inode) || fsverity_active(inode))
> + return false;
> +
> + if (count < bdev_logical_block_size(inode->i_sb->s_bdev))
> + return false;
> +
> + if (dio_flags & IOMAP_DIO_FSBLOCK_ALIGNED)
> + alignment = i_blocksize(inode);
> + else
> + alignment = bdev_logical_block_size(inode->i_sb->s_bdev);
> +
> + if ((iocb->ki_pos | count) & (alignment - 1))
> + return false;
> +
> + return true;
> +}
> +
> +static ssize_t iomap_dio_fast_read_async(struct kiocb *iocb,
<...>
> +static ssize_t fast_read_enable_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + bool enable;
> + int ret;
> +
> + ret = kstrtobool(buf, &enable);
> + if (ret)
> + return ret;
> +
> + iomap_dio_fast_read_enabled = enable;
> + return count;
> +}
> +
> +static struct kobj_attribute fast_read_enable_attr =
> + __ATTR(fast_read_enable, 0644, fast_read_enable_show, fast_read_enable_store);
> +
> +static struct kobject *iomap_kobj;
> +
> +static int __init iomap_dio_sysfs_init(void)
Since we do more than sysfs work here, maybe we can have a more generic
name like iomap_dio_init(void) or iomap_dio_fast/simple_read_init().
> +{
> + int ret;
> +
> + ret = bioset_init(&iomap_dio_fast_read_pool, 4,
> + offsetof(struct iomap_dio_fast_read, bio),
> + BIOSET_NEED_BVECS | BIOSET_PERCPU_CACHE);
> + if (ret)
> + return ret;
> +
> + iomap_kobj = kobject_create_and_add("iomap", fs_kobj);
> + if (!iomap_kobj) {
> + bioset_exit(&iomap_dio_fast_read_pool);
> + return -ENOMEM;
> + }
> +
> + if (sysfs_create_file(iomap_kobj, &fast_read_enable_attr.attr)) {
> + kobject_put(iomap_kobj);
> + bioset_exit(&iomap_dio_fast_read_pool);
> + return -ENOMEM;
> + }
> +
> + return 0;
> +}
> +fs_initcall(iomap_dio_sysfs_init);
> --
Regards,
ojaswin
> 2.39.5 (Apple Git-154)
>
^ permalink raw reply
* Re: [PATCH v1] ext4: add mb_stats_clear for mballoc statistics
From: Ojaswin Mujoo @ 2026-04-15 19:26 UTC (permalink / raw)
To: Baolin Liu
Cc: tytso, adilger.kernel, linux-ext4, linux-kernel, wangguanyu,
Baolin Liu
In-Reply-To: <20260414100212.95209-1-liubaolin12138@163.com>
On Tue, Apr 14, 2026 at 06:02:11PM +0800, Baolin Liu wrote:
> From: Baolin Liu <liubaolin@kylinos.cn>
>
> Add a write-only mb_stats_clear sysfs knob to reset ext4 mballoc
> runtime statistics.This makes it easier to inspect allocator
> activity for a specific workload instead of using counters
> accumulated since mount.
>
> Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
The patch looks good to me Baolin. We just need to add documentation of
this to the Documentation/ABI/testing/sysfs-fs-ext4 file so that the
users know what it is and the fact that the only value we allow to write
is 1.
Regards,
ojaswin
^ permalink raw reply
* Re: [PATCH v2 2/4] ext4: skip cursor node in ext4_orphan_del()
From: Darrick J. Wong @ 2026-04-15 23:56 UTC (permalink / raw)
To: Ye Bin; +Cc: tytso, adilger.kernel, linux-ext4, jack
In-Reply-To: <20260415105505.342358-3-yebin@huaweicloud.com>
On Wed, Apr 15, 2026 at 06:55:03PM +0800, Ye Bin wrote:
> From: Ye Bin <yebin10@huawei.com>
>
> This patch is prepared for displaying orphan_list information. Because
> temporary nodes may be inserted when the orphan_list is traversed and
> displayed, these temporary nodes need to be skipped.
>
> Signed-off-by: Ye Bin <yebin10@huawei.com>
> ---
> fs/ext4/orphan.c | 20 +++++++++++++++++++-
> 1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/orphan.c b/fs/ext4/orphan.c
> index f7e7f77e021e..a6bffe67ef75 100644
> --- a/fs/ext4/orphan.c
> +++ b/fs/ext4/orphan.c
> @@ -220,6 +220,23 @@ static int ext4_orphan_file_del(handle_t *handle, struct inode *inode)
> return ret;
> }
>
> +static inline bool ext4_is_cursor(struct inode *inode)
> +{
> + return (inode->i_ino == 0);
> +}
> +
> +static inline struct list_head *ext4_orphan_prev_node(
> + struct ext4_inode_info *pos,
> + struct list_head *head)
> +{
> + list_for_each_entry_continue_reverse(pos, head, i_orphan) {
> + if (likely(!ext4_is_cursor(&pos->vfs_inode)))
> + return &pos->i_orphan;
> + }
> +
> + return head;
Waitaminute, you inject the procfs file's cursor into the orphan list
with a phony ext4_inode_info?? That sounds like a landmine waiting to
go off the next time someone writes code that traverses the list, or
even wants to check that its non-empty.
--D
> +}
> +
> /*
> * ext4_orphan_del() removes an unlinked or truncated inode from the list
> * of such inodes stored on disk, because it is finally being cleaned up.
> @@ -253,7 +270,8 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
> mutex_lock(&sbi->s_orphan_lock);
> ext4_debug("remove inode %llu from orphan list\n", inode->i_ino);
>
> - prev = ei->i_orphan.prev;
> + prev = ext4_orphan_prev_node(ei, &sbi->s_orphan);
> +
> list_del_init(&ei->i_orphan);
>
> /* If we're on an error path, we may not have a valid
> --
> 2.34.1
>
>
^ permalink raw reply
* Re: [PATCH v1] ext4: add mb_stats_clear for mballoc statistics
From: Andreas Dilger @ 2026-04-16 1:14 UTC (permalink / raw)
To: Baolin Liu; +Cc: tytso, linux-ext4, linux-kernel, wangguanyu, Baolin Liu
In-Reply-To: <20260414100212.95209-1-liubaolin12138@163.com>
On Apr 14, 2026, at 04:02, Baolin Liu <liubaolin12138@163.com> wrote:
>
> From: Baolin Liu <liubaolin@kylinos.cn>
>
> Add a write-only mb_stats_clear sysfs knob to reset ext4 mballoc
> runtime statistics. This makes it easier to inspect allocator
> activity for a specific workload instead of using counters
> accumulated since mount.
Rather than having a read-only "mb_stats" procfs file and a separate
write-only "mb_stats_clear" sysfs file to clear "mb_stats", IMHO it
would be more obvious to write directly to "/proc/fs/ext4/DEV/mb_stats"
file to clear it. Writing "0" would be logical to zero out the stats.
Cheers, Andreas
>
> Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
> ---
> fs/ext4/ext4.h | 1 +
> fs/ext4/mballoc.c | 31 +++++++++++++++++++++++++++++++
> fs/ext4/sysfs.c | 24 ++++++++++++++++++++++++
> 3 files changed, 56 insertions(+)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 7617e2d454ea..3a32e1a515dd 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2995,6 +2995,7 @@ int ext4_fc_record_regions(struct super_block *sb, int ino,
> extern const struct seq_operations ext4_mb_seq_groups_ops;
> extern const struct seq_operations ext4_mb_seq_structs_summary_ops;
> extern int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset);
> +extern void ext4_mb_stats_clear(struct ext4_sb_info *sbi);
> extern int ext4_mb_init(struct super_block *);
> extern void ext4_mb_release(struct super_block *);
> extern ext4_fsblk_t ext4_mb_new_blocks(handle_t *,
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index bb58eafb87bc..382c91586b26 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -3219,6 +3219,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
> }
> seq_printf(seq, "\treqs: %u\n", atomic_read(&sbi->s_bal_reqs));
> seq_printf(seq, "\tsuccess: %u\n", atomic_read(&sbi->s_bal_success));
> + seq_printf(seq, "\tblocks_allocated: %u\n",
> + atomic_read(&sbi->s_bal_allocated));
>
> seq_printf(seq, "\tgroups_scanned: %u\n",
> atomic_read(&sbi->s_bal_groups_scanned));
> @@ -4721,6 +4723,35 @@ static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
> trace_ext4_mballoc_prealloc(ac);
> }
>
> +void ext4_mb_stats_clear(struct ext4_sb_info *sbi)
> +{
> + int i;
> +
> + atomic_set(&sbi->s_bal_reqs, 0);
> + atomic_set(&sbi->s_bal_success, 0);
> + atomic_set(&sbi->s_bal_allocated, 0);
> + atomic_set(&sbi->s_bal_groups_scanned, 0);
> +
> + for (i = 0; i < EXT4_MB_NUM_CRS; i++) {
> + atomic64_set(&sbi->s_bal_cX_hits[i], 0);
> + atomic64_set(&sbi->s_bal_cX_groups_considered[i], 0);
> + atomic_set(&sbi->s_bal_cX_ex_scanned[i], 0);
> + atomic64_set(&sbi->s_bal_cX_failed[i], 0);
> + }
> +
> + atomic_set(&sbi->s_bal_ex_scanned, 0);
> + atomic_set(&sbi->s_bal_goals, 0);
> + atomic_set(&sbi->s_bal_stream_goals, 0);
> + atomic_set(&sbi->s_bal_len_goals, 0);
> + atomic_set(&sbi->s_bal_2orders, 0);
> + atomic_set(&sbi->s_bal_breaks, 0);
> + atomic_set(&sbi->s_mb_lost_chunks, 0);
> + atomic_set(&sbi->s_mb_buddies_generated, 0);
> + atomic64_set(&sbi->s_mb_generation_time, 0);
> + atomic_set(&sbi->s_mb_preallocated, 0);
> + atomic_set(&sbi->s_mb_discarded, 0);
> +}
> +
> /*
> * Called on failure; free up any blocks from the inode PA for this
> * context. We don't need this for MB_GROUP_PA because we only change
> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> index 923b375e017f..a5bd88a99f22 100644
> --- a/fs/ext4/sysfs.c
> +++ b/fs/ext4/sysfs.c
> @@ -41,6 +41,7 @@ typedef enum {
> attr_pointer_atomic,
> attr_journal_task,
> attr_err_report_sec,
> + attr_mb_stats_clear,
> } attr_id_t;
>
> typedef enum {
> @@ -161,6 +162,25 @@ static ssize_t err_report_sec_store(struct ext4_sb_info *sbi,
> return count;
> }
>
> +static ssize_t mb_stats_clear_store(struct ext4_sb_info *sbi,
> + const char *buf, size_t count)
> +{
> + int val;
> + int ret;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + ret = kstrtoint(skip_spaces(buf), 0, &val);
> + if (ret)
> + return ret;
> + if (val != 1)
> + return -EINVAL;
> +
> + ext4_mb_stats_clear(sbi);
> + return count;
> +}
> +
> static ssize_t journal_task_show(struct ext4_sb_info *sbi, char *buf)
> {
> if (!sbi->s_journal)
> @@ -251,6 +271,7 @@ EXT4_ATTR_OFFSET(mb_best_avail_max_trim_order, 0644, mb_order,
> EXT4_ATTR_OFFSET(err_report_sec, 0644, err_report_sec, ext4_sb_info, s_err_report_sec);
> EXT4_RW_ATTR_SBI_UI(inode_goal, s_inode_goal);
> EXT4_RW_ATTR_SBI_UI(mb_stats, s_mb_stats);
> +EXT4_ATTR(mb_stats_clear, 0200, mb_stats_clear);
> EXT4_RW_ATTR_SBI_UI(mb_max_to_scan, s_mb_max_to_scan);
> EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan);
> EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
> @@ -301,6 +322,7 @@ static struct attribute *ext4_attrs[] = {
> ATTR_LIST(inode_readahead_blks),
> ATTR_LIST(inode_goal),
> ATTR_LIST(mb_stats),
> + ATTR_LIST(mb_stats_clear),
> ATTR_LIST(mb_max_to_scan),
> ATTR_LIST(mb_min_to_scan),
> ATTR_LIST(mb_order2_req),
> @@ -561,6 +583,8 @@ static ssize_t ext4_attr_store(struct kobject *kobj,
> return trigger_test_error(sbi, buf, len);
> case attr_err_report_sec:
> return err_report_sec_store(sbi, buf, len);
> + case attr_mb_stats_clear:
> + return mb_stats_clear_store(sbi, buf, len);
> default:
> return ext4_generic_attr_store(a, sbi, buf, len);
> }
> --
> 2.51.0
>
Cheers, Andreas
^ permalink raw reply
* Re: [PATCH] jbd2: enforce power-of-two default revoke hash size at compile time
From: Andreas Dilger @ 2026-04-16 1:35 UTC (permalink / raw)
To: Jan Kara; +Cc: Milos Nikic, tytso, linux-ext4, linux-kernel
In-Reply-To: <cjtirtqozud4keiu2fqi6sj3o4moimloti7celc4y44eejsn2y@i2tz72chmm34>
On Apr 14, 2026, at 06:59, Jan Kara <jack@suse.cz> wrote:
>
> On Mon 13-04-26 14:27:24, Milos Nikic wrote:
>> The jbd2 revoke table relies on bitwise AND operations for fast hash
>> indexing, which requires the hash table size to be a strict power of two.
>>
>> Currently, this requirement is only enforced at runtime via a J_ASSERT
>> in jbd2_journal_init_revoke(). While this successfully catches invalid
>> dynamic allocations, it means a developer accidentally modifying the
>> hardcoded JOURNAL_REVOKE_DEFAULT_HASH macro will experience a system
>> panic upon mounting the filesystem during testing.
>>
>> Add a BUILD_BUG_ON() in journal_init_common() to validate the default
>> macro at compile time. This acts as an immediate, zero-overhead
>> safeguard, preventing compilation entirely if the default hash size is
>> mathematically invalid.
>>
>> Signed-off-by: Milos Nikic <nikic.milos@gmail.com>
>
> Eh, if you modify JOURNAL_REVOKE_DEFAULT_HASH you should better know what
> you are doing and if you mess up, then the kernel failing with assertion
> isn't that difficult to diagnose. So sorry I don't think this "cleanup" is
> useful either.
Jan,
this is a BUILD_BUG_ON() so it won't cause any runtime assertion.
Cheers, Andreas
> Honza
>
>> ---
>> fs/jbd2/journal.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
>> index 4f397fcdb13c..62b36a2fc4e2 100644
>> --- a/fs/jbd2/journal.c
>> +++ b/fs/jbd2/journal.c
>> @@ -1565,6 +1565,7 @@ static journal_t *journal_init_common(struct
>> /* The journal is marked for error until we succeed with recovery! */
>> journal->j_flags = JBD2_ABORT;
>>
>> + BUILD_BUG_ON(!is_power_of_2(JOURNAL_REVOKE_DEFAULT_HASH));
>> /* Set up a default-sized revoke table for the new mount. */
>> err = jbd2_journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
>> if (err)
>> --
>> 2.53.0
>>
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
>
Cheers, Andreas
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox