From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5903F3043CF for ; Tue, 14 Apr 2026 12:27:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.182 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776169634; cv=none; b=LaT1C1uK7cW0uysz5B2SIJxhd16inxm8FrnQ6HQY71rHjrcgJLe3vZUXNkIf+ZhXudW5q1b7najDWe7YFTHRIVL2v/eLqGeY6jfwqkf0rfvHmWslX40SlHwPfWyYqoDZWMoEYRO3gOcWHeHwN5ZB9aW1VbQivMCjlYS0ZdnKJww= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776169634; c=relaxed/simple; bh=lInK4xg5OPiqmqMrMqo3ff+cEQrecsEeB4QLUNhSHm4=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=JqMo63tbtuPulukGjSJPvPgnoauolKA3ReLVhDqYimBjZa/8KZxPWdSs1ioRRJDHxU3OUZYQuOSW2MTVQ7GStnjqs3icMvpnAyC8NAbCb/MepCaXzxOckS/W8Qz4zeJQyeXjGo7wP2x71NuQjPvhm3mciEYaexLXAiQbzgoILrQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=EfN8c0ye; arc=none smtp.client-ip=209.85.214.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="EfN8c0ye" Received: by mail-pl1-f182.google.com with SMTP id d9443c01a7336-2a8fba3f769so24626245ad.2 for ; Tue, 14 Apr 2026 05:27:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1776169633; x=1776774433; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=bEDuvXTH5Zn2XT2X16UjfO2ntfyWS9qDU8ZPsqL80os=; b=EfN8c0yet70Tc5H7N3Bs8BIwx1TZ5Sq4ByBNOO39p+trvs3vJhc72sRmk5eAes8b9G /QjhQfs8oGr3032yYM1eAVQj/p8BuuvSHGb7dt4OWRJveyMUnN+sJ6OwXCwto6O+RK6M 52BuqJ7eH7U0f7hiHq8dI3TUcUYywowO559Lyfnvgqqxdj1E73dwG5yJ+DAuiWjHU+OU Nvmzo8K5qFRDoQYDPCEsd5QDkTsdExHHogCCvR52M6flw5saMmfPrYWoDLdl432o1wtH y/ipTCfZukeu27M8PgSIkeuNKuUP76V9JUYpj5AkP7Y6v3Q4xy60lHbnqQ9SVm/RbfT7 FyrA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776169633; x=1776774433; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=bEDuvXTH5Zn2XT2X16UjfO2ntfyWS9qDU8ZPsqL80os=; b=XyrSg3JsJoKoHHwkjiXeYIpYlljcaBIp/AwihtU/CU3NksjmjbWj3KYU0OKufRA55q N9/u47JHAvvl20IOYuTjD07V+gLAM5BpxgWazk6HuY9LQzS3CMhyt+eILD+Xsa30u+oi dFICMLyvFP7S1iYyGwmW7Wzd+ptog5RulftEohWPigtemssw9jxlBlz0mdwEMfrHe4Us gItiFqcnQcxeHtXIoZapZ+dFlTNb4+4Qvr85gUcUASmae197yzkj2YdXAaa6RJVQTrYC E152jLdtWRkMJFlMya+OlYp++6VstSg9NW+REiSV/+ugAJMX9hJqDLBshmuGqKUTPYJd lUCg== X-Forwarded-Encrypted: i=1; AFNElJ+YrWI8F1kNm3OtfnW4ozxNY9NzWYsCobKFdI9xkt9LFleiyVbv86HlA1Mrl1/lcgXaBbwd1ZMN0Au+@vger.kernel.org X-Gm-Message-State: AOJu0YzmIvtdMebl3ETyEwGcJLCmnEIAVGdEDc+Shcrw7uUc0Gc0efqV v8OabLoHbZO+hgN8vFdEL8ZiRvjSONR7Utq7jaTIR+J8o4RtlNSm2PeW X-Gm-Gg: AeBDieutYtq8cjbNhLgRTz0L/iRCEQ6qVQC/G2PKtmuKhk0UDOO+8XfhDaBo+SAmNOx hGIKVyLCvErh/5vbEUoXuF+G0cFPLsEMchMlzzdQJd7UHR4YRFcYJJJifgOKNUl6obghAk6ehK5 UdJQA4fMzZ65lVOa7rQqOOT/DtuZhmjfbDOXgAWyz8sPGgV+iAP5kgsHtEBB4AqCvrx6AveKMO1 vlUdkWgbv6C/imFPLEHNb5doz95QDzEQv6VOjE9u7NNs4dMs6JG3QJH9fMePtheq7WQt4fyJOZB in9PRV9M6WM6SRwYm4xUvkQMOpI4bf/u2iASLlrF84sztGr2LuNWwvk26iDATGO7mLf0AL9HV4T YToN9rE0WqTdofQD7jvlGMYdl0sPuxOkf4oTmgHL67Pj71tIl11ofx2r4JO2ox6KByvYktaHa9Y T19K1M7ZLr5hlueOuC+xy2NgTNW5szcdD3aGyH2Vn2NDynkP27itPNShJVeBRoqdnqUkA= X-Received: by 2002:a17:903:1250:b0:2ae:54b2:27c7 with SMTP id d9443c01a7336-2b2d5a88d3dmr176156215ad.39.1776169632604; Tue, 14 Apr 2026 05:27:12 -0700 (PDT) Received: from 192.168.5.4 ([2409:8a28:f32:86e4:9161:fb54:c7b3:5567]) by smtp.googlemail.com with ESMTPSA id d9443c01a7336-2b468273ccfsm62273745ad.43.2026.04.14.05.27.07 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 14 Apr 2026 05:27:11 -0700 (PDT) From: Fengnan Chang X-Google-Original-From: Fengnan Chang To: brauner@kernel.org, djwong@kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org Cc: lidiangang@bytedance.com, Fengnan Chang Subject: [RFC PATCH] iomap: add fast read path for small direct I/O Date: Tue, 14 Apr 2026 20:26:47 +0800 Message-Id: <20260414122647.15686-1-changfengnan@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) Precedence: bulk X-Mailing-List: linux-ext4@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit When running 4K random read workloads on high-performance Gen5 NVMe SSDs, the software overhead in the iomap direct I/O path (__iomap_dio_rw) becomes a significant bottleneck. Using io_uring with poll mode for a 4K randread test on a raw block device: taskset -c 30 ./t/io_uring -p1 -d512 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 /dev/nvme10n1 Result: ~3.2M IOPS Running the exact same workload on ext4 and XFS: taskset -c 30 ./t/io_uring -p1 -d512 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 /mnt/testfile Result: ~1.9M IOPS Profiling the ext4 workload reveals that a significant portion of CPU time is spent on memory allocation and the iomap state machine iteration: 5.33% [kernel] [k] __iomap_dio_rw 3.26% [kernel] [k] iomap_iter 2.37% [kernel] [k] iomap_dio_bio_iter 2.35% [kernel] [k] kfree 1.33% [kernel] [k] iomap_dio_complete I attempted several incremental optimizations in the __iomap_dio_rw() path to close the gap: 1. Allocating the `bio` and `struct iomap_dio` together to avoid a separate kmalloc. However, because `struct iomap_dio` is relatively large and the main path is complex, this yielded almost no performance improvement. 2. Reducing unnecessary state resets in the iomap state machine (e.g., skipping `iomap_iter_reset_iomap` where safe). This provided a ~5% IOPS boost, which is helpful but still falls far short of closing the gap with the raw block device. Since optimizing the heavy generic path did not yield the desired results for this specific, highly-demanding Gen5 SSD scenario, this RFC patch introduces a dedicated asynchronous fast path. The fast path is triggered when the request satisfies: - Asynchronous READ request only for now. - I/O size is <= inode blocksize (fits in a single block, no splits). - Aligned to the block device's logical block size. - No bounce buffering, fscrypt, or fsverity involved. - No custom `iomap_dio_ops` (dops) registered by the filesystem. By using a dedicated bio_set (`iomap_dio_fast_read_pool`) to embed a much smaller completion state (`struct iomap_dio_fast_read`) directly in the bio's front padding, we completely eliminate kmalloc/kfree and drastically shorten the execution path. After this optimization, the heavy generic functions disappear from the profile, replaced by a single streamlined execution path: 4.83% [kernel] [k] iomap_dio_fast_read_async.isra.31 With this patch, 4K random read IOPS on ext4 increases from 1.9M to 2.3M. I am aware that adding a completely separate fast path introduces duplicate code and may result in iomap_begin being called twice, this likely unacceptable for merging in its current form. However, I am submitting this patch to validate whether this optimization direction is correct and worth pursuing. I would appreciate feedback on how to better integrate these ideas into the main iomap execution path. Signed-off-by: Fengnan Chang --- fs/iomap/direct-io.c | 275 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 275 insertions(+) diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index e911daedff65a..e4183f7c2f962 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -5,10 +5,14 @@ */ #include #include +#include #include #include #include #include +#include +#include +#include #include "internal.h" #include "trace.h" @@ -880,12 +884,231 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, } EXPORT_SYMBOL_GPL(__iomap_dio_rw); +static bool iomap_dio_fast_read_enabled = true; + +struct iomap_dio_fast_read { + struct kiocb *iocb; + size_t size; + bool should_dirty; + struct work_struct work; + struct bio bio ____cacheline_aligned_in_smp; +}; + +static struct bio_set iomap_dio_fast_read_pool; + +static void iomap_dio_fast_read_complete_work(struct work_struct *work) +{ + struct iomap_dio_fast_read *fr = + container_of(work, struct iomap_dio_fast_read, work); + struct kiocb *iocb = fr->iocb; + struct inode *inode = file_inode(iocb->ki_filp); + bool should_dirty = fr->should_dirty; + struct bio *bio = &fr->bio; + ssize_t ret; + + WRITE_ONCE(iocb->private, NULL); + + if (likely(!bio->bi_status)) { + ret = fr->size; + iocb->ki_pos += ret; + } else { + ret = blk_status_to_errno(bio->bi_status); + fserror_report_io(inode, FSERR_DIRECTIO_READ, iocb->ki_pos, + fr->size, ret, GFP_NOFS); + } + + if (should_dirty) { + bio_check_pages_dirty(bio); + } else { + bio_release_pages(bio, false); + bio_put(bio); + } + + inode_dio_end(inode); + + trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0, ret > 0 ? ret : 0); + iocb->ki_complete(iocb, ret); +} + +static void iomap_dio_fast_read_end_io(struct bio *bio) +{ + struct iomap_dio_fast_read *fr = bio->bi_private; + struct kiocb *iocb = fr->iocb; + + if (unlikely(bio->bi_status)) { + struct inode *inode = file_inode(iocb->ki_filp); + + INIT_WORK(&fr->work, iomap_dio_fast_read_complete_work); + queue_work(inode->i_sb->s_dio_done_wq, &fr->work); + return; + } + + iomap_dio_fast_read_complete_work(&fr->work); +} + +static inline bool iomap_dio_fast_read_supported(struct kiocb *iocb, + struct iov_iter *iter, + unsigned int dio_flags, + size_t done_before) +{ + struct inode *inode = file_inode(iocb->ki_filp); + size_t count = iov_iter_count(iter); + unsigned int alignment; + + if (!iomap_dio_fast_read_enabled) + return false; + if (iov_iter_rw(iter) != READ) + return false; + + /* + * Fast read is an optimization for small IO. Filter out large IO early + * as it's the most common case to fail for typical direct IO workloads. + */ + if (count > inode->i_sb->s_blocksize) + return false; + + if (is_sync_kiocb(iocb) || done_before) + return false; + if (dio_flags & (IOMAP_DIO_FORCE_WAIT | IOMAP_DIO_BOUNCE)) + return false; + if (iocb->ki_pos + count > i_size_read(inode)) + return false; + if (IS_ENCRYPTED(inode) || fsverity_active(inode)) + return false; + + if (count < bdev_logical_block_size(inode->i_sb->s_bdev)) + return false; + + if (dio_flags & IOMAP_DIO_FSBLOCK_ALIGNED) + alignment = i_blocksize(inode); + else + alignment = bdev_logical_block_size(inode->i_sb->s_bdev); + + if ((iocb->ki_pos | count) & (alignment - 1)) + return false; + + return true; +} + +static ssize_t iomap_dio_fast_read_async(struct kiocb *iocb, + struct iov_iter *iter, + const struct iomap_ops *ops, + void *private) +{ + struct inode *inode = file_inode(iocb->ki_filp); + size_t count = iov_iter_count(iter); + int nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS); + bool should_dirty = user_backed_iter(iter); + struct iomap_dio_fast_read *fr; + struct iomap_iter iomi = { + .inode = inode, + .pos = iocb->ki_pos, + .len = count, + .flags = IOMAP_DIRECT, + .private = private, + }; + struct bio *bio; + ssize_t ret; + + if (iocb->ki_flags & IOCB_NOWAIT) + iomi.flags |= IOMAP_NOWAIT; + + ret = kiocb_write_and_wait(iocb, count); + if (ret) + return ret; + + inode_dio_begin(inode); + + ret = ops->iomap_begin(inode, iomi.pos, count, iomi.flags, + &iomi.iomap, &iomi.srcmap); + if (ret) { + inode_dio_end(inode); + return ret; + } + + if (iomi.iomap.type != IOMAP_MAPPED || + iomi.iomap.offset > iomi.pos || + iomi.iomap.offset + iomi.iomap.length < iomi.pos + count || + (iomi.iomap.flags & IOMAP_F_ANON_WRITE)) { + ret = -EAGAIN; + goto out_iomap_end; + } + + if (!inode->i_sb->s_dio_done_wq) { + ret = sb_init_dio_done_wq(inode->i_sb); + if (ret < 0) + goto out_iomap_end; + } + + trace_iomap_dio_rw_begin(iocb, iter, 0, 0); + + bio = bio_alloc_bioset(iomi.iomap.bdev, nr_pages, + REQ_OP_READ | REQ_SYNC | REQ_IDLE, + GFP_KERNEL, &iomap_dio_fast_read_pool); + fr = container_of(bio, struct iomap_dio_fast_read, bio); + fr->iocb = iocb; + fr->should_dirty = should_dirty; + + bio->bi_iter.bi_sector = iomap_sector(&iomi.iomap, iomi.pos); + bio->bi_ioprio = iocb->ki_ioprio; + bio->bi_private = fr; + bio->bi_end_io = iomap_dio_fast_read_end_io; + + ret = bio_iov_iter_get_pages(bio, iter, + bdev_logical_block_size(iomi.iomap.bdev) - 1); + if (unlikely(ret)) { + bio_put(bio); + goto out_iomap_end; + } + + if (bio->bi_iter.bi_size != count) { + iov_iter_revert(iter, bio->bi_iter.bi_size); + bio_release_pages(bio, false); + bio_put(bio); + ret = -EAGAIN; + goto out_iomap_end; + } + + fr->size = bio->bi_iter.bi_size; + + if (should_dirty) + bio_set_pages_dirty(bio); + + if (iocb->ki_flags & IOCB_NOWAIT) + bio->bi_opf |= REQ_NOWAIT; + if (iocb->ki_flags & IOCB_HIPRI) { + bio->bi_opf |= REQ_POLLED; + bio_set_polled(bio, iocb); + WRITE_ONCE(iocb->private, bio); + } + submit_bio(bio); + + if (ops->iomap_end) + ops->iomap_end(inode, iomi.pos, count, count, iomi.flags, + &iomi.iomap); + return -EIOCBQUEUED; + +out_iomap_end: + if (ops->iomap_end) + ops->iomap_end(inode, iomi.pos, count, 0, iomi.flags, + &iomi.iomap); + inode_dio_end(inode); + return ret; +} + ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, const struct iomap_ops *ops, const struct iomap_dio_ops *dops, unsigned int dio_flags, void *private, size_t done_before) { struct iomap_dio *dio; + ssize_t ret; + + if (!dops && iomap_dio_fast_read_supported(iocb, iter, dio_flags, done_before)) { + ret = iomap_dio_fast_read_async(iocb, iter, ops, private); + if (ret != -EAGAIN) + return ret; + } dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, private, done_before); @@ -894,3 +1117,55 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, return iomap_dio_complete(dio); } EXPORT_SYMBOL_GPL(iomap_dio_rw); + +static ssize_t fast_read_enable_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", iomap_dio_fast_read_enabled); +} + +static ssize_t fast_read_enable_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + bool enable; + int ret; + + ret = kstrtobool(buf, &enable); + if (ret) + return ret; + + iomap_dio_fast_read_enabled = enable; + return count; +} + +static struct kobj_attribute fast_read_enable_attr = + __ATTR(fast_read_enable, 0644, fast_read_enable_show, fast_read_enable_store); + +static struct kobject *iomap_kobj; + +static int __init iomap_dio_sysfs_init(void) +{ + int ret; + + ret = bioset_init(&iomap_dio_fast_read_pool, 4, + offsetof(struct iomap_dio_fast_read, bio), + BIOSET_NEED_BVECS | BIOSET_PERCPU_CACHE); + if (ret) + return ret; + + iomap_kobj = kobject_create_and_add("iomap", fs_kobj); + if (!iomap_kobj) { + bioset_exit(&iomap_dio_fast_read_pool); + return -ENOMEM; + } + + if (sysfs_create_file(iomap_kobj, &fast_read_enable_attr.attr)) { + kobject_put(iomap_kobj); + bioset_exit(&iomap_dio_fast_read_pool); + return -ENOMEM; + } + + return 0; +} +fs_initcall(iomap_dio_sysfs_init); -- 2.39.5 (Apple Git-154)