From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 456073451C8; Mon, 20 Apr 2026 23:59:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776729588; cv=none; b=IFtIjH9FB2YvZMybkiMpPuiCSrwLmMaxwwrDITT9mfgXOr0TViwcUJU26EbrOnmADaqC8NIS1xs3HZD7VG9IUzWJqeGAUjl9OcLVkIyV+NaJ+OVvmvxN2t5URJeAQq2Fo/hAjAgRMyH3feF1qSzpL9UdURn/YD4PwBbXphkb0qM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776729588; c=relaxed/simple; bh=MMIK8ekJu08qDIh4FXjwAkAb5gYATo0TZerNr1wZ41Q=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=ewdG4XjLH/87GqDqAwFcVccSGBzrhyEa0iMmqafPAtw/YO4YSd+rTGnpb/7/JUOHVinGjH6nTMX9bBinKbEkIC4jQZPcruUHt0oRukeaRJ3jLTEUoObpZAbVStLv1BovjAhTNE4/SFOZEUXHRWFswkD6gNNK6Z0vroDl/TpRRjs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=eKRq7WL6; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="eKRq7WL6" Received: by smtp.kernel.org (Postfix) with ESMTPSA id A904FC19425; Mon, 20 Apr 2026 23:59:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776729588; bh=MMIK8ekJu08qDIh4FXjwAkAb5gYATo0TZerNr1wZ41Q=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=eKRq7WL6E2lHWHXTDDFXRJHQUpYIfYpHwv9urqmsWJllcxQVz0Mj8kAenpKZlE60Z m53+Hsf7domFWiIN9JrHk9eu7P849uctSkbOPqhI/D9M8aV/kh/oegnJoSdwq4DLD/ Y7uWy+wrZ4SF6DJ0gTyFdIJSZ8afJ9f4FCOrKCB2TC+rB14SAhIhu3suotjANQEokk c5rh+lVatIXlrOcnQLLw4775gr0CLRjzqZrexkxsnRJlSAgjQOgIWj0HQVupGkino3 lWge1tbjjR+ZU50HwTrVVzxqWSts5s44aha7ifq7tUczGgpLESA+jXT6NxNZ3I5Uye KPJm0lDLMxukQ== Date: Tue, 21 Apr 2026 09:59:37 +1000 From: Dave Chinner To: changfengnan Cc: Ojaswin Mujoo , Fengnan Chang , brauner@kernel.org, djwong@kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, lidiangang@bytedance.com Subject: Re: [RFC PATCH] iomap: add fast read path for small direct I/O Message-ID: References: <20260414122647.15686-1-changfengnan@bytedance.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Thu, Apr 16, 2026 at 11:22:08AM +0800, changfengnan wrote: > This is test 4k randread with QD 512 in io_uring poll mode.  > If you use fio, almost like this, but ./t/io_uring  can get higher IOPS. > fio \ >   --name=io_uring_test \ >   --ioengine=io_uring \ >   --filename=/mnt/testfile \ >   --direct=1 \ >   --rw=randread \ >   --bs=4096 \ >   --iodepth=512 \ >   --iodepth_batch_submit=32 \ >   --iodepth_batch_complete_min=32 \ >   --hipri=1 \ >   --fixedbufs=1 \ >   --registerfiles=1 \ >   --nonvectored=1 \ >   --sqthread_poll=1 Ok, given the way fio works, the iodepth batching will result in in the code submitting repeated batches of 32 read Io submissions in a single 'syscall'. If you change the size of this batch, how does it change the performance of both vanilla and patched IO paths? i.e. does this optimisation provide a benefit over a range of IO submission patterns, or is it only evident when the CPU is running a IO-uring microbenchmark and userspace is doing no real work on the IO buffers being submitted? Also, 'fixedbufs=1' leads me to beleive that this is using the same set of buffer memory for all IOs, and hence we've probably got a cache-hot data set here. Hence: is userspace reading the buffers at IO completion (i.e. emulating the application actually consuming the data that is being read from the disk), or are they remaining untouched by userspace and immediately reused for the next IO submission batch? > > > Profiling the ext4 workload reveals that a significant portion of CPU > > > time is spent on memory allocation and the iomap state machine > > > iteration: > > >   5.33%  [kernel]  [k] __iomap_dio_rw > > >   3.26%  [kernel]  [k] iomap_iter > > >   2.37%  [kernel]  [k] iomap_dio_bio_iter > > >   2.35%  [kernel]  [k] kfree > > >   1.33%  [kernel]  [k] iomap_dio_complete > >  > > Hmm read is usually under a shared lock for inode as well as extent > > lookup so we should ideally not be blocking too much there. Can you > > share a bit more detailed perf report. I'd be interested to see where > > in iomap_iter() are you seeing the regression? > Are there enough images of the flame diagram? I’ve attached them. > ext4_poll_7.svg is without this patch, iomap_fast.svg is with this patch. I've had a look at them, and the biggest change in CPU usage is that bio_alloc_bioset() disappears from the graph. In the vanilla kernel, that accounts for 6.05% of the cpu samples. Let's put this in a table: function vanilla patched saved ---------- ------- ------- ----- ext4_file_read_iter 54.75 46.85 -7.90 iomap_dio_rw 49.21 40.69 -8.52 ---- bio_alloc_bioset 6.05 1.77 -4.28 iomap_dio_bio_iter 25.44 iomap_iter 15.02 iomap_dio_fast_read_async 39.82 (subtotals) 46.51 41.59 -4.99 ---- bio_alloc_bioset 6.05 1.77 -4.28 bio_init 4.52 0.00 -4.52 More than 50% of the difference in CPU usage between the two code paths is entirely from bio_init() overhead. That makes no sense to me. The fast path still requires bios to be allocated and have bio_init() called on them, and we are doing many more of those calls every second. Why is this overhead not showing up in the fast path profile -at all-? > > > I attempted several incremental optimizations in the __iomap_dio_rw() > > > path to close the gap: > > > 1. Allocating the `bio` and `struct iomap_dio` together to avoid a > > >    separate kmalloc. However, because `struct iomap_dio` is relatively > > >    large and the main path is complex, this yielded almost no > > >    performance improvement. Yet this is exactly what you do in the fast path. Why did it not provide any improvement for the existing code when it is an implied beneficial optimisation for the new fast path? I'm clearly missing something here. I'm trying to work out why the profiles show what they do, but there's differences between them that do make obvious sense to me. It would also be useful to have XFS profiles, because it has a larger CPU cache footprint than ext4. If what the profiles are showing is a result of CPU cache residency artifacts, then we'll see different profile (and, potentially, performance) artifacts with XFS... -Dave. -- Dave Chinner dgc@kernel.org