* Re: [PATCH] the dm-loop target
[not found] <7d6ae2c9-df8e-50d0-7ad6-b787cb3cfab4@redhat.com>
@ 2025-03-03 13:59 ` Christoph Hellwig
[not found] ` <CAM23VxprhJgOPfhxQf6QNWzHd6+-ZwbjSo-oMHCD2WDQiKntMg@mail.gmail.com>
2025-03-03 16:16 ` Mikulas Patocka
0 siblings, 2 replies; 45+ messages in thread
From: Christoph Hellwig @ 2025-03-03 13:59 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Jooyung Han, Alasdair Kergon, Mike Snitzer, zkabelac, dm-devel,
linux-block, linux-fsdevel
On Mon, Mar 03, 2025 at 11:24:27AM +0100, Mikulas Patocka wrote:
> This is the dm-loop target - a replacement for the regular loop driver
> with better performance. The dm-loop target builds a map of the file in
> the constructor and it just remaps bios according to this map.
Using ->bmap is broken and a no-go for new code. If you have any real
performance issues with the loop driver document and report them so that
they can be fixed instead of working around them by duplicating the code
(and in this case making the new code completely broken).
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
[not found] ` <CAM23VxprhJgOPfhxQf6QNWzHd6+-ZwbjSo-oMHCD2WDQiKntMg@mail.gmail.com>
@ 2025-03-03 15:13 ` Christoph Hellwig
2025-03-03 15:22 ` Matthew Wilcox
[not found] ` <CAM23VxprSduDDK8qvLVkUt9WWmLMPFjhqKB8X4e6gw7Wv-6R2w@mail.gmail.com>
0 siblings, 2 replies; 45+ messages in thread
From: Christoph Hellwig @ 2025-03-03 15:13 UTC (permalink / raw)
To: Heinz Mauelshagen
Cc: Christoph Hellwig, Mikulas Patocka, Jooyung Han, Alasdair Kergon,
Mike Snitzer, zkabelac, dm-devel, linux-block, linux-fsdevel
On Mon, Mar 03, 2025 at 03:57:23PM +0100, Heinz Mauelshagen wrote:
> dm-loop avoids the file system abstraction, thus gains the resulting
> performance advantage versus the loop driver.
How?
> dm-loop obviously requires full provisioning, thus sparse files are being
> detected and error handled.
>
> Additionally, block relocation on CoW filesystems has to be prevented.
> dm-loop uses S_SWAPFILE to do so but that flag's limited to swap semantics
> and is overloaded as such.
>
> Should we avoid the flag and enforce use of non-CoW filesystems for backing
> or checks on non-CoW files (inode->i_flags)?
No, ->bmap is fundamentally flawed. No new code should be using it, and
we need to move the places still using it (most notably swap and the md
bitmap code) off it. It can't deal with any kind of changes to the file
system topology and is a risk to data corruption because if used in the
I/O path it bypasses the file system entirely.. If it wasn't we'd use it
in the loop driver.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-03 15:13 ` Christoph Hellwig
@ 2025-03-03 15:22 ` Matthew Wilcox
2025-03-03 15:31 ` Christoph Hellwig
[not found] ` <CAM23VxprSduDDK8qvLVkUt9WWmLMPFjhqKB8X4e6gw7Wv-6R2w@mail.gmail.com>
1 sibling, 1 reply; 45+ messages in thread
From: Matthew Wilcox @ 2025-03-03 15:22 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Heinz Mauelshagen, Mikulas Patocka, Jooyung Han, Alasdair Kergon,
Mike Snitzer, zkabelac, dm-devel, linux-block, linux-fsdevel
On Mon, Mar 03, 2025 at 07:13:49AM -0800, Christoph Hellwig wrote:
> No, ->bmap is fundamentally flawed. No new code should be using it, and
> we need to move the places still using it (most notably swap and the md
> bitmap code) off it. It can't deal with any kind of changes to the file
> system topology and is a risk to data corruption because if used in the
> I/O path it bypasses the file system entirely.. If it wasn't we'd use it
> in the loop driver.
Funnily, I was looking at the md bitmap code too. It's the last part of
the kernel using buffer heads with not-a-pagecache page, so it's the
only caller of alloc_page_buffers() remaining.
I think it should just use the page cache to read/write the file data,
but I haven't looked into it in detail.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-03 15:22 ` Matthew Wilcox
@ 2025-03-03 15:31 ` Christoph Hellwig
0 siblings, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2025-03-03 15:31 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Christoph Hellwig, Heinz Mauelshagen, Mikulas Patocka,
Jooyung Han, Alasdair Kergon, Mike Snitzer, zkabelac, dm-devel,
linux-block, linux-fsdevel
On Mon, Mar 03, 2025 at 03:22:48PM +0000, Matthew Wilcox wrote:
> Funnily, I was looking at the md bitmap code too. It's the last part of
> the kernel using buffer heads with not-a-pagecache page, so it's the
> only caller of alloc_page_buffers() remaining.
>
> I think it should just use the page cache to read/write the file data,
> but I haven't looked into it in detail.
The md bitmap code is awful as it abuses it's own buffer heads in a
way that is non-coherent with the underlying fs. It should just be
using the vfs read/write helpers for in-kernel direct I/O with a
scoped nofs context.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-03 13:59 ` [PATCH] the dm-loop target Christoph Hellwig
[not found] ` <CAM23VxprhJgOPfhxQf6QNWzHd6+-ZwbjSo-oMHCD2WDQiKntMg@mail.gmail.com>
@ 2025-03-03 16:16 ` Mikulas Patocka
2025-03-03 17:24 ` Christoph Hellwig
1 sibling, 1 reply; 45+ messages in thread
From: Mikulas Patocka @ 2025-03-03 16:16 UTC (permalink / raw)
To: Christoph Hellwig, Jens Axboe
Cc: Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen,
zkabelac, dm-devel, linux-block, linux-fsdevel
On Mon, 3 Mar 2025, Christoph Hellwig wrote:
> On Mon, Mar 03, 2025 at 11:24:27AM +0100, Mikulas Patocka wrote:
> > This is the dm-loop target - a replacement for the regular loop driver
> > with better performance. The dm-loop target builds a map of the file in
> > the constructor and it just remaps bios according to this map.
>
> Using ->bmap is broken and a no-go for new code.
What should I use instead of bmap? Is fiemap exported for use in the
kernel?
ext4_bmap flushes the journal if journaled data are used. Is there some
equivalent function that would provide the same guarantee w.r.t. journaled
data?
> If you have any real
> performance issues with the loop driver document and report them so that
> they can be fixed instead of working around them by duplicating the code
> (and in this case making the new code completely broken).
Would Jens Axboe agree to merge the dm-loop logic into the existing loop
driver?
Dm-loop is significantly faster than the regular loop:
# modprobe brd rd_size=1048576
# dd if=/dev/zero of=/dev/ram0 bs=1048576
# mkfs.ext4 /dev/ram0
# mount -t ext4 /dev/ram0 /mnt/test
# dd if=/dev/zero of=/mnt/test/test bs=1048576 count=512
dm-loop (on /mnt/test/test):
# dmsetup create loop --table '0 1048576 loop /mnt/test/test 0'
# fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=16
--ioengine=psync --iodepth=1 --group_reporting=1
--filename=/dev/mapper/loop --name=job --rw=rw
READ: bw=2428MiB/s (2546MB/s), 2428MiB/s-2428MiB/s (2546MB/s-2546MB/s), io=23.7GiB (25.5GB), run=10001-10001msec
WRITE: bw=2429MiB/s (2547MB/s), 2429MiB/s-2429MiB/s (2547MB/s-2547MB/s), io=23.7GiB (25.5GB), run=10001-10001msec
regular loop (on /mnt/test/test):
# losetup /dev/loop0 /mnt/test/test
# fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=16
--ioengine=psync --iodepth=1 --group_reporting=1 --filename=/dev/loop0
--name=job --rw=rw
READ: bw=326MiB/s (342MB/s), 326MiB/s-326MiB/s (342MB/s-342MB/s), io=3259MiB (3417MB), run=10003-10003msec
WRITE: bw=326MiB/s (342MB/s), 326MiB/s-326MiB/s (342MB/s-342MB/s), io=3260MiB (3418MB), run=10003-10003msec
dm-loop is even slightly faster than running fio directly on the regular
file:
# fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=16
--ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test/test
--name=job --rw=rw
READ: bw=2005MiB/s (2103MB/s), 2005MiB/s-2005MiB/s (2103MB/s-2103MB/s), io=19.6GiB (21.0GB), run=10002-10002msec
WRITE: bw=2007MiB/s (2104MB/s), 2007MiB/s-2007MiB/s (2104MB/s-2104MB/s), io=19.6GiB (21.0GB), run=10002-10002msec
Mikulas
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-03 16:16 ` Mikulas Patocka
@ 2025-03-03 17:24 ` Christoph Hellwig
2025-03-03 21:03 ` Mikulas Patocka
0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2025-03-03 17:24 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon,
Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block,
linux-fsdevel
On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote:
> What should I use instead of bmap? Is fiemap exported for use in the
> kernel?
You can't do an ahead of time mapping. It's a broken concept.
> Would Jens Axboe agree to merge the dm-loop logic into the existing loop
> driver?
What logic?
> Dm-loop is significantly faster than the regular loop:
>
> # modprobe brd rd_size=1048576
> # dd if=/dev/zero of=/dev/ram0 bs=1048576
> # mkfs.ext4 /dev/ram0
> # mount -t ext4 /dev/ram0 /mnt/test
> # dd if=/dev/zero of=/mnt/test/test bs=1048576 count=512
All of this needs to be in a commit log. Also note that the above:
a) does not use direct I/O which any sane loop user should
b) is not on a file but a block device, rendering the use of a loop
device entirely pointless.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
[not found] ` <CAM23VxprSduDDK8qvLVkUt9WWmLMPFjhqKB8X4e6gw7Wv-6R2w@mail.gmail.com>
@ 2025-03-03 17:24 ` Christoph Hellwig
[not found] ` <CAM23Vxoxyrf9nwJd1Xe8uncAPiyK8yaNZNsugwX8p=qo1n6yVg@mail.gmail.com>
0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2025-03-03 17:24 UTC (permalink / raw)
To: Heinz Mauelshagen
Cc: Christoph Hellwig, Mikulas Patocka, Jooyung Han, Alasdair Kergon,
Mike Snitzer, zkabelac, dm-devel, linux-block, linux-fsdevel
On Mon, Mar 03, 2025 at 05:06:16PM +0100, Heinz Mauelshagen wrote:
> On Mon, Mar 3, 2025 at 4:24 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> > On Mon, Mar 03, 2025 at 03:57:23PM +0100, Heinz Mauelshagen wrote:
> > > dm-loop avoids the file system abstraction, thus gains the resulting
> > > performance advantage versus the loop driver.
> >
> > How?
> >
>
> It uses a predefined map and thus avoids VFS -> FS write path overhead
> completelly.
But you simply can't use a "pre-defined map". It's fundamentally
unsafe.
> The inode fiemap() operation would avoid issues relative to outdated bmap().
No, using that for I/O is even worse.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-03 17:24 ` Christoph Hellwig
@ 2025-03-03 21:03 ` Mikulas Patocka
2025-03-04 2:13 ` Dave Chinner
2025-03-04 13:49 ` Christoph Hellwig
0 siblings, 2 replies; 45+ messages in thread
From: Mikulas Patocka @ 2025-03-03 21:03 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer,
Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel
On Mon, 3 Mar 2025, Christoph Hellwig wrote:
> On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote:
> > What should I use instead of bmap? Is fiemap exported for use in the
> > kernel?
>
> You can't do an ahead of time mapping. It's a broken concept.
Swapfile does ahead of time mapping. And I just looked at what swapfile
does and copied the logic into dm-loop. If swapfile is not broken, how
could dm-loop be broken?
> > Would Jens Axboe agree to merge the dm-loop logic into the existing loop
> > driver?
>
> What logic?
The ahead-of-time mapping.
> > Dm-loop is significantly faster than the regular loop:
> >
> > # modprobe brd rd_size=1048576
> > # dd if=/dev/zero of=/dev/ram0 bs=1048576
> > # mkfs.ext4 /dev/ram0
> > # mount -t ext4 /dev/ram0 /mnt/test
> > # dd if=/dev/zero of=/mnt/test/test bs=1048576 count=512
>
> All of this needs to be in a commit log. Also note that the above:
>
> a) does not use direct I/O which any sane loop user should
> b) is not on a file but a block device, rendering the use of a loop
> device entirely pointless.
With "losetup --direct-io=on /dev/loop0 /mnt/test/test", it is even slower
than without:
READ: bw=217MiB/s (227MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s), io=2170MiB (2275MB), run=10003-10003msec
WRITE: bw=217MiB/s (227MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s), io=2169MiB (2274MB), run=10003-10003msec
with --direct-io=off
READ: bw=398MiB/s (417MB/s), 398MiB/s-398MiB/s (417MB/s-417MB/s), io=3978MiB (4171MB), run=10003-10003msec
WRITE: bw=398MiB/s (417MB/s), 398MiB/s-398MiB/s (417MB/s-417MB/s), io=3981MiB (4175MB), run=10003-10003msec
Mikulas
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-03 21:03 ` Mikulas Patocka
@ 2025-03-04 2:13 ` Dave Chinner
2025-03-04 11:18 ` Mikulas Patocka
2025-03-04 13:49 ` Christoph Hellwig
1 sibling, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2025-03-04 2:13 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon,
Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block,
linux-fsdevel
On Mon, Mar 03, 2025 at 10:03:42PM +0100, Mikulas Patocka wrote:
>
>
> On Mon, 3 Mar 2025, Christoph Hellwig wrote:
>
> > On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote:
> > > What should I use instead of bmap? Is fiemap exported for use in the
> > > kernel?
> >
> > You can't do an ahead of time mapping. It's a broken concept.
>
> Swapfile does ahead of time mapping. And I just looked at what swapfile
> does and copied the logic into dm-loop. If swapfile is not broken, how
> could dm-loop be broken?
Swap files cannot be accessed/modified by user code once the
swapfile is activated. See all the IS_SWAPFILE() checked throughout
the VFS and filesystem code.
Swap files must be fully allocated (i.e. not sparse), nor contan
shared extents. This is required so that writes to the swapfile do
not require block allocation which would change the mapping...
Hence we explicitly prevent modification of the underlying file
mapping once a swapfile is owned and mapped by the kernel as a
swapfile.
That's not how loop devices/image files work - we actually rely on
them being:
a) sparse; and
b) the mapping being mutable via direct access to the loop file
whilst there is an active mounted filesystem on that loop file.
and so every IO needs to be mapped through the filesystem at
submission time.
The reason for a) is obvious: we don't need to allocate space for
the filesystem so it's effectively thin provisioned. Also, fstrim on
the mounted loop device can punch out unused space in the mounted
filesytsem.
The reason for b) is less obvious: snapshots via file cloning,
deduplication via extent sharing.
The clone operaiton is an atomic modification of the underlying file
mapping, which then triggers COW on future writes to those mappings,
which causes the mapping to the change at write IO time.
IOWs, the whole concept that there is a "static mapping" for a loop
device image file for the life of the image file is fundamentally
flawed.
> > > Dm-loop is significantly faster than the regular loop:
> > >
> > > # modprobe brd rd_size=1048576
> > > # dd if=/dev/zero of=/dev/ram0 bs=1048576
> > > # mkfs.ext4 /dev/ram0
> > > # mount -t ext4 /dev/ram0 /mnt/test
> > > # dd if=/dev/zero of=/mnt/test/test bs=1048576 count=512
Urk. ram disks are terrible for IO benchmarking. The IO is
synchronous (i.e. always completes in the submitter context) and
performance is -always CPU bound- due to the requirement for all
data copying to be marshalled through the CPU.
Please benchmark performance on NVMe SSDs - it will give a much more
accurate deomonstration of the performance differences we'll see in
real world usage of loop device functionality...
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-04 2:13 ` Dave Chinner
@ 2025-03-04 11:18 ` Mikulas Patocka
2025-03-04 13:50 ` Christoph Hellwig
2025-03-05 0:01 ` Dave Chinner
0 siblings, 2 replies; 45+ messages in thread
From: Mikulas Patocka @ 2025-03-04 11:18 UTC (permalink / raw)
To: Dave Chinner
Cc: Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon,
Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block,
linux-fsdevel
On Tue, 4 Mar 2025, Dave Chinner wrote:
> On Mon, Mar 03, 2025 at 10:03:42PM +0100, Mikulas Patocka wrote:
> >
> >
> > On Mon, 3 Mar 2025, Christoph Hellwig wrote:
> >
> > > On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote:
> > > > What should I use instead of bmap? Is fiemap exported for use in the
> > > > kernel?
> > >
> > > You can't do an ahead of time mapping. It's a broken concept.
> >
> > Swapfile does ahead of time mapping. And I just looked at what swapfile
> > does and copied the logic into dm-loop. If swapfile is not broken, how
> > could dm-loop be broken?
>
> Swap files cannot be accessed/modified by user code once the
> swapfile is activated. See all the IS_SWAPFILE() checked throughout
> the VFS and filesystem code.
>
> Swap files must be fully allocated (i.e. not sparse), nor contan
> shared extents. This is required so that writes to the swapfile do
> not require block allocation which would change the mapping...
>
> Hence we explicitly prevent modification of the underlying file
> mapping once a swapfile is owned and mapped by the kernel as a
> swapfile.
>
> That's not how loop devices/image files work - we actually rely on
> them being:
>
> a) sparse; and
> b) the mapping being mutable via direct access to the loop file
> whilst there is an active mounted filesystem on that loop file.
>
> and so every IO needs to be mapped through the filesystem at
> submission time.
>
> The reason for a) is obvious: we don't need to allocate space for
> the filesystem so it's effectively thin provisioned. Also, fstrim on
> the mounted loop device can punch out unused space in the mounted
> filesytsem.
>
> The reason for b) is less obvious: snapshots via file cloning,
> deduplication via extent sharing.
>
> The clone operaiton is an atomic modification of the underlying file
> mapping, which then triggers COW on future writes to those mappings,
> which causes the mapping to the change at write IO time.
>
> IOWs, the whole concept that there is a "static mapping" for a loop
> device image file for the life of the image file is fundamentally
> flawed.
I'm not trying to break existing loop.
But some users don't use COW filesystems, some users use fully provisioned
files, some users don't need to write to a file when it is being mapped -
and for them dm-loop would be viable alternative because of better
performance.
The Android people concluded that loop is too slow and rather than using
loop they want to map a file using a table with dm-linear targets over the
image of the host filesystem. So, they are already doing what dm-loop is
doing.
Mikulas
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-03 21:03 ` Mikulas Patocka
2025-03-04 2:13 ` Dave Chinner
@ 2025-03-04 13:49 ` Christoph Hellwig
[not found] ` <CAM23Vxr=fKy-0L1R5P-5h6A95acKT_d=CC1E+TAzAs8v6q9gHw@mail.gmail.com>
2025-03-12 13:26 ` Kent Overstreet
1 sibling, 2 replies; 45+ messages in thread
From: Christoph Hellwig @ 2025-03-04 13:49 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon,
Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block,
linux-fsdevel
On Mon, Mar 03, 2025 at 10:03:42PM +0100, Mikulas Patocka wrote:
> Swapfile does ahead of time mapping.
It does. But it is:
a) protected against modification by the S_SWAPFILE flag and checked
for full allocation first
b) something we want to get rid of because even with the above it is
rather problematic
> And I just looked at what swapfile
> does and copied the logic into dm-loop. If swapfile is not broken, how
> could dm-loop be broken?
As said above, swapfile works around the brokenness in ways that you
can't. And just blindly copying old code without understanding it is
never a good idea.
>
> > > Would Jens Axboe agree to merge the dm-loop logic into the existing loop
> > > driver?
> >
> > What logic?
>
> The ahead-of-time mapping.
As said multiple times you can't do that. The block mapping is
file system private information.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-04 11:18 ` Mikulas Patocka
@ 2025-03-04 13:50 ` Christoph Hellwig
2025-03-05 0:01 ` Dave Chinner
1 sibling, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2025-03-04 13:50 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Dave Chinner, Christoph Hellwig, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel
On Tue, Mar 04, 2025 at 12:18:04PM +0100, Mikulas Patocka wrote:
> I'm not trying to break existing loop.
You are f***cking breaking file system semantics. Stop it now.
> The Android people concluded that loop is too slow and rather than using
> loop they want to map a file using a table with dm-linear targets over the
> image of the host filesystem. So, they are already doing what dm-loop is
> doing.
I've never seen bug reports from " The Android people". But maybe they
just need to stop pointlessly using loop devices?
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
[not found] ` <CAM23Vxoxyrf9nwJd1Xe8uncAPiyK8yaNZNsugwX8p=qo1n6yVg@mail.gmail.com>
@ 2025-03-04 13:52 ` Christoph Hellwig
0 siblings, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2025-03-04 13:52 UTC (permalink / raw)
To: Heinz Mauelshagen
Cc: Christoph Hellwig, Mikulas Patocka, Jooyung Han, Alasdair Kergon,
Mike Snitzer, zkabelac, dm-devel, linux-block, linux-fsdevel
On Mon, Mar 03, 2025 at 06:34:45PM +0100, Heinz Mauelshagen wrote:
> It is, unless you ensure the pre-allocated and fully provisioned file to be
> immutable.
Which you can't.
> > > The inode fiemap() operation would avoid issues relative to outdated
> > bmap().
> >
> > No, using that for I/O is even worse.
> >
>
> Are you saying it imposes more overhead than the VFS APIs?
No, but the reporting in fiemap is even less defined than the already
overly lose bmap. fiemap for btrfs for example doesn't report actual
block numbers.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
[not found] ` <CAM23Vxr=fKy-0L1R5P-5h6A95acKT_d=CC1E+TAzAs8v6q9gHw@mail.gmail.com>
@ 2025-03-04 16:04 ` Christoph Hellwig
[not found] ` <CAM23VxqJX46DCpCiH5qxPpDLtMVg87Ba8sx55aQ4hvt-XaHzuQ@mail.gmail.com>
0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2025-03-04 16:04 UTC (permalink / raw)
To: Heinz Mauelshagen
Cc: Christoph Hellwig, Mikulas Patocka, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, zkabelac, dm-devel, linux-block,
linux-fsdevel
On Tue, Mar 04, 2025 at 03:50:36PM +0100, Heinz Mauelshagen wrote:
> Seems to have a loop target to add respective mapped device semantics, the
> way to go is to design it using VFS API to get it upstream.
>
> Won't help the Android use case requested by Google thus they'll have to
> stick to their multi-segment linear mapped approach to achieve their
> performance goals unless they want to support dm-loop OOT to make their
> life easier on immutable file backings.
What are "the performance goals" and what is in the way of achieving
them? This thread has massively blown up but no one has even tried to
explain what the thing tries to address.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
[not found] ` <CAM23VxqJX46DCpCiH5qxPpDLtMVg87Ba8sx55aQ4hvt-XaHzuQ@mail.gmail.com>
@ 2025-03-04 17:17 ` Christoph Hellwig
0 siblings, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2025-03-04 17:17 UTC (permalink / raw)
To: Heinz Mauelshagen
Cc: Christoph Hellwig, Mikulas Patocka, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, zkabelac, dm-devel, linux-block,
linux-fsdevel
On Tue, Mar 04, 2025 at 06:01:04PM +0100, Heinz Mauelshagen wrote:
> > As Mikulas shared earlier:
> > The Android people concluded that loop is too slow and rather than using
> > loop they want to map a file using a table with dm-linear targets over the
> > image of the host filesystem. So, they are already doing what dm-loop is
> > doing.
Which again is completely handwavy. What is the workload. What are the
issues with the workload? Where is the time spent making it slower.
e.g. the extent mapping code in this patch is a lot less efficient than
the btree based extent lookup in XFS. If it is faster it is not looking
up the extents but something else, and you need to figure out what it
is.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-04 11:18 ` Mikulas Patocka
2025-03-04 13:50 ` Christoph Hellwig
@ 2025-03-05 0:01 ` Dave Chinner
2025-03-07 15:21 ` Mikulas Patocka
1 sibling, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2025-03-05 0:01 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon,
Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block,
linux-fsdevel
On Tue, Mar 04, 2025 at 12:18:04PM +0100, Mikulas Patocka wrote:
>
>
> On Tue, 4 Mar 2025, Dave Chinner wrote:
>
> > On Mon, Mar 03, 2025 at 10:03:42PM +0100, Mikulas Patocka wrote:
> > >
> > >
> > > On Mon, 3 Mar 2025, Christoph Hellwig wrote:
> > >
> > > > On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote:
> > > > > What should I use instead of bmap? Is fiemap exported for use in the
> > > > > kernel?
> > > >
> > > > You can't do an ahead of time mapping. It's a broken concept.
> > >
> > > Swapfile does ahead of time mapping. And I just looked at what swapfile
> > > does and copied the logic into dm-loop. If swapfile is not broken, how
> > > could dm-loop be broken?
> >
> > Swap files cannot be accessed/modified by user code once the
> > swapfile is activated. See all the IS_SWAPFILE() checked throughout
> > the VFS and filesystem code.
> >
> > Swap files must be fully allocated (i.e. not sparse), nor contan
> > shared extents. This is required so that writes to the swapfile do
> > not require block allocation which would change the mapping...
> >
> > Hence we explicitly prevent modification of the underlying file
> > mapping once a swapfile is owned and mapped by the kernel as a
> > swapfile.
> >
> > That's not how loop devices/image files work - we actually rely on
> > them being:
> >
> > a) sparse; and
> > b) the mapping being mutable via direct access to the loop file
> > whilst there is an active mounted filesystem on that loop file.
> >
> > and so every IO needs to be mapped through the filesystem at
> > submission time.
> >
> > The reason for a) is obvious: we don't need to allocate space for
> > the filesystem so it's effectively thin provisioned. Also, fstrim on
> > the mounted loop device can punch out unused space in the mounted
> > filesytsem.
> >
> > The reason for b) is less obvious: snapshots via file cloning,
> > deduplication via extent sharing.
> >
> > The clone operaiton is an atomic modification of the underlying file
> > mapping, which then triggers COW on future writes to those mappings,
> > which causes the mapping to the change at write IO time.
> >
> > IOWs, the whole concept that there is a "static mapping" for a loop
> > device image file for the life of the image file is fundamentally
> > flawed.
>
> I'm not trying to break existing loop.
I didn't say you were. I said the concept that dm-loop is based on
is fundamentally flawed and that your benchmark setup does not
reflect real world usage of loop devices.
> But some users don't use COW filesystems, some users use fully provisioned
> files, some users don't need to write to a file when it is being mapped -
> and for them dm-loop would be viable alternative because of better
> performance.
Nothing has changed since 2008 when this "fast file mapping" thing
was first proposed and dm-loop made it's first appearance in this
thread:
https://lore.kernel.org/linux-fsdevel/20080109085231.GE6650@kernel.dk/
Let me quote Christoph's response to Jen's proposed static mapping
for the loop device patch back in 2008:
| And the way this is done is simply broken. It means you have to get
| rid of things like delayed or unwritten hands beforehand, it'll be
| a complete pain for COW or non-block backed filesystems.
|
| The right way to do this is to allow direct I/O from kernel sources
| where the filesystem is in-charge of submitting the actual I/O after
| the pages are handed to it. I think Peter Zijlstra has been looking
| into something like that for swap over nfs.
Jens also said this about dm-loop in that thread:
} Why oh why does dm always insist to reinvent everything? That's bad
} enough in itself, but on top of that most of the extra stuff ends up
} being essentially unmaintained.
}
} If we instead improve loop, everyone wins.
}
} Sorry to sound a bit harsh, but sometimes it doesn't hurt to think a bit
} outside your own sandbox.
You - personally - were also told directly by Jens back then that
dm-loop's approach simply does not work for filesystems that move
blocks around. i.e. it isn't a viable appraoch.
Nothing has changed - it still isn't a viable approach for loopback
devices for the same reasons it wasnt' viable in 2008.
> The Android people concluded that loop is too slow and rather than using
> loop they want to map a file using a table with dm-linear targets over the
> image of the host filesystem. So, they are already doing what dm-loop is
> doing.
I don't care if a downstream kernel is doing something stupid with
their kernels.
Where are the bug reports about the loop device being slow and the
analysis that indicates that it is unfixable?
The fact is that AIO+DIO through filesystems like XFS performs
generally within 1-2% of the underlying block device capabilities.
Hence if there's a problem with loop device performance, it isn't in
the backing file IO submission path.
Find out why loop device AIO+DIO is slow for the workload you are
testing and fix that. This way everyone who already uses loop
devices benefits (as Jens said in 2008), and the Android folk can
get rid of their hacky mapping setup....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-05 0:01 ` Dave Chinner
@ 2025-03-07 15:21 ` Mikulas Patocka
2025-03-08 3:49 ` Darrick J. Wong
` (2 more replies)
0 siblings, 3 replies; 45+ messages in thread
From: Mikulas Patocka @ 2025-03-07 15:21 UTC (permalink / raw)
To: Dave Chinner
Cc: Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon,
Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block,
linux-fsdevel
> I didn't say you were. I said the concept that dm-loop is based on
> is fundamentally flawed and that your benchmark setup does not
> reflect real world usage of loop devices.
> Where are the bug reports about the loop device being slow and the
> analysis that indicates that it is unfixable?
So, I did benchmarks on an enterprise nvme drive (SAMSUNG
MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup
--direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow.
synchronous I/O:
fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
raw block device:
READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec
WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec
ext4/loop/ext4:
READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec
WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec
xfs/loop/xfs:
READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec
WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec
ext4/dm-loop/ext4:
READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec
WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec
xfs/dm-loop/xfs:
READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec
WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec
asynchronous I/O:
fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
raw block device:
READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
ext4/loop/ext4:
READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec
WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec
xfs/loop/xfs:
READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec
WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec
ext4/dm-loop/ext4:
READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
xfs/dm-loop/xfs:
READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
Mikulas
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-07 15:21 ` Mikulas Patocka
@ 2025-03-08 3:49 ` Darrick J. Wong
2025-03-08 20:45 ` Mikulas Patocka
2025-03-09 0:05 ` Ming Lei
2025-03-09 0:16 ` Ming Lei
2 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2025-03-08 3:49 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Dave Chinner, Christoph Hellwig, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel
On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote:
> > I didn't say you were. I said the concept that dm-loop is based on
> > is fundamentally flawed and that your benchmark setup does not
> > reflect real world usage of loop devices.
>
> > Where are the bug reports about the loop device being slow and the
> > analysis that indicates that it is unfixable?
>
> So, I did benchmarks on an enterprise nvme drive (SAMSUNG
> MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup
> --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow.
Are you running the loop device in directio mode? The default is to use
buffered io, which wastes pagecache /and/ sometimes trips dirty limits
throttling. The loopdev tests in fstests get noticeably faster if I
force directio mode.
--D
> synchronous I/O:
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> raw block device:
> READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec
> WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec
> ext4/loop/ext4:
> READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec
> WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec
> xfs/loop/xfs:
> READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec
> WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec
> ext4/dm-loop/ext4:
> READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec
> WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec
> xfs/dm-loop/xfs:
> READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec
> WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec
>
> asynchronous I/O:
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> raw block device:
> READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> ext4/loop/ext4:
> READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec
> WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec
> xfs/loop/xfs:
> READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec
> WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec
> ext4/dm-loop/ext4:
> READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> xfs/dm-loop/xfs:
> READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
> WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
>
> Mikulas
>
>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-08 3:49 ` Darrick J. Wong
@ 2025-03-08 20:45 ` Mikulas Patocka
0 siblings, 0 replies; 45+ messages in thread
From: Mikulas Patocka @ 2025-03-08 20:45 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Dave Chinner, Christoph Hellwig, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel
On Fri, 7 Mar 2025, Darrick J. Wong wrote:
> On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote:
> > > I didn't say you were. I said the concept that dm-loop is based on
> > > is fundamentally flawed and that your benchmark setup does not
> > > reflect real world usage of loop devices.
> >
> > > Where are the bug reports about the loop device being slow and the
> > > analysis that indicates that it is unfixable?
> >
> > So, I did benchmarks on an enterprise nvme drive (SAMSUNG
> > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup
> > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow.
>
> Are you running the loop device in directio mode? The default is to use
> buffered io, which wastes pagecache /and/ sometimes trips dirty limits
> throttling. The loopdev tests in fstests get noticeably faster if I
> force directio mode.
Yes, I am. I set it up with:
losetup --direct-io=on /dev/loop0 /mnt/test/l
mount -t xfs /dev/loop0 /mnt/test2
I double checked it and I got the same results.
Mikulas
> --D
>
> > synchronous I/O:
> > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > raw block device:
> > READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec
> > WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec
> > ext4/loop/ext4:
> > READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec
> > WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec
> > xfs/loop/xfs:
> > READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec
> > WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec
> > ext4/dm-loop/ext4:
> > READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec
> > WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec
> > xfs/dm-loop/xfs:
> > READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec
> > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec
> >
> > asynchronous I/O:
> > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > raw block device:
> > READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> > WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> > ext4/loop/ext4:
> > READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec
> > WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec
> > xfs/loop/xfs:
> > READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec
> > WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec
> > ext4/dm-loop/ext4:
> > READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> > WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> > xfs/dm-loop/xfs:
> > READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
> > WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
> >
> > Mikulas
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-07 15:21 ` Mikulas Patocka
2025-03-08 3:49 ` Darrick J. Wong
@ 2025-03-09 0:05 ` Ming Lei
2025-03-10 11:18 ` Mikulas Patocka
2025-03-09 0:16 ` Ming Lei
2 siblings, 1 reply; 45+ messages in thread
From: Ming Lei @ 2025-03-09 0:05 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Dave Chinner, Christoph Hellwig, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel
On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote:
> > I didn't say you were. I said the concept that dm-loop is based on
> > is fundamentally flawed and that your benchmark setup does not
> > reflect real world usage of loop devices.
>
> > Where are the bug reports about the loop device being slow and the
> > analysis that indicates that it is unfixable?
>
> So, I did benchmarks on an enterprise nvme drive (SAMSUNG
> MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup
> --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow.
>
> synchronous I/O:
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> raw block device:
> READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec
> WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec
> ext4/loop/ext4:
> READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec
> WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec
> xfs/loop/xfs:
> READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec
> WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec
> ext4/dm-loop/ext4:
> READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec
> WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec
> xfs/dm-loop/xfs:
> READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec
> WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec
>
> asynchronous I/O:
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> raw block device:
> READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> ext4/loop/ext4:
> READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec
> WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec
> xfs/loop/xfs:
> READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec
> WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec
> ext4/dm-loop/ext4:
> READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> xfs/dm-loop/xfs:
> READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
> WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
Hi Mikulas,
Please try the following patchset:
https://lore.kernel.org/linux-block/20250308162312.1640828-1-ming.lei@redhat.com/
which tries to handle IO in current context directly via NOWAIT, and
supports MQ for loop since 12 io jobs are applied in your test. With this
change, I can observe similar perf data on raw block device and loop/xfs
over mq-virtio-scsi & nvme in my test VM.
1) try single queue first by `modprobe loop`
2) then try MQ by 'modprobe loop nr_hw_queues=4'
If it still doesn't work, please provide fio log for both `raw block
device` and 'loop/xfs', which may provide some clue for the big perf
gap.
Thanks,
Ming
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-07 15:21 ` Mikulas Patocka
2025-03-08 3:49 ` Darrick J. Wong
2025-03-09 0:05 ` Ming Lei
@ 2025-03-09 0:16 ` Ming Lei
2025-03-10 11:20 ` Mikulas Patocka
2 siblings, 1 reply; 45+ messages in thread
From: Ming Lei @ 2025-03-09 0:16 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Dave Chinner, Christoph Hellwig, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel
On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote:
> > I didn't say you were. I said the concept that dm-loop is based on
> > is fundamentally flawed and that your benchmark setup does not
> > reflect real world usage of loop devices.
>
> > Where are the bug reports about the loop device being slow and the
> > analysis that indicates that it is unfixable?
>
> So, I did benchmarks on an enterprise nvme drive (SAMSUNG
> MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup
> --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow.
>
> synchronous I/O:
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> raw block device:
> READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec
> WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec
> ext4/loop/ext4:
> READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec
> WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec
> xfs/loop/xfs:
> READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec
> WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec
> ext4/dm-loop/ext4:
> READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec
> WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec
> xfs/dm-loop/xfs:
> READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec
> WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec
>
> asynchronous I/O:
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> raw block device:
> READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
BTW, raw device is supposed to be xfs or ext4 over raw block device, right?
Otherwise, please provide test data for this case, then it becomes one fair
comparison because there should be lock contention for FS write IOs on same file.
Thanks,
Ming
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-09 0:05 ` Ming Lei
@ 2025-03-10 11:18 ` Mikulas Patocka
2025-03-11 1:27 ` Dave Chinner
0 siblings, 1 reply; 45+ messages in thread
From: Mikulas Patocka @ 2025-03-10 11:18 UTC (permalink / raw)
To: Ming Lei
Cc: Dave Chinner, Christoph Hellwig, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel
On Sun, 9 Mar 2025, Ming Lei wrote:
> On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote:
> > > I didn't say you were. I said the concept that dm-loop is based on
> > > is fundamentally flawed and that your benchmark setup does not
> > > reflect real world usage of loop devices.
> >
> > > Where are the bug reports about the loop device being slow and the
> > > analysis that indicates that it is unfixable?
> >
> > So, I did benchmarks on an enterprise nvme drive (SAMSUNG
> > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup
> > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow.
> >
> > synchronous I/O:
> > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > raw block device:
> > READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec
> > WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec
> > ext4/loop/ext4:
> > READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec
> > WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec
> > xfs/loop/xfs:
> > READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec
> > WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec
> > ext4/dm-loop/ext4:
> > READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec
> > WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec
> > xfs/dm-loop/xfs:
> > READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec
> > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec
> >
> > asynchronous I/O:
> > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > raw block device:
> > READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> > WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> > ext4/loop/ext4:
> > READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec
> > WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec
> > xfs/loop/xfs:
> > READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec
> > WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec
> > ext4/dm-loop/ext4:
> > READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> > WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> > xfs/dm-loop/xfs:
> > READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
> > WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
>
> Hi Mikulas,
>
> Please try the following patchset:
>
> https://lore.kernel.org/linux-block/20250308162312.1640828-1-ming.lei@redhat.com/
>
> which tries to handle IO in current context directly via NOWAIT, and
> supports MQ for loop since 12 io jobs are applied in your test. With this
> change, I can observe similar perf data on raw block device and loop/xfs
> over mq-virtio-scsi & nvme in my test VM.
Yes - with these patches, it is much better.
> 1) try single queue first by `modprobe loop`
fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
xfs/loop/xfs
READ: bw=302MiB/s (317MB/s), 302MiB/s-302MiB/s (317MB/s-317MB/s), io=3024MiB (3170MB), run=10001-10001msec
WRITE: bw=303MiB/s (317MB/s), 303MiB/s-303MiB/s (317MB/s-317MB/s), io=3026MiB (3173MB), run=10001-10001msec
fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
xfs/loop/xfs
READ: bw=1055MiB/s (1106MB/s), 1055MiB/s-1055MiB/s (1106MB/s-1106MB/s), io=10.3GiB (11.1GB), run=10001-10001msec
WRITE: bw=1056MiB/s (1107MB/s), 1056MiB/s-1056MiB/s (1107MB/s-1107MB/s), io=10.3GiB (11.1GB), run=10001-10001msec
> 2) then try MQ by 'modprobe loop nr_hw_queues=4'
fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
xfs/loop/xfs
READ: bw=352MiB/s (370MB/s), 352MiB/s-352MiB/s (370MB/s-370MB/s), io=3525MiB (3696MB), run=10001-10001msec
WRITE: bw=353MiB/s (370MB/s), 353MiB/s-353MiB/s (370MB/s-370MB/s), io=3527MiB (3698MB), run=10001-10001msec
fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
xfs/loop/xfs
READ: bw=1181MiB/s (1239MB/s), 1181MiB/s-1181MiB/s (1239MB/s-1239MB/s), io=11.5GiB (12.4GB), run=10001-10001msec
WRITE: bw=1183MiB/s (1240MB/s), 1183MiB/s-1183MiB/s (1240MB/s-1240MB/s), io=11.5GiB (12.4GB), run=10001-10001msec
> If it still doesn't work, please provide fio log for both `raw block
> device` and 'loop/xfs', which may provide some clue for the big perf
> gap.
>
>
>
> Thanks,
> Ming
Mikulas
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-09 0:16 ` Ming Lei
@ 2025-03-10 11:20 ` Mikulas Patocka
0 siblings, 0 replies; 45+ messages in thread
From: Mikulas Patocka @ 2025-03-10 11:20 UTC (permalink / raw)
To: Ming Lei
Cc: Dave Chinner, Christoph Hellwig, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel
On Sun, 9 Mar 2025, Ming Lei wrote:
> On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote:
> > > I didn't say you were. I said the concept that dm-loop is based on
> > > is fundamentally flawed and that your benchmark setup does not
> > > reflect real world usage of loop devices.
> >
> > > Where are the bug reports about the loop device being slow and the
> > > analysis that indicates that it is unfixable?
> >
> > So, I did benchmarks on an enterprise nvme drive (SAMSUNG
> > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup
> > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow.
> >
> > synchronous I/O:
> > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > raw block device:
> > READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec
> > WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec
> > ext4/loop/ext4:
> > READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec
> > WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec
> > xfs/loop/xfs:
> > READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec
> > WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec
> > ext4/dm-loop/ext4:
> > READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec
> > WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec
> > xfs/dm-loop/xfs:
> > READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec
> > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec
> >
> > asynchronous I/O:
> > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > raw block device:
> > READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> > WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
>
> BTW, raw device is supposed to be xfs or ext4 over raw block device, right?
No - raw device means fio directly on /dev/nvme0n1
> Otherwise, please provide test data for this case, then it becomes one fair
> comparison because there should be lock contention for FS write IOs on same file.
>
> Thanks,
> Ming
Mikulas
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-10 11:18 ` Mikulas Patocka
@ 2025-03-11 1:27 ` Dave Chinner
2025-03-11 10:43 ` Ming Lei
0 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2025-03-11 1:27 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Ming Lei, Christoph Hellwig, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel
On Mon, Mar 10, 2025 at 12:18:51PM +0100, Mikulas Patocka wrote:
>
>
> On Sun, 9 Mar 2025, Ming Lei wrote:
>
> > On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote:
> > > > I didn't say you were. I said the concept that dm-loop is based on
> > > > is fundamentally flawed and that your benchmark setup does not
> > > > reflect real world usage of loop devices.
> > >
> > > > Where are the bug reports about the loop device being slow and the
> > > > analysis that indicates that it is unfixable?
> > >
> > > So, I did benchmarks on an enterprise nvme drive (SAMSUNG
> > > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup
> > > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow.
> > >
> > > synchronous I/O:
> > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > > raw block device:
> > > READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec
> > > WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec
> > > ext4/loop/ext4:
> > > READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec
> > > WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec
> > > xfs/loop/xfs:
> > > READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec
> > > WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec
> > > ext4/dm-loop/ext4:
> > > READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec
> > > WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec
> > > xfs/dm-loop/xfs:
> > > READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec
> > > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec
> > >
> > > asynchronous I/O:
> > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > > raw block device:
> > > READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> > > WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> > > ext4/loop/ext4:
> > > READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec
> > > WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec
> > > xfs/loop/xfs:
> > > READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec
> > > WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec
> > > ext4/dm-loop/ext4:
> > > READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> > > WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> > > xfs/dm-loop/xfs:
> > > READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
> > > WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
> >
> > Hi Mikulas,
> >
> > Please try the following patchset:
> >
> > https://lore.kernel.org/linux-block/20250308162312.1640828-1-ming.lei@redhat.com/
> >
> > which tries to handle IO in current context directly via NOWAIT, and
> > supports MQ for loop since 12 io jobs are applied in your test. With this
> > change, I can observe similar perf data on raw block device and loop/xfs
> > over mq-virtio-scsi & nvme in my test VM.
I'm not sure RWF_NOWAIT is a workable solution.
Why?
IO submission is queued to a different thread context because to
avoid a potential deadlock. That is, we are operating here in the
writeback context of another filesystem, and so we cannot do things
like depend on memory allocation making forwards progress for IO
submission. RWF_NOWAIT is not a guarantee that memory allocation
will not occur in the IO submission path - it is implemented as best
effort non-blocking behaviour.
Further, if we have stacked loop devices (e.g.
xfs-loop-ext4-loop-btrfs-loop-xfs) we can will be stacking
RWF_NOWAIT IO submission contexts through multiple filesystems. This
is not a filesystem IO path I want to support - being in the middle
of such a stack creates all sorts of subtle constraints on behaviour
that otherwise wouldn't exist. We actually do this sort of multi-fs
stacking in fstests, so it's not a made up scenario.
I'm also concerned that RWF_NOWAIT submission is not an optimisation
at all for the common sparse/COW image file case, because in this
case RWF_NOWAIT failing with EAGAIN is just as common (if not
moreso) than it succeeding.
i.e. in this case, RWF_NOWAIT writes will fail with -EAGAIN very
frequently, so all that time spent doing IO submission is wasted
time.
Further, because allocation on write requires an exclusive lock and
it is held for some time, this will affect read performance from the
backing device as well. i.e. block mapping during a read while a
write is doing allocation will also return EAGAIN for RWF_NOWAIT.
This will push the read off to the background worker thread to be
serviced and so that will go much slower than a RWF_NOWAIT read that
hits the backing file between writes doing allocation. i.e. read
latency is going to become much, much more variable.
Hence I suspect RWF_NOWAIT is simply hiding the underlying issue
by providing this benchmark with a "pure overwrite" fast path that
avoids the overhead of queueing work through loop_queue_work()....
Can you run these same loop dev tests using a sparse image file and
a sparse fio test file so that the fio benchmark measures the impact
of loop device block allocation on the test? I suspect the results
with the RWF_NOWAIT patch will be somewhat different to the fully
allocated case...
>
> Yes - with these patches, it is much better.
>
> > 1) try single queue first by `modprobe loop`
>
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> xfs/loop/xfs
> READ: bw=302MiB/s (317MB/s), 302MiB/s-302MiB/s (317MB/s-317MB/s), io=3024MiB (3170MB), run=10001-10001msec
> WRITE: bw=303MiB/s (317MB/s), 303MiB/s-303MiB/s (317MB/s-317MB/s), io=3026MiB (3173MB), run=10001-10001msec
>
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> xfs/loop/xfs
> READ: bw=1055MiB/s (1106MB/s), 1055MiB/s-1055MiB/s (1106MB/s-1106MB/s), io=10.3GiB (11.1GB), run=10001-10001msec
> WRITE: bw=1056MiB/s (1107MB/s), 1056MiB/s-1056MiB/s (1107MB/s-1107MB/s), io=10.3GiB (11.1GB), run=10001-10001msec
Yup, this sort of difference in performance simply from bypassing
loop_queue_work() implies the problem is the single threaded loop
device queue implementation needs to be fixed.
loop_queue_work()
{
....
spin_lock_irq(&lo->lo_work_lock);
....
} else {
work = &lo->rootcg_work;
cmd_list = &lo->rootcg_cmd_list;
}
list_add_tail(&cmd->list_entry, cmd_list);
queue_work(lo->workqueue, work);
spin_unlock_irq(&lo->lo_work_lock);
}
Not only does every IO that is queued takes this queue lock, there
is only one work instance for the loop device. Therefore there is
only one kworker process per control group that does IO submission
for the loop device. And that kworker thread also takes the work
lock to do dequeue as well.
That serialised queue with a single IO dispatcher thread looks to be
the performance bottleneck to me. We could get rid of the lock by
using a llist for this multi-producer/single consumer cmd list
pattern, though I suspect we can get rid of the list entirely...
i.e. we have a work queue that can run a
thousand concurrent works for this loop device, but the request
queue is depth limited to 128 requests. hence we can have a full
set of requests in flight and not run out of submission worker
concurrency. There's no need to isolate IO from different cgroups in
this situation - they will not get blocked behind IO submission
from a different cgroup that is throttled...
i.e. the cmd->list_entry list_head could be replaced with a struct
work_struct and that whole cmd list management and cgroup scheduling
thing could be replaced with a single call to
queue_work(cmd->io_work). i.e. the single point that all IO
submission is directed through goes away completely.
-Dave
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-11 1:27 ` Dave Chinner
@ 2025-03-11 10:43 ` Ming Lei
2025-03-12 2:34 ` Dave Chinner
0 siblings, 1 reply; 45+ messages in thread
From: Ming Lei @ 2025-03-11 10:43 UTC (permalink / raw)
To: Dave Chinner
Cc: Mikulas Patocka, Christoph Hellwig, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel, io-uring
On Tue, Mar 11, 2025 at 12:27:23PM +1100, Dave Chinner wrote:
> On Mon, Mar 10, 2025 at 12:18:51PM +0100, Mikulas Patocka wrote:
> >
> >
> > On Sun, 9 Mar 2025, Ming Lei wrote:
> >
> > > On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote:
> > > > > I didn't say you were. I said the concept that dm-loop is based on
> > > > > is fundamentally flawed and that your benchmark setup does not
> > > > > reflect real world usage of loop devices.
> > > >
> > > > > Where are the bug reports about the loop device being slow and the
> > > > > analysis that indicates that it is unfixable?
> > > >
> > > > So, I did benchmarks on an enterprise nvme drive (SAMSUNG
> > > > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup
> > > > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow.
> > > >
> > > > synchronous I/O:
> > > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > > > raw block device:
> > > > READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec
> > > > WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec
> > > > ext4/loop/ext4:
> > > > READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec
> > > > WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec
> > > > xfs/loop/xfs:
> > > > READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec
> > > > WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec
> > > > ext4/dm-loop/ext4:
> > > > READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec
> > > > WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec
> > > > xfs/dm-loop/xfs:
> > > > READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec
> > > > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec
> > > >
> > > > asynchronous I/O:
> > > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > > > raw block device:
> > > > READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> > > > WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> > > > ext4/loop/ext4:
> > > > READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec
> > > > WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec
> > > > xfs/loop/xfs:
> > > > READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec
> > > > WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec
> > > > ext4/dm-loop/ext4:
> > > > READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> > > > WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> > > > xfs/dm-loop/xfs:
> > > > READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
> > > > WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
> > >
> > > Hi Mikulas,
> > >
> > > Please try the following patchset:
> > >
> > > https://lore.kernel.org/linux-block/20250308162312.1640828-1-ming.lei@redhat.com/
> > >
> > > which tries to handle IO in current context directly via NOWAIT, and
> > > supports MQ for loop since 12 io jobs are applied in your test. With this
> > > change, I can observe similar perf data on raw block device and loop/xfs
> > > over mq-virtio-scsi & nvme in my test VM.
>
> I'm not sure RWF_NOWAIT is a workable solution.
>
> Why?
It is just the sane implementation of Mikulas's static mapping
approach: no need to move to workqueue if the mapping is immutable
or sort of.
Also it matches with io_uring's FS read/write implementation:
- try to submit IO with NOWAIT first
- then fallback to io-wq in case of -EAGAIN
It isn't perfect, sometime it may be slower than running on io-wq
directly.
But is there any better way for covering everything?
I guess no, because FS can't tell us when the IO can be submitted
successfully via NOWAIT, and we can't know if it may succeed without
trying. And basically that is what the interface is designed.
>
> IO submission is queued to a different thread context because to
> avoid a potential deadlock. That is, we are operating here in the
> writeback context of another filesystem, and so we cannot do things
> like depend on memory allocation making forwards progress for IO
> submission. RWF_NOWAIT is not a guarantee that memory allocation
> will not occur in the IO submission path - it is implemented as best
> effort non-blocking behaviour.
Yes, that is why BLK_MQ_F_BLOCKING is added.
>
> Further, if we have stacked loop devices (e.g.
> xfs-loop-ext4-loop-btrfs-loop-xfs) we can will be stacking
> RWF_NOWAIT IO submission contexts through multiple filesystems. This
> is not a filesystem IO path I want to support - being in the middle
> of such a stack creates all sorts of subtle constraints on behaviour
> that otherwise wouldn't exist. We actually do this sort of multi-fs
> stacking in fstests, so it's not a made up scenario.
>
> I'm also concerned that RWF_NOWAIT submission is not an optimisation
> at all for the common sparse/COW image file case, because in this
> case RWF_NOWAIT failing with EAGAIN is just as common (if not
> moreso) than it succeeding.
>
> i.e. in this case, RWF_NOWAIT writes will fail with -EAGAIN very
> frequently, so all that time spent doing IO submission is wasted
> time.
Right.
I saw that when I wrote ublk/zoned in which every write needs to
allocate space. It is workaround by preallocating space for one fixed
range or the whole zone.
>
> Further, because allocation on write requires an exclusive lock and
> it is held for some time, this will affect read performance from the
> backing device as well. i.e. block mapping during a read while a
> write is doing allocation will also return EAGAIN for RWF_NOWAIT.
But that can't be avoided without using NOWAIT, and read is blocked
when write(WAIT) is in-progress.
> This will push the read off to the background worker thread to be
> serviced and so that will go much slower than a RWF_NOWAIT read that
> hits the backing file between writes doing allocation. i.e. read
> latency is going to become much, much more variable.
>
> Hence I suspect RWF_NOWAIT is simply hiding the underlying issue
> by providing this benchmark with a "pure overwrite" fast path that
> avoids the overhead of queueing work through loop_queue_work()....
>
> Can you run these same loop dev tests using a sparse image file and
> a sparse fio test file so that the fio benchmark measures the impact
> of loop device block allocation on the test? I suspect the results
> with the RWF_NOWAIT patch will be somewhat different to the fully
> allocated case...
Yes, it will be slower, and io_uring FS IO application is in
the same situation, and usually application doesn't have such
knowledge if RWF_NOWAIT can succeed.
However, usually meta IO is much less compared with normal IO, so most
times it will be a win to try NOWAIT first.
>
> >
> > Yes - with these patches, it is much better.
> >
> > > 1) try single queue first by `modprobe loop`
> >
> > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > xfs/loop/xfs
> > READ: bw=302MiB/s (317MB/s), 302MiB/s-302MiB/s (317MB/s-317MB/s), io=3024MiB (3170MB), run=10001-10001msec
> > WRITE: bw=303MiB/s (317MB/s), 303MiB/s-303MiB/s (317MB/s-317MB/s), io=3026MiB (3173MB), run=10001-10001msec
> >
> > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > xfs/loop/xfs
> > READ: bw=1055MiB/s (1106MB/s), 1055MiB/s-1055MiB/s (1106MB/s-1106MB/s), io=10.3GiB (11.1GB), run=10001-10001msec
> > WRITE: bw=1056MiB/s (1107MB/s), 1056MiB/s-1056MiB/s (1107MB/s-1107MB/s), io=10.3GiB (11.1GB), run=10001-10001msec
>
> Yup, this sort of difference in performance simply from bypassing
> loop_queue_work() implies the problem is the single threaded loop
> device queue implementation needs to be fixed.
>
> loop_queue_work()
> {
> ....
> spin_lock_irq(&lo->lo_work_lock);
> ....
>
> } else {
> work = &lo->rootcg_work;
> cmd_list = &lo->rootcg_cmd_list;
> }
> list_add_tail(&cmd->list_entry, cmd_list);
> queue_work(lo->workqueue, work);
> spin_unlock_irq(&lo->lo_work_lock);
> }
>
> Not only does every IO that is queued takes this queue lock, there
> is only one work instance for the loop device. Therefore there is
> only one kworker process per control group that does IO submission
> for the loop device. And that kworker thread also takes the work
> lock to do dequeue as well.
>
> That serialised queue with a single IO dispatcher thread looks to be
> the performance bottleneck to me. We could get rid of the lock by
> using a llist for this multi-producer/single consumer cmd list
> pattern, though I suspect we can get rid of the list entirely...
>
> i.e. we have a work queue that can run a
> thousand concurrent works for this loop device, but the request
> queue is depth limited to 128 requests. hence we can have a full
> set of requests in flight and not run out of submission worker
> concurrency. There's no need to isolate IO from different cgroups in
> this situation - they will not get blocked behind IO submission
> from a different cgroup that is throttled...
>
> i.e. the cmd->list_entry list_head could be replaced with a struct
> work_struct and that whole cmd list management and cgroup scheduling
> thing could be replaced with a single call to
> queue_work(cmd->io_work). i.e. the single point that all IO
Then there will be many FS write contention, :-)
> submission is directed through goes away completely.
It has been shown many times that AIO submitted from single or much less
contexts is much more efficient than running IO concurrently from multiple
contexts, especially for sequential IO.
Please see the recent example of zloop vs. ublk/zoned:
https://lore.kernel.org/linux-block/d5e59531-c19b-4332-8f47-b380ab9678be@kernel.org/
When zloop takes single dispatcher just like the in-tree loop, sequential
WRITE performs much better than initial version of ublk/zoned, which just
handles every IO in its own io-wq(one time NOWAIT & -EAGAIN and one time fallback
to io-wq). I tried to submit FS WRITE via io-wq directly and it becomes what
your suggested, the improvement is small, and still much worse than zloop's
single dispatcher.
Later, when I switch to pre-allocate space for each zone or one fixed range,
each write is submitted with NOWAIT successfully, then the sequential write perf
is improved much:
https://lore.kernel.org/linux-block/Z6QrceGGAJl_X_BM@fedora/
Thanks,
Ming
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-11 10:43 ` Ming Lei
@ 2025-03-12 2:34 ` Dave Chinner
2025-03-12 6:24 ` Christoph Hellwig
2025-03-12 8:26 ` Ming Lei
0 siblings, 2 replies; 45+ messages in thread
From: Dave Chinner @ 2025-03-12 2:34 UTC (permalink / raw)
To: Ming Lei
Cc: Mikulas Patocka, Christoph Hellwig, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel, io-uring
On Tue, Mar 11, 2025 at 06:43:22PM +0800, Ming Lei wrote:
> On Tue, Mar 11, 2025 at 12:27:23PM +1100, Dave Chinner wrote:
> > On Mon, Mar 10, 2025 at 12:18:51PM +0100, Mikulas Patocka wrote:
> > > On Sun, 9 Mar 2025, Ming Lei wrote:
> > > > Please try the following patchset:
> > > >
> > > > https://lore.kernel.org/linux-block/20250308162312.1640828-1-ming.lei@redhat.com/
> > > >
> > > > which tries to handle IO in current context directly via NOWAIT, and
> > > > supports MQ for loop since 12 io jobs are applied in your test. With this
> > > > change, I can observe similar perf data on raw block device and loop/xfs
> > > > over mq-virtio-scsi & nvme in my test VM.
> >
> > I'm not sure RWF_NOWAIT is a workable solution.
> >
> > Why?
>
> It is just the sane implementation of Mikulas's static mapping
> approach: no need to move to workqueue if the mapping is immutable
> or sort of.
>
> Also it matches with io_uring's FS read/write implementation:
>
> - try to submit IO with NOWAIT first
> - then fallback to io-wq in case of -EAGAIN
No, it doesn't match what io_uring does. yes, the NOWAIT bit does,
but the work queue implementation is in the loop device is
completely different to the way io_uring dispatches work.
That is, io_uring dispatches one IO per wroker thread context so
they can all run in parallel down through the filesystem. The loop
device has a single worker thread so it -serialises- IO submission
to the filesystem.
i.e. blocking a single IO submission in the loop worker blocks all
IO submission, whereas io_uring submits all IO indepedently so they
only block if serialisation occurs further down the stack.
> It isn't perfect, sometime it may be slower than running on io-wq
> directly.
>
> But is there any better way for covering everything?
Yes - fix the loop queue workers.
> I guess no, because FS can't tell us when the IO can be submitted
> successfully via NOWAIT, and we can't know if it may succeed without
> trying. And basically that is what the interface is designed.
Wrong.
Speculative non-blocking IO like NOWAIT is the wrong optimisation to
make for workloads that are very likely to block in the IO path. It
just adds overhead without adding any improvement in performance.
Getting rid of the serialised IO submission problems that the loop
device current has will benefit *all* workloads that use the loop
device, not just those that are fully allocated. Yes, it won't quite
show the same performance as NOWAIT in that case, but it still
should give 90-95% of native performance for the static file case.
And it should also improve all the other cases, too, because now
they will only serialise when the backing file needs IO operations to
serialise (i.e. during allocation).
> > IO submission is queued to a different thread context because to
> > avoid a potential deadlock. That is, we are operating here in the
> > writeback context of another filesystem, and so we cannot do things
> > like depend on memory allocation making forwards progress for IO
> > submission. RWF_NOWAIT is not a guarantee that memory allocation
> > will not occur in the IO submission path - it is implemented as best
> > effort non-blocking behaviour.
>
> Yes, that is why BLK_MQ_F_BLOCKING is added.
>
> >
> > Further, if we have stacked loop devices (e.g.
> > xfs-loop-ext4-loop-btrfs-loop-xfs) we can will be stacking
> > RWF_NOWAIT IO submission contexts through multiple filesystems. This
> > is not a filesystem IO path I want to support - being in the middle
> > of such a stack creates all sorts of subtle constraints on behaviour
> > that otherwise wouldn't exist. We actually do this sort of multi-fs
> > stacking in fstests, so it's not a made up scenario.
> >
> > I'm also concerned that RWF_NOWAIT submission is not an optimisation
> > at all for the common sparse/COW image file case, because in this
> > case RWF_NOWAIT failing with EAGAIN is just as common (if not
> > moreso) than it succeeding.
> >
> > i.e. in this case, RWF_NOWAIT writes will fail with -EAGAIN very
> > frequently, so all that time spent doing IO submission is wasted
> > time.
>
> Right.
>
> I saw that when I wrote ublk/zoned in which every write needs to
> allocate space. It is workaround by preallocating space for one fixed
> range or the whole zone.
*cough*
You do realise that fallocate() serialises all IO on the file? i.e.
not only does it block all IO submission, mmap IO and other metadata
operations, it also waits for all IO in flight to complete and it
doesn't allow any IO to restart until the preallocation is complete.
i.e. preallocation creates a huge IO submission latency bubble in
the IO path. Hence preallocation is something that should be avoided
at runtime if at all possible.
If you have problems with allocation overhead during IO, then we
have things like extent size hints that allow the per-IO allocation
to be made much larger than the IO itself. This effectively does
preallocation during IO without the IO pipeline bubbles that
preallocation requires to function correctly....
> > Further, because allocation on write requires an exclusive lock and
> > it is held for some time, this will affect read performance from the
> > backing device as well. i.e. block mapping during a read while a
> > write is doing allocation will also return EAGAIN for RWF_NOWAIT.
>
> But that can't be avoided without using NOWAIT, and read is blocked
> when write(WAIT) is in-progress.
If the loop worker dispatches each IO in it's own task context, then
we don't care if a read IO blocks on some other write in progress.
It doesn't get in the way of any other IO submission...
> > This will push the read off to the background worker thread to be
> > serviced and so that will go much slower than a RWF_NOWAIT read that
> > hits the backing file between writes doing allocation. i.e. read
> > latency is going to become much, much more variable.
> >
> > Hence I suspect RWF_NOWAIT is simply hiding the underlying issue
> > by providing this benchmark with a "pure overwrite" fast path that
> > avoids the overhead of queueing work through loop_queue_work()....
> >
> > Can you run these same loop dev tests using a sparse image file and
> > a sparse fio test file so that the fio benchmark measures the impact
> > of loop device block allocation on the test? I suspect the results
> > with the RWF_NOWAIT patch will be somewhat different to the fully
> > allocated case...
>
> Yes, it will be slower, and io_uring FS IO application is in
> the same situation, and usually application doesn't have such
> knowledge if RWF_NOWAIT can succeed.
And that's my point: nothing above the filesystem will have - or
can have - any knowledge of whether NOWAIT IO will succeed or
not.
Therefore we have to use our brains to analyse what the typical
behaviour of the filesystem will be for a given situation to
determine how best to optimise IO submission.
> However, usually meta IO is much less compared with normal IO, so most
> times it will be a win to try NOWAIT first.
Data vs metadata from the upper filesystem doesn't even enter into
the equation here.
Filesystems like XFS, bcachefs and btrfs all dynamically create,
move and remove metadata, so metadata writes are as much of an
allocation problem for sparse loop backing files as write-once user
data IO is.
> > Yup, this sort of difference in performance simply from bypassing
> > loop_queue_work() implies the problem is the single threaded loop
> > device queue implementation needs to be fixed.
> >
> > loop_queue_work()
> > {
> > ....
> > spin_lock_irq(&lo->lo_work_lock);
> > ....
> >
> > } else {
> > work = &lo->rootcg_work;
> > cmd_list = &lo->rootcg_cmd_list;
> > }
> > list_add_tail(&cmd->list_entry, cmd_list);
> > queue_work(lo->workqueue, work);
> > spin_unlock_irq(&lo->lo_work_lock);
> > }
> >
> > Not only does every IO that is queued takes this queue lock, there
> > is only one work instance for the loop device. Therefore there is
> > only one kworker process per control group that does IO submission
> > for the loop device. And that kworker thread also takes the work
> > lock to do dequeue as well.
> >
> > That serialised queue with a single IO dispatcher thread looks to be
> > the performance bottleneck to me. We could get rid of the lock by
> > using a llist for this multi-producer/single consumer cmd list
> > pattern, though I suspect we can get rid of the list entirely...
> >
> > i.e. we have a work queue that can run a
> > thousand concurrent works for this loop device, but the request
> > queue is depth limited to 128 requests. hence we can have a full
> > set of requests in flight and not run out of submission worker
> > concurrency. There's no need to isolate IO from different cgroups in
> > this situation - they will not get blocked behind IO submission
> > from a different cgroup that is throttled...
> >
> > i.e. the cmd->list_entry list_head could be replaced with a struct
> > work_struct and that whole cmd list management and cgroup scheduling
> > thing could be replaced with a single call to
> > queue_work(cmd->io_work). i.e. the single point that all IO
>
> Then there will be many FS write contention, :-)
Did you think about how much parallelism your NOWAIT patches are
directing at the loop device backing file before making that
comment?
The fio test that was run had 12 jobs running, there is at least
12-way IO concurrency hitting the backing file directly via the
NOWAIT path.
Yup, the NOWAIT path exposes the filesystem directly to userspace
write IO concurrency. Yet the filesystem IO scaled to near raw
device speed....
It should be obvious that adding more IO concurrency to loop
device IO submission isn't a problem for the XFS IO path. And if it
proves to be an issue, then that's an XFS problem to solve, not a
generic device scalability issue.
> > submission is directed through goes away completely.
>
> It has been shown many times that AIO submitted from single or much less
> contexts is much more efficient than running IO concurrently from multiple
> contexts, especially for sequential IO.
Yes, I know that, and I'm not disputing that this is the most
optimal way to submit IO *when it doesn't require blocking to
dispatch*.
However, the fact is this benefit only exists when all IO can be
submitted in a non-blocking manner.
As soon as blocking occurs in IO submission, then AIO submission
either becomes synchronous or is aborted with EAGAIN (i.e. NOWAIT
io).
If we use blocking submission behaviour, then performance tanks.
This is what the loop device does right now.
If we use non-blocking submission and get -EAGAIN, then we still
need to use the slightly less efficient method of per-IO task
contexts to scale out blocking IO submission.
With that in mind, we must accept that loop device IO *frequently
needs to block* and so non-blocking IO is the wrong behaviour to
optimise *exclusively* for.
Hence we need to first make the loop device scale with per-IO task
contexts as this will improve both IO that can be dispatched without
blocking as well as IO that will block during dispatch. If this
does not bring performance up to acceptible levels, then other
optimisations can then be considered.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-12 2:34 ` Dave Chinner
@ 2025-03-12 6:24 ` Christoph Hellwig
2025-03-12 8:26 ` Ming Lei
1 sibling, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2025-03-12 6:24 UTC (permalink / raw)
To: Dave Chinner
Cc: Ming Lei, Mikulas Patocka, Christoph Hellwig, Jens Axboe,
Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen,
zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring
On Wed, Mar 12, 2025 at 01:34:02PM +1100, Dave Chinner wrote:
> Wrong.
>
> Speculative non-blocking IO like NOWAIT is the wrong optimisation to
> make for workloads that are very likely to block in the IO path. It
> just adds overhead without adding any improvement in performance.
Note that I suspect that most or at least many loop workloads are
read-heavy. And at least for reads NOWAIT makes perfect sense.
> Getting rid of the serialised IO submission problems that the loop
> device current has will benefit *all* workloads that use the loop
> device, not just those that are fully allocated. Yes, it won't quite
> show the same performance as NOWAIT in that case, but it still
> should give 90-95% of native performance for the static file case.
> And it should also improve all the other cases, too, because now
> they will only serialise when the backing file needs IO operations to
> serialise (i.e. during allocation).
And I agree that this should be a first step.
> *cough*
The whole ublk-zoned is a bit of a bullshit thing where Ming wrote
up something that barely works to block inclusion of the zloop driver
we really need for zoned xfs testing. Please don't take it serious.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-12 2:34 ` Dave Chinner
2025-03-12 6:24 ` Christoph Hellwig
@ 2025-03-12 8:26 ` Ming Lei
2025-03-13 1:36 ` Ming Lei
2025-03-13 16:36 ` Mikulas Patocka
1 sibling, 2 replies; 45+ messages in thread
From: Ming Lei @ 2025-03-12 8:26 UTC (permalink / raw)
To: Dave Chinner
Cc: Mikulas Patocka, Christoph Hellwig, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel, io-uring
[-- Attachment #1: Type: text/plain, Size: 9748 bytes --]
On Wed, Mar 12, 2025 at 01:34:02PM +1100, Dave Chinner wrote:
> On Tue, Mar 11, 2025 at 06:43:22PM +0800, Ming Lei wrote:
> > On Tue, Mar 11, 2025 at 12:27:23PM +1100, Dave Chinner wrote:
> > > On Mon, Mar 10, 2025 at 12:18:51PM +0100, Mikulas Patocka wrote:
> > > > On Sun, 9 Mar 2025, Ming Lei wrote:
> > > > > Please try the following patchset:
> > > > >
> > > > > https://lore.kernel.org/linux-block/20250308162312.1640828-1-ming.lei@redhat.com/
> > > > >
> > > > > which tries to handle IO in current context directly via NOWAIT, and
> > > > > supports MQ for loop since 12 io jobs are applied in your test. With this
> > > > > change, I can observe similar perf data on raw block device and loop/xfs
> > > > > over mq-virtio-scsi & nvme in my test VM.
> > >
> > > I'm not sure RWF_NOWAIT is a workable solution.
> > >
> > > Why?
> >
> > It is just the sane implementation of Mikulas's static mapping
> > approach: no need to move to workqueue if the mapping is immutable
> > or sort of.
> >
> > Also it matches with io_uring's FS read/write implementation:
> >
> > - try to submit IO with NOWAIT first
> > - then fallback to io-wq in case of -EAGAIN
>
> No, it doesn't match what io_uring does. yes, the NOWAIT bit does,
> but the work queue implementation is in the loop device is
> completely different to the way io_uring dispatches work.
>
> That is, io_uring dispatches one IO per wroker thread context so
> they can all run in parallel down through the filesystem. The loop
> device has a single worker thread so it -serialises- IO submission
> to the filesystem.
>
> i.e. blocking a single IO submission in the loop worker blocks all
> IO submission, whereas io_uring submits all IO indepedently so they
> only block if serialisation occurs further down the stack.
block layer/storage has many optimization for batching handling, if IOs
are submitted from many contexts:
- this batch handling optimization is gone
- IO is re-ordered from underlying hardware viewpoint
- more contention from FS write lock, because loop has single back file.
That is why the single task context is taken from the beginning of loop aio,
and it performs pretty well for sequential IO workloads, as I shown
in the zloop example.
>
> > It isn't perfect, sometime it may be slower than running on io-wq
> > directly.
> >
> > But is there any better way for covering everything?
>
> Yes - fix the loop queue workers.
What you suggested is threaded aio by submitting IO concurrently from
different task context, this way is not the most efficient one, otherwise
modern language won't invent async/.await.
In my test VM, by running Mikulas's fio script on loop/nvme by the attached
threaded_aio patch:
NOWAIT with MQ 4 : 70K iops(read), 70K iops(write), cpu util: 40%
threaded_aio with MQ 4 : 64k iops(read), 64K iops(write), cpu util: 52%
in tree loop(SQ) : 58K iops(read), 58K iops(write)
Mikulas, please feel free to run your tests with threaded_aio:
modprobe loop nr_hw_queues=4 threaded_aio=1
by applying the attached the patch over the loop patchset.
The performance gap could be more obvious in fast hardware.
>
> > I guess no, because FS can't tell us when the IO can be submitted
> > successfully via NOWAIT, and we can't know if it may succeed without
> > trying. And basically that is what the interface is designed.
>
> Wrong.
>
> Speculative non-blocking IO like NOWAIT is the wrong optimisation to
> make for workloads that are very likely to block in the IO path. It
> just adds overhead without adding any improvement in performance.
Mikulas's ->bmap() implies that there isn't block in IO path too.
And many io_uring application shouldn't have idea if the IO will block too,
but they still use NOWAIT...
>
> Getting rid of the serialised IO submission problems that the loop
> device current has will benefit *all* workloads that use the loop
> device, not just those that are fully allocated. Yes, it won't quite
> show the same performance as NOWAIT in that case, but it still
> should give 90-95% of native performance for the static file case.
> And it should also improve all the other cases, too, because now
> they will only serialise when the backing file needs IO operations to
> serialise (i.e. during allocation).
The above test shows that concurrent submission isn't better than NOWAIT,
and my ublk/zoned examples showed the point too.
>
> > > IO submission is queued to a different thread context because to
> > > avoid a potential deadlock. That is, we are operating here in the
> > > writeback context of another filesystem, and so we cannot do things
> > > like depend on memory allocation making forwards progress for IO
> > > submission. RWF_NOWAIT is not a guarantee that memory allocation
> > > will not occur in the IO submission path - it is implemented as best
> > > effort non-blocking behaviour.
> >
> > Yes, that is why BLK_MQ_F_BLOCKING is added.
> >
> > >
> > > Further, if we have stacked loop devices (e.g.
> > > xfs-loop-ext4-loop-btrfs-loop-xfs) we can will be stacking
> > > RWF_NOWAIT IO submission contexts through multiple filesystems. This
> > > is not a filesystem IO path I want to support - being in the middle
> > > of such a stack creates all sorts of subtle constraints on behaviour
> > > that otherwise wouldn't exist. We actually do this sort of multi-fs
> > > stacking in fstests, so it's not a made up scenario.
> > >
> > > I'm also concerned that RWF_NOWAIT submission is not an optimisation
> > > at all for the common sparse/COW image file case, because in this
> > > case RWF_NOWAIT failing with EAGAIN is just as common (if not
> > > moreso) than it succeeding.
> > >
> > > i.e. in this case, RWF_NOWAIT writes will fail with -EAGAIN very
> > > frequently, so all that time spent doing IO submission is wasted
> > > time.
> >
> > Right.
> >
> > I saw that when I wrote ublk/zoned in which every write needs to
> > allocate space. It is workaround by preallocating space for one fixed
> > range or the whole zone.
>
> *cough*
>
> You do realise that fallocate() serialises all IO on the file? i.e.
> not only does it block all IO submission, mmap IO and other metadata
> operations, it also waits for all IO in flight to complete and it
> doesn't allow any IO to restart until the preallocation is complete.
>
It isn't one issue for zloop & ublk/zoned, each zone is emulated by one
single file, and the zone is usual the parallelism unit.
> i.e. preallocation creates a huge IO submission latency bubble in
> the IO path. Hence preallocation is something that should be avoided
> at runtime if at all possible.
>
> If you have problems with allocation overhead during IO, then we
> have things like extent size hints that allow the per-IO allocation
> to be made much larger than the IO itself. This effectively does
> preallocation during IO without the IO pipeline bubbles that
> preallocation requires to function correctly....
OK, I will try it later in ublk/zoned and see if the pre-allocation can
be replaced by extent size hints, is it available for all popular FS?
>
> > > Further, because allocation on write requires an exclusive lock and
> > > it is held for some time, this will affect read performance from the
> > > backing device as well. i.e. block mapping during a read while a
> > > write is doing allocation will also return EAGAIN for RWF_NOWAIT.
> >
> > But that can't be avoided without using NOWAIT, and read is blocked
> > when write(WAIT) is in-progress.
>
> If the loop worker dispatches each IO in it's own task context, then
> we don't care if a read IO blocks on some other write in progress.
> It doesn't get in the way of any other IO submission...
>
> > > This will push the read off to the background worker thread to be
> > > serviced and so that will go much slower than a RWF_NOWAIT read that
> > > hits the backing file between writes doing allocation. i.e. read
> > > latency is going to become much, much more variable.
> > >
> > > Hence I suspect RWF_NOWAIT is simply hiding the underlying issue
> > > by providing this benchmark with a "pure overwrite" fast path that
> > > avoids the overhead of queueing work through loop_queue_work()....
> > >
> > > Can you run these same loop dev tests using a sparse image file and
> > > a sparse fio test file so that the fio benchmark measures the impact
> > > of loop device block allocation on the test? I suspect the results
> > > with the RWF_NOWAIT patch will be somewhat different to the fully
> > > allocated case...
> >
> > Yes, it will be slower, and io_uring FS IO application is in
> > the same situation, and usually application doesn't have such
> > knowledge if RWF_NOWAIT can succeed.
>
> And that's my point: nothing above the filesystem will have - or
> can have - any knowledge of whether NOWAIT IO will succeed or
> not.
>
> Therefore we have to use our brains to analyse what the typical
> behaviour of the filesystem will be for a given situation to
> determine how best to optimise IO submission.
Indeed, I think FS typical behaviour for loop case is valuable for
optimizing loop driver, but care to provide more details if you have?
>
> > However, usually meta IO is much less compared with normal IO, so most
> > times it will be a win to try NOWAIT first.
>
> Data vs metadata from the upper filesystem doesn't even enter into
> the equation here.
>
> Filesystems like XFS, bcachefs and btrfs all dynamically create,
> move and remove metadata, so metadata writes are as much of an
> allocation problem for sparse loop backing files as write-once user
> data IO is.
For typical loop workloads, I still see much less metadata io, and that
is why improvement from NOWAIT is big in Mikulas's test.
Thanks,
Ming
[-- Attachment #2: 0001-loop-debug-add-threaded-aio.patch --]
[-- Type: text/plain, Size: 3163 bytes --]
From b3c78b04ac6b80e171271a833c0221b8ab3b2b00 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Wed, 12 Mar 2025 07:33:55 +0000
Subject: [PATCH] loop: debug: add threaded aio
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
drivers/block/loop.c | 47 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 46 insertions(+), 1 deletion(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 393f38573169..113e29b3ac80 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -86,6 +86,8 @@ struct loop_cmd {
struct bio_vec *bvec;
struct cgroup_subsys_state *blkcg_css;
struct cgroup_subsys_state *memcg_css;
+
+ struct work_struct work;
};
#define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
@@ -98,6 +100,8 @@ static DEFINE_IDR(loop_index_idr);
static DEFINE_MUTEX(loop_ctl_mutex);
static DEFINE_MUTEX(loop_validate_mutex);
+static unsigned int threaded_aio = 0;
+
/**
* loop_global_lock_killable() - take locks for safe loop_validate_file() test
*
@@ -508,10 +512,17 @@ static int lo_rw_aio_prep(struct loop_device *lo, struct loop_cmd *cmd,
static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, loff_t pos)
{
unsigned int nr_bvec = lo_cmd_nr_bvec(cmd);
- int ret;
+ int ret = 0;
+
+ if (threaded_aio) {
+ ret = lo_rw_aio_prep(lo, cmd, nr_bvec);
+ if (ret < 0)
+ goto exit;
+ }
cmd->iocb.ki_flags &= ~IOCB_NOWAIT;
ret = lo_submit_rw_aio(lo, cmd, pos, nr_bvec);
+exit:
if (ret != -EIOCBQUEUED)
lo_rw_aio_complete(&cmd->iocb, ret);
return 0;
@@ -1941,10 +1952,39 @@ static const struct kernel_param_ops loop_nr_hw_q_param_ops = {
device_param_cb(nr_hw_queues, &loop_nr_hw_q_param_ops, &nr_hw_queues, 0444);
MODULE_PARM_DESC(nr_hw_queues, "number of hardware queues. Default: " __stringify(LOOP_DEFAULT_NR_HW_Q));
+module_param(threaded_aio, uint, 0644);
+MODULE_PARM_DESC(threaded_aio, "threaded aio");
+
MODULE_DESCRIPTION("Loopback device support");
MODULE_LICENSE("GPL");
MODULE_ALIAS_BLOCKDEV_MAJOR(LOOP_MAJOR);
+static void loop_aio_workfn(struct work_struct *work)
+{
+ struct loop_cmd *cmd = container_of(work, struct loop_cmd, work);
+ struct request *rq = blk_mq_rq_from_pdu(cmd);
+ struct loop_device *lo = rq->q->queuedata;
+ int ret = do_req_filebacked(lo, rq);
+
+ if (!cmd->use_aio || ret) {
+ if (ret == -EOPNOTSUPP)
+ cmd->ret = ret;
+ else
+ cmd->ret = ret ? -EIO : 0;
+ if (likely(!blk_should_fake_timeout(rq->q)))
+ blk_mq_complete_request(rq);
+ }
+}
+
+static void loop_queue_rq_threaded_aio(struct loop_cmd *cmd)
+{
+ struct request *rq = blk_mq_rq_from_pdu(cmd);
+ struct loop_device *lo = rq->q->queuedata;
+
+ INIT_WORK(&cmd->work, loop_aio_workfn);
+ queue_work(lo->workqueue, &cmd->work);
+}
+
static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
{
@@ -1968,6 +2008,11 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
break;
}
+ if (threaded_aio) {
+ loop_queue_rq_threaded_aio(cmd);
+ return BLK_STS_OK;
+ }
+
if (cmd->use_aio) {
loff_t pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
int ret = lo_rw_aio_nowait(lo, cmd, pos);
--
2.47.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-04 13:49 ` Christoph Hellwig
[not found] ` <CAM23Vxr=fKy-0L1R5P-5h6A95acKT_d=CC1E+TAzAs8v6q9gHw@mail.gmail.com>
@ 2025-03-12 13:26 ` Kent Overstreet
2025-03-12 14:20 ` Christoph Hellwig
1 sibling, 1 reply; 45+ messages in thread
From: Kent Overstreet @ 2025-03-12 13:26 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Mikulas Patocka, Jens Axboe, Jooyung Han, Alasdair Kergon,
Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block,
linux-fsdevel
On Tue, Mar 04, 2025 at 05:49:51AM -0800, Christoph Hellwig wrote:
> On Mon, Mar 03, 2025 at 10:03:42PM +0100, Mikulas Patocka wrote:
> > Swapfile does ahead of time mapping.
>
> It does. But it is:
>
> a) protected against modification by the S_SWAPFILE flag and checked
> for full allocation first
> b) something we want to get rid of because even with the above it is
> rather problematic
>
> > And I just looked at what swapfile
> > does and copied the logic into dm-loop. If swapfile is not broken, how
> > could dm-loop be broken?
>
> As said above, swapfile works around the brokenness in ways that you
> can't. And just blindly copying old code without understanding it is
> never a good idea.
>
> >
> > > > Would Jens Axboe agree to merge the dm-loop logic into the existing loop
> > > > driver?
> > >
> > > What logic?
> >
> > The ahead-of-time mapping.
>
> As said multiple times you can't do that. The block mapping is
> file system private information.
We might be able to provide an API to _request_ a stable mapping to a
file - with proper synchronization, of course.
I don't recall anyone ever trying that, it'd replace all the weird
IF_SWAPFILE() hacks and be a safe way to do these kinds of performance
optimizations.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-12 13:26 ` Kent Overstreet
@ 2025-03-12 14:20 ` Christoph Hellwig
2025-03-12 16:09 ` Kent Overstreet
2025-03-13 16:21 ` Mikulas Patocka
0 siblings, 2 replies; 45+ messages in thread
From: Christoph Hellwig @ 2025-03-12 14:20 UTC (permalink / raw)
To: Kent Overstreet
Cc: Christoph Hellwig, Mikulas Patocka, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel
On Wed, Mar 12, 2025 at 09:26:55AM -0400, Kent Overstreet wrote:
> We might be able to provide an API to _request_ a stable mapping to a
> file - with proper synchronization, of course.
We already have that with the pNFS layouts.
> I don't recall anyone ever trying that, it'd replace all the weird
> IF_SWAPFILE() hacks and be a safe way to do these kinds of performance
> optimizations.
IS_SWAPFILE isn't going way, as can't allow other writers to it.
Also asides from the that the layouts are fairly complex.
The right way ahead for swap is to literally just treat it as a slightly
special case of direct I/o that is allowed to IS_SWAPFILE files. We
can safely do writeback to file backed folios under memory pressure,
so we can also go through the normal file system path.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-12 14:20 ` Christoph Hellwig
@ 2025-03-12 16:09 ` Kent Overstreet
2025-03-13 12:44 ` Christoph Hellwig
2025-03-13 16:21 ` Mikulas Patocka
1 sibling, 1 reply; 45+ messages in thread
From: Kent Overstreet @ 2025-03-12 16:09 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Mikulas Patocka, Jens Axboe, Jooyung Han, Alasdair Kergon,
Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block,
linux-fsdevel
On Wed, Mar 12, 2025 at 07:20:11AM -0700, Christoph Hellwig wrote:
> On Wed, Mar 12, 2025 at 09:26:55AM -0400, Kent Overstreet wrote:
> > We might be able to provide an API to _request_ a stable mapping to a
> > file - with proper synchronization, of course.
>
> We already have that with the pNFS layouts.
Good point, maybe Mikulas wants to look at that if they care about
squeezing every iops...
>
> > I don't recall anyone ever trying that, it'd replace all the weird
> > IF_SWAPFILE() hacks and be a safe way to do these kinds of performance
> > optimizations.
>
> IS_SWAPFILE isn't going way, as can't allow other writers to it.
> Also asides from the that the layouts are fairly complex.
>
> The right way ahead for swap is to literally just treat it as a slightly
> special case of direct I/o that is allowed to IS_SWAPFILE files. We
> can safely do writeback to file backed folios under memory pressure,
> so we can also go through the normal file system path.
Yeah, and that gets us e.g. encrypted swap on bcachefs
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-12 8:26 ` Ming Lei
@ 2025-03-13 1:36 ` Ming Lei
2025-03-13 16:36 ` Mikulas Patocka
1 sibling, 0 replies; 45+ messages in thread
From: Ming Lei @ 2025-03-13 1:36 UTC (permalink / raw)
To: Dave Chinner
Cc: Mikulas Patocka, Christoph Hellwig, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel, io-uring
On Wed, Mar 12, 2025 at 04:27:12PM +0800, Ming Lei wrote:
> On Wed, Mar 12, 2025 at 01:34:02PM +1100, Dave Chinner wrote:
...
>
> block layer/storage has many optimization for batching handling, if IOs
> are submitted from many contexts:
>
> - this batch handling optimization is gone
>
> - IO is re-ordered from underlying hardware viewpoint
>
> - more contention from FS write lock, because loop has single back file.
>
> That is why the single task context is taken from the beginning of loop aio,
> and it performs pretty well for sequential IO workloads, as I shown
> in the zloop example.
>
> >
> > > It isn't perfect, sometime it may be slower than running on io-wq
> > > directly.
> > >
> > > But is there any better way for covering everything?
> >
> > Yes - fix the loop queue workers.
>
> What you suggested is threaded aio by submitting IO concurrently from
> different task context, this way is not the most efficient one, otherwise
> modern language won't invent async/.await.
>
> In my test VM, by running Mikulas's fio script on loop/nvme by the attached
> threaded_aio patch:
>
> NOWAIT with MQ 4 : 70K iops(read), 70K iops(write), cpu util: 40%
> threaded_aio with MQ 4 : 64k iops(read), 64K iops(write), cpu util: 52%
> in tree loop(SQ) : 58K iops(read), 58K iops(write)
>
> Mikulas, please feel free to run your tests with threaded_aio:
>
> modprobe loop nr_hw_queues=4 threaded_aio=1
>
> by applying the attached the patch over the loop patchset.
>
> The performance gap could be more obvious in fast hardware.
For the normal single job sequential WRITE workload, on same test VM, still
loop over /dev/nvme0n1, and running fio over loop directly:
fio --direct=1 --bs=4k --runtime=40 --time_based --numjobs=1 --ioengine=libaio \
--iodepth=16 --group_reporting=1 --filename=/dev/loop0 -name=job --rw=write
threaded_aio(SQ) : 81k iops(write), cpu util: 20%
in tree loop(SQ) : 100K iops(write), cpu util: 7%
Thanks,
Ming
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-12 16:09 ` Kent Overstreet
@ 2025-03-13 12:44 ` Christoph Hellwig
0 siblings, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2025-03-13 12:44 UTC (permalink / raw)
To: Kent Overstreet
Cc: Christoph Hellwig, Mikulas Patocka, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel
On Wed, Mar 12, 2025 at 12:09:41PM -0400, Kent Overstreet wrote:
> > The right way ahead for swap is to literally just treat it as a slightly
> > special case of direct I/o that is allowed to IS_SWAPFILE files. We
> > can safely do writeback to file backed folios under memory pressure,
> > so we can also go through the normal file system path.
>
> Yeah, and that gets us e.g. encrypted swap on bcachefs
You can already do this pretty easily, take a look at nfs_swap_rw.
It just would be nice to lift the boilerplate code to common code
and make this the default.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-12 14:20 ` Christoph Hellwig
2025-03-12 16:09 ` Kent Overstreet
@ 2025-03-13 16:21 ` Mikulas Patocka
2025-03-13 16:33 ` Kent Overstreet
1 sibling, 1 reply; 45+ messages in thread
From: Mikulas Patocka @ 2025-03-13 16:21 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kent Overstreet, Jens Axboe, Jooyung Han, Alasdair Kergon,
Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block,
linux-fsdevel
> IS_SWAPFILE isn't going way, as can't allow other writers to it.
> Also asides from the that the layouts are fairly complex.
>
> The right way ahead for swap is to literally just treat it as a slightly
> special case of direct I/o that is allowed to IS_SWAPFILE files. We
> can safely do writeback to file backed folios under memory pressure,
> so we can also go through the normal file system path.
But that is prone to low-memory-deadlock because the filesystems allocate
buffer memory when they are mapping logical blocks to physical blocks. You
would need to have a mempool in the filesystems, so that they can make
forward progress even if there is no memory free - and that would
complicate them.
GFP_NOIO won't help if the memory is completely exhausted. Only mempool
would help in this situation.
Mikulas
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-13 16:21 ` Mikulas Patocka
@ 2025-03-13 16:33 ` Kent Overstreet
0 siblings, 0 replies; 45+ messages in thread
From: Kent Overstreet @ 2025-03-13 16:33 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon,
Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block,
linux-fsdevel
On Thu, Mar 13, 2025 at 05:21:36PM +0100, Mikulas Patocka wrote:
> > IS_SWAPFILE isn't going way, as can't allow other writers to it.
> > Also asides from the that the layouts are fairly complex.
> >
> > The right way ahead for swap is to literally just treat it as a slightly
> > special case of direct I/o that is allowed to IS_SWAPFILE files. We
> > can safely do writeback to file backed folios under memory pressure,
> > so we can also go through the normal file system path.
>
> But that is prone to low-memory-deadlock because the filesystems allocate
> buffer memory when they are mapping logical blocks to physical blocks. You
> would need to have a mempool in the filesystems, so that they can make
> forward progress even if there is no memory free - and that would
> complicate them.
I can't speak for everone else, but bcachefs has those mempools.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-12 8:26 ` Ming Lei
2025-03-13 1:36 ` Ming Lei
@ 2025-03-13 16:36 ` Mikulas Patocka
2025-03-18 4:27 ` Dave Chinner
1 sibling, 1 reply; 45+ messages in thread
From: Mikulas Patocka @ 2025-03-13 16:36 UTC (permalink / raw)
To: Ming Lei
Cc: Dave Chinner, Christoph Hellwig, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel, io-uring
On Wed, 12 Mar 2025, Ming Lei wrote:
> > > It isn't perfect, sometime it may be slower than running on io-wq
> > > directly.
> > >
> > > But is there any better way for covering everything?
> >
> > Yes - fix the loop queue workers.
>
> What you suggested is threaded aio by submitting IO concurrently from
> different task context, this way is not the most efficient one, otherwise
> modern language won't invent async/.await.
>
> In my test VM, by running Mikulas's fio script on loop/nvme by the attached
> threaded_aio patch:
>
> NOWAIT with MQ 4 : 70K iops(read), 70K iops(write), cpu util: 40%
> threaded_aio with MQ 4 : 64k iops(read), 64K iops(write), cpu util: 52%
> in tree loop(SQ) : 58K iops(read), 58K iops(write)
>
> Mikulas, please feel free to run your tests with threaded_aio:
>
> modprobe loop nr_hw_queues=4 threaded_aio=1
>
> by applying the attached the patch over the loop patchset.
>
> The performance gap could be more obvious in fast hardware.
With "threaded_aio=1":
Sync io
fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
xfs/loop/xfs
READ: bw=300MiB/s (315MB/s), 300MiB/s-300MiB/s (315MB/s-315MB/s), io=3001MiB (3147MB), run=10001-10001msec
WRITE: bw=300MiB/s (315MB/s), 300MiB/s-300MiB/s (315MB/s-315MB/s), io=3004MiB (3149MB), run=10001-10001msec
Async io
fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
xfs/loop/xfs
READ: bw=869MiB/s (911MB/s), 869MiB/s-869MiB/s (911MB/s-911MB/s), io=8694MiB (9116MB), run=10002-10002msec
WRITE: bw=870MiB/s (913MB/s), 870MiB/s-870MiB/s (913MB/s-913MB/s), io=8706MiB (9129MB), run=10002-10002msec
Without "threaded_aio=1":
Sync io
fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
xfs/loop/xfs
READ: bw=348MiB/s (365MB/s), 348MiB/s-348MiB/s (365MB/s-365MB/s), io=3481MiB (3650MB), run=10001-10001msec
WRITE: bw=348MiB/s (365MB/s), 348MiB/s-348MiB/s (365MB/s-365MB/s), io=3484MiB (3653MB), run=10001-10001msec
Async io
fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
xfs/loop/xfs
READ: bw=1186MiB/s (1244MB/s), 1186MiB/s-1186MiB/s (1244MB/s-1244MB/s), io=11.6GiB (12.4GB), run=10001-10001msec
WRITE: bw=1187MiB/s (1245MB/s), 1187MiB/s-1187MiB/s (1245MB/s-1245MB/s), io=11.6GiB (12.5GB), run=10001-10001msec
Mikulas
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-13 16:36 ` Mikulas Patocka
@ 2025-03-18 4:27 ` Dave Chinner
2025-03-18 7:57 ` Christoph Hellwig
0 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2025-03-18 4:27 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Ming Lei, Christoph Hellwig, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel, io-uring
On Thu, Mar 13, 2025 at 05:36:53PM +0100, Mikulas Patocka wrote:
> On Wed, 12 Mar 2025, Ming Lei wrote:
>
> > > > It isn't perfect, sometime it may be slower than running on io-wq
> > > > directly.
> > > >
> > > > But is there any better way for covering everything?
> > >
> > > Yes - fix the loop queue workers.
> >
> > What you suggested is threaded aio by submitting IO concurrently from
> > different task context, this way is not the most efficient one, otherwise
> > modern language won't invent async/.await.
> >
> > In my test VM, by running Mikulas's fio script on loop/nvme by the attached
> > threaded_aio patch:
> >
> > NOWAIT with MQ 4 : 70K iops(read), 70K iops(write), cpu util: 40%
> > threaded_aio with MQ 4 : 64k iops(read), 64K iops(write), cpu util: 52%
> > in tree loop(SQ) : 58K iops(read), 58K iops(write)
> >
> > Mikulas, please feel free to run your tests with threaded_aio:
> >
> > modprobe loop nr_hw_queues=4 threaded_aio=1
> >
> > by applying the attached the patch over the loop patchset.
> >
> > The performance gap could be more obvious in fast hardware.
>
> With "threaded_aio=1":
>
> Sync io
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> xfs/loop/xfs
> READ: bw=300MiB/s (315MB/s), 300MiB/s-300MiB/s (315MB/s-315MB/s), io=3001MiB (3147MB), run=10001-10001msec
> WRITE: bw=300MiB/s (315MB/s), 300MiB/s-300MiB/s (315MB/s-315MB/s), io=3004MiB (3149MB), run=10001-10001msec
>
> Async io
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> xfs/loop/xfs
> READ: bw=869MiB/s (911MB/s), 869MiB/s-869MiB/s (911MB/s-911MB/s), io=8694MiB (9116MB), run=10002-10002msec
> WRITE: bw=870MiB/s (913MB/s), 870MiB/s-870MiB/s (913MB/s-913MB/s), io=8706MiB (9129MB), run=10002-10002msec
The original numbers for the xfs/loop/xfs performance were 220MiB/s
(sync) and 276MiB/s (async), so this is actually a very big step
forward compared to the existing code.
Yes, it's not quite as fast as the NOWAIT case for pure overwrites -
348MB/s (sync) and 1186MB/s (async), but we predicted (and expected)
that this would be the case.
However, this is still testing the static file, pure overwrite case
only, so there is never any IO that blocks during submission. When
IO will block (because there are allocating writes in progress)
performance in the NOWAIT case will trend back towards the original
performance levels because the single loop queue blocking submission
will still be the limiting factor for all IO that needs to block.
IOWs, these results show that to get decent, consistent performance
out of the loop device we need threaded blocking submission so users
do not have to care about optimising individual loop device
instances for the layout of their image files.
Yes, NOWAIT may then add an incremental performance improvement on
top for optimal layout cases, but I'm still not yet convinced that
it is a generally applicable loop device optimisation that everyone
wants to always enable due to the potential for 100% NOWAIT
submission failure on any given loop device.....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-18 4:27 ` Dave Chinner
@ 2025-03-18 7:57 ` Christoph Hellwig
2025-03-18 9:34 ` Ming Lei
0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2025-03-18 7:57 UTC (permalink / raw)
To: Dave Chinner
Cc: Mikulas Patocka, Ming Lei, Christoph Hellwig, Jens Axboe,
Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen,
zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring
On Tue, Mar 18, 2025 at 03:27:48PM +1100, Dave Chinner wrote:
> Yes, NOWAIT may then add an incremental performance improvement on
> top for optimal layout cases, but I'm still not yet convinced that
> it is a generally applicable loop device optimisation that everyone
> wants to always enable due to the potential for 100% NOWAIT
> submission failure on any given loop device.....
Yes, I think this is a really good first step:
1) switch loop to use a per-command work_item unconditionally, which also
has the nice effect that it cleans up the horrible mess of the
per-blkcg workers. (note that this is what the nvmet file backend has
always done with good result)
2) look into NOWAIT submission, especially for reads this should be
a clear winner and probaby done unconditionally. For writes it
might be a bit of a tradeoff if we expect the writes to allocate
a lot, so we might want some kind of tunable for it.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-18 7:57 ` Christoph Hellwig
@ 2025-03-18 9:34 ` Ming Lei
2025-03-20 7:08 ` Christoph Hellwig
0 siblings, 1 reply; 45+ messages in thread
From: Ming Lei @ 2025-03-18 9:34 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Dave Chinner, Mikulas Patocka, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel, io-uring
On Tue, Mar 18, 2025 at 12:57:17AM -0700, Christoph Hellwig wrote:
> On Tue, Mar 18, 2025 at 03:27:48PM +1100, Dave Chinner wrote:
> > Yes, NOWAIT may then add an incremental performance improvement on
> > top for optimal layout cases, but I'm still not yet convinced that
> > it is a generally applicable loop device optimisation that everyone
> > wants to always enable due to the potential for 100% NOWAIT
> > submission failure on any given loop device.....
NOWAIT failure can be avoided actually:
https://lore.kernel.org/linux-block/20250314021148.3081954-6-ming.lei@redhat.com/
>
> Yes, I think this is a really good first step:
>
> 1) switch loop to use a per-command work_item unconditionally, which also
> has the nice effect that it cleans up the horrible mess of the
> per-blkcg workers. (note that this is what the nvmet file backend has
It could be worse to take per-command work, because IO handling crosses
all system wq worker contexts.
> always done with good result)
per-command work does burn lots of CPU unnecessarily, it isn't good for
use case of container, and it can not perform as well as NOWAIT.
> 2) look into NOWAIT submission, especially for reads this should be
> a clear winner and probaby done unconditionally. For writes it
> might be a bit of a tradeoff if we expect the writes to allocate
> a lot, so we might want some kind of tunable for it.
It is a winner for over-write too.
WRITE with allocation can be kept to submit from wq context, see my
patchset V2.
Thanks,
Ming
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-18 9:34 ` Ming Lei
@ 2025-03-20 7:08 ` Christoph Hellwig
2025-03-20 7:41 ` Ming Lei
0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2025-03-20 7:08 UTC (permalink / raw)
To: Ming Lei
Cc: Christoph Hellwig, Dave Chinner, Mikulas Patocka, Jens Axboe,
Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen,
zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring
On Tue, Mar 18, 2025 at 05:34:28PM +0800, Ming Lei wrote:
> On Tue, Mar 18, 2025 at 12:57:17AM -0700, Christoph Hellwig wrote:
> > On Tue, Mar 18, 2025 at 03:27:48PM +1100, Dave Chinner wrote:
> > > Yes, NOWAIT may then add an incremental performance improvement on
> > > top for optimal layout cases, but I'm still not yet convinced that
> > > it is a generally applicable loop device optimisation that everyone
> > > wants to always enable due to the potential for 100% NOWAIT
> > > submission failure on any given loop device.....
>
> NOWAIT failure can be avoided actually:
>
> https://lore.kernel.org/linux-block/20250314021148.3081954-6-ming.lei@redhat.com/
That's a very complex set of heuristics which doesn't match up
with other uses of it.
>
> >
> > Yes, I think this is a really good first step:
> >
> > 1) switch loop to use a per-command work_item unconditionally, which also
> > has the nice effect that it cleans up the horrible mess of the
> > per-blkcg workers. (note that this is what the nvmet file backend has
>
> It could be worse to take per-command work, because IO handling crosses
> all system wq worker contexts.
So do other workloads with pretty good success.
>
> > always done with good result)
>
> per-command work does burn lots of CPU unnecessarily, it isn't good for
> use case of container
That does not match my observations in say nvmet. But if you have
numbers please share them.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-20 7:08 ` Christoph Hellwig
@ 2025-03-20 7:41 ` Ming Lei
2025-03-20 14:22 ` Christoph Hellwig
2025-03-25 10:15 ` Dave Chinner
0 siblings, 2 replies; 45+ messages in thread
From: Ming Lei @ 2025-03-20 7:41 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Dave Chinner, Mikulas Patocka, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel, io-uring
On Thu, Mar 20, 2025 at 12:08:19AM -0700, Christoph Hellwig wrote:
> On Tue, Mar 18, 2025 at 05:34:28PM +0800, Ming Lei wrote:
> > On Tue, Mar 18, 2025 at 12:57:17AM -0700, Christoph Hellwig wrote:
> > > On Tue, Mar 18, 2025 at 03:27:48PM +1100, Dave Chinner wrote:
> > > > Yes, NOWAIT may then add an incremental performance improvement on
> > > > top for optimal layout cases, but I'm still not yet convinced that
> > > > it is a generally applicable loop device optimisation that everyone
> > > > wants to always enable due to the potential for 100% NOWAIT
> > > > submission failure on any given loop device.....
> >
> > NOWAIT failure can be avoided actually:
> >
> > https://lore.kernel.org/linux-block/20250314021148.3081954-6-ming.lei@redhat.com/
>
> That's a very complex set of heuristics which doesn't match up
> with other uses of it.
I'd suggest you to point them out in the patch review.
>
> >
> > >
> > > Yes, I think this is a really good first step:
> > >
> > > 1) switch loop to use a per-command work_item unconditionally, which also
> > > has the nice effect that it cleans up the horrible mess of the
> > > per-blkcg workers. (note that this is what the nvmet file backend has
> >
> > It could be worse to take per-command work, because IO handling crosses
> > all system wq worker contexts.
>
> So do other workloads with pretty good success.
>
> >
> > > always done with good result)
> >
> > per-command work does burn lots of CPU unnecessarily, it isn't good for
> > use case of container
>
> That does not match my observations in say nvmet. But if you have
> numbers please share them.
Please see the result I posted:
https://lore.kernel.org/linux-block/Z9FFTiuMC8WD6qMH@fedora/
Thanks,
Ming
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-20 7:41 ` Ming Lei
@ 2025-03-20 14:22 ` Christoph Hellwig
2025-03-20 14:36 ` Ming Lei
2025-03-25 10:15 ` Dave Chinner
1 sibling, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2025-03-20 14:22 UTC (permalink / raw)
To: Ming Lei
Cc: Christoph Hellwig, Dave Chinner, Mikulas Patocka, Jens Axboe,
Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen,
zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring
On Thu, Mar 20, 2025 at 03:41:58PM +0800, Ming Lei wrote:
> > That does not match my observations in say nvmet. But if you have
> > numbers please share them.
>
> Please see the result I posted:
>
> https://lore.kernel.org/linux-block/Z9FFTiuMC8WD6qMH@fedora/
That shows it improves numbers and not that it doens't.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-20 14:22 ` Christoph Hellwig
@ 2025-03-20 14:36 ` Ming Lei
0 siblings, 0 replies; 45+ messages in thread
From: Ming Lei @ 2025-03-20 14:36 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Dave Chinner, Mikulas Patocka, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel, io-uring
On Thu, Mar 20, 2025 at 10:22 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Mar 20, 2025 at 03:41:58PM +0800, Ming Lei wrote:
> > > That does not match my observations in say nvmet. But if you have
> > > numbers please share them.
> >
> > Please see the result I posted:
> >
> > https://lore.kernel.org/linux-block/Z9FFTiuMC8WD6qMH@fedora/
>
> That shows it improves numbers and not that it doens't.
Fine, then please look at the result in the following reply:
https://lore.kernel.org/linux-block/Z9I2lm31KOQ784nb@fedora/
Thanks,
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-20 7:41 ` Ming Lei
2025-03-20 14:22 ` Christoph Hellwig
@ 2025-03-25 10:15 ` Dave Chinner
2025-03-25 12:23 ` Ming Lei
1 sibling, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2025-03-25 10:15 UTC (permalink / raw)
To: Ming Lei
Cc: Christoph Hellwig, Mikulas Patocka, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel, io-uring
On Thu, Mar 20, 2025 at 03:41:58PM +0800, Ming Lei wrote:
> On Thu, Mar 20, 2025 at 12:08:19AM -0700, Christoph Hellwig wrote:
> > On Tue, Mar 18, 2025 at 05:34:28PM +0800, Ming Lei wrote:
> > > On Tue, Mar 18, 2025 at 12:57:17AM -0700, Christoph Hellwig wrote:
> > > > On Tue, Mar 18, 2025 at 03:27:48PM +1100, Dave Chinner wrote:
> > > > > Yes, NOWAIT may then add an incremental performance improvement on
> > > > > top for optimal layout cases, but I'm still not yet convinced that
> > > > > it is a generally applicable loop device optimisation that everyone
> > > > > wants to always enable due to the potential for 100% NOWAIT
> > > > > submission failure on any given loop device.....
> > >
> > > NOWAIT failure can be avoided actually:
> > >
> > > https://lore.kernel.org/linux-block/20250314021148.3081954-6-ming.lei@redhat.com/
> >
> > That's a very complex set of heuristics which doesn't match up
> > with other uses of it.
>
> I'd suggest you to point them out in the patch review.
Until you pointed them out here, I didn't know these patches
existed.
Please cc linux-fsdevel on any loop device changes you are
proposing, Ming. It is as much a filesystem driver as it is a block
device, and it changes need review from both sides of the fence.
> > > > Yes, I think this is a really good first step:
> > > >
> > > > 1) switch loop to use a per-command work_item unconditionally, which also
> > > > has the nice effect that it cleans up the horrible mess of the
> > > > per-blkcg workers. (note that this is what the nvmet file backend has
> > >
> > > It could be worse to take per-command work, because IO handling crosses
> > > all system wq worker contexts.
> >
> > So do other workloads with pretty good success.
> >
> > >
> > > > always done with good result)
> > >
> > > per-command work does burn lots of CPU unnecessarily, it isn't good for
> > > use case of container
> >
> > That does not match my observations in say nvmet. But if you have
> > numbers please share them.
>
> Please see the result I posted:
>
> https://lore.kernel.org/linux-block/Z9FFTiuMC8WD6qMH@fedora/
You are arguing in circles about how we need to optimise for static
file layouts.
Please listen to the filesystem people when they tell you that
static file layouts are a -secondary- optimisation target for loop
devices.
The primary optimisation target is the modification that makes all
types of IO perform better in production, not just the one use case
that overwrite-specific IO benchmarks exercise.
If you want me to test your changes, I have a very loop device heavy
workload here - it currently creates about 300 *sparse* loop devices
totalling about 1.2TB of capacity, then does all sorts of IO to them
through both the loop devices themselves and filesystems created on
top of the loop devices. It typically generates 4-5GB/s of IO
through the loop devices to the backing filesystem and it's physical
storage.
Speeding up or slowing down IO submission through the loop devices
has direct impact on the speed of the workload. Using buffered IO
through the loop device right now is about 25% faster than using
aio+dio for the loop because there is some amount of re-read and
re-write in the filesystem IO patterns. That said, AIO+DIO should be
much faster than it is, hence my interest in making all the AIO+DIO
IO submission independent of potential blocking operations.
Hence if you have patch sets that improve loop device performance,
then you need to make sure filesystem people like myself see those
patch series so they can be tested and reviewed in a timely manner.
That means you need to cc loop device patches to linux-fsdevel....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target
2025-03-25 10:15 ` Dave Chinner
@ 2025-03-25 12:23 ` Ming Lei
0 siblings, 0 replies; 45+ messages in thread
From: Ming Lei @ 2025-03-25 12:23 UTC (permalink / raw)
To: Dave Chinner
Cc: Christoph Hellwig, Mikulas Patocka, Jens Axboe, Jooyung Han,
Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac,
dm-devel, linux-block, linux-fsdevel, io-uring
On Tue, Mar 25, 2025 at 09:15:26PM +1100, Dave Chinner wrote:
> On Thu, Mar 20, 2025 at 03:41:58PM +0800, Ming Lei wrote:
> > On Thu, Mar 20, 2025 at 12:08:19AM -0700, Christoph Hellwig wrote:
> > > On Tue, Mar 18, 2025 at 05:34:28PM +0800, Ming Lei wrote:
> > > > On Tue, Mar 18, 2025 at 12:57:17AM -0700, Christoph Hellwig wrote:
> > > > > On Tue, Mar 18, 2025 at 03:27:48PM +1100, Dave Chinner wrote:
> > > > > > Yes, NOWAIT may then add an incremental performance improvement on
> > > > > > top for optimal layout cases, but I'm still not yet convinced that
> > > > > > it is a generally applicable loop device optimisation that everyone
> > > > > > wants to always enable due to the potential for 100% NOWAIT
> > > > > > submission failure on any given loop device.....
> > > >
> > > > NOWAIT failure can be avoided actually:
> > > >
> > > > https://lore.kernel.org/linux-block/20250314021148.3081954-6-ming.lei@redhat.com/
> > >
> > > That's a very complex set of heuristics which doesn't match up
> > > with other uses of it.
> >
> > I'd suggest you to point them out in the patch review.
>
> Until you pointed them out here, I didn't know these patches
> existed.
>
> Please cc linux-fsdevel on any loop device changes you are
> proposing, Ming. It is as much a filesystem driver as it is a block
> device, and it changes need review from both sides of the fence.
Please see the patchset:
https://lore.kernel.org/linux-block/20250322012617.354222-1-ming.lei@redhat.com/
>
> > > > > Yes, I think this is a really good first step:
> > > > >
> > > > > 1) switch loop to use a per-command work_item unconditionally, which also
> > > > > has the nice effect that it cleans up the horrible mess of the
> > > > > per-blkcg workers. (note that this is what the nvmet file backend has
> > > >
> > > > It could be worse to take per-command work, because IO handling crosses
> > > > all system wq worker contexts.
> > >
> > > So do other workloads with pretty good success.
> > >
> > > >
> > > > > always done with good result)
> > > >
> > > > per-command work does burn lots of CPU unnecessarily, it isn't good for
> > > > use case of container
> > >
> > > That does not match my observations in say nvmet. But if you have
> > > numbers please share them.
> >
> > Please see the result I posted:
> >
> > https://lore.kernel.org/linux-block/Z9FFTiuMC8WD6qMH@fedora/
>
> You are arguing in circles about how we need to optimise for static
> file layouts.
>
> Please listen to the filesystem people when they tell you that
> static file layouts are a -secondary- optimisation target for loop
> devices.
>
> The primary optimisation target is the modification that makes all
> types of IO perform better in production, not just the one use case
> that overwrite-specific IO benchmarks exercise.
>
> If you want me to test your changes, I have a very loop device heavy
> workload here - it currently creates about 300 *sparse* loop devices
> totalling about 1.2TB of capacity, then does all sorts of IO to them
> through both the loop devices themselves and filesystems created on
> top of the loop devices. It typically generates 4-5GB/s of IO
> through the loop devices to the backing filesystem and it's physical
> storage.
The patchset does cover the sparse backfile, and I also provide one test
case in which one completely sparse file is used, and make sure that
there isn't regression in this case.
This patchset is supposed to address Mikulas's case of stable FS mapping,
meantime without introducing regression on other cases, such as
the sparse backing file.
>
> Speeding up or slowing down IO submission through the loop devices
> has direct impact on the speed of the workload. Using buffered IO
> through the loop device right now is about 25% faster than using
> aio+dio for the loop because there is some amount of re-read and
> re-write in the filesystem IO patterns. That said, AIO+DIO should be
> much faster than it is, hence my interest in making all the AIO+DIO
> IO submission independent of potential blocking operations.
>
> Hence if you have patch sets that improve loop device performance,
> then you need to make sure filesystem people like myself see those
> patch series so they can be tested and reviewed in a timely manner.
> That means you need to cc loop device patches to linux-fsdevel....
OK, will Cc you and linux-fsdevel in future loop patch submission.
Thanks,
Ming
^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2025-03-25 12:24 UTC | newest]
Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <7d6ae2c9-df8e-50d0-7ad6-b787cb3cfab4@redhat.com>
2025-03-03 13:59 ` [PATCH] the dm-loop target Christoph Hellwig
[not found] ` <CAM23VxprhJgOPfhxQf6QNWzHd6+-ZwbjSo-oMHCD2WDQiKntMg@mail.gmail.com>
2025-03-03 15:13 ` Christoph Hellwig
2025-03-03 15:22 ` Matthew Wilcox
2025-03-03 15:31 ` Christoph Hellwig
[not found] ` <CAM23VxprSduDDK8qvLVkUt9WWmLMPFjhqKB8X4e6gw7Wv-6R2w@mail.gmail.com>
2025-03-03 17:24 ` Christoph Hellwig
[not found] ` <CAM23Vxoxyrf9nwJd1Xe8uncAPiyK8yaNZNsugwX8p=qo1n6yVg@mail.gmail.com>
2025-03-04 13:52 ` Christoph Hellwig
2025-03-03 16:16 ` Mikulas Patocka
2025-03-03 17:24 ` Christoph Hellwig
2025-03-03 21:03 ` Mikulas Patocka
2025-03-04 2:13 ` Dave Chinner
2025-03-04 11:18 ` Mikulas Patocka
2025-03-04 13:50 ` Christoph Hellwig
2025-03-05 0:01 ` Dave Chinner
2025-03-07 15:21 ` Mikulas Patocka
2025-03-08 3:49 ` Darrick J. Wong
2025-03-08 20:45 ` Mikulas Patocka
2025-03-09 0:05 ` Ming Lei
2025-03-10 11:18 ` Mikulas Patocka
2025-03-11 1:27 ` Dave Chinner
2025-03-11 10:43 ` Ming Lei
2025-03-12 2:34 ` Dave Chinner
2025-03-12 6:24 ` Christoph Hellwig
2025-03-12 8:26 ` Ming Lei
2025-03-13 1:36 ` Ming Lei
2025-03-13 16:36 ` Mikulas Patocka
2025-03-18 4:27 ` Dave Chinner
2025-03-18 7:57 ` Christoph Hellwig
2025-03-18 9:34 ` Ming Lei
2025-03-20 7:08 ` Christoph Hellwig
2025-03-20 7:41 ` Ming Lei
2025-03-20 14:22 ` Christoph Hellwig
2025-03-20 14:36 ` Ming Lei
2025-03-25 10:15 ` Dave Chinner
2025-03-25 12:23 ` Ming Lei
2025-03-09 0:16 ` Ming Lei
2025-03-10 11:20 ` Mikulas Patocka
2025-03-04 13:49 ` Christoph Hellwig
[not found] ` <CAM23Vxr=fKy-0L1R5P-5h6A95acKT_d=CC1E+TAzAs8v6q9gHw@mail.gmail.com>
2025-03-04 16:04 ` Christoph Hellwig
[not found] ` <CAM23VxqJX46DCpCiH5qxPpDLtMVg87Ba8sx55aQ4hvt-XaHzuQ@mail.gmail.com>
2025-03-04 17:17 ` Christoph Hellwig
2025-03-12 13:26 ` Kent Overstreet
2025-03-12 14:20 ` Christoph Hellwig
2025-03-12 16:09 ` Kent Overstreet
2025-03-13 12:44 ` Christoph Hellwig
2025-03-13 16:21 ` Mikulas Patocka
2025-03-13 16:33 ` Kent Overstreet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).