* Re: [PATCH] the dm-loop target [not found] <7d6ae2c9-df8e-50d0-7ad6-b787cb3cfab4@redhat.com> @ 2025-03-03 13:59 ` Christoph Hellwig [not found] ` <CAM23VxprhJgOPfhxQf6QNWzHd6+-ZwbjSo-oMHCD2WDQiKntMg@mail.gmail.com> 2025-03-03 16:16 ` Mikulas Patocka 0 siblings, 2 replies; 45+ messages in thread From: Christoph Hellwig @ 2025-03-03 13:59 UTC (permalink / raw) To: Mikulas Patocka Cc: Jooyung Han, Alasdair Kergon, Mike Snitzer, zkabelac, dm-devel, linux-block, linux-fsdevel On Mon, Mar 03, 2025 at 11:24:27AM +0100, Mikulas Patocka wrote: > This is the dm-loop target - a replacement for the regular loop driver > with better performance. The dm-loop target builds a map of the file in > the constructor and it just remaps bios according to this map. Using ->bmap is broken and a no-go for new code. If you have any real performance issues with the loop driver document and report them so that they can be fixed instead of working around them by duplicating the code (and in this case making the new code completely broken). ^ permalink raw reply [flat|nested] 45+ messages in thread
[parent not found: <CAM23VxprhJgOPfhxQf6QNWzHd6+-ZwbjSo-oMHCD2WDQiKntMg@mail.gmail.com>]
* Re: [PATCH] the dm-loop target [not found] ` <CAM23VxprhJgOPfhxQf6QNWzHd6+-ZwbjSo-oMHCD2WDQiKntMg@mail.gmail.com> @ 2025-03-03 15:13 ` Christoph Hellwig 2025-03-03 15:22 ` Matthew Wilcox [not found] ` <CAM23VxprSduDDK8qvLVkUt9WWmLMPFjhqKB8X4e6gw7Wv-6R2w@mail.gmail.com> 0 siblings, 2 replies; 45+ messages in thread From: Christoph Hellwig @ 2025-03-03 15:13 UTC (permalink / raw) To: Heinz Mauelshagen Cc: Christoph Hellwig, Mikulas Patocka, Jooyung Han, Alasdair Kergon, Mike Snitzer, zkabelac, dm-devel, linux-block, linux-fsdevel On Mon, Mar 03, 2025 at 03:57:23PM +0100, Heinz Mauelshagen wrote: > dm-loop avoids the file system abstraction, thus gains the resulting > performance advantage versus the loop driver. How? > dm-loop obviously requires full provisioning, thus sparse files are being > detected and error handled. > > Additionally, block relocation on CoW filesystems has to be prevented. > dm-loop uses S_SWAPFILE to do so but that flag's limited to swap semantics > and is overloaded as such. > > Should we avoid the flag and enforce use of non-CoW filesystems for backing > or checks on non-CoW files (inode->i_flags)? No, ->bmap is fundamentally flawed. No new code should be using it, and we need to move the places still using it (most notably swap and the md bitmap code) off it. It can't deal with any kind of changes to the file system topology and is a risk to data corruption because if used in the I/O path it bypasses the file system entirely.. If it wasn't we'd use it in the loop driver. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-03 15:13 ` Christoph Hellwig @ 2025-03-03 15:22 ` Matthew Wilcox 2025-03-03 15:31 ` Christoph Hellwig [not found] ` <CAM23VxprSduDDK8qvLVkUt9WWmLMPFjhqKB8X4e6gw7Wv-6R2w@mail.gmail.com> 1 sibling, 1 reply; 45+ messages in thread From: Matthew Wilcox @ 2025-03-03 15:22 UTC (permalink / raw) To: Christoph Hellwig Cc: Heinz Mauelshagen, Mikulas Patocka, Jooyung Han, Alasdair Kergon, Mike Snitzer, zkabelac, dm-devel, linux-block, linux-fsdevel On Mon, Mar 03, 2025 at 07:13:49AM -0800, Christoph Hellwig wrote: > No, ->bmap is fundamentally flawed. No new code should be using it, and > we need to move the places still using it (most notably swap and the md > bitmap code) off it. It can't deal with any kind of changes to the file > system topology and is a risk to data corruption because if used in the > I/O path it bypasses the file system entirely.. If it wasn't we'd use it > in the loop driver. Funnily, I was looking at the md bitmap code too. It's the last part of the kernel using buffer heads with not-a-pagecache page, so it's the only caller of alloc_page_buffers() remaining. I think it should just use the page cache to read/write the file data, but I haven't looked into it in detail. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-03 15:22 ` Matthew Wilcox @ 2025-03-03 15:31 ` Christoph Hellwig 0 siblings, 0 replies; 45+ messages in thread From: Christoph Hellwig @ 2025-03-03 15:31 UTC (permalink / raw) To: Matthew Wilcox Cc: Christoph Hellwig, Heinz Mauelshagen, Mikulas Patocka, Jooyung Han, Alasdair Kergon, Mike Snitzer, zkabelac, dm-devel, linux-block, linux-fsdevel On Mon, Mar 03, 2025 at 03:22:48PM +0000, Matthew Wilcox wrote: > Funnily, I was looking at the md bitmap code too. It's the last part of > the kernel using buffer heads with not-a-pagecache page, so it's the > only caller of alloc_page_buffers() remaining. > > I think it should just use the page cache to read/write the file data, > but I haven't looked into it in detail. The md bitmap code is awful as it abuses it's own buffer heads in a way that is non-coherent with the underlying fs. It should just be using the vfs read/write helpers for in-kernel direct I/O with a scoped nofs context. ^ permalink raw reply [flat|nested] 45+ messages in thread
[parent not found: <CAM23VxprSduDDK8qvLVkUt9WWmLMPFjhqKB8X4e6gw7Wv-6R2w@mail.gmail.com>]
* Re: [PATCH] the dm-loop target [not found] ` <CAM23VxprSduDDK8qvLVkUt9WWmLMPFjhqKB8X4e6gw7Wv-6R2w@mail.gmail.com> @ 2025-03-03 17:24 ` Christoph Hellwig [not found] ` <CAM23Vxoxyrf9nwJd1Xe8uncAPiyK8yaNZNsugwX8p=qo1n6yVg@mail.gmail.com> 0 siblings, 1 reply; 45+ messages in thread From: Christoph Hellwig @ 2025-03-03 17:24 UTC (permalink / raw) To: Heinz Mauelshagen Cc: Christoph Hellwig, Mikulas Patocka, Jooyung Han, Alasdair Kergon, Mike Snitzer, zkabelac, dm-devel, linux-block, linux-fsdevel On Mon, Mar 03, 2025 at 05:06:16PM +0100, Heinz Mauelshagen wrote: > On Mon, Mar 3, 2025 at 4:24 PM Christoph Hellwig <hch@infradead.org> wrote: > > > On Mon, Mar 03, 2025 at 03:57:23PM +0100, Heinz Mauelshagen wrote: > > > dm-loop avoids the file system abstraction, thus gains the resulting > > > performance advantage versus the loop driver. > > > > How? > > > > It uses a predefined map and thus avoids VFS -> FS write path overhead > completelly. But you simply can't use a "pre-defined map". It's fundamentally unsafe. > The inode fiemap() operation would avoid issues relative to outdated bmap(). No, using that for I/O is even worse. ^ permalink raw reply [flat|nested] 45+ messages in thread
[parent not found: <CAM23Vxoxyrf9nwJd1Xe8uncAPiyK8yaNZNsugwX8p=qo1n6yVg@mail.gmail.com>]
* Re: [PATCH] the dm-loop target [not found] ` <CAM23Vxoxyrf9nwJd1Xe8uncAPiyK8yaNZNsugwX8p=qo1n6yVg@mail.gmail.com> @ 2025-03-04 13:52 ` Christoph Hellwig 0 siblings, 0 replies; 45+ messages in thread From: Christoph Hellwig @ 2025-03-04 13:52 UTC (permalink / raw) To: Heinz Mauelshagen Cc: Christoph Hellwig, Mikulas Patocka, Jooyung Han, Alasdair Kergon, Mike Snitzer, zkabelac, dm-devel, linux-block, linux-fsdevel On Mon, Mar 03, 2025 at 06:34:45PM +0100, Heinz Mauelshagen wrote: > It is, unless you ensure the pre-allocated and fully provisioned file to be > immutable. Which you can't. > > > The inode fiemap() operation would avoid issues relative to outdated > > bmap(). > > > > No, using that for I/O is even worse. > > > > Are you saying it imposes more overhead than the VFS APIs? No, but the reporting in fiemap is even less defined than the already overly lose bmap. fiemap for btrfs for example doesn't report actual block numbers. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-03 13:59 ` [PATCH] the dm-loop target Christoph Hellwig [not found] ` <CAM23VxprhJgOPfhxQf6QNWzHd6+-ZwbjSo-oMHCD2WDQiKntMg@mail.gmail.com> @ 2025-03-03 16:16 ` Mikulas Patocka 2025-03-03 17:24 ` Christoph Hellwig 1 sibling, 1 reply; 45+ messages in thread From: Mikulas Patocka @ 2025-03-03 16:16 UTC (permalink / raw) To: Christoph Hellwig, Jens Axboe Cc: Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Mon, 3 Mar 2025, Christoph Hellwig wrote: > On Mon, Mar 03, 2025 at 11:24:27AM +0100, Mikulas Patocka wrote: > > This is the dm-loop target - a replacement for the regular loop driver > > with better performance. The dm-loop target builds a map of the file in > > the constructor and it just remaps bios according to this map. > > Using ->bmap is broken and a no-go for new code. What should I use instead of bmap? Is fiemap exported for use in the kernel? ext4_bmap flushes the journal if journaled data are used. Is there some equivalent function that would provide the same guarantee w.r.t. journaled data? > If you have any real > performance issues with the loop driver document and report them so that > they can be fixed instead of working around them by duplicating the code > (and in this case making the new code completely broken). Would Jens Axboe agree to merge the dm-loop logic into the existing loop driver? Dm-loop is significantly faster than the regular loop: # modprobe brd rd_size=1048576 # dd if=/dev/zero of=/dev/ram0 bs=1048576 # mkfs.ext4 /dev/ram0 # mount -t ext4 /dev/ram0 /mnt/test # dd if=/dev/zero of=/mnt/test/test bs=1048576 count=512 dm-loop (on /mnt/test/test): # dmsetup create loop --table '0 1048576 loop /mnt/test/test 0' # fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=16 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/dev/mapper/loop --name=job --rw=rw READ: bw=2428MiB/s (2546MB/s), 2428MiB/s-2428MiB/s (2546MB/s-2546MB/s), io=23.7GiB (25.5GB), run=10001-10001msec WRITE: bw=2429MiB/s (2547MB/s), 2429MiB/s-2429MiB/s (2547MB/s-2547MB/s), io=23.7GiB (25.5GB), run=10001-10001msec regular loop (on /mnt/test/test): # losetup /dev/loop0 /mnt/test/test # fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=16 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/dev/loop0 --name=job --rw=rw READ: bw=326MiB/s (342MB/s), 326MiB/s-326MiB/s (342MB/s-342MB/s), io=3259MiB (3417MB), run=10003-10003msec WRITE: bw=326MiB/s (342MB/s), 326MiB/s-326MiB/s (342MB/s-342MB/s), io=3260MiB (3418MB), run=10003-10003msec dm-loop is even slightly faster than running fio directly on the regular file: # fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=16 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test/test --name=job --rw=rw READ: bw=2005MiB/s (2103MB/s), 2005MiB/s-2005MiB/s (2103MB/s-2103MB/s), io=19.6GiB (21.0GB), run=10002-10002msec WRITE: bw=2007MiB/s (2104MB/s), 2007MiB/s-2007MiB/s (2104MB/s-2104MB/s), io=19.6GiB (21.0GB), run=10002-10002msec Mikulas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-03 16:16 ` Mikulas Patocka @ 2025-03-03 17:24 ` Christoph Hellwig 2025-03-03 21:03 ` Mikulas Patocka 0 siblings, 1 reply; 45+ messages in thread From: Christoph Hellwig @ 2025-03-03 17:24 UTC (permalink / raw) To: Mikulas Patocka Cc: Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote: > What should I use instead of bmap? Is fiemap exported for use in the > kernel? You can't do an ahead of time mapping. It's a broken concept. > Would Jens Axboe agree to merge the dm-loop logic into the existing loop > driver? What logic? > Dm-loop is significantly faster than the regular loop: > > # modprobe brd rd_size=1048576 > # dd if=/dev/zero of=/dev/ram0 bs=1048576 > # mkfs.ext4 /dev/ram0 > # mount -t ext4 /dev/ram0 /mnt/test > # dd if=/dev/zero of=/mnt/test/test bs=1048576 count=512 All of this needs to be in a commit log. Also note that the above: a) does not use direct I/O which any sane loop user should b) is not on a file but a block device, rendering the use of a loop device entirely pointless. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-03 17:24 ` Christoph Hellwig @ 2025-03-03 21:03 ` Mikulas Patocka 2025-03-04 2:13 ` Dave Chinner 2025-03-04 13:49 ` Christoph Hellwig 0 siblings, 2 replies; 45+ messages in thread From: Mikulas Patocka @ 2025-03-03 21:03 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Mon, 3 Mar 2025, Christoph Hellwig wrote: > On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote: > > What should I use instead of bmap? Is fiemap exported for use in the > > kernel? > > You can't do an ahead of time mapping. It's a broken concept. Swapfile does ahead of time mapping. And I just looked at what swapfile does and copied the logic into dm-loop. If swapfile is not broken, how could dm-loop be broken? > > Would Jens Axboe agree to merge the dm-loop logic into the existing loop > > driver? > > What logic? The ahead-of-time mapping. > > Dm-loop is significantly faster than the regular loop: > > > > # modprobe brd rd_size=1048576 > > # dd if=/dev/zero of=/dev/ram0 bs=1048576 > > # mkfs.ext4 /dev/ram0 > > # mount -t ext4 /dev/ram0 /mnt/test > > # dd if=/dev/zero of=/mnt/test/test bs=1048576 count=512 > > All of this needs to be in a commit log. Also note that the above: > > a) does not use direct I/O which any sane loop user should > b) is not on a file but a block device, rendering the use of a loop > device entirely pointless. With "losetup --direct-io=on /dev/loop0 /mnt/test/test", it is even slower than without: READ: bw=217MiB/s (227MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s), io=2170MiB (2275MB), run=10003-10003msec WRITE: bw=217MiB/s (227MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s), io=2169MiB (2274MB), run=10003-10003msec with --direct-io=off READ: bw=398MiB/s (417MB/s), 398MiB/s-398MiB/s (417MB/s-417MB/s), io=3978MiB (4171MB), run=10003-10003msec WRITE: bw=398MiB/s (417MB/s), 398MiB/s-398MiB/s (417MB/s-417MB/s), io=3981MiB (4175MB), run=10003-10003msec Mikulas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-03 21:03 ` Mikulas Patocka @ 2025-03-04 2:13 ` Dave Chinner 2025-03-04 11:18 ` Mikulas Patocka 2025-03-04 13:49 ` Christoph Hellwig 1 sibling, 1 reply; 45+ messages in thread From: Dave Chinner @ 2025-03-04 2:13 UTC (permalink / raw) To: Mikulas Patocka Cc: Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Mon, Mar 03, 2025 at 10:03:42PM +0100, Mikulas Patocka wrote: > > > On Mon, 3 Mar 2025, Christoph Hellwig wrote: > > > On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote: > > > What should I use instead of bmap? Is fiemap exported for use in the > > > kernel? > > > > You can't do an ahead of time mapping. It's a broken concept. > > Swapfile does ahead of time mapping. And I just looked at what swapfile > does and copied the logic into dm-loop. If swapfile is not broken, how > could dm-loop be broken? Swap files cannot be accessed/modified by user code once the swapfile is activated. See all the IS_SWAPFILE() checked throughout the VFS and filesystem code. Swap files must be fully allocated (i.e. not sparse), nor contan shared extents. This is required so that writes to the swapfile do not require block allocation which would change the mapping... Hence we explicitly prevent modification of the underlying file mapping once a swapfile is owned and mapped by the kernel as a swapfile. That's not how loop devices/image files work - we actually rely on them being: a) sparse; and b) the mapping being mutable via direct access to the loop file whilst there is an active mounted filesystem on that loop file. and so every IO needs to be mapped through the filesystem at submission time. The reason for a) is obvious: we don't need to allocate space for the filesystem so it's effectively thin provisioned. Also, fstrim on the mounted loop device can punch out unused space in the mounted filesytsem. The reason for b) is less obvious: snapshots via file cloning, deduplication via extent sharing. The clone operaiton is an atomic modification of the underlying file mapping, which then triggers COW on future writes to those mappings, which causes the mapping to the change at write IO time. IOWs, the whole concept that there is a "static mapping" for a loop device image file for the life of the image file is fundamentally flawed. > > > Dm-loop is significantly faster than the regular loop: > > > > > > # modprobe brd rd_size=1048576 > > > # dd if=/dev/zero of=/dev/ram0 bs=1048576 > > > # mkfs.ext4 /dev/ram0 > > > # mount -t ext4 /dev/ram0 /mnt/test > > > # dd if=/dev/zero of=/mnt/test/test bs=1048576 count=512 Urk. ram disks are terrible for IO benchmarking. The IO is synchronous (i.e. always completes in the submitter context) and performance is -always CPU bound- due to the requirement for all data copying to be marshalled through the CPU. Please benchmark performance on NVMe SSDs - it will give a much more accurate deomonstration of the performance differences we'll see in real world usage of loop device functionality... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-04 2:13 ` Dave Chinner @ 2025-03-04 11:18 ` Mikulas Patocka 2025-03-04 13:50 ` Christoph Hellwig 2025-03-05 0:01 ` Dave Chinner 0 siblings, 2 replies; 45+ messages in thread From: Mikulas Patocka @ 2025-03-04 11:18 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Tue, 4 Mar 2025, Dave Chinner wrote: > On Mon, Mar 03, 2025 at 10:03:42PM +0100, Mikulas Patocka wrote: > > > > > > On Mon, 3 Mar 2025, Christoph Hellwig wrote: > > > > > On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote: > > > > What should I use instead of bmap? Is fiemap exported for use in the > > > > kernel? > > > > > > You can't do an ahead of time mapping. It's a broken concept. > > > > Swapfile does ahead of time mapping. And I just looked at what swapfile > > does and copied the logic into dm-loop. If swapfile is not broken, how > > could dm-loop be broken? > > Swap files cannot be accessed/modified by user code once the > swapfile is activated. See all the IS_SWAPFILE() checked throughout > the VFS and filesystem code. > > Swap files must be fully allocated (i.e. not sparse), nor contan > shared extents. This is required so that writes to the swapfile do > not require block allocation which would change the mapping... > > Hence we explicitly prevent modification of the underlying file > mapping once a swapfile is owned and mapped by the kernel as a > swapfile. > > That's not how loop devices/image files work - we actually rely on > them being: > > a) sparse; and > b) the mapping being mutable via direct access to the loop file > whilst there is an active mounted filesystem on that loop file. > > and so every IO needs to be mapped through the filesystem at > submission time. > > The reason for a) is obvious: we don't need to allocate space for > the filesystem so it's effectively thin provisioned. Also, fstrim on > the mounted loop device can punch out unused space in the mounted > filesytsem. > > The reason for b) is less obvious: snapshots via file cloning, > deduplication via extent sharing. > > The clone operaiton is an atomic modification of the underlying file > mapping, which then triggers COW on future writes to those mappings, > which causes the mapping to the change at write IO time. > > IOWs, the whole concept that there is a "static mapping" for a loop > device image file for the life of the image file is fundamentally > flawed. I'm not trying to break existing loop. But some users don't use COW filesystems, some users use fully provisioned files, some users don't need to write to a file when it is being mapped - and for them dm-loop would be viable alternative because of better performance. The Android people concluded that loop is too slow and rather than using loop they want to map a file using a table with dm-linear targets over the image of the host filesystem. So, they are already doing what dm-loop is doing. Mikulas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-04 11:18 ` Mikulas Patocka @ 2025-03-04 13:50 ` Christoph Hellwig 2025-03-05 0:01 ` Dave Chinner 1 sibling, 0 replies; 45+ messages in thread From: Christoph Hellwig @ 2025-03-04 13:50 UTC (permalink / raw) To: Mikulas Patocka Cc: Dave Chinner, Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Tue, Mar 04, 2025 at 12:18:04PM +0100, Mikulas Patocka wrote: > I'm not trying to break existing loop. You are f***cking breaking file system semantics. Stop it now. > The Android people concluded that loop is too slow and rather than using > loop they want to map a file using a table with dm-linear targets over the > image of the host filesystem. So, they are already doing what dm-loop is > doing. I've never seen bug reports from " The Android people". But maybe they just need to stop pointlessly using loop devices? ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-04 11:18 ` Mikulas Patocka 2025-03-04 13:50 ` Christoph Hellwig @ 2025-03-05 0:01 ` Dave Chinner 2025-03-07 15:21 ` Mikulas Patocka 1 sibling, 1 reply; 45+ messages in thread From: Dave Chinner @ 2025-03-05 0:01 UTC (permalink / raw) To: Mikulas Patocka Cc: Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Tue, Mar 04, 2025 at 12:18:04PM +0100, Mikulas Patocka wrote: > > > On Tue, 4 Mar 2025, Dave Chinner wrote: > > > On Mon, Mar 03, 2025 at 10:03:42PM +0100, Mikulas Patocka wrote: > > > > > > > > > On Mon, 3 Mar 2025, Christoph Hellwig wrote: > > > > > > > On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote: > > > > > What should I use instead of bmap? Is fiemap exported for use in the > > > > > kernel? > > > > > > > > You can't do an ahead of time mapping. It's a broken concept. > > > > > > Swapfile does ahead of time mapping. And I just looked at what swapfile > > > does and copied the logic into dm-loop. If swapfile is not broken, how > > > could dm-loop be broken? > > > > Swap files cannot be accessed/modified by user code once the > > swapfile is activated. See all the IS_SWAPFILE() checked throughout > > the VFS and filesystem code. > > > > Swap files must be fully allocated (i.e. not sparse), nor contan > > shared extents. This is required so that writes to the swapfile do > > not require block allocation which would change the mapping... > > > > Hence we explicitly prevent modification of the underlying file > > mapping once a swapfile is owned and mapped by the kernel as a > > swapfile. > > > > That's not how loop devices/image files work - we actually rely on > > them being: > > > > a) sparse; and > > b) the mapping being mutable via direct access to the loop file > > whilst there is an active mounted filesystem on that loop file. > > > > and so every IO needs to be mapped through the filesystem at > > submission time. > > > > The reason for a) is obvious: we don't need to allocate space for > > the filesystem so it's effectively thin provisioned. Also, fstrim on > > the mounted loop device can punch out unused space in the mounted > > filesytsem. > > > > The reason for b) is less obvious: snapshots via file cloning, > > deduplication via extent sharing. > > > > The clone operaiton is an atomic modification of the underlying file > > mapping, which then triggers COW on future writes to those mappings, > > which causes the mapping to the change at write IO time. > > > > IOWs, the whole concept that there is a "static mapping" for a loop > > device image file for the life of the image file is fundamentally > > flawed. > > I'm not trying to break existing loop. I didn't say you were. I said the concept that dm-loop is based on is fundamentally flawed and that your benchmark setup does not reflect real world usage of loop devices. > But some users don't use COW filesystems, some users use fully provisioned > files, some users don't need to write to a file when it is being mapped - > and for them dm-loop would be viable alternative because of better > performance. Nothing has changed since 2008 when this "fast file mapping" thing was first proposed and dm-loop made it's first appearance in this thread: https://lore.kernel.org/linux-fsdevel/20080109085231.GE6650@kernel.dk/ Let me quote Christoph's response to Jen's proposed static mapping for the loop device patch back in 2008: | And the way this is done is simply broken. It means you have to get | rid of things like delayed or unwritten hands beforehand, it'll be | a complete pain for COW or non-block backed filesystems. | | The right way to do this is to allow direct I/O from kernel sources | where the filesystem is in-charge of submitting the actual I/O after | the pages are handed to it. I think Peter Zijlstra has been looking | into something like that for swap over nfs. Jens also said this about dm-loop in that thread: } Why oh why does dm always insist to reinvent everything? That's bad } enough in itself, but on top of that most of the extra stuff ends up } being essentially unmaintained. } } If we instead improve loop, everyone wins. } } Sorry to sound a bit harsh, but sometimes it doesn't hurt to think a bit } outside your own sandbox. You - personally - were also told directly by Jens back then that dm-loop's approach simply does not work for filesystems that move blocks around. i.e. it isn't a viable appraoch. Nothing has changed - it still isn't a viable approach for loopback devices for the same reasons it wasnt' viable in 2008. > The Android people concluded that loop is too slow and rather than using > loop they want to map a file using a table with dm-linear targets over the > image of the host filesystem. So, they are already doing what dm-loop is > doing. I don't care if a downstream kernel is doing something stupid with their kernels. Where are the bug reports about the loop device being slow and the analysis that indicates that it is unfixable? The fact is that AIO+DIO through filesystems like XFS performs generally within 1-2% of the underlying block device capabilities. Hence if there's a problem with loop device performance, it isn't in the backing file IO submission path. Find out why loop device AIO+DIO is slow for the workload you are testing and fix that. This way everyone who already uses loop devices benefits (as Jens said in 2008), and the Android folk can get rid of their hacky mapping setup.... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-05 0:01 ` Dave Chinner @ 2025-03-07 15:21 ` Mikulas Patocka 2025-03-08 3:49 ` Darrick J. Wong ` (2 more replies) 0 siblings, 3 replies; 45+ messages in thread From: Mikulas Patocka @ 2025-03-07 15:21 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel > I didn't say you were. I said the concept that dm-loop is based on > is fundamentally flawed and that your benchmark setup does not > reflect real world usage of loop devices. > Where are the bug reports about the loop device being slow and the > analysis that indicates that it is unfixable? So, I did benchmarks on an enterprise nvme drive (SAMSUNG MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow. synchronous I/O: fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw raw block device: READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec ext4/loop/ext4: READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec xfs/loop/xfs: READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec ext4/dm-loop/ext4: READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec xfs/dm-loop/xfs: READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec asynchronous I/O: fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw raw block device: READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec ext4/loop/ext4: READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec xfs/loop/xfs: READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec ext4/dm-loop/ext4: READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec xfs/dm-loop/xfs: READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec Mikulas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-07 15:21 ` Mikulas Patocka @ 2025-03-08 3:49 ` Darrick J. Wong 2025-03-08 20:45 ` Mikulas Patocka 2025-03-09 0:05 ` Ming Lei 2025-03-09 0:16 ` Ming Lei 2 siblings, 1 reply; 45+ messages in thread From: Darrick J. Wong @ 2025-03-08 3:49 UTC (permalink / raw) To: Mikulas Patocka Cc: Dave Chinner, Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote: > > I didn't say you were. I said the concept that dm-loop is based on > > is fundamentally flawed and that your benchmark setup does not > > reflect real world usage of loop devices. > > > Where are the bug reports about the loop device being slow and the > > analysis that indicates that it is unfixable? > > So, I did benchmarks on an enterprise nvme drive (SAMSUNG > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow. Are you running the loop device in directio mode? The default is to use buffered io, which wastes pagecache /and/ sometimes trips dirty limits throttling. The loopdev tests in fstests get noticeably faster if I force directio mode. --D > synchronous I/O: > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > raw block device: > READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec > WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec > ext4/loop/ext4: > READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec > WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec > xfs/loop/xfs: > READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec > WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec > ext4/dm-loop/ext4: > READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec > WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec > xfs/dm-loop/xfs: > READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec > > asynchronous I/O: > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > raw block device: > READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > ext4/loop/ext4: > READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec > WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec > xfs/loop/xfs: > READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec > WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec > ext4/dm-loop/ext4: > READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > xfs/dm-loop/xfs: > READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec > WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec > > Mikulas > > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-08 3:49 ` Darrick J. Wong @ 2025-03-08 20:45 ` Mikulas Patocka 0 siblings, 0 replies; 45+ messages in thread From: Mikulas Patocka @ 2025-03-08 20:45 UTC (permalink / raw) To: Darrick J. Wong Cc: Dave Chinner, Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Fri, 7 Mar 2025, Darrick J. Wong wrote: > On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote: > > > I didn't say you were. I said the concept that dm-loop is based on > > > is fundamentally flawed and that your benchmark setup does not > > > reflect real world usage of loop devices. > > > > > Where are the bug reports about the loop device being slow and the > > > analysis that indicates that it is unfixable? > > > > So, I did benchmarks on an enterprise nvme drive (SAMSUNG > > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup > > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow. > > Are you running the loop device in directio mode? The default is to use > buffered io, which wastes pagecache /and/ sometimes trips dirty limits > throttling. The loopdev tests in fstests get noticeably faster if I > force directio mode. Yes, I am. I set it up with: losetup --direct-io=on /dev/loop0 /mnt/test/l mount -t xfs /dev/loop0 /mnt/test2 I double checked it and I got the same results. Mikulas > --D > > > synchronous I/O: > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > > raw block device: > > READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec > > WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec > > ext4/loop/ext4: > > READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec > > WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec > > xfs/loop/xfs: > > READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec > > WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec > > ext4/dm-loop/ext4: > > READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec > > WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec > > xfs/dm-loop/xfs: > > READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec > > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec > > > > asynchronous I/O: > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > > raw block device: > > READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > > WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > > ext4/loop/ext4: > > READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec > > WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec > > xfs/loop/xfs: > > READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec > > WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec > > ext4/dm-loop/ext4: > > READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > > WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > > xfs/dm-loop/xfs: > > READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec > > WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec > > > > Mikulas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-07 15:21 ` Mikulas Patocka 2025-03-08 3:49 ` Darrick J. Wong @ 2025-03-09 0:05 ` Ming Lei 2025-03-10 11:18 ` Mikulas Patocka 2025-03-09 0:16 ` Ming Lei 2 siblings, 1 reply; 45+ messages in thread From: Ming Lei @ 2025-03-09 0:05 UTC (permalink / raw) To: Mikulas Patocka Cc: Dave Chinner, Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote: > > I didn't say you were. I said the concept that dm-loop is based on > > is fundamentally flawed and that your benchmark setup does not > > reflect real world usage of loop devices. > > > Where are the bug reports about the loop device being slow and the > > analysis that indicates that it is unfixable? > > So, I did benchmarks on an enterprise nvme drive (SAMSUNG > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow. > > synchronous I/O: > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > raw block device: > READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec > WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec > ext4/loop/ext4: > READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec > WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec > xfs/loop/xfs: > READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec > WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec > ext4/dm-loop/ext4: > READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec > WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec > xfs/dm-loop/xfs: > READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec > > asynchronous I/O: > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > raw block device: > READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > ext4/loop/ext4: > READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec > WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec > xfs/loop/xfs: > READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec > WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec > ext4/dm-loop/ext4: > READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > xfs/dm-loop/xfs: > READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec > WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec Hi Mikulas, Please try the following patchset: https://lore.kernel.org/linux-block/20250308162312.1640828-1-ming.lei@redhat.com/ which tries to handle IO in current context directly via NOWAIT, and supports MQ for loop since 12 io jobs are applied in your test. With this change, I can observe similar perf data on raw block device and loop/xfs over mq-virtio-scsi & nvme in my test VM. 1) try single queue first by `modprobe loop` 2) then try MQ by 'modprobe loop nr_hw_queues=4' If it still doesn't work, please provide fio log for both `raw block device` and 'loop/xfs', which may provide some clue for the big perf gap. Thanks, Ming ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-09 0:05 ` Ming Lei @ 2025-03-10 11:18 ` Mikulas Patocka 2025-03-11 1:27 ` Dave Chinner 0 siblings, 1 reply; 45+ messages in thread From: Mikulas Patocka @ 2025-03-10 11:18 UTC (permalink / raw) To: Ming Lei Cc: Dave Chinner, Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Sun, 9 Mar 2025, Ming Lei wrote: > On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote: > > > I didn't say you were. I said the concept that dm-loop is based on > > > is fundamentally flawed and that your benchmark setup does not > > > reflect real world usage of loop devices. > > > > > Where are the bug reports about the loop device being slow and the > > > analysis that indicates that it is unfixable? > > > > So, I did benchmarks on an enterprise nvme drive (SAMSUNG > > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup > > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow. > > > > synchronous I/O: > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > > raw block device: > > READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec > > WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec > > ext4/loop/ext4: > > READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec > > WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec > > xfs/loop/xfs: > > READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec > > WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec > > ext4/dm-loop/ext4: > > READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec > > WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec > > xfs/dm-loop/xfs: > > READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec > > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec > > > > asynchronous I/O: > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > > raw block device: > > READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > > WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > > ext4/loop/ext4: > > READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec > > WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec > > xfs/loop/xfs: > > READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec > > WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec > > ext4/dm-loop/ext4: > > READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > > WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > > xfs/dm-loop/xfs: > > READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec > > WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec > > Hi Mikulas, > > Please try the following patchset: > > https://lore.kernel.org/linux-block/20250308162312.1640828-1-ming.lei@redhat.com/ > > which tries to handle IO in current context directly via NOWAIT, and > supports MQ for loop since 12 io jobs are applied in your test. With this > change, I can observe similar perf data on raw block device and loop/xfs > over mq-virtio-scsi & nvme in my test VM. Yes - with these patches, it is much better. > 1) try single queue first by `modprobe loop` fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw xfs/loop/xfs READ: bw=302MiB/s (317MB/s), 302MiB/s-302MiB/s (317MB/s-317MB/s), io=3024MiB (3170MB), run=10001-10001msec WRITE: bw=303MiB/s (317MB/s), 303MiB/s-303MiB/s (317MB/s-317MB/s), io=3026MiB (3173MB), run=10001-10001msec fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw xfs/loop/xfs READ: bw=1055MiB/s (1106MB/s), 1055MiB/s-1055MiB/s (1106MB/s-1106MB/s), io=10.3GiB (11.1GB), run=10001-10001msec WRITE: bw=1056MiB/s (1107MB/s), 1056MiB/s-1056MiB/s (1107MB/s-1107MB/s), io=10.3GiB (11.1GB), run=10001-10001msec > 2) then try MQ by 'modprobe loop nr_hw_queues=4' fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw xfs/loop/xfs READ: bw=352MiB/s (370MB/s), 352MiB/s-352MiB/s (370MB/s-370MB/s), io=3525MiB (3696MB), run=10001-10001msec WRITE: bw=353MiB/s (370MB/s), 353MiB/s-353MiB/s (370MB/s-370MB/s), io=3527MiB (3698MB), run=10001-10001msec fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw xfs/loop/xfs READ: bw=1181MiB/s (1239MB/s), 1181MiB/s-1181MiB/s (1239MB/s-1239MB/s), io=11.5GiB (12.4GB), run=10001-10001msec WRITE: bw=1183MiB/s (1240MB/s), 1183MiB/s-1183MiB/s (1240MB/s-1240MB/s), io=11.5GiB (12.4GB), run=10001-10001msec > If it still doesn't work, please provide fio log for both `raw block > device` and 'loop/xfs', which may provide some clue for the big perf > gap. > > > > Thanks, > Ming Mikulas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-10 11:18 ` Mikulas Patocka @ 2025-03-11 1:27 ` Dave Chinner 2025-03-11 10:43 ` Ming Lei 0 siblings, 1 reply; 45+ messages in thread From: Dave Chinner @ 2025-03-11 1:27 UTC (permalink / raw) To: Mikulas Patocka Cc: Ming Lei, Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Mon, Mar 10, 2025 at 12:18:51PM +0100, Mikulas Patocka wrote: > > > On Sun, 9 Mar 2025, Ming Lei wrote: > > > On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote: > > > > I didn't say you were. I said the concept that dm-loop is based on > > > > is fundamentally flawed and that your benchmark setup does not > > > > reflect real world usage of loop devices. > > > > > > > Where are the bug reports about the loop device being slow and the > > > > analysis that indicates that it is unfixable? > > > > > > So, I did benchmarks on an enterprise nvme drive (SAMSUNG > > > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup > > > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow. > > > > > > synchronous I/O: > > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > > > raw block device: > > > READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec > > > WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec > > > ext4/loop/ext4: > > > READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec > > > WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec > > > xfs/loop/xfs: > > > READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec > > > WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec > > > ext4/dm-loop/ext4: > > > READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec > > > WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec > > > xfs/dm-loop/xfs: > > > READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec > > > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec > > > > > > asynchronous I/O: > > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > > > raw block device: > > > READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > > > WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > > > ext4/loop/ext4: > > > READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec > > > WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec > > > xfs/loop/xfs: > > > READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec > > > WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec > > > ext4/dm-loop/ext4: > > > READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > > > WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > > > xfs/dm-loop/xfs: > > > READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec > > > WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec > > > > Hi Mikulas, > > > > Please try the following patchset: > > > > https://lore.kernel.org/linux-block/20250308162312.1640828-1-ming.lei@redhat.com/ > > > > which tries to handle IO in current context directly via NOWAIT, and > > supports MQ for loop since 12 io jobs are applied in your test. With this > > change, I can observe similar perf data on raw block device and loop/xfs > > over mq-virtio-scsi & nvme in my test VM. I'm not sure RWF_NOWAIT is a workable solution. Why? IO submission is queued to a different thread context because to avoid a potential deadlock. That is, we are operating here in the writeback context of another filesystem, and so we cannot do things like depend on memory allocation making forwards progress for IO submission. RWF_NOWAIT is not a guarantee that memory allocation will not occur in the IO submission path - it is implemented as best effort non-blocking behaviour. Further, if we have stacked loop devices (e.g. xfs-loop-ext4-loop-btrfs-loop-xfs) we can will be stacking RWF_NOWAIT IO submission contexts through multiple filesystems. This is not a filesystem IO path I want to support - being in the middle of such a stack creates all sorts of subtle constraints on behaviour that otherwise wouldn't exist. We actually do this sort of multi-fs stacking in fstests, so it's not a made up scenario. I'm also concerned that RWF_NOWAIT submission is not an optimisation at all for the common sparse/COW image file case, because in this case RWF_NOWAIT failing with EAGAIN is just as common (if not moreso) than it succeeding. i.e. in this case, RWF_NOWAIT writes will fail with -EAGAIN very frequently, so all that time spent doing IO submission is wasted time. Further, because allocation on write requires an exclusive lock and it is held for some time, this will affect read performance from the backing device as well. i.e. block mapping during a read while a write is doing allocation will also return EAGAIN for RWF_NOWAIT. This will push the read off to the background worker thread to be serviced and so that will go much slower than a RWF_NOWAIT read that hits the backing file between writes doing allocation. i.e. read latency is going to become much, much more variable. Hence I suspect RWF_NOWAIT is simply hiding the underlying issue by providing this benchmark with a "pure overwrite" fast path that avoids the overhead of queueing work through loop_queue_work().... Can you run these same loop dev tests using a sparse image file and a sparse fio test file so that the fio benchmark measures the impact of loop device block allocation on the test? I suspect the results with the RWF_NOWAIT patch will be somewhat different to the fully allocated case... > > Yes - with these patches, it is much better. > > > 1) try single queue first by `modprobe loop` > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > xfs/loop/xfs > READ: bw=302MiB/s (317MB/s), 302MiB/s-302MiB/s (317MB/s-317MB/s), io=3024MiB (3170MB), run=10001-10001msec > WRITE: bw=303MiB/s (317MB/s), 303MiB/s-303MiB/s (317MB/s-317MB/s), io=3026MiB (3173MB), run=10001-10001msec > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > xfs/loop/xfs > READ: bw=1055MiB/s (1106MB/s), 1055MiB/s-1055MiB/s (1106MB/s-1106MB/s), io=10.3GiB (11.1GB), run=10001-10001msec > WRITE: bw=1056MiB/s (1107MB/s), 1056MiB/s-1056MiB/s (1107MB/s-1107MB/s), io=10.3GiB (11.1GB), run=10001-10001msec Yup, this sort of difference in performance simply from bypassing loop_queue_work() implies the problem is the single threaded loop device queue implementation needs to be fixed. loop_queue_work() { .... spin_lock_irq(&lo->lo_work_lock); .... } else { work = &lo->rootcg_work; cmd_list = &lo->rootcg_cmd_list; } list_add_tail(&cmd->list_entry, cmd_list); queue_work(lo->workqueue, work); spin_unlock_irq(&lo->lo_work_lock); } Not only does every IO that is queued takes this queue lock, there is only one work instance for the loop device. Therefore there is only one kworker process per control group that does IO submission for the loop device. And that kworker thread also takes the work lock to do dequeue as well. That serialised queue with a single IO dispatcher thread looks to be the performance bottleneck to me. We could get rid of the lock by using a llist for this multi-producer/single consumer cmd list pattern, though I suspect we can get rid of the list entirely... i.e. we have a work queue that can run a thousand concurrent works for this loop device, but the request queue is depth limited to 128 requests. hence we can have a full set of requests in flight and not run out of submission worker concurrency. There's no need to isolate IO from different cgroups in this situation - they will not get blocked behind IO submission from a different cgroup that is throttled... i.e. the cmd->list_entry list_head could be replaced with a struct work_struct and that whole cmd list management and cgroup scheduling thing could be replaced with a single call to queue_work(cmd->io_work). i.e. the single point that all IO submission is directed through goes away completely. -Dave -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-11 1:27 ` Dave Chinner @ 2025-03-11 10:43 ` Ming Lei 2025-03-12 2:34 ` Dave Chinner 0 siblings, 1 reply; 45+ messages in thread From: Ming Lei @ 2025-03-11 10:43 UTC (permalink / raw) To: Dave Chinner Cc: Mikulas Patocka, Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring On Tue, Mar 11, 2025 at 12:27:23PM +1100, Dave Chinner wrote: > On Mon, Mar 10, 2025 at 12:18:51PM +0100, Mikulas Patocka wrote: > > > > > > On Sun, 9 Mar 2025, Ming Lei wrote: > > > > > On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote: > > > > > I didn't say you were. I said the concept that dm-loop is based on > > > > > is fundamentally flawed and that your benchmark setup does not > > > > > reflect real world usage of loop devices. > > > > > > > > > Where are the bug reports about the loop device being slow and the > > > > > analysis that indicates that it is unfixable? > > > > > > > > So, I did benchmarks on an enterprise nvme drive (SAMSUNG > > > > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup > > > > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow. > > > > > > > > synchronous I/O: > > > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > > > > raw block device: > > > > READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec > > > > WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec > > > > ext4/loop/ext4: > > > > READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec > > > > WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec > > > > xfs/loop/xfs: > > > > READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec > > > > WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec > > > > ext4/dm-loop/ext4: > > > > READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec > > > > WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec > > > > xfs/dm-loop/xfs: > > > > READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec > > > > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec > > > > > > > > asynchronous I/O: > > > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > > > > raw block device: > > > > READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > > > > WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > > > > ext4/loop/ext4: > > > > READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec > > > > WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec > > > > xfs/loop/xfs: > > > > READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec > > > > WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec > > > > ext4/dm-loop/ext4: > > > > READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > > > > WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > > > > xfs/dm-loop/xfs: > > > > READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec > > > > WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec > > > > > > Hi Mikulas, > > > > > > Please try the following patchset: > > > > > > https://lore.kernel.org/linux-block/20250308162312.1640828-1-ming.lei@redhat.com/ > > > > > > which tries to handle IO in current context directly via NOWAIT, and > > > supports MQ for loop since 12 io jobs are applied in your test. With this > > > change, I can observe similar perf data on raw block device and loop/xfs > > > over mq-virtio-scsi & nvme in my test VM. > > I'm not sure RWF_NOWAIT is a workable solution. > > Why? It is just the sane implementation of Mikulas's static mapping approach: no need to move to workqueue if the mapping is immutable or sort of. Also it matches with io_uring's FS read/write implementation: - try to submit IO with NOWAIT first - then fallback to io-wq in case of -EAGAIN It isn't perfect, sometime it may be slower than running on io-wq directly. But is there any better way for covering everything? I guess no, because FS can't tell us when the IO can be submitted successfully via NOWAIT, and we can't know if it may succeed without trying. And basically that is what the interface is designed. > > IO submission is queued to a different thread context because to > avoid a potential deadlock. That is, we are operating here in the > writeback context of another filesystem, and so we cannot do things > like depend on memory allocation making forwards progress for IO > submission. RWF_NOWAIT is not a guarantee that memory allocation > will not occur in the IO submission path - it is implemented as best > effort non-blocking behaviour. Yes, that is why BLK_MQ_F_BLOCKING is added. > > Further, if we have stacked loop devices (e.g. > xfs-loop-ext4-loop-btrfs-loop-xfs) we can will be stacking > RWF_NOWAIT IO submission contexts through multiple filesystems. This > is not a filesystem IO path I want to support - being in the middle > of such a stack creates all sorts of subtle constraints on behaviour > that otherwise wouldn't exist. We actually do this sort of multi-fs > stacking in fstests, so it's not a made up scenario. > > I'm also concerned that RWF_NOWAIT submission is not an optimisation > at all for the common sparse/COW image file case, because in this > case RWF_NOWAIT failing with EAGAIN is just as common (if not > moreso) than it succeeding. > > i.e. in this case, RWF_NOWAIT writes will fail with -EAGAIN very > frequently, so all that time spent doing IO submission is wasted > time. Right. I saw that when I wrote ublk/zoned in which every write needs to allocate space. It is workaround by preallocating space for one fixed range or the whole zone. > > Further, because allocation on write requires an exclusive lock and > it is held for some time, this will affect read performance from the > backing device as well. i.e. block mapping during a read while a > write is doing allocation will also return EAGAIN for RWF_NOWAIT. But that can't be avoided without using NOWAIT, and read is blocked when write(WAIT) is in-progress. > This will push the read off to the background worker thread to be > serviced and so that will go much slower than a RWF_NOWAIT read that > hits the backing file between writes doing allocation. i.e. read > latency is going to become much, much more variable. > > Hence I suspect RWF_NOWAIT is simply hiding the underlying issue > by providing this benchmark with a "pure overwrite" fast path that > avoids the overhead of queueing work through loop_queue_work().... > > Can you run these same loop dev tests using a sparse image file and > a sparse fio test file so that the fio benchmark measures the impact > of loop device block allocation on the test? I suspect the results > with the RWF_NOWAIT patch will be somewhat different to the fully > allocated case... Yes, it will be slower, and io_uring FS IO application is in the same situation, and usually application doesn't have such knowledge if RWF_NOWAIT can succeed. However, usually meta IO is much less compared with normal IO, so most times it will be a win to try NOWAIT first. > > > > > Yes - with these patches, it is much better. > > > > > 1) try single queue first by `modprobe loop` > > > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > > xfs/loop/xfs > > READ: bw=302MiB/s (317MB/s), 302MiB/s-302MiB/s (317MB/s-317MB/s), io=3024MiB (3170MB), run=10001-10001msec > > WRITE: bw=303MiB/s (317MB/s), 303MiB/s-303MiB/s (317MB/s-317MB/s), io=3026MiB (3173MB), run=10001-10001msec > > > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > > xfs/loop/xfs > > READ: bw=1055MiB/s (1106MB/s), 1055MiB/s-1055MiB/s (1106MB/s-1106MB/s), io=10.3GiB (11.1GB), run=10001-10001msec > > WRITE: bw=1056MiB/s (1107MB/s), 1056MiB/s-1056MiB/s (1107MB/s-1107MB/s), io=10.3GiB (11.1GB), run=10001-10001msec > > Yup, this sort of difference in performance simply from bypassing > loop_queue_work() implies the problem is the single threaded loop > device queue implementation needs to be fixed. > > loop_queue_work() > { > .... > spin_lock_irq(&lo->lo_work_lock); > .... > > } else { > work = &lo->rootcg_work; > cmd_list = &lo->rootcg_cmd_list; > } > list_add_tail(&cmd->list_entry, cmd_list); > queue_work(lo->workqueue, work); > spin_unlock_irq(&lo->lo_work_lock); > } > > Not only does every IO that is queued takes this queue lock, there > is only one work instance for the loop device. Therefore there is > only one kworker process per control group that does IO submission > for the loop device. And that kworker thread also takes the work > lock to do dequeue as well. > > That serialised queue with a single IO dispatcher thread looks to be > the performance bottleneck to me. We could get rid of the lock by > using a llist for this multi-producer/single consumer cmd list > pattern, though I suspect we can get rid of the list entirely... > > i.e. we have a work queue that can run a > thousand concurrent works for this loop device, but the request > queue is depth limited to 128 requests. hence we can have a full > set of requests in flight and not run out of submission worker > concurrency. There's no need to isolate IO from different cgroups in > this situation - they will not get blocked behind IO submission > from a different cgroup that is throttled... > > i.e. the cmd->list_entry list_head could be replaced with a struct > work_struct and that whole cmd list management and cgroup scheduling > thing could be replaced with a single call to > queue_work(cmd->io_work). i.e. the single point that all IO Then there will be many FS write contention, :-) > submission is directed through goes away completely. It has been shown many times that AIO submitted from single or much less contexts is much more efficient than running IO concurrently from multiple contexts, especially for sequential IO. Please see the recent example of zloop vs. ublk/zoned: https://lore.kernel.org/linux-block/d5e59531-c19b-4332-8f47-b380ab9678be@kernel.org/ When zloop takes single dispatcher just like the in-tree loop, sequential WRITE performs much better than initial version of ublk/zoned, which just handles every IO in its own io-wq(one time NOWAIT & -EAGAIN and one time fallback to io-wq). I tried to submit FS WRITE via io-wq directly and it becomes what your suggested, the improvement is small, and still much worse than zloop's single dispatcher. Later, when I switch to pre-allocate space for each zone or one fixed range, each write is submitted with NOWAIT successfully, then the sequential write perf is improved much: https://lore.kernel.org/linux-block/Z6QrceGGAJl_X_BM@fedora/ Thanks, Ming ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-11 10:43 ` Ming Lei @ 2025-03-12 2:34 ` Dave Chinner 2025-03-12 6:24 ` Christoph Hellwig 2025-03-12 8:26 ` Ming Lei 0 siblings, 2 replies; 45+ messages in thread From: Dave Chinner @ 2025-03-12 2:34 UTC (permalink / raw) To: Ming Lei Cc: Mikulas Patocka, Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring On Tue, Mar 11, 2025 at 06:43:22PM +0800, Ming Lei wrote: > On Tue, Mar 11, 2025 at 12:27:23PM +1100, Dave Chinner wrote: > > On Mon, Mar 10, 2025 at 12:18:51PM +0100, Mikulas Patocka wrote: > > > On Sun, 9 Mar 2025, Ming Lei wrote: > > > > Please try the following patchset: > > > > > > > > https://lore.kernel.org/linux-block/20250308162312.1640828-1-ming.lei@redhat.com/ > > > > > > > > which tries to handle IO in current context directly via NOWAIT, and > > > > supports MQ for loop since 12 io jobs are applied in your test. With this > > > > change, I can observe similar perf data on raw block device and loop/xfs > > > > over mq-virtio-scsi & nvme in my test VM. > > > > I'm not sure RWF_NOWAIT is a workable solution. > > > > Why? > > It is just the sane implementation of Mikulas's static mapping > approach: no need to move to workqueue if the mapping is immutable > or sort of. > > Also it matches with io_uring's FS read/write implementation: > > - try to submit IO with NOWAIT first > - then fallback to io-wq in case of -EAGAIN No, it doesn't match what io_uring does. yes, the NOWAIT bit does, but the work queue implementation is in the loop device is completely different to the way io_uring dispatches work. That is, io_uring dispatches one IO per wroker thread context so they can all run in parallel down through the filesystem. The loop device has a single worker thread so it -serialises- IO submission to the filesystem. i.e. blocking a single IO submission in the loop worker blocks all IO submission, whereas io_uring submits all IO indepedently so they only block if serialisation occurs further down the stack. > It isn't perfect, sometime it may be slower than running on io-wq > directly. > > But is there any better way for covering everything? Yes - fix the loop queue workers. > I guess no, because FS can't tell us when the IO can be submitted > successfully via NOWAIT, and we can't know if it may succeed without > trying. And basically that is what the interface is designed. Wrong. Speculative non-blocking IO like NOWAIT is the wrong optimisation to make for workloads that are very likely to block in the IO path. It just adds overhead without adding any improvement in performance. Getting rid of the serialised IO submission problems that the loop device current has will benefit *all* workloads that use the loop device, not just those that are fully allocated. Yes, it won't quite show the same performance as NOWAIT in that case, but it still should give 90-95% of native performance for the static file case. And it should also improve all the other cases, too, because now they will only serialise when the backing file needs IO operations to serialise (i.e. during allocation). > > IO submission is queued to a different thread context because to > > avoid a potential deadlock. That is, we are operating here in the > > writeback context of another filesystem, and so we cannot do things > > like depend on memory allocation making forwards progress for IO > > submission. RWF_NOWAIT is not a guarantee that memory allocation > > will not occur in the IO submission path - it is implemented as best > > effort non-blocking behaviour. > > Yes, that is why BLK_MQ_F_BLOCKING is added. > > > > > Further, if we have stacked loop devices (e.g. > > xfs-loop-ext4-loop-btrfs-loop-xfs) we can will be stacking > > RWF_NOWAIT IO submission contexts through multiple filesystems. This > > is not a filesystem IO path I want to support - being in the middle > > of such a stack creates all sorts of subtle constraints on behaviour > > that otherwise wouldn't exist. We actually do this sort of multi-fs > > stacking in fstests, so it's not a made up scenario. > > > > I'm also concerned that RWF_NOWAIT submission is not an optimisation > > at all for the common sparse/COW image file case, because in this > > case RWF_NOWAIT failing with EAGAIN is just as common (if not > > moreso) than it succeeding. > > > > i.e. in this case, RWF_NOWAIT writes will fail with -EAGAIN very > > frequently, so all that time spent doing IO submission is wasted > > time. > > Right. > > I saw that when I wrote ublk/zoned in which every write needs to > allocate space. It is workaround by preallocating space for one fixed > range or the whole zone. *cough* You do realise that fallocate() serialises all IO on the file? i.e. not only does it block all IO submission, mmap IO and other metadata operations, it also waits for all IO in flight to complete and it doesn't allow any IO to restart until the preallocation is complete. i.e. preallocation creates a huge IO submission latency bubble in the IO path. Hence preallocation is something that should be avoided at runtime if at all possible. If you have problems with allocation overhead during IO, then we have things like extent size hints that allow the per-IO allocation to be made much larger than the IO itself. This effectively does preallocation during IO without the IO pipeline bubbles that preallocation requires to function correctly.... > > Further, because allocation on write requires an exclusive lock and > > it is held for some time, this will affect read performance from the > > backing device as well. i.e. block mapping during a read while a > > write is doing allocation will also return EAGAIN for RWF_NOWAIT. > > But that can't be avoided without using NOWAIT, and read is blocked > when write(WAIT) is in-progress. If the loop worker dispatches each IO in it's own task context, then we don't care if a read IO blocks on some other write in progress. It doesn't get in the way of any other IO submission... > > This will push the read off to the background worker thread to be > > serviced and so that will go much slower than a RWF_NOWAIT read that > > hits the backing file between writes doing allocation. i.e. read > > latency is going to become much, much more variable. > > > > Hence I suspect RWF_NOWAIT is simply hiding the underlying issue > > by providing this benchmark with a "pure overwrite" fast path that > > avoids the overhead of queueing work through loop_queue_work().... > > > > Can you run these same loop dev tests using a sparse image file and > > a sparse fio test file so that the fio benchmark measures the impact > > of loop device block allocation on the test? I suspect the results > > with the RWF_NOWAIT patch will be somewhat different to the fully > > allocated case... > > Yes, it will be slower, and io_uring FS IO application is in > the same situation, and usually application doesn't have such > knowledge if RWF_NOWAIT can succeed. And that's my point: nothing above the filesystem will have - or can have - any knowledge of whether NOWAIT IO will succeed or not. Therefore we have to use our brains to analyse what the typical behaviour of the filesystem will be for a given situation to determine how best to optimise IO submission. > However, usually meta IO is much less compared with normal IO, so most > times it will be a win to try NOWAIT first. Data vs metadata from the upper filesystem doesn't even enter into the equation here. Filesystems like XFS, bcachefs and btrfs all dynamically create, move and remove metadata, so metadata writes are as much of an allocation problem for sparse loop backing files as write-once user data IO is. > > Yup, this sort of difference in performance simply from bypassing > > loop_queue_work() implies the problem is the single threaded loop > > device queue implementation needs to be fixed. > > > > loop_queue_work() > > { > > .... > > spin_lock_irq(&lo->lo_work_lock); > > .... > > > > } else { > > work = &lo->rootcg_work; > > cmd_list = &lo->rootcg_cmd_list; > > } > > list_add_tail(&cmd->list_entry, cmd_list); > > queue_work(lo->workqueue, work); > > spin_unlock_irq(&lo->lo_work_lock); > > } > > > > Not only does every IO that is queued takes this queue lock, there > > is only one work instance for the loop device. Therefore there is > > only one kworker process per control group that does IO submission > > for the loop device. And that kworker thread also takes the work > > lock to do dequeue as well. > > > > That serialised queue with a single IO dispatcher thread looks to be > > the performance bottleneck to me. We could get rid of the lock by > > using a llist for this multi-producer/single consumer cmd list > > pattern, though I suspect we can get rid of the list entirely... > > > > i.e. we have a work queue that can run a > > thousand concurrent works for this loop device, but the request > > queue is depth limited to 128 requests. hence we can have a full > > set of requests in flight and not run out of submission worker > > concurrency. There's no need to isolate IO from different cgroups in > > this situation - they will not get blocked behind IO submission > > from a different cgroup that is throttled... > > > > i.e. the cmd->list_entry list_head could be replaced with a struct > > work_struct and that whole cmd list management and cgroup scheduling > > thing could be replaced with a single call to > > queue_work(cmd->io_work). i.e. the single point that all IO > > Then there will be many FS write contention, :-) Did you think about how much parallelism your NOWAIT patches are directing at the loop device backing file before making that comment? The fio test that was run had 12 jobs running, there is at least 12-way IO concurrency hitting the backing file directly via the NOWAIT path. Yup, the NOWAIT path exposes the filesystem directly to userspace write IO concurrency. Yet the filesystem IO scaled to near raw device speed.... It should be obvious that adding more IO concurrency to loop device IO submission isn't a problem for the XFS IO path. And if it proves to be an issue, then that's an XFS problem to solve, not a generic device scalability issue. > > submission is directed through goes away completely. > > It has been shown many times that AIO submitted from single or much less > contexts is much more efficient than running IO concurrently from multiple > contexts, especially for sequential IO. Yes, I know that, and I'm not disputing that this is the most optimal way to submit IO *when it doesn't require blocking to dispatch*. However, the fact is this benefit only exists when all IO can be submitted in a non-blocking manner. As soon as blocking occurs in IO submission, then AIO submission either becomes synchronous or is aborted with EAGAIN (i.e. NOWAIT io). If we use blocking submission behaviour, then performance tanks. This is what the loop device does right now. If we use non-blocking submission and get -EAGAIN, then we still need to use the slightly less efficient method of per-IO task contexts to scale out blocking IO submission. With that in mind, we must accept that loop device IO *frequently needs to block* and so non-blocking IO is the wrong behaviour to optimise *exclusively* for. Hence we need to first make the loop device scale with per-IO task contexts as this will improve both IO that can be dispatched without blocking as well as IO that will block during dispatch. If this does not bring performance up to acceptible levels, then other optimisations can then be considered. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-12 2:34 ` Dave Chinner @ 2025-03-12 6:24 ` Christoph Hellwig 2025-03-12 8:26 ` Ming Lei 1 sibling, 0 replies; 45+ messages in thread From: Christoph Hellwig @ 2025-03-12 6:24 UTC (permalink / raw) To: Dave Chinner Cc: Ming Lei, Mikulas Patocka, Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring On Wed, Mar 12, 2025 at 01:34:02PM +1100, Dave Chinner wrote: > Wrong. > > Speculative non-blocking IO like NOWAIT is the wrong optimisation to > make for workloads that are very likely to block in the IO path. It > just adds overhead without adding any improvement in performance. Note that I suspect that most or at least many loop workloads are read-heavy. And at least for reads NOWAIT makes perfect sense. > Getting rid of the serialised IO submission problems that the loop > device current has will benefit *all* workloads that use the loop > device, not just those that are fully allocated. Yes, it won't quite > show the same performance as NOWAIT in that case, but it still > should give 90-95% of native performance for the static file case. > And it should also improve all the other cases, too, because now > they will only serialise when the backing file needs IO operations to > serialise (i.e. during allocation). And I agree that this should be a first step. > *cough* The whole ublk-zoned is a bit of a bullshit thing where Ming wrote up something that barely works to block inclusion of the zloop driver we really need for zoned xfs testing. Please don't take it serious. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-12 2:34 ` Dave Chinner 2025-03-12 6:24 ` Christoph Hellwig @ 2025-03-12 8:26 ` Ming Lei 2025-03-13 1:36 ` Ming Lei 2025-03-13 16:36 ` Mikulas Patocka 1 sibling, 2 replies; 45+ messages in thread From: Ming Lei @ 2025-03-12 8:26 UTC (permalink / raw) To: Dave Chinner Cc: Mikulas Patocka, Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring [-- Attachment #1: Type: text/plain, Size: 9748 bytes --] On Wed, Mar 12, 2025 at 01:34:02PM +1100, Dave Chinner wrote: > On Tue, Mar 11, 2025 at 06:43:22PM +0800, Ming Lei wrote: > > On Tue, Mar 11, 2025 at 12:27:23PM +1100, Dave Chinner wrote: > > > On Mon, Mar 10, 2025 at 12:18:51PM +0100, Mikulas Patocka wrote: > > > > On Sun, 9 Mar 2025, Ming Lei wrote: > > > > > Please try the following patchset: > > > > > > > > > > https://lore.kernel.org/linux-block/20250308162312.1640828-1-ming.lei@redhat.com/ > > > > > > > > > > which tries to handle IO in current context directly via NOWAIT, and > > > > > supports MQ for loop since 12 io jobs are applied in your test. With this > > > > > change, I can observe similar perf data on raw block device and loop/xfs > > > > > over mq-virtio-scsi & nvme in my test VM. > > > > > > I'm not sure RWF_NOWAIT is a workable solution. > > > > > > Why? > > > > It is just the sane implementation of Mikulas's static mapping > > approach: no need to move to workqueue if the mapping is immutable > > or sort of. > > > > Also it matches with io_uring's FS read/write implementation: > > > > - try to submit IO with NOWAIT first > > - then fallback to io-wq in case of -EAGAIN > > No, it doesn't match what io_uring does. yes, the NOWAIT bit does, > but the work queue implementation is in the loop device is > completely different to the way io_uring dispatches work. > > That is, io_uring dispatches one IO per wroker thread context so > they can all run in parallel down through the filesystem. The loop > device has a single worker thread so it -serialises- IO submission > to the filesystem. > > i.e. blocking a single IO submission in the loop worker blocks all > IO submission, whereas io_uring submits all IO indepedently so they > only block if serialisation occurs further down the stack. block layer/storage has many optimization for batching handling, if IOs are submitted from many contexts: - this batch handling optimization is gone - IO is re-ordered from underlying hardware viewpoint - more contention from FS write lock, because loop has single back file. That is why the single task context is taken from the beginning of loop aio, and it performs pretty well for sequential IO workloads, as I shown in the zloop example. > > > It isn't perfect, sometime it may be slower than running on io-wq > > directly. > > > > But is there any better way for covering everything? > > Yes - fix the loop queue workers. What you suggested is threaded aio by submitting IO concurrently from different task context, this way is not the most efficient one, otherwise modern language won't invent async/.await. In my test VM, by running Mikulas's fio script on loop/nvme by the attached threaded_aio patch: NOWAIT with MQ 4 : 70K iops(read), 70K iops(write), cpu util: 40% threaded_aio with MQ 4 : 64k iops(read), 64K iops(write), cpu util: 52% in tree loop(SQ) : 58K iops(read), 58K iops(write) Mikulas, please feel free to run your tests with threaded_aio: modprobe loop nr_hw_queues=4 threaded_aio=1 by applying the attached the patch over the loop patchset. The performance gap could be more obvious in fast hardware. > > > I guess no, because FS can't tell us when the IO can be submitted > > successfully via NOWAIT, and we can't know if it may succeed without > > trying. And basically that is what the interface is designed. > > Wrong. > > Speculative non-blocking IO like NOWAIT is the wrong optimisation to > make for workloads that are very likely to block in the IO path. It > just adds overhead without adding any improvement in performance. Mikulas's ->bmap() implies that there isn't block in IO path too. And many io_uring application shouldn't have idea if the IO will block too, but they still use NOWAIT... > > Getting rid of the serialised IO submission problems that the loop > device current has will benefit *all* workloads that use the loop > device, not just those that are fully allocated. Yes, it won't quite > show the same performance as NOWAIT in that case, but it still > should give 90-95% of native performance for the static file case. > And it should also improve all the other cases, too, because now > they will only serialise when the backing file needs IO operations to > serialise (i.e. during allocation). The above test shows that concurrent submission isn't better than NOWAIT, and my ublk/zoned examples showed the point too. > > > > IO submission is queued to a different thread context because to > > > avoid a potential deadlock. That is, we are operating here in the > > > writeback context of another filesystem, and so we cannot do things > > > like depend on memory allocation making forwards progress for IO > > > submission. RWF_NOWAIT is not a guarantee that memory allocation > > > will not occur in the IO submission path - it is implemented as best > > > effort non-blocking behaviour. > > > > Yes, that is why BLK_MQ_F_BLOCKING is added. > > > > > > > > Further, if we have stacked loop devices (e.g. > > > xfs-loop-ext4-loop-btrfs-loop-xfs) we can will be stacking > > > RWF_NOWAIT IO submission contexts through multiple filesystems. This > > > is not a filesystem IO path I want to support - being in the middle > > > of such a stack creates all sorts of subtle constraints on behaviour > > > that otherwise wouldn't exist. We actually do this sort of multi-fs > > > stacking in fstests, so it's not a made up scenario. > > > > > > I'm also concerned that RWF_NOWAIT submission is not an optimisation > > > at all for the common sparse/COW image file case, because in this > > > case RWF_NOWAIT failing with EAGAIN is just as common (if not > > > moreso) than it succeeding. > > > > > > i.e. in this case, RWF_NOWAIT writes will fail with -EAGAIN very > > > frequently, so all that time spent doing IO submission is wasted > > > time. > > > > Right. > > > > I saw that when I wrote ublk/zoned in which every write needs to > > allocate space. It is workaround by preallocating space for one fixed > > range or the whole zone. > > *cough* > > You do realise that fallocate() serialises all IO on the file? i.e. > not only does it block all IO submission, mmap IO and other metadata > operations, it also waits for all IO in flight to complete and it > doesn't allow any IO to restart until the preallocation is complete. > It isn't one issue for zloop & ublk/zoned, each zone is emulated by one single file, and the zone is usual the parallelism unit. > i.e. preallocation creates a huge IO submission latency bubble in > the IO path. Hence preallocation is something that should be avoided > at runtime if at all possible. > > If you have problems with allocation overhead during IO, then we > have things like extent size hints that allow the per-IO allocation > to be made much larger than the IO itself. This effectively does > preallocation during IO without the IO pipeline bubbles that > preallocation requires to function correctly.... OK, I will try it later in ublk/zoned and see if the pre-allocation can be replaced by extent size hints, is it available for all popular FS? > > > > Further, because allocation on write requires an exclusive lock and > > > it is held for some time, this will affect read performance from the > > > backing device as well. i.e. block mapping during a read while a > > > write is doing allocation will also return EAGAIN for RWF_NOWAIT. > > > > But that can't be avoided without using NOWAIT, and read is blocked > > when write(WAIT) is in-progress. > > If the loop worker dispatches each IO in it's own task context, then > we don't care if a read IO blocks on some other write in progress. > It doesn't get in the way of any other IO submission... > > > > This will push the read off to the background worker thread to be > > > serviced and so that will go much slower than a RWF_NOWAIT read that > > > hits the backing file between writes doing allocation. i.e. read > > > latency is going to become much, much more variable. > > > > > > Hence I suspect RWF_NOWAIT is simply hiding the underlying issue > > > by providing this benchmark with a "pure overwrite" fast path that > > > avoids the overhead of queueing work through loop_queue_work().... > > > > > > Can you run these same loop dev tests using a sparse image file and > > > a sparse fio test file so that the fio benchmark measures the impact > > > of loop device block allocation on the test? I suspect the results > > > with the RWF_NOWAIT patch will be somewhat different to the fully > > > allocated case... > > > > Yes, it will be slower, and io_uring FS IO application is in > > the same situation, and usually application doesn't have such > > knowledge if RWF_NOWAIT can succeed. > > And that's my point: nothing above the filesystem will have - or > can have - any knowledge of whether NOWAIT IO will succeed or > not. > > Therefore we have to use our brains to analyse what the typical > behaviour of the filesystem will be for a given situation to > determine how best to optimise IO submission. Indeed, I think FS typical behaviour for loop case is valuable for optimizing loop driver, but care to provide more details if you have? > > > However, usually meta IO is much less compared with normal IO, so most > > times it will be a win to try NOWAIT first. > > Data vs metadata from the upper filesystem doesn't even enter into > the equation here. > > Filesystems like XFS, bcachefs and btrfs all dynamically create, > move and remove metadata, so metadata writes are as much of an > allocation problem for sparse loop backing files as write-once user > data IO is. For typical loop workloads, I still see much less metadata io, and that is why improvement from NOWAIT is big in Mikulas's test. Thanks, Ming [-- Attachment #2: 0001-loop-debug-add-threaded-aio.patch --] [-- Type: text/plain, Size: 3163 bytes --] From b3c78b04ac6b80e171271a833c0221b8ab3b2b00 Mon Sep 17 00:00:00 2001 From: Ming Lei <ming.lei@redhat.com> Date: Wed, 12 Mar 2025 07:33:55 +0000 Subject: [PATCH] loop: debug: add threaded aio Signed-off-by: Ming Lei <ming.lei@redhat.com> --- drivers/block/loop.c | 47 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 46 insertions(+), 1 deletion(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 393f38573169..113e29b3ac80 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -86,6 +86,8 @@ struct loop_cmd { struct bio_vec *bvec; struct cgroup_subsys_state *blkcg_css; struct cgroup_subsys_state *memcg_css; + + struct work_struct work; }; #define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ) @@ -98,6 +100,8 @@ static DEFINE_IDR(loop_index_idr); static DEFINE_MUTEX(loop_ctl_mutex); static DEFINE_MUTEX(loop_validate_mutex); +static unsigned int threaded_aio = 0; + /** * loop_global_lock_killable() - take locks for safe loop_validate_file() test * @@ -508,10 +512,17 @@ static int lo_rw_aio_prep(struct loop_device *lo, struct loop_cmd *cmd, static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, loff_t pos) { unsigned int nr_bvec = lo_cmd_nr_bvec(cmd); - int ret; + int ret = 0; + + if (threaded_aio) { + ret = lo_rw_aio_prep(lo, cmd, nr_bvec); + if (ret < 0) + goto exit; + } cmd->iocb.ki_flags &= ~IOCB_NOWAIT; ret = lo_submit_rw_aio(lo, cmd, pos, nr_bvec); +exit: if (ret != -EIOCBQUEUED) lo_rw_aio_complete(&cmd->iocb, ret); return 0; @@ -1941,10 +1952,39 @@ static const struct kernel_param_ops loop_nr_hw_q_param_ops = { device_param_cb(nr_hw_queues, &loop_nr_hw_q_param_ops, &nr_hw_queues, 0444); MODULE_PARM_DESC(nr_hw_queues, "number of hardware queues. Default: " __stringify(LOOP_DEFAULT_NR_HW_Q)); +module_param(threaded_aio, uint, 0644); +MODULE_PARM_DESC(threaded_aio, "threaded aio"); + MODULE_DESCRIPTION("Loopback device support"); MODULE_LICENSE("GPL"); MODULE_ALIAS_BLOCKDEV_MAJOR(LOOP_MAJOR); +static void loop_aio_workfn(struct work_struct *work) +{ + struct loop_cmd *cmd = container_of(work, struct loop_cmd, work); + struct request *rq = blk_mq_rq_from_pdu(cmd); + struct loop_device *lo = rq->q->queuedata; + int ret = do_req_filebacked(lo, rq); + + if (!cmd->use_aio || ret) { + if (ret == -EOPNOTSUPP) + cmd->ret = ret; + else + cmd->ret = ret ? -EIO : 0; + if (likely(!blk_should_fake_timeout(rq->q))) + blk_mq_complete_request(rq); + } +} + +static void loop_queue_rq_threaded_aio(struct loop_cmd *cmd) +{ + struct request *rq = blk_mq_rq_from_pdu(cmd); + struct loop_device *lo = rq->q->queuedata; + + INIT_WORK(&cmd->work, loop_aio_workfn); + queue_work(lo->workqueue, &cmd->work); +} + static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd) { @@ -1968,6 +2008,11 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx, break; } + if (threaded_aio) { + loop_queue_rq_threaded_aio(cmd); + return BLK_STS_OK; + } + if (cmd->use_aio) { loff_t pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset; int ret = lo_rw_aio_nowait(lo, cmd, pos); -- 2.47.1 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-12 8:26 ` Ming Lei @ 2025-03-13 1:36 ` Ming Lei 2025-03-13 16:36 ` Mikulas Patocka 1 sibling, 0 replies; 45+ messages in thread From: Ming Lei @ 2025-03-13 1:36 UTC (permalink / raw) To: Dave Chinner Cc: Mikulas Patocka, Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring On Wed, Mar 12, 2025 at 04:27:12PM +0800, Ming Lei wrote: > On Wed, Mar 12, 2025 at 01:34:02PM +1100, Dave Chinner wrote: ... > > block layer/storage has many optimization for batching handling, if IOs > are submitted from many contexts: > > - this batch handling optimization is gone > > - IO is re-ordered from underlying hardware viewpoint > > - more contention from FS write lock, because loop has single back file. > > That is why the single task context is taken from the beginning of loop aio, > and it performs pretty well for sequential IO workloads, as I shown > in the zloop example. > > > > > > It isn't perfect, sometime it may be slower than running on io-wq > > > directly. > > > > > > But is there any better way for covering everything? > > > > Yes - fix the loop queue workers. > > What you suggested is threaded aio by submitting IO concurrently from > different task context, this way is not the most efficient one, otherwise > modern language won't invent async/.await. > > In my test VM, by running Mikulas's fio script on loop/nvme by the attached > threaded_aio patch: > > NOWAIT with MQ 4 : 70K iops(read), 70K iops(write), cpu util: 40% > threaded_aio with MQ 4 : 64k iops(read), 64K iops(write), cpu util: 52% > in tree loop(SQ) : 58K iops(read), 58K iops(write) > > Mikulas, please feel free to run your tests with threaded_aio: > > modprobe loop nr_hw_queues=4 threaded_aio=1 > > by applying the attached the patch over the loop patchset. > > The performance gap could be more obvious in fast hardware. For the normal single job sequential WRITE workload, on same test VM, still loop over /dev/nvme0n1, and running fio over loop directly: fio --direct=1 --bs=4k --runtime=40 --time_based --numjobs=1 --ioengine=libaio \ --iodepth=16 --group_reporting=1 --filename=/dev/loop0 -name=job --rw=write threaded_aio(SQ) : 81k iops(write), cpu util: 20% in tree loop(SQ) : 100K iops(write), cpu util: 7% Thanks, Ming ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-12 8:26 ` Ming Lei 2025-03-13 1:36 ` Ming Lei @ 2025-03-13 16:36 ` Mikulas Patocka 2025-03-18 4:27 ` Dave Chinner 1 sibling, 1 reply; 45+ messages in thread From: Mikulas Patocka @ 2025-03-13 16:36 UTC (permalink / raw) To: Ming Lei Cc: Dave Chinner, Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring On Wed, 12 Mar 2025, Ming Lei wrote: > > > It isn't perfect, sometime it may be slower than running on io-wq > > > directly. > > > > > > But is there any better way for covering everything? > > > > Yes - fix the loop queue workers. > > What you suggested is threaded aio by submitting IO concurrently from > different task context, this way is not the most efficient one, otherwise > modern language won't invent async/.await. > > In my test VM, by running Mikulas's fio script on loop/nvme by the attached > threaded_aio patch: > > NOWAIT with MQ 4 : 70K iops(read), 70K iops(write), cpu util: 40% > threaded_aio with MQ 4 : 64k iops(read), 64K iops(write), cpu util: 52% > in tree loop(SQ) : 58K iops(read), 58K iops(write) > > Mikulas, please feel free to run your tests with threaded_aio: > > modprobe loop nr_hw_queues=4 threaded_aio=1 > > by applying the attached the patch over the loop patchset. > > The performance gap could be more obvious in fast hardware. With "threaded_aio=1": Sync io fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw xfs/loop/xfs READ: bw=300MiB/s (315MB/s), 300MiB/s-300MiB/s (315MB/s-315MB/s), io=3001MiB (3147MB), run=10001-10001msec WRITE: bw=300MiB/s (315MB/s), 300MiB/s-300MiB/s (315MB/s-315MB/s), io=3004MiB (3149MB), run=10001-10001msec Async io fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw xfs/loop/xfs READ: bw=869MiB/s (911MB/s), 869MiB/s-869MiB/s (911MB/s-911MB/s), io=8694MiB (9116MB), run=10002-10002msec WRITE: bw=870MiB/s (913MB/s), 870MiB/s-870MiB/s (913MB/s-913MB/s), io=8706MiB (9129MB), run=10002-10002msec Without "threaded_aio=1": Sync io fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw xfs/loop/xfs READ: bw=348MiB/s (365MB/s), 348MiB/s-348MiB/s (365MB/s-365MB/s), io=3481MiB (3650MB), run=10001-10001msec WRITE: bw=348MiB/s (365MB/s), 348MiB/s-348MiB/s (365MB/s-365MB/s), io=3484MiB (3653MB), run=10001-10001msec Async io fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw xfs/loop/xfs READ: bw=1186MiB/s (1244MB/s), 1186MiB/s-1186MiB/s (1244MB/s-1244MB/s), io=11.6GiB (12.4GB), run=10001-10001msec WRITE: bw=1187MiB/s (1245MB/s), 1187MiB/s-1187MiB/s (1245MB/s-1245MB/s), io=11.6GiB (12.5GB), run=10001-10001msec Mikulas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-13 16:36 ` Mikulas Patocka @ 2025-03-18 4:27 ` Dave Chinner 2025-03-18 7:57 ` Christoph Hellwig 0 siblings, 1 reply; 45+ messages in thread From: Dave Chinner @ 2025-03-18 4:27 UTC (permalink / raw) To: Mikulas Patocka Cc: Ming Lei, Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring On Thu, Mar 13, 2025 at 05:36:53PM +0100, Mikulas Patocka wrote: > On Wed, 12 Mar 2025, Ming Lei wrote: > > > > > It isn't perfect, sometime it may be slower than running on io-wq > > > > directly. > > > > > > > > But is there any better way for covering everything? > > > > > > Yes - fix the loop queue workers. > > > > What you suggested is threaded aio by submitting IO concurrently from > > different task context, this way is not the most efficient one, otherwise > > modern language won't invent async/.await. > > > > In my test VM, by running Mikulas's fio script on loop/nvme by the attached > > threaded_aio patch: > > > > NOWAIT with MQ 4 : 70K iops(read), 70K iops(write), cpu util: 40% > > threaded_aio with MQ 4 : 64k iops(read), 64K iops(write), cpu util: 52% > > in tree loop(SQ) : 58K iops(read), 58K iops(write) > > > > Mikulas, please feel free to run your tests with threaded_aio: > > > > modprobe loop nr_hw_queues=4 threaded_aio=1 > > > > by applying the attached the patch over the loop patchset. > > > > The performance gap could be more obvious in fast hardware. > > With "threaded_aio=1": > > Sync io > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > xfs/loop/xfs > READ: bw=300MiB/s (315MB/s), 300MiB/s-300MiB/s (315MB/s-315MB/s), io=3001MiB (3147MB), run=10001-10001msec > WRITE: bw=300MiB/s (315MB/s), 300MiB/s-300MiB/s (315MB/s-315MB/s), io=3004MiB (3149MB), run=10001-10001msec > > Async io > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > xfs/loop/xfs > READ: bw=869MiB/s (911MB/s), 869MiB/s-869MiB/s (911MB/s-911MB/s), io=8694MiB (9116MB), run=10002-10002msec > WRITE: bw=870MiB/s (913MB/s), 870MiB/s-870MiB/s (913MB/s-913MB/s), io=8706MiB (9129MB), run=10002-10002msec The original numbers for the xfs/loop/xfs performance were 220MiB/s (sync) and 276MiB/s (async), so this is actually a very big step forward compared to the existing code. Yes, it's not quite as fast as the NOWAIT case for pure overwrites - 348MB/s (sync) and 1186MB/s (async), but we predicted (and expected) that this would be the case. However, this is still testing the static file, pure overwrite case only, so there is never any IO that blocks during submission. When IO will block (because there are allocating writes in progress) performance in the NOWAIT case will trend back towards the original performance levels because the single loop queue blocking submission will still be the limiting factor for all IO that needs to block. IOWs, these results show that to get decent, consistent performance out of the loop device we need threaded blocking submission so users do not have to care about optimising individual loop device instances for the layout of their image files. Yes, NOWAIT may then add an incremental performance improvement on top for optimal layout cases, but I'm still not yet convinced that it is a generally applicable loop device optimisation that everyone wants to always enable due to the potential for 100% NOWAIT submission failure on any given loop device..... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-18 4:27 ` Dave Chinner @ 2025-03-18 7:57 ` Christoph Hellwig 2025-03-18 9:34 ` Ming Lei 0 siblings, 1 reply; 45+ messages in thread From: Christoph Hellwig @ 2025-03-18 7:57 UTC (permalink / raw) To: Dave Chinner Cc: Mikulas Patocka, Ming Lei, Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring On Tue, Mar 18, 2025 at 03:27:48PM +1100, Dave Chinner wrote: > Yes, NOWAIT may then add an incremental performance improvement on > top for optimal layout cases, but I'm still not yet convinced that > it is a generally applicable loop device optimisation that everyone > wants to always enable due to the potential for 100% NOWAIT > submission failure on any given loop device..... Yes, I think this is a really good first step: 1) switch loop to use a per-command work_item unconditionally, which also has the nice effect that it cleans up the horrible mess of the per-blkcg workers. (note that this is what the nvmet file backend has always done with good result) 2) look into NOWAIT submission, especially for reads this should be a clear winner and probaby done unconditionally. For writes it might be a bit of a tradeoff if we expect the writes to allocate a lot, so we might want some kind of tunable for it. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-18 7:57 ` Christoph Hellwig @ 2025-03-18 9:34 ` Ming Lei 2025-03-20 7:08 ` Christoph Hellwig 0 siblings, 1 reply; 45+ messages in thread From: Ming Lei @ 2025-03-18 9:34 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, Mikulas Patocka, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring On Tue, Mar 18, 2025 at 12:57:17AM -0700, Christoph Hellwig wrote: > On Tue, Mar 18, 2025 at 03:27:48PM +1100, Dave Chinner wrote: > > Yes, NOWAIT may then add an incremental performance improvement on > > top for optimal layout cases, but I'm still not yet convinced that > > it is a generally applicable loop device optimisation that everyone > > wants to always enable due to the potential for 100% NOWAIT > > submission failure on any given loop device..... NOWAIT failure can be avoided actually: https://lore.kernel.org/linux-block/20250314021148.3081954-6-ming.lei@redhat.com/ > > Yes, I think this is a really good first step: > > 1) switch loop to use a per-command work_item unconditionally, which also > has the nice effect that it cleans up the horrible mess of the > per-blkcg workers. (note that this is what the nvmet file backend has It could be worse to take per-command work, because IO handling crosses all system wq worker contexts. > always done with good result) per-command work does burn lots of CPU unnecessarily, it isn't good for use case of container, and it can not perform as well as NOWAIT. > 2) look into NOWAIT submission, especially for reads this should be > a clear winner and probaby done unconditionally. For writes it > might be a bit of a tradeoff if we expect the writes to allocate > a lot, so we might want some kind of tunable for it. It is a winner for over-write too. WRITE with allocation can be kept to submit from wq context, see my patchset V2. Thanks, Ming ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-18 9:34 ` Ming Lei @ 2025-03-20 7:08 ` Christoph Hellwig 2025-03-20 7:41 ` Ming Lei 0 siblings, 1 reply; 45+ messages in thread From: Christoph Hellwig @ 2025-03-20 7:08 UTC (permalink / raw) To: Ming Lei Cc: Christoph Hellwig, Dave Chinner, Mikulas Patocka, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring On Tue, Mar 18, 2025 at 05:34:28PM +0800, Ming Lei wrote: > On Tue, Mar 18, 2025 at 12:57:17AM -0700, Christoph Hellwig wrote: > > On Tue, Mar 18, 2025 at 03:27:48PM +1100, Dave Chinner wrote: > > > Yes, NOWAIT may then add an incremental performance improvement on > > > top for optimal layout cases, but I'm still not yet convinced that > > > it is a generally applicable loop device optimisation that everyone > > > wants to always enable due to the potential for 100% NOWAIT > > > submission failure on any given loop device..... > > NOWAIT failure can be avoided actually: > > https://lore.kernel.org/linux-block/20250314021148.3081954-6-ming.lei@redhat.com/ That's a very complex set of heuristics which doesn't match up with other uses of it. > > > > > Yes, I think this is a really good first step: > > > > 1) switch loop to use a per-command work_item unconditionally, which also > > has the nice effect that it cleans up the horrible mess of the > > per-blkcg workers. (note that this is what the nvmet file backend has > > It could be worse to take per-command work, because IO handling crosses > all system wq worker contexts. So do other workloads with pretty good success. > > > always done with good result) > > per-command work does burn lots of CPU unnecessarily, it isn't good for > use case of container That does not match my observations in say nvmet. But if you have numbers please share them. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-20 7:08 ` Christoph Hellwig @ 2025-03-20 7:41 ` Ming Lei 2025-03-20 14:22 ` Christoph Hellwig 2025-03-25 10:15 ` Dave Chinner 0 siblings, 2 replies; 45+ messages in thread From: Ming Lei @ 2025-03-20 7:41 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, Mikulas Patocka, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring On Thu, Mar 20, 2025 at 12:08:19AM -0700, Christoph Hellwig wrote: > On Tue, Mar 18, 2025 at 05:34:28PM +0800, Ming Lei wrote: > > On Tue, Mar 18, 2025 at 12:57:17AM -0700, Christoph Hellwig wrote: > > > On Tue, Mar 18, 2025 at 03:27:48PM +1100, Dave Chinner wrote: > > > > Yes, NOWAIT may then add an incremental performance improvement on > > > > top for optimal layout cases, but I'm still not yet convinced that > > > > it is a generally applicable loop device optimisation that everyone > > > > wants to always enable due to the potential for 100% NOWAIT > > > > submission failure on any given loop device..... > > > > NOWAIT failure can be avoided actually: > > > > https://lore.kernel.org/linux-block/20250314021148.3081954-6-ming.lei@redhat.com/ > > That's a very complex set of heuristics which doesn't match up > with other uses of it. I'd suggest you to point them out in the patch review. > > > > > > > > > Yes, I think this is a really good first step: > > > > > > 1) switch loop to use a per-command work_item unconditionally, which also > > > has the nice effect that it cleans up the horrible mess of the > > > per-blkcg workers. (note that this is what the nvmet file backend has > > > > It could be worse to take per-command work, because IO handling crosses > > all system wq worker contexts. > > So do other workloads with pretty good success. > > > > > > always done with good result) > > > > per-command work does burn lots of CPU unnecessarily, it isn't good for > > use case of container > > That does not match my observations in say nvmet. But if you have > numbers please share them. Please see the result I posted: https://lore.kernel.org/linux-block/Z9FFTiuMC8WD6qMH@fedora/ Thanks, Ming ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-20 7:41 ` Ming Lei @ 2025-03-20 14:22 ` Christoph Hellwig 2025-03-20 14:36 ` Ming Lei 2025-03-25 10:15 ` Dave Chinner 1 sibling, 1 reply; 45+ messages in thread From: Christoph Hellwig @ 2025-03-20 14:22 UTC (permalink / raw) To: Ming Lei Cc: Christoph Hellwig, Dave Chinner, Mikulas Patocka, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring On Thu, Mar 20, 2025 at 03:41:58PM +0800, Ming Lei wrote: > > That does not match my observations in say nvmet. But if you have > > numbers please share them. > > Please see the result I posted: > > https://lore.kernel.org/linux-block/Z9FFTiuMC8WD6qMH@fedora/ That shows it improves numbers and not that it doens't. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-20 14:22 ` Christoph Hellwig @ 2025-03-20 14:36 ` Ming Lei 0 siblings, 0 replies; 45+ messages in thread From: Ming Lei @ 2025-03-20 14:36 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, Mikulas Patocka, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring On Thu, Mar 20, 2025 at 10:22 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Mar 20, 2025 at 03:41:58PM +0800, Ming Lei wrote: > > > That does not match my observations in say nvmet. But if you have > > > numbers please share them. > > > > Please see the result I posted: > > > > https://lore.kernel.org/linux-block/Z9FFTiuMC8WD6qMH@fedora/ > > That shows it improves numbers and not that it doens't. Fine, then please look at the result in the following reply: https://lore.kernel.org/linux-block/Z9I2lm31KOQ784nb@fedora/ Thanks, ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-20 7:41 ` Ming Lei 2025-03-20 14:22 ` Christoph Hellwig @ 2025-03-25 10:15 ` Dave Chinner 2025-03-25 12:23 ` Ming Lei 1 sibling, 1 reply; 45+ messages in thread From: Dave Chinner @ 2025-03-25 10:15 UTC (permalink / raw) To: Ming Lei Cc: Christoph Hellwig, Mikulas Patocka, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring On Thu, Mar 20, 2025 at 03:41:58PM +0800, Ming Lei wrote: > On Thu, Mar 20, 2025 at 12:08:19AM -0700, Christoph Hellwig wrote: > > On Tue, Mar 18, 2025 at 05:34:28PM +0800, Ming Lei wrote: > > > On Tue, Mar 18, 2025 at 12:57:17AM -0700, Christoph Hellwig wrote: > > > > On Tue, Mar 18, 2025 at 03:27:48PM +1100, Dave Chinner wrote: > > > > > Yes, NOWAIT may then add an incremental performance improvement on > > > > > top for optimal layout cases, but I'm still not yet convinced that > > > > > it is a generally applicable loop device optimisation that everyone > > > > > wants to always enable due to the potential for 100% NOWAIT > > > > > submission failure on any given loop device..... > > > > > > NOWAIT failure can be avoided actually: > > > > > > https://lore.kernel.org/linux-block/20250314021148.3081954-6-ming.lei@redhat.com/ > > > > That's a very complex set of heuristics which doesn't match up > > with other uses of it. > > I'd suggest you to point them out in the patch review. Until you pointed them out here, I didn't know these patches existed. Please cc linux-fsdevel on any loop device changes you are proposing, Ming. It is as much a filesystem driver as it is a block device, and it changes need review from both sides of the fence. > > > > Yes, I think this is a really good first step: > > > > > > > > 1) switch loop to use a per-command work_item unconditionally, which also > > > > has the nice effect that it cleans up the horrible mess of the > > > > per-blkcg workers. (note that this is what the nvmet file backend has > > > > > > It could be worse to take per-command work, because IO handling crosses > > > all system wq worker contexts. > > > > So do other workloads with pretty good success. > > > > > > > > > always done with good result) > > > > > > per-command work does burn lots of CPU unnecessarily, it isn't good for > > > use case of container > > > > That does not match my observations in say nvmet. But if you have > > numbers please share them. > > Please see the result I posted: > > https://lore.kernel.org/linux-block/Z9FFTiuMC8WD6qMH@fedora/ You are arguing in circles about how we need to optimise for static file layouts. Please listen to the filesystem people when they tell you that static file layouts are a -secondary- optimisation target for loop devices. The primary optimisation target is the modification that makes all types of IO perform better in production, not just the one use case that overwrite-specific IO benchmarks exercise. If you want me to test your changes, I have a very loop device heavy workload here - it currently creates about 300 *sparse* loop devices totalling about 1.2TB of capacity, then does all sorts of IO to them through both the loop devices themselves and filesystems created on top of the loop devices. It typically generates 4-5GB/s of IO through the loop devices to the backing filesystem and it's physical storage. Speeding up or slowing down IO submission through the loop devices has direct impact on the speed of the workload. Using buffered IO through the loop device right now is about 25% faster than using aio+dio for the loop because there is some amount of re-read and re-write in the filesystem IO patterns. That said, AIO+DIO should be much faster than it is, hence my interest in making all the AIO+DIO IO submission independent of potential blocking operations. Hence if you have patch sets that improve loop device performance, then you need to make sure filesystem people like myself see those patch series so they can be tested and reviewed in a timely manner. That means you need to cc loop device patches to linux-fsdevel.... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-25 10:15 ` Dave Chinner @ 2025-03-25 12:23 ` Ming Lei 0 siblings, 0 replies; 45+ messages in thread From: Ming Lei @ 2025-03-25 12:23 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Mikulas Patocka, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel, io-uring On Tue, Mar 25, 2025 at 09:15:26PM +1100, Dave Chinner wrote: > On Thu, Mar 20, 2025 at 03:41:58PM +0800, Ming Lei wrote: > > On Thu, Mar 20, 2025 at 12:08:19AM -0700, Christoph Hellwig wrote: > > > On Tue, Mar 18, 2025 at 05:34:28PM +0800, Ming Lei wrote: > > > > On Tue, Mar 18, 2025 at 12:57:17AM -0700, Christoph Hellwig wrote: > > > > > On Tue, Mar 18, 2025 at 03:27:48PM +1100, Dave Chinner wrote: > > > > > > Yes, NOWAIT may then add an incremental performance improvement on > > > > > > top for optimal layout cases, but I'm still not yet convinced that > > > > > > it is a generally applicable loop device optimisation that everyone > > > > > > wants to always enable due to the potential for 100% NOWAIT > > > > > > submission failure on any given loop device..... > > > > > > > > NOWAIT failure can be avoided actually: > > > > > > > > https://lore.kernel.org/linux-block/20250314021148.3081954-6-ming.lei@redhat.com/ > > > > > > That's a very complex set of heuristics which doesn't match up > > > with other uses of it. > > > > I'd suggest you to point them out in the patch review. > > Until you pointed them out here, I didn't know these patches > existed. > > Please cc linux-fsdevel on any loop device changes you are > proposing, Ming. It is as much a filesystem driver as it is a block > device, and it changes need review from both sides of the fence. Please see the patchset: https://lore.kernel.org/linux-block/20250322012617.354222-1-ming.lei@redhat.com/ > > > > > > Yes, I think this is a really good first step: > > > > > > > > > > 1) switch loop to use a per-command work_item unconditionally, which also > > > > > has the nice effect that it cleans up the horrible mess of the > > > > > per-blkcg workers. (note that this is what the nvmet file backend has > > > > > > > > It could be worse to take per-command work, because IO handling crosses > > > > all system wq worker contexts. > > > > > > So do other workloads with pretty good success. > > > > > > > > > > > > always done with good result) > > > > > > > > per-command work does burn lots of CPU unnecessarily, it isn't good for > > > > use case of container > > > > > > That does not match my observations in say nvmet. But if you have > > > numbers please share them. > > > > Please see the result I posted: > > > > https://lore.kernel.org/linux-block/Z9FFTiuMC8WD6qMH@fedora/ > > You are arguing in circles about how we need to optimise for static > file layouts. > > Please listen to the filesystem people when they tell you that > static file layouts are a -secondary- optimisation target for loop > devices. > > The primary optimisation target is the modification that makes all > types of IO perform better in production, not just the one use case > that overwrite-specific IO benchmarks exercise. > > If you want me to test your changes, I have a very loop device heavy > workload here - it currently creates about 300 *sparse* loop devices > totalling about 1.2TB of capacity, then does all sorts of IO to them > through both the loop devices themselves and filesystems created on > top of the loop devices. It typically generates 4-5GB/s of IO > through the loop devices to the backing filesystem and it's physical > storage. The patchset does cover the sparse backfile, and I also provide one test case in which one completely sparse file is used, and make sure that there isn't regression in this case. This patchset is supposed to address Mikulas's case of stable FS mapping, meantime without introducing regression on other cases, such as the sparse backing file. > > Speeding up or slowing down IO submission through the loop devices > has direct impact on the speed of the workload. Using buffered IO > through the loop device right now is about 25% faster than using > aio+dio for the loop because there is some amount of re-read and > re-write in the filesystem IO patterns. That said, AIO+DIO should be > much faster than it is, hence my interest in making all the AIO+DIO > IO submission independent of potential blocking operations. > > Hence if you have patch sets that improve loop device performance, > then you need to make sure filesystem people like myself see those > patch series so they can be tested and reviewed in a timely manner. > That means you need to cc loop device patches to linux-fsdevel.... OK, will Cc you and linux-fsdevel in future loop patch submission. Thanks, Ming ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-07 15:21 ` Mikulas Patocka 2025-03-08 3:49 ` Darrick J. Wong 2025-03-09 0:05 ` Ming Lei @ 2025-03-09 0:16 ` Ming Lei 2025-03-10 11:20 ` Mikulas Patocka 2 siblings, 1 reply; 45+ messages in thread From: Ming Lei @ 2025-03-09 0:16 UTC (permalink / raw) To: Mikulas Patocka Cc: Dave Chinner, Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote: > > I didn't say you were. I said the concept that dm-loop is based on > > is fundamentally flawed and that your benchmark setup does not > > reflect real world usage of loop devices. > > > Where are the bug reports about the loop device being slow and the > > analysis that indicates that it is unfixable? > > So, I did benchmarks on an enterprise nvme drive (SAMSUNG > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow. > > synchronous I/O: > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > raw block device: > READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec > WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec > ext4/loop/ext4: > READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec > WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec > xfs/loop/xfs: > READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec > WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec > ext4/dm-loop/ext4: > READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec > WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec > xfs/dm-loop/xfs: > READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec > > asynchronous I/O: > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > raw block device: > READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec BTW, raw device is supposed to be xfs or ext4 over raw block device, right? Otherwise, please provide test data for this case, then it becomes one fair comparison because there should be lock contention for FS write IOs on same file. Thanks, Ming ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-09 0:16 ` Ming Lei @ 2025-03-10 11:20 ` Mikulas Patocka 0 siblings, 0 replies; 45+ messages in thread From: Mikulas Patocka @ 2025-03-10 11:20 UTC (permalink / raw) To: Ming Lei Cc: Dave Chinner, Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Sun, 9 Mar 2025, Ming Lei wrote: > On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote: > > > I didn't say you were. I said the concept that dm-loop is based on > > > is fundamentally flawed and that your benchmark setup does not > > > reflect real world usage of loop devices. > > > > > Where are the bug reports about the loop device being slow and the > > > analysis that indicates that it is unfixable? > > > > So, I did benchmarks on an enterprise nvme drive (SAMSUNG > > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup > > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow. > > > > synchronous I/O: > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > > raw block device: > > READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec > > WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec > > ext4/loop/ext4: > > READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec > > WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec > > xfs/loop/xfs: > > READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec > > WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec > > ext4/dm-loop/ext4: > > READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec > > WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec > > xfs/dm-loop/xfs: > > READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec > > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec > > > > asynchronous I/O: > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > > raw block device: > > READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > > WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > > BTW, raw device is supposed to be xfs or ext4 over raw block device, right? No - raw device means fio directly on /dev/nvme0n1 > Otherwise, please provide test data for this case, then it becomes one fair > comparison because there should be lock contention for FS write IOs on same file. > > Thanks, > Ming Mikulas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-03 21:03 ` Mikulas Patocka 2025-03-04 2:13 ` Dave Chinner @ 2025-03-04 13:49 ` Christoph Hellwig [not found] ` <CAM23Vxr=fKy-0L1R5P-5h6A95acKT_d=CC1E+TAzAs8v6q9gHw@mail.gmail.com> 2025-03-12 13:26 ` Kent Overstreet 1 sibling, 2 replies; 45+ messages in thread From: Christoph Hellwig @ 2025-03-04 13:49 UTC (permalink / raw) To: Mikulas Patocka Cc: Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Mon, Mar 03, 2025 at 10:03:42PM +0100, Mikulas Patocka wrote: > Swapfile does ahead of time mapping. It does. But it is: a) protected against modification by the S_SWAPFILE flag and checked for full allocation first b) something we want to get rid of because even with the above it is rather problematic > And I just looked at what swapfile > does and copied the logic into dm-loop. If swapfile is not broken, how > could dm-loop be broken? As said above, swapfile works around the brokenness in ways that you can't. And just blindly copying old code without understanding it is never a good idea. > > > > Would Jens Axboe agree to merge the dm-loop logic into the existing loop > > > driver? > > > > What logic? > > The ahead-of-time mapping. As said multiple times you can't do that. The block mapping is file system private information. ^ permalink raw reply [flat|nested] 45+ messages in thread
[parent not found: <CAM23Vxr=fKy-0L1R5P-5h6A95acKT_d=CC1E+TAzAs8v6q9gHw@mail.gmail.com>]
* Re: [PATCH] the dm-loop target [not found] ` <CAM23Vxr=fKy-0L1R5P-5h6A95acKT_d=CC1E+TAzAs8v6q9gHw@mail.gmail.com> @ 2025-03-04 16:04 ` Christoph Hellwig [not found] ` <CAM23VxqJX46DCpCiH5qxPpDLtMVg87Ba8sx55aQ4hvt-XaHzuQ@mail.gmail.com> 0 siblings, 1 reply; 45+ messages in thread From: Christoph Hellwig @ 2025-03-04 16:04 UTC (permalink / raw) To: Heinz Mauelshagen Cc: Christoph Hellwig, Mikulas Patocka, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, zkabelac, dm-devel, linux-block, linux-fsdevel On Tue, Mar 04, 2025 at 03:50:36PM +0100, Heinz Mauelshagen wrote: > Seems to have a loop target to add respective mapped device semantics, the > way to go is to design it using VFS API to get it upstream. > > Won't help the Android use case requested by Google thus they'll have to > stick to their multi-segment linear mapped approach to achieve their > performance goals unless they want to support dm-loop OOT to make their > life easier on immutable file backings. What are "the performance goals" and what is in the way of achieving them? This thread has massively blown up but no one has even tried to explain what the thing tries to address. ^ permalink raw reply [flat|nested] 45+ messages in thread
[parent not found: <CAM23VxqJX46DCpCiH5qxPpDLtMVg87Ba8sx55aQ4hvt-XaHzuQ@mail.gmail.com>]
* Re: [PATCH] the dm-loop target [not found] ` <CAM23VxqJX46DCpCiH5qxPpDLtMVg87Ba8sx55aQ4hvt-XaHzuQ@mail.gmail.com> @ 2025-03-04 17:17 ` Christoph Hellwig 0 siblings, 0 replies; 45+ messages in thread From: Christoph Hellwig @ 2025-03-04 17:17 UTC (permalink / raw) To: Heinz Mauelshagen Cc: Christoph Hellwig, Mikulas Patocka, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, zkabelac, dm-devel, linux-block, linux-fsdevel On Tue, Mar 04, 2025 at 06:01:04PM +0100, Heinz Mauelshagen wrote: > > As Mikulas shared earlier: > > The Android people concluded that loop is too slow and rather than using > > loop they want to map a file using a table with dm-linear targets over the > > image of the host filesystem. So, they are already doing what dm-loop is > > doing. Which again is completely handwavy. What is the workload. What are the issues with the workload? Where is the time spent making it slower. e.g. the extent mapping code in this patch is a lot less efficient than the btree based extent lookup in XFS. If it is faster it is not looking up the extents but something else, and you need to figure out what it is. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-04 13:49 ` Christoph Hellwig [not found] ` <CAM23Vxr=fKy-0L1R5P-5h6A95acKT_d=CC1E+TAzAs8v6q9gHw@mail.gmail.com> @ 2025-03-12 13:26 ` Kent Overstreet 2025-03-12 14:20 ` Christoph Hellwig 1 sibling, 1 reply; 45+ messages in thread From: Kent Overstreet @ 2025-03-12 13:26 UTC (permalink / raw) To: Christoph Hellwig Cc: Mikulas Patocka, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Tue, Mar 04, 2025 at 05:49:51AM -0800, Christoph Hellwig wrote: > On Mon, Mar 03, 2025 at 10:03:42PM +0100, Mikulas Patocka wrote: > > Swapfile does ahead of time mapping. > > It does. But it is: > > a) protected against modification by the S_SWAPFILE flag and checked > for full allocation first > b) something we want to get rid of because even with the above it is > rather problematic > > > And I just looked at what swapfile > > does and copied the logic into dm-loop. If swapfile is not broken, how > > could dm-loop be broken? > > As said above, swapfile works around the brokenness in ways that you > can't. And just blindly copying old code without understanding it is > never a good idea. > > > > > > > Would Jens Axboe agree to merge the dm-loop logic into the existing loop > > > > driver? > > > > > > What logic? > > > > The ahead-of-time mapping. > > As said multiple times you can't do that. The block mapping is > file system private information. We might be able to provide an API to _request_ a stable mapping to a file - with proper synchronization, of course. I don't recall anyone ever trying that, it'd replace all the weird IF_SWAPFILE() hacks and be a safe way to do these kinds of performance optimizations. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-12 13:26 ` Kent Overstreet @ 2025-03-12 14:20 ` Christoph Hellwig 2025-03-12 16:09 ` Kent Overstreet 2025-03-13 16:21 ` Mikulas Patocka 0 siblings, 2 replies; 45+ messages in thread From: Christoph Hellwig @ 2025-03-12 14:20 UTC (permalink / raw) To: Kent Overstreet Cc: Christoph Hellwig, Mikulas Patocka, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Wed, Mar 12, 2025 at 09:26:55AM -0400, Kent Overstreet wrote: > We might be able to provide an API to _request_ a stable mapping to a > file - with proper synchronization, of course. We already have that with the pNFS layouts. > I don't recall anyone ever trying that, it'd replace all the weird > IF_SWAPFILE() hacks and be a safe way to do these kinds of performance > optimizations. IS_SWAPFILE isn't going way, as can't allow other writers to it. Also asides from the that the layouts are fairly complex. The right way ahead for swap is to literally just treat it as a slightly special case of direct I/o that is allowed to IS_SWAPFILE files. We can safely do writeback to file backed folios under memory pressure, so we can also go through the normal file system path. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-12 14:20 ` Christoph Hellwig @ 2025-03-12 16:09 ` Kent Overstreet 2025-03-13 12:44 ` Christoph Hellwig 2025-03-13 16:21 ` Mikulas Patocka 1 sibling, 1 reply; 45+ messages in thread From: Kent Overstreet @ 2025-03-12 16:09 UTC (permalink / raw) To: Christoph Hellwig Cc: Mikulas Patocka, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Wed, Mar 12, 2025 at 07:20:11AM -0700, Christoph Hellwig wrote: > On Wed, Mar 12, 2025 at 09:26:55AM -0400, Kent Overstreet wrote: > > We might be able to provide an API to _request_ a stable mapping to a > > file - with proper synchronization, of course. > > We already have that with the pNFS layouts. Good point, maybe Mikulas wants to look at that if they care about squeezing every iops... > > > I don't recall anyone ever trying that, it'd replace all the weird > > IF_SWAPFILE() hacks and be a safe way to do these kinds of performance > > optimizations. > > IS_SWAPFILE isn't going way, as can't allow other writers to it. > Also asides from the that the layouts are fairly complex. > > The right way ahead for swap is to literally just treat it as a slightly > special case of direct I/o that is allowed to IS_SWAPFILE files. We > can safely do writeback to file backed folios under memory pressure, > so we can also go through the normal file system path. Yeah, and that gets us e.g. encrypted swap on bcachefs ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-12 16:09 ` Kent Overstreet @ 2025-03-13 12:44 ` Christoph Hellwig 0 siblings, 0 replies; 45+ messages in thread From: Christoph Hellwig @ 2025-03-13 12:44 UTC (permalink / raw) To: Kent Overstreet Cc: Christoph Hellwig, Mikulas Patocka, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Wed, Mar 12, 2025 at 12:09:41PM -0400, Kent Overstreet wrote: > > The right way ahead for swap is to literally just treat it as a slightly > > special case of direct I/o that is allowed to IS_SWAPFILE files. We > > can safely do writeback to file backed folios under memory pressure, > > so we can also go through the normal file system path. > > Yeah, and that gets us e.g. encrypted swap on bcachefs You can already do this pretty easily, take a look at nfs_swap_rw. It just would be nice to lift the boilerplate code to common code and make this the default. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-12 14:20 ` Christoph Hellwig 2025-03-12 16:09 ` Kent Overstreet @ 2025-03-13 16:21 ` Mikulas Patocka 2025-03-13 16:33 ` Kent Overstreet 1 sibling, 1 reply; 45+ messages in thread From: Mikulas Patocka @ 2025-03-13 16:21 UTC (permalink / raw) To: Christoph Hellwig Cc: Kent Overstreet, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel > IS_SWAPFILE isn't going way, as can't allow other writers to it. > Also asides from the that the layouts are fairly complex. > > The right way ahead for swap is to literally just treat it as a slightly > special case of direct I/o that is allowed to IS_SWAPFILE files. We > can safely do writeback to file backed folios under memory pressure, > so we can also go through the normal file system path. But that is prone to low-memory-deadlock because the filesystems allocate buffer memory when they are mapping logical blocks to physical blocks. You would need to have a mempool in the filesystems, so that they can make forward progress even if there is no memory free - and that would complicate them. GFP_NOIO won't help if the memory is completely exhausted. Only mempool would help in this situation. Mikulas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] the dm-loop target 2025-03-13 16:21 ` Mikulas Patocka @ 2025-03-13 16:33 ` Kent Overstreet 0 siblings, 0 replies; 45+ messages in thread From: Kent Overstreet @ 2025-03-13 16:33 UTC (permalink / raw) To: Mikulas Patocka Cc: Christoph Hellwig, Jens Axboe, Jooyung Han, Alasdair Kergon, Mike Snitzer, Heinz Mauelshagen, zkabelac, dm-devel, linux-block, linux-fsdevel On Thu, Mar 13, 2025 at 05:21:36PM +0100, Mikulas Patocka wrote: > > IS_SWAPFILE isn't going way, as can't allow other writers to it. > > Also asides from the that the layouts are fairly complex. > > > > The right way ahead for swap is to literally just treat it as a slightly > > special case of direct I/o that is allowed to IS_SWAPFILE files. We > > can safely do writeback to file backed folios under memory pressure, > > so we can also go through the normal file system path. > > But that is prone to low-memory-deadlock because the filesystems allocate > buffer memory when they are mapping logical blocks to physical blocks. You > would need to have a mempool in the filesystems, so that they can make > forward progress even if there is no memory free - and that would > complicate them. I can't speak for everone else, but bcachefs has those mempools. ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2025-03-25 12:24 UTC | newest] Thread overview: 45+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <7d6ae2c9-df8e-50d0-7ad6-b787cb3cfab4@redhat.com> 2025-03-03 13:59 ` [PATCH] the dm-loop target Christoph Hellwig [not found] ` <CAM23VxprhJgOPfhxQf6QNWzHd6+-ZwbjSo-oMHCD2WDQiKntMg@mail.gmail.com> 2025-03-03 15:13 ` Christoph Hellwig 2025-03-03 15:22 ` Matthew Wilcox 2025-03-03 15:31 ` Christoph Hellwig [not found] ` <CAM23VxprSduDDK8qvLVkUt9WWmLMPFjhqKB8X4e6gw7Wv-6R2w@mail.gmail.com> 2025-03-03 17:24 ` Christoph Hellwig [not found] ` <CAM23Vxoxyrf9nwJd1Xe8uncAPiyK8yaNZNsugwX8p=qo1n6yVg@mail.gmail.com> 2025-03-04 13:52 ` Christoph Hellwig 2025-03-03 16:16 ` Mikulas Patocka 2025-03-03 17:24 ` Christoph Hellwig 2025-03-03 21:03 ` Mikulas Patocka 2025-03-04 2:13 ` Dave Chinner 2025-03-04 11:18 ` Mikulas Patocka 2025-03-04 13:50 ` Christoph Hellwig 2025-03-05 0:01 ` Dave Chinner 2025-03-07 15:21 ` Mikulas Patocka 2025-03-08 3:49 ` Darrick J. Wong 2025-03-08 20:45 ` Mikulas Patocka 2025-03-09 0:05 ` Ming Lei 2025-03-10 11:18 ` Mikulas Patocka 2025-03-11 1:27 ` Dave Chinner 2025-03-11 10:43 ` Ming Lei 2025-03-12 2:34 ` Dave Chinner 2025-03-12 6:24 ` Christoph Hellwig 2025-03-12 8:26 ` Ming Lei 2025-03-13 1:36 ` Ming Lei 2025-03-13 16:36 ` Mikulas Patocka 2025-03-18 4:27 ` Dave Chinner 2025-03-18 7:57 ` Christoph Hellwig 2025-03-18 9:34 ` Ming Lei 2025-03-20 7:08 ` Christoph Hellwig 2025-03-20 7:41 ` Ming Lei 2025-03-20 14:22 ` Christoph Hellwig 2025-03-20 14:36 ` Ming Lei 2025-03-25 10:15 ` Dave Chinner 2025-03-25 12:23 ` Ming Lei 2025-03-09 0:16 ` Ming Lei 2025-03-10 11:20 ` Mikulas Patocka 2025-03-04 13:49 ` Christoph Hellwig [not found] ` <CAM23Vxr=fKy-0L1R5P-5h6A95acKT_d=CC1E+TAzAs8v6q9gHw@mail.gmail.com> 2025-03-04 16:04 ` Christoph Hellwig [not found] ` <CAM23VxqJX46DCpCiH5qxPpDLtMVg87Ba8sx55aQ4hvt-XaHzuQ@mail.gmail.com> 2025-03-04 17:17 ` Christoph Hellwig 2025-03-12 13:26 ` Kent Overstreet 2025-03-12 14:20 ` Christoph Hellwig 2025-03-12 16:09 ` Kent Overstreet 2025-03-13 12:44 ` Christoph Hellwig 2025-03-13 16:21 ` Mikulas Patocka 2025-03-13 16:33 ` Kent Overstreet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).