linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] block: Improve read ahead size for rotational devices
@ 2025-06-16  6:28 Damien Le Moal
  2025-06-16 21:24 ` Martin K. Petersen
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Damien Le Moal @ 2025-06-16  6:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel, linux-mm

For a device that does not advertize an optimal I/O size, the function
blk_apply_bdi_limits() defaults to an initial setting of the ra_pages
field of struct backing_dev_info to VM_READAHEAD_PAGES, that is, 128 KB.

This low I/O size value is far from being optimal for hard-disk devices:
when reading files from multiple contexts using buffered I/Os, the seek
overhead between the small read commands generated to read-ahead
multiple files will significantly limit the performance that can be
achieved.

This fact applies to all ATA devices as ATA does not define an optimal
I/O size and the SCSI SAT specification does not define a default value
to expose to the host.

Modify blk_apply_bdi_limits() to use a device max_sectors limit to
calculate the ra_pages field of struct backing_dev_info, when the device
is a rotational one (BLK_FEAT_ROTATIONAL feature is set). For a SCSI
disk, this defaults to 2560 KB, which significantly improve performance
for buffered reads. Using XFS and sequentially reading randomly selected
(large) files stored on a SATA HDD, the maximum throughput achieved with
8 readers reading files with 1MB buffered I/Os increases from 122 MB/s
to 167 MB/s (+36%). The improvement is even larger when reading files
using 128 KB buffered I/Os, with a throughput increasing from 57 MB/s to
165 MB/s (+189%).

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-settings.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index a000daafbfb4..66d402de9026 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -58,16 +58,24 @@ EXPORT_SYMBOL(blk_set_stacking_limits);
 void blk_apply_bdi_limits(struct backing_dev_info *bdi,
 		struct queue_limits *lim)
 {
+	u64 io_opt = lim->io_opt;
+
 	/*
 	 * For read-ahead of large files to be effective, we need to read ahead
-	 * at least twice the optimal I/O size.
+	 * at least twice the optimal I/O size. For rotational devices that do
+	 * not report an optimal I/O size (e.g. ATA HDDs), use the maximum I/O
+	 * size to avoid falling back to the (rather inefficient) small default
+	 * read-ahead size.
 	 *
 	 * There is no hardware limitation for the read-ahead size and the user
 	 * might have increased the read-ahead size through sysfs, so don't ever
 	 * decrease it.
 	 */
+	if (!io_opt && (lim->features & BLK_FEAT_ROTATIONAL))
+		io_opt = (u64)lim->max_sectors << SECTOR_SHIFT;
+
 	bdi->ra_pages = max3(bdi->ra_pages,
-				lim->io_opt * 2 / PAGE_SIZE,
+				io_opt * 2 >> PAGE_SHIFT,
 				VM_READAHEAD_PAGES);
 	bdi->io_pages = lim->max_sectors >> PAGE_SECTORS_SHIFT;
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] block: Improve read ahead size for rotational devices
  2025-06-16  6:28 [PATCH] block: Improve read ahead size for rotational devices Damien Le Moal
@ 2025-06-16 21:24 ` Martin K. Petersen
  2025-06-16 22:45   ` Damien Le Moal
  2025-06-17  5:10   ` Christoph Hellwig
  2025-06-17  5:09 ` Christoph Hellwig
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 8+ messages in thread
From: Martin K. Petersen @ 2025-06-16 21:24 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Jens Axboe, linux-block, linux-fsdevel, linux-mm


Hi Damien!

> Modify blk_apply_bdi_limits() to use a device max_sectors limit to
> calculate the ra_pages field of struct backing_dev_info, when the
> device is a rotational one (BLK_FEAT_ROTATIONAL feature is set).

I much prefer doing it here. I don't think overriding io_opt in SCSI is
appropriate. Applications and filesystems need to be able to determine
whether a SCSI device reports an optimal I/O size or not. Overloading
the queue limit with readahead semantics does not belong in SCSI.

> For a SCSI disk, this defaults to 2560 KB, which significantly improve
> performance for buffered reads.

I believe this number came from a common RAID stripe configuration at
the time. However, it's really not a great default and has caused
problems with many devices that expect a power of two. Personally, I'd
like this default to be something like 2MB or 4MB. MD, DM, and most
hardware RAID devices report their stripe width correctly so the
existing "RAID-friendly" default really shouldn't be needed.

Anyway. That's orthogonal to this particular change...

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] block: Improve read ahead size for rotational devices
  2025-06-16 21:24 ` Martin K. Petersen
@ 2025-06-16 22:45   ` Damien Le Moal
  2025-06-17  5:10   ` Christoph Hellwig
  1 sibling, 0 replies; 8+ messages in thread
From: Damien Le Moal @ 2025-06-16 22:45 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Jens Axboe, linux-block, linux-fsdevel, linux-mm

On 6/17/25 6:24 AM, Martin K. Petersen wrote:
> 
> Hi Damien!
> 
>> Modify blk_apply_bdi_limits() to use a device max_sectors limit to
>> calculate the ra_pages field of struct backing_dev_info, when the
>> device is a rotational one (BLK_FEAT_ROTATIONAL feature is set).
> 
> I much prefer doing it here. I don't think overriding io_opt in SCSI is
> appropriate. Applications and filesystems need to be able to determine
> whether a SCSI device reports an optimal I/O size or not. Overloading
> the queue limit with readahead semantics does not belong in SCSI.
> 
>> For a SCSI disk, this defaults to 2560 KB, which significantly improve
>> performance for buffered reads.
> 
> I believe this number came from a common RAID stripe configuration at
> the time. However, it's really not a great default and has caused
> problems with many devices that expect a power of two. Personally, I'd
> like this default to be something like 2MB or 4MB. MD, DM, and most
> hardware RAID devices report their stripe width correctly so the
> existing "RAID-friendly" default really shouldn't be needed.

That sounds good. Recently, I have been doing a lot of performance benchmarks
with large IOs on HDDs (2, 4 8 and 16 MB IOs). And with the improved memory
allocation these days (transparent huge pages), even a simple malloc() IO
buffer can have far less memory segments that the HBA maximum a majority of the
time. So doing such large I/Os is fairly easy and really improves HDD
performance. And in that context, the current default 1280 value for
max_sectors_kb is really a limiting factor. So I am all for increasing the
default to something like 4MB. I will send a patch.

> Anyway. That's orthogonal to this particular change...
> 
> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

Thanks.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] block: Improve read ahead size for rotational devices
  2025-06-16  6:28 [PATCH] block: Improve read ahead size for rotational devices Damien Le Moal
  2025-06-16 21:24 ` Martin K. Petersen
@ 2025-06-17  5:09 ` Christoph Hellwig
  2025-06-17  6:58 ` John Garry
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2025-06-17  5:09 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Jens Axboe, linux-block, linux-fsdevel, linux-mm

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] block: Improve read ahead size for rotational devices
  2025-06-16 21:24 ` Martin K. Petersen
  2025-06-16 22:45   ` Damien Le Moal
@ 2025-06-17  5:10   ` Christoph Hellwig
  1 sibling, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2025-06-17  5:10 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Damien Le Moal, Jens Axboe, linux-block, linux-fsdevel, linux-mm

On Mon, Jun 16, 2025 at 05:24:43PM -0400, Martin K. Petersen wrote:
> > For a SCSI disk, this defaults to 2560 KB, which significantly improve
> > performance for buffered reads.
> 
> I believe this number came from a common RAID stripe configuration at
> the time. However, it's really not a great default and has caused
> problems with many devices that expect a power of two. Personally, I'd
> like this default to be something like 2MB or 4MB. MD, DM, and most
> hardware RAID devices report their stripe width correctly so the
> existing "RAID-friendly" default really shouldn't be needed.

Yeah, that number always felt weird.  I'm all for using a more
round value here.  But as you said, separate isue.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] block: Improve read ahead size for rotational devices
  2025-06-16  6:28 [PATCH] block: Improve read ahead size for rotational devices Damien Le Moal
  2025-06-16 21:24 ` Martin K. Petersen
  2025-06-17  5:09 ` Christoph Hellwig
@ 2025-06-17  6:58 ` John Garry
  2025-07-29  7:58 ` Damien Le Moal
  2025-07-29 12:27 ` Jens Axboe
  4 siblings, 0 replies; 8+ messages in thread
From: John Garry @ 2025-06-17  6:58 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe, linux-block, linux-fsdevel, linux-mm

On 16/06/2025 07:28, Damien Le Moal wrote:
> For a device that does not advertize an optimal I/O size, the function
> blk_apply_bdi_limits() defaults to an initial setting of the ra_pages
> field of struct backing_dev_info to VM_READAHEAD_PAGES, that is, 128 KB.
> 
> This low I/O size value is far from being optimal for hard-disk devices:
> when reading files from multiple contexts using buffered I/Os, the seek
> overhead between the small read commands generated to read-ahead
> multiple files will significantly limit the performance that can be
> achieved.
> 
> This fact applies to all ATA devices as ATA does not define an optimal
> I/O size and the SCSI SAT specification does not define a default value
> to expose to the host.
> 
> Modify blk_apply_bdi_limits() to use a device max_sectors limit to
> calculate the ra_pages field of struct backing_dev_info, when the device
> is a rotational one (BLK_FEAT_ROTATIONAL feature is set). For a SCSI
> disk, this defaults to 2560 KB, which significantly improve performance
> for buffered reads. Using XFS and sequentially reading randomly selected
> (large) files stored on a SATA HDD, the maximum throughput achieved with
> 8 readers reading files with 1MB buffered I/Os increases from 122 MB/s
> to 167 MB/s (+36%). The improvement is even larger when reading files
> using 128 KB buffered I/Os, with a throughput increasing from 57 MB/s to
> 165 MB/s (+189%).
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-settings.c | 12 ++++++++++--
>   1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index a000daafbfb4..66d402de9026 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -58,16 +58,24 @@ EXPORT_SYMBOL(blk_set_stacking_limits);
>   void blk_apply_bdi_limits(struct backing_dev_info *bdi,
>   		struct queue_limits *lim)
>   {
> +	u64 io_opt = lim->io_opt;
> +
>   	/*
>   	 * For read-ahead of large files to be effective, we need to read ahead
> -	 * at least twice the optimal I/O size.
> +	 * at least twice the optimal I/O size. For rotational devices that do
> +	 * not report an optimal I/O size (e.g. ATA HDDs), use the maximum I/O
> +	 * size to avoid falling back to the (rather inefficient) small default
> +	 * read-ahead size.
>   	 *
>   	 * There is no hardware limitation for the read-ahead size and the user
>   	 * might have increased the read-ahead size through sysfs, so don't ever
>   	 * decrease it.
>   	 */
> +	if (!io_opt && (lim->features & BLK_FEAT_ROTATIONAL))
> +		io_opt = (u64)lim->max_sectors << SECTOR_SHIFT;

I'm still not sure it's even possible that lim->max_sectors << 
SECTOR_SHIFT overflows an unsigned int.

Anyway, FWIW:

Reviewed-by: John Garry <john.g.garry@oracle.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] block: Improve read ahead size for rotational devices
  2025-06-16  6:28 [PATCH] block: Improve read ahead size for rotational devices Damien Le Moal
                   ` (2 preceding siblings ...)
  2025-06-17  6:58 ` John Garry
@ 2025-07-29  7:58 ` Damien Le Moal
  2025-07-29 12:27 ` Jens Axboe
  4 siblings, 0 replies; 8+ messages in thread
From: Damien Le Moal @ 2025-07-29  7:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel, linux-mm

On 6/16/25 15:28, Damien Le Moal wrote:
> For a device that does not advertize an optimal I/O size, the function
> blk_apply_bdi_limits() defaults to an initial setting of the ra_pages
> field of struct backing_dev_info to VM_READAHEAD_PAGES, that is, 128 KB.
> 
> This low I/O size value is far from being optimal for hard-disk devices:
> when reading files from multiple contexts using buffered I/Os, the seek
> overhead between the small read commands generated to read-ahead
> multiple files will significantly limit the performance that can be
> achieved.
> 
> This fact applies to all ATA devices as ATA does not define an optimal
> I/O size and the SCSI SAT specification does not define a default value
> to expose to the host.
> 
> Modify blk_apply_bdi_limits() to use a device max_sectors limit to
> calculate the ra_pages field of struct backing_dev_info, when the device
> is a rotational one (BLK_FEAT_ROTATIONAL feature is set). For a SCSI
> disk, this defaults to 2560 KB, which significantly improve performance
> for buffered reads. Using XFS and sequentially reading randomly selected
> (large) files stored on a SATA HDD, the maximum throughput achieved with
> 8 readers reading files with 1MB buffered I/Os increases from 122 MB/s
> to 167 MB/s (+36%). The improvement is even larger when reading files
> using 128 KB buffered I/Os, with a throughput increasing from 57 MB/s to
> 165 MB/s (+189%).
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>

Jens,

Ping ? This is reviewed. Did it got lost ? Or will it be part of a different PR ?

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] block: Improve read ahead size for rotational devices
  2025-06-16  6:28 [PATCH] block: Improve read ahead size for rotational devices Damien Le Moal
                   ` (3 preceding siblings ...)
  2025-07-29  7:58 ` Damien Le Moal
@ 2025-07-29 12:27 ` Jens Axboe
  4 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2025-07-29 12:27 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-mm, Damien Le Moal


On Mon, 16 Jun 2025 15:28:56 +0900, Damien Le Moal wrote:
> For a device that does not advertize an optimal I/O size, the function
> blk_apply_bdi_limits() defaults to an initial setting of the ra_pages
> field of struct backing_dev_info to VM_READAHEAD_PAGES, that is, 128 KB.
> 
> This low I/O size value is far from being optimal for hard-disk devices:
> when reading files from multiple contexts using buffered I/Os, the seek
> overhead between the small read commands generated to read-ahead
> multiple files will significantly limit the performance that can be
> achieved.
> 
> [...]

Applied, thanks!

[1/1] block: Improve read ahead size for rotational devices
      (no commit info)

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-07-29 12:27 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-16  6:28 [PATCH] block: Improve read ahead size for rotational devices Damien Le Moal
2025-06-16 21:24 ` Martin K. Petersen
2025-06-16 22:45   ` Damien Le Moal
2025-06-17  5:10   ` Christoph Hellwig
2025-06-17  5:09 ` Christoph Hellwig
2025-06-17  6:58 ` John Garry
2025-07-29  7:58 ` Damien Le Moal
2025-07-29 12:27 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).