* Re: Improper io_opt setting for md raid5 [not found] <ywsfp3lqnijgig6yrlv2ztxram6ohf5z4yfeebswjkvp2dzisd@f5ikoyo3sfq5> @ 2025-07-27 10:50 ` Csordás Hunor 2025-07-28 0:39 ` Damien Le Moal 2025-07-29 4:44 ` Martin K. Petersen 0 siblings, 2 replies; 20+ messages in thread From: Csordás Hunor @ 2025-07-27 10:50 UTC (permalink / raw) To: Coly Li, hch Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi, Damien Le Moal Adding the SCSI maintainers because I believe the culprit is in drivers/scsi/sd.c, and Damien Le Moal because because he has a pending patch modifying the relevant part and he might be interested in the implications. On 7/15/2025 5:56 PM, Coly Li wrote: > Let me rescript the problem I encountered. > 1, There is an 8 disks raid5 with 64K chunk size on my machine, I observe > /sys/block/md0/queue/optimal_io_size is very large value, which isn’t > reasonable size IMHO. I have come across the same problem after moving all 8 disks of a RAID6 md array from two separate SATA controllers to an mpt3sas device. In my case, the readahead on the array became almost 4 GB: # grep ^ /sys/block/{sda,md_helium}/queue/{optimal_io_size,read_ahead_kb} /sys/block/sda/queue/optimal_io_size:16773120 /sys/block/sda/queue/read_ahead_kb:32760 /sys/block/md_helium/queue/optimal_io_size:4293918720 /sys/block/md_helium/queue/read_ahead_kb:4192256 Note: the readahead is supposed to be twice the optimal I/O size (after a unit conversion). On the md array it isn't because of an overflow in blk_apply_bdi_limits. This overflow is avoidable but basically irrelevant; however, it nicely highlights the fact that io_opt should really never get this large. > 2, It was from drivers/scsi/mpt3sas/mpt3sas_scsih.c, > 11939 static const struct scsi_host_template mpt3sas_driver_template = { ... > 11960 .max_sectors = 32767, ... > 11969 }; > at line 11960, max_sectors of mpt3sas driver is defined as 32767. > > Then in drivers/scsi/scsi_transport_sas.c, at line 241 inside sas_host_setup(), > shots->opt_sectors is assigned by 32767 from the following code, > 240 if (dma_dev->dma_mask) { > 241 shost->opt_sectors = min_t(unsigned int, shost->max_sectors, > 242 dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT); > 243 } > > Then in drivers/scsi/sd.c, inside sd_revalidate_disk() from the following coce, > 3785 /* > 3786 * Limit default to SCSI host optimal sector limit if set. There may be > 3787 * an impact on performance for when the size of a request exceeds this > 3788 * host limit. > 3789 */ > 3790 lim.io_opt = sdp->host->opt_sectors << SECTOR_SHIFT; > 3791 if (sd_validate_opt_xfer_size(sdkp, dev_max)) { > 3792 lim.io_opt = min_not_zero(lim.io_opt, > 3793 logical_to_bytes(sdp, sdkp->opt_xfer_blocks)); > 3794 } > > lim.io_opt of all my sata disks attached to mpt3sas HBA are all 32767 sectors, > because the above code block. > > Then when my raid5 array sets its queue limits, because its io_opt is 64KiB*7, > and the raid component sata hard drive has io_opt with 32767 sectors, by > calculation in block/blk-setting.c:blk_stack_limits() at line 753, > 753 t->io_opt = lcm_not_zero(t->io_opt, b->io_opt); > the calculated opt_io_size of my raid5 array is more than 1GiB. It is too large. > > I know the purpose of lcm_not_zero() is to get an optimized io size for both > raid device and underlying component devices, but the resulted io_opt is bigger > than 1 GiB that's too big. > > For me, I just feel uncomfortable that using max_sectors as opt_sectors in > sas_host_stup(), but I don't know a better way to improve. Currently I just > modify the mpt3sas_driver_template's max_sectors from 32767 to 64, and observed > 5~10% sequetial write performance improvement (direct io) for my raid5 devices > by fio. In my case, the impact was more noticable. The system seemed to work surprisingly fine under light loads, but an increased number of parallel I/O operations completely tanked its performance until I set the readaheads to their expected values and gave the system some time to recover. I came to the same conclusion as Coly Li: io_opt ultimately gets populated from shost->max_sectors, which (in the case of mpt3sas and several other SCSI controllers) contains a value which is both: - unnecessarily large for this purpose and, more importantly, - not a nice number without any large odd divisors, as blk_stack_limits clearly expects. Populating io_opt from shost->max_sectors happens via shost->opt_sectors. This variable was introduced in commits 608128d391fa ("scsi: sd: allow max_sectors be capped at DMA optimal size limit") and 4cbfca5f7750 ("scsi: scsi_transport_sas: cap shost opt_sectors according to DMA optimal limit"). Despite the (in hindsight perhaps unfortunate) name, it wasn't used to set io_opt. It was optimal in a different sense: it was used as a (user-overridable) upper limit to max_sectors, constraining the size of requests to play nicely with IOMMU which might get slow with large mappings. Commit 608128d391fa even mentions io_opt: It could be considered to have request queues io_opt value initially set at Scsi_Host.opt_sectors in __scsi_init_queue(), but that is not really the purpose of io_opt. The last part is correct. shost->opt_sectors is an _upper_ bound on the size of requests, while io_opt is used both as a sort of _lower_ bound (in the form of readahead), and as a sort of indivisible "block size" for I/O (by blk_stack_limits). These two existing purposes may or may not already be too much for a single variable; adding a third one clearly doesn't work well. It was commit a23634644afc ("block: take io_opt and io_min into account for max_sectors") which started setting io_opt from shost->opt_sectors. It did so to stop abusing max_user_sectors to set max_sectors from shost->opt_sectors, but it ended up misusing another variable for this purpose -- perhaps due to inadvertently conflating the two "optimal" transfer sizes, which are optimal in two very different contexts. Interestingly, while I've verified that the increased values for io_opt and readahead on the actual disks definitely comes from this commit (a23634644afc), the io_opt and readahead of the md array are unaffected until commit 9c0ba14828d6 ("blk-settings: round down io_opt to physical_block_size") due to a weird coincidence. This commit rounds io_opt down to the physical block size in blk_validate_limits. Without this commit, io_opt for the disks is 16776704, which looks even worse at first glance (512 * 32767 instead of 4096 * 4095). However, this ends up overflowing in a funny way when combined with the fact that blk_stack_limits (and thus lcm_not_zero) is called once per component device: u32 t = 3145728; // 3 MB, the optimal I/O size for the array u32 b = 16776704; // the (incorrect) optimal I/O size of the disks u32 x = lcm(t, b); // x == (u32)103076069376 == 4291821568 u32 y = lcm(x, b); // y == (u32)140630117318656 == t Repeat for an even number of component devices to get the right answer from the wrong inputs by an incorrect method. I'm sure the issue can be reproduced before commit 9c0ba14828d6 (although I haven't actually tried -- if I had to, I'd start with an array with an odd number of component devices), but at the same time, the issue may be still present and hidden on some systems even after that commit (for example, the rounding does nothing if the physical block size is 512). This might help a little bit to explain why the problem doesn't seem more widespread. > So there should be something to fix. Can you take a look, or give me some hint > to fix? > > Thanks in advance. > > Coly Li I would have loved to finish with a patch here but I'm not sure what the correct fix is. shost->opt_sectors was clearly added for a reason and it should reach max_sectors in struct queue_limits in some way. It probably isn't included in max_hw_sectors because it's meant to be overridable. Apparently just setting max_sectors causes problems, and so does setting max_sectors and max_user_sectors. I don't know how to to fix this correctly without introducing a new variable to struct queue_limits but maybe people more familiar with the code can think of a less intrusive way. Hunor Csordás ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-27 10:50 ` Improper io_opt setting for md raid5 Csordás Hunor @ 2025-07-28 0:39 ` Damien Le Moal 2025-07-28 0:55 ` Yu Kuai 2025-07-29 4:44 ` Martin K. Petersen 1 sibling, 1 reply; 20+ messages in thread From: Damien Le Moal @ 2025-07-28 0:39 UTC (permalink / raw) To: Csordás Hunor, Coly Li, hch Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi On 7/27/25 7:50 PM, Csordás Hunor wrote: > Adding the SCSI maintainers because I believe the culprit is in > drivers/scsi/sd.c, and Damien Le Moal because because he has a pending > patch modifying the relevant part and he might be interested in the > implications. > > On 7/15/2025 5:56 PM, Coly Li wrote: >> Let me rescript the problem I encountered. >> 1, There is an 8 disks raid5 with 64K chunk size on my machine, I observe >> /sys/block/md0/queue/optimal_io_size is very large value, which isn’t >> reasonable size IMHO. > > I have come across the same problem after moving all 8 disks of a RAID6 > md array from two separate SATA controllers to an mpt3sas device. In my > case, the readahead on the array became almost 4 GB: > > # grep ^ /sys/block/{sda,md_helium}/queue/{optimal_io_size,read_ahead_kb} > /sys/block/sda/queue/optimal_io_size:16773120 > /sys/block/sda/queue/read_ahead_kb:32760 For a SATA drive connected to an mpt3sas HBA, I see the same. But note that the optimal_io_size here is completely made up by the HBA/driver because ATA does not advertize/define an optimal IO size. For SATA drive connected to AHCI SATA ports, I see: /sys/block/sda/queue/optimal_io_size:0 /sys/block/sda/queue/read_ahead_kb:8192 read_ahead_kb in this case is twice max_sectors_kb (which with my patch is now 4MB). > /sys/block/md_helium/queue/optimal_io_size:4293918720 > /sys/block/md_helium/queue/read_ahead_kb:4192256 > > Note: the readahead is supposed to be twice the optimal I/O size (after > a unit conversion). On the md array it isn't because of an overflow in > blk_apply_bdi_limits. This overflow is avoidable but basically > irrelevant; however, it nicely highlights the fact that io_opt should > really never get this large. Only if io_opt is non-zero. If io_opt is zero, then read_ahead_kb by default is twice max_sectors_kb. > >> 2, It was from drivers/scsi/mpt3sas/mpt3sas_scsih.c, >> 11939 static const struct scsi_host_template mpt3sas_driver_template = { > ... >> 11960 .max_sectors = 32767, > ... >> 11969 }; >> at line 11960, max_sectors of mpt3sas driver is defined as 32767. This is another completely made-up value since SCSI allows commands transfer length up to 4GB (32-bits value in bytes). Even ATA drives allow up to 65536 logical sectors per command (so 65536 * 4K = 256MB per command for $k logical sector drives). Not sure why it is set to this completely arbitrary value. >> Then in drivers/scsi/scsi_transport_sas.c, at line 241 inside sas_host_setup(), >> shots->opt_sectors is assigned by 32767 from the following code, >> 240 if (dma_dev->dma_mask) { >> 241 shost->opt_sectors = min_t(unsigned int, shost->max_sectors, >> 242 dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT); >> 243 } >> >> Then in drivers/scsi/sd.c, inside sd_revalidate_disk() from the following coce, >> 3785 /* >> 3786 * Limit default to SCSI host optimal sector limit if set. There may be >> 3787 * an impact on performance for when the size of a request exceeds this >> 3788 * host limit. >> 3789 */ >> 3790 lim.io_opt = sdp->host->opt_sectors << SECTOR_SHIFT; >> 3791 if (sd_validate_opt_xfer_size(sdkp, dev_max)) { >> 3792 lim.io_opt = min_not_zero(lim.io_opt, >> 3793 logical_to_bytes(sdp, sdkp->opt_xfer_blocks)); >> 3794 } >> >> lim.io_opt of all my sata disks attached to mpt3sas HBA are all 32767 sectors, >> because the above code block. >> >> Then when my raid5 array sets its queue limits, because its io_opt is 64KiB*7, >> and the raid component sata hard drive has io_opt with 32767 sectors, by >> calculation in block/blk-setting.c:blk_stack_limits() at line 753, >> 753 t->io_opt = lcm_not_zero(t->io_opt, b->io_opt); >> the calculated opt_io_size of my raid5 array is more than 1GiB. It is too large. md setting its io_opt to 64K*number of drives in the array is strange... It does not have to be that large since io_opt is an upper bound and not a "issue that IO size for optimal performance". io_opt is simply a limit saying: if you exceed that IO size, performance may suffer. So a default of stride size x number of drives for the io_opt may be OK, but that should be bound to some reasonable value. Furthermore, this is likely suboptimal. I woulld think that setting the md array io_opt initially to min(all drives io_opt) x number of drives would be a better default. >> I know the purpose of lcm_not_zero() is to get an optimized io size for both >> raid device and underlying component devices, but the resulted io_opt is bigger >> than 1 GiB that's too big. >> >> For me, I just feel uncomfortable that using max_sectors as opt_sectors in >> sas_host_stup(), but I don't know a better way to improve. Currently I just >> modify the mpt3sas_driver_template's max_sectors from 32767 to 64, and observed >> 5~10% sequetial write performance improvement (direct io) for my raid5 devices >> by fio. > > In my case, the impact was more noticable. The system seemed to work > surprisingly fine under light loads, but an increased number of > parallel I/O operations completely tanked its performance until I > set the readaheads to their expected values and gave the system some > time to recover. > > I came to the same conclusion as Coly Li: io_opt ultimately gets > populated from shost->max_sectors, which (in the case of mpt3sas and > several other SCSI controllers) contains a value which is both: > - unnecessarily large for this purpose and, more importantly, > - not a nice number without any large odd divisors, as blk_stack_limits > clearly expects. Sounds to me like this is an md driver issue and tweak the limits automatically calculated by stacking the limits of the array members. > Populating io_opt from shost->max_sectors happens via > shost->opt_sectors. This variable was introduced in commits > 608128d391fa ("scsi: sd: allow max_sectors be capped at DMA optimal > size limit") and 4cbfca5f7750 ("scsi: scsi_transport_sas: cap shost > opt_sectors according to DMA optimal limit"). Despite the (in hindsight > perhaps unfortunate) name, it wasn't used to set io_opt. It was optimal > in a different sense: it was used as a (user-overridable) upper limit > to max_sectors, constraining the size of requests to play nicely with > IOMMU which might get slow with large mappings. > > Commit 608128d391fa even mentions io_opt: > > It could be considered to have request queues io_opt value initially > set at Scsi_Host.opt_sectors in __scsi_init_queue(), but that is not > really the purpose of io_opt. > > The last part is correct. shost->opt_sectors is an _upper_ bound on the > size of requests, while io_opt is used both as a sort of _lower_ bound > (in the form of readahead), and as a sort of indivisible "block size" > for I/O (by blk_stack_limits). These two existing purposes may or may > not already be too much for a single variable; adding a third one > clearly doesn't work well. > > It was commit a23634644afc ("block: take io_opt and io_min into account > for max_sectors") which started setting io_opt from shost->opt_sectors. > It did so to stop abusing max_user_sectors to set max_sectors from > shost->opt_sectors, but it ended up misusing another variable for this > purpose -- perhaps due to inadvertently conflating the two "optimal" > transfer sizes, which are optimal in two very different contexts. > > Interestingly, while I've verified that the increased values for io_opt > and readahead on the actual disks definitely comes from this commit > (a23634644afc), the io_opt and readahead of the md array are unaffected > until commit 9c0ba14828d6 ("blk-settings: round down io_opt to > physical_block_size") due to a weird coincidence. This commit rounds > io_opt down to the physical block size in blk_validate_limits. Without > this commit, io_opt for the disks is 16776704, which looks even worse > at first glance (512 * 32767 instead of 4096 * 4095). However, this > ends up overflowing in a funny way when combined with the fact that > blk_stack_limits (and thus lcm_not_zero) is called once per component > device: > > u32 t = 3145728; // 3 MB, the optimal I/O size for the array > u32 b = 16776704; // the (incorrect) optimal I/O size of the disks It is not incorrect. It is a made-up value. For a SATA drive, reporting 0 would be the correct thing to do. > u32 x = lcm(t, b); // x == (u32)103076069376 == 4291821568 > u32 y = lcm(x, b); // y == (u32)140630117318656 == t > > Repeat for an even number of component devices to get the right answer > from the wrong inputs by an incorrect method. > > I'm sure the issue can be reproduced before commit 9c0ba14828d6 > (although I haven't actually tried -- if I had to, I'd start with an > array with an odd number of component devices), but at the same time, > the issue may be still present and hidden on some systems even after > that commit (for example, the rounding does nothing if the physical > block size is 512). This might help a little bit to explain why the > problem doesn't seem more widespread. > >> So there should be something to fix. Can you take a look, or give me some hint >> to fix? >> >> Thanks in advance. >> >> Coly Li > > I would have loved to finish with a patch here but I'm not sure what > the correct fix is. shost->opt_sectors was clearly added for a reason > and it should reach max_sectors in struct queue_limits in some way. It > probably isn't included in max_hw_sectors because it's meant to be > overridable. Apparently just setting max_sectors causes problems, and > so does setting max_sectors and max_user_sectors. I don't know how to > to fix this correctly without introducing a new variable to struct > queue_limits but maybe people more familiar with the code can think of > a less intrusive way. > > Hunor Csordás > -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-28 0:39 ` Damien Le Moal @ 2025-07-28 0:55 ` Yu Kuai 2025-07-28 2:41 ` Damien Le Moal 0 siblings, 1 reply; 20+ messages in thread From: Yu Kuai @ 2025-07-28 0:55 UTC (permalink / raw) To: Damien Le Moal, Csordás Hunor, Coly Li, hch Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C) Hi, 在 2025/07/28 8:39, Damien Le Moal 写道: > md setting its io_opt to 64K*number of drives in the array is strange... It > does not have to be that large since io_opt is an upper bound and not a "issue > that IO size for optimal performance". io_opt is simply a limit saying: if you > exceed that IO size, performance may suffer. > At least from Documentation, for raid arrays, multiple of io_opt is the prefereed io size to the optimal io performance, and for raid5, this is chunksize * data disks. > So a default of stride size x number of drives for the io_opt may be OK, but > that should be bound to some reasonable value. Furthermore, this is likely > suboptimal. I woulld think that setting the md array io_opt initially to > min(all drives io_opt) x number of drives would be a better default. For raid5, this is not ok, the value have to be chunksize * data disks, regardless of io_opt from member disks, otherwise raid5 have to issue additional IO from other disks to build xor data. For example: - write aligned chunksize to one disk, actually means read chunksize old xor data,then write chunksize data and chunksize new xor data. - write aligned chunksize * data disks, new xor data can be build directly without reading old xor data. Thanks, Kuai ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-28 0:55 ` Yu Kuai @ 2025-07-28 2:41 ` Damien Le Moal 2025-07-28 3:08 ` Yu Kuai 0 siblings, 1 reply; 20+ messages in thread From: Damien Le Moal @ 2025-07-28 2:41 UTC (permalink / raw) To: Yu Kuai, Csordás Hunor, Coly Li, hch Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C) On 7/28/25 9:55 AM, Yu Kuai wrote: > Hi, > > 在 2025/07/28 8:39, Damien Le Moal 写道: >> md setting its io_opt to 64K*number of drives in the array is strange... It >> does not have to be that large since io_opt is an upper bound and not a "issue >> that IO size for optimal performance". io_opt is simply a limit saying: if you >> exceed that IO size, performance may suffer. >> > > At least from Documentation, for raid arrays, multiple of io_opt is the > prefereed io size to the optimal io performance, and for raid5, this is > chunksize * data disks. > >> So a default of stride size x number of drives for the io_opt may be OK, but >> that should be bound to some reasonable value. Furthermore, this is likely >> suboptimal. I woulld think that setting the md array io_opt initially to >> min(all drives io_opt) x number of drives would be a better default. > > For raid5, this is not ok, the value have to be chunksize * data disks, > regardless of io_opt from member disks, otherwise raid5 have to issue > additional IO from other disks to build xor data. > > For example: > > - write aligned chunksize to one disk, actually means read chunksize > old xor data,then write chunksize data and chunksize new xor data. > - write aligned chunksize * data disks, new xor data can be build > directly without reading old xor data. I understand all of that. But you missed my point: io_opt simply indicates an upper bound for an IO size. If exceeded, performance may be degraded. This has *nothing* to do with the io granularity, which for a RAID array should ideally be equal to stride size x number of data disks. This is the confusion here. md setting io_opt to stride x number of disks in the array is simply not what io_opt is supposed to indicate. -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-28 2:41 ` Damien Le Moal @ 2025-07-28 3:08 ` Yu Kuai 2025-07-28 3:49 ` Damien Le Moal 0 siblings, 1 reply; 20+ messages in thread From: Yu Kuai @ 2025-07-28 3:08 UTC (permalink / raw) To: Damien Le Moal, Yu Kuai, Csordás Hunor, Coly Li, hch Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C) Hi, 在 2025/07/28 10:41, Damien Le Moal 写道: > On 7/28/25 9:55 AM, Yu Kuai wrote: >> Hi, >> >> 在 2025/07/28 8:39, Damien Le Moal 写道: >>> md setting its io_opt to 64K*number of drives in the array is strange... It >>> does not have to be that large since io_opt is an upper bound and not a "issue >>> that IO size for optimal performance". io_opt is simply a limit saying: if you >>> exceed that IO size, performance may suffer. >>> >> >> At least from Documentation, for raid arrays, multiple of io_opt is the >> prefereed io size to the optimal io performance, and for raid5, this is >> chunksize * data disks. >> >>> So a default of stride size x number of drives for the io_opt may be OK, but >>> that should be bound to some reasonable value. Furthermore, this is likely >>> suboptimal. I woulld think that setting the md array io_opt initially to >>> min(all drives io_opt) x number of drives would be a better default. >> >> For raid5, this is not ok, the value have to be chunksize * data disks, >> regardless of io_opt from member disks, otherwise raid5 have to issue >> additional IO from other disks to build xor data. >> >> For example: >> >> - write aligned chunksize to one disk, actually means read chunksize >> old xor data,then write chunksize data and chunksize new xor data. >> - write aligned chunksize * data disks, new xor data can be build >> directly without reading old xor data. > > I understand all of that. But you missed my point: io_opt simply indicates an > upper bound for an IO size. If exceeded, performance may be degraded. This has > *nothing* to do with the io granularity, which for a RAID array should ideally > be equal to stride size x number of data disks. > > This is the confusion here. md setting io_opt to stride x number of disks in > the array is simply not what io_opt is supposed to indicate. ok, can I ask where is this upper bound for IO size from? With git log, start from commit 7e5f5fb09e6f ("block: Update topology documentation"), the documentation start contain specail explanation for raid array, and the optimal_io_size says: For RAID arrays it is usually the stripe width or the internal track size. A properly aligned multiple of optimal_io_size is the preferred request size for workloads where sustained throughput is desired. And this explanation is exactly what raid5 did, it's important that io size is aligned multiple of io_opt. Thanks, Kuai > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-28 3:08 ` Yu Kuai @ 2025-07-28 3:49 ` Damien Le Moal 2025-07-28 7:14 ` Yu Kuai 2025-07-29 3:49 ` Martin K. Petersen 0 siblings, 2 replies; 20+ messages in thread From: Damien Le Moal @ 2025-07-28 3:49 UTC (permalink / raw) To: Yu Kuai, Csordás Hunor, Coly Li, hch Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C) On 7/28/25 12:08 PM, Yu Kuai wrote: > Hi, > > 在 2025/07/28 10:41, Damien Le Moal 写道: >> On 7/28/25 9:55 AM, Yu Kuai wrote: >>> Hi, >>> >>> 在 2025/07/28 8:39, Damien Le Moal 写道: >>>> md setting its io_opt to 64K*number of drives in the array is strange... It >>>> does not have to be that large since io_opt is an upper bound and not a "issue >>>> that IO size for optimal performance". io_opt is simply a limit saying: if you >>>> exceed that IO size, performance may suffer. >>>> >>> >>> At least from Documentation, for raid arrays, multiple of io_opt is the >>> prefereed io size to the optimal io performance, and for raid5, this is >>> chunksize * data disks. >>> >>>> So a default of stride size x number of drives for the io_opt may be OK, but >>>> that should be bound to some reasonable value. Furthermore, this is likely >>>> suboptimal. I woulld think that setting the md array io_opt initially to >>>> min(all drives io_opt) x number of drives would be a better default. >>> >>> For raid5, this is not ok, the value have to be chunksize * data disks, >>> regardless of io_opt from member disks, otherwise raid5 have to issue >>> additional IO from other disks to build xor data. >>> >>> For example: >>> >>> - write aligned chunksize to one disk, actually means read chunksize >>> old xor data,then write chunksize data and chunksize new xor data. >>> - write aligned chunksize * data disks, new xor data can be build >>> directly without reading old xor data. >> >> I understand all of that. But you missed my point: io_opt simply indicates an >> upper bound for an IO size. If exceeded, performance may be degraded. This has >> *nothing* to do with the io granularity, which for a RAID array should ideally >> be equal to stride size x number of data disks. >> >> This is the confusion here. md setting io_opt to stride x number of disks in >> the array is simply not what io_opt is supposed to indicate. > > ok, can I ask where is this upper bound for IO size from? SCSI SBC specifications, Block Limits VPD page (B0h): 3 values are important in there: * OPTIMAL TRANSFER LENGTH GRANULARITY: An OPTIMAL TRANSFER LENGTH GRANULARITY field set to a non-zero value indicates the optimal transfer length granularity size in logical blocks for a single command shown in the command column of table 33. If a device server receives one of these commands with a transfer size that is not equal to a multiple of this value, then the device server may incur delays in processing the command. An OPTIMAL TRANSFER LENGTH GRANULARITY field set to 0000h indicates that the device server does not report optimal transfer length granularity. For a SCSI disk, sd.c uses this value for sdkp->min_xfer_blocks. Note that the naming here is dubious since this is not a minimum. The minimum is the logical block size. This is a "hint" for better performance. For a RAID area, this should be the stripe size of the RAID volume (stride x number of data disks). This value is used for queue->limits.io_min. * MAXIMUM TRANSFER LENGTH: A MAXIMUM TRANSFER LENGTH field set to a non-zero value indicates the maximum transfer length in logical blocks that the device server accepts for a single command shown in table 33. If a device server receives one of these commands with a transfer size greater than this value, then the device server shall terminate the command with CHECK CONDITION status with the sense key set to ILLEGAL REQUEST and the additional sense code set to the value shown in table 33. A MAXIMUM TRANSFER LENGTH field set to 0000_0000h indicates that the device server does not report a limit on the transfer length. For a SCSI disk, sd.c uses this value for sdkp->max_xfer_blocks. This is a hard limit which will be reflected in queue->limits.max_dev_sectors (max_hw_sectors_kb in sysfs). * OPTIMAL TRANSFER LENGTH: An OPTIMAL TRANSFER LENGTH field set to a non-zero value indicates the optimal transfer size in logical blocks for a single command shown in table 33. If a device server receives one of these commands with a transfer size greater than this value, then the device server may incur delays in processing the command. An OPTIMAL TRANSFER LENGTH field set to 0000_0000h indicates that the device server does not report an optimal transfer size. For a SCSI disk, sd.c uses this value for sdkp->opt_xfer_blocks. This value is used for queue->limit.io_opt. > With git log, start from commit 7e5f5fb09e6f ("block: Update topology > documentation"), the documentation start contain specail explanation for > raid array, and the optimal_io_size says: > > For RAID arrays it is usually the > stripe width or the internal track size. A properly aligned > multiple of optimal_io_size is the preferred request size for > workloads where sustained throughput is desired. > > And this explanation is exactly what raid5 did, it's important that > io size is aligned multiple of io_opt. Looking at the sysfs doc for the above fields, they are described as follows: * /sys/block/<disk>/queue/minimum_io_size [RO] Storage devices may report a granularity or preferred minimum I/O size which is the smallest request the device can perform without incurring a performance penalty. For disk drives this is often the physical block size. For RAID arrays it is often the stripe chunk size. A properly aligned multiple of minimum_io_size is the preferred request size for workloads where a high number of I/O operations is desired. So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY and for a RAID array, this indeed should be the stride x number of data disks. * /sys/block/<disk>/queue/max_hw_sectors_kb [RO] This is the maximum number of kilobytes supported in a single data transfer. No problem here. * /sys/block/<disk>/queue/optimal_io_size Storage devices may report an optimal I/O size, which is the device's preferred unit for sustained I/O. This is rarely reported for disk drives. For RAID arrays it is usually the stripe width or the internal track size. A properly aligned multiple of optimal_io_size is the preferred request size for workloads where sustained throughput is desired. If no optimal I/O size is reported this file contains 0. Well, I find this definition not correct *at all*. This is repeating the definition of minimum_io_size (limits->io_min) and completely disregard the eventual optimal_io_size limit of the drives in the array. For a raid array, this value should obviously be a multiple of minimum_io_size (the array stripe size), but it can be much larger, since this should be an upper bound for IO size. read_ahead_kb being set using this value is thus not correct I think. read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size. -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-28 3:49 ` Damien Le Moal @ 2025-07-28 7:14 ` Yu Kuai 2025-07-28 7:44 ` Damien Le Moal 2025-07-29 3:53 ` Martin K. Petersen 2025-07-29 3:49 ` Martin K. Petersen 1 sibling, 2 replies; 20+ messages in thread From: Yu Kuai @ 2025-07-28 7:14 UTC (permalink / raw) To: Damien Le Moal, Yu Kuai, Csordás Hunor, Coly Li, hch Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C) Hi, 在 2025/07/28 11:49, Damien Le Moal 写道: > On 7/28/25 12:08 PM, Yu Kuai wrote: >> Hi, >> >> 在 2025/07/28 10:41, Damien Le Moal 写道: >>> On 7/28/25 9:55 AM, Yu Kuai wrote: >>>> Hi, >>>> >>>> 在 2025/07/28 8:39, Damien Le Moal 写道: >>>>> md setting its io_opt to 64K*number of drives in the array is strange... It >>>>> does not have to be that large since io_opt is an upper bound and not a "issue >>>>> that IO size for optimal performance". io_opt is simply a limit saying: if you >>>>> exceed that IO size, performance may suffer. >>>>> >>>> >>>> At least from Documentation, for raid arrays, multiple of io_opt is the >>>> prefereed io size to the optimal io performance, and for raid5, this is >>>> chunksize * data disks. >>>> >>>>> So a default of stride size x number of drives for the io_opt may be OK, but >>>>> that should be bound to some reasonable value. Furthermore, this is likely >>>>> suboptimal. I woulld think that setting the md array io_opt initially to >>>>> min(all drives io_opt) x number of drives would be a better default. >>>> >>>> For raid5, this is not ok, the value have to be chunksize * data disks, >>>> regardless of io_opt from member disks, otherwise raid5 have to issue >>>> additional IO from other disks to build xor data. >>>> >>>> For example: >>>> >>>> - write aligned chunksize to one disk, actually means read chunksize >>>> old xor data,then write chunksize data and chunksize new xor data. >>>> - write aligned chunksize * data disks, new xor data can be build >>>> directly without reading old xor data. >>> >>> I understand all of that. But you missed my point: io_opt simply indicates an >>> upper bound for an IO size. If exceeded, performance may be degraded. This has >>> *nothing* to do with the io granularity, which for a RAID array should ideally >>> be equal to stride size x number of data disks. >>> >>> This is the confusion here. md setting io_opt to stride x number of disks in >>> the array is simply not what io_opt is supposed to indicate. >> >> ok, can I ask where is this upper bound for IO size from? > > SCSI SBC specifications, Block Limits VPD page (B0h): > > 3 values are important in there: > > * OPTIMAL TRANSFER LENGTH GRANULARITY: > > An OPTIMAL TRANSFER LENGTH GRANULARITY field set to a non-zero value indicates > the optimal transfer length granularity size in logical blocks for a single > command shown in the command column of table 33. If a device server receives > one of these commands with a transfer size that is not equal to a multiple of > this value, then the device server may incur delays in processing the command. > An OPTIMAL TRANSFER LENGTH GRANULARITY field set to 0000h indicates that the > device server does not report optimal transfer length granularity. > > For a SCSI disk, sd.c uses this value for sdkp->min_xfer_blocks. Note that the > naming here is dubious since this is not a minimum. The minimum is the logical > block size. This is a "hint" for better performance. For a RAID area, this > should be the stripe size of the RAID volume (stride x number of data disks). > This value is used for queue->limits.io_min. > > * MAXIMUM TRANSFER LENGTH: > > A MAXIMUM TRANSFER LENGTH field set to a non-zero value indicates the maximum > transfer length in logical blocks that the device server accepts for a single > command shown in table 33. If a device server receives one of these commands > with a transfer size greater than this value, then the device server shall > terminate the command with CHECK CONDITION status with the sense key set to > ILLEGAL REQUEST and the additional sense code set to the value shown in table > 33. A MAXIMUM TRANSFER LENGTH field set to 0000_0000h indicates that the device > server does not report a limit on the transfer length. > > For a SCSI disk, sd.c uses this value for sdkp->max_xfer_blocks. This is a hard > limit which will be reflected in queue->limits.max_dev_sectors > (max_hw_sectors_kb in sysfs). > > * OPTIMAL TRANSFER LENGTH: > > An OPTIMAL TRANSFER LENGTH field set to a non-zero value indicates the optimal > transfer size in logical blocks for a single command shown in table 33. If a > device server receives one of these commands with a transfer size greater than > this value, then the device server may incur delays in processing the command. > An OPTIMAL TRANSFER LENGTH field set to 0000_0000h indicates that the device > server does not report an optimal transfer size. > > For a SCSI disk, sd.c uses this value for sdkp->opt_xfer_blocks. This value is > used for queue->limit.io_opt. Thanks for the explanation. > >> With git log, start from commit 7e5f5fb09e6f ("block: Update topology >> documentation"), the documentation start contain specail explanation for >> raid array, and the optimal_io_size says: >> >> For RAID arrays it is usually the >> stripe width or the internal track size. A properly aligned >> multiple of optimal_io_size is the preferred request size for >> workloads where sustained throughput is desired. >> >> And this explanation is exactly what raid5 did, it's important that >> io size is aligned multiple of io_opt. > > Looking at the sysfs doc for the above fields, they are described as follows: > > * /sys/block/<disk>/queue/minimum_io_size > > [RO] Storage devices may report a granularity or preferred > minimum I/O size which is the smallest request the device can > perform without incurring a performance penalty. For disk > drives this is often the physical block size. For RAID arrays > it is often the stripe chunk size. A properly aligned multiple > of minimum_io_size is the preferred request size for workloads > where a high number of I/O operations is desired. > > So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY and for a > RAID array, this indeed should be the stride x number of data disks. Do you mean stripe here? io_min for raid array is always just one chunksize. > > * /sys/block/<disk>/queue/max_hw_sectors_kb > > [RO] This is the maximum number of kilobytes supported in a > single data transfer. > > No problem here. > > * /sys/block/<disk>/queue/optimal_io_size > > Storage devices may report an optimal I/O size, which is > the device's preferred unit for sustained I/O. This is rarely > reported for disk drives. For RAID arrays it is usually the > stripe width or the internal track size. A properly aligned > multiple of optimal_io_size is the preferred request size for > workloads where sustained throughput is desired. If no optimal > I/O size is reported this file contains 0. > > Well, I find this definition not correct *at all*. This is repeating the > definition of minimum_io_size (limits->io_min) and completely disregard the > eventual optimal_io_size limit of the drives in the array. For a raid array, > this value should obviously be a multiple of minimum_io_size (the array stripe > size), but it can be much larger, since this should be an upper bound for IO > size. read_ahead_kb being set using this value is thus not correct I think. > read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size. I think this is actually different than io_min, and io_opt for different levels are not the same, for raid0, raid10, raid456(raid1 doesn't have chunksize): - lim.io_min = mddev->chunk_sectors << 9; - lim.io_opt = lim.io_min * (number of data copies); And I think they do match the definition above, specifically: - properly multiple aligned io_min to *prevent performance penalty*; - properly multiple aligned io_opt to *get optimal performance*, the number of data copies times the performance of a single disk; The orginal problem is that scsi disks report unusual io_opt 32767, and raid5 set io_opt to 64k * 7(8 disks with 64k chunksise). The lcm_not_zero() from blk_stack_limits() end up with a huge value: blk_stack_limits() t->io_min = max(t->io_min, b->io_min); t->io_opt = lcm_not_zero(t->io_opt, b->io_opt); > read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size. The io_opt is used in raid array as minimal aligned size to get optimal IO performance, not the upper bound. With the respect of this, use this value for ra_pages make sense. However, if scsi is using this value as IO upper bound, it's right this doesn't make sense. Thanks, Kuai > > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-28 7:14 ` Yu Kuai @ 2025-07-28 7:44 ` Damien Le Moal 2025-07-28 9:02 ` Yu Kuai ` (2 more replies) 2025-07-29 3:53 ` Martin K. Petersen 1 sibling, 3 replies; 20+ messages in thread From: Damien Le Moal @ 2025-07-28 7:44 UTC (permalink / raw) To: Yu Kuai, Csordás Hunor, Coly Li, hch Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C) On 7/28/25 4:14 PM, Yu Kuai wrote: >>> With git log, start from commit 7e5f5fb09e6f ("block: Update topology >>> documentation"), the documentation start contain specail explanation for >>> raid array, and the optimal_io_size says: >>> >>> For RAID arrays it is usually the >>> stripe width or the internal track size. A properly aligned >>> multiple of optimal_io_size is the preferred request size for >>> workloads where sustained throughput is desired. >>> >>> And this explanation is exactly what raid5 did, it's important that >>> io size is aligned multiple of io_opt. >> >> Looking at the sysfs doc for the above fields, they are described as follows: >> >> * /sys/block/<disk>/queue/minimum_io_size >> >> [RO] Storage devices may report a granularity or preferred >> minimum I/O size which is the smallest request the device can >> perform without incurring a performance penalty. For disk >> drives this is often the physical block size. For RAID arrays >> it is often the stripe chunk size. A properly aligned multiple >> of minimum_io_size is the preferred request size for workloads >> where a high number of I/O operations is desired. >> >> So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY and for a >> RAID array, this indeed should be the stride x number of data disks. > > Do you mean stripe here? io_min for raid array is always just one > chunksize. My bad, yes, that is the definition in sysfs. So io_min is the stride size, where: stride size x number of data disks == stripe_size. Note that chunk_sectors limit is the *stripe* size, not per drive stride. Beware of the wording here to avoid confusion (this is all already super confusing !). Well, at least, that is how I interpret the io_min definition of minimum_io_size in Documentation/ABI/stable/sysfs-block. But the wording "For RAID arrays it is often the stripe chunk size." is super confusing. Not entirely sure if stride or stripe was meant here... >> * /sys/block/<disk>/queue/optimal_io_size >> >> Storage devices may report an optimal I/O size, which is >> the device's preferred unit for sustained I/O. This is rarely >> reported for disk drives. For RAID arrays it is usually the >> stripe width or the internal track size. A properly aligned >> multiple of optimal_io_size is the preferred request size for >> workloads where sustained throughput is desired. If no optimal >> I/O size is reported this file contains 0. >> >> Well, I find this definition not correct *at all*. This is repeating the >> definition of minimum_io_size (limits->io_min) and completely disregard the >> eventual optimal_io_size limit of the drives in the array. For a raid array, >> this value should obviously be a multiple of minimum_io_size (the array stripe >> size), but it can be much larger, since this should be an upper bound for IO >> size. read_ahead_kb being set using this value is thus not correct I think. >> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size. > > I think this is actually different than io_min, and io_opt for different > levels are not the same, for raid0, raid10, raid456(raid1 doesn't have > chunksize): > - lim.io_min = mddev->chunk_sectors << 9; See above. Given how confusing the definition of minimum_io_size is, not sure that is correct. This code assumes that io_min is the stripe size and not the stride size. > - lim.io_opt = lim.io_min * (number of data copies); I do not understand what you mean with "number of data copies"... There is no data copy in a RAID 5/6 array. > And I think they do match the definition above, specifically: > - properly multiple aligned io_min to *prevent performance penalty*; Yes. > - properly multiple aligned io_opt to *get optimal performance*, the > number of data copies times the performance of a single disk; That is how this field is defined for RAID, but that is far from what it means for a single disk. It is unfortunate that it was defined like that. For a single disk, io_opt is NOT about getting optimal_performance. It is about an upper bound for the IO size to NOT get a performance penalty (e.g. due to a DMA mapping that is too large for what the IOMMU can handle). And for a RAID array, it means that we should always have io_min == io_opt but it seems that the scsi code and limit stacking code try to make this limit an upper bound on the IO size, aligned to the stripe size. > The orginal problem is that scsi disks report unusual io_opt 32767, > and raid5 set io_opt to 64k * 7(8 disks with 64k chunksise). The > lcm_not_zero() from blk_stack_limits() end up with a huge value: > > blk_stack_limits() > t->io_min = max(t->io_min, b->io_min); > t->io_opt = lcm_not_zero(t->io_opt, b->io_opt); I understand the "problem" that was stated. There is an overflow that result in a large io_opt and a ridiculously large read_ahead_kb. io_opt being large should in my opinion not be an issue in itself, since it should be an upper bound on IO size and not the stripe size (io_min indicates that). >> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size. > > The io_opt is used in raid array as minimal aligned size to get optimal > IO performance, not the upper bound. With the respect of this, use this > value for ra_pages make sense. However, if scsi is using this value as > IO upper bound, it's right this doesn't make sense. Here is your issue. People misunderstood optimal_io_size and used that instead of using minimal_io_size/io_min limit for the granularity/alignment of IOs. Using optimal_io_size as the "granularity" for optimal IOs that do not require read-modify-write of RAID stripes is simply wrong in my optinion. io_min/minimal_io_size is the attribute indicating that. As for read_ahead_kb, it should be bounded by io_opt (upper bound) but should be initialized to a smaller value aligned to io_min (if io_opt is unreasonably large). Given all of that and how misused io_opt seems to be, I am not sure how to fix this though. -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-28 7:44 ` Damien Le Moal @ 2025-07-28 9:02 ` Yu Kuai 2025-07-29 4:23 ` Martin K. Petersen 2025-07-29 6:13 ` Hannes Reinecke 2025-07-28 10:56 ` Csordás Hunor 2025-07-29 4:08 ` Martin K. Petersen 2 siblings, 2 replies; 20+ messages in thread From: Yu Kuai @ 2025-07-28 9:02 UTC (permalink / raw) To: Damien Le Moal, Yu Kuai, Csordás Hunor, Coly Li, hch Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C) Hi, 在 2025/07/28 15:44, Damien Le Moal 写道: > On 7/28/25 4:14 PM, Yu Kuai wrote: >>>> With git log, start from commit 7e5f5fb09e6f ("block: Update topology >>>> documentation"), the documentation start contain specail explanation for >>>> raid array, and the optimal_io_size says: >>>> >>>> For RAID arrays it is usually the >>>> stripe width or the internal track size. A properly aligned >>>> multiple of optimal_io_size is the preferred request size for >>>> workloads where sustained throughput is desired. >>>> >>>> And this explanation is exactly what raid5 did, it's important that >>>> io size is aligned multiple of io_opt. >>> >>> Looking at the sysfs doc for the above fields, they are described as follows: >>> >>> * /sys/block/<disk>/queue/minimum_io_size >>> >>> [RO] Storage devices may report a granularity or preferred >>> minimum I/O size which is the smallest request the device can >>> perform without incurring a performance penalty. For disk >>> drives this is often the physical block size. For RAID arrays >>> it is often the stripe chunk size. A properly aligned multiple >>> of minimum_io_size is the preferred request size for workloads >>> where a high number of I/O operations is desired. >>> >>> So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY and for a >>> RAID array, this indeed should be the stride x number of data disks. >> >> Do you mean stripe here? io_min for raid array is always just one >> chunksize. > > My bad, yes, that is the definition in sysfs. So io_min is the stride size, where: > > stride size x number of data disks == stripe_size. > Yes. > Note that chunk_sectors limit is the *stripe* size, not per drive stride. > Beware of the wording here to avoid confusion (this is all already super > confusing !). This is something we're not in the same page :( For example, 8 disks raid5, with default chunk size. Then the above calculation is: 64k * 7 = 448k The chunksize I said is 64k... > > Well, at least, that is how I interpret the io_min definition of > minimum_io_size in Documentation/ABI/stable/sysfs-block. But the wording "For > RAID arrays it is often the stripe chunk size." is super confusing. Not > entirely sure if stride or stripe was meant here... > Hope it's clear now. > >>> * /sys/block/<disk>/queue/optimal_io_size >>> >>> Storage devices may report an optimal I/O size, which is >>> the device's preferred unit for sustained I/O. This is rarely >>> reported for disk drives. For RAID arrays it is usually the >>> stripe width or the internal track size. A properly aligned >>> multiple of optimal_io_size is the preferred request size for >>> workloads where sustained throughput is desired. If no optimal >>> I/O size is reported this file contains 0. >>> >>> Well, I find this definition not correct *at all*. This is repeating the >>> definition of minimum_io_size (limits->io_min) and completely disregard the >>> eventual optimal_io_size limit of the drives in the array. For a raid array, >>> this value should obviously be a multiple of minimum_io_size (the array stripe >>> size), but it can be much larger, since this should be an upper bound for IO >>> size. read_ahead_kb being set using this value is thus not correct I think. >>> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size. >> >> I think this is actually different than io_min, and io_opt for different >> levels are not the same, for raid0, raid10, raid456(raid1 doesn't have >> chunksize): >> - lim.io_min = mddev->chunk_sectors << 9; By the above example, io_min = 64k, and io_opt = 448k. And make sure we're on the same page, io_min is the *stride* and io_opt is the *stripe*. > > See above. Given how confusing the definition of minimum_io_size is, not sure > that is correct. This code assumes that io_min is the stripe size and not the > stride size. > >> - lim.io_opt = lim.io_min * (number of data copies); > > I do not understand what you mean with "number of data copies"... There is no > data copy in a RAID 5/6 array. Yes, this is my bad, *data disks* is the better word. > >> And I think they do match the definition above, specifically: >> - properly multiple aligned io_min to *prevent performance penalty*; > > Yes. > >> - properly multiple aligned io_opt to *get optimal performance*, the >> number of data copies times the performance of a single disk; > > That is how this field is defined for RAID, but that is far from what it means > for a single disk. It is unfortunate that it was defined like that. > > For a single disk, io_opt is NOT about getting optimal_performance. It is about > an upper bound for the IO size to NOT get a performance penalty (e.g. due to a > DMA mapping that is too large for what the IOMMU can handle). The name itself is misleading. :( I didn't know this definition until now. > > And for a RAID array, it means that we should always have io_min == io_opt but > it seems that the scsi code and limit stacking code try to make this limit an > upper bound on the IO size, aligned to the stripe size. > >> The orginal problem is that scsi disks report unusual io_opt 32767, >> and raid5 set io_opt to 64k * 7(8 disks with 64k chunksise). The >> lcm_not_zero() from blk_stack_limits() end up with a huge value: >> >> blk_stack_limits() >> t->io_min = max(t->io_min, b->io_min); >> t->io_opt = lcm_not_zero(t->io_opt, b->io_opt); > > I understand the "problem" that was stated. There is an overflow that result in > a large io_opt and a ridiculously large read_ahead_kb. > io_opt being large should in my opinion not be an issue in itself, since it > should be an upper bound on IO size and not the stripe size (io_min indicates > that). > >>> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size. >> >> The io_opt is used in raid array as minimal aligned size to get optimal >> IO performance, not the upper bound. With the respect of this, use this >> value for ra_pages make sense. However, if scsi is using this value as >> IO upper bound, it's right this doesn't make sense. > > Here is your issue. People misunderstood optimal_io_size and used that instead > of using minimal_io_size/io_min limit for the granularity/alignment of IOs. > Using optimal_io_size as the "granularity" for optimal IOs that do not require > read-modify-write of RAID stripes is simply wrong in my optinion. > io_min/minimal_io_size is the attribute indicating that. Ok, looks like there are two problems now: a) io_min, size to prevent performance penalty; 1) For raid5, to avoid read-modify-write, this value should be 448k, but now it's 64k; 2) For raid0/raid10, this value is set to 64k now, however, this value should not set. If the value in member disks is 4k, issue 4k is just fine, there won't be any performance penalty; 3) For raid1, this value is not set, and will use member disks, this is correct. b) io_opt, size to ??? 4) For raid0/raid10/rai5, this value is set to mininal IO size to get best performance. 5) For raid1, this value is not set, and will use member disks. Problem a can be fixed easily, and for problem b, I'm not sure how to fix it as well, it depends on how we think io_opt is. If io_opt should be *upper bound*, problem 4) should be fixed like case 5), and other places like blk_apply_bdi_limits() setting ra_pages by io_opt should be fixed as well. If io_opt should be *mininal IO size to get best performance*, problem 5) should be fixed like case 4), and I don't know if scsi or other drivers to set initial io_opt should be changed. :( Thanks, Kuai > > As for read_ahead_kb, it should be bounded by io_opt (upper bound) but should > be initialized to a smaller value aligned to io_min (if io_opt is unreasonably > large). > > Given all of that and how misused io_opt seems to be, I am not sure how to fix > this though. > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-28 9:02 ` Yu Kuai @ 2025-07-29 4:23 ` Martin K. Petersen 2025-07-29 6:25 ` Yu Kuai 2025-07-29 22:02 ` Tony Battersby 2025-07-29 6:13 ` Hannes Reinecke 1 sibling, 2 replies; 20+ messages in thread From: Martin K. Petersen @ 2025-07-29 4:23 UTC (permalink / raw) To: Yu Kuai Cc: Damien Le Moal, Csordás Hunor, Coly Li, hch, linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C) > Ok, looks like there are two problems now: > > a) io_min, size to prevent performance penalty; > > 1) For raid5, to avoid read-modify-write, this value should be 448k, > but now it's 64k; You have two penalties for RAID5: Writes smaller than the stripe chunk size and writes smaller than the full stripe width. > 2) For raid0/raid10, this value is set to 64k now, however, this value > should not set. If the value in member disks is 4k, issue 4k is just > fine, there won't be any performance penalty; Correct. > 3) For raid1, this value is not set, and will use member disks, this is > correct. Correct. > b) io_opt, size to ??? > 4) For raid0/raid10/rai5, this value is set to mininal IO size to get > best performance. For RAID 0 you want to set io_opt to the stripe width. io_opt is for sequential, throughput-optimized I/O. Presumably the MD stripe chunk size has been chosen based on knowledge about the underlying disks and their performance. And thus maximum throughput will be achieved when doing full stripe writes across all drives. For software RAID I am not sure how much this really matters in a modern context. It certainly did 25 years ago when we benchmarked things for XFS. Full stripe writes were a big improvement with both software and hardware RAID. But how much this matters today, I am not sure. > 5) For raid1, this value is not set, and will use member disks. Correct. > > If io_opt should be *upper bound*, problem 4) should be fixed like case > 5), and other places like blk_apply_bdi_limits() setting ra_pages by > io_opt should be fixed as well. I understand Damien's "upper bound" interpretation but it does not take alignment and granularity into account. And both are imperative for io_opt. > If io_opt should be *mininal IO size to get best performance*, What is "best performance"? IOPS or throughput? io_min is about IOPS. io_opt is about throughput. -- Martin K. Petersen ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-29 4:23 ` Martin K. Petersen @ 2025-07-29 6:25 ` Yu Kuai 2025-07-29 22:02 ` Tony Battersby 1 sibling, 0 replies; 20+ messages in thread From: Yu Kuai @ 2025-07-29 6:25 UTC (permalink / raw) To: Martin K. Petersen, Yu Kuai Cc: Damien Le Moal, Csordás Hunor, Coly Li, hch, linux-block, James E.J. Bottomley, linux-scsi, yukuai (C) Hi, Martin 在 2025/07/29 12:23, Martin K. Petersen 写道: > >> Ok, looks like there are two problems now: >> >> a) io_min, size to prevent performance penalty; >> >> 1) For raid5, to avoid read-modify-write, this value should be 448k, >> but now it's 64k; > > You have two penalties for RAID5: Writes smaller than the stripe chunk > size and writes smaller than the full stripe width. Yes, the internal IO size for raid5 is 4k, however, only full stripe write can prevent read-modify-write, which is 448k. > >> 2) For raid0/raid10, this value is set to 64k now, however, this value >> should not set. If the value in member disks is 4k, issue 4k is just >> fine, there won't be any performance penalty; > > Correct. > >> 3) For raid1, this value is not set, and will use member disks, this is >> correct. > > Correct. > >> b) io_opt, size to ??? >> 4) For raid0/raid10/rai5, this value is set to mininal IO size to get >> best performance. > > For RAID 0 you want to set io_opt to the stripe width. io_opt is for > sequential, throughput-optimized I/O. Presumably the MD stripe chunk > size has been chosen based on knowledge about the underlying disks and > their performance. And thus maximum throughput will be achieved when > doing full stripe writes across all drives. Yes, raid0/raid10/raid5 are all the same logic, multiple aligned sequential IO can get the number of data disks times sigle disk performance. > > For software RAID I am not sure how much this really matters in a modern > context. It certainly did 25 years ago when we benchmarked things for > XFS. Full stripe writes were a big improvement with both software and > hardware RAID. But how much this matters today, I am not sure. For raid1, write will be less than single disk performance. However, for read, the io_opt should be the sum of io_opt of member disks, see should_choose_next(), for sequential read, raid1 will switch to next rdev to read after reading io_opf of this rdev. > >> 5) For raid1, this value is not set, and will use member disks. > > Correct. > >> >> If io_opt should be *upper bound*, problem 4) should be fixed like case >> 5), and other places like blk_apply_bdi_limits() setting ra_pages by >> io_opt should be fixed as well. > > I understand Damien's "upper bound" interpretation but it does not take > alignment and granularity into account. And both are imperative for > io_opt. > >> If io_opt should be *mininal IO size to get best performance*, > > What is "best performance"? IOPS or throughput? > > io_min is about IOPS. io_opt is about throughput. I mean throughput here. > Thanks, Kuai ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-29 4:23 ` Martin K. Petersen 2025-07-29 6:25 ` Yu Kuai @ 2025-07-29 22:02 ` Tony Battersby 1 sibling, 0 replies; 20+ messages in thread From: Tony Battersby @ 2025-07-29 22:02 UTC (permalink / raw) To: Martin K. Petersen, Yu Kuai Cc: Damien Le Moal, Csordás Hunor, Coly Li, hch, linux-block, James E.J. Bottomley, linux-scsi, yukuai (C) On 7/29/25 00:23, Martin K. Petersen wrote: >> b) io_opt, size to ??? >> 4) For raid0/raid10/rai5, this value is set to mininal IO size to get >> best performance. > For software RAID I am not sure how much this really matters in a modern > context. It certainly did 25 years ago when we benchmarked things for > XFS. Full stripe writes were a big improvement with both software and > hardware RAID. But how much this matters today, I am not sure. > FWIW, I just posted a patch that aligns writes to stripe boundaries using io_opt: https://lore.kernel.org/all/55deda1d-967d-4d68-a9ba-4d5139374a37@cybernetics.com/ I get about a 2.3% performance improvement with md-raid6, but I have an out-of-tree RAID driver that gets more like 4x improvement. If io_opt means different things to different code, might we consider adding another field to the queue limits to give explicit stripe parameters? Tony Battersby Cybernetics ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-28 9:02 ` Yu Kuai 2025-07-29 4:23 ` Martin K. Petersen @ 2025-07-29 6:13 ` Hannes Reinecke 2025-07-29 6:29 ` Yu Kuai 2025-07-29 22:24 ` Keith Busch 1 sibling, 2 replies; 20+ messages in thread From: Hannes Reinecke @ 2025-07-29 6:13 UTC (permalink / raw) To: Yu Kuai, Damien Le Moal, Csordás Hunor, Coly Li, hch Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C) On 7/28/25 11:02, Yu Kuai wrote: > Hi, > > 在 2025/07/28 15:44, Damien Le Moal 写道: >> On 7/28/25 4:14 PM, Yu Kuai wrote: >>>>> With git log, start from commit 7e5f5fb09e6f ("block: Update topology >>>>> documentation"), the documentation start contain specail >>>>> explanation for >>>>> raid array, and the optimal_io_size says: >>>>> >>>>> For RAID arrays it is usually the >>>>> stripe width or the internal track size. A properly aligned >>>>> multiple of optimal_io_size is the preferred request size for >>>>> workloads where sustained throughput is desired. >>>>> >>>>> And this explanation is exactly what raid5 did, it's important that >>>>> io size is aligned multiple of io_opt. >>>> >>>> Looking at the sysfs doc for the above fields, they are described as >>>> follows: >>>> >>>> * /sys/block/<disk>/queue/minimum_io_size >>>> >>>> [RO] Storage devices may report a granularity or preferred >>>> minimum I/O size which is the smallest request the device can >>>> perform without incurring a performance penalty. For disk >>>> drives this is often the physical block size. For RAID arrays >>>> it is often the stripe chunk size. A properly aligned multiple >>>> of minimum_io_size is the preferred request size for workloads >>>> where a high number of I/O operations is desired. >>>> >>>> So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY >>>> and for a >>>> RAID array, this indeed should be the stride x number of data disks. >>> >>> Do you mean stripe here? io_min for raid array is always just one >>> chunksize. >> >> My bad, yes, that is the definition in sysfs. So io_min is the stride >> size, where: >> >> stride size x number of data disks == stripe_size. >> > Yes. > >> Note that chunk_sectors limit is the *stripe* size, not per drive stride. >> Beware of the wording here to avoid confusion (this is all already super >> confusing !). > > This is something we're not in the same page :( For example, 8 disks > raid5, with default chunk size. Then the above calculation is: > > 64k * 7 = 448k > > The chunksize I said is 64k... Hmm. I always thought that the 'chunksize' is the limit which I/O must not cross to avoid being split. So for RAID 4/5/6 I would have thought this to be the stride size, as MD must split larger I/O onto two disks. Sure, one could argue that the stripe size is the chunk size, but then MD will have to split that I/O... Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-29 6:13 ` Hannes Reinecke @ 2025-07-29 6:29 ` Yu Kuai 2025-07-29 22:24 ` Keith Busch 1 sibling, 0 replies; 20+ messages in thread From: Yu Kuai @ 2025-07-29 6:29 UTC (permalink / raw) To: Hannes Reinecke, Yu Kuai, Damien Le Moal, Csordás Hunor, Coly Li, hch Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C) Hi, 在 2025/07/29 14:13, Hannes Reinecke 写道: > On 7/28/25 11:02, Yu Kuai wrote: >> Hi, >> >> 在 2025/07/28 15:44, Damien Le Moal 写道: >>> On 7/28/25 4:14 PM, Yu Kuai wrote: >>>>>> With git log, start from commit 7e5f5fb09e6f ("block: Update topology >>>>>> documentation"), the documentation start contain specail >>>>>> explanation for >>>>>> raid array, and the optimal_io_size says: >>>>>> >>>>>> For RAID arrays it is usually the >>>>>> stripe width or the internal track size. A properly aligned >>>>>> multiple of optimal_io_size is the preferred request size for >>>>>> workloads where sustained throughput is desired. >>>>>> >>>>>> And this explanation is exactly what raid5 did, it's important that >>>>>> io size is aligned multiple of io_opt. >>>>> >>>>> Looking at the sysfs doc for the above fields, they are described >>>>> as follows: >>>>> >>>>> * /sys/block/<disk>/queue/minimum_io_size >>>>> >>>>> [RO] Storage devices may report a granularity or preferred >>>>> minimum I/O size which is the smallest request the device can >>>>> perform without incurring a performance penalty. For disk >>>>> drives this is often the physical block size. For RAID arrays >>>>> it is often the stripe chunk size. A properly aligned multiple >>>>> of minimum_io_size is the preferred request size for workloads >>>>> where a high number of I/O operations is desired. >>>>> >>>>> So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY >>>>> and for a >>>>> RAID array, this indeed should be the stride x number of data disks. >>>> >>>> Do you mean stripe here? io_min for raid array is always just one >>>> chunksize. >>> >>> My bad, yes, that is the definition in sysfs. So io_min is the stride >>> size, where: >>> >>> stride size x number of data disks == stripe_size. >>> >> Yes. >> >>> Note that chunk_sectors limit is the *stripe* size, not per drive >>> stride. >>> Beware of the wording here to avoid confusion (this is all already super >>> confusing !). >> >> This is something we're not in the same page :( For example, 8 disks >> raid5, with default chunk size. Then the above calculation is: >> >> 64k * 7 = 448k >> >> The chunksize I said is 64k... > > Hmm. I always thought that the 'chunksize' is the limit which I/O must > not cross to avoid being split. > So for RAID 4/5/6 I would have thought this to be the stride size, > as MD must split larger I/O onto two disks. > Sure, one could argue that the stripe size is the chunk size, but then > MD will have to split that I/O... BTW, I always thought chunksize to be stride size simply because there is a metadata field in mddev superblock named 'chunk_size', which is the stride size. Thanks, Kuai > > Cheers, > > Hannes ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-29 6:13 ` Hannes Reinecke 2025-07-29 6:29 ` Yu Kuai @ 2025-07-29 22:24 ` Keith Busch 1 sibling, 0 replies; 20+ messages in thread From: Keith Busch @ 2025-07-29 22:24 UTC (permalink / raw) To: Hannes Reinecke Cc: Yu Kuai, Damien Le Moal, Csordás Hunor, Coly Li, hch, linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C) On Tue, Jul 29, 2025 at 08:13:31AM +0200, Hannes Reinecke wrote: > > > Note that chunk_sectors limit is the *stripe* size, not per drive stride. > > > Beware of the wording here to avoid confusion (this is all already super > > > confusing !). > > > > This is something we're not in the same page :( For example, 8 disks > > raid5, with default chunk size. Then the above calculation is: > > > > 64k * 7 = 448k > > > > The chunksize I said is 64k... > > Hmm. I always thought that the 'chunksize' is the limit which I/O must > not cross to avoid being split. > So for RAID 4/5/6 I would have thought this to be the stride size, > as MD must split larger I/O onto two disks. > Sure, one could argue that the stripe size is the chunk size, but then > MD will have to split that I/O... Yah, I think that makes sense. At least the way nvme uses "chunk_size", it was assumed to mean the boundary for when the backend handling will split your request for different isolated media. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-28 7:44 ` Damien Le Moal 2025-07-28 9:02 ` Yu Kuai @ 2025-07-28 10:56 ` Csordás Hunor 2025-07-29 4:08 ` Martin K. Petersen 2 siblings, 0 replies; 20+ messages in thread From: Csordás Hunor @ 2025-07-28 10:56 UTC (permalink / raw) To: Damien Le Moal, Yu Kuai, Coly Li, hch Cc: linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C) On 7/28/2025 11:02 AM, Yu Kuai wrote: > Hi, > > 在 2025/07/28 15:44, Damien Le Moal 写道: >> On 7/28/25 4:14 PM, Yu Kuai wrote: >>>>> With git log, start from commit 7e5f5fb09e6f ("block: Update topology >>>>> documentation"), the documentation start contain specail explanation for >>>>> raid array, and the optimal_io_size says: >>>>> >>>>> For RAID arrays it is usually the >>>>> stripe width or the internal track size. A properly aligned >>>>> multiple of optimal_io_size is the preferred request size for >>>>> workloads where sustained throughput is desired. >>>>> >>>>> And this explanation is exactly what raid5 did, it's important that >>>>> io size is aligned multiple of io_opt. >>>> >>>> Looking at the sysfs doc for the above fields, they are described as follows: >>>> >>>> * /sys/block/<disk>/queue/minimum_io_size >>>> >>>> [RO] Storage devices may report a granularity or preferred >>>> minimum I/O size which is the smallest request the device can >>>> perform without incurring a performance penalty. For disk >>>> drives this is often the physical block size. For RAID arrays >>>> it is often the stripe chunk size. A properly aligned multiple >>>> of minimum_io_size is the preferred request size for workloads >>>> where a high number of I/O operations is desired. >>>> >>>> So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY and for a >>>> RAID array, this indeed should be the stride x number of data disks. >>> >>> Do you mean stripe here? io_min for raid array is always just one >>> chunksize. >> >> My bad, yes, that is the definition in sysfs. So io_min is the stride size, where: >> >> stride size x number of data disks == stripe_size. >> > Yes. > >> Note that chunk_sectors limit is the *stripe* size, not per drive stride. >> Beware of the wording here to avoid confusion (this is all already super >> confusing !). > > This is something we're not in the same page :( For example, 8 disks > raid5, with default chunk size. Then the above calculation is: > > 64k * 7 = 448k > > The chunksize I said is 64k... >> >> Well, at least, that is how I interpret the io_min definition of >> minimum_io_size in Documentation/ABI/stable/sysfs-block. But the wording "For >> RAID arrays it is often the stripe chunk size." is super confusing. Not >> entirely sure if stride or stripe was meant here... >> > > Hope it's clear now. >> >>>> * /sys/block/<disk>/queue/optimal_io_size >>>> >>>> Storage devices may report an optimal I/O size, which is >>>> the device's preferred unit for sustained I/O. This is rarely >>>> reported for disk drives. For RAID arrays it is usually the >>>> stripe width or the internal track size. A properly aligned >>>> multiple of optimal_io_size is the preferred request size for >>>> workloads where sustained throughput is desired. If no optimal >>>> I/O size is reported this file contains 0. >>>> >>>> Well, I find this definition not correct *at all*. This is repeating the >>>> definition of minimum_io_size (limits->io_min) and completely disregard the >>>> eventual optimal_io_size limit of the drives in the array. For a raid array, >>>> this value should obviously be a multiple of minimum_io_size (the array stripe >>>> size), but it can be much larger, since this should be an upper bound for IO >>>> size. read_ahead_kb being set using this value is thus not correct I think. >>>> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size. >>> >>> I think this is actually different than io_min, and io_opt for different >>> levels are not the same, for raid0, raid10, raid456(raid1 doesn't have >>> chunksize): >>> - lim.io_min = mddev->chunk_sectors << 9; > > By the above example, io_min = 64k, and io_opt = 448k. And make sure > we're on the same page, io_min is the *stride* and io_opt is the > *stripe*. >> >> See above. Given how confusing the definition of minimum_io_size is, not sure >> that is correct. This code assumes that io_min is the stripe size and not the >> stride size. >> >>> - lim.io_opt = lim.io_min * (number of data copies); >> >> I do not understand what you mean with "number of data copies"... There is no >> data copy in a RAID 5/6 array. > > Yes, this is my bad, *data disks* is the better word. >> >>> And I think they do match the definition above, specifically: >>> - properly multiple aligned io_min to *prevent performance penalty*; >> >> Yes. >> >>> - properly multiple aligned io_opt to *get optimal performance*, the >>> number of data copies times the performance of a single disk; >> >> That is how this field is defined for RAID, but that is far from what it means >> for a single disk. It is unfortunate that it was defined like that. >> >> For a single disk, io_opt is NOT about getting optimal_performance. It is about >> an upper bound for the IO size to NOT get a performance penalty (e.g. due to a >> DMA mapping that is too large for what the IOMMU can handle). > > The name itself is misleading. :( I didn't know this definition until > now. > >> >> And for a RAID array, it means that we should always have io_min == io_opt but >> it seems that the scsi code and limit stacking code try to make this limit an >> upper bound on the IO size, aligned to the stripe size. >> >>> The orginal problem is that scsi disks report unusual io_opt 32767, >>> and raid5 set io_opt to 64k * 7(8 disks with 64k chunksise). The >>> lcm_not_zero() from blk_stack_limits() end up with a huge value: >>> >>> blk_stack_limits() >>> t->io_min = max(t->io_min, b->io_min); >>> t->io_opt = lcm_not_zero(t->io_opt, b->io_opt); >> >> I understand the "problem" that was stated. There is an overflow that result in >> a large io_opt and a ridiculously large read_ahead_kb. >> io_opt being large should in my opinion not be an issue in itself, since it >> should be an upper bound on IO size and not the stripe size (io_min indicates >> that). >> >>>> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size. >>> >>> The io_opt is used in raid array as minimal aligned size to get optimal >>> IO performance, not the upper bound. With the respect of this, use this >>> value for ra_pages make sense. However, if scsi is using this value as >>> IO upper bound, it's right this doesn't make sense. >> >> Here is your issue. People misunderstood optimal_io_size and used that instead >> of using minimal_io_size/io_min limit for the granularity/alignment of IOs. >> Using optimal_io_size as the "granularity" for optimal IOs that do not require >> read-modify-write of RAID stripes is simply wrong in my optinion. >> io_min/minimal_io_size is the attribute indicating that. I'm not familiar enough with the all the code using io_opt to be certain, but I think it's a little bit the other way around. The problem definitely seems to be that we want to use one variable for multiple different purposes. io_opt in struct queue_limits is the "optimal I/O size", but "optimal" can mean many things in different contexts. In general, if I want to do some I/O, I can say that the optimal way to do it is to make I/O requests that satisfy some condition. The condition can be many things: - The request size should be at least X (combining two small requests into a big one may be faster, or extending a small request into a bigger one may avoid having to do another request later). - The request size should be at most X (for example, because DMA is inefficient on this system with request sizes too large -- this is the _only_ thing that shost->opt_sectors is currently set for). - The request size should be a multiple of X, _and_ the request should have an alignment of X (this is what a striped md array wants). And, of course, there can be many other "optimality" conditions. The ones listed above all have a parameter (X), which can, in the context of that condition, be called the "optimal I/O size". These parameters, however, can be very different for each condition; the right parameter for one condition can be completely inappropriate for another. The SCSI standard may have a definition for "optimal transfer length", but io_opt in struct queue_length seems to be used for a different purpose since its introduction in 2009: - It was introduced in commit c72758f33784 ("block: Export I/O topology for block devices and partitions"). The commit message specifically mentions its use by md: The io_opt characteristic indicates the optimal I/O size reported by the device. This is usually the stripe width for arrays. - It actually started being set by md in commit 8f6c2e4b325a ("md: Use new topology calls to indicate alignment and I/O sizes"). The commit date is more than a month later than the last but as far as I can see, they were originally posted in the same patch series: https://lore.kernel.org/all/20090424123054.GA1926@parisc-linux.org/T/#u In the context of that patch series, md was literally the first user of io_opt. - Using the lowest common multiple of the component devices and the array to calculate the final io_opt of the array happened in commit 70dd5bf3b999 ("block: Stack optimal I/O size"), still in 2009. It wasn't until commit a23634644afc ("block: take io_opt and io_min into account for max_sectors" in 2024 that sd_revalidate_disk started to set io_opt from what, in the context of the SCSI standard, should be called the optimal I/O size. However, in doing so, it broke md arrays. With my setup, this was hidden until commit 9c0ba14828d6 ("blk-settings: round down io_opt to physical_block_size"), but nonetheless, it happened here. > Ok, looks like there are two problems now: > > a) io_min, size to prevent performance penalty; > > 1) For raid5, to avoid read-modify-write, this value should be 448k, > but now it's 64k; > 2) For raid0/raid10, this value is set to 64k now, however, this value > should not set. If the value in member disks is 4k, issue 4k is just > fine, there won't be any performance penalty; > 3) For raid1, this value is not set, and will use member disks, this is > correct. > > b) io_opt, size to ??? > 4) For raid0/raid10/rai5, this value is set to mininal IO size to get > best performance. > 5) For raid1, this value is not set, and will use member disks. > > Problem a can be fixed easily, and for problem b, I'm not sure how to > fix it as well, it depends on how we think io_opt is. > > If io_opt should be *upper bound*, problem 4) should be fixed like case > 5), and other places like blk_apply_bdi_limits() setting ra_pages by > io_opt should be fixed as well. > > If io_opt should be *mininal IO size to get best performance*, problem > 5) should be fixed like case 4), and I don't know if scsi or other > drivers to set initial io_opt should be changed. :( > > Thanks, > Kuai > >> >> As for read_ahead_kb, it should be bounded by io_opt (upper bound) but should >> be initialized to a smaller value aligned to io_min (if io_opt is unreasonably >> large). >> >> Given all of that and how misused io_opt seems to be, I am not sure how to fix >> this though. >> > Hunor Csordás ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-28 7:44 ` Damien Le Moal 2025-07-28 9:02 ` Yu Kuai 2025-07-28 10:56 ` Csordás Hunor @ 2025-07-29 4:08 ` Martin K. Petersen 2 siblings, 0 replies; 20+ messages in thread From: Martin K. Petersen @ 2025-07-29 4:08 UTC (permalink / raw) To: Damien Le Moal Cc: Yu Kuai, Csordás Hunor, Coly Li, hch, linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C) Damien, > My bad, yes, that is the definition in sysfs. So io_min is the stride > size, where: Depends on the RAID type. For RAID0 and 1 there is no inherent penalty wrt. writing less than the stride size. But for RAID 5/6 there clearly is. > stride size x number of data disks == stripe_size. > > Note that chunk_sectors limit is the *stripe* size, not per drive stride. > Beware of the wording here to avoid confusion (this is all already super > confusing !). The choice of "chunk" to describe the LBA boundary queue limit is unfortunate since MD uses chunk_sectors as the term for what you call stride. > Well, at least, that is how I interpret the io_min definition of > minimum_io_size in Documentation/ABI/stable/sysfs-block. But the wording "For > RAID arrays it is often the stripe chunk size." is super confusing. Not > entirely sure if stride or stripe was meant here... The stripe chunk or stripe unit is what you call stride. Stripe width is the full stripe across all drives. > As for read_ahead_kb, it should be bounded by io_opt (upper bound) but > should be initialized to a smaller value aligned to io_min (if io_opt > is unreasonably large). In retrospect I am not really a fan of using io_opt for read_ahead_kb since, to my knowledge, there is no guarantee that the readahead I/O will be naturally aligned. That said, I don't really know of devices where this matters much for reads. With writes, this would be much more of an issue. -- Martin K. Petersen ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-28 7:14 ` Yu Kuai 2025-07-28 7:44 ` Damien Le Moal @ 2025-07-29 3:53 ` Martin K. Petersen 1 sibling, 0 replies; 20+ messages in thread From: Martin K. Petersen @ 2025-07-29 3:53 UTC (permalink / raw) To: Yu Kuai Cc: Damien Le Moal, Csordás Hunor, Coly Li, hch, linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C) > The orginal problem is that scsi disks report unusual io_opt 32767, We should fix that. We already ignore a bunch of other oddball clues. -- Martin K. Petersen ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-28 3:49 ` Damien Le Moal 2025-07-28 7:14 ` Yu Kuai @ 2025-07-29 3:49 ` Martin K. Petersen 1 sibling, 0 replies; 20+ messages in thread From: Martin K. Petersen @ 2025-07-29 3:49 UTC (permalink / raw) To: Damien Le Moal Cc: Yu Kuai, Csordás Hunor, Coly Li, hch, linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi, yukuai (C) Damien, > An OPTIMAL TRANSFER LENGTH GRANULARITY field set to 0000h indicates > that the device server does not report optimal transfer length > granularity. > > For a SCSI disk, sd.c uses this value for sdkp->min_xfer_blocks. Note > that the naming here is dubious since this is not a minimum. The > minimum is the logical block size. min_io is a preferred minimum, not an absolute minimum, just like the physical block size. You can do smaller I/Os but you don't want to. > Storage devices may report an optimal I/O size, which is the device's > preferred unit for sustained I/O. This is rarely reported for disk > drives. For RAID arrays it is usually the stripe width or the internal > track size. A properly aligned multiple of optimal_io_size is the > preferred request size for workloads where sustained throughput is > desired. If no optimal I/O size is reported this file contains 0. > > Well, I find this definition not correct *at all*. This is repeating > the definition of minimum_io_size (limits->io_min) and completely > disregard the eventual optimal_io_size limit of the drives in the > array. opt_io defines the optimal I/O size for a sequential/throughput-focused workload. You can do larger I/Os but you don't want to. RAID arrays at the time the spec was written had sophisticated non-volatile caches which managed data in tracks or cache lines. When you issued an I/O which straddled internal cache lines in the array, you would fall back to a slow path as opposed to doing things entirely in hardware. So the purpose of the optimal transfer length and granularity in SCSI was to encourage applications to write naturally aligned full tracks/cache lines/stripes. > For a raid array, this value should obviously be a multiple of > minimum_io_size (the array stripe size), SCSI does not require this but we enforce it in sd.c. -- Martin K. Petersen ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Improper io_opt setting for md raid5 2025-07-27 10:50 ` Improper io_opt setting for md raid5 Csordás Hunor 2025-07-28 0:39 ` Damien Le Moal @ 2025-07-29 4:44 ` Martin K. Petersen 1 sibling, 0 replies; 20+ messages in thread From: Martin K. Petersen @ 2025-07-29 4:44 UTC (permalink / raw) To: Csordás Hunor Cc: Coly Li, hch, linux-block, James E.J. Bottomley, Martin K. Petersen, linux-scsi, Damien Le Moal Csordás, I basically agree with everything in your analysis. >> Then in drivers/scsi/sd.c, inside sd_revalidate_disk() from the following coce, >> 3785 /* >> 3786 * Limit default to SCSI host optimal sector limit if set. There may be >> 3787 * an impact on performance for when the size of a request exceeds this >> 3788 * host limit. >> 3789 */ >> 3790 lim.io_opt = sdp->host->opt_sectors << SECTOR_SHIFT; >> 3791 if (sd_validate_opt_xfer_size(sdkp, dev_max)) { >> 3792 lim.io_opt = min_not_zero(lim.io_opt, >> 3793 logical_to_bytes(sdp, sdkp->opt_xfer_blocks)); >> 3794 } >> >> lim.io_opt of all my sata disks attached to mpt3sas HBA are all 32767 sectors, >> because the above code block. shost->opt_sectors was originally used to seed rw_max and not io_opt, I think that is the appropriate thing to do. A SCSI host driver reporting some ludicrous limit is not necessarily representative of a "good" I/O size as far as neither disk drive, nor the Linux I/O stack is concerned. shost->opt_sectors should clearly set an upper bound for max_sectors. But I don't think we should ever increase any queue limit based on what is reported in opt_sectors. -- Martin K. Petersen ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2025-07-29 22:24 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <ywsfp3lqnijgig6yrlv2ztxram6ohf5z4yfeebswjkvp2dzisd@f5ikoyo3sfq5>
2025-07-27 10:50 ` Improper io_opt setting for md raid5 Csordás Hunor
2025-07-28 0:39 ` Damien Le Moal
2025-07-28 0:55 ` Yu Kuai
2025-07-28 2:41 ` Damien Le Moal
2025-07-28 3:08 ` Yu Kuai
2025-07-28 3:49 ` Damien Le Moal
2025-07-28 7:14 ` Yu Kuai
2025-07-28 7:44 ` Damien Le Moal
2025-07-28 9:02 ` Yu Kuai
2025-07-29 4:23 ` Martin K. Petersen
2025-07-29 6:25 ` Yu Kuai
2025-07-29 22:02 ` Tony Battersby
2025-07-29 6:13 ` Hannes Reinecke
2025-07-29 6:29 ` Yu Kuai
2025-07-29 22:24 ` Keith Busch
2025-07-28 10:56 ` Csordás Hunor
2025-07-29 4:08 ` Martin K. Petersen
2025-07-29 3:53 ` Martin K. Petersen
2025-07-29 3:49 ` Martin K. Petersen
2025-07-29 4:44 ` Martin K. Petersen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox