* [PATCH 1/1] dm-stripe: adjust max_hw_discard_sectors to avoid unnecessary discard bio splitting @ 2025-12-08 10:07 Yongpeng Yang 2025-12-10 6:07 ` Benjamin Marzinski 0 siblings, 1 reply; 5+ messages in thread From: Yongpeng Yang @ 2025-12-08 10:07 UTC (permalink / raw) To: Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Benjamin Marzinski Cc: dm-devel, Yongpeng Yang From: Yongpeng Yang <yangyongpeng@xiaomi.com> Currently, the max_hw_discard_sectors of a stripe target is set to the minimum max_hw_discard_sectors among all sub devices. When the discard bio is larger than max_hw_discard_sectors, this may cause the stripe device to split discard bios unnecessarily, because the value of max_hw_discard_sectors affects max_discard_sectors, which equal to min(max_hw_discard_sectors, max_user_discard_sectors). For example: root@vm:~# echo '0 33554432 striped 2 256 /dev/vdd 0 /dev/vde 0' | dmsetup create stripe_dev root@vm:~# cat /sys/block/dm-1/queue/discard_max_bytes 536870912 root@vm:~# cat /sys/block/dm-1/slaves/vdd/queue/discard_max_bytes 536870912 root@vm:~# blkdiscard -o 0 -l 1073741824 -p 1073741824 /dev/mapper/stripe_dev dm-1 is the stripe device, and its discard_max_bytes is equal to each sub device’s discard_max_bytes. Since the requested discard length exceeds discard_max_bytes, the block layer splits the discard bio: block_bio_queue: 252,1 DS 0 + 2097152 [blkdiscard] block_split: 252,1 DS 0 / 1048576 [blkdiscard] block_rq_issue: 253,48 DS 268435456 () 0 + 524288 be,0,4 [blkdiscard] block_bio_queue: 253,64 DS 524288 + 524288 [blkdiscard] However, both vdd and vde can actually handle a discard bio of 536870912 bytes, so this split is not necessary. This patch updates the stripe target’s q->limits.max_hw_discard_sectors to be the minimum max_hw_discard_sectors of the sub devices multiplied by the # of stripe devices. This enables the stripe device to handle larger discard bios without incurring unnecessary splitting. Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> --- drivers/md/dm-stripe.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c index 1461dc740dae..799d6def699b 100644 --- a/drivers/md/dm-stripe.c +++ b/drivers/md/dm-stripe.c @@ -38,6 +38,9 @@ struct stripe_c { uint32_t chunk_size; int chunk_size_shift; + /* The minimum max_hw_discard_sectors of all sub devices. */ + unsigned int max_hw_discard_sectors; + /* Needed for handling events */ struct dm_target *ti; @@ -169,6 +172,8 @@ static int stripe_ctr(struct dm_target *ti, unsigned int argc, char **argv) * Get the stripe destinations. */ for (i = 0; i < stripes; i++) { + struct request_queue *q; + argv += 2; r = get_stripe(ti, sc, i, argv); @@ -180,6 +185,13 @@ static int stripe_ctr(struct dm_target *ti, unsigned int argc, char **argv) return r; } atomic_set(&(sc->stripe[i].error_count), 0); + + q = bdev_get_queue(sc->stripe[i].dev->bdev); + if (i == 0) + sc->max_hw_discard_sectors = q->limits.max_hw_discard_sectors; + else + sc->max_hw_discard_sectors = min_not_zero(sc->max_hw_discard_sectors, + q->limits.max_hw_discard_sectors); } ti->private = sc; @@ -456,7 +468,7 @@ static void stripe_io_hints(struct dm_target *ti, struct queue_limits *limits) { struct stripe_c *sc = ti->private; - unsigned int io_min, io_opt; + unsigned int io_min, io_opt, max_hw_discard_sectors; limits->chunk_sectors = sc->chunk_size; @@ -465,6 +477,8 @@ static void stripe_io_hints(struct dm_target *ti, limits->io_min = io_min; limits->io_opt = io_opt; } + if (!check_mul_overflow(sc->max_hw_discard_sectors, sc->stripes, &max_hw_discard_sectors)) + limits->max_hw_discard_sectors = max_hw_discard_sectors; } static struct target_type stripe_target = { -- 2.43.0 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1] dm-stripe: adjust max_hw_discard_sectors to avoid unnecessary discard bio splitting 2025-12-08 10:07 [PATCH 1/1] dm-stripe: adjust max_hw_discard_sectors to avoid unnecessary discard bio splitting Yongpeng Yang @ 2025-12-10 6:07 ` Benjamin Marzinski 2025-12-11 7:03 ` Yongpeng Yang 0 siblings, 1 reply; 5+ messages in thread From: Benjamin Marzinski @ 2025-12-10 6:07 UTC (permalink / raw) To: Yongpeng Yang Cc: Alasdair Kergon, Mike Snitzer, Mikulas Patocka, dm-devel, Yongpeng Yang On Mon, Dec 08, 2025 at 06:07:14PM +0800, Yongpeng Yang wrote: > From: Yongpeng Yang <yangyongpeng@xiaomi.com> > > Currently, the max_hw_discard_sectors of a stripe target is set to the > minimum max_hw_discard_sectors among all sub devices. When the discard > bio is larger than max_hw_discard_sectors, this may cause the stripe > device to split discard bios unnecessarily, because the value of > max_hw_discard_sectors affects max_discard_sectors, which equal to > min(max_hw_discard_sectors, max_user_discard_sectors). > > For example: > root@vm:~# echo '0 33554432 striped 2 256 /dev/vdd 0 /dev/vde 0' | dmsetup create stripe_dev > root@vm:~# cat /sys/block/dm-1/queue/discard_max_bytes > 536870912 > root@vm:~# cat /sys/block/dm-1/slaves/vdd/queue/discard_max_bytes > 536870912 > root@vm:~# blkdiscard -o 0 -l 1073741824 -p 1073741824 /dev/mapper/stripe_dev > > dm-1 is the stripe device, and its discard_max_bytes is equal to > each sub device’s discard_max_bytes. Since the requested discard > length exceeds discard_max_bytes, the block layer splits the discard bio: > > block_bio_queue: 252,1 DS 0 + 2097152 [blkdiscard] > block_split: 252,1 DS 0 / 1048576 [blkdiscard] > block_rq_issue: 253,48 DS 268435456 () 0 + 524288 be,0,4 [blkdiscard] > block_bio_queue: 253,64 DS 524288 + 524288 [blkdiscard] > > However, both vdd and vde can actually handle a discard bio of 536870912 > bytes, so this split is not necessary. > > This patch updates the stripe target’s q->limits.max_hw_discard_sectors > to be the minimum max_hw_discard_sectors of the sub devices multiplied > by the # of stripe devices. This enables the stripe device to handle > larger discard bios without incurring unnecessary splitting. > > Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> > --- > drivers/md/dm-stripe.c | 16 +++++++++++++++- > 1 file changed, 15 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c > index 1461dc740dae..799d6def699b 100644 > --- a/drivers/md/dm-stripe.c > +++ b/drivers/md/dm-stripe.c > @@ -38,6 +38,9 @@ struct stripe_c { > uint32_t chunk_size; > int chunk_size_shift; > > + /* The minimum max_hw_discard_sectors of all sub devices. */ > + unsigned int max_hw_discard_sectors; > + > /* Needed for handling events */ > struct dm_target *ti; > > @@ -169,6 +172,8 @@ static int stripe_ctr(struct dm_target *ti, unsigned int argc, char **argv) > * Get the stripe destinations. > */ > for (i = 0; i < stripes; i++) { > + struct request_queue *q; > + > argv += 2; > > r = get_stripe(ti, sc, i, argv); > @@ -180,6 +185,13 @@ static int stripe_ctr(struct dm_target *ti, unsigned int argc, char **argv) > return r; > } > atomic_set(&(sc->stripe[i].error_count), 0); > + > + q = bdev_get_queue(sc->stripe[i].dev->bdev); > + if (i == 0) > + sc->max_hw_discard_sectors = q->limits.max_hw_discard_sectors; > + else > + sc->max_hw_discard_sectors = min_not_zero(sc->max_hw_discard_sectors, > + q->limits.max_hw_discard_sectors); I don't think any of the above is necessary. When stripe_io_hints() is called, dm_set_device_limits() will already have been called on all the underlying stripe devices to combine the limits. So limits->max_hw_discard_sectors should already be set to the same value as you're computing for sc->max_hw_discard_sectors. Right? > } > > ti->private = sc; > @@ -456,7 +468,7 @@ static void stripe_io_hints(struct dm_target *ti, > struct queue_limits *limits) > { > struct stripe_c *sc = ti->private; > - unsigned int io_min, io_opt; > + unsigned int io_min, io_opt, max_hw_discard_sectors; > > limits->chunk_sectors = sc->chunk_size; > > @@ -465,6 +477,8 @@ static void stripe_io_hints(struct dm_target *ti, > limits->io_min = io_min; > limits->io_opt = io_opt; > } > + if (!check_mul_overflow(sc->max_hw_discard_sectors, sc->stripes, &max_hw_discard_sectors)) sc->max_hw_discard_sectors should be the same as limits->max_hw_discard_sectors here > + limits->max_hw_discard_sectors = max_hw_discard_sectors; I see a couple of issues with this calculation. First, this only works if max_hw_discard_sectors is greater than sc->chunk_size. But more than that, before you multiply the original limit by the number of stripes, you must round it down to a multiple of chunk_size. Otherwise, you can end up writing too much to some of the underlying devices. To show this with simple numbers, imagine you had 3 stripes that with a chunk_size of 4 and all the underlying devices had a max_hw_discard_sectors of 5. You would set max_hw_discard_sectors to 5 * 3 = 15. But if you discared 15 sectors from the beginning of the device, you would end up discarding the first 4 sectors of each underlying device, and then loop around discard 3 more sectors of the first device. This means that you would discard 7 from the first device, instead of the 5 that it could handle. Rounding max_hw_discard_sectors down to a multiple of chunk_size will fix that. Lastly, if you do overflow when multiplying max_hw_discard_sectors by the number of stripes, you should probably just set limits->max_hw_discard_sectors to UINT_MAX >> SECTOR_SHIFT, instead of leaving it at what it was. -Ben > } > > static struct target_type stripe_target = { > -- > 2.43.0 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1] dm-stripe: adjust max_hw_discard_sectors to avoid unnecessary discard bio splitting 2025-12-10 6:07 ` Benjamin Marzinski @ 2025-12-11 7:03 ` Yongpeng Yang 2025-12-11 23:13 ` Benjamin Marzinski 0 siblings, 1 reply; 5+ messages in thread From: Yongpeng Yang @ 2025-12-11 7:03 UTC (permalink / raw) To: Benjamin Marzinski, Yongpeng Yang Cc: Alasdair Kergon, Mike Snitzer, Mikulas Patocka, dm-devel, Yongpeng Yang On 12/10/25 14:07, Benjamin Marzinski wrote: > On Mon, Dec 08, 2025 at 06:07:14PM +0800, Yongpeng Yang wrote: >> From: Yongpeng Yang <yangyongpeng@xiaomi.com> >> >> Currently, the max_hw_discard_sectors of a stripe target is set to the >> minimum max_hw_discard_sectors among all sub devices. When the discard >> bio is larger than max_hw_discard_sectors, this may cause the stripe >> device to split discard bios unnecessarily, because the value of >> max_hw_discard_sectors affects max_discard_sectors, which equal to >> min(max_hw_discard_sectors, max_user_discard_sectors). >> >> For example: >> root@vm:~# echo '0 33554432 striped 2 256 /dev/vdd 0 /dev/vde 0' | dmsetup create stripe_dev >> root@vm:~# cat /sys/block/dm-1/queue/discard_max_bytes >> 536870912 >> root@vm:~# cat /sys/block/dm-1/slaves/vdd/queue/discard_max_bytes >> 536870912 >> root@vm:~# blkdiscard -o 0 -l 1073741824 -p 1073741824 /dev/mapper/stripe_dev >> >> dm-1 is the stripe device, and its discard_max_bytes is equal to >> each sub device’s discard_max_bytes. Since the requested discard >> length exceeds discard_max_bytes, the block layer splits the discard bio: >> >> block_bio_queue: 252,1 DS 0 + 2097152 [blkdiscard] >> block_split: 252,1 DS 0 / 1048576 [blkdiscard] >> block_rq_issue: 253,48 DS 268435456 () 0 + 524288 be,0,4 [blkdiscard] >> block_bio_queue: 253,64 DS 524288 + 524288 [blkdiscard] >> >> However, both vdd and vde can actually handle a discard bio of 536870912 >> bytes, so this split is not necessary. >> >> This patch updates the stripe target’s q->limits.max_hw_discard_sectors >> to be the minimum max_hw_discard_sectors of the sub devices multiplied >> by the # of stripe devices. This enables the stripe device to handle >> larger discard bios without incurring unnecessary splitting. >> >> Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> >> --- >> drivers/md/dm-stripe.c | 16 +++++++++++++++- >> 1 file changed, 15 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c >> index 1461dc740dae..799d6def699b 100644 >> --- a/drivers/md/dm-stripe.c >> +++ b/drivers/md/dm-stripe.c >> @@ -38,6 +38,9 @@ struct stripe_c { >> uint32_t chunk_size; >> int chunk_size_shift; >> >> + /* The minimum max_hw_discard_sectors of all sub devices. */ >> + unsigned int max_hw_discard_sectors; >> + >> /* Needed for handling events */ >> struct dm_target *ti; >> >> @@ -169,6 +172,8 @@ static int stripe_ctr(struct dm_target *ti, unsigned int argc, char **argv) >> * Get the stripe destinations. >> */ >> for (i = 0; i < stripes; i++) { >> + struct request_queue *q; >> + >> argv += 2; >> >> r = get_stripe(ti, sc, i, argv); >> @@ -180,6 +185,13 @@ static int stripe_ctr(struct dm_target *ti, unsigned int argc, char **argv) >> return r; >> } >> atomic_set(&(sc->stripe[i].error_count), 0); >> + >> + q = bdev_get_queue(sc->stripe[i].dev->bdev); >> + if (i == 0) >> + sc->max_hw_discard_sectors = q->limits.max_hw_discard_sectors; >> + else >> + sc->max_hw_discard_sectors = min_not_zero(sc->max_hw_discard_sectors, >> + q->limits.max_hw_discard_sectors); > > I don't think any of the above is necessary. When stripe_io_hints() is > called, dm_set_device_limits() will already have been called on all the > underlying stripe devices to combine the limits. So > limits->max_hw_discard_sectors should already be set to the same value > as you're computing for sc->max_hw_discard_sectors. Right? dm_set_device_limits() initializes the stack-local ti_limits defined in dm_calculate_queue_limits(), and stripe_io_hints() modifies this same variable as well. These updates do not affect the stripe device’s q->limits until dm_calculate_queue_limits() later calls blk_stack_limits(). Therefore, the q->limits.max_hw_discard_sectors of each sub device in a stripe target is never modified and it is necessary to record the minimum q->limits.max_hw_discard_sectors across all sub devices. > >> } >> >> ti->private = sc; >> @@ -456,7 +468,7 @@ static void stripe_io_hints(struct dm_target *ti, >> struct queue_limits *limits) >> { >> struct stripe_c *sc = ti->private; >> - unsigned int io_min, io_opt; >> + unsigned int io_min, io_opt, max_hw_discard_sectors; >> >> limits->chunk_sectors = sc->chunk_size; >> >> @@ -465,6 +477,8 @@ static void stripe_io_hints(struct dm_target *ti, >> limits->io_min = io_min; >> limits->io_opt = io_opt; >> } >> + if (!check_mul_overflow(sc->max_hw_discard_sectors, sc->stripes, &max_hw_discard_sectors)) > > sc->max_hw_discard_sectors should be the same as > limits->max_hw_discard_sectors here > >> + limits->max_hw_discard_sectors = max_hw_discard_sectors; > > I see a couple of issues with this calculation. First, this only works > if max_hw_discard_sectors is greater than sc->chunk_size. But more than > that, before you multiply the original limit by the number of stripes, > you must round it down to a multiple of chunk_size. Otherwise, you can > end up writing too much to some of the underlying devices. To show this > with simple numbers, imagine you had 3 stripes that with a chunk_size of > 4 and all the underlying devices had a max_hw_discard_sectors of 5. You > would set max_hw_discard_sectors to 5 * 3 = 15. But if you discared 15 > sectors from the beginning of the device, you would end up discarding > the first 4 sectors of each underlying device, and then loop around > discard 3 more sectors of the first device. This means that you would > discard 7 from the first device, instead of the 5 that it could handle. > Rounding max_hw_discard_sectors down to a multiple of chunk_size will > fix that. The splitting of discard bios is based on limits->max_discard_sectors, which is independent of the stripe size. As a result, the discard bio size passed to stripe_map->stripe_map_range() may exceed `sc->chunk_size * sc->stripes`. In the example above, stripe_map_range() is invoked three times, and each invocation reduces bio->bi_iter.bi_size to 5 sectors for each sub devices. > > Lastly, if you do overflow when multiplying max_hw_discard_sectors by > the number of stripes, you should probably just set > limits->max_hw_discard_sectors to UINT_MAX >> SECTOR_SHIFT, instead of > leaving it at what it was. Yes, I overlooked that. I’ll fix it in v2. Thanks Yongpeng, > > -Ben > >> } >> >> static struct target_type stripe_target = { >> -- >> 2.43.0 > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1] dm-stripe: adjust max_hw_discard_sectors to avoid unnecessary discard bio splitting 2025-12-11 7:03 ` Yongpeng Yang @ 2025-12-11 23:13 ` Benjamin Marzinski 2025-12-12 12:11 ` Yongpeng Yang 0 siblings, 1 reply; 5+ messages in thread From: Benjamin Marzinski @ 2025-12-11 23:13 UTC (permalink / raw) To: Yongpeng Yang Cc: Yongpeng Yang, Alasdair Kergon, Mike Snitzer, Mikulas Patocka, dm-devel, Yongpeng Yang Resending, since somehow, I didn't reply-to-all last time. On Thu, Dec 11, 2025 at 03:03:12PM +0800, Yongpeng Yang wrote: > On 12/10/25 14:07, Benjamin Marzinski wrote: > > On Mon, Dec 08, 2025 at 06:07:14PM +0800, Yongpeng Yang wrote: > >> From: Yongpeng Yang <yangyongpeng@xiaomi.com> > >> > >> Currently, the max_hw_discard_sectors of a stripe target is set to the > >> minimum max_hw_discard_sectors among all sub devices. When the discard > >> bio is larger than max_hw_discard_sectors, this may cause the stripe > >> device to split discard bios unnecessarily, because the value of > >> max_hw_discard_sectors affects max_discard_sectors, which equal to > >> min(max_hw_discard_sectors, max_user_discard_sectors). > >> > >> For example: > >> root@vm:~# echo '0 33554432 striped 2 256 /dev/vdd 0 /dev/vde 0' | dmsetup create stripe_dev > >> root@vm:~# cat /sys/block/dm-1/queue/discard_max_bytes > >> 536870912 > >> root@vm:~# cat /sys/block/dm-1/slaves/vdd/queue/discard_max_bytes > >> 536870912 > >> root@vm:~# blkdiscard -o 0 -l 1073741824 -p 1073741824 /dev/mapper/stripe_dev > >> > >> dm-1 is the stripe device, and its discard_max_bytes is equal to > >> each sub device’s discard_max_bytes. Since the requested discard > >> length exceeds discard_max_bytes, the block layer splits the discard bio: > >> > >> block_bio_queue: 252,1 DS 0 + 2097152 [blkdiscard] > >> block_split: 252,1 DS 0 / 1048576 [blkdiscard] > >> block_rq_issue: 253,48 DS 268435456 () 0 + 524288 be,0,4 [blkdiscard] > >> block_bio_queue: 253,64 DS 524288 + 524288 [blkdiscard] > >> > >> However, both vdd and vde can actually handle a discard bio of 536870912 > >> bytes, so this split is not necessary. > >> > >> This patch updates the stripe target’s q->limits.max_hw_discard_sectors > >> to be the minimum max_hw_discard_sectors of the sub devices multiplied > >> by the # of stripe devices. This enables the stripe device to handle > >> larger discard bios without incurring unnecessary splitting. > >> > >> Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> > >> --- > >> drivers/md/dm-stripe.c | 16 +++++++++++++++- > >> 1 file changed, 15 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c > >> index 1461dc740dae..799d6def699b 100644 > >> --- a/drivers/md/dm-stripe.c > >> +++ b/drivers/md/dm-stripe.c > >> @@ -38,6 +38,9 @@ struct stripe_c { > >> uint32_t chunk_size; > >> int chunk_size_shift; > >> > >> + /* The minimum max_hw_discard_sectors of all sub devices. */ > >> + unsigned int max_hw_discard_sectors; > >> + > >> /* Needed for handling events */ > >> struct dm_target *ti; > >> > >> @@ -169,6 +172,8 @@ static int stripe_ctr(struct dm_target *ti, unsigned int argc, char **argv) > >> * Get the stripe destinations. > >> */ > >> for (i = 0; i < stripes; i++) { > >> + struct request_queue *q; > >> + > >> argv += 2; > >> > >> r = get_stripe(ti, sc, i, argv); > >> @@ -180,6 +185,13 @@ static int stripe_ctr(struct dm_target *ti, unsigned int argc, char **argv) > >> return r; > >> } > >> atomic_set(&(sc->stripe[i].error_count), 0); > >> + > >> + q = bdev_get_queue(sc->stripe[i].dev->bdev); > >> + if (i == 0) > >> + sc->max_hw_discard_sectors = q->limits.max_hw_discard_sectors; > >> + else > >> + sc->max_hw_discard_sectors = min_not_zero(sc->max_hw_discard_sectors, > >> + q->limits.max_hw_discard_sectors); > > > > I don't think any of the above is necessary. When stripe_io_hints() is > > called, dm_set_device_limits() will already have been called on all the > > underlying stripe devices to combine the limits. So > > limits->max_hw_discard_sectors should already be set to the same value > > as you're computing for sc->max_hw_discard_sectors. Right? > > dm_set_device_limits() initializes the stack-local ti_limits defined in > dm_calculate_queue_limits(), and stripe_io_hints() modifies this same > variable as well. Yep. That's why it's o.k. to use limits->max_hw_discard_sectors in stripe_io_hints(). dm_set_device_limits() -> blk_stack_limits() is called on all of the underlying stripe devices before stripe_io_hints() is called. blk_stack_limits() includes the line t->max_hw_discard_sectors = min_not_zero(t->max_hw_discard_sectors, b->max_hw_discard_sectors); Which updates ti_limits.max_hw_discard_sectors. ti_limits is then passed into stripe_io_hints(), and has the same value that you compute in stripe_ctr(). > These updates do not affect the stripe device’s > q->limits until dm_calculate_queue_limits() later calls > blk_stack_limits(). Actually, we don't update q->limits for the stripe device itself until later, via __bind() -> dm_table_set_restrictions() -> queue_limits_commit_update(). But we're not checking the queue limits of the stripe device. We are checking the queue limits of the underlying devices. Those devices are completely set up before we create the dm device on top of them. > Therefore, the q->limits.max_hw_discard_sectors of > each sub device in a stripe target is never modified and it is necessary > to record the minimum q->limits.max_hw_discard_sectors > across all sub devices. You are checking the same underlying q->limits.max_hw_discard of each sub device in stripe_ctr(). stripe_ctr() is called when we set up the table, well before we call dm_calculate_queue_limits() as part of making the table active in dm_swap_table(). So if the q->limits.max_hw_discard of the underlying devices is correct in stripe_ctr(), it surely must be correct when dm_set_device_limits() is called on them. If you don't believe be try adding DMERR("%u vs %u", sc->max_hw_discard_sectors, limits->max_hw_discard_sectors); to stripe_io_hints(). It should show that they have the same number. If it doesn't, I apologize for all this noise. > > > >> } > >> > >> ti->private = sc; > >> @@ -456,7 +468,7 @@ static void stripe_io_hints(struct dm_target *ti, > >> struct queue_limits *limits) > >> { > >> struct stripe_c *sc = ti->private; > >> - unsigned int io_min, io_opt; > >> + unsigned int io_min, io_opt, max_hw_discard_sectors; > >> > >> limits->chunk_sectors = sc->chunk_size; > >> > >> @@ -465,6 +477,8 @@ static void stripe_io_hints(struct dm_target *ti, > >> limits->io_min = io_min; > >> limits->io_opt = io_opt; > >> } > >> + if (!check_mul_overflow(sc->max_hw_discard_sectors, sc->stripes, &max_hw_discard_sectors)) > > > > sc->max_hw_discard_sectors should be the same as > > limits->max_hw_discard_sectors here > > > >> + limits->max_hw_discard_sectors = max_hw_discard_sectors; > > > > I see a couple of issues with this calculation. First, this only works > > if max_hw_discard_sectors is greater than sc->chunk_size. But more than > > that, before you multiply the original limit by the number of stripes, > > you must round it down to a multiple of chunk_size. Otherwise, you can > > end up writing too much to some of the underlying devices. To show this > > with simple numbers, imagine you had 3 stripes that with a chunk_size of > > 4 and all the underlying devices had a max_hw_discard_sectors of 5. You > > would set max_hw_discard_sectors to 5 * 3 = 15. But if you discared 15 > > sectors from the beginning of the device, you would end up discarding > > the first 4 sectors of each underlying device, and then loop around > > discard 3 more sectors of the first device. This means that you would > > discard 7 from the first device, instead of the 5 that it could handle. > > Rounding max_hw_discard_sectors down to a multiple of chunk_size will > > fix that. > > The splitting of discard bios is based on limits->max_discard_sectors, > which is independent of the stripe size. As a result, the discard bio > size passed to stripe_map->stripe_map_range() may exceed > `sc->chunk_size * sc->stripes`. Of course this is true, but my example still holds if you had a max_hw_discard_sectors of 9 instead. 9 * 3 = 27, so you would discard the first 8 sectors (two chunks) of each stripe device, and then 3 more sectors of the first stripe device. This means that you would send a discard of 11 sectors to the first device, instead of the 9 that it can handle at once. > In the example above, stripe_map_range() > is invoked three times, and each invocation reduces bio->bi_iter.bi_size > to 5 sectors for each sub devices. If it did that, that would be a bug. You were intending to discard the first 15 sectors of the dm device, but instead, you would be discarding the first 13 sectors (0-12), plus sector 16 and sector 20. It's easy to test my original thought experiment using a scsi debug device with a discard granularity of 1 and a max_hw_discard_sectors of 5, and a stripe device with a 4 sector stripe size. Here's the test: # modprobe scsi_debug dev_size_mb=128 max_luns=3 lbpu=1 unmap_max_blocks=5 # echo '0 786432 striped 3 4 /dev/sda 0 /dev/sdb 0 /dev/sdc 0' | create stripe_dev # blkdiscard -o 0 -l 7680 /dev/mapper/stripe_dev And here's the perf results: block:block_bio_queue: 253,1 DS 0 + 15 [blkdiscard] block:block_bio_remap: 8,0 DS 0 + 7 <- (253,1) 0 block:block_bio_queue: 8,0 DS 0 + 7 [blkdiscard] block:block_bio_remap: 8,16 DS 0 + 4 <- (253,1) 0 block:block_bio_queue: 8,16 DS 0 + 4 [blkdiscard] block:block_bio_remap: 8,32 DS 0 + 4 <- (253,1) 0 block:block_bio_queue: 8,32 DS 0 + 4 [blkdiscard] block:block_split: 8,0 DS 0 / 5 [blkdiscard] block:block_rq_issue: 8,0 DS 2560 () 0 + 5 0x2,0,4 [kworker/5:1H] block:block_rq_issue: 8,0 DS 1024 () 5 + 2 0x2,0,4 [kworker/5:1H] block:block_rq_issue: 8,16 DS 2048 () 0 + 4 0x2,0,4 [kworker/5:1H] block:block_rq_issue: 8,32 DS 2048 () 0 + 4 0x2,0,4 [kworker/5:1H] As you can see, it sends 7 sectors down to sda (8:0), which is too large, so it needs to get split by the scsi device. -Ben > > > > Lastly, if you do overflow when multiplying max_hw_discard_sectors by > > the number of stripes, you should probably just set > > limits->max_hw_discard_sectors to UINT_MAX >> SECTOR_SHIFT, instead of > > leaving it at what it was. > > Yes, I overlooked that. I’ll fix it in v2. > > Thanks > Yongpeng, > > > > > -Ben > > > >> } > >> > >> static struct target_type stripe_target = { > >> -- > >> 2.43.0 > > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1] dm-stripe: adjust max_hw_discard_sectors to avoid unnecessary discard bio splitting 2025-12-11 23:13 ` Benjamin Marzinski @ 2025-12-12 12:11 ` Yongpeng Yang 0 siblings, 0 replies; 5+ messages in thread From: Yongpeng Yang @ 2025-12-12 12:11 UTC (permalink / raw) To: Benjamin Marzinski, Yongpeng Yang Cc: Yongpeng Yang, Alasdair Kergon, Mike Snitzer, Mikulas Patocka, dm-devel, Yongpeng Yang On 12/12/25 07:13, Benjamin Marzinski wrote: > Resending, since somehow, I didn't reply-to-all last time. > > On Thu, Dec 11, 2025 at 03:03:12PM +0800, Yongpeng Yang wrote: >> On 12/10/25 14:07, Benjamin Marzinski wrote: >>> On Mon, Dec 08, 2025 at 06:07:14PM +0800, Yongpeng Yang wrote: >>>> From: Yongpeng Yang <yangyongpeng@xiaomi.com> >>>> >>>> Currently, the max_hw_discard_sectors of a stripe target is set to the >>>> minimum max_hw_discard_sectors among all sub devices. When the discard >>>> bio is larger than max_hw_discard_sectors, this may cause the stripe >>>> device to split discard bios unnecessarily, because the value of >>>> max_hw_discard_sectors affects max_discard_sectors, which equal to >>>> min(max_hw_discard_sectors, max_user_discard_sectors). >>>> >>>> For example: >>>> root@vm:~# echo '0 33554432 striped 2 256 /dev/vdd 0 /dev/vde 0' | dmsetup create stripe_dev >>>> root@vm:~# cat /sys/block/dm-1/queue/discard_max_bytes >>>> 536870912 >>>> root@vm:~# cat /sys/block/dm-1/slaves/vdd/queue/discard_max_bytes >>>> 536870912 >>>> root@vm:~# blkdiscard -o 0 -l 1073741824 -p 1073741824 /dev/mapper/stripe_dev >>>> >>>> dm-1 is the stripe device, and its discard_max_bytes is equal to >>>> each sub device’s discard_max_bytes. Since the requested discard >>>> length exceeds discard_max_bytes, the block layer splits the discard bio: >>>> >>>> block_bio_queue: 252,1 DS 0 + 2097152 [blkdiscard] >>>> block_split: 252,1 DS 0 / 1048576 [blkdiscard] >>>> block_rq_issue: 253,48 DS 268435456 () 0 + 524288 be,0,4 [blkdiscard] >>>> block_bio_queue: 253,64 DS 524288 + 524288 [blkdiscard] >>>> >>>> However, both vdd and vde can actually handle a discard bio of 536870912 >>>> bytes, so this split is not necessary. >>>> >>>> This patch updates the stripe target’s q->limits.max_hw_discard_sectors >>>> to be the minimum max_hw_discard_sectors of the sub devices multiplied >>>> by the # of stripe devices. This enables the stripe device to handle >>>> larger discard bios without incurring unnecessary splitting. >>>> >>>> Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> >>>> --- >>>> drivers/md/dm-stripe.c | 16 +++++++++++++++- >>>> 1 file changed, 15 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c >>>> index 1461dc740dae..799d6def699b 100644 >>>> --- a/drivers/md/dm-stripe.c >>>> +++ b/drivers/md/dm-stripe.c >>>> @@ -38,6 +38,9 @@ struct stripe_c { >>>> uint32_t chunk_size; >>>> int chunk_size_shift; >>>> >>>> + /* The minimum max_hw_discard_sectors of all sub devices. */ >>>> + unsigned int max_hw_discard_sectors; >>>> + >>>> /* Needed for handling events */ >>>> struct dm_target *ti; >>>> >>>> @@ -169,6 +172,8 @@ static int stripe_ctr(struct dm_target *ti, unsigned int argc, char **argv) >>>> * Get the stripe destinations. >>>> */ >>>> for (i = 0; i < stripes; i++) { >>>> + struct request_queue *q; >>>> + >>>> argv += 2; >>>> >>>> r = get_stripe(ti, sc, i, argv); >>>> @@ -180,6 +185,13 @@ static int stripe_ctr(struct dm_target *ti, unsigned int argc, char **argv) >>>> return r; >>>> } >>>> atomic_set(&(sc->stripe[i].error_count), 0); >>>> + >>>> + q = bdev_get_queue(sc->stripe[i].dev->bdev); >>>> + if (i == 0) >>>> + sc->max_hw_discard_sectors = q->limits.max_hw_discard_sectors; >>>> + else >>>> + sc->max_hw_discard_sectors = min_not_zero(sc->max_hw_discard_sectors, >>>> + q->limits.max_hw_discard_sectors); >>> >>> I don't think any of the above is necessary. When stripe_io_hints() is >>> called, dm_set_device_limits() will already have been called on all the >>> underlying stripe devices to combine the limits. So >>> limits->max_hw_discard_sectors should already be set to the same value >>> as you're computing for sc->max_hw_discard_sectors. Right? >> >> dm_set_device_limits() initializes the stack-local ti_limits defined in >> dm_calculate_queue_limits(), and stripe_io_hints() modifies this same >> variable as well. > > Yep. That's why it's o.k. to use limits->max_hw_discard_sectors in > stripe_io_hints(). dm_set_device_limits() -> blk_stack_limits() > is called on all of the underlying stripe devices before > stripe_io_hints() is called. blk_stack_limits() includes the line > > t->max_hw_discard_sectors = min_not_zero(t->max_hw_discard_sectors, > b->max_hw_discard_sectors); > > Which updates ti_limits.max_hw_discard_sectors. ti_limits is then passed > into stripe_io_hints(), and has the same value that you compute in > stripe_ctr(). > >> These updates do not affect the stripe device’s >> q->limits until dm_calculate_queue_limits() later calls >> blk_stack_limits(). > > Actually, we don't update q->limits for the stripe device itself until > later, via __bind() -> dm_table_set_restrictions() -> > queue_limits_commit_update(). But we're not checking the queue limits of > the stripe device. We are checking the queue limits of the underlying > devices. Those devices are completely set up before we create the dm > device on top of them. > >> Therefore, the q->limits.max_hw_discard_sectors of >> each sub device in a stripe target is never modified and it is necessary >> to record the minimum q->limits.max_hw_discard_sectors >> across all sub devices. > > You are checking the same underlying q->limits.max_hw_discard of > each sub device in stripe_ctr(). stripe_ctr() is called when we set up > the table, well before we call dm_calculate_queue_limits() as part > of making the table active in dm_swap_table(). So if the > q->limits.max_hw_discard of the underlying devices is correct in > stripe_ctr(), it surely must be correct when dm_set_device_limits() > is called on them. > > If you don't believe be try adding > DMERR("%u vs %u", sc->max_hw_discard_sectors, limits->max_hw_discard_sectors); > > to stripe_io_hints(). It should show that they have the same number. If > it doesn't, I apologize for all this noise. > You are right. I was mistaken. Thanks for the patient explanation. We don't need to add sc->max_hw_discard_sectors. >>> >>>> } >>>> >>>> ti->private = sc; >>>> @@ -456,7 +468,7 @@ static void stripe_io_hints(struct dm_target *ti, >>>> struct queue_limits *limits) >>>> { >>>> struct stripe_c *sc = ti->private; >>>> - unsigned int io_min, io_opt; >>>> + unsigned int io_min, io_opt, max_hw_discard_sectors; >>>> >>>> limits->chunk_sectors = sc->chunk_size; >>>> >>>> @@ -465,6 +477,8 @@ static void stripe_io_hints(struct dm_target *ti, >>>> limits->io_min = io_min; >>>> limits->io_opt = io_opt; >>>> } >>>> + if (!check_mul_overflow(sc->max_hw_discard_sectors, sc->stripes, &max_hw_discard_sectors)) >>> >>> sc->max_hw_discard_sectors should be the same as >>> limits->max_hw_discard_sectors here >>> >>>> + limits->max_hw_discard_sectors = max_hw_discard_sectors; >>> >>> I see a couple of issues with this calculation. First, this only works >>> if max_hw_discard_sectors is greater than sc->chunk_size. But more than >>> that, before you multiply the original limit by the number of stripes, >>> you must round it down to a multiple of chunk_size. Otherwise, you can >>> end up writing too much to some of the underlying devices. To show this >>> with simple numbers, imagine you had 3 stripes that with a chunk_size of >>> 4 and all the underlying devices had a max_hw_discard_sectors of 5. You >>> would set max_hw_discard_sectors to 5 * 3 = 15. But if you discared 15 >>> sectors from the beginning of the device, you would end up discarding >>> the first 4 sectors of each underlying device, and then loop around >>> discard 3 more sectors of the first device. This means that you would >>> discard 7 from the first device, instead of the 5 that it could handle. >>> Rounding max_hw_discard_sectors down to a multiple of chunk_size will >>> fix that. Yes, Rounding down is right, I'll fix this in v3. >> >> The splitting of discard bios is based on limits->max_discard_sectors, >> which is independent of the stripe size. As a result, the discard bio >> size passed to stripe_map->stripe_map_range() may exceed >> `sc->chunk_size * sc->stripes`. > > Of course this is true, but my example still holds if you had a > max_hw_discard_sectors of 9 instead. 9 * 3 = 27, so you would discard > the first 8 sectors (two chunks) of each stripe device, and then 3 more > sectors of the first stripe device. This means that you would send a > discard of 11 sectors to the first device, instead of the 9 that it can > handle at once. > >> In the example above, stripe_map_range() >> is invoked three times, and each invocation reduces bio->bi_iter.bi_size >> to 5 sectors for each sub devices. > > If it did that, that would be a bug. You were intending to discard > the first 15 sectors of the dm device, but instead, you would be > discarding the first 13 sectors (0-12), plus sector 16 and sector 20. > > It's easy to test my original thought experiment using a scsi debug > device with a discard granularity of 1 and a max_hw_discard_sectors of > 5, and a stripe device with a 4 sector stripe size. Here's the test: > > # modprobe scsi_debug dev_size_mb=128 max_luns=3 lbpu=1 unmap_max_blocks=5 > # echo '0 786432 striped 3 4 /dev/sda 0 /dev/sdb 0 /dev/sdc 0' | create stripe_dev > # blkdiscard -o 0 -l 7680 /dev/mapper/stripe_dev > > And here's the perf results: > > block:block_bio_queue: 253,1 DS 0 + 15 [blkdiscard] > block:block_bio_remap: 8,0 DS 0 + 7 <- (253,1) 0 > block:block_bio_queue: 8,0 DS 0 + 7 [blkdiscard] > block:block_bio_remap: 8,16 DS 0 + 4 <- (253,1) 0 > block:block_bio_queue: 8,16 DS 0 + 4 [blkdiscard] > block:block_bio_remap: 8,32 DS 0 + 4 <- (253,1) 0 > block:block_bio_queue: 8,32 DS 0 + 4 [blkdiscard] > block:block_split: 8,0 DS 0 / 5 [blkdiscard] > block:block_rq_issue: 8,0 DS 2560 () 0 + 5 0x2,0,4 [kworker/5:1H] > block:block_rq_issue: 8,0 DS 1024 () 5 + 2 0x2,0,4 [kworker/5:1H] > block:block_rq_issue: 8,16 DS 2048 () 0 + 4 0x2,0,4 [kworker/5:1H] > block:block_rq_issue: 8,32 DS 2048 () 0 + 4 0x2,0,4 [kworker/5:1H] > > As you can see, it sends 7 sectors down to sda (8:0), which is too > large, so it needs to get split by the scsi device. Yes, my previous calculation was incorrect. Thanks Yongpeng, > > -Ben > >>> >>> Lastly, if you do overflow when multiplying max_hw_discard_sectors by >>> the number of stripes, you should probably just set >>> limits->max_hw_discard_sectors to UINT_MAX >> SECTOR_SHIFT, instead of >>> leaving it at what it was. >> >> Yes, I overlooked that. I’ll fix it in v2. >> >> Thanks >> Yongpeng, >> >>> >>> -Ben >>> >>>> } >>>> >>>> static struct target_type stripe_target = { >>>> -- >>>> 2.43.0 >>> > ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-12-12 12:11 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-12-08 10:07 [PATCH 1/1] dm-stripe: adjust max_hw_discard_sectors to avoid unnecessary discard bio splitting Yongpeng Yang 2025-12-10 6:07 ` Benjamin Marzinski 2025-12-11 7:03 ` Yongpeng Yang 2025-12-11 23:13 ` Benjamin Marzinski 2025-12-12 12:11 ` Yongpeng Yang
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.