[PATCH] dm: retain stacked max_sectors when setting queue

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] dm: retain stacked max_sectors when setting queue_limits
@ 2024-05-22  2:51 Mike Snitzer
  2024-05-22 14:24 ` Christoph Hellwig
  2024-05-22 20:33 ` [PATCH] " Ewan Milne
  0 siblings, 2 replies; 21+ messages in thread
From: Mike Snitzer @ 2024-05-22  2:51 UTC (permalink / raw)
  To: dm-devel; +Cc: linux-block, hch, Marco Patalano, Ewan Milne

Otherwise, blk_validate_limits() will throw-away the max_sectors that
was stacked from underlying device(s). In doing so it can set a
max_sectors limit that violates underlying device limits.

This caused dm-multipath IO failures like the following because the
underlying devices' max_sectors were stacked up to be 1024, yet
blk_validate_limits() defaulted max_sectors to BLK_DEF_MAX_SECTORS_CAP
(2560):

[ 1214.673233] blk_insert_cloned_request: over max size limit. (2048 > 1024)
[ 1214.673267] device-mapper: multipath: 254:3: Failing path 8:32.
[ 1214.675196] blk_insert_cloned_request: over max size limit. (2048 > 1024)
[ 1214.675224] device-mapper: multipath: 254:3: Failing path 8:16.
[ 1214.675309] blk_insert_cloned_request: over max size limit. (2048 > 1024)
[ 1214.675338] device-mapper: multipath: 254:3: Failing path 8:48.
[ 1214.675413] blk_insert_cloned_request: over max size limit. (2048 > 1024)
[ 1214.675441] device-mapper: multipath: 254:3: Failing path 8:64.

The initial bug report included:

[   13.822701] blk_insert_cloned_request: over max size limit. (248 > 128)
[   13.829351] device-mapper: multipath: 253:3: Failing path 8:32.
[   13.835307] blk_insert_cloned_request: over max size limit. (248 > 128)
[   13.841928] device-mapper: multipath: 253:3: Failing path 65:16.
[   13.844532] blk_insert_cloned_request: over max size limit. (248 > 128)
[   13.854363] blk_insert_cloned_request: over max size limit. (248 > 128)
[   13.854580] device-mapper: multipath: 253:4: Failing path 8:48.
[   13.861166] device-mapper: multipath: 253:3: Failing path 8:192.

Reported-by: Marco Patalano <mpatalan@redhat.com>
Reported-by: Ewan Milne <emilne@redhat.com>
Fixes: 1c0e720228ad ("dm: use queue_limits_set")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 drivers/md/dm-table.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 88114719fe18..6463b4afeaa4 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1961,6 +1961,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 			      struct queue_limits *limits)
 {
 	bool wc = false, fua = false;
+	unsigned int max_hw_sectors;
 	int r;
 
 	if (dm_table_supports_nowait(t))
@@ -1981,9 +1982,16 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 	if (!dm_table_supports_secure_erase(t))
 		limits->max_secure_erase_sectors = 0;
 
+	/* Don't allow queue_limits_set() to throw-away stacked max_sectors */
+	max_hw_sectors = limits->max_hw_sectors;
+	limits->max_hw_sectors = limits->max_sectors;
 	r = queue_limits_set(q, limits);
 	if (r)
 		return r;
+	/* Restore stacked max_hw_sectors */
+	mutex_lock(&q->limits_lock);
+	limits->max_hw_sectors = max_hw_sectors;
+	mutex_unlock(&q->limits_lock);
 
 	if (dm_table_supports_flush(t, (1UL << QUEUE_FLAG_WC))) {
 		wc = true;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] dm: retain stacked max_sectors when setting queue_limits
  2024-05-22  2:51 [PATCH] dm: retain stacked max_sectors when setting queue_limits Mike Snitzer
@ 2024-05-22 14:24 ` Christoph Hellwig
  2024-05-22 16:48   ` Mike Snitzer
  2024-05-22 20:33 ` [PATCH] " Ewan Milne
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2024-05-22 14:24 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, linux-block, hch, Marco Patalano, Ewan Milne

On Tue, May 21, 2024 at 10:51:17PM -0400, Mike Snitzer wrote:
> Otherwise, blk_validate_limits() will throw-away the max_sectors that
> was stacked from underlying device(s). In doing so it can set a
> max_sectors limit that violates underlying device limits.

Hmm, yes it sort of is "throwing the limit away", but it really
recalculates it from max_hw_sectors, max_dev_sectors and user_max_sectors.

> 
> This caused dm-multipath IO failures like the following because the
> underlying devices' max_sectors were stacked up to be 1024, yet
> blk_validate_limits() defaulted max_sectors to BLK_DEF_MAX_SECTORS_CAP
> (2560):

I suspect the problem is that SCSI messed directly with max_sectors instead
and ignores max_user_sectors (and really shouldn't touch either, but that's
a separate discussion).  Can you try the patch below and maybe also provide
the sysfs output for max_sectors_kb and max_hw_sectors_kb for all involved
devices?

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 332eb9dac22d91..f6c822c9cbd2d3 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3700,8 +3700,10 @@ static int sd_revalidate_disk(struct gendisk *disk)
 	 */
 	if (sdkp->first_scan ||
 	    q->limits.max_sectors > q->limits.max_dev_sectors ||
-	    q->limits.max_sectors > q->limits.max_hw_sectors)
+	    q->limits.max_sectors > q->limits.max_hw_sectors) {
 		q->limits.max_sectors = rw_max;
+		q->limits.max_user_sectors = rw_max;
+	}
 
 	sdkp->first_scan = 0;
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: dm: retain stacked max_sectors when setting queue_limits
  2024-05-22 14:24 ` Christoph Hellwig
@ 2024-05-22 16:48   ` Mike Snitzer
  2024-05-22 17:37     ` Ewan Milne
                       ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Mike Snitzer @ 2024-05-22 16:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: dm-devel, linux-block, Marco Patalano, Ewan Milne

On Wed, May 22, 2024 at 04:24:58PM +0200, Christoph Hellwig wrote:
> On Tue, May 21, 2024 at 10:51:17PM -0400, Mike Snitzer wrote:
> > Otherwise, blk_validate_limits() will throw-away the max_sectors that
> > was stacked from underlying device(s). In doing so it can set a
> > max_sectors limit that violates underlying device limits.
> 
> Hmm, yes it sort of is "throwing the limit away", but it really
> recalculates it from max_hw_sectors, max_dev_sectors and user_max_sectors.

Yes, but it needs to do that recalculation at each level of a stacked
device.  And then we need to combine them via blk_stack_limits() -- as
is done with the various limits stacking loops in
drivers/md/dm-table.c:dm_calculate_queue_limits

> > This caused dm-multipath IO failures like the following because the
> > underlying devices' max_sectors were stacked up to be 1024, yet
> > blk_validate_limits() defaulted max_sectors to BLK_DEF_MAX_SECTORS_CAP
> > (2560):
> 
> I suspect the problem is that SCSI messed directly with max_sectors instead
> and ignores max_user_sectors (and really shouldn't touch either, but that's
> a separate discussion).  Can you try the patch below and maybe also provide
> the sysfs output for max_sectors_kb and max_hw_sectors_kb for all involved
> devices?

FYI, you can easily reproduce with:

git clone https://github.com/snitm/mptest.git
cd mptest
<edit so it uses: export MULTIPATH_BACKEND_MODULE="scsidebug">
./runtest ./tests/test_00_no_failure

Also, best to change this line:
./lib/mpath_generic:        local _feature="4 queue_if_no_path retain_attached_hw_handler queue_mode $MULTIPATH_QUEUE_MODE"
to:
./lib/mpath_generic:        local _feature="3 retain_attached_hw_handler queue_mode $MULTIPATH_QUEUE_MODE"
Otherwise the test will hang due to queue_if_no_path.

all underlying scsi-debug scsi devices:

./max_hw_sectors_kb:2147483647
./max_sectors_kb:512

multipath device:

before my fix:
./max_hw_sectors_kb:2147483647
./max_sectors_kb:1280

after my fix:
./max_hw_sectors_kb:2147483647
./max_sectors_kb:512
 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 332eb9dac22d91..f6c822c9cbd2d3 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3700,8 +3700,10 @@ static int sd_revalidate_disk(struct gendisk *disk)
>  	 */
>  	if (sdkp->first_scan ||
>  	    q->limits.max_sectors > q->limits.max_dev_sectors ||
> -	    q->limits.max_sectors > q->limits.max_hw_sectors)
> +	    q->limits.max_sectors > q->limits.max_hw_sectors) {
>  		q->limits.max_sectors = rw_max;
> +		q->limits.max_user_sectors = rw_max;
> +	}
>  
>  	sdkp->first_scan = 0;
>  
> 

Driver shouldn't be changing max_user_sectors..

But it also didn't fix it (mpath still gets ./max_sectors_kb:1280):

[   74.872485] blk_insert_cloned_request: over max size limit. (2048 > 1024)
[   74.872505] device-mapper: multipath: 254:3: Failing path 8:16.
[   74.872620] blk_insert_cloned_request: over max size limit. (2048 > 1024)
[   74.872641] device-mapper: multipath: 254:3: Failing path 8:32.
[   74.872712] blk_insert_cloned_request: over max size limit. (2048 > 1024)
[   74.872732] device-mapper: multipath: 254:3: Failing path 8:48.
[   74.872788] blk_insert_cloned_request: over max size limit. (2048 > 1024)
[   74.872808] device-mapper: multipath: 254:3: Failing path 8:64.

Simply setting max_user_sectors won't help with stacked devices
because blk_stack_limits() doesn't stack max_user_sectors.  It'll
inform the underlying device's blk_validate_limits() calculation which
will result in max_sectors having the desired value (which it already
did, as I showed above).  But when stacking limits from underlying
devices up to the higher-level dm-mpath queue_limits we still have
information loss.

Mike

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dm: retain stacked max_sectors when setting queue_limits
  2024-05-22 16:48   ` Mike Snitzer
@ 2024-05-22 17:37     ` Ewan Milne
  2024-05-23  1:52     ` Ming Lei
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Ewan Milne @ 2024-05-22 17:37 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Christoph Hellwig, dm-devel, linux-block, Marco Patalano

We tried the sd_revalidate_disk() change, just to be sure.  It didn't fix it.

-Ewan

On Wed, May 22, 2024 at 12:49 PM Mike Snitzer <snitzer@kernel.org> wrote:
>
> On Wed, May 22, 2024 at 04:24:58PM +0200, Christoph Hellwig wrote:
> > On Tue, May 21, 2024 at 10:51:17PM -0400, Mike Snitzer wrote:
> > > Otherwise, blk_validate_limits() will throw-away the max_sectors that
> > > was stacked from underlying device(s). In doing so it can set a
> > > max_sectors limit that violates underlying device limits.
> >
> > Hmm, yes it sort of is "throwing the limit away", but it really
> > recalculates it from max_hw_sectors, max_dev_sectors and user_max_sectors.
>
> Yes, but it needs to do that recalculation at each level of a stacked
> device.  And then we need to combine them via blk_stack_limits() -- as
> is done with the various limits stacking loops in
> drivers/md/dm-table.c:dm_calculate_queue_limits
>
> > > This caused dm-multipath IO failures like the following because the
> > > underlying devices' max_sectors were stacked up to be 1024, yet
> > > blk_validate_limits() defaulted max_sectors to BLK_DEF_MAX_SECTORS_CAP
> > > (2560):
> >
> > I suspect the problem is that SCSI messed directly with max_sectors instead
> > and ignores max_user_sectors (and really shouldn't touch either, but that's
> > a separate discussion).  Can you try the patch below and maybe also provide
> > the sysfs output for max_sectors_kb and max_hw_sectors_kb for all involved
> > devices?
>
> FYI, you can easily reproduce with:
>
> git clone https://github.com/snitm/mptest.git
> cd mptest
> <edit so it uses: export MULTIPATH_BACKEND_MODULE="scsidebug">
> ./runtest ./tests/test_00_no_failure
>
> Also, best to change this line:
> ./lib/mpath_generic:        local _feature="4 queue_if_no_path retain_attached_hw_handler queue_mode $MULTIPATH_QUEUE_MODE"
> to:
> ./lib/mpath_generic:        local _feature="3 retain_attached_hw_handler queue_mode $MULTIPATH_QUEUE_MODE"
> Otherwise the test will hang due to queue_if_no_path.
>
> all underlying scsi-debug scsi devices:
>
> ./max_hw_sectors_kb:2147483647
> ./max_sectors_kb:512
>
> multipath device:
>
> before my fix:
> ./max_hw_sectors_kb:2147483647
> ./max_sectors_kb:1280
>
> after my fix:
> ./max_hw_sectors_kb:2147483647
> ./max_sectors_kb:512
>
> > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> > index 332eb9dac22d91..f6c822c9cbd2d3 100644
> > --- a/drivers/scsi/sd.c
> > +++ b/drivers/scsi/sd.c
> > @@ -3700,8 +3700,10 @@ static int sd_revalidate_disk(struct gendisk *disk)
> >        */
> >       if (sdkp->first_scan ||
> >           q->limits.max_sectors > q->limits.max_dev_sectors ||
> > -         q->limits.max_sectors > q->limits.max_hw_sectors)
> > +         q->limits.max_sectors > q->limits.max_hw_sectors) {
> >               q->limits.max_sectors = rw_max;
> > +             q->limits.max_user_sectors = rw_max;
> > +     }
> >
> >       sdkp->first_scan = 0;
> >
> >
>
> Driver shouldn't be changing max_user_sectors..
>
> But it also didn't fix it (mpath still gets ./max_sectors_kb:1280):
>
> [   74.872485] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [   74.872505] device-mapper: multipath: 254:3: Failing path 8:16.
> [   74.872620] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [   74.872641] device-mapper: multipath: 254:3: Failing path 8:32.
> [   74.872712] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [   74.872732] device-mapper: multipath: 254:3: Failing path 8:48.
> [   74.872788] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [   74.872808] device-mapper: multipath: 254:3: Failing path 8:64.
>
> Simply setting max_user_sectors won't help with stacked devices
> because blk_stack_limits() doesn't stack max_user_sectors.  It'll
> inform the underlying device's blk_validate_limits() calculation which
> will result in max_sectors having the desired value (which it already
> did, as I showed above).  But when stacking limits from underlying
> devices up to the higher-level dm-mpath queue_limits we still have
> information loss.
>
> Mike
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dm: retain stacked max_sectors when setting queue_limits
  2024-05-22 16:48   ` Mike Snitzer
  2024-05-22 17:37     ` Ewan Milne
@ 2024-05-23  1:52     ` Ming Lei
  2024-05-23 15:38       ` [PATCH for-6.10-rc1] block: fix blk_validate_limits() to properly handle stacked devices Mike Snitzer
  2024-05-23  7:16     ` dm: retain stacked max_sectors when setting queue_limits Christoph Hellwig
  2024-05-23  8:27     ` Christoph Hellwig
  3 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2024-05-23  1:52 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, dm-devel, linux-block, Marco Patalano,
	Ewan Milne, linux-raid

On Wed, May 22, 2024 at 12:48:59PM -0400, Mike Snitzer wrote:
> On Wed, May 22, 2024 at 04:24:58PM +0200, Christoph Hellwig wrote:
> > On Tue, May 21, 2024 at 10:51:17PM -0400, Mike Snitzer wrote:
> > > Otherwise, blk_validate_limits() will throw-away the max_sectors that
> > > was stacked from underlying device(s). In doing so it can set a
> > > max_sectors limit that violates underlying device limits.
> > 
> > Hmm, yes it sort of is "throwing the limit away", but it really
> > recalculates it from max_hw_sectors, max_dev_sectors and user_max_sectors.
> 
> Yes, but it needs to do that recalculation at each level of a stacked
> device.  And then we need to combine them via blk_stack_limits() -- as
> is done with the various limits stacking loops in
> drivers/md/dm-table.c:dm_calculate_queue_limits

This way looks one stacking specific requirement, just wondering why not
put the logic into blk_validate_limits() by passing 'stacking' parameter?
Then raid can benefit from it oo.

thanks,
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH for-6.10-rc1] block: fix blk_validate_limits() to properly handle stacked devices
  2024-05-23  1:52     ` Ming Lei
@ 2024-05-23 15:38       ` Mike Snitzer
  2024-05-23 15:44         ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Snitzer @ 2024-05-23 15:38 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig, axboe
  Cc: dm-devel, linux-block, Marco Patalano, Ewan Milne, linux-raid

On Thu, May 23, 2024 at 09:52:40AM +0800, Ming Lei wrote:
> On Wed, May 22, 2024 at 12:48:59PM -0400, Mike Snitzer wrote:
> > On Wed, May 22, 2024 at 04:24:58PM +0200, Christoph Hellwig wrote:
> > > On Tue, May 21, 2024 at 10:51:17PM -0400, Mike Snitzer wrote:
> > > > Otherwise, blk_validate_limits() will throw-away the max_sectors that
> > > > was stacked from underlying device(s). In doing so it can set a
> > > > max_sectors limit that violates underlying device limits.
> > > 
> > > Hmm, yes it sort of is "throwing the limit away", but it really
> > > recalculates it from max_hw_sectors, max_dev_sectors and user_max_sectors.
> > 
> > Yes, but it needs to do that recalculation at each level of a stacked
> > device.  And then we need to combine them via blk_stack_limits() -- as
> > is done with the various limits stacking loops in
> > drivers/md/dm-table.c:dm_calculate_queue_limits
> 
> This way looks one stacking specific requirement, just wondering why not
> put the logic into blk_validate_limits() by passing 'stacking' parameter?
> Then raid can benefit from it too.

Sure, we could elevate it to blk_validate_limits (and callers) but
adding a 'stacking' parameter is more intrusive on an API level.

Best to just update blk_set_stacking_limits() to set a new 'stacking'
flag in struct queue_limits, and update blk_stack_limits() to stack
that flag up.

I've verified this commit to work and have staged it in linux-next via
linux-dm.git's 'for-next', see:

https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=cedc03d697ff255dd5b600146521434e2e921815

Jens (and obviously: Christoph, Ming and others), I'm happy to send
this to Linus tomorrow morning if you could please provide your
Reviewed-by or Acked-by.  I'd prefer to keep the intermediate DM fix
just to "show the work and testing".

Thanks,
Mike

From cedc03d697ff255dd5b600146521434e2e921815 Mon Sep 17 00:00:00 2001
From: Mike Snitzer <snitzer@kernel.org>
Date: Thu, 23 May 2024 11:19:29 -0400
Subject: [PATCH] block: fix blk_validate_limits() to properly handle stacked devices

For the benefit of other stacking block drivers, e.g. MD, elevate the
DM fix from commit 0ead1c8e8e48 ("dm: retain stacked max_sectors when
setting queue_limits") to block core.

Switches to using a bool bitfield in struct queue_limits (for old
member 'zoned' and new member 'stacking') to not grow that struct.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 block/blk-settings.c   | 30 ++++++++++++++++++++++++++++--
 drivers/md/dm-table.c  |  8 --------
 include/linux/blkdev.h |  3 ++-
 3 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index cdbaef159c4b..24c799072f6c 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -35,6 +35,7 @@ EXPORT_SYMBOL_GPL(blk_queue_rq_timeout);
 void blk_set_stacking_limits(struct queue_limits *lim)
 {
 	memset(lim, 0, sizeof(*lim));
+	lim->stacking = true;
 	lim->logical_block_size = SECTOR_SIZE;
 	lim->physical_block_size = SECTOR_SIZE;
 	lim->io_min = SECTOR_SIZE;
@@ -103,7 +104,7 @@ static int blk_validate_zoned_limits(struct queue_limits *lim)
  */
 static int blk_validate_limits(struct queue_limits *lim)
 {
-	unsigned int max_hw_sectors;
+	unsigned int max_hw_sectors, stacked_max_hw_sectors = 0;
 
 	/*
 	 * Unless otherwise specified, default to 512 byte logical blocks and a
@@ -121,6 +122,23 @@ static int blk_validate_limits(struct queue_limits *lim)
 	if (lim->io_min < lim->physical_block_size)
 		lim->io_min = lim->physical_block_size;
 
+
+	/*
+	 * For stacked block devices, don't throw-away stacked max_sectors.
+	 */
+	if (lim->stacking && lim->max_hw_sectors) {
+		/*
+		 * lim->max_sectors and lim->max_hw_sectors were already
+		 * validated, relative underlying device(s) in this stacked
+		 * block device.
+		 */
+		stacked_max_hw_sectors = lim->max_hw_sectors;
+		/*
+		 * Impose stacked max_sectors as upper-bound for code below.
+		 */
+		lim->max_hw_sectors = lim->max_sectors;
+	}
+
 	/*
 	 * max_hw_sectors has a somewhat weird default for historical reason,
 	 * but driver really should set their own instead of relying on this
@@ -155,6 +173,11 @@ static int blk_validate_limits(struct queue_limits *lim)
 	lim->max_sectors = round_down(lim->max_sectors,
 			lim->logical_block_size >> SECTOR_SHIFT);
 
+	if (stacked_max_hw_sectors) {
+		/* Restore previously validated stacked max_hw_sectors */
+		lim->max_hw_sectors = max_hw_sectors;
+	}
+
 	/*
 	 * Random default for the maximum number of segments.  Driver should not
 	 * rely on this and set their own.
@@ -881,11 +904,14 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 						   b->max_secure_erase_sectors);
 	t->zone_write_granularity = max(t->zone_write_granularity,
 					b->zone_write_granularity);
-	t->zoned = max(t->zoned, b->zoned);
+	t->zoned |= b->zoned;
 	if (!t->zoned) {
 		t->zone_write_granularity = 0;
 		t->max_zone_append_sectors = 0;
 	}
+
+	t->stacking |= b->stacking;
+
 	return ret;
 }
 EXPORT_SYMBOL(blk_stack_limits);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 6463b4afeaa4..88114719fe18 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1961,7 +1961,6 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 			      struct queue_limits *limits)
 {
 	bool wc = false, fua = false;
-	unsigned int max_hw_sectors;
 	int r;
 
 	if (dm_table_supports_nowait(t))
@@ -1982,16 +1981,9 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 	if (!dm_table_supports_secure_erase(t))
 		limits->max_secure_erase_sectors = 0;
 
-	/* Don't allow queue_limits_set() to throw-away stacked max_sectors */
-	max_hw_sectors = limits->max_hw_sectors;
-	limits->max_hw_sectors = limits->max_sectors;
 	r = queue_limits_set(q, limits);
 	if (r)
 		return r;
-	/* Restore stacked max_hw_sectors */
-	mutex_lock(&q->limits_lock);
-	limits->max_hw_sectors = max_hw_sectors;
-	mutex_unlock(&q->limits_lock);
 
 	if (dm_table_supports_flush(t, (1UL << QUEUE_FLAG_WC))) {
 		wc = true;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c3e8f7cf96be..ad1b00e5cc3e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -307,7 +307,8 @@ struct queue_limits {
 	unsigned char		misaligned;
 	unsigned char		discard_misaligned;
 	unsigned char		raid_partial_stripes_expensive;
-	bool			zoned;
+	bool			zoned:1;
+	bool			stacking:1;
 	unsigned int		max_open_zones;
 	unsigned int		max_active_zones;
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH for-6.10-rc1] block: fix blk_validate_limits() to properly handle stacked devices
  2024-05-23 15:38       ` [PATCH for-6.10-rc1] block: fix blk_validate_limits() to properly handle stacked devices Mike Snitzer
@ 2024-05-23 15:44         ` Christoph Hellwig
  2024-05-23 15:48           ` Mike Snitzer
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2024-05-23 15:44 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Ming Lei, Christoph Hellwig, axboe, dm-devel, linux-block,
	Marco Patalano, Ewan Milne, linux-raid

On Thu, May 23, 2024 at 11:38:21AM -0400, Mike Snitzer wrote:
> Sure, we could elevate it to blk_validate_limits (and callers) but
> adding a 'stacking' parameter is more intrusive on an API level.
> 
> Best to just update blk_set_stacking_limits() to set a new 'stacking'
> flag in struct queue_limits, and update blk_stack_limits() to stack
> that flag up.
> 
> I've verified this commit to work and have staged it in linux-next via
> linux-dm.git's 'for-next', see:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=cedc03d697ff255dd5b600146521434e2e921815
> 
> Jens (and obviously: Christoph, Ming and others), I'm happy to send
> this to Linus tomorrow morning if you could please provide your
> Reviewed-by or Acked-by.  I'd prefer to keep the intermediate DM fix
> just to "show the work and testing".

A stacking flag in the limits is fundamentally wrong, please don't
do this.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-6.10-rc1] block: fix blk_validate_limits() to properly handle stacked devices
  2024-05-23 15:44         ` Christoph Hellwig
@ 2024-05-23 15:48           ` Mike Snitzer
  2024-05-23 15:52             ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Snitzer @ 2024-05-23 15:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, axboe, dm-devel, linux-block, Marco Patalano,
	Ewan Milne, linux-raid

On Thu, May 23, 2024 at 05:44:35PM +0200, Christoph Hellwig wrote:
> On Thu, May 23, 2024 at 11:38:21AM -0400, Mike Snitzer wrote:
> > Sure, we could elevate it to blk_validate_limits (and callers) but
> > adding a 'stacking' parameter is more intrusive on an API level.
> > 
> > Best to just update blk_set_stacking_limits() to set a new 'stacking'
> > flag in struct queue_limits, and update blk_stack_limits() to stack
> > that flag up.
> > 
> > I've verified this commit to work and have staged it in linux-next via
> > linux-dm.git's 'for-next', see:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=cedc03d697ff255dd5b600146521434e2e921815
> > 
> > Jens (and obviously: Christoph, Ming and others), I'm happy to send
> > this to Linus tomorrow morning if you could please provide your
> > Reviewed-by or Acked-by.  I'd prefer to keep the intermediate DM fix
> > just to "show the work and testing".
> 
> A stacking flag in the limits is fundamentally wrong, please don't
> do this.

Um, how so?  It serves as a hint to how the limits were constructed.

Reality is, we have stacking block devices that regularly are _not_
accounted for when people make changes to block core queue_limits
code.  That is a serious problem.

Happy to see the need for the 'stacking' flag to go away in time but I
fail to see why it is "fundamentally wrong".

Mike

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-6.10-rc1] block: fix blk_validate_limits() to properly handle stacked devices
  2024-05-23 15:48           ` Mike Snitzer
@ 2024-05-23 15:52             ` Christoph Hellwig
  2024-05-23 16:38               ` Mike Snitzer
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2024-05-23 15:52 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Ming Lei, axboe, dm-devel, linux-block,
	Marco Patalano, Ewan Milne, linux-raid

On Thu, May 23, 2024 at 11:48:50AM -0400, Mike Snitzer wrote:
> 
> Reality is, we have stacking block devices that regularly are _not_
> accounted for when people make changes to block core queue_limits
> code.  That is a serious problem.

Well, maybe we need to sure blktests covers this instead of either
impossible to run on a stock distro (device_mapper tests) or always
somehow hanging (lvm2 testsuite) separate tests..

> Happy to see the need for the 'stacking' flag to go away in time but I
> fail to see why it is "fundamentally wrong".

Because stacking limits should not in any way be built different.
Both the stacking flag and your restoring of the value just hack
around the problem instead of trying to address it.  Let's use a
little time to sort this out properly instead of piling up hacks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-6.10-rc1] block: fix blk_validate_limits() to properly handle stacked devices
  2024-05-23 15:52             ` Christoph Hellwig
@ 2024-05-23 16:38               ` Mike Snitzer
  2024-05-23 17:05                 ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Snitzer @ 2024-05-23 16:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, axboe, dm-devel, linux-block, Marco Patalano,
	Ewan Milne, linux-raid

On Thu, May 23, 2024 at 05:52:17PM +0200, Christoph Hellwig wrote:
> On Thu, May 23, 2024 at 11:48:50AM -0400, Mike Snitzer wrote:
> > 
> > Reality is, we have stacking block devices that regularly are _not_
> > accounted for when people make changes to block core queue_limits
> > code.  That is a serious problem.
> 
> Well, maybe we need to sure blktests covers this instead of either
> impossible to run on a stock distro (device_mapper tests) or always
> somehow hanging (lvm2 testsuite) separate tests..

Happy to see mptest get folded into blktests (its just bash code) --
but it doesn't reproduce for you so not a reliable safeguard.

The other testsuites aren't applicable to this particular issue, but I
don't disagree that the unicorn automated test frameworks used for DM
aren't robust enough.

> > Happy to see the need for the 'stacking' flag to go away in time but I
> > fail to see why it is "fundamentally wrong".
> 
> Because stacking limits should not in any way be built different.
> Both the stacking flag and your restoring of the value just hack
> around the problem instead of trying to address it.  Let's use a
> little time to sort this out properly instead of piling up hacks.

I have limited patience for queue_limits bugs given how long it took
us to stablize limits stacking (and how regressions keep happening).

All of the changes to render max_sectors and max_discard_sectors
unsettable directly (and only allow them being set in terms of an ever
growing array of overrides) were quite the big hammer. But yeah, it
is what it is.

I do appreciate your concern with me wanting limits stacking to be a
more pronounced first-class citizen in block core's queue_limits code.
I was trying to get the code back to a reliable state relative to
blk_stack_limits() et al -- for any underlying driver that might also
be blind to (ab)using max_user_sectors to steer blk_validate_limits().
I'm concerned such hacks are also fragile.

But in general I know we share the same goal, so I'll slow down and
yield to you to run with piecewise fixes where needed.  Will reply to
your latest patch now.

Mike

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-6.10-rc1] block: fix blk_validate_limits() to properly handle stacked devices
  2024-05-23 16:38               ` Mike Snitzer
@ 2024-05-23 17:05                 ` Christoph Hellwig
  2024-05-23 17:14                   ` Mike Snitzer
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2024-05-23 17:05 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Ming Lei, axboe, dm-devel, linux-block,
	Marco Patalano, Ewan Milne, linux-raid

On Thu, May 23, 2024 at 12:38:24PM -0400, Mike Snitzer wrote:
> Happy to see mptest get folded into blktests (its just bash code) --
> but it doesn't reproduce for you so not a reliable safeguard.

It is a lot better than not running it.  And I'll look into why
it doesn't reproduce.  Right now the only thing I can think of is
different kernel configs, maybe related to schedulers.  Can you send
me your .config?

Is adding mptests something you want to do, or you'd prefer outhers
to take care of?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-6.10-rc1] block: fix blk_validate_limits() to properly handle stacked devices
  2024-05-23 17:05                 ` Christoph Hellwig
@ 2024-05-23 17:14                   ` Mike Snitzer
  0 siblings, 0 replies; 21+ messages in thread
From: Mike Snitzer @ 2024-05-23 17:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, axboe, dm-devel, linux-block, Marco Patalano,
	Ewan Milne, linux-raid

On Thu, May 23, 2024 at 07:05:29PM +0200, Christoph Hellwig wrote:
> On Thu, May 23, 2024 at 12:38:24PM -0400, Mike Snitzer wrote:
> > Happy to see mptest get folded into blktests (its just bash code) --
> > but it doesn't reproduce for you so not a reliable safeguard.
> 
> It is a lot better than not running it.  And I'll look into why
> it doesn't reproduce.  Right now the only thing I can think of is
> different kernel configs, maybe related to schedulers.  Can you send
> me your .config?

Will do off-list.

> Is adding mptests something you want to do, or you'd prefer outhers
> to take care of?

I won't have time in the near or mid-term.  So if someone else would
like to convert mptests over to blktests that'd be wonderful.

Fanning out to test the various supported test permutations would be
cool (meaning, test with both MULTIPATH_BACKEND_MODULE=scsidebug and
MULTIPATH_BACKEND_MODULE=tcmloop ... but tcmloop is brittle due to
sysfs changes vs targetcli's sysfs expectations -- but that's a
targetcli issue that might have been fixed, assuming that code is
supported still?  Not revisited in a few months)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dm: retain stacked max_sectors when setting queue_limits
  2024-05-22 16:48   ` Mike Snitzer
  2024-05-22 17:37     ` Ewan Milne
  2024-05-23  1:52     ` Ming Lei
@ 2024-05-23  7:16     ` Christoph Hellwig
  2024-05-23  8:27     ` Christoph Hellwig
  3 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2024-05-23  7:16 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, dm-devel, linux-block, Marco Patalano,
	Ewan Milne

On Wed, May 22, 2024 at 12:48:59PM -0400, Mike Snitzer wrote:
> On Wed, May 22, 2024 at 04:24:58PM +0200, Christoph Hellwig wrote:
> > On Tue, May 21, 2024 at 10:51:17PM -0400, Mike Snitzer wrote:
> > > Otherwise, blk_validate_limits() will throw-away the max_sectors that
> > > was stacked from underlying device(s). In doing so it can set a
> > > max_sectors limit that violates underlying device limits.
> > 
> > Hmm, yes it sort of is "throwing the limit away", but it really
> > recalculates it from max_hw_sectors, max_dev_sectors and user_max_sectors.
> 
> Yes, but it needs to do that recalculation at each level of a stacked
> device.  And then we need to combine them via blk_stack_limits() -- as
> is done with the various limits stacking loops in
> drivers/md/dm-table.c:dm_calculate_queue_limits
> 
> > > This caused dm-multipath IO failures like the following because the
> > > underlying devices' max_sectors were stacked up to be 1024, yet
> > > blk_validate_limits() defaulted max_sectors to BLK_DEF_MAX_SECTORS_CAP
> > > (2560):
> > 
> > I suspect the problem is that SCSI messed directly with max_sectors instead
> > and ignores max_user_sectors (and really shouldn't touch either, but that's
> > a separate discussion).  Can you try the patch below and maybe also provide
> > the sysfs output for max_sectors_kb and max_hw_sectors_kb for all involved
> > devices?
> 
> FYI, you can easily reproduce with:

Running this (with your suggested edits) on Linus' current tree
(commit c760b3725e52403dc1b28644fb09c47a83cacea6) doesn't show any
failure even after dozens of runs.  What am I doing wrong?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dm: retain stacked max_sectors when setting queue_limits
  2024-05-22 16:48   ` Mike Snitzer
                       ` (2 preceding siblings ...)
  2024-05-23  7:16     ` dm: retain stacked max_sectors when setting queue_limits Christoph Hellwig
@ 2024-05-23  8:27     ` Christoph Hellwig
  2024-05-23 14:12       ` Mike Snitzer
  3 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2024-05-23  8:27 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, dm-devel, linux-block, Marco Patalano,
	Ewan Milne

On Wed, May 22, 2024 at 12:48:59PM -0400, Mike Snitzer wrote:
> [   74.872485] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [   74.872505] device-mapper: multipath: 254:3: Failing path 8:16.
> [   74.872620] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [   74.872641] device-mapper: multipath: 254:3: Failing path 8:32.
> [   74.872712] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [   74.872732] device-mapper: multipath: 254:3: Failing path 8:48.
> [   74.872788] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [   74.872808] device-mapper: multipath: 254:3: Failing path 8:64.
> 
> Simply setting max_user_sectors won't help with stacked devices
> because blk_stack_limits() doesn't stack max_user_sectors.  It'll
> inform the underlying device's blk_validate_limits() calculation which
> will result in max_sectors having the desired value (which it already
> did, as I showed above).  But when stacking limits from underlying
> devices up to the higher-level dm-mpath queue_limits we still have
> information loss.

So while I can't reproduce it, I think the main issue is that
max_sectors really just is a voluntary limit, and enforcing that at
the lower device doesn't really make any sense.  So we could just
check blk_insert_cloned_request to check max_hw_sectors instead.
Or my below preferre variant to just drop the check, as the
max_sectors == 0 check indicates it's pretty sketchy to start with.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index fc364a226e952f..61b108aa20044d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3041,29 +3041,9 @@ void blk_mq_submit_bio(struct bio *bio)
 blk_status_t blk_insert_cloned_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
-	unsigned int max_sectors = blk_queue_get_max_sectors(q, req_op(rq));
 	unsigned int max_segments = blk_rq_get_max_segments(rq);
 	blk_status_t ret;
 
-	if (blk_rq_sectors(rq) > max_sectors) {
-		/*
-		 * SCSI device does not have a good way to return if
-		 * Write Same/Zero is actually supported. If a device rejects
-		 * a non-read/write command (discard, write same,etc.) the
-		 * low-level device driver will set the relevant queue limit to
-		 * 0 to prevent blk-lib from issuing more of the offending
-		 * operations. Commands queued prior to the queue limit being
-		 * reset need to be completed with BLK_STS_NOTSUPP to avoid I/O
-		 * errors being propagated to upper layers.
-		 */
-		if (max_sectors == 0)
-			return BLK_STS_NOTSUPP;
-
-		printk(KERN_ERR "%s: over max size limit. (%u > %u)\n",
-			__func__, blk_rq_sectors(rq), max_sectors);
-		return BLK_STS_IOERR;
-	}
-
 	/*
 	 * The queue settings related to segment counting may differ from the
 	 * original queue.

> 
> Mike
---end quoted text---

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: dm: retain stacked max_sectors when setting queue_limits
  2024-05-23  8:27     ` Christoph Hellwig
@ 2024-05-23 14:12       ` Mike Snitzer
  2024-05-23 14:49         ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Snitzer @ 2024-05-23 14:12 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: dm-devel, linux-block, Marco Patalano, Ewan Milne

On Thu, May 23, 2024 at 10:27:31AM +0200, Christoph Hellwig wrote:
> On Wed, May 22, 2024 at 12:48:59PM -0400, Mike Snitzer wrote:
> > [   74.872485] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> > [   74.872505] device-mapper: multipath: 254:3: Failing path 8:16.
> > [   74.872620] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> > [   74.872641] device-mapper: multipath: 254:3: Failing path 8:32.
> > [   74.872712] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> > [   74.872732] device-mapper: multipath: 254:3: Failing path 8:48.
> > [   74.872788] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> > [   74.872808] device-mapper: multipath: 254:3: Failing path 8:64.
> > 
> > Simply setting max_user_sectors won't help with stacked devices
> > because blk_stack_limits() doesn't stack max_user_sectors.  It'll
> > inform the underlying device's blk_validate_limits() calculation which
> > will result in max_sectors having the desired value (which it already
> > did, as I showed above).  But when stacking limits from underlying
> > devices up to the higher-level dm-mpath queue_limits we still have
> > information loss.
> 
> So while I can't reproduce it, I think the main issue is that
> max_sectors really just is a voluntary limit, and enforcing that at
> the lower device doesn't really make any sense.  So we could just
> check blk_insert_cloned_request to check max_hw_sectors instead.

I haven't tried your patch but we still want properly stacked
max_sectors configured for the device.

> Or my below preferre variant to just drop the check, as the
> max_sectors == 0 check indicates it's pretty sketchy to start with.

At this point in the 6.10 release I don't want further whack-a-mole
fixes due to fallout from removing longstanding negative checks.

Not sure what is sketchy about the max_sectors == 0 check, the large
comment block explains that check quite well.  We want to avoid EIO
for unsupported operations (otherwise we'll get spurious path failures
in the context of dm-multipath).  Could be we can remove this check
after an audit of how LLD handle servicing IO for unsupported
operations -- so best to work through it during a devel cycle.

Not sure why scsi_debug based testing with mptest isn't triggering it
for you. Are you seeing these limits for the underlying scsi_debug
devices?

./max_hw_sectors_kb:2147483647
./max_sectors_kb:512

What are those limits for the mptest created 'mp' dm-multipath device?

Mike

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dm: retain stacked max_sectors when setting queue_limits
  2024-05-23 14:12       ` Mike Snitzer
@ 2024-05-23 14:49         ` Christoph Hellwig
  2024-05-23 15:44           ` Mike Snitzer
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2024-05-23 14:49 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, dm-devel, linux-block, Marco Patalano,
	Ewan Milne

On Thu, May 23, 2024 at 10:12:24AM -0400, Mike Snitzer wrote:
> Not sure what is sketchy about the max_sectors == 0 check, the large
> comment block explains that check quite well.  We want to avoid EIO
> for unsupported operations (otherwise we'll get spurious path failures
> in the context of dm-multipath).  Could be we can remove this check
> after an audit of how LLD handle servicing IO for unsupported
> operations -- so best to work through it during a devel cycle.

Think of what happens if you create a dm device, and then reduce
max_sectors using sysfs on the lower device after the dm device
was created: you'll trivially trigger this check.

> Not sure why scsi_debug based testing with mptest isn't triggering it
> for you. Are you seeing these limits for the underlying scsi_debug
> devices?
> 
> ./max_hw_sectors_kb:2147483647
> ./max_sectors_kb:512

root@testvm:~/mptest# cat /sys/block/sdc/queue/max_hw_sectors_kb 
2147483647

root@testvm:~/mptest# cat /sys/block/sdd/queue/max_sectors_kb 
512

root@testvm:~/mptest# cat /sys/block/dm-0/queue/max_hw_sectors_kb 
2147483647

root@testvm:~/mptest# cat /sys/block/dm-0/queue/max_sectors_kb 
1280

so they don't match, but for some reason bigger bios never get built.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dm: retain stacked max_sectors when setting queue_limits
  2024-05-23 14:49         ` Christoph Hellwig
@ 2024-05-23 15:44           ` Mike Snitzer
  2024-05-23 15:50             ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Snitzer @ 2024-05-23 15:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: dm-devel, linux-block, Marco Patalano, Ewan Milne

On Thu, May 23, 2024 at 04:49:38PM +0200, Christoph Hellwig wrote:
> On Thu, May 23, 2024 at 10:12:24AM -0400, Mike Snitzer wrote:
> > Not sure what is sketchy about the max_sectors == 0 check, the large
> > comment block explains that check quite well.  We want to avoid EIO
> > for unsupported operations (otherwise we'll get spurious path failures
> > in the context of dm-multipath).  Could be we can remove this check
> > after an audit of how LLD handle servicing IO for unsupported
> > operations -- so best to work through it during a devel cycle.
> 
> Think of what happens if you create a dm device, and then reduce
> max_sectors using sysfs on the lower device after the dm device
> was created: you'll trivially trigger this check.
> 
> > Not sure why scsi_debug based testing with mptest isn't triggering it
> > for you. Are you seeing these limits for the underlying scsi_debug
> > devices?
> > 
> > ./max_hw_sectors_kb:2147483647
> > ./max_sectors_kb:512
> 
> root@testvm:~/mptest# cat /sys/block/sdc/queue/max_hw_sectors_kb 
> 2147483647
> 
> root@testvm:~/mptest# cat /sys/block/sdd/queue/max_sectors_kb 
> 512
> 
> root@testvm:~/mptest# cat /sys/block/dm-0/queue/max_hw_sectors_kb 
> 2147483647
> 
> root@testvm:~/mptest# cat /sys/block/dm-0/queue/max_sectors_kb 
> 1280
> 
> so they don't match, but for some reason bigger bios never get built.

Weird... I'm running in a VMware guest but I don't see why that'd make
a difference on larger IOs being formed (given it is virtual
scsi_debug devices).

In any case, we know I can reproduce with this scsi_debug-based mptest
test and Marco has verified my fix resolves the issue on his FC
multipath testbed.

But I've just floated a patch to elevate the fix to block core (based
on Ming's suggestion):
https://patchwork.kernel.org/project/dm-devel/patch/Zk9i7V2GRoHxBPRu@kernel.org/

Let me know, thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dm: retain stacked max_sectors when setting queue_limits
  2024-05-23 15:44           ` Mike Snitzer
@ 2024-05-23 15:50             ` Christoph Hellwig
  2024-05-23 16:44               ` Mike Snitzer
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2024-05-23 15:50 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, dm-devel, linux-block, Marco Patalano,
	Ewan Milne

On Thu, May 23, 2024 at 11:44:05AM -0400, Mike Snitzer wrote:
> a difference on larger IOs being formed (given it is virtual
> scsi_debug devices).
> 
> In any case, we know I can reproduce with this scsi_debug-based mptest
> test and Marco has verified my fix resolves the issue on his FC
> multipath testbed.
> 
> But I've just floated a patch to elevate the fix to block core (based
> on Ming's suggestion):
> https://patchwork.kernel.org/project/dm-devel/patch/Zk9i7V2GRoHxBPRu@kernel.org/

I still think that is wrong.  Unfortunately I can't actually reproduce
the issue locally, but I think we want sd to set the user_max_sectors
and stack if you want to see the limits propagated, i.e. the combined
patch below.   In the longer run I need to get SCSI out of messing
with max_sectors directly, and the blk-mq stacking to stop looking
at it vs just the hardware limits (or just drop the check).

diff --git a/block/blk-settings.c b/block/blk-settings.c
index a7fe8e90240a6e..7a672021daee6a 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -611,6 +611,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	unsigned int top, bottom, alignment, ret = 0;
 
 	t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
+	t->max_user_sectors = min_not_zero(t->max_user_sectors,
+			b->max_user_sectors);
 	t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors);
 	t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 332eb9dac22d91..f6c822c9cbd2d3 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3700,8 +3700,10 @@ static int sd_revalidate_disk(struct gendisk *disk)
 	 */
 	if (sdkp->first_scan ||
 	    q->limits.max_sectors > q->limits.max_dev_sectors ||
-	    q->limits.max_sectors > q->limits.max_hw_sectors)
+	    q->limits.max_sectors > q->limits.max_hw_sectors) {
 		q->limits.max_sectors = rw_max;
+		q->limits.max_user_sectors = rw_max;
+	}
 
 	sdkp->first_scan = 0;
 



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: dm: retain stacked max_sectors when setting queue_limits
  2024-05-23 15:50             ` Christoph Hellwig
@ 2024-05-23 16:44               ` Mike Snitzer
  2024-05-23 17:03                 ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Snitzer @ 2024-05-23 16:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: dm-devel, linux-block, Marco Patalano, Ewan Milne

On Thu, May 23, 2024 at 05:50:09PM +0200, Christoph Hellwig wrote:
> On Thu, May 23, 2024 at 11:44:05AM -0400, Mike Snitzer wrote:
> > a difference on larger IOs being formed (given it is virtual
> > scsi_debug devices).
> > 
> > In any case, we know I can reproduce with this scsi_debug-based mptest
> > test and Marco has verified my fix resolves the issue on his FC
> > multipath testbed.
> > 
> > But I've just floated a patch to elevate the fix to block core (based
> > on Ming's suggestion):
> > https://patchwork.kernel.org/project/dm-devel/patch/Zk9i7V2GRoHxBPRu@kernel.org/
> 
> I still think that is wrong.  Unfortunately I can't actually reproduce
> the issue locally, but I think we want sd to set the user_max_sectors
> and stack if you want to see the limits propagated, i.e. the combined
> patch below.   In the longer run I need to get SCSI out of messing
> with max_sectors directly, and the blk-mq stacking to stop looking
> at it vs just the hardware limits (or just drop the check).

This "works" but it doesn't safeguard blk_stack_limits() and
blk_validate_limits() from other drivers that weren't trained to
(ab)use max_user_sectors to get blk_validate_limits() to preserve the
underlying device's max_sectors.

But I suppose we can worry about any other similar issues if/when
reported.

Please send a proper patch to Jens so we can get this fixed for
6.10-rc1. Thanks.

Acked-by: Mike Snitzer <snitzer@kernel.org>

> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index a7fe8e90240a6e..7a672021daee6a 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -611,6 +611,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>  	unsigned int top, bottom, alignment, ret = 0;
>  
>  	t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
> +	t->max_user_sectors = min_not_zero(t->max_user_sectors,
> +			b->max_user_sectors);
>  	t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors);
>  	t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors);
>  	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 332eb9dac22d91..f6c822c9cbd2d3 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3700,8 +3700,10 @@ static int sd_revalidate_disk(struct gendisk *disk)
>  	 */
>  	if (sdkp->first_scan ||
>  	    q->limits.max_sectors > q->limits.max_dev_sectors ||
> -	    q->limits.max_sectors > q->limits.max_hw_sectors)
> +	    q->limits.max_sectors > q->limits.max_hw_sectors) {
>  		q->limits.max_sectors = rw_max;
> +		q->limits.max_user_sectors = rw_max;
> +	}
>  
>  	sdkp->first_scan = 0;
>  
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dm: retain stacked max_sectors when setting queue_limits
  2024-05-23 16:44               ` Mike Snitzer
@ 2024-05-23 17:03                 ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2024-05-23 17:03 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, dm-devel, linux-block, Marco Patalano,
	Ewan Milne

On Thu, May 23, 2024 at 12:44:05PM -0400, Mike Snitzer wrote:
> This "works" but it doesn't safeguard blk_stack_limits() and
> blk_validate_limits() from other drivers that weren't trained to
> (ab)use max_user_sectors to get blk_validate_limits() to preserve the
> underlying device's max_sectors.

It does in that sd/sr are the only remaining places directly setting
queue limits without the commit API.  And I am working on converting
them for 6.11.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] dm: retain stacked max_sectors when setting queue_limits
  2024-05-22  2:51 [PATCH] dm: retain stacked max_sectors when setting queue_limits Mike Snitzer
  2024-05-22 14:24 ` Christoph Hellwig
@ 2024-05-22 20:33 ` Ewan Milne
  1 sibling, 0 replies; 21+ messages in thread
From: Ewan Milne @ 2024-05-22 20:33 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, linux-block, hch, Marco Patalano

On Tue, May 21, 2024 at 10:51 PM Mike Snitzer <snitzer@kernel.org> wrote:
>
> Otherwise, blk_validate_limits() will throw-away the max_sectors that
> was stacked from underlying device(s). In doing so it can set a
> max_sectors limit that violates underlying device limits.
>
> This caused dm-multipath IO failures like the following because the
> underlying devices' max_sectors were stacked up to be 1024, yet
> blk_validate_limits() defaulted max_sectors to BLK_DEF_MAX_SECTORS_CAP
> (2560):
>
> [ 1214.673233] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [ 1214.673267] device-mapper: multipath: 254:3: Failing path 8:32.
> [ 1214.675196] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [ 1214.675224] device-mapper: multipath: 254:3: Failing path 8:16.
> [ 1214.675309] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [ 1214.675338] device-mapper: multipath: 254:3: Failing path 8:48.
> [ 1214.675413] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [ 1214.675441] device-mapper: multipath: 254:3: Failing path 8:64.
>
> The initial bug report included:
>
> [   13.822701] blk_insert_cloned_request: over max size limit. (248 > 128)
> [   13.829351] device-mapper: multipath: 253:3: Failing path 8:32.
> [   13.835307] blk_insert_cloned_request: over max size limit. (248 > 128)
> [   13.841928] device-mapper: multipath: 253:3: Failing path 65:16.
> [   13.844532] blk_insert_cloned_request: over max size limit. (248 > 128)
> [   13.854363] blk_insert_cloned_request: over max size limit. (248 > 128)
> [   13.854580] device-mapper: multipath: 253:4: Failing path 8:48.
> [   13.861166] device-mapper: multipath: 253:3: Failing path 8:192.
>
> Reported-by: Marco Patalano <mpatalan@redhat.com>
> Reported-by: Ewan Milne <emilne@redhat.com>
> Fixes: 1c0e720228ad ("dm: use queue_limits_set")
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  drivers/md/dm-table.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 88114719fe18..6463b4afeaa4 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -1961,6 +1961,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
>                               struct queue_limits *limits)
>  {
>         bool wc = false, fua = false;
> +       unsigned int max_hw_sectors;
>         int r;
>
>         if (dm_table_supports_nowait(t))
> @@ -1981,9 +1982,16 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
>         if (!dm_table_supports_secure_erase(t))
>                 limits->max_secure_erase_sectors = 0;
>
> +       /* Don't allow queue_limits_set() to throw-away stacked max_sectors */
> +       max_hw_sectors = limits->max_hw_sectors;
> +       limits->max_hw_sectors = limits->max_sectors;
>         r = queue_limits_set(q, limits);
>         if (r)
>                 return r;
> +       /* Restore stacked max_hw_sectors */
> +       mutex_lock(&q->limits_lock);
> +       limits->max_hw_sectors = max_hw_sectors;
> +       mutex_unlock(&q->limits_lock);
>
>         if (dm_table_supports_flush(t, (1UL << QUEUE_FLAG_WC))) {
>                 wc = true;
> --
> 2.43.0
>

This fixed the FC DM-MP failures, so:

Tested-by: Marco Patalano <mpatalan@redhat.com>
Revewied-by: Ewan D. Milne <emilne@redhat.com>


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-05-23 17:14 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-22  2:51 [PATCH] dm: retain stacked max_sectors when setting queue_limits Mike Snitzer
2024-05-22 14:24 ` Christoph Hellwig
2024-05-22 16:48   ` Mike Snitzer
2024-05-22 17:37     ` Ewan Milne
2024-05-23  1:52     ` Ming Lei
2024-05-23 15:38       ` [PATCH for-6.10-rc1] block: fix blk_validate_limits() to properly handle stacked devices Mike Snitzer
2024-05-23 15:44         ` Christoph Hellwig
2024-05-23 15:48           ` Mike Snitzer
2024-05-23 15:52             ` Christoph Hellwig
2024-05-23 16:38               ` Mike Snitzer
2024-05-23 17:05                 ` Christoph Hellwig
2024-05-23 17:14                   ` Mike Snitzer
2024-05-23  7:16     ` dm: retain stacked max_sectors when setting queue_limits Christoph Hellwig
2024-05-23  8:27     ` Christoph Hellwig
2024-05-23 14:12       ` Mike Snitzer
2024-05-23 14:49         ` Christoph Hellwig
2024-05-23 15:44           ` Mike Snitzer
2024-05-23 15:50             ` Christoph Hellwig
2024-05-23 16:44               ` Mike Snitzer
2024-05-23 17:03                 ` Christoph Hellwig
2024-05-22 20:33 ` [PATCH] " Ewan Milne

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).