Re: [PATCH] dm: Fix deadlock when reloading a multipath table

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Benjamin Marzinski <bmarzins@redhat.com>
To: Martin Wilck <mwilck@suse.com>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>,
	Mikulas Patocka <mpatocka@redhat.com>,
	Mike Snitzer <snitzer@kernel.org>,
	dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Subject: Re: [PATCH] dm: Fix deadlock when reloading a multipath table
Date: Tue, 21 Oct 2025 15:16:17 -0400	[thread overview]
Message-ID: <aPfcAfn6gsgNLwC7@redhat.com> (raw)
In-Reply-To: <e407b683dceb9516b54cede5624baa399f8fa638.camel@suse.com>

On Fri, Oct 10, 2025 at 12:19:51PM +0200, Martin Wilck wrote:
> On Thu, 2025-10-09 at 18:25 -0400, Benjamin Marzinski wrote:
> > On Thu, Oct 09, 2025 at 11:57:08AM +0200, Martin Wilck wrote:
> > > On Wed, 2025-10-08 at 23:04 -0400, Benjamin Marzinski wrote:
> > > > Request-based devices (dm-multipath) queue I/O in blk-mq on
> > > > noflush
> > > > suspends. Any queued IO will make it impossible to freeze the
> > > > queue.
> > > > If
> > > > a process attempts to update the queue limits while there is
> > > > queued
> > > > IO,
> > > > it can be get stuck holding the limits lock, while unable to
> > > > freeze
> > > > the
> > > > queue. If device-mapper then attempts to update the limits during
> > > > a
> > > > table swap, it will deadlock trying to grab the limits lock while
> > > > making
> > > > it impossible to flush the IO.
> > > > 
> > > > Disallow updating the queue limits during a table swap, when
> > > > updating
> > > > an
> > > > immutable request-based dm device (dm-multipath) during a noflush
> > > > suspend. It is userspace's responsibility to make sure that the
> > > > new
> > > > table uses the same limits as the existing table if it asks for a
> > > > noflush suspend.
> > > > 
> > > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > > > ---
> > > >  drivers/md/dm-table.c |  4 ++++
> > > >  drivers/md/dm-thin.c  |  7 ++-----
> > > >  drivers/md/dm.c       | 35 +++++++++++++++++++++++------------
> > > >  3 files changed, 29 insertions(+), 17 deletions(-)
> > > > 
> > > > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> > > > index ad0a60a07b93..0522cd700e0e 100644
> > > > --- a/drivers/md/dm-table.c
> > > > +++ b/drivers/md/dm-table.c
> > > > @@ -2043,6 +2043,10 @@ bool dm_table_supports_size_change(struct
> > > > dm_table *t, sector_t old_size,
> > > >  	return true;
> > > >  }
> > > >  
> > > > +/*
> > > > + * This function will be skipped by noflush reloads of immutable
> > > > request
> > > > + * based devices (dm-mpath).
> > > > + */
> > > >  int dm_table_set_restrictions(struct dm_table *t, struct
> > > > request_queue *q,
> > > >  			      struct queue_limits *limits)
> > > >  {
> > > > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > > > index c84149ba4e38..6f98936f0e05 100644
> > > > --- a/drivers/md/dm-thin.c
> > > > +++ b/drivers/md/dm-thin.c
> > > > @@ -4383,11 +4383,8 @@ static void thin_postsuspend(struct
> > > > dm_target
> > > > *ti)
> > > >  {
> > > >  	struct thin_c *tc = ti->private;
> > > >  
> > > > -	/*
> > > > -	 * The dm_noflush_suspending flag has been cleared by
> > > > now,
> > > > so
> > > > -	 * unfortunately we must always run this.
> > > > -	 */
> > > > -	noflush_work(tc, do_noflush_stop);
> > > > +	if (dm_noflush_suspending(ti))
> > > > +		noflush_work(tc, do_noflush_stop);
> > > >  }
> > > >  
> > > >  static int thin_preresume(struct dm_target *ti)
> > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > > index 3ede5ba4cf7e..87e65c48dd04 100644
> > > > --- a/drivers/md/dm.c
> > > > +++ b/drivers/md/dm.c
> > > > @@ -2439,7 +2439,6 @@ static struct dm_table *__bind(struct
> > > > mapped_device *md, struct dm_table *t,
> > > >  {
> > > >  	struct dm_table *old_map;
> > > >  	sector_t size, old_size;
> > > > -	int ret;
> > > >  
> > > >  	lockdep_assert_held(&md->suspend_lock);
> > > >  
> > > > @@ -2454,11 +2453,13 @@ static struct dm_table *__bind(struct
> > > > mapped_device *md, struct dm_table *t,
> > > >  
> > > >  	set_capacity(md->disk, size);
> > > >  
> > > > -	ret = dm_table_set_restrictions(t, md->queue, limits);
> > > > -	if (ret) {
> > > > -		set_capacity(md->disk, old_size);
> > > > -		old_map = ERR_PTR(ret);
> > > > -		goto out;
> > > > +	if (limits) {
> > > > +		int ret = dm_table_set_restrictions(t, md-
> > > > >queue,
> > > > limits);
> > > > +		if (ret) {
> > > > +			set_capacity(md->disk, old_size);
> > > > +			old_map = ERR_PTR(ret);
> > > > +			goto out;
> > > > +		}
> > > >  	}
> > > >  
> > > >  	/*
> > > > @@ -2836,6 +2837,7 @@ static void dm_wq_work(struct work_struct
> > > > *work)
> > > >  
> > > >  static void dm_queue_flush(struct mapped_device *md)
> > > >  {
> > > > +	clear_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
> > > >  	clear_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags);
> > > >  	smp_mb__after_atomic();
> > > >  	queue_work(md->wq, &md->work);
> > > > @@ -2848,6 +2850,7 @@ struct dm_table *dm_swap_table(struct
> > > > mapped_device *md, struct dm_table *table)
> > > >  {
> > > >  	struct dm_table *live_map = NULL, *map = ERR_PTR(-
> > > > EINVAL);
> > > >  	struct queue_limits limits;
> > > > +	bool update_limits = true;
> > > >  	int r;
> > > >  
> > > >  	mutex_lock(&md->suspend_lock);
> > > > @@ -2856,20 +2859,31 @@ struct dm_table *dm_swap_table(struct
> > > > mapped_device *md, struct dm_table *table)
> > > >  	if (!dm_suspended_md(md))
> > > >  		goto out;
> > > >  
> > > > +	/*
> > > > +	 * To avoid a potential deadlock locking the queue
> > > > limits,
> > > > disallow
> > > > +	 * updating the queue limits during a table swap, when
> > > > updating an
> > > > +	 * immutable request-based dm device (dm-multipath)
> > > > during a
> > > > noflush
> > > > +	 * suspend. It is userspace's responsibility to make
> > > > sure
> > > > that the new
> > > > +	 * table uses the same limits as the existing table, if
> > > > it
> > > > asks for a
> > > > +	 * noflush suspend.
> > > > +	 */
> > > > +	if (dm_request_based(md) && md->immutable_target &&
> > > > +	    __noflush_suspending(md))
> > > > +		update_limits = false;
> > > >  	/*
> > > >  	 * If the new table has no data devices, retain the
> > > > existing
> > > > limits.
> > > >  	 * This helps multipath with queue_if_no_path if all
> > > > paths
> > > > disappear,
> > > >  	 * then new I/O is queued based on these limits, and
> > > > then
> > > > some paths
> > > >  	 * reappear.
> > > >  	 */
> > > > -	if (dm_table_has_no_data_devices(table)) {
> > > > +	else if (dm_table_has_no_data_devices(table)) {
> > > >  		live_map = dm_get_live_table_fast(md);
> > > >  		if (live_map)
> > > >  			limits = md->queue->limits;
> > > >  		dm_put_live_table_fast(md);
> > > >  	}
> > > >  
> > > > -	if (!live_map) {
> > > > +	if (update_limits && !live_map) {
> > > >  		r = dm_calculate_queue_limits(table, &limits);
> > > >  		if (r) {
> > > >  			map = ERR_PTR(r);
> > > > @@ -2877,7 +2891,7 @@ struct dm_table *dm_swap_table(struct
> > > > mapped_device *md, struct dm_table *table)
> > > >  		}
> > > >  	}
> > > >  
> > > > -	map = __bind(md, table, &limits);
> > > > +	map = __bind(md, table, update_limits ? &limits : NULL);
> > > >  	dm_issue_global_event();
> > > >  
> > > >  out:
> > > > @@ -2930,7 +2944,6 @@ static int __dm_suspend(struct
> > > > mapped_device
> > > > *md, struct dm_table *map,
> > > >  
> > > >  	/*
> > > >  	 * DMF_NOFLUSH_SUSPENDING must be set before presuspend.
> > > > -	 * This flag is cleared before dm_suspend returns.
> > > >  	 */
> > > >  	if (noflush)
> > > >  		set_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
> > > > @@ -2993,8 +3006,6 @@ static int __dm_suspend(struct
> > > > mapped_device
> > > > *md, struct dm_table *map,
> > > >  	if (!r)
> > > >  		set_bit(dmf_suspended_flag, &md->flags);
> > > >  
> > > > -	if (noflush)
> > > > -		clear_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
> > > >  	if (map)
> > > >  		synchronize_srcu(&md->io_barrier);
> > > 
> > > This moves the clear_bit() behind synchronize_rcu(). Is that safe?
> > 
> > It's not just moved behind the synchronize_rcu().
> > DMF_NOFLUSH_SUSPENDING
> > is now set for the entire time the device is suspended. If people
> > would
> > like, I'll update the patch to rename it to DMF_NOFLUSH_SUSPEND.
> 
> Right. I realize now that there's a smp_mb__after_atomic() in
> dm_queue_flush().
> 
> > I did check to see if holding it for the entire suspend would cause
> > issues, but I didn't see any case where it would. If I missed a 
> > case where __noflush_suspending() should only return true if we are
> > actually in the process of suspending, I can easily update that
> > function to do that.
> 
> If this is necessary, I agree that the flag an related function should
> be renamed. But there are already generic DM flags to indicate that a
> queue is suspend*ed*. Perhaps, instead of changing the semantics of
> DMF_NOFLUSH_SUSPENDING, it would make more sense to test 
> 
>   (__noflush_suspending || test_bit(DMF_BLOCK_IO_FOR_SUSPEND)
> 
> in dm_swap_table()?

Won't we ALWAYS be suspended when we are in dm_swap_table()? We do need
to refresh the limits in some cases (the cases where multipath-tools
currently reloads the table without setting noflush). What we need to
know is "is this table swap happening in a noflush suspend, where
userspace understands that it can't modify the device table in a way
that would change the limits". For multipath, this is almost always the
case. 

> 
> Anyway, I am still not convinced that we want to have this specific
> exception for multipath only.

I agree that Bart's solution looks better to me, if it can get in.
 
> > > In general, I'm wondering whether we need a more generic solution
> > > to
> > > this problem. Therefore I've added linux-block to cc.
> > > 
> > > The way I see it, if a device has queued IO without any means to
> > > perform the IO, it can't be frozen. We'd either need to fail all
> > > queued
> > > IO in this case, or refuse attempts to freeze the queue.
> > 
> > In general, it's perfectly normal to start freezing the queue while
> > there are still outstanding requests.
> 
> Sure, but my point was "without any means to perform the IO". As you
> pointed out yourself, a freeze attempt in this situation would get
> stuck.
> 
> I find Bart's approach very attractive; freezing might not be necessary
> at all in that case. We'dd just need to avoid a race where paths get
> reinstated while the operation that would normally have required a
> freeze is ongoing.

I agree. Even just the timing out of freezes, his
"[PATCH 2/3] block: Restrict the duration of sysfs attribute changes"
would be enough to keep this from deadlocking the system.

-Ben

> Martin

next prev parent reply	other threads:[~2025-10-21 19:16 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-09  3:04 [PATCH] dm: Fix deadlock when reloading a multipath table Benjamin Marzinski
2025-10-09  9:57 ` Martin Wilck
2025-10-09 22:25   ` Benjamin Marzinski
2025-10-10 10:19     ` Martin Wilck
2025-10-21 19:16       ` Benjamin Marzinski [this message]
2025-10-22 14:11         ` Martin Wilck
2025-10-23 17:39           ` Benjamin Marzinski
2025-10-23 17:54             ` Bart Van Assche
2025-10-09 23:29   ` Bart Van Assche
2025-10-10  0:39     ` Benjamin Marzinski
2025-10-10  9:23     ` Martin Wilck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPfcAfn6gsgNLwC7@redhat.com \
    --to=bmarzins@redhat.com \
    --cc=bart.vanassche@sandisk.com \
    --cc=dm-devel@lists.linux.dev \
    --cc=linux-block@vger.kernel.org \
    --cc=mpatocka@redhat.com \
    --cc=mwilck@suse.com \
    --cc=snitzer@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.