From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D9E0F34A3C0 for ; Tue, 21 Oct 2025 19:16:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761074194; cv=none; b=bJsSk+a/9R8wXycjKmLyp7WZPBXlqBN4YuFZMtqYP3DplwAi+VRlqrBX+vUu1nQKJj+kTRE4LHSMuWN2X+hi40ugViqjTc2O+DFbcMMPqHHpbxVK7jXiGB1dP8UkSdtvXlK4Do48oymFNBEXiuAww3WHFdp21Ew0bUuL5DQkz3A= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761074194; c=relaxed/simple; bh=AmReOZCD1x8Sk/N6D5ZiKjTaLrZxSusaJgyBMpGb2Gk=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=VBg/RXiU0DnTdU0fPHk1m5iks5W0cryQw6ltcxKSe4mIT3oQp7WLk7MkOYlK5k3qvjW9152eoT7yjFeuXJhLVD4NO8qV7RzV1kFrtN1ulkRCaY84PjHxW3qmzhbKjV2pwJjRvr7d9vYl+E7TXACga2pc6yzSZSqEJcG5WrXsxMc= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Mo5AHGhZ; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Mo5AHGhZ" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1761074190; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=v0Oe/5HuoWDLpJoIPd3oyR/J28Ecq7NQwP7IC26zECo=; b=Mo5AHGhZNQbOi8oetcQGJMs2g7kboSfzg3o2n8f/aF+51jSqtYBJMizIyLeAMYIcELbk+g i/REbOnKzoq4IHrkn25h0It9VuwfwZOV3VnYjXyKg1XWVnpYtTIoVvZSYq4WUYRoTOIw0S 7G8Lyzf8if+fklbmQhGr3w0b/jPq/u8= Received: from mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-542-HRIPV6LgPDCbeRjYFokaFA-1; Tue, 21 Oct 2025 15:16:22 -0400 X-MC-Unique: HRIPV6LgPDCbeRjYFokaFA-1 X-Mimecast-MFC-AGG-ID: HRIPV6LgPDCbeRjYFokaFA_1761074180 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id B79321954224; Tue, 21 Oct 2025 19:16:20 +0000 (UTC) Received: from bmarzins-01.fast.eng.rdu2.dc.redhat.com (unknown [10.6.23.247]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 4073330001A7; Tue, 21 Oct 2025 19:16:20 +0000 (UTC) Received: from bmarzins-01.fast.eng.rdu2.dc.redhat.com (localhost [127.0.0.1]) by bmarzins-01.fast.eng.rdu2.dc.redhat.com (8.18.1/8.17.1) with ESMTPS id 59LJGIEH3453173 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT); Tue, 21 Oct 2025 15:16:18 -0400 Received: (from bmarzins@localhost) by bmarzins-01.fast.eng.rdu2.dc.redhat.com (8.18.1/8.18.1/Submit) id 59LJGHSQ3453172; Tue, 21 Oct 2025 15:16:17 -0400 Date: Tue, 21 Oct 2025 15:16:17 -0400 From: Benjamin Marzinski To: Martin Wilck Cc: Bart Van Assche , Mikulas Patocka , Mike Snitzer , dm-devel@lists.linux.dev, linux-block@vger.kernel.org Subject: Re: [PATCH] dm: Fix deadlock when reloading a multipath table Message-ID: References: <20251009030431.2895495-1-bmarzins@redhat.com> Precedence: bulk X-Mailing-List: dm-devel@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 On Fri, Oct 10, 2025 at 12:19:51PM +0200, Martin Wilck wrote: > On Thu, 2025-10-09 at 18:25 -0400, Benjamin Marzinski wrote: > > On Thu, Oct 09, 2025 at 11:57:08AM +0200, Martin Wilck wrote: > > > On Wed, 2025-10-08 at 23:04 -0400, Benjamin Marzinski wrote: > > > > Request-based devices (dm-multipath) queue I/O in blk-mq on > > > > noflush > > > > suspends. Any queued IO will make it impossible to freeze the > > > > queue. > > > > If > > > > a process attempts to update the queue limits while there is > > > > queued > > > > IO, > > > > it can be get stuck holding the limits lock, while unable to > > > > freeze > > > > the > > > > queue. If device-mapper then attempts to update the limits during > > > > a > > > > table swap, it will deadlock trying to grab the limits lock while > > > > making > > > > it impossible to flush the IO. > > > > > > > > Disallow updating the queue limits during a table swap, when > > > > updating > > > > an > > > > immutable request-based dm device (dm-multipath) during a noflush > > > > suspend. It is userspace's responsibility to make sure that the > > > > new > > > > table uses the same limits as the existing table if it asks for a > > > > noflush suspend. > > > > > > > > Signed-off-by: Benjamin Marzinski > > > > --- > > > >  drivers/md/dm-table.c |  4 ++++ > > > >  drivers/md/dm-thin.c  |  7 ++----- > > > >  drivers/md/dm.c       | 35 +++++++++++++++++++++++------------ > > > >  3 files changed, 29 insertions(+), 17 deletions(-) > > > > > > > > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c > > > > index ad0a60a07b93..0522cd700e0e 100644 > > > > --- a/drivers/md/dm-table.c > > > > +++ b/drivers/md/dm-table.c > > > > @@ -2043,6 +2043,10 @@ bool dm_table_supports_size_change(struct > > > > dm_table *t, sector_t old_size, > > > >   return true; > > > >  } > > > >   > > > > +/* > > > > + * This function will be skipped by noflush reloads of immutable > > > > request > > > > + * based devices (dm-mpath). > > > > + */ > > > >  int dm_table_set_restrictions(struct dm_table *t, struct > > > > request_queue *q, > > > >         struct queue_limits *limits) > > > >  { > > > > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c > > > > index c84149ba4e38..6f98936f0e05 100644 > > > > --- a/drivers/md/dm-thin.c > > > > +++ b/drivers/md/dm-thin.c > > > > @@ -4383,11 +4383,8 @@ static void thin_postsuspend(struct > > > > dm_target > > > > *ti) > > > >  { > > > >   struct thin_c *tc = ti->private; > > > >   > > > > - /* > > > > - * The dm_noflush_suspending flag has been cleared by > > > > now, > > > > so > > > > - * unfortunately we must always run this. > > > > - */ > > > > - noflush_work(tc, do_noflush_stop); > > > > + if (dm_noflush_suspending(ti)) > > > > + noflush_work(tc, do_noflush_stop); > > > >  } > > > >   > > > >  static int thin_preresume(struct dm_target *ti) > > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > > > > index 3ede5ba4cf7e..87e65c48dd04 100644 > > > > --- a/drivers/md/dm.c > > > > +++ b/drivers/md/dm.c > > > > @@ -2439,7 +2439,6 @@ static struct dm_table *__bind(struct > > > > mapped_device *md, struct dm_table *t, > > > >  { > > > >   struct dm_table *old_map; > > > >   sector_t size, old_size; > > > > - int ret; > > > >   > > > >   lockdep_assert_held(&md->suspend_lock); > > > >   > > > > @@ -2454,11 +2453,13 @@ static struct dm_table *__bind(struct > > > > mapped_device *md, struct dm_table *t, > > > >   > > > >   set_capacity(md->disk, size); > > > >   > > > > - ret = dm_table_set_restrictions(t, md->queue, limits); > > > > - if (ret) { > > > > - set_capacity(md->disk, old_size); > > > > - old_map = ERR_PTR(ret); > > > > - goto out; > > > > + if (limits) { > > > > + int ret = dm_table_set_restrictions(t, md- > > > > >queue, > > > > limits); > > > > + if (ret) { > > > > + set_capacity(md->disk, old_size); > > > > + old_map = ERR_PTR(ret); > > > > + goto out; > > > > + } > > > >   } > > > >   > > > >   /* > > > > @@ -2836,6 +2837,7 @@ static void dm_wq_work(struct work_struct > > > > *work) > > > >   > > > >  static void dm_queue_flush(struct mapped_device *md) > > > >  { > > > > + clear_bit(DMF_NOFLUSH_SUSPENDING, &md->flags); > > > >   clear_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags); > > > >   smp_mb__after_atomic(); > > > >   queue_work(md->wq, &md->work); > > > > @@ -2848,6 +2850,7 @@ struct dm_table *dm_swap_table(struct > > > > mapped_device *md, struct dm_table *table) > > > >  { > > > >   struct dm_table *live_map = NULL, *map = ERR_PTR(- > > > > EINVAL); > > > >   struct queue_limits limits; > > > > + bool update_limits = true; > > > >   int r; > > > >   > > > >   mutex_lock(&md->suspend_lock); > > > > @@ -2856,20 +2859,31 @@ struct dm_table *dm_swap_table(struct > > > > mapped_device *md, struct dm_table *table) > > > >   if (!dm_suspended_md(md)) > > > >   goto out; > > > >   > > > > + /* > > > > + * To avoid a potential deadlock locking the queue > > > > limits, > > > > disallow > > > > + * updating the queue limits during a table swap, when > > > > updating an > > > > + * immutable request-based dm device (dm-multipath) > > > > during a > > > > noflush > > > > + * suspend. It is userspace's responsibility to make > > > > sure > > > > that the new > > > > + * table uses the same limits as the existing table, if > > > > it > > > > asks for a > > > > + * noflush suspend. > > > > + */ > > > > + if (dm_request_based(md) && md->immutable_target && > > > > +     __noflush_suspending(md)) > > > > + update_limits = false; > > > >   /* > > > >   * If the new table has no data devices, retain the > > > > existing > > > > limits. > > > >   * This helps multipath with queue_if_no_path if all > > > > paths > > > > disappear, > > > >   * then new I/O is queued based on these limits, and > > > > then > > > > some paths > > > >   * reappear. > > > >   */ > > > > - if (dm_table_has_no_data_devices(table)) { > > > > + else if (dm_table_has_no_data_devices(table)) { > > > >   live_map = dm_get_live_table_fast(md); > > > >   if (live_map) > > > >   limits = md->queue->limits; > > > >   dm_put_live_table_fast(md); > > > >   } > > > >   > > > > - if (!live_map) { > > > > + if (update_limits && !live_map) { > > > >   r = dm_calculate_queue_limits(table, &limits); > > > >   if (r) { > > > >   map = ERR_PTR(r); > > > > @@ -2877,7 +2891,7 @@ struct dm_table *dm_swap_table(struct > > > > mapped_device *md, struct dm_table *table) > > > >   } > > > >   } > > > >   > > > > - map = __bind(md, table, &limits); > > > > + map = __bind(md, table, update_limits ? &limits : NULL); > > > >   dm_issue_global_event(); > > > >   > > > >  out: > > > > @@ -2930,7 +2944,6 @@ static int __dm_suspend(struct > > > > mapped_device > > > > *md, struct dm_table *map, > > > >   > > > >   /* > > > >   * DMF_NOFLUSH_SUSPENDING must be set before presuspend. > > > > - * This flag is cleared before dm_suspend returns. > > > >   */ > > > >   if (noflush) > > > >   set_bit(DMF_NOFLUSH_SUSPENDING, &md->flags); > > > > @@ -2993,8 +3006,6 @@ static int __dm_suspend(struct > > > > mapped_device > > > > *md, struct dm_table *map, > > > >   if (!r) > > > >   set_bit(dmf_suspended_flag, &md->flags); > > > >   > > > > - if (noflush) > > > > - clear_bit(DMF_NOFLUSH_SUSPENDING, &md->flags); > > > >   if (map) > > > >   synchronize_srcu(&md->io_barrier); > > > > > > This moves the clear_bit() behind synchronize_rcu(). Is that safe? > > > > It's not just moved behind the synchronize_rcu(). > > DMF_NOFLUSH_SUSPENDING > > is now set for the entire time the device is suspended. If people > > would > > like, I'll update the patch to rename it to DMF_NOFLUSH_SUSPEND. > > Right. I realize now that there's a smp_mb__after_atomic() in > dm_queue_flush(). > > > I did check to see if holding it for the entire suspend would cause > > issues, but I didn't see any case where it would. If I missed a > > case where __noflush_suspending() should only return true if we are > > actually in the process of suspending, I can easily update that > > function to do that. > > If this is necessary, I agree that the flag an related function should > be renamed. But there are already generic DM flags to indicate that a > queue is suspend*ed*. Perhaps, instead of changing the semantics of > DMF_NOFLUSH_SUSPENDING, it would make more sense to test  > > (__noflush_suspending || test_bit(DMF_BLOCK_IO_FOR_SUSPEND) > > in dm_swap_table()? Won't we ALWAYS be suspended when we are in dm_swap_table()? We do need to refresh the limits in some cases (the cases where multipath-tools currently reloads the table without setting noflush). What we need to know is "is this table swap happening in a noflush suspend, where userspace understands that it can't modify the device table in a way that would change the limits". For multipath, this is almost always the case. > > Anyway, I am still not convinced that we want to have this specific > exception for multipath only. I agree that Bart's solution looks better to me, if it can get in. > > > In general, I'm wondering whether we need a more generic solution > > > to > > > this problem. Therefore I've added linux-block to cc. > > > > > > The way I see it, if a device has queued IO without any means to > > > perform the IO, it can't be frozen. We'd either need to fail all > > > queued > > > IO in this case, or refuse attempts to freeze the queue. > > > > In general, it's perfectly normal to start freezing the queue while > > there are still outstanding requests. > > Sure, but my point was "without any means to perform the IO". As you > pointed out yourself, a freeze attempt in this situation would get > stuck. > > I find Bart's approach very attractive; freezing might not be necessary > at all in that case. We'dd just need to avoid a race where paths get > reinstated while the operation that would normally have required a > freeze is ongoing. I agree. Even just the timing out of freezes, his "[PATCH 2/3] block: Restrict the duration of sysfs attribute changes" would be enough to keep this from deadlocking the system. -Ben > Martin