fix md disk_name lifetime problems

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* fix md disk_name lifetime problems
@ 2022-07-12  7:03 Christoph Hellwig
  2022-07-12  7:03 ` [PATCH 1/8] md: fix kobject_add error handling Christoph Hellwig
                   ` (8 more replies)
  0 siblings, 9 replies; 13+ messages in thread
From: Christoph Hellwig @ 2022-07-12  7:03 UTC (permalink / raw)
  To: Song Liu; +Cc: Logan Gunthorpe, linux-raid, linux-block

Hi all,

this series tries to fix a problem repored by Logan where we see
duplicate sysfs file name in md.  It is due to the fact that the
md driver only checks for duplicates on currently live mddevs,
while the sysfs name can live on longer.  It is an old problem,
but the race window got longer due to waiting for the device freeze
earlier in del_gendisk.

Note that I still can't reproduce this problem so this was based
on code inspection.  Also note that I occasionally run into a hang
in the 07layouts tests with or without this series.

Diffstat:
 md.c |  272 +++++++++++++++++++++++++++++++++----------------------------------
 md.h |    1 
 2 files changed, 139 insertions(+), 134 deletions(-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/8] md: fix kobject_add error handling
  2022-07-12  7:03 fix md disk_name lifetime problems Christoph Hellwig
@ 2022-07-12  7:03 ` Christoph Hellwig
  2022-07-12  7:03 ` [PATCH 2/8] md: implement ->free_disk Christoph Hellwig
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2022-07-12  7:03 UTC (permalink / raw)
  To: Song Liu; +Cc: Logan Gunthorpe, linux-raid, linux-block

Always use the deferred kobject_put from mddev_put to clean up a mddev.
To make sure that happens properly clear the hold_active on error,
and clear the ->gendisk field and put the disk manually when ->add_disk
fails to avoid a double free.

Fixes: 5e55e2f5fc95 ("[PATCH] md: convert compile time warnings into runtime warnings")
Fixes: 9be68dd7ac0e ("md: add error handling support for add_disk()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/md.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 076255ec9ba18..861d6a9481b2e 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5703,23 +5703,23 @@ static int md_alloc(dev_t dev, char *name)
 	disk->events |= DISK_EVENT_MEDIA_CHANGE;
 	mddev->gendisk = disk;
 	error = add_disk(disk);
-	if (error)
-		goto out_cleanup_disk;
+	if (error) {
+		mddev->gendisk = NULL;
+		put_disk(disk);
+		goto out_unlock_disks_mutex;
+	}
 
 	error = kobject_add(&mddev->kobj, &disk_to_dev(disk)->kobj, "%s", "md");
 	if (error)
-		goto out_del_gendisk;
+		goto out_unlock_disks_mutex;
 
 	kobject_uevent(&mddev->kobj, KOBJ_ADD);
 	mddev->sysfs_state = sysfs_get_dirent_safe(mddev->kobj.sd, "array_state");
 	mddev->sysfs_level = sysfs_get_dirent_safe(mddev->kobj.sd, "level");
-	goto out_unlock_disks_mutex;
 
-out_del_gendisk:
-	del_gendisk(disk);
-out_cleanup_disk:
-	put_disk(disk);
 out_unlock_disks_mutex:
+	if (error)
+		mddev->hold_active = 0;
 	mutex_unlock(&disks_mutex);
 	mddev_put(mddev);
 	return error;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/8] md: implement ->free_disk
  2022-07-12  7:03 fix md disk_name lifetime problems Christoph Hellwig
  2022-07-12  7:03 ` [PATCH 1/8] md: fix kobject_add error handling Christoph Hellwig
@ 2022-07-12  7:03 ` Christoph Hellwig
  2022-07-12 23:13   ` Logan Gunthorpe
  2022-07-12  7:03 ` [PATCH 3/8] md: factor out the rdev overlaps check from rdev_size_store Christoph Hellwig
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2022-07-12  7:03 UTC (permalink / raw)
  To: Song Liu; +Cc: Logan Gunthorpe, linux-raid, linux-block

Ensure that all private data is only freed once all accesses are done.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/md.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 861d6a9481b2e..ae076a7a87796 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5581,11 +5581,6 @@ static void md_free(struct kobject *ko)
 		del_gendisk(mddev->gendisk);
 		put_disk(mddev->gendisk);
 	}
-	percpu_ref_exit(&mddev->writes_pending);
-
-	bioset_exit(&mddev->bio_set);
-	bioset_exit(&mddev->sync_set);
-	kfree(mddev);
 }
 
 static const struct sysfs_ops md_sysfs_ops = {
@@ -7844,6 +7839,17 @@ static unsigned int md_check_events(struct gendisk *disk, unsigned int clearing)
 	return ret;
 }
 
+static void md_free_disk(struct gendisk *disk)
+{
+	struct mddev *mddev = disk->private_data;
+
+	percpu_ref_exit(&mddev->writes_pending);
+	bioset_exit(&mddev->bio_set);
+	bioset_exit(&mddev->sync_set);
+
+	kfree(mddev);
+}
+
 const struct block_device_operations md_fops =
 {
 	.owner		= THIS_MODULE,
@@ -7857,6 +7863,7 @@ const struct block_device_operations md_fops =
 	.getgeo		= md_getgeo,
 	.check_events	= md_check_events,
 	.set_read_only	= md_set_read_only,
+	.free_disk	= md_free_disk,
 };
 
 static int md_thread(void *arg)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/8] md: factor out the rdev overlaps check from rdev_size_store
  2022-07-12  7:03 fix md disk_name lifetime problems Christoph Hellwig
  2022-07-12  7:03 ` [PATCH 1/8] md: fix kobject_add error handling Christoph Hellwig
  2022-07-12  7:03 ` [PATCH 2/8] md: implement ->free_disk Christoph Hellwig
@ 2022-07-12  7:03 ` Christoph Hellwig
  2022-07-12  7:03 ` [PATCH 4/8] md: stop using for_each_mddev in md_do_sync Christoph Hellwig
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2022-07-12  7:03 UTC (permalink / raw)
  To: Song Liu; +Cc: Logan Gunthorpe, linux-raid, linux-block

This splits the code into nicely readable chunks and also avoids
the refcount inc/dec manipulations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/md.c | 84 +++++++++++++++++++++++--------------------------
 1 file changed, 39 insertions(+), 45 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index ae076a7a87796..61eceef17d5fd 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -3335,14 +3335,33 @@ rdev_size_show(struct md_rdev *rdev, char *page)
 	return sprintf(page, "%llu\n", (unsigned long long)rdev->sectors / 2);
 }
 
-static int overlaps(sector_t s1, sector_t l1, sector_t s2, sector_t l2)
+static int md_rdevs_overlap(struct md_rdev *a, struct md_rdev *b)
 {
 	/* check if two start/length pairs overlap */
-	if (s1+l1 <= s2)
-		return 0;
-	if (s2+l2 <= s1)
-		return 0;
-	return 1;
+	if (a->data_offset + a->sectors <= b->data_offset)
+		return false;
+	if (b->data_offset + b->sectors <= a->data_offset)
+		return false;
+	return true;
+}
+
+static bool md_rdev_overlaps(struct md_rdev *rdev)
+{
+	struct mddev *mddev;
+	struct md_rdev *rdev2;
+
+	spin_lock(&all_mddevs_lock);
+	list_for_each_entry(mddev, &all_mddevs, all_mddevs) {
+		rdev_for_each(rdev2, mddev) {
+			if (rdev != rdev2 && rdev->bdev == rdev2->bdev &&
+			    md_rdevs_overlap(rdev, rdev2)) {
+				spin_unlock(&all_mddevs_lock);
+				return true;
+			}
+		}
+	}
+	spin_unlock(&all_mddevs_lock);
+	return false;
 }
 
 static int strict_blocks_to_sectors(const char *buf, sector_t *sectors)
@@ -3394,46 +3413,21 @@ rdev_size_store(struct md_rdev *rdev, const char *buf, size_t len)
 		return -EINVAL; /* component must fit device */
 
 	rdev->sectors = sectors;
-	if (sectors > oldsectors && my_mddev->external) {
-		/* Need to check that all other rdevs with the same
-		 * ->bdev do not overlap.  'rcu' is sufficient to walk
-		 * the rdev lists safely.
-		 * This check does not provide a hard guarantee, it
-		 * just helps avoid dangerous mistakes.
-		 */
-		struct mddev *mddev;
-		int overlap = 0;
-		struct list_head *tmp;
 
-		rcu_read_lock();
-		for_each_mddev(mddev, tmp) {
-			struct md_rdev *rdev2;
-
-			rdev_for_each(rdev2, mddev)
-				if (rdev->bdev == rdev2->bdev &&
-				    rdev != rdev2 &&
-				    overlaps(rdev->data_offset, rdev->sectors,
-					     rdev2->data_offset,
-					     rdev2->sectors)) {
-					overlap = 1;
-					break;
-				}
-			if (overlap) {
-				mddev_put(mddev);
-				break;
-			}
-		}
-		rcu_read_unlock();
-		if (overlap) {
-			/* Someone else could have slipped in a size
-			 * change here, but doing so is just silly.
-			 * We put oldsectors back because we *know* it is
-			 * safe, and trust userspace not to race with
-			 * itself
-			 */
-			rdev->sectors = oldsectors;
-			return -EBUSY;
-		}
+	/*
+	 * Check that all other rdevs with the same bdev do not overlap.  This
+	 * check does not provide a hard guarantee, it just helps avoid
+	 * dangerous mistakes.
+	 */
+	if (sectors > oldsectors && my_mddev->external &&
+	    md_rdev_overlaps(rdev)) {
+		/*
+		 * Someone else could have slipped in a size change here, but
+		 * doing so is just silly.  We put oldsectors back because we
+		 * know it is safe, and trust userspace not to race with itself.
+		 */
+		rdev->sectors = oldsectors;
+		return -EBUSY;
 	}
 	return len;
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/8] md: stop using for_each_mddev in md_do_sync
  2022-07-12  7:03 fix md disk_name lifetime problems Christoph Hellwig
                   ` (2 preceding siblings ...)
  2022-07-12  7:03 ` [PATCH 3/8] md: factor out the rdev overlaps check from rdev_size_store Christoph Hellwig
@ 2022-07-12  7:03 ` Christoph Hellwig
  2022-07-12  7:03 ` [PATCH 5/8] md: stop using for_each_mddev in md_notify_reboot Christoph Hellwig
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2022-07-12  7:03 UTC (permalink / raw)
  To: Song Liu; +Cc: Logan Gunthorpe, linux-raid, linux-block

Just do a plain list_for_each that only grabs a mddev reference in
the case where the thread sleeps and restarts the list iteration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/md.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 61eceef17d5fd..d22608dcb25fe 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8683,7 +8683,6 @@ void md_do_sync(struct md_thread *thread)
 	unsigned long update_time;
 	sector_t mark_cnt[SYNC_MARKS];
 	int last_mark,m;
-	struct list_head *tmp;
 	sector_t last_check;
 	int skipped = 0;
 	struct md_rdev *rdev;
@@ -8753,7 +8752,8 @@ void md_do_sync(struct md_thread *thread)
 	try_again:
 		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
 			goto skip;
-		for_each_mddev(mddev2, tmp) {
+		spin_lock(&all_mddevs_lock);
+		list_for_each_entry(mddev2, &all_mddevs, all_mddevs) {
 			if (mddev2 == mddev)
 				continue;
 			if (!mddev->parallel_resync
@@ -8783,7 +8783,8 @@ void md_do_sync(struct md_thread *thread)
 							desc, mdname(mddev),
 							mdname(mddev2));
 					}
-					mddev_put(mddev2);
+					spin_unlock(&all_mddevs_lock);
+
 					if (signal_pending(current))
 						flush_signals(current);
 					schedule();
@@ -8793,6 +8794,7 @@ void md_do_sync(struct md_thread *thread)
 				finish_wait(&resync_wait, &wq);
 			}
 		}
+		spin_unlock(&all_mddevs_lock);
 	} while (mddev->curr_resync < 2);
 
 	j = 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/8] md: stop using for_each_mddev in md_notify_reboot
  2022-07-12  7:03 fix md disk_name lifetime problems Christoph Hellwig
                   ` (3 preceding siblings ...)
  2022-07-12  7:03 ` [PATCH 4/8] md: stop using for_each_mddev in md_do_sync Christoph Hellwig
@ 2022-07-12  7:03 ` Christoph Hellwig
  2022-07-12  7:03 ` [PATCH 6/8] md: stop using for_each_mddev in md_exit Christoph Hellwig
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2022-07-12  7:03 UTC (permalink / raw)
  To: Song Liu; +Cc: Logan Gunthorpe, linux-raid, linux-block

Just do a simple list_for_each_entry_safe on all_mddevs, and only grab a
reference when we drop the lock.

Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/md.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index d22608dcb25fe..73abeaed0b6fc 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9547,11 +9547,13 @@ EXPORT_SYMBOL_GPL(rdev_clear_badblocks);
 static int md_notify_reboot(struct notifier_block *this,
 			    unsigned long code, void *x)
 {
-	struct list_head *tmp;
-	struct mddev *mddev;
+	struct mddev *mddev, *n;
 	int need_delay = 0;
 
-	for_each_mddev(mddev, tmp) {
+	spin_lock(&all_mddevs_lock);
+	list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
+		mddev_get(mddev);
+		spin_unlock(&all_mddevs_lock);
 		if (mddev_trylock(mddev)) {
 			if (mddev->pers)
 				__md_stop_writes(mddev);
@@ -9560,7 +9562,11 @@ static int md_notify_reboot(struct notifier_block *this,
 			mddev_unlock(mddev);
 		}
 		need_delay = 1;
+		mddev_put(mddev);
+		spin_lock(&all_mddevs_lock);
 	}
+	spin_unlock(&all_mddevs_lock);
+
 	/*
 	 * certain more exotic SCSI devices are known to be
 	 * volatile wrt too early system reboots. While the
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 6/8] md: stop using for_each_mddev in md_exit
  2022-07-12  7:03 fix md disk_name lifetime problems Christoph Hellwig
                   ` (4 preceding siblings ...)
  2022-07-12  7:03 ` [PATCH 5/8] md: stop using for_each_mddev in md_notify_reboot Christoph Hellwig
@ 2022-07-12  7:03 ` Christoph Hellwig
  2022-07-12  7:03 ` [PATCH 7/8] md: only delete entries from all_mddevs when the disk is freed Christoph Hellwig
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2022-07-12  7:03 UTC (permalink / raw)
  To: Song Liu; +Cc: Logan Gunthorpe, linux-raid, linux-block

Just do a simple list_for_each_entry_safe on all_mddevs, and only grab a
reference when we drop the lock and delete the now unused for_each_mddev
macro.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/md.c | 39 +++++++++++----------------------------
 1 file changed, 11 insertions(+), 28 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 73abeaed0b6fc..f15bc6bb65f2d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -368,28 +368,6 @@ EXPORT_SYMBOL_GPL(md_new_event);
 static LIST_HEAD(all_mddevs);
 static DEFINE_SPINLOCK(all_mddevs_lock);
 
-/*
- * iterates through all used mddevs in the system.
- * We take care to grab the all_mddevs_lock whenever navigating
- * the list, and to always hold a refcount when unlocked.
- * Any code which breaks out of this loop while own
- * a reference to the current mddev and must mddev_put it.
- */
-#define for_each_mddev(_mddev,_tmp)					\
-									\
-	for (({ spin_lock(&all_mddevs_lock);				\
-		_tmp = all_mddevs.next;					\
-		_mddev = NULL;});					\
-	     ({ if (_tmp != &all_mddevs)				\
-			mddev_get(list_entry(_tmp, struct mddev, all_mddevs));\
-		spin_unlock(&all_mddevs_lock);				\
-		if (_mddev) mddev_put(_mddev);				\
-		_mddev = list_entry(_tmp, struct mddev, all_mddevs);	\
-		_tmp != &all_mddevs;});					\
-	     ({ spin_lock(&all_mddevs_lock);				\
-		_tmp = _tmp->next;})					\
-		)
-
 /* Rather than calling directly into the personality make_request function,
  * IO requests come here first so that we can check if the device is
  * being suspended pending a reconfiguration.
@@ -9885,8 +9863,7 @@ void md_autostart_arrays(int part)
 
 static __exit void md_exit(void)
 {
-	struct mddev *mddev;
-	struct list_head *tmp;
+	struct mddev *mddev, *n;
 	int delay = 1;
 
 	unregister_blkdev(MD_MAJOR,"md");
@@ -9906,17 +9883,23 @@ static __exit void md_exit(void)
 	}
 	remove_proc_entry("mdstat", NULL);
 
-	for_each_mddev(mddev, tmp) {
+	spin_lock(&all_mddevs_lock);
+	list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
+		mddev_get(mddev);
+		spin_unlock(&all_mddevs_lock);
 		export_array(mddev);
 		mddev->ctime = 0;
 		mddev->hold_active = 0;
 		/*
-		 * for_each_mddev() will call mddev_put() at the end of each
-		 * iteration.  As the mddev is now fully clear, this will
-		 * schedule the mddev for destruction by a workqueue, and the
+		 * As the mddev is now fully clear, mddev_put will schedule
+		 * the mddev for destruction by a workqueue, and the
 		 * destroy_workqueue() below will wait for that to complete.
 		 */
+		mddev_put(mddev);
+		spin_lock(&all_mddevs_lock);
 	}
+	spin_unlock(&all_mddevs_lock);
+
 	destroy_workqueue(md_rdev_misc_wq);
 	destroy_workqueue(md_misc_wq);
 	destroy_workqueue(md_wq);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 7/8] md: only delete entries from all_mddevs when the disk is freed
  2022-07-12  7:03 fix md disk_name lifetime problems Christoph Hellwig
                   ` (5 preceding siblings ...)
  2022-07-12  7:03 ` [PATCH 6/8] md: stop using for_each_mddev in md_exit Christoph Hellwig
@ 2022-07-12  7:03 ` Christoph Hellwig
  2022-07-12  7:03 ` [PATCH 8/8] md: simplify md_open Christoph Hellwig
  2022-07-12 23:21 ` fix md disk_name lifetime problems Logan Gunthorpe
  8 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2022-07-12  7:03 UTC (permalink / raw)
  To: Song Liu; +Cc: Logan Gunthorpe, linux-raid, linux-block

This ensures device names don't get prematurely reused.  Instead add a
deleted flag to skip already deleted devices in mddev_get and other
places that only want to see live mddevs.

Reported-by; Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/md.c | 58 ++++++++++++++++++++++++++++++++++---------------
 drivers/md/md.h |  1 +
 2 files changed, 42 insertions(+), 17 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index f15bc6bb65f2d..9afc438a08e4d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -625,6 +625,10 @@ EXPORT_SYMBOL(md_flush_request);
 
 static inline struct mddev *mddev_get(struct mddev *mddev)
 {
+	lockdep_assert_held(&all_mddevs_lock);
+
+	if (mddev->deleted)
+		return NULL;
 	atomic_inc(&mddev->active);
 	return mddev;
 }
@@ -639,7 +643,7 @@ static void mddev_put(struct mddev *mddev)
 	    mddev->ctime == 0 && !mddev->hold_active) {
 		/* Array is not configured at all, and not held active,
 		 * so destroy it */
-		list_del_init(&mddev->all_mddevs);
+		mddev->deleted = true;
 
 		/*
 		 * Call queue_work inside the spinlock so that
@@ -720,8 +724,8 @@ static struct mddev *mddev_find(dev_t unit)
 
 	spin_lock(&all_mddevs_lock);
 	mddev = mddev_find_locked(unit);
-	if (mddev)
-		mddev_get(mddev);
+	if (mddev && !mddev_get(mddev))
+		mddev = NULL;
 	spin_unlock(&all_mddevs_lock);
 
 	return mddev;
@@ -3330,6 +3334,8 @@ static bool md_rdev_overlaps(struct md_rdev *rdev)
 
 	spin_lock(&all_mddevs_lock);
 	list_for_each_entry(mddev, &all_mddevs, all_mddevs) {
+		if (mddev->deleted)
+			continue;
 		rdev_for_each(rdev2, mddev) {
 			if (rdev != rdev2 && rdev->bdev == rdev2->bdev &&
 			    md_rdevs_overlap(rdev, rdev2)) {
@@ -5504,11 +5510,10 @@ md_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
 	if (!entry->show)
 		return -EIO;
 	spin_lock(&all_mddevs_lock);
-	if (list_empty(&mddev->all_mddevs)) {
+	if (!mddev_get(mddev)) {
 		spin_unlock(&all_mddevs_lock);
 		return -EBUSY;
 	}
-	mddev_get(mddev);
 	spin_unlock(&all_mddevs_lock);
 
 	rv = entry->show(mddev, page);
@@ -5529,11 +5534,10 @@ md_attr_store(struct kobject *kobj, struct attribute *attr,
 	if (!capable(CAP_SYS_ADMIN))
 		return -EACCES;
 	spin_lock(&all_mddevs_lock);
-	if (list_empty(&mddev->all_mddevs)) {
+	if (!mddev_get(mddev)) {
 		spin_unlock(&all_mddevs_lock);
 		return -EBUSY;
 	}
-	mddev_get(mddev);
 	spin_unlock(&all_mddevs_lock);
 	rv = entry->store(mddev, page, length);
 	mddev_put(mddev);
@@ -7815,6 +7819,10 @@ static void md_free_disk(struct gendisk *disk)
 {
 	struct mddev *mddev = disk->private_data;
 
+	spin_lock(&all_mddevs_lock);
+	list_del_init(&mddev->all_mddevs);
+	spin_unlock(&all_mddevs_lock);
+
 	percpu_ref_exit(&mddev->writes_pending);
 	bioset_exit(&mddev->bio_set);
 	bioset_exit(&mddev->sync_set);
@@ -8131,6 +8139,8 @@ static void *md_seq_start(struct seq_file *seq, loff_t *pos)
 		if (!l--) {
 			mddev = list_entry(tmp, struct mddev, all_mddevs);
 			mddev_get(mddev);
+			if (!mddev_get(mddev))
+				continue;
 			spin_unlock(&all_mddevs_lock);
 			return mddev;
 		}
@@ -8144,25 +8154,35 @@ static void *md_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 {
 	struct list_head *tmp;
 	struct mddev *next_mddev, *mddev = v;
+	struct mddev *to_put = NULL;
 
 	++*pos;
 	if (v == (void*)2)
 		return NULL;
 
 	spin_lock(&all_mddevs_lock);
-	if (v == (void*)1)
+	if (v == (void*)1) {
 		tmp = all_mddevs.next;
-	else
+	} else {
+		to_put = mddev;
 		tmp = mddev->all_mddevs.next;
-	if (tmp != &all_mddevs)
-		next_mddev = mddev_get(list_entry(tmp,struct mddev,all_mddevs));
-	else {
-		next_mddev = (void*)2;
-		*pos = 0x10000;
 	}
+
+	for (;;) {
+		if (tmp == &all_mddevs) {
+			next_mddev = (void*)2;
+			*pos = 0x10000;
+			break;
+		}
+		next_mddev = list_entry(tmp, struct mddev, all_mddevs);
+		if (mddev_get(next_mddev))
+			break;
+		mddev = next_mddev;
+		tmp = mddev->all_mddevs.next;
+	};
 	spin_unlock(&all_mddevs_lock);
 
-	if (v != (void*)1)
+	if (to_put)
 		mddev_put(mddev);
 	return next_mddev;
 
@@ -8732,6 +8752,8 @@ void md_do_sync(struct md_thread *thread)
 			goto skip;
 		spin_lock(&all_mddevs_lock);
 		list_for_each_entry(mddev2, &all_mddevs, all_mddevs) {
+			if (mddev2->deleted)
+				continue;
 			if (mddev2 == mddev)
 				continue;
 			if (!mddev->parallel_resync
@@ -9530,7 +9552,8 @@ static int md_notify_reboot(struct notifier_block *this,
 
 	spin_lock(&all_mddevs_lock);
 	list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
-		mddev_get(mddev);
+		if (!mddev_get(mddev))
+			continue;
 		spin_unlock(&all_mddevs_lock);
 		if (mddev_trylock(mddev)) {
 			if (mddev->pers)
@@ -9885,7 +9908,8 @@ static __exit void md_exit(void)
 
 	spin_lock(&all_mddevs_lock);
 	list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
-		mddev_get(mddev);
+		if (!mddev_get(mddev))
+			continue;
 		spin_unlock(&all_mddevs_lock);
 		export_array(mddev);
 		mddev->ctime = 0;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index cf2cbb17acbd4..e731a2fdc6ac1 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -488,6 +488,7 @@ struct mddev {
 
 	atomic_t			max_corr_read_errors; /* max read retries */
 	struct list_head		all_mddevs;
+	bool				deleted;
 
 	const struct attribute_group	*to_remove;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 8/8] md: simplify md_open
  2022-07-12  7:03 fix md disk_name lifetime problems Christoph Hellwig
                   ` (6 preceding siblings ...)
  2022-07-12  7:03 ` [PATCH 7/8] md: only delete entries from all_mddevs when the disk is freed Christoph Hellwig
@ 2022-07-12  7:03 ` Christoph Hellwig
  2022-07-12 23:21 ` fix md disk_name lifetime problems Logan Gunthorpe
  8 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2022-07-12  7:03 UTC (permalink / raw)
  To: Song Liu; +Cc: Logan Gunthorpe, linux-raid, linux-block

Now that devices are on the all_mddevs list until the gendisk is freed,
there can't be any duplicates.  Remove the global list lookup and just
grab a reference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/md.c | 42 +++++++++++++++---------------------------
 1 file changed, 15 insertions(+), 27 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 9afc438a08e4d..53a92b306b1fc 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7753,45 +7753,33 @@ static int md_set_read_only(struct block_device *bdev, bool ro)
 
 static int md_open(struct block_device *bdev, fmode_t mode)
 {
-	/*
-	 * Succeed if we can lock the mddev, which confirms that
-	 * it isn't being stopped right now.
-	 */
-	struct mddev *mddev = mddev_find(bdev->bd_dev);
+	struct mddev *mddev;
 	int err;
 
+	spin_lock(&all_mddevs_lock);
+	mddev = mddev_get(bdev->bd_disk->private_data);
+	spin_unlock(&all_mddevs_lock);
 	if (!mddev)
 		return -ENODEV;
 
-	if (mddev->gendisk != bdev->bd_disk) {
-		/* we are racing with mddev_put which is discarding this
-		 * bd_disk.
-		 */
-		mddev_put(mddev);
-		/* Wait until bdev->bd_disk is definitely gone */
-		if (work_pending(&mddev->del_work))
-			flush_workqueue(md_misc_wq);
-		return -EBUSY;
-	}
-	BUG_ON(mddev != bdev->bd_disk->private_data);
-
-	if ((err = mutex_lock_interruptible(&mddev->open_mutex)))
+	err = mutex_lock_interruptible(&mddev->open_mutex);
+	if (err)
 		goto out;
 
-	if (test_bit(MD_CLOSING, &mddev->flags)) {
-		mutex_unlock(&mddev->open_mutex);
-		err = -ENODEV;
-		goto out;
-	}
+	err = -ENODEV;
+	if (test_bit(MD_CLOSING, &mddev->flags))
+		goto out_unlock;
 
-	err = 0;
 	atomic_inc(&mddev->openers);
 	mutex_unlock(&mddev->open_mutex);
 
 	bdev_check_media_change(bdev);
- out:
-	if (err)
-		mddev_put(mddev);
+	return 0;
+
+out_unlock:
+	mutex_unlock(&mddev->open_mutex);
+out:
+	mddev_put(mddev);
 	return err;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/8] md: implement ->free_disk
  2022-07-12  7:03 ` [PATCH 2/8] md: implement ->free_disk Christoph Hellwig
@ 2022-07-12 23:13   ` Logan Gunthorpe
  2022-07-13  7:17     ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Logan Gunthorpe @ 2022-07-12 23:13 UTC (permalink / raw)
  To: Christoph Hellwig, Song Liu; +Cc: linux-raid, linux-block



On 2022-07-12 01:03, Christoph Hellwig wrote:
> Ensure that all private data is only freed once all accesses are done.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/md/md.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 861d6a9481b2e..ae076a7a87796 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5581,11 +5581,6 @@ static void md_free(struct kobject *ko)
>  		del_gendisk(mddev->gendisk);
>  		put_disk(mddev->gendisk);
>  	}
> -	percpu_ref_exit(&mddev->writes_pending);
> -
> -	bioset_exit(&mddev->bio_set);
> -	bioset_exit(&mddev->sync_set);
> -	kfree(mddev);
>  }
>  
>  static const struct sysfs_ops md_sysfs_ops = {
> @@ -7844,6 +7839,17 @@ static unsigned int md_check_events(struct gendisk *disk, unsigned int clearing)
>  	return ret;
>  }
>  
> +static void md_free_disk(struct gendisk *disk)
> +{
> +	struct mddev *mddev = disk->private_data;
> +
> +	percpu_ref_exit(&mddev->writes_pending);
> +	bioset_exit(&mddev->bio_set);
> +	bioset_exit(&mddev->sync_set);
> +
> +	kfree(mddev);
> +}

I still don't think this is entirely correct. There are error paths that
will put the kobject before the disk is created and if they get hit then
the kfree(mddev) will never be called and the memory will be leaked.

Instead of creating an ugly special path for that, I came up with a solution 
that I think  makes a bit more sense: the kobject is still freed in it's 
own free  function, but the disk holds a reference to the kobject and drops
it in its free function. The sysfs puts and del_gendisk are then moved
into mddev_delayed_delete() so they happen earlier.

Thoughts?

Logan

--

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 78e2588ed43e..330db78a5b38 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5585,15 +5585,6 @@ static void md_free(struct kobject *ko)
 {
        struct mddev *mddev = container_of(ko, struct mddev, kobj);
 
-       if (mddev->sysfs_state)
-               sysfs_put(mddev->sysfs_state);
-       if (mddev->sysfs_level)
-               sysfs_put(mddev->sysfs_level);
-
-       if (mddev->gendisk) {
-               del_gendisk(mddev->gendisk);
-               blk_cleanup_disk(mddev->gendisk);
-       }
        percpu_ref_exit(&mddev->writes_pending);
 
        bioset_exit(&mddev->bio_set);
@@ -5618,6 +5609,17 @@ static void mddev_delayed_delete(struct work_struct *ws)
        struct mddev *mddev = container_of(ws, struct mddev, del_work);
 
        kobject_del(&mddev->kobj);
+
+       if (mddev->sysfs_state)
+               sysfs_put(mddev->sysfs_state);
+       if (mddev->sysfs_level)
+               sysfs_put(mddev->sysfs_level);
+
+       if (mddev->gendisk) {
+               del_gendisk(mddev->gendisk);
+               blk_cleanup_disk(mddev->gendisk);
+       }
+
        kobject_put(&mddev->kobj);
 }
 
@@ -5708,6 +5710,7 @@ int md_alloc(dev_t dev, char *name)
        else
                sprintf(disk->disk_name, "md%d", unit);
        disk->fops = &md_fops;
+       kobject_get(&mddev->kobj);
        disk->private_data = mddev;
 
        mddev->queue = disk->queue;
@@ -7858,6 +7861,13 @@ static unsigned int md_check_events(struct gendisk *disk, unsign>
        return ret;
 }
 
+static void md_free_disk(struct gendisk *disk)
+{
+       struct mddev *mddev = disk->private_data;
+
+       kobject_put(&mddev->kobj);
+}
+
 const struct block_device_operations md_fops =
 {
        .owner          = THIS_MODULE,
@@ -7871,6 +7881,7 @@ const struct block_device_operations md_fops =
        .getgeo         = md_getgeo,
        .check_events   = md_check_events,
        .set_read_only  = md_set_read_only,
+       .free_disk      = md_free_disk,
 };

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: fix md disk_name lifetime problems
  2022-07-12  7:03 fix md disk_name lifetime problems Christoph Hellwig
                   ` (7 preceding siblings ...)
  2022-07-12  7:03 ` [PATCH 8/8] md: simplify md_open Christoph Hellwig
@ 2022-07-12 23:21 ` Logan Gunthorpe
  8 siblings, 0 replies; 13+ messages in thread
From: Logan Gunthorpe @ 2022-07-12 23:21 UTC (permalink / raw)
  To: Christoph Hellwig, Song Liu; +Cc: linux-raid, linux-block



On 2022-07-12 01:03, Christoph Hellwig wrote:
> Hi all,
> 
> this series tries to fix a problem repored by Logan where we see
> duplicate sysfs file name in md.  It is due to the fact that the
> md driver only checks for duplicates on currently live mddevs,
> while the sysfs name can live on longer.  It is an old problem,
> but the race window got longer due to waiting for the device freeze
> earlier in del_gendisk.
> 
> Note that I still can't reproduce this problem so this was based
> on code inspection.  Also note that I occasionally run into a hang
> in the 07layouts tests with or without this series.

Yes, there should be a fix for 07layouts in md-next already 
(92a2748dc3c5).

Your branch wasn't based off md-next and does have some conflicts. 

I've rebased your patches onto md-next, replaced patch 2 with 
my suggestion and ran my battery of tests on it and have found that 
it fixes the bug I identified and looks good to me.

https://github.com/sbates130272/linux-p2pmem   md-lifetime-fixes

This branch will still conflict (easily) with the patch
8b9ab62662 ("block: remove blk_cleanup_disk") in your branch.

How do you want to proceed with getting these merged?

Logan



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/8] md: implement ->free_disk
  2022-07-12 23:13   ` Logan Gunthorpe
@ 2022-07-13  7:17     ` Christoph Hellwig
  2022-07-13 15:30       ` Logan Gunthorpe
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2022-07-13  7:17 UTC (permalink / raw)
  To: Logan Gunthorpe; +Cc: Christoph Hellwig, Song Liu, linux-raid, linux-block

On Tue, Jul 12, 2022 at 05:13:48PM -0600, Logan Gunthorpe wrote:
> > +
> > +	percpu_ref_exit(&mddev->writes_pending);
> > +	bioset_exit(&mddev->bio_set);
> > +	bioset_exit(&mddev->sync_set);
> > +
> > +	kfree(mddev);
> > +}
> 
> I still don't think this is entirely correct. There are error paths that
> will put the kobject before the disk is created and if they get hit then
> the kfree(mddev) will never be called and the memory will be leaked.

True.

> Instead of creating an ugly special path for that, I came up with a solution 
> that I think  makes a bit more sense: the kobject is still freed in it's 
> own free  function, but the disk holds a reference to the kobject and drops
> it in its free function. The sysfs puts and del_gendisk are then moved
> into mddev_delayed_delete() so they happen earlier.

I'm not sure this is a good idea.  The mddev kobject hangs off the
disk, so I don't think that it should in any way control the life
time of the disk, as that just creates potential problems down the
road.

Moreover we actually need the special kfree path anyway for the
add_disk failure, something that I had missed.  Something like
the untested patch below on top of my series.  I'll look into folding
and testing it.

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 53a92b306b1fc..62f40c49344e4 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5641,7 +5641,7 @@ static int md_alloc(dev_t dev, char *name)
 			    strcmp(mddev2->gendisk->disk_name, name) == 0) {
 				spin_unlock(&all_mddevs_lock);
 				error = -EEXIST;
-				goto out_unlock_disks_mutex;
+				goto out_free_mddev;
 			}
 		spin_unlock(&all_mddevs_lock);
 	}
@@ -5654,7 +5654,7 @@ static int md_alloc(dev_t dev, char *name)
 	error = -ENOMEM;
 	disk = blk_alloc_disk(NUMA_NO_NODE);
 	if (!disk)
-		goto out_unlock_disks_mutex;
+		goto out_free_mddev;
 
 	disk->major = MAJOR(mddev->unit);
 	disk->first_minor = unit << shift;
@@ -5674,26 +5674,35 @@ static int md_alloc(dev_t dev, char *name)
 	disk->events |= DISK_EVENT_MEDIA_CHANGE;
 	mddev->gendisk = disk;
 	error = add_disk(disk);
-	if (error) {
-		mddev->gendisk = NULL;
-		put_disk(disk);
-		goto out_unlock_disks_mutex;
-	}
+	if (error)
+		goto out_put_disk;
 
 	error = kobject_add(&mddev->kobj, &disk_to_dev(disk)->kobj, "%s", "md");
-	if (error)
+	if (error) {
+		/*
+		 * The disk is already live at this point.  Clear the hold flag
+		 * and let mddev_put take care of the deletion, as it isn't any
+		 * different from a normal close on last release now.
+		 */
+		mddev->hold_active = 0;
 		goto out_unlock_disks_mutex;
+	}
 
 	kobject_uevent(&mddev->kobj, KOBJ_ADD);
 	mddev->sysfs_state = sysfs_get_dirent_safe(mddev->kobj.sd, "array_state");
 	mddev->sysfs_level = sysfs_get_dirent_safe(mddev->kobj.sd, "level");
 
 out_unlock_disks_mutex:
-	if (error)
-		mddev->hold_active = 0;
 	mutex_unlock(&disks_mutex);
 	mddev_put(mddev);
 	return error;
+
+out_put_disk:
+	put_disk(disk);
+out_free_mddev:
+	mutex_unlock(&disks_mutex);
+	kfree(mddev);
+	return error;
 }
 
 static void md_probe(dev_t dev)

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/8] md: implement ->free_disk
  2022-07-13  7:17     ` Christoph Hellwig
@ 2022-07-13 15:30       ` Logan Gunthorpe
  0 siblings, 0 replies; 13+ messages in thread
From: Logan Gunthorpe @ 2022-07-13 15:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Song Liu, linux-raid, linux-block




On 2022-07-13 01:17, Christoph Hellwig wrote:
> On Tue, Jul 12, 2022 at 05:13:48PM -0600, Logan Gunthorpe wrote:
>>> +
>>> +	percpu_ref_exit(&mddev->writes_pending);
>>> +	bioset_exit(&mddev->bio_set);
>>> +	bioset_exit(&mddev->sync_set);
>>> +
>>> +	kfree(mddev);
>>> +}
>>
>> I still don't think this is entirely correct. There are error paths that
>> will put the kobject before the disk is created and if they get hit then
>> the kfree(mddev) will never be called and the memory will be leaked.
> 
> True.
> 
>> Instead of creating an ugly special path for that, I came up with a solution 
>> that I think  makes a bit more sense: the kobject is still freed in it's 
>> own free  function, but the disk holds a reference to the kobject and drops
>> it in its free function. The sysfs puts and del_gendisk are then moved
>> into mddev_delayed_delete() so they happen earlier.
> 
> I'm not sure this is a good idea.  The mddev kobject hangs off the
> disk, so I don't think that it should in any way control the life
> time of the disk, as that just creates potential problems down the
> road.

My interpretation was that kobject_del() (which is called in
mddev_delayed_delete() before the disk would be removed) removes the
mddev from the disk. At that point, it's just a structure hanging around
that will be freed when its last reference is dropped, which is the disk
itself cleaning up. I'm not sure how that would cause potential problems.

Logan

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-07-13 15:30 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-07-12  7:03 fix md disk_name lifetime problems Christoph Hellwig
2022-07-12  7:03 ` [PATCH 1/8] md: fix kobject_add error handling Christoph Hellwig
2022-07-12  7:03 ` [PATCH 2/8] md: implement ->free_disk Christoph Hellwig
2022-07-12 23:13   ` Logan Gunthorpe
2022-07-13  7:17     ` Christoph Hellwig
2022-07-13 15:30       ` Logan Gunthorpe
2022-07-12  7:03 ` [PATCH 3/8] md: factor out the rdev overlaps check from rdev_size_store Christoph Hellwig
2022-07-12  7:03 ` [PATCH 4/8] md: stop using for_each_mddev in md_do_sync Christoph Hellwig
2022-07-12  7:03 ` [PATCH 5/8] md: stop using for_each_mddev in md_notify_reboot Christoph Hellwig
2022-07-12  7:03 ` [PATCH 6/8] md: stop using for_each_mddev in md_exit Christoph Hellwig
2022-07-12  7:03 ` [PATCH 7/8] md: only delete entries from all_mddevs when the disk is freed Christoph Hellwig
2022-07-12  7:03 ` [PATCH 8/8] md: simplify md_open Christoph Hellwig
2022-07-12 23:21 ` fix md disk_name lifetime problems Logan Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).