[PATCH RFC 0/3] md: call del_gendisk in sync way

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/3] md: call del_gendisk in sync way
@ 2025-04-28  8:26 Xiao Ni
  2025-04-28  8:26 ` [PATCH v2 1/3] md: replace MD_DELETED with MD_CLOSING Xiao Ni
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Xiao Ni @ 2025-04-28  8:26 UTC (permalink / raw)
  To: linux-raid; +Cc: song, yukuai1, ncroxon, mtkaczyk

This patch clears some flags and counters related stopping an array
and calls del_gendisk in sync way.

This patch set can cause a deadlock when stopping imsm raid1, I'm
investigating this deadlock problem to resovle this.

Xiao Ni (3):
  md: replace MD_DELETED with MD_CLOSING
  md: replace ->openers with ->active
  md: call del_gendisk in sync way

 drivers/md/md.c | 67 ++++++++++++++++++++++---------------------------
 drivers/md/md.h | 13 ----------
 2 files changed, 30 insertions(+), 50 deletions(-)

-- 
2.32.0 (Apple Git-132)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/3] md: replace MD_DELETED with MD_CLOSING
  2025-04-28  8:26 [PATCH RFC 0/3] md: call del_gendisk in sync way Xiao Ni
@ 2025-04-28  8:26 ` Xiao Ni
  2025-04-29  1:54   ` Yu Kuai
  2025-04-28  8:26 ` [PATCH 2/3] md: replace ->openers with ->active Xiao Ni
  2025-04-28  8:26 ` [PATCH 3/3] md: call del_gendisk in sync way Xiao Ni
  2 siblings, 1 reply; 8+ messages in thread
From: Xiao Ni @ 2025-04-28  8:26 UTC (permalink / raw)
  To: linux-raid; +Cc: song, yukuai1, ncroxon, mtkaczyk

Before commit 12a6caf27324 ("md: only delete entries from all_mddevs when
the disk is freed") MD_CLOSING is cleared in ioctl path. Now MD_CLOSING
will keep until mddev is freed. So MD_CLOSING can be used to check if the
array is stopping.

This patch also removes the ->hold_active check in md_clean function.
->hold_active is used to avoid mddev is released on the last close before
adding disks to mddev. udev rule will open/close array once it's created.
The array can be closed without ->hold_active support. But it should not
work again when stopping an array. So remove the check ->hold_active in
md_clean function.

Signed-off-by: Xiao Ni <xni@redhat.com>
---
v2 removes ->hold_active check in md_clean
 drivers/md/md.c | 24 +++++++-----------------
 drivers/md/md.h |  2 --
 2 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 9daa78c5fe33..e14253433c49 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -599,7 +599,7 @@ static inline struct mddev *mddev_get(struct mddev *mddev)
 {
 	lockdep_assert_held(&all_mddevs_lock);
 
-	if (test_bit(MD_DELETED, &mddev->flags))
+	if (test_bit(MD_CLOSING, &mddev->flags))
 		return NULL;
 	atomic_inc(&mddev->active);
 	return mddev;
@@ -613,9 +613,6 @@ static void __mddev_put(struct mddev *mddev)
 	    mddev->ctime || mddev->hold_active)
 		return;
 
-	/* Array is not configured at all, and not held active, so destroy it */
-	set_bit(MD_DELETED, &mddev->flags);
-
 	/*
 	 * Call queue_work inside the spinlock so that flush_workqueue() after
 	 * mddev_find will succeed in waiting for the work to be done.
@@ -3312,7 +3309,7 @@ static bool md_rdev_overlaps(struct md_rdev *rdev)
 
 	spin_lock(&all_mddevs_lock);
 	list_for_each_entry(mddev, &all_mddevs, all_mddevs) {
-		if (test_bit(MD_DELETED, &mddev->flags))
+		if (test_bit(MD_CLOSING, &mddev->flags))
 			continue;
 		rdev_for_each(rdev2, mddev) {
 			if (rdev != rdev2 && rdev->bdev == rdev2->bdev &&
@@ -6360,15 +6357,10 @@ static void md_clean(struct mddev *mddev)
 	mddev->persistent = 0;
 	mddev->level = LEVEL_NONE;
 	mddev->clevel[0] = 0;
-	/*
-	 * Don't clear MD_CLOSING, or mddev can be opened again.
-	 * 'hold_active != 0' means mddev is still in the creation
-	 * process and will be used later.
-	 */
-	if (mddev->hold_active)
-		mddev->flags = 0;
-	else
-		mddev->flags &= BIT_ULL_MASK(MD_CLOSING);
+	/* UNTIL_STOP is cleared here */
+	mddev->hold_active = 0;
+	/* Don't clear MD_CLOSING, or mddev can be opened again. */
+	mddev->flags &= BIT_ULL_MASK(MD_CLOSING);
 	mddev->sb_flags = 0;
 	mddev->ro = MD_RDWR;
 	mddev->metadata_type[0] = 0;
@@ -6595,8 +6587,6 @@ static int do_md_stop(struct mddev *mddev, int mode)
 		export_array(mddev);
 
 		md_clean(mddev);
-		if (mddev->hold_active == UNTIL_STOP)
-			mddev->hold_active = 0;
 	}
 	md_new_event();
 	sysfs_notify_dirent_safe(mddev->sysfs_state);
@@ -8999,7 +8989,7 @@ void md_do_sync(struct md_thread *thread)
 			goto skip;
 		spin_lock(&all_mddevs_lock);
 		list_for_each_entry(mddev2, &all_mddevs, all_mddevs) {
-			if (test_bit(MD_DELETED, &mddev2->flags))
+			if (test_bit(MD_CLOSING, &mddev2->flags))
 				continue;
 			if (mddev2 == mddev)
 				continue;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 1cf00a04bcdd..a9dccb3d84ed 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -338,7 +338,6 @@ struct md_cluster_operations;
  * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that
  *		   array is ready yet.
  * @MD_BROKEN: This is used to stop writes and mark array as failed.
- * @MD_DELETED: This device is being deleted
  *
  * change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag is added
  */
@@ -353,7 +352,6 @@ enum mddev_flags {
 	MD_HAS_MULTIPLE_PPLS,
 	MD_NOT_READY,
 	MD_BROKEN,
-	MD_DELETED,
 };
 
 enum mddev_sb_flags {
-- 
2.32.0 (Apple Git-132)


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] md: replace ->openers with ->active
  2025-04-28  8:26 [PATCH RFC 0/3] md: call del_gendisk in sync way Xiao Ni
  2025-04-28  8:26 ` [PATCH v2 1/3] md: replace MD_DELETED with MD_CLOSING Xiao Ni
@ 2025-04-28  8:26 ` Xiao Ni
  2025-04-29  2:05   ` Yu Kuai
  2025-04-28  8:26 ` [PATCH 3/3] md: call del_gendisk in sync way Xiao Ni
  2 siblings, 1 reply; 8+ messages in thread
From: Xiao Ni @ 2025-04-28  8:26 UTC (permalink / raw)
  To: linux-raid; +Cc: song, yukuai1, ncroxon, mtkaczyk

It only checks openers when stopping an array. But some users still can
access mddev through sysfs interfaces. Now ->active is added once mddev
is accessed which is done in function mddev_get protected by lock
all_mddevs_lock. They are md_open, md_seq_show, md_attr_show/store and
md_notify_reboot and md_exit.

->openers is only added in md_open while mddev_get is called too. So we
can replace ->openers with ->active. This can guarantee no one access the
array once MD_CLOSING is set.

At the same time, ->open_mutex is replaced with all_mddevs_lock. Though
all_mddevs_lock is a global lock, the only place checks ->active and sets
MD_CLOSING in ioctl path. So it doesn't affect performance.

Signed-off-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/md.c | 47 +++++++++++++++++------------------------------
 drivers/md/md.h | 11 -----------
 2 files changed, 17 insertions(+), 41 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index e14253433c49..4ca3d04ce13f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -523,18 +523,27 @@ void mddev_resume(struct mddev *mddev)
 EXPORT_SYMBOL_GPL(mddev_resume);
 
 /* sync bdev before setting device to readonly or stopping raid*/
-static int mddev_set_closing_and_sync_blockdev(struct mddev *mddev, int opener_num)
+static int mddev_set_closing_and_sync_blockdev(struct mddev *mddev)
 {
-	mutex_lock(&mddev->open_mutex);
-	if (mddev->pers && atomic_read(&mddev->openers) > opener_num) {
-		mutex_unlock(&mddev->open_mutex);
+	spin_lock(&all_mddevs_lock);
+
+	/*
+	 * there are two places that call this function and ->active
+	 * is added before calling this function. So the array can't
+	 *  be stopped when ->active is bigger than 1.
+	 */
+	if (mddev->pers && atomic_read(&mddev->active) > 1) {
+
+		spin_unlock(&all_mddevs_lock);
 		return -EBUSY;
 	}
+
 	if (test_and_set_bit(MD_CLOSING, &mddev->flags)) {
-		mutex_unlock(&mddev->open_mutex);
+		spin_unlock(&all_mddevs_lock);
 		return -EBUSY;
 	}
-	mutex_unlock(&mddev->open_mutex);
+
+	spin_unlock(&all_mddevs_lock);
 
 	sync_blockdev(mddev->gendisk->part0);
 	return 0;
@@ -663,7 +672,6 @@ int mddev_init(struct mddev *mddev)
 	/* We want to start with the refcount at zero */
 	percpu_ref_put(&mddev->writes_pending);
 
-	mutex_init(&mddev->open_mutex);
 	mutex_init(&mddev->reconfig_mutex);
 	mutex_init(&mddev->suspend_mutex);
 	mutex_init(&mddev->bitmap_info.mutex);
@@ -672,7 +680,6 @@ int mddev_init(struct mddev *mddev)
 	INIT_LIST_HEAD(&mddev->deleting);
 	timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
 	atomic_set(&mddev->active, 1);
-	atomic_set(&mddev->openers, 0);
 	atomic_set(&mddev->sync_seq, 0);
 	spin_lock_init(&mddev->lock);
 	init_waitqueue_head(&mddev->sb_wait);
@@ -4421,8 +4428,7 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
 	case read_auto:
 		if (!mddev->pers || !md_is_rdwr(mddev))
 			break;
-		/* write sysfs will not open mddev and opener should be 0 */
-		err = mddev_set_closing_and_sync_blockdev(mddev, 0);
+		err = mddev_set_closing_and_sync_blockdev(mddev);
 		if (err)
 			return err;
 		break;
@@ -7738,7 +7744,7 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode,
 		/* Need to flush page cache, and ensure no-one else opens
 		 * and writes
 		 */
-		err = mddev_set_closing_and_sync_blockdev(mddev, 1);
+		err = mddev_set_closing_and_sync_blockdev(mddev);
 		if (err)
 			return err;
 	}
@@ -7938,7 +7944,6 @@ static int md_set_read_only(struct block_device *bdev, bool ro)
 static int md_open(struct gendisk *disk, blk_mode_t mode)
 {
 	struct mddev *mddev;
-	int err;
 
 	spin_lock(&all_mddevs_lock);
 	mddev = mddev_get(disk->private_data);
@@ -7946,25 +7951,8 @@ static int md_open(struct gendisk *disk, blk_mode_t mode)
 	if (!mddev)
 		return -ENODEV;
 
-	err = mutex_lock_interruptible(&mddev->open_mutex);
-	if (err)
-		goto out;
-
-	err = -ENODEV;
-	if (test_bit(MD_CLOSING, &mddev->flags))
-		goto out_unlock;
-
-	atomic_inc(&mddev->openers);
-	mutex_unlock(&mddev->open_mutex);
-
 	disk_check_media_change(disk);
 	return 0;
-
-out_unlock:
-	mutex_unlock(&mddev->open_mutex);
-out:
-	mddev_put(mddev);
-	return err;
 }
 
 static void md_release(struct gendisk *disk)
@@ -7972,7 +7960,6 @@ static void md_release(struct gendisk *disk)
 	struct mddev *mddev = disk->private_data;
 
 	BUG_ON(!mddev);
-	atomic_dec(&mddev->openers);
 	mddev_put(mddev);
 }
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index a9dccb3d84ed..60dc0943e05b 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -496,19 +496,8 @@ struct mddev {
 	int				recovery_disabled;
 
 	int				in_sync;	/* know to not need resync */
-	/* 'open_mutex' avoids races between 'md_open' and 'do_md_stop', so
-	 * that we are never stopping an array while it is open.
-	 * 'reconfig_mutex' protects all other reconfiguration.
-	 * These locks are separate due to conflicting interactions
-	 * with disk->open_mutex.
-	 * Lock ordering is:
-	 *  reconfig_mutex -> disk->open_mutex
-	 *  disk->open_mutex -> open_mutex:  e.g. __blkdev_get -> md_open
-	 */
-	struct mutex			open_mutex;
 	struct mutex			reconfig_mutex;
 	atomic_t			active;		/* general refcount */
-	atomic_t			openers;	/* number of active opens */
 
 	int				changed;	/* True if we might need to
 							 * reread partition info */
-- 
2.32.0 (Apple Git-132)


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] md: call del_gendisk in sync way
  2025-04-28  8:26 [PATCH RFC 0/3] md: call del_gendisk in sync way Xiao Ni
  2025-04-28  8:26 ` [PATCH v2 1/3] md: replace MD_DELETED with MD_CLOSING Xiao Ni
  2025-04-28  8:26 ` [PATCH 2/3] md: replace ->openers with ->active Xiao Ni
@ 2025-04-28  8:26 ` Xiao Ni
  2 siblings, 0 replies; 8+ messages in thread
From: Xiao Ni @ 2025-04-28  8:26 UTC (permalink / raw)
  To: linux-raid; +Cc: song, yukuai1, ncroxon, mtkaczyk

Now del_gendisk and put_disk are called asynchronously in workqueue work.
The asynchronous way also has a problem that the device node can still
exist after mdadm --stop command returns in a short window. So udev rule
can open this device node and create the struct mddev in kernel again.

So put del_gendisk in ioctl path and still leave put_disk in
md_kobj_release to avoid uaf.

Signed-off-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/md.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4ca3d04ce13f..0c921072ffff 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5724,11 +5724,16 @@ md_attr_store(struct kobject *kobj, struct attribute *attr,
 	struct md_sysfs_entry *entry = container_of(attr, struct md_sysfs_entry, attr);
 	struct mddev *mddev = container_of(kobj, struct mddev, kobj);
 	ssize_t rv;
+	struct kernfs_node *kn = NULL;
 
 	if (!entry->store)
 		return -EIO;
 	if (!capable(CAP_SYS_ADMIN))
 		return -EACCES;
+
+	if (entry->store == array_state_store && cmd_match(page, "clear"))
+		kn = sysfs_break_active_protection(kobj, attr);
+
 	spin_lock(&all_mddevs_lock);
 	if (!mddev_get(mddev)) {
 		spin_unlock(&all_mddevs_lock);
@@ -5737,6 +5742,10 @@ md_attr_store(struct kobject *kobj, struct attribute *attr,
 	spin_unlock(&all_mddevs_lock);
 	rv = entry->store(mddev, page, length);
 	mddev_put(mddev);
+
+	if (kn)
+		sysfs_unbreak_active_protection(kn);
+
 	return rv;
 }
 
@@ -5749,7 +5758,6 @@ static void md_kobj_release(struct kobject *ko)
 	if (mddev->sysfs_level)
 		sysfs_put(mddev->sysfs_level);
 
-	del_gendisk(mddev->gendisk);
 	put_disk(mddev->gendisk);
 }
 
@@ -6591,8 +6599,8 @@ static int do_md_stop(struct mddev *mddev, int mode)
 		mddev->bitmap_info.offset = 0;
 
 		export_array(mddev);
-
 		md_clean(mddev);
+		del_gendisk(mddev->gendisk);
 	}
 	md_new_event();
 	sysfs_notify_dirent_safe(mddev->sysfs_state);
-- 
2.32.0 (Apple Git-132)


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] md: replace MD_DELETED with MD_CLOSING
  2025-04-28  8:26 ` [PATCH v2 1/3] md: replace MD_DELETED with MD_CLOSING Xiao Ni
@ 2025-04-29  1:54   ` Yu Kuai
  0 siblings, 0 replies; 8+ messages in thread
From: Yu Kuai @ 2025-04-29  1:54 UTC (permalink / raw)
  To: Xiao Ni, linux-raid; +Cc: song, yukuai1, ncroxon, mtkaczyk, yukuai (C)

Hi,

在 2025/04/28 16:26, Xiao Ni 写道:
> Before commit 12a6caf27324 ("md: only delete entries from all_mddevs when
> the disk is freed") MD_CLOSING is cleared in ioctl path. Now MD_CLOSING
> will keep until mddev is freed. So MD_CLOSING can be used to check if the
> array is stopping.
> 
> This patch also removes the ->hold_active check in md_clean function.
> ->hold_active is used to avoid mddev is released on the last close before
> adding disks to mddev. udev rule will open/close array once it's created.
> The array can be closed without ->hold_active support. But it should not
> work again when stopping an array. So remove the check ->hold_active in
> md_clean function.
> 
> Signed-off-by: Xiao Ni <xni@redhat.com>
> ---
> v2 removes ->hold_active check in md_clean
>   drivers/md/md.c | 24 +++++++-----------------
>   drivers/md/md.h |  2 --
>   2 files changed, 7 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 9daa78c5fe33..e14253433c49 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -599,7 +599,7 @@ static inline struct mddev *mddev_get(struct mddev *mddev)
>   {
>   	lockdep_assert_held(&all_mddevs_lock);
>   
> -	if (test_bit(MD_DELETED, &mddev->flags))
> +	if (test_bit(MD_CLOSING, &mddev->flags))
>   		return NULL;
>   	atomic_inc(&mddev->active);
>   	return mddev;
> @@ -613,9 +613,6 @@ static void __mddev_put(struct mddev *mddev)
>   	    mddev->ctime || mddev->hold_active)
>   		return;
>   
> -	/* Array is not configured at all, and not held active, so destroy it */
> -	set_bit(MD_DELETED, &mddev->flags);
> -
>   	/*
>   	 * Call queue_work inside the spinlock so that flush_workqueue() after
>   	 * mddev_find will succeed in waiting for the work to be done.
> @@ -3312,7 +3309,7 @@ static bool md_rdev_overlaps(struct md_rdev *rdev)
>   
>   	spin_lock(&all_mddevs_lock);
>   	list_for_each_entry(mddev, &all_mddevs, all_mddevs) {
> -		if (test_bit(MD_DELETED, &mddev->flags))
> +		if (test_bit(MD_CLOSING, &mddev->flags))
>   			continue;
>   		rdev_for_each(rdev2, mddev) {
>   			if (rdev != rdev2 && rdev->bdev == rdev2->bdev &&
> @@ -6360,15 +6357,10 @@ static void md_clean(struct mddev *mddev)
>   	mddev->persistent = 0;
>   	mddev->level = LEVEL_NONE;
>   	mddev->clevel[0] = 0;
> -	/*
> -	 * Don't clear MD_CLOSING, or mddev can be opened again.
> -	 * 'hold_active != 0' means mddev is still in the creation
> -	 * process and will be used later.
> -	 */
> -	if (mddev->hold_active)
> -		mddev->flags = 0;
> -	else
> -		mddev->flags &= BIT_ULL_MASK(MD_CLOSING);
> +	/* UNTIL_STOP is cleared here */
> +	mddev->hold_active = 0;
> +	/* Don't clear MD_CLOSING, or mddev can be opened again. */
> +	mddev->flags &= BIT_ULL_MASK(MD_CLOSING);

This should be a seperate patch, and following code can be removed as
well.

diff --git a/drivers/md/md.c b/drivers/md/md.c
index fce27cec67ad..2594b579a488 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6734,8 +6734,6 @@ static int do_md_stop(struct mddev *mddev, int mode)
                 export_array(mddev);

                 md_clean(mddev);
-               if (mddev->hold_active == UNTIL_STOP)
-                       mddev->hold_active = 0;
         }
         md_new_event();
         sysfs_notify_dirent_safe(mddev->sysfs_state);

Thanks,
Kuai

>   	mddev->sb_flags = 0;
>   	mddev->ro = MD_RDWR;
>   	mddev->metadata_type[0] = 0;
> @@ -6595,8 +6587,6 @@ static int do_md_stop(struct mddev *mddev, int mode)
>   		export_array(mddev);
>   
>   		md_clean(mddev);
> -		if (mddev->hold_active == UNTIL_STOP)
> -			mddev->hold_active = 0;
>   	}
>   	md_new_event();
>   	sysfs_notify_dirent_safe(mddev->sysfs_state);
> @@ -8999,7 +8989,7 @@ void md_do_sync(struct md_thread *thread)
>   			goto skip;
>   		spin_lock(&all_mddevs_lock);
>   		list_for_each_entry(mddev2, &all_mddevs, all_mddevs) {
> -			if (test_bit(MD_DELETED, &mddev2->flags))
> +			if (test_bit(MD_CLOSING, &mddev2->flags))
>   				continue;
>   			if (mddev2 == mddev)
>   				continue;
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 1cf00a04bcdd..a9dccb3d84ed 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -338,7 +338,6 @@ struct md_cluster_operations;
>    * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that
>    *		   array is ready yet.
>    * @MD_BROKEN: This is used to stop writes and mark array as failed.
> - * @MD_DELETED: This device is being deleted
>    *
>    * change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag is added
>    */
> @@ -353,7 +352,6 @@ enum mddev_flags {
>   	MD_HAS_MULTIPLE_PPLS,
>   	MD_NOT_READY,
>   	MD_BROKEN,
> -	MD_DELETED,
>   };
>   
>   enum mddev_sb_flags {
> 


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] md: replace ->openers with ->active
  2025-04-28  8:26 ` [PATCH 2/3] md: replace ->openers with ->active Xiao Ni
@ 2025-04-29  2:05   ` Yu Kuai
  0 siblings, 0 replies; 8+ messages in thread
From: Yu Kuai @ 2025-04-29  2:05 UTC (permalink / raw)
  To: Xiao Ni, linux-raid; +Cc: song, yukuai1, ncroxon, mtkaczyk, yukuai (C)

Hi,

在 2025/04/28 16:26, Xiao Ni 写道:
> It only checks openers when stopping an array. But some users still can
> access mddev through sysfs interfaces. Now ->active is added once mddev
> is accessed which is done in function mddev_get protected by lock
> all_mddevs_lock. They are md_open, md_seq_show, md_attr_show/store and
> md_notify_reboot and md_exit.
> 
> ->openers is only added in md_open while mddev_get is called too. So we
> can replace ->openers with ->active. This can guarantee no one access the
> array once MD_CLOSING is set.
> 
> At the same time, ->open_mutex is replaced with all_mddevs_lock. Though
> all_mddevs_lock is a global lock, the only place checks ->active and sets
> MD_CLOSING in ioctl path. So it doesn't affect performance.
> 
> Signed-off-by: Xiao Ni <xni@redhat.com>
> ---
>   drivers/md/md.c | 47 +++++++++++++++++------------------------------
>   drivers/md/md.h | 11 -----------
>   2 files changed, 17 insertions(+), 41 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index e14253433c49..4ca3d04ce13f 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -523,18 +523,27 @@ void mddev_resume(struct mddev *mddev)
>   EXPORT_SYMBOL_GPL(mddev_resume);
>   
>   /* sync bdev before setting device to readonly or stopping raid*/
> -static int mddev_set_closing_and_sync_blockdev(struct mddev *mddev, int opener_num)
> +static int mddev_set_closing_and_sync_blockdev(struct mddev *mddev)
>   {
> -	mutex_lock(&mddev->open_mutex);
> -	if (mddev->pers && atomic_read(&mddev->openers) > opener_num) {
> -		mutex_unlock(&mddev->open_mutex);
> +	spin_lock(&all_mddevs_lock);
> +
> +	/*
> +	 * there are two places that call this function and ->active
> +	 * is added before calling this function. So the array can't
> +	 *  be stopped when ->active is bigger than 1.
> +	 */
> +	if (mddev->pers && atomic_read(&mddev->active) > 1) {

I think this is not a good idea, this will introduce new synchronization
that read/write sysfs APIs will forbid ioctl to stop array. If you
really want to do this, you must also forbid new sysfs APIs and wait for
inflight sysfs APIs to be done, however, this looks more complicated.

Thanks,
Kuai

> +
> +		spin_unlock(&all_mddevs_lock);
>   		return -EBUSY;
>   	}
> +
>   	if (test_and_set_bit(MD_CLOSING, &mddev->flags)) {
> -		mutex_unlock(&mddev->open_mutex);
> +		spin_unlock(&all_mddevs_lock);
>   		return -EBUSY;
>   	}
> -	mutex_unlock(&mddev->open_mutex);
> +
> +	spin_unlock(&all_mddevs_lock);
>   
>   	sync_blockdev(mddev->gendisk->part0);
>   	return 0;
> @@ -663,7 +672,6 @@ int mddev_init(struct mddev *mddev)
>   	/* We want to start with the refcount at zero */
>   	percpu_ref_put(&mddev->writes_pending);
>   
> -	mutex_init(&mddev->open_mutex);
>   	mutex_init(&mddev->reconfig_mutex);
>   	mutex_init(&mddev->suspend_mutex);
>   	mutex_init(&mddev->bitmap_info.mutex);
> @@ -672,7 +680,6 @@ int mddev_init(struct mddev *mddev)
>   	INIT_LIST_HEAD(&mddev->deleting);
>   	timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
>   	atomic_set(&mddev->active, 1);
> -	atomic_set(&mddev->openers, 0);
>   	atomic_set(&mddev->sync_seq, 0);
>   	spin_lock_init(&mddev->lock);
>   	init_waitqueue_head(&mddev->sb_wait);
> @@ -4421,8 +4428,7 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
>   	case read_auto:
>   		if (!mddev->pers || !md_is_rdwr(mddev))
>   			break;
> -		/* write sysfs will not open mddev and opener should be 0 */
> -		err = mddev_set_closing_and_sync_blockdev(mddev, 0);
> +		err = mddev_set_closing_and_sync_blockdev(mddev);
>   		if (err)
>   			return err;
>   		break;
> @@ -7738,7 +7744,7 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode,
>   		/* Need to flush page cache, and ensure no-one else opens
>   		 * and writes
>   		 */
> -		err = mddev_set_closing_and_sync_blockdev(mddev, 1);
> +		err = mddev_set_closing_and_sync_blockdev(mddev);
>   		if (err)
>   			return err;
>   	}
> @@ -7938,7 +7944,6 @@ static int md_set_read_only(struct block_device *bdev, bool ro)
>   static int md_open(struct gendisk *disk, blk_mode_t mode)
>   {
>   	struct mddev *mddev;
> -	int err;
>   
>   	spin_lock(&all_mddevs_lock);
>   	mddev = mddev_get(disk->private_data);
> @@ -7946,25 +7951,8 @@ static int md_open(struct gendisk *disk, blk_mode_t mode)
>   	if (!mddev)
>   		return -ENODEV;
>   
> -	err = mutex_lock_interruptible(&mddev->open_mutex);
> -	if (err)
> -		goto out;
> -
> -	err = -ENODEV;
> -	if (test_bit(MD_CLOSING, &mddev->flags))
> -		goto out_unlock;
> -
> -	atomic_inc(&mddev->openers);
> -	mutex_unlock(&mddev->open_mutex);
> -
>   	disk_check_media_change(disk);
>   	return 0;
> -
> -out_unlock:
> -	mutex_unlock(&mddev->open_mutex);
> -out:
> -	mddev_put(mddev);
> -	return err;
>   }
>   
>   static void md_release(struct gendisk *disk)
> @@ -7972,7 +7960,6 @@ static void md_release(struct gendisk *disk)
>   	struct mddev *mddev = disk->private_data;
>   
>   	BUG_ON(!mddev);
> -	atomic_dec(&mddev->openers);
>   	mddev_put(mddev);
>   }
>   
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index a9dccb3d84ed..60dc0943e05b 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -496,19 +496,8 @@ struct mddev {
>   	int				recovery_disabled;
>   
>   	int				in_sync;	/* know to not need resync */
> -	/* 'open_mutex' avoids races between 'md_open' and 'do_md_stop', so
> -	 * that we are never stopping an array while it is open.
> -	 * 'reconfig_mutex' protects all other reconfiguration.
> -	 * These locks are separate due to conflicting interactions
> -	 * with disk->open_mutex.
> -	 * Lock ordering is:
> -	 *  reconfig_mutex -> disk->open_mutex
> -	 *  disk->open_mutex -> open_mutex:  e.g. __blkdev_get -> md_open
> -	 */
> -	struct mutex			open_mutex;
>   	struct mutex			reconfig_mutex;
>   	atomic_t			active;		/* general refcount */
> -	atomic_t			openers;	/* number of active opens */
>   
>   	int				changed;	/* True if we might need to
>   							 * reread partition info */
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 3/3] md: call del_gendisk in sync way
  2025-05-07  2:14 [PATCH 0/3] " Xiao Ni
@ 2025-05-07  2:14 ` Xiao Ni
  2025-05-07  6:28   ` Yu Kuai
  0 siblings, 1 reply; 8+ messages in thread
From: Xiao Ni @ 2025-05-07  2:14 UTC (permalink / raw)
  To: linux-raid; +Cc: song, yukuai1, ncroxon

Now del_gendisk and put_disk are called asynchronously in workqueue work.
The asynchronous way also has a problem that the device node can still
exist after mdadm --stop command returns in a short window. So udev rule
can open this device node and create the struct mddev in kernel again.

So put del_gendisk in ioctl path and still leave put_disk in
md_kobj_release to avoid uaf.

Signed-off-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/md.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index c226747be9e3..a82778f6c31f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5718,19 +5718,30 @@ md_attr_store(struct kobject *kobj, struct attribute *attr,
 	struct md_sysfs_entry *entry = container_of(attr, struct md_sysfs_entry, attr);
 	struct mddev *mddev = container_of(kobj, struct mddev, kobj);
 	ssize_t rv;
+	struct kernfs_node *kn = NULL;
 
 	if (!entry->store)
 		return -EIO;
 	if (!capable(CAP_SYS_ADMIN))
 		return -EACCES;
+
+	if (entry->store == array_state_store && cmd_match(page, "clear"))
+		kn = sysfs_break_active_protection(kobj, attr);
+
 	spin_lock(&all_mddevs_lock);
 	if (!mddev_get(mddev)) {
+		if (kn)
+			sysfs_unbreak_active_protection(kn);
 		spin_unlock(&all_mddevs_lock);
 		return -EBUSY;
 	}
 	spin_unlock(&all_mddevs_lock);
 	rv = entry->store(mddev, page, length);
 	mddev_put(mddev);
+
+	if (kn)
+		sysfs_unbreak_active_protection(kn);
+
 	return rv;
 }
 
@@ -5743,7 +5754,6 @@ static void md_kobj_release(struct kobject *ko)
 	if (mddev->sysfs_level)
 		sysfs_put(mddev->sysfs_level);
 
-	del_gendisk(mddev->gendisk);
 	put_disk(mddev->gendisk);
 }
 
@@ -6585,8 +6595,8 @@ static int do_md_stop(struct mddev *mddev, int mode)
 		mddev->bitmap_info.offset = 0;
 
 		export_array(mddev);
-
 		md_clean(mddev);
+		del_gendisk(mddev->gendisk);
 	}
 	md_new_event();
 	sysfs_notify_dirent_safe(mddev->sysfs_state);
-- 
2.32.0 (Apple Git-132)


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] md: call del_gendisk in sync way
  2025-05-07  2:14 ` [PATCH 3/3] " Xiao Ni
@ 2025-05-07  6:28   ` Yu Kuai
  0 siblings, 0 replies; 8+ messages in thread
From: Yu Kuai @ 2025-05-07  6:28 UTC (permalink / raw)
  To: Xiao Ni, linux-raid; +Cc: song, yukuai1, ncroxon, yukuai (C)

Hi,

在 2025/05/07 10:14, Xiao Ni 写道:
> Now del_gendisk and put_disk are called asynchronously in workqueue work.
> The asynchronous way also has a problem that the device node can still
> exist after mdadm --stop command returns in a short window. So udev rule
> can open this device node and create the struct mddev in kernel again.
> 
> So put del_gendisk in ioctl path and still leave put_disk in
> md_kobj_release to avoid uaf.
> 
> Signed-off-by: Xiao Ni <xni@redhat.com>
> ---
>   drivers/md/md.c | 14 ++++++++++++--
>   1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index c226747be9e3..a82778f6c31f 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5718,19 +5718,30 @@ md_attr_store(struct kobject *kobj, struct attribute *attr,
>   	struct md_sysfs_entry *entry = container_of(attr, struct md_sysfs_entry, attr);
>   	struct mddev *mddev = container_of(kobj, struct mddev, kobj);
>   	ssize_t rv;
> +	struct kernfs_node *kn = NULL;
>   
>   	if (!entry->store)
>   		return -EIO;
>   	if (!capable(CAP_SYS_ADMIN))
>   		return -EACCES;
> +
> +	if (entry->store == array_state_store && cmd_match(page, "clear"))
> +		kn = sysfs_break_active_protection(kobj, attr);
> +
>   	spin_lock(&all_mddevs_lock);
>   	if (!mddev_get(mddev)) {
> +		if (kn)
> +			sysfs_unbreak_active_protection(kn);
>   		spin_unlock(&all_mddevs_lock);
>   		return -EBUSY;
>   	}
>   	spin_unlock(&all_mddevs_lock);
>   	rv = entry->store(mddev, page, length);
>   	mddev_put(mddev);
> +
> +	if (kn)
> +		sysfs_unbreak_active_protection(kn);
> +
>   	return rv;
>   }
>   
> @@ -5743,7 +5754,6 @@ static void md_kobj_release(struct kobject *ko)
>   	if (mddev->sysfs_level)
>   		sysfs_put(mddev->sysfs_level);
>   
> -	del_gendisk(mddev->gendisk);
>   	put_disk(mddev->gendisk);
>   }
>   
> @@ -6585,8 +6595,8 @@ static int do_md_stop(struct mddev *mddev, int mode)
>   		mddev->bitmap_info.offset = 0;
>   
>   		export_array(mddev);
> -
>   		md_clean(mddev);
> +		del_gendisk(mddev->gendisk);
>   	}

Refer to the comments in mddev_unlock, there are some special hacks, to
delay removing md_redundancy_group until releasing reconfig_mutex, and
looks like del_gendisk here will break that.

Thanks,
Kuai

>   	md_new_event();
>   	sysfs_notify_dirent_safe(mddev->sysfs_state);
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-05-07  6:28 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-28  8:26 [PATCH RFC 0/3] md: call del_gendisk in sync way Xiao Ni
2025-04-28  8:26 ` [PATCH v2 1/3] md: replace MD_DELETED with MD_CLOSING Xiao Ni
2025-04-29  1:54   ` Yu Kuai
2025-04-28  8:26 ` [PATCH 2/3] md: replace ->openers with ->active Xiao Ni
2025-04-29  2:05   ` Yu Kuai
2025-04-28  8:26 ` [PATCH 3/3] md: call del_gendisk in sync way Xiao Ni
  -- strict thread matches above, loose matches on Subject: below --
2025-05-07  2:14 [PATCH 0/3] " Xiao Ni
2025-05-07  2:14 ` [PATCH 3/3] " Xiao Ni
2025-05-07  6:28   ` Yu Kuai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).