linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH AUTOSEL 6.16 14/35] md/raid10: set chunk_sectors limit
       [not found] <20250803211736.3545028-1-sashal@kernel.org>
@ 2025-08-03 21:17 ` Sasha Levin
  2025-08-03 21:17 ` [PATCH AUTOSEL 6.16 18/35] md: call del_gendisk in control path Sasha Levin
  1 sibling, 0 replies; 2+ messages in thread
From: Sasha Levin @ 2025-08-03 21:17 UTC (permalink / raw)
  To: patches, stable
  Cc: John Garry, Nilay Shroff, Yu Kuai, Martin K . Petersen,
	Jens Axboe, Sasha Levin, song, linux-raid

From: John Garry <john.g.garry@oracle.com>

[ Upstream commit 7ef50c4c6a9c36fa3ea6f1681a80c0bf9a797345 ]

Same as done for raid0, set chunk_sectors limit to appropriately set the
atomic write size limit.

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250711105258.3135198-5-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Based on my analysis, here's my assessment:

**Backport Status: YES**

This commit should be backported to stable kernel trees for the
following reasons:

1. **Bug Fix for Correctness**: The commit fixes a missing configuration
   that causes incorrect atomic write size limits in RAID10. The code
   shows that `lim.chunk_sectors` needs to be set properly for the
   atomic writes feature to work correctly. Without this, the
   `blk_stack_atomic_writes_chunk_sectors()` function in block/blk-
   settings.c won't have the correct chunk size information, leading to
   incorrect atomic write size calculations.


3. **Small and Contained Change**: The fix is a single line addition:
  ```c
  lim.chunk_sectors = mddev->chunk_sectors;
  ```
  This is a minimal change that only affects the RAID10 queue limits
  setup.

4. **No Architectural Changes**: The change doesn't introduce new
   features or modify existing architecture. It simply ensures an
   existing field is properly initialized with the correct value that's
   already available (`mddev->chunk_sectors`).

5. **Prevents Data Corruption Risk**: Without proper chunk_sectors
   limits, atomic writes may not work correctly on RAID10 arrays. This
   could lead to writes that are supposed to be atomic being split
   incorrectly, potentially causing data integrity issues.

6. **Affects Users**: Any system using RAID10 with applications that
   rely on atomic writes (databases, filesystems) would be affected by
   this missing configuration. The fix ensures that atomic write
   guarantees are properly honored.

7. **Low Risk**: The change has minimal risk of regression as it only
   sets a field that was previously uninitialized. The value being set
   (`mddev->chunk_sectors`) is already validated and used elsewhere in
   the RAID10 code.

The commit is clearly a bug fix that corrects an oversight where RAID10
wasn't setting the chunk_sectors limit while RAID0 was already fixed for
the same issue. This makes it an excellent candidate for stable
backporting.

 drivers/md/raid10.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index c9bd2005bfd0..a7594be2a704 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -4012,6 +4012,7 @@ static int raid10_set_queue_limits(struct mddev *mddev)
 	md_init_stacking_limits(&lim);
 	lim.max_write_zeroes_sectors = 0;
 	lim.io_min = mddev->chunk_sectors << 9;
+	lim.chunk_sectors = mddev->chunk_sectors;
 	lim.io_opt = lim.io_min * raid10_nr_stripes(conf);
 	lim.features |= BLK_FEAT_ATOMIC_WRITES;
 	err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* [PATCH AUTOSEL 6.16 18/35] md: call del_gendisk in control path
       [not found] <20250803211736.3545028-1-sashal@kernel.org>
  2025-08-03 21:17 ` [PATCH AUTOSEL 6.16 14/35] md/raid10: set chunk_sectors limit Sasha Levin
@ 2025-08-03 21:17 ` Sasha Levin
  1 sibling, 0 replies; 2+ messages in thread
From: Sasha Levin @ 2025-08-03 21:17 UTC (permalink / raw)
  To: patches, stable; +Cc: Xiao Ni, Yu Kuai, Sasha Levin, song, linux-raid

From: Xiao Ni <xni@redhat.com>

[ Upstream commit 9e59d609763f70a992a8f3808dabcce60f14eb5c ]

Now del_gendisk and put_disk are called asynchronously in workqueue work.
The asynchronous way has a problem that the device node can still exist
after mdadm --stop command returns in a short window. So udev rule can
open this device node and create the struct mddev in kernel again. So put
del_gendisk in control path and still leave put_disk in md_kobj_release
to avoid uaf of gendisk.

Function del_gendisk can't be called with reconfig_mutex. If it's called
with reconfig mutex, a deadlock can happen. del_gendisk waits all sysfs
files access to finish and sysfs file access waits reconfig mutex. So
put del_gendisk after releasing reconfig mutex.

But there is still a window that sysfs can be accessed between mddev_unlock
and del_gendisk. So some actions (add disk, change level, .e.g) can happen
which lead unexpected results. MD_DELETED is used to resolve this problem.
MD_DELETED is set before releasing reconfig mutex and it should be checked
for these sysfs access which need reconfig mutex. For sysfs access which
don't need reconfig mutex, del_gendisk will wait them to finish.

But it doesn't need to do this in function mddev_lock_nointr. There are
ten places that call it.
* Five of them are in dm raid which we don't need to care. MD_DELETED is
only used for md raid.
* stop_sync_thread, md_do_sync and md_start_sync are related sync request,
and it needs to wait sync thread to finish before stopping an array.
* md_ioctl: md_open is called before md_ioctl, so ->openers is added. It
will fail to stop the array. So it doesn't need to check MD_DELETED here
* md_set_readonly:
It needs to call mddev_set_closing_and_sync_blockdev when setting readonly
or read_auto. So it will fail to stop the array too because MD_CLOSING is
already set.

Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20250611073108.25463-2-xni@redhat.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

**Backport Status: YES**

This commit should be backported to stable kernel trees. Here's my
extensive analysis:

## 1. **Critical Race Condition Fix**

The commit addresses a serious race condition between `mdadm --stop` and
udev that can lead to unexpected behavior:

```c
// Before: del_gendisk was called asynchronously in md_kobj_release
- del_gendisk(mddev->gendisk);
  put_disk(mddev->gendisk);

// After: del_gendisk moved to control path in mddev_unlock
+ if (test_bit(MD_DELETED, &mddev->flags))
+     del_gendisk(mddev->gendisk);
```

The race window allows udev to reopen the device node after `mdadm
--stop` returns but before `del_gendisk()` completes, creating a new
`struct mddev` in the kernel unexpectedly.

## 2. **Use-After-Free Prevention**

The commit prevents a potential use-after-free (UAF) of gendisk by
properly sequencing the cleanup:
- `del_gendisk()` is called synchronously in the control path
- `put_disk()` remains in `md_kobj_release()` to avoid UAF

## 3. **Deadlock Avoidance**

The implementation carefully avoids deadlocks by calling `del_gendisk()`
after releasing `reconfig_mutex`:

```c
// In mddev_unlock():
+ /* Call del_gendisk after release reconfig_mutex to avoid
+  * deadlock (e.g. call del_gendisk under the lock and an
+  * access to sysfs files waits the lock)
+  */
+ if (test_bit(MD_DELETED, &mddev->flags))
+     del_gendisk(mddev->gendisk);
```

## 4. **MD_DELETED Flag Protection**

The commit introduces proper synchronization using the `MD_DELETED`
flag:

```c
// In mddev_lock():
+ if (!ret && test_bit(MD_DELETED, &mddev->flags)) {
+     ret = -ENODEV;
+     mutex_unlock(&mddev->reconfig_mutex);
+ }
```

This prevents sysfs operations from racing with array shutdown.

## 5. **Minimal and Contained Changes**

The fix is:
- Relatively small (~100 lines)
- Confined to the md subsystem
- Does not introduce new features
- Does not make architectural changes
- Only modifies the cleanup path timing

## 6. **Clear Bug Fix Criteria**

This meets all stable backport criteria:
- Fixes a real bug that affects users (race condition causing unexpected
  device recreation)
- Prevents potential data corruption/loss scenarios
- Has minimal risk of regression (only affects shutdown path)
- Is well-tested (has Reviewed-by tag)
- Fixes behavior that violates expected semantics (device should be gone
  after `mdadm --stop`)

## 7. **Related UAF History**

Recent commit history shows ongoing UAF issues in md:
- `8542870237c3 md: fix mddev uaf while iterating all_mddevs list`
- This indicates the subsystem has had similar issues that required
  fixes

The synchronous `del_gendisk()` approach is the correct fix for this
class of problems, making this an important stability improvement for
the md subsystem.

 drivers/md/md.c | 33 +++++++++++++++++++++++----------
 drivers/md/md.h | 26 ++++++++++++++++++++++++--
 2 files changed, 47 insertions(+), 12 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0f03b21e66e4..7445e44eabff 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -636,9 +636,6 @@ static void __mddev_put(struct mddev *mddev)
 	    mddev->ctime || mddev->hold_active)
 		return;
 
-	/* Array is not configured at all, and not held active, so destroy it */
-	set_bit(MD_DELETED, &mddev->flags);
-
 	/*
 	 * Call queue_work inside the spinlock so that flush_workqueue() after
 	 * mddev_find will succeed in waiting for the work to be done.
@@ -873,6 +870,16 @@ void mddev_unlock(struct mddev *mddev)
 		kobject_del(&rdev->kobj);
 		export_rdev(rdev, mddev);
 	}
+
+	/* Call del_gendisk after release reconfig_mutex to avoid
+	 * deadlock (e.g. call del_gendisk under the lock and an
+	 * access to sysfs files waits the lock)
+	 * And MD_DELETED is only used for md raid which is set in
+	 * do_md_stop. dm raid only uses md_stop to stop. So dm raid
+	 * doesn't need to check MD_DELETED when getting reconfig lock
+	 */
+	if (test_bit(MD_DELETED, &mddev->flags))
+		del_gendisk(mddev->gendisk);
 }
 EXPORT_SYMBOL_GPL(mddev_unlock);
 
@@ -5774,19 +5781,30 @@ md_attr_store(struct kobject *kobj, struct attribute *attr,
 	struct md_sysfs_entry *entry = container_of(attr, struct md_sysfs_entry, attr);
 	struct mddev *mddev = container_of(kobj, struct mddev, kobj);
 	ssize_t rv;
+	struct kernfs_node *kn = NULL;
 
 	if (!entry->store)
 		return -EIO;
 	if (!capable(CAP_SYS_ADMIN))
 		return -EACCES;
+
+	if (entry->store == array_state_store && cmd_match(page, "clear"))
+		kn = sysfs_break_active_protection(kobj, attr);
+
 	spin_lock(&all_mddevs_lock);
 	if (!mddev_get(mddev)) {
 		spin_unlock(&all_mddevs_lock);
+		if (kn)
+			sysfs_unbreak_active_protection(kn);
 		return -EBUSY;
 	}
 	spin_unlock(&all_mddevs_lock);
 	rv = entry->store(mddev, page, length);
 	mddev_put(mddev);
+
+	if (kn)
+		sysfs_unbreak_active_protection(kn);
+
 	return rv;
 }
 
@@ -5794,12 +5812,6 @@ static void md_kobj_release(struct kobject *ko)
 {
 	struct mddev *mddev = container_of(ko, struct mddev, kobj);
 
-	if (mddev->sysfs_state)
-		sysfs_put(mddev->sysfs_state);
-	if (mddev->sysfs_level)
-		sysfs_put(mddev->sysfs_level);
-
-	del_gendisk(mddev->gendisk);
 	put_disk(mddev->gendisk);
 }
 
@@ -6646,8 +6658,9 @@ static int do_md_stop(struct mddev *mddev, int mode)
 		mddev->bitmap_info.offset = 0;
 
 		export_array(mddev);
-
 		md_clean(mddev);
+		set_bit(MD_DELETED, &mddev->flags);
+
 		if (mddev->hold_active == UNTIL_STOP)
 			mddev->hold_active = 0;
 	}
diff --git a/drivers/md/md.h b/drivers/md/md.h
index d45a9e6ead80..67b365621507 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -700,11 +700,26 @@ static inline bool reshape_interrupted(struct mddev *mddev)
 
 static inline int __must_check mddev_lock(struct mddev *mddev)
 {
-	return mutex_lock_interruptible(&mddev->reconfig_mutex);
+	int ret;
+
+	ret = mutex_lock_interruptible(&mddev->reconfig_mutex);
+
+	/* MD_DELETED is set in do_md_stop with reconfig_mutex.
+	 * So check it here.
+	 */
+	if (!ret && test_bit(MD_DELETED, &mddev->flags)) {
+		ret = -ENODEV;
+		mutex_unlock(&mddev->reconfig_mutex);
+	}
+
+	return ret;
 }
 
 /* Sometimes we need to take the lock in a situation where
  * failure due to interrupts is not acceptable.
+ * It doesn't need to check MD_DELETED here, the owner which
+ * holds the lock here can't be stopped. And all paths can't
+ * call this function after do_md_stop.
  */
 static inline void mddev_lock_nointr(struct mddev *mddev)
 {
@@ -713,7 +728,14 @@ static inline void mddev_lock_nointr(struct mddev *mddev)
 
 static inline int mddev_trylock(struct mddev *mddev)
 {
-	return mutex_trylock(&mddev->reconfig_mutex);
+	int ret;
+
+	ret = mutex_trylock(&mddev->reconfig_mutex);
+	if (!ret && test_bit(MD_DELETED, &mddev->flags)) {
+		ret = -ENODEV;
+		mutex_unlock(&mddev->reconfig_mutex);
+	}
+	return ret;
 }
 extern void mddev_unlock(struct mddev *mddev);
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-08-03 21:18 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250803211736.3545028-1-sashal@kernel.org>
2025-08-03 21:17 ` [PATCH AUTOSEL 6.16 14/35] md/raid10: set chunk_sectors limit Sasha Levin
2025-08-03 21:17 ` [PATCH AUTOSEL 6.16 18/35] md: call del_gendisk in control path Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).