[PATCH RFC 0/3] md/raid1: data corruption with serialization

public inbox for linux-raid@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/3] md/raid1: data corruption with serialization
@ 2026-02-04 14:58 Xiao Ni
  2026-02-04 14:58 ` [PATCH RFC 1/3] md: add return value of mddev_create_serial_pool Xiao Ni
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Xiao Ni @ 2026-02-04 14:58 UTC (permalink / raw)
  To: yukuai; +Cc: linux-raid

A data corruption can happen when using serialization for raid1.
Serialization is not enabled by default. But it looks like there
is a data corruption risk if serialization is closed. Because the
lower driver can't guarantee the sequence which io is written first.
So it's possible that different member disks will have different
data for nvme devices. This patch set doesn't open serialization
by default.

Xiao Ni (3):
  md: add return value of mddev_create_serial_pool
  md/raid1: fix data corruption by moving serialization to mddev level
  md/raid1: fix incorrect sector range in serialization

 drivers/md/md-bitmap.c |  28 +++++--
 drivers/md/md.c        | 171 ++++++++++++++---------------------------
 drivers/md/md.h        |  30 ++++----
 drivers/md/raid1.c     |  47 ++++++-----
 4 files changed, 115 insertions(+), 161 deletions(-)

-- 
2.50.1 (Apple Git-155)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC 1/3] md: add return value of mddev_create_serial_pool
  2026-02-04 14:58 [PATCH RFC 0/3] md/raid1: data corruption with serialization Xiao Ni
@ 2026-02-04 14:58 ` Xiao Ni
  2026-02-04 14:58 ` [PATCH RFC 2/3] md/raid1: fix data corruption by moving serialization to mddev level Xiao Ni
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Xiao Ni @ 2026-02-04 14:58 UTC (permalink / raw)
  To: yukuai; +Cc: linux-raid

Prepare for the next patch.

Signed-off-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/md-bitmap.c | 28 ++++++++++++++++++++++------
 drivers/md/md.c        | 30 +++++++++++++++++++++---------
 drivers/md/md.h        |  2 +-
 3 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 84b7e2af6dba..1c82961833ba 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -2213,8 +2213,11 @@ static int bitmap_load(struct mddev *mddev)
 	if (!bitmap)
 		goto out;
 
-	rdev_for_each(rdev, mddev)
-		mddev_create_serial_pool(mddev, rdev);
+	rdev_for_each(rdev, mddev) {
+		err = mddev_create_serial_pool(mddev, rdev);
+		if (err)
+			goto out;
+	}
 
 	if (mddev_is_clustered(mddev))
 		mddev->cluster_ops->load_bitmaps(mddev, mddev->bitmap_info.nodes);
@@ -2253,9 +2256,15 @@ static int bitmap_load(struct mddev *mddev)
 
 	bitmap_update_sb(bitmap);
 
-	if (test_bit(BITMAP_WRITE_ERROR, &bitmap->flags))
+	if (test_bit(BITMAP_WRITE_ERROR, &bitmap->flags)) {
 		err = -EIO;
+		goto out;
+	}
+
+	return err;
+
 out:
+	mddev_destroy_serial_pool(mddev, NULL);
 	return err;
 }
 
@@ -2806,19 +2815,26 @@ backlog_store(struct mddev *mddev, const char *buf, size_t len)
 		return -EINVAL;
 	}
 
-	mddev->bitmap_info.max_write_behind = backlog;
 	if (!backlog && mddev->serial_info_pool) {
 		/* serial_info_pool is not needed if backlog is zero */
 		if (!mddev->serialize_policy)
 			mddev_destroy_serial_pool(mddev, NULL);
 	} else if (backlog && !mddev->serial_info_pool) {
 		/* serial_info_pool is needed since backlog is not zero */
-		rdev_for_each(rdev, mddev)
-			mddev_create_serial_pool(mddev, rdev);
+		rdev_for_each(rdev, mddev) {
+			rv = mddev_create_serial_pool(mddev, rdev);
+			if (rv) {
+				mddev_destroy_serial_pool(mddev, NULL);
+				goto unlock;
+			}
+		}
 	}
+
+	mddev->bitmap_info.max_write_behind = backlog;
 	if (old_mwb != backlog)
 		bitmap_update_sb(mddev->bitmap);
 
+unlock:
 	mddev_unlock_and_resume(mddev);
 	return len;
 }
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 6d73f6e196a9..115df70354ed 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -232,20 +232,20 @@ static int rdev_need_serial(struct md_rdev *rdev)
  * 1. rdev is the first device which return true from rdev_enable_serial.
  * 2. rdev is NULL, means we want to enable serialization for all rdevs.
  */
-void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev)
+int mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev)
 {
 	int ret = 0;
 
 	if (rdev && !rdev_need_serial(rdev) &&
 	    !test_bit(CollisionCheck, &rdev->flags))
-		return;
+		return ret;
 
 	if (!rdev)
 		ret = rdevs_init_serial(mddev);
 	else
 		ret = rdev_init_serial(rdev);
 	if (ret)
-		return;
+		return ret;
 
 	if (mddev->serial_info_pool == NULL) {
 		/*
@@ -258,8 +258,11 @@ void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev)
 		if (!mddev->serial_info_pool) {
 			rdevs_uninit_serial(mddev);
 			pr_err("can't alloc memory pool for serialization\n");
+			ret = -ENOMEM;
 		}
 	}
+
+	return ret;
 }
 
 /*
@@ -2600,8 +2603,13 @@ static int bind_rdev_to_array(struct md_rdev *rdev, struct mddev *mddev)
 	rdev->mddev = mddev;
 	pr_debug("md: bind<%s>\n", b);
 
-	if (mddev->raid_disks)
-		mddev_create_serial_pool(mddev, rdev);
+	if (mddev->raid_disks) {
+		err = mddev_create_serial_pool(mddev, rdev);
+		if (err) {
+			pr_err("failed to create serial pool\n");
+			return err;
+		}
+	}
 
 	if ((err = kobject_add(&rdev->kobj, &mddev->kobj, "dev-%s", b)))
 		goto fail;
@@ -3114,7 +3122,9 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 		}
 	} else if (cmd_match(buf, "writemostly")) {
 		set_bit(WriteMostly, &rdev->flags);
-		mddev_create_serial_pool(rdev->mddev, rdev);
+		err = mddev_create_serial_pool(rdev->mddev, rdev);
+		if (err)
+			return err;
 		need_update_sb = true;
 		err = 0;
 	} else if (cmd_match(buf, "-writemostly")) {
@@ -5924,9 +5934,11 @@ serialize_policy_store(struct mddev *mddev, const char *buf, size_t len)
 		goto unlock;
 	}
 
-	if (value)
-		mddev_create_serial_pool(mddev, NULL);
-	else
+	if (value) {
+		err = mddev_create_serial_pool(mddev, NULL);
+		if (err)
+			goto unlock;
+	} else
 		mddev_destroy_serial_pool(mddev, NULL);
 	mddev->serialize_policy = value;
 unlock:
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 6985f2829bbd..3c4edc6ed50e 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -956,7 +956,7 @@ extern void md_frozen_sync_thread(struct mddev *mddev);
 extern void md_unfrozen_sync_thread(struct mddev *mddev);
 
 extern void md_update_sb(struct mddev *mddev, int force);
-extern void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev);
+extern int mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev);
 extern void mddev_destroy_serial_pool(struct mddev *mddev,
 				      struct md_rdev *rdev);
 struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr);
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH RFC 2/3] md/raid1: fix data corruption by moving serialization to mddev level
  2026-02-04 14:58 [PATCH RFC 0/3] md/raid1: data corruption with serialization Xiao Ni
  2026-02-04 14:58 ` [PATCH RFC 1/3] md: add return value of mddev_create_serial_pool Xiao Ni
@ 2026-02-04 14:58 ` Xiao Ni
  2026-02-04 14:58 ` [PATCH RFC 3/3] md/raid1: fix incorrect sector range in serialization Xiao Ni
  2026-02-04 15:58 ` [PATCH RFC 0/3] md/raid1: data corruption with serialization Yu Kuai
  3 siblings, 0 replies; 11+ messages in thread
From: Xiao Ni @ 2026-02-04 14:58 UTC (permalink / raw)
  To: yukuai; +Cc: linux-raid

Overlap io request is handled in per-rdev serialization now. Overlap io
requests need to wait and are woken up when one overlap io returns. But
the sequence of getting the lock after waking up maybe different in
member disks.

For example:
bio1 (100,200)
bio2 (150,200)
bio3 (160,200)
bio1 is writing and bio2/bio3 are waiting. It wakes up bio2/bio3 when bio1
finishes. bio2 gets the lock and is submitted to disk0. But in disk1, bio3
gets the lock and is submitted. Finally, the data on member disks are
different. Even if they get locks in the same order, the lower drivers
don't guarantee the sequence that bios are landed in disks. It still has
a risk that member disks have different data.

So per-rdev serialization cannot guarantee ordering across different
member disks when multiple requests are queued. Each disk's wake_up()
independently wakes waiters, but the order they reacquire locks is
undefined.

Fix by moving serialization from per-rdev to per-mddev level:
- Change serial data structure from rdev->serial to mddev->serial
- Call wait_for_serialization() once per r1_bio at mddev level
  before submitting to any disk
- Call remove_serial() once in r1_bio_write_done() after all disks
  complete, instead of per-rdev in raid1_end_write_request()

This ensures:
- All member disks see identical write ordering
- Only one wait queue exists for each sector range
- Wake-up ordering issues cannot cause cross-disk inconsistency

Signed-off-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/md.c    | 143 ++++++++++++---------------------------------
 drivers/md/md.h    |  28 ++++-----
 drivers/md/raid1.c |  45 +++++++-------
 3 files changed, 71 insertions(+), 145 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 115df70354ed..c113af3865ac 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -152,69 +152,37 @@ static int sync_io_depth(struct mddev *mddev)
 		mddev->sync_io_depth : sysctl_sync_io_depth;
 }
 
-static void rdev_uninit_serial(struct md_rdev *rdev)
+static void mddev_uninit_serial(struct mddev *mddev)
 {
-	if (!test_and_clear_bit(CollisionCheck, &rdev->flags))
-		return;
-
-	kvfree(rdev->serial);
-	rdev->serial = NULL;
+	kvfree(mddev->serial);
+	mddev->serial = NULL;
 }
 
-static void rdevs_uninit_serial(struct mddev *mddev)
-{
-	struct md_rdev *rdev;
-
-	rdev_for_each(rdev, mddev)
-		rdev_uninit_serial(rdev);
-}
-
-static int rdev_init_serial(struct md_rdev *rdev)
+static int mddev_init_serial(struct mddev *mddev)
 {
 	/* serial_nums equals with BARRIER_BUCKETS_NR */
 	int i, serial_nums = 1 << ((PAGE_SHIFT - ilog2(sizeof(atomic_t))));
-	struct serial_in_rdev *serial = NULL;
-
-	if (test_bit(CollisionCheck, &rdev->flags))
-		return 0;
+	struct serial *serial = NULL;
 
-	serial = kvmalloc(sizeof(struct serial_in_rdev) * serial_nums,
+	serial = kvmalloc_array(serial_nums, sizeof(struct serial),
 			  GFP_KERNEL);
-	if (!serial)
+	if (!serial) {
+		pr_err("failed to alloc serial\n");
 		return -ENOMEM;
+	}
 
 	for (i = 0; i < serial_nums; i++) {
-		struct serial_in_rdev *serial_tmp = &serial[i];
+		struct serial *serial_tmp = &serial[i];
 
 		spin_lock_init(&serial_tmp->serial_lock);
 		serial_tmp->serial_rb = RB_ROOT_CACHED;
 		init_waitqueue_head(&serial_tmp->serial_io_wait);
 	}
 
-	rdev->serial = serial;
-	set_bit(CollisionCheck, &rdev->flags);
-
+	mddev->serial = serial;
 	return 0;
 }
 
-static int rdevs_init_serial(struct mddev *mddev)
-{
-	struct md_rdev *rdev;
-	int ret = 0;
-
-	rdev_for_each(rdev, mddev) {
-		ret = rdev_init_serial(rdev);
-		if (ret)
-			break;
-	}
-
-	/* Free all resources if pool is not existed */
-	if (ret && !mddev->serial_info_pool)
-		rdevs_uninit_serial(mddev);
-
-	return ret;
-}
-
 /*
  * rdev needs to enable serial stuffs if it meets the conditions:
  * 1. it is multi-queue device flaged with writemostly.
@@ -236,72 +204,40 @@ int mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev)
 {
 	int ret = 0;
 
-	if (rdev && !rdev_need_serial(rdev) &&
-	    !test_bit(CollisionCheck, &rdev->flags))
+	if (mddev->serial_info_pool)
 		return ret;
 
-	if (!rdev)
-		ret = rdevs_init_serial(mddev);
-	else
-		ret = rdev_init_serial(rdev);
-	if (ret)
+	/* return success if rdev doesn't support serial */
+	if (rdev && !rdev_need_serial(rdev))
 		return ret;
 
-	if (mddev->serial_info_pool == NULL) {
-		/*
-		 * already in memalloc noio context by
-		 * mddev_suspend()
-		 */
-		mddev->serial_info_pool =
-			mempool_create_kmalloc_pool(NR_SERIAL_INFOS,
-						sizeof(struct serial_info));
-		if (!mddev->serial_info_pool) {
-			rdevs_uninit_serial(mddev);
-			pr_err("can't alloc memory pool for serialization\n");
-			ret = -ENOMEM;
-		}
+	ret = mddev_init_serial(mddev);
+	if (ret)
+		return ret;
+	/*
+	 * already in memalloc noio context by
+	 * mddev_suspend()
+	 */
+	mddev->serial_info_pool = mempool_create_kmalloc_pool(
+			NR_SERIAL_INFOS, sizeof(struct serial_info));
+	if (!mddev->serial_info_pool) {
+		mddev_uninit_serial(mddev);
+		pr_err("can't alloc memory pool for serialization\n");
+		ret = -ENOMEM;
 	}
 
+	set_bit(MD_SERIAL, &mddev->flags);
 	return ret;
 }
 
-/*
- * Free resource from rdev(s), and destroy serial_info_pool under conditions:
- * 1. rdev is the last device flaged with CollisionCheck.
- * 2. when bitmap is destroyed while policy is not enabled.
- * 3. for disable policy, the pool is destroyed only when no rdev needs it.
- */
 void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev)
 {
-	if (rdev && !test_bit(CollisionCheck, &rdev->flags))
+	if (!mddev->serial_info_pool)
 		return;
-
-	if (mddev->serial_info_pool) {
-		struct md_rdev *temp;
-		int num = 0; /* used to track if other rdevs need the pool */
-
-		rdev_for_each(temp, mddev) {
-			if (!rdev) {
-				if (!mddev->serialize_policy ||
-				    !rdev_need_serial(temp))
-					rdev_uninit_serial(temp);
-				else
-					num++;
-			} else if (temp != rdev &&
-				   test_bit(CollisionCheck, &temp->flags))
-				num++;
-		}
-
-		if (rdev)
-			rdev_uninit_serial(rdev);
-
-		if (num)
-			pr_info("The mempool could be used by other devices\n");
-		else {
-			mempool_destroy(mddev->serial_info_pool);
-			mddev->serial_info_pool = NULL;
-		}
-	}
+	mddev_uninit_serial(mddev);
+	mempool_destroy(mddev->serial_info_pool);
+	mddev->serial_info_pool = NULL;
+	clear_bit(MD_SERIAL, &mddev->flags);
 }
 
 static struct ctl_table_header *raid_table_header;
@@ -6633,18 +6569,13 @@ int md_run(struct mddev *mddev)
 		bool create_pool = false;
 
 		rdev_for_each(rdev, mddev) {
-			if (test_bit(WriteMostly, &rdev->flags) &&
-			    rdev_init_serial(rdev))
+			if (test_bit(WriteMostly, &rdev->flags))
 				create_pool = true;
 		}
-		if (create_pool && mddev->serial_info_pool == NULL) {
-			mddev->serial_info_pool =
-				mempool_create_kmalloc_pool(NR_SERIAL_INFOS,
-						    sizeof(struct serial_info));
-			if (!mddev->serial_info_pool) {
-				err = -ENOMEM;
+		if (create_pool) {
+			err = mddev_create_serial_pool(mddev, NULL);
+			if (err)
 				goto bitmap_abort;
-			}
 		}
 	}
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 3c4edc6ed50e..7fc6d03b0c33 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -116,15 +116,6 @@ enum sync_action {
 	NR_SYNC_ACTIONS,
 };
 
-/*
- * The struct embedded in rdev is used to serialize IO.
- */
-struct serial_in_rdev {
-	struct rb_root_cached serial_rb;
-	spinlock_t serial_lock;
-	wait_queue_head_t serial_io_wait;
-};
-
 /*
  * MD's 'extended' device
  */
@@ -204,8 +195,6 @@ struct md_rdev {
 					   * in superblock.
 					   */
 
-	struct serial_in_rdev *serial;  /* used for raid1 io serialization */
-
 	struct kernfs_node *sysfs_state; /* handle for 'state'
 					   * sysfs entry */
 	/* handle for 'unacknowledged_bad_blocks' sysfs dentry */
@@ -286,10 +275,6 @@ enum flag_bits {
 				 * it didn't fail, so don't use FailFast
 				 * any more for metadata
 				 */
-	CollisionCheck,		/*
-				 * check if there is collision between raid1
-				 * serial bios.
-				 */
 	Nonrot,			/* non-rotational device (SSD) */
 };
 
@@ -356,6 +341,7 @@ enum mddev_flags {
 	MD_BROKEN,
 	MD_DO_DELETE,
 	MD_DELETED,
+	MD_SERIAL,
 };
 
 enum mddev_sb_flags {
@@ -389,6 +375,15 @@ enum {
 	MD_RESYNC_ACTIVE = 3,
 };
 
+/*
+ * The struct embedded is used to serialize IO.
+ */
+struct serial {
+	struct rb_root_cached serial_rb;
+	spinlock_t serial_lock;
+	wait_queue_head_t serial_io_wait;
+};
+
 struct mddev {
 	void				*private;
 	struct md_personality		*pers;
@@ -607,7 +602,8 @@ struct mddev {
 	struct bio_set			io_clone_set;
 
 	struct work_struct event_work;	/* used by dm to report failure event */
-	mempool_t *serial_info_pool;
+	mempool_t				*serial_info_pool;
+	struct serial			*serial;  /* used for raid1 io serialization */
 	void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev);
 	struct md_cluster_info		*cluster_info;
 	struct md_cluster_operations *cluster_ops;
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 57d50465eed1..029aa458ffb2 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -56,14 +56,14 @@ static void raid1_free(struct mddev *mddev, void *priv);
 INTERVAL_TREE_DEFINE(struct serial_info, node, sector_t, _subtree_last,
 		     START, LAST, static inline, raid1_rb);
 
-static int check_and_add_serial(struct md_rdev *rdev, struct r1bio *r1_bio,
+static int check_and_add_serial(struct mddev *mddev, struct r1bio *r1_bio,
 				struct serial_info *si, int idx)
 {
 	unsigned long flags;
 	int ret = 0;
 	sector_t lo = r1_bio->sector;
 	sector_t hi = lo + r1_bio->sectors;
-	struct serial_in_rdev *serial = &rdev->serial[idx];
+	struct serial *serial = &mddev->serial[idx];
 
 	spin_lock_irqsave(&serial->serial_lock, flags);
 	/* collision happened */
@@ -79,28 +79,33 @@ static int check_and_add_serial(struct md_rdev *rdev, struct r1bio *r1_bio,
 	return ret;
 }
 
-static void wait_for_serialization(struct md_rdev *rdev, struct r1bio *r1_bio)
+static void wait_for_serialization(struct mddev *mddev, struct r1bio *r1_bio)
 {
-	struct mddev *mddev = rdev->mddev;
 	struct serial_info *si;
 	int idx = sector_to_idx(r1_bio->sector);
-	struct serial_in_rdev *serial = &rdev->serial[idx];
+	struct serial *serial = &mddev->serial[idx];
 
-	if (WARN_ON(!mddev->serial_info_pool))
+	if (!test_bit(MD_SERIAL, &mddev->flags))
 		return;
+
 	si = mempool_alloc(mddev->serial_info_pool, GFP_NOIO);
 	wait_event(serial->serial_io_wait,
-		   check_and_add_serial(rdev, r1_bio, si, idx) == 0);
+		   check_and_add_serial(mddev, r1_bio, si, idx) == 0);
 }
 
-static void remove_serial(struct md_rdev *rdev, sector_t lo, sector_t hi)
+static void remove_serial(struct r1bio *r1_bio)
 {
 	struct serial_info *si;
 	unsigned long flags;
 	int found = 0;
-	struct mddev *mddev = rdev->mddev;
+	struct mddev *mddev = r1_bio->mddev;
+	sector_t lo = r1_bio->sector;
+	sector_t hi = r1_bio->sector + r1_bio->sectors;
 	int idx = sector_to_idx(lo);
-	struct serial_in_rdev *serial = &rdev->serial[idx];
+	struct serial *serial = &mddev->serial[idx];
+
+	if (!test_bit(MD_SERIAL, &mddev->flags))
+		return;
 
 	spin_lock_irqsave(&serial->serial_lock, flags);
 	for (si = raid1_rb_iter_first(&serial->serial_rb, lo, hi);
@@ -433,6 +438,8 @@ static void r1_bio_write_done(struct r1bio *r1_bio)
 	if (!atomic_dec_and_test(&r1_bio->remaining))
 		return;
 
+	remove_serial(r1_bio);
+
 	if (test_bit(R1BIO_WriteError, &r1_bio->state))
 		reschedule_retry(r1_bio);
 	else {
@@ -452,8 +459,6 @@ static void raid1_end_write_request(struct bio *bio)
 	struct bio *to_put = NULL;
 	int mirror = find_bio_disk(r1_bio, bio);
 	struct md_rdev *rdev = conf->mirrors[mirror].rdev;
-	sector_t lo = r1_bio->sector;
-	sector_t hi = r1_bio->sector + r1_bio->sectors;
 	bool ignore_error = !raid1_should_handle_error(bio) ||
 		(bio->bi_status && bio_op(bio) == REQ_OP_DISCARD);
 
@@ -518,8 +523,6 @@ static void raid1_end_write_request(struct bio *bio)
 	}
 
 	if (behind) {
-		if (test_bit(CollisionCheck, &rdev->flags))
-			remove_serial(rdev, lo, hi);
 		if (test_bit(WriteMostly, &rdev->flags))
 			atomic_dec(&r1_bio->behind_remaining);
 
@@ -542,8 +545,8 @@ static void raid1_end_write_request(struct bio *bio)
 				call_bio_endio(r1_bio);
 			}
 		}
-	} else if (rdev->mddev->serialize_policy)
-		remove_serial(rdev, lo, hi);
+	}
+
 	if (r1_bio->bios[mirror] == NULL)
 		rdev_dec_pending(rdev, conf->mddev);
 
@@ -1620,6 +1623,8 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 
 	first_clone = 1;
 
+	wait_for_serialization(mddev, r1_bio);
+
 	for (i = 0; i < disks; i++) {
 		struct bio *mbio = NULL;
 		struct md_rdev *rdev = conf->mirrors[i].rdev;
@@ -1636,18 +1641,12 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 			mbio = bio_alloc_clone(rdev->bdev,
 					       r1_bio->behind_master_bio,
 					       GFP_NOIO, &mddev->bio_set);
-			if (test_bit(CollisionCheck, &rdev->flags))
-				wait_for_serialization(rdev, r1_bio);
 			if (test_bit(WriteMostly, &rdev->flags))
 				atomic_inc(&r1_bio->behind_remaining);
-		} else {
+		} else
 			mbio = bio_alloc_clone(rdev->bdev, bio, GFP_NOIO,
 					       &mddev->bio_set);
 
-			if (mddev->serialize_policy)
-				wait_for_serialization(rdev, r1_bio);
-		}
-
 		mbio->bi_opf &= ~REQ_NOWAIT;
 		r1_bio->bios[i] = mbio;
 
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH RFC 3/3] md/raid1: fix incorrect sector range in serialization
  2026-02-04 14:58 [PATCH RFC 0/3] md/raid1: data corruption with serialization Xiao Ni
  2026-02-04 14:58 ` [PATCH RFC 1/3] md: add return value of mddev_create_serial_pool Xiao Ni
  2026-02-04 14:58 ` [PATCH RFC 2/3] md/raid1: fix data corruption by moving serialization to mddev level Xiao Ni
@ 2026-02-04 14:58 ` Xiao Ni
  2026-02-04 15:58 ` [PATCH RFC 0/3] md/raid1: data corruption with serialization Yu Kuai
  3 siblings, 0 replies; 11+ messages in thread
From: Xiao Ni @ 2026-02-04 14:58 UTC (permalink / raw)
  To: yukuai; +Cc: linux-raid

md/raid1: fix incorrect sector range in serialization

Fix off-by-one error in sector range calculation for serialization.
The range should be [lo, hi] inclusive, where hi = lo + sectors - 1,
not lo + sectors.

The current implementation causes false collision detection for
consecutive writes. For example, when writing:
- R1 to sectors 0-1023 (hi = 1024)
- R2 to sectors 1024-2047 (lo = 1024)

R1's hi equals R2's lo, causing raid1_rb_iter_first() to return
true and triggering unnecessary serialization, even though these
requests don't actually overlap.

Signed-off-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/raid1.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 029aa458ffb2..e3d9ba71796f 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -62,7 +62,7 @@ static int check_and_add_serial(struct mddev *mddev, struct r1bio *r1_bio,
 	unsigned long flags;
 	int ret = 0;
 	sector_t lo = r1_bio->sector;
-	sector_t hi = lo + r1_bio->sectors;
+	sector_t hi = lo + r1_bio->sectors - 1;
 	struct serial *serial = &mddev->serial[idx];
 
 	spin_lock_irqsave(&serial->serial_lock, flags);
@@ -100,7 +100,7 @@ static void remove_serial(struct r1bio *r1_bio)
 	int found = 0;
 	struct mddev *mddev = r1_bio->mddev;
 	sector_t lo = r1_bio->sector;
-	sector_t hi = r1_bio->sector + r1_bio->sectors;
+	sector_t hi = r1_bio->sector + r1_bio->sectors - 1;
 	int idx = sector_to_idx(lo);
 	struct serial *serial = &mddev->serial[idx];
 
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 0/3] md/raid1: data corruption with serialization
  2026-02-04 14:58 [PATCH RFC 0/3] md/raid1: data corruption with serialization Xiao Ni
                   ` (2 preceding siblings ...)
  2026-02-04 14:58 ` [PATCH RFC 3/3] md/raid1: fix incorrect sector range in serialization Xiao Ni
@ 2026-02-04 15:58 ` Yu Kuai
  2026-02-05  0:34   ` Xiao Ni
  3 siblings, 1 reply; 11+ messages in thread
From: Yu Kuai @ 2026-02-04 15:58 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid, yukuai

Hi,

在 2026/2/4 22:58, Xiao Ni 写道:
> A data corruption can happen when using serialization for raid1.
> Serialization is not enabled by default. But it looks like there
> is a data corruption risk if serialization is closed. Because the
> lower driver can't guarantee the sequence which io is written first.
> So it's possible that different member disks will have different
> data for nvme devices. This patch set doesn't open serialization
> by default.

An important idea is that if there are filesystem on top of raid1/10/5, there
is no need to consider overlap bio. The only case for serialization is WriteMostly,
that bio can be returned to user while it's not done for slow disks. And that is
the reason why serialization is for rdev.

We definitely will not enable serialization by default, because there will
be huge performance degradation. And I don't think we'll handle the case user
manage data with raw disk, and write to the same area concurrently. At last, if
there are still problem with WriteMostly case, please rebase and keep serialization
for rdev.

>
> Xiao Ni (3):
>    md: add return value of mddev_create_serial_pool
>    md/raid1: fix data corruption by moving serialization to mddev level
>    md/raid1: fix incorrect sector range in serialization
>
>   drivers/md/md-bitmap.c |  28 +++++--
>   drivers/md/md.c        | 171 ++++++++++++++---------------------------
>   drivers/md/md.h        |  30 ++++----
>   drivers/md/raid1.c     |  47 ++++++-----
>   4 files changed, 115 insertions(+), 161 deletions(-)
>
-- 
Thansk,
Kuai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 0/3] md/raid1: data corruption with serialization
  2026-02-04 15:58 ` [PATCH RFC 0/3] md/raid1: data corruption with serialization Yu Kuai
@ 2026-02-05  0:34   ` Xiao Ni
  2026-02-05  1:06     ` Xiao Ni
  2026-02-05  1:44     ` Yu Kuai
  0 siblings, 2 replies; 11+ messages in thread
From: Xiao Ni @ 2026-02-05  0:34 UTC (permalink / raw)
  To: yukuai; +Cc: linux-raid, Guoqing Jiang

On Wed, Feb 4, 2026 at 11:58 PM Yu Kuai <yukuai@fnnas.com> wrote:
>
> Hi,
>
> 在 2026/2/4 22:58, Xiao Ni 写道:
> > A data corruption can happen when using serialization for raid1.
> > Serialization is not enabled by default. But it looks like there
> > is a data corruption risk if serialization is closed. Because the
> > lower driver can't guarantee the sequence which io is written first.
> > So it's possible that different member disks will have different
> > data for nvme devices. This patch set doesn't open serialization
> > by default.
>
> An important idea is that if there are filesystem on top of raid1/10/5, there
> is no need to consider overlap bio. The only case for serialization is WriteMostly,
> that bio can be returned to user while it's not done for slow disks. And that is
> the reason why serialization is for rdev.

I know the reason why it needs serialization for writemostly case. At
first, I asked the same question if we need to handle overlap if a
filesystem is on raid. I got the answer from AI, filesystem can submit
overlap bios simultaneously, maybe it's the reason why
serialize_policy was introduced? And database is another case which
can write to raid1 directly without a filesystem. So I started this
job. If we don't need to consider overlap, why do we need
serialize_policy? cc guoqing who is the author of serialize_policy
also.

>
> We definitely will not enable serialization by default, because there will
> be huge performance degradation. And I don't think we'll handle the case user

Yes, this is the reason why I don't enable it.

> manage data with raw disk, and write to the same area concurrently. At last, if
> there are still problem with WriteMostly case, please rebase and keep serialization
> for rdev.

The data corruption is:

R1 issued (split into LO and HI)
(LO)                                                              (HI)
R1 write to high-speed disk begins              R1 write to high-speed
disk waits (waits while LO is being written due to an incorrect sector
range)
R1 write to low-speed disk begins
R1 write to high-speed disk completes
High-speed disk wakes up                           R1 write to
high-speed disk begins

R1 write to low-speed disk waits (waits while LO is being written due
to an incorrect sector range)

R1 write to high-speed disk completes
R1 completion notification

R2 issued (split into LO and HI)
R2 write to high-speed disk begins              R2 write to high-speed
disk waits (waits while LO is being written due to an incorrect sector
range)
R2 write to low-speed disk waits (due to R1 being written)
R2 write to high-speed disk completes
High-speed disk wakes up                           R2 write to
high-speed disk begins

R2 write to low-speed disk Wait (R1 is still waiting)

R2 write completed on high-speed disk
R2 completion notification

R1 write completed on slow disk
Slow disk wake-up
R2 write started on slow disk
R2 write completed on slow disk
Slow disk wake-up                                       R2 write
started on slow disk *R2 will be written before R1 here.

R2 write completed on slow disk

Slow disk wake-up

R1 write started on slow disk

Patch03 can fix this problem. But the root cause should be the
sequence getting lock after being woken up. So besides patch03, it is
necessary to ensure that the order of processing I/O matches the order
of adding I/O. If we don't need to consider upper layer can submit
overlap bios simultaneously, I'll fix this based on rdev rathar than
mddev.

Best Regards
Xiao

>
> >
> > Xiao Ni (3):
> >    md: add return value of mddev_create_serial_pool
> >    md/raid1: fix data corruption by moving serialization to mddev level
> >    md/raid1: fix incorrect sector range in serialization
> >
> >   drivers/md/md-bitmap.c |  28 +++++--
> >   drivers/md/md.c        | 171 ++++++++++++++---------------------------
> >   drivers/md/md.h        |  30 ++++----
> >   drivers/md/raid1.c     |  47 ++++++-----
> >   4 files changed, 115 insertions(+), 161 deletions(-)
> >
> --
> Thansk,
> Kuai
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 0/3] md/raid1: data corruption with serialization
  2026-02-05  0:34   ` Xiao Ni
@ 2026-02-05  1:06     ` Xiao Ni
  2026-02-05  1:48       ` Yu Kuai
  2026-02-05  1:44     ` Yu Kuai
  1 sibling, 1 reply; 11+ messages in thread
From: Xiao Ni @ 2026-02-05  1:06 UTC (permalink / raw)
  To: yukuai; +Cc: linux-raid, Guoqing Jiang

On Thu, Feb 5, 2026 at 8:34 AM Xiao Ni <xni@redhat.com> wrote:
>
> On Wed, Feb 4, 2026 at 11:58 PM Yu Kuai <yukuai@fnnas.com> wrote:
> >
> > Hi,
> >
> > 在 2026/2/4 22:58, Xiao Ni 写道:
> > > A data corruption can happen when using serialization for raid1.
> > > Serialization is not enabled by default. But it looks like there
> > > is a data corruption risk if serialization is closed. Because the
> > > lower driver can't guarantee the sequence which io is written first.
> > > So it's possible that different member disks will have different
> > > data for nvme devices. This patch set doesn't open serialization
> > > by default.
> >
> > An important idea is that if there are filesystem on top of raid1/10/5, there
> > is no need to consider overlap bio. The only case for serialization is WriteMostly,
> > that bio can be returned to user while it's not done for slow disks. And that is
> > the reason why serialization is for rdev.
>
> I know the reason why it needs serialization for writemostly case. At
> first, I asked the same question if we need to handle overlap if a
> filesystem is on raid. I got the answer from AI, filesystem can submit
> overlap bios simultaneously, maybe it's the reason why
> serialize_policy was introduced? And database is another case which
> can write to raid1 directly without a filesystem. So I started this
> job. If we don't need to consider overlap, why do we need
> serialize_policy? cc guoqing who is the author of serialize_policy
> also.
>
> >
> > We definitely will not enable serialization by default, because there will
> > be huge performance degradation. And I don't think we'll handle the case user
>
> Yes, this is the reason why I don't enable it.
>
> > manage data with raw disk, and write to the same area concurrently. At last, if
> > there are still problem with WriteMostly case, please rebase and keep serialization
> > for rdev.
>
> The data corruption is:
>
> R1 issued (split into LO and HI)
> (LO)                                                              (HI)
> R1 write to high-speed disk begins              R1 write to high-speed
> disk waits (waits while LO is being written due to an incorrect sector
> range)
> R1 write to low-speed disk begins
> R1 write to high-speed disk completes
> High-speed disk wakes up                           R1 write to
> high-speed disk begins
>
> R1 write to low-speed disk waits (waits while LO is being written due
> to an incorrect sector range)
>
> R1 write to high-speed disk completes
> R1 completion notification
>
> R2 issued (split into LO and HI)
> R2 write to high-speed disk begins              R2 write to high-speed
> disk waits (waits while LO is being written due to an incorrect sector
> range)
> R2 write to low-speed disk waits (due to R1 being written)
> R2 write to high-speed disk completes
> High-speed disk wakes up                           R2 write to
> high-speed disk begins
>
> R2 write to low-speed disk Wait (R1 is still waiting)
>
> R2 write completed on high-speed disk
> R2 completion notification
>
> R1 write completed on slow disk
> Slow disk wake-up
> R2 write started on slow disk
> R2 write completed on slow disk
> Slow disk wake-up                                       R2 write
> started on slow disk *R2 will be written before R1 here.
>
> R2 write completed on slow disk
>
> Slow disk wake-up
>
> R1 write started on slow disk

Sorry the format looks a mess.

And let me change to this:
Setup: RAID1 with one fast disk and one slow disk. Two requests R1 and
R2, each split into LO and HI parts

Step 1-10: Request R1 Processing
  - R1-LO: writes to fast disk → completes
  - R1-LO: writes to slow disk → starts (slow)
  - R1-HI: incorrectly waits for R1-LO (wrong sector range check)
  - R1-HI: R1-LO finishes and wakes up serial_io_wait. R1-HI finally
writes to fast disk → completes
  - R1 completion notified (but R1-LO to slow disk is still handling,
R1-HI still pending on slow disk)

Step 11-19: Request R2 Processing
  - R2-LO: writes to fast disk → completes
  - R2-LO: waits for R1-LO on slow disk (correct behavior)
  - R2-HI: incorrectly waits for R2-LO (wrong sector range check)
  - R2-HI: writes to fast disk → completes
  - R2 completion notified

Now, R1-LO is writing to slow disk, R2-LO is pending on slow disk.
R1-HI and R2-HI are pending on slow disk

Step 20-28: Slow Disk Completes (OUT OF ORDER!)
  - R1-LO completes on slow disk
  - R2-LO starts and completes on slow disk
  - R2-HI starts and completes on slow disk ← R2 BEFORE R1!
  - R1-HI finally starts and completes on slow disk

RESULT:
=======
Fast disk write order:  R1-LO → R1-HI → R2-LO → R2-HI  (CORRECT)
Slow disk write order:  R1-LO → R2-LO → R2-HI → R1-HI  (WRONG!)

The slow disk has R2-HI written before R1-HI, causing data corruption.

Best Regards
Xiao


>
> Patch03 can fix this problem. But the root cause should be the
> sequence getting lock after being woken up. So besides patch03, it is
> necessary to ensure that the order of processing I/O matches the order
> of adding I/O. If we don't need to consider upper layer can submit
> overlap bios simultaneously, I'll fix this based on rdev rathar than
> mddev.
>
> Best Regards
> Xiao
>
> >
> > >
> > > Xiao Ni (3):
> > >    md: add return value of mddev_create_serial_pool
> > >    md/raid1: fix data corruption by moving serialization to mddev level
> > >    md/raid1: fix incorrect sector range in serialization
> > >
> > >   drivers/md/md-bitmap.c |  28 +++++--
> > >   drivers/md/md.c        | 171 ++++++++++++++---------------------------
> > >   drivers/md/md.h        |  30 ++++----
> > >   drivers/md/raid1.c     |  47 ++++++-----
> > >   4 files changed, 115 insertions(+), 161 deletions(-)
> > >
> > --
> > Thansk,
> > Kuai
> >


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 0/3] md/raid1: data corruption with serialization
  2026-02-05  0:34   ` Xiao Ni
  2026-02-05  1:06     ` Xiao Ni
@ 2026-02-05  1:44     ` Yu Kuai
  2026-02-05  1:56       ` Xiao Ni
  1 sibling, 1 reply; 11+ messages in thread
From: Yu Kuai @ 2026-02-05  1:44 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid, Guoqing Jiang, yukuai

在 2026/2/5 8:34, Xiao Ni 写道:
> I know the reason why it needs serialization for writemostly case. At
> first, I asked the same question if we need to handle overlap if a
> filesystem is on raid. I got the answer from AI, filesystem can submit
> overlap bios simultaneously, maybe it's the reason why
> serialize_policy was introduced? And database is another case which
> can write to raid1 directly without a filesystem. So I started this
> job. If we don't need to consider overlap, why do we need
> serialize_policy? cc guoqing who is the author of serialize_policy
> also.

Please don't just trust answer from AI, filesystem can't submit overlap
bios concurrently. And I already said serialize_policy is for writemostly
case.

And I also said we don't need to consider user manage data on raw disk
and write to same area concurrently, database will not do this.

The reason is not complicated, if you write to the same LBA concurrent,
it's pointless because you don't know what you'll read after that.

-- 
Thansk,
Kuai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 0/3] md/raid1: data corruption with serialization
  2026-02-05  1:06     ` Xiao Ni
@ 2026-02-05  1:48       ` Yu Kuai
  2026-02-05  2:03         ` Xiao Ni
  0 siblings, 1 reply; 11+ messages in thread
From: Yu Kuai @ 2026-02-05  1:48 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid, Guoqing Jiang, yukuai

Hi,

在 2026/2/5 9:06, Xiao Ni 写道:
> And let me change to this:
> Setup: RAID1 with one fast disk and one slow disk. Two requests R1 and
> R2, each split into LO and HI parts
>
> Step 1-10: Request R1 Processing
>    - R1-LO: writes to fast disk → completes
>    - R1-LO: writes to slow disk → starts (slow)
>    - R1-HI: incorrectly waits for R1-LO (wrong sector range check)
>    - R1-HI: R1-LO finishes and wakes up serial_io_wait. R1-HI finally
> writes to fast disk → completes

R1-hi should not wait for r1-lo, as they're not overlap. And as I said
in the first reply, if here are still problem with WriteMostly case, please
rebase and keep serialization for rdev.

-- 
Thansk,
Kuai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 0/3] md/raid1: data corruption with serialization
  2026-02-05  1:44     ` Yu Kuai
@ 2026-02-05  1:56       ` Xiao Ni
  0 siblings, 0 replies; 11+ messages in thread
From: Xiao Ni @ 2026-02-05  1:56 UTC (permalink / raw)
  To: yukuai; +Cc: linux-raid, Guoqing Jiang

On Thu, Feb 5, 2026 at 9:44 AM Yu Kuai <yukuai@fnnas.com> wrote:
>
> 在 2026/2/5 8:34, Xiao Ni 写道:
> > I know the reason why it needs serialization for writemostly case. At
> > first, I asked the same question if we need to handle overlap if a
> > filesystem is on raid. I got the answer from AI, filesystem can submit
> > overlap bios simultaneously, maybe it's the reason why
> > serialize_policy was introduced? And database is another case which
> > can write to raid1 directly without a filesystem. So I started this
> > job. If we don't need to consider overlap, why do we need
> > serialize_policy? cc guoqing who is the author of serialize_policy
> > also.
>
> Please don't just trust answer from AI, filesystem can't submit overlap

Thanks for pointing out this.

> bios concurrently. And I already said serialize_policy is for writemostly
> case.

de31ee949739aba9ce7dbb8b10e72c6fce0e76c7 (md: reorgnize
mddev_create/destroy_serial_pool). serialize_policy is not for
writemostly.

>
> And I also said we don't need to consider user manage data on raw disk
> and write to same area concurrently, database will not do this.
>
> The reason is not complicated, if you write to the same LBA concurrent,
> it's pointless because you don't know what you'll read after that.

Thanks for the explanation.

>
> --
> Thansk,
> Kuai
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 0/3] md/raid1: data corruption with serialization
  2026-02-05  1:48       ` Yu Kuai
@ 2026-02-05  2:03         ` Xiao Ni
  0 siblings, 0 replies; 11+ messages in thread
From: Xiao Ni @ 2026-02-05  2:03 UTC (permalink / raw)
  To: yukuai; +Cc: linux-raid, Guoqing Jiang

On Thu, Feb 5, 2026 at 9:48 AM Yu Kuai <yukuai@fnnas.com> wrote:
>
> Hi,
>
> 在 2026/2/5 9:06, Xiao Ni 写道:
> > And let me change to this:
> > Setup: RAID1 with one fast disk and one slow disk. Two requests R1 and
> > R2, each split into LO and HI parts
> >
> > Step 1-10: Request R1 Processing
> >    - R1-LO: writes to fast disk → completes
> >    - R1-LO: writes to slow disk → starts (slow)
> >    - R1-HI: incorrectly waits for R1-LO (wrong sector range check)
> >    - R1-HI: R1-LO finishes and wakes up serial_io_wait. R1-HI finally
> > writes to fast disk → completes
>
> R1-hi should not wait for r1-lo, as they're not overlap. And as I said
> in the first reply, if here are still problem with WriteMostly case, please
> rebase and keep serialization for rdev.

Yes, it's a bug and patch03 fixes. I'll re-send patches based on
rdev-serialization.

>
> --
> Thansk,
> Kuai
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-02-05  2:03 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-04 14:58 [PATCH RFC 0/3] md/raid1: data corruption with serialization Xiao Ni
2026-02-04 14:58 ` [PATCH RFC 1/3] md: add return value of mddev_create_serial_pool Xiao Ni
2026-02-04 14:58 ` [PATCH RFC 2/3] md/raid1: fix data corruption by moving serialization to mddev level Xiao Ni
2026-02-04 14:58 ` [PATCH RFC 3/3] md/raid1: fix incorrect sector range in serialization Xiao Ni
2026-02-04 15:58 ` [PATCH RFC 0/3] md/raid1: data corruption with serialization Yu Kuai
2026-02-05  0:34   ` Xiao Ni
2026-02-05  1:06     ` Xiao Ni
2026-02-05  1:48       ` Yu Kuai
2026-02-05  2:03         ` Xiao Ni
2026-02-05  1:44     ` Yu Kuai
2026-02-05  1:56       ` Xiao Ni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox