linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap
@ 2025-05-24  6:12 Yu Kuai
  2025-05-24  6:12 ` [PATCH 01/23] md: add a new parameter 'offset' to md_super_write() Yu Kuai
                   ` (25 more replies)
  0 siblings, 26 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:12 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

This is the formal version after previous RFC version:

https://lore.kernel.org/all/20250512011927.2809400-1-yukuai1@huaweicloud.com/

#### Background

Redundant data is used to enhance data fault tolerance, and the storage
method for redundant data vary depending on the RAID levels. And it's
important to maintain the consistency of redundant data.

Bitmap is used to record which data blocks have been synchronized and which
ones need to be resynchronized or recovered. Each bit in the bitmap
represents a segment of data in the array. When a bit is set, it indicates
that the multiple redundant copies of that data segment may not be
consistent. Data synchronization can be performed based on the bitmap after
power failure or readding a disk. If there is no bitmap, a full disk
synchronization is required.

#### Key Features

 - IO fastpath is lockless, if user issues lots of write IO to the same
 bitmap bit in a short time, only the first write have additional overhead
 to update bitmap bit, no additional overhead for the following writes;
 - support only resync or recover written data, means in the case creating
 new array or replacing with a new disk, there is no need to do a full disk
 resync/recovery;

#### Key Concept

##### State Machine

Each bit is one byte, contain 6 difference state, see llbitmap_state. And
there are total 8 differenct actions, see llbitmap_action, can change state:

llbitmap state machine: transitions between states

|           | Startwrite | Startsync | Endsync | Abortsync|
| --------- | ---------- | --------- | ------- | -------  |
| Unwritten | Dirty      | x         | x       | x        |
| Clean     | Dirty      | x         | x       | x        |
| Dirty     | x          | x         | x       | x        |
| NeedSync  | x          | Syncing   | x       | x        |
| Syncing   | x          | Syncing   | Dirty   | NeedSync |

|           | Reload   | Daemon | Discard   | Stale     |
| --------- | -------- | ------ | --------- | --------- |
| Unwritten | x        | x      | x         | x         |
| Clean     | x        | x      | Unwritten | NeedSync  |
| Dirty     | NeedSync | Clean  | Unwritten | NeedSync  |
| NeedSync  | x        | x      | Unwritten | x         |
| Syncing   | NeedSync | x      | Unwritten | NeedSync  |

Typical scenarios:

1) Create new array
All bits will be set to Unwritten by default, if --assume-clean is set,
all bits will be set to Clean instead.

2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and
rely on xor data

2.1) write new data to raid1/raid10:
Unwritten --StartWrite--> Dirty

2.2) write new data to raid456:
Unwritten --StartWrite--> NeedSync

Because the initial recover for raid456 is skipped, the xor data is not build
yet, the bit must set to NeedSync first and after lazy initial recover is
finished, the bit will finially set to Dirty(see 5.1 and 5.4);

2.3) cover write
Clean --StartWrite--> Dirty

3) daemon, if the array is not degraded:
Dirty --Daemon--> Clean

For degraded array, the Dirty bit will never be cleared, prevent full disk
recovery while readding a removed disk.

4) discard
{Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten

5) resync and recover

5.1) common process
NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean

5.2) resync after power failure
Dirty --Reload--> NeedSync

5.3) recover while replacing with a new disk
By default, the old bitmap framework will recover all data, and llbitmap
implement this by a new helper, see llbitmap_skip_sync_blocks:

skip recover for bits other than dirty or clean;

5.4) lazy initial recover for raid5:
By default, the old bitmap framework will only allow new recover when there
are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add
to perform raid456 lazy recover for set bits(from 2.2).

##### Bitmap IO

##### Chunksize

The default bitmap size is 128k, incluing 1k bitmap super block, and
the default size of segment of data in the array each bit(chunksize) is 64k,
and chunksize will adjust to twice the old size each time if the total number
bits is not less than 127k.(see llbitmap_init)

##### READ

While creating bitmap, all pages will be allocated and read for llbitmap,
there won't be read afterwards

##### WRITE

WRITE IO is divided into logical_block_size of the array, the dirty state
of each block is tracked independently, for example:

each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;

| page0 | page1 | ... | page 31 |
|       |
|        \-----------------------\
|                                |
| block0 | block1 | ... | block 8|
|        |
|         \-----------------\
|                            |
| bit0 | bit1 | ... | bit511 |

From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
subpage will be marked dirty, such block must write first before the IO is
issued. This behaviour will affect IO performance, to reduce the impact, if
multiple bits are changed in the same block in a short time, all bits in this
block will be changed to Dirty/NeedSync, so that there won't be any overhead
until daemon clears dirty bits.

##### Dirty Bits syncronization

IO fast path will set bits to dirty, and those dirty bits will be cleared
by daemon after IO is done. llbitmap_page_ctl is used to synchronize between
IO path and daemon;

IO path:
 1) try to grab a reference, if succeed, set expire time after 5s and return;
 2) if failed to grab a reference, wait for daemon to finish clearing dirty
 bits;

Daemon(Daemon will be waken up every daemon_sleep seconds):
For each page:
 1) check if page expired, if not skip this page; for expired page:
 2) suspend the page and wait for inflight write IO to be done;
 3) change dirty page to clean;
 4) resume the page;

Performance Test:
Simple fio randwrite test to build array with 20GB ramdisk in my VM:

|                      | none      | bitmap    | llbitmap  |
| -------------------- | --------- | --------- | --------- |
| raid1                | 13.7MiB/s | 9696KiB/s | 19.5MiB/s |
| raid1(assume clean)  | 19.5MiB/s | 11.9MiB/s | 19.5MiB/s |
| raid10               | 21.9MiB/s | 11.6MiB/s | 27.8MiB/s |
| raid10(assume clean) | 27.8MiB/s | 15.4MiB/s | 27.8MiB/s |
| raid5                | 14.0MiB/s | 11.6MiB/s | 12.9MiB/s |
| raid5(assume clean)  | 17.8MiB/s | 13.4MiB/s | 13.9MiB/s |

For raid1/raid10 llbitmap can be better than none bitmap with background
initial resync, and it's the same as none bitmap without it.

Noted that llbitmap performance improvement for raid5 is not obvious,
this is due to raid5 has many other performance bottleneck, perf
results still shows that bitmap overhead will be much less.

following branch for review or test:
https://git.kernel.org/pub/scm/linux/kernel/git/yukuai/linux.git/log/?h=yukuai/md-llbitmap

Yu Kuai (23):
  md: add a new parameter 'offset' to md_super_write()
  md: factor out a helper raid_is_456()
  md/md-bitmap: cleanup bitmap_ops->startwrite()
  md/md-bitmap: support discard for bitmap ops
  md/md-bitmap: remove parameter slot from bitmap_create()
  md/md-bitmap: add a new sysfs api bitmap_type
  md/md-bitmap: delay registration of bitmap_ops until creating bitmap
  md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations
  md/md-bitmap: add a new method blocks_synced() in bitmap_operations
  md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER
  md/md-bitmap: make method bitmap_ops->daemon_work optional
  md/md-bitmap: add macros for lockless bitmap
  md/md-bitmap: fix dm-raid max_write_behind setting
  md/dm-raid: remove max_write_behind setting limit
  md/md-llbitmap: implement llbitmap IO
  md/md-llbitmap: implement bit state machine
  md/md-llbitmap: implement APIs for page level dirty bits
    synchronization
  md/md-llbitmap: implement APIs to mange bitmap lifetime
  md/md-llbitmap: implement APIs to dirty bits and clear bits
  md/md-llbitmap: implement APIs for sync_thread
  md/md-llbitmap: implement all bitmap operations
  md/md-llbitmap: implement sysfs APIs
  md/md-llbitmap: add Kconfig

 Documentation/admin-guide/md.rst |   80 +-
 drivers/md/Kconfig               |   11 +
 drivers/md/Makefile              |    2 +-
 drivers/md/dm-raid.c             |    6 +-
 drivers/md/md-bitmap.c           |   50 +-
 drivers/md/md-bitmap.h           |   55 +-
 drivers/md/md-llbitmap.c         | 1556 ++++++++++++++++++++++++++++++
 drivers/md/md.c                  |  247 +++--
 drivers/md/md.h                  |   20 +-
 drivers/md/raid5.c               |    6 +
 10 files changed, 1901 insertions(+), 132 deletions(-)
 create mode 100644 drivers/md/md-llbitmap.c

-- 
2.39.2


^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH 01/23] md: add a new parameter 'offset' to md_super_write()
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
@ 2025-05-24  6:12 ` Yu Kuai
  2025-05-25 15:50   ` Xiao Ni
                     ` (2 more replies)
  2025-05-24  6:12 ` [PATCH 02/23] md: factor out a helper raid_is_456() Yu Kuai
                   ` (24 subsequent siblings)
  25 siblings, 3 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:12 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

The parameter is always set to 0 for now, following patches will use
this helper to write llbitmap to underlying disks, allow writing
dirty sectors instead of the whole page.

Also rename md_super_write to md_write_metadata since there is nothing
super-block specific.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md-bitmap.c |  3 ++-
 drivers/md/md.c        | 28 ++++++++++++++--------------
 drivers/md/md.h        |  5 +++--
 3 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 431a3ab2e449..168eea6595b3 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -470,7 +470,8 @@ static int __write_sb_page(struct md_rdev *rdev, struct bitmap *bitmap,
 			return -EINVAL;
 	}
 
-	md_super_write(mddev, rdev, sboff + ps, (int)min(size, bitmap_limit), page);
+	md_write_metadata(mddev, rdev, sboff + ps, (int)min(size, bitmap_limit),
+			  page, 0);
 	return 0;
 }
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 32b997dfe6f4..18e03f651f6b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1021,8 +1021,9 @@ static void super_written(struct bio *bio)
 		wake_up(&mddev->sb_wait);
 }
 
-void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
-		   sector_t sector, int size, struct page *page)
+void md_write_metadata(struct mddev *mddev, struct md_rdev *rdev,
+		       sector_t sector, int size, struct page *page,
+		       unsigned int offset)
 {
 	/* write first size bytes of page to sector of rdev
 	 * Increment mddev->pending_writes before returning
@@ -1047,7 +1048,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
 	atomic_inc(&rdev->nr_pending);
 
 	bio->bi_iter.bi_sector = sector;
-	__bio_add_page(bio, page, size, 0);
+	__bio_add_page(bio, page, size, offset);
 	bio->bi_private = rdev;
 	bio->bi_end_io = super_written;
 
@@ -1657,8 +1658,8 @@ super_90_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors)
 	if ((u64)num_sectors >= (2ULL << 32) && rdev->mddev->level >= 1)
 		num_sectors = (sector_t)(2ULL << 32) - 2;
 	do {
-		md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
-		       rdev->sb_page);
+		md_write_metadata(rdev->mddev, rdev, rdev->sb_start,
+				  rdev->sb_size, rdev->sb_page, 0);
 	} while (md_super_wait(rdev->mddev) < 0);
 	return num_sectors;
 }
@@ -2306,8 +2307,8 @@ super_1_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors)
 	sb->super_offset = cpu_to_le64(rdev->sb_start);
 	sb->sb_csum = calc_sb_1_csum(sb);
 	do {
-		md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
-			       rdev->sb_page);
+		md_write_metadata(rdev->mddev, rdev, rdev->sb_start,
+				  rdev->sb_size, rdev->sb_page, 0);
 	} while (md_super_wait(rdev->mddev) < 0);
 	return num_sectors;
 
@@ -2816,18 +2817,17 @@ void md_update_sb(struct mddev *mddev, int force_change)
 			continue; /* no noise on spare devices */
 
 		if (!test_bit(Faulty, &rdev->flags)) {
-			md_super_write(mddev,rdev,
-				       rdev->sb_start, rdev->sb_size,
-				       rdev->sb_page);
+			md_write_metadata(mddev, rdev, rdev->sb_start,
+					  rdev->sb_size, rdev->sb_page, 0);
 			pr_debug("md: (write) %pg's sb offset: %llu\n",
 				 rdev->bdev,
 				 (unsigned long long)rdev->sb_start);
 			rdev->sb_events = mddev->events;
 			if (rdev->badblocks.size) {
-				md_super_write(mddev, rdev,
-					       rdev->badblocks.sector,
-					       rdev->badblocks.size << 9,
-					       rdev->bb_page);
+				md_write_metadata(mddev, rdev,
+						  rdev->badblocks.sector,
+						  rdev->badblocks.size << 9,
+						  rdev->bb_page, 0);
 				rdev->badblocks.size = 0;
 			}
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 6eb5dfdf2f55..5ba4a9093a92 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -886,8 +886,9 @@ void md_account_bio(struct mddev *mddev, struct bio **bio);
 void md_free_cloned_bio(struct bio *bio);
 
 extern bool __must_check md_flush_request(struct mddev *mddev, struct bio *bio);
-extern void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
-			   sector_t sector, int size, struct page *page);
+extern void md_write_metadata(struct mddev *mddev, struct md_rdev *rdev,
+			      sector_t sector, int size, struct page *page,
+			      unsigned int offset);
 extern int md_super_wait(struct mddev *mddev);
 extern int sync_page_io(struct md_rdev *rdev, sector_t sector, int size,
 		struct page *page, blk_opf_t opf, bool metadata_op);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 02/23] md: factor out a helper raid_is_456()
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
  2025-05-24  6:12 ` [PATCH 01/23] md: add a new parameter 'offset' to md_super_write() Yu Kuai
@ 2025-05-24  6:12 ` Yu Kuai
  2025-05-25 15:50   ` Xiao Ni
                     ` (2 more replies)
  2025-05-24  6:13 ` [PATCH 03/23] md/md-bitmap: cleanup bitmap_ops->startwrite() Yu Kuai
                   ` (23 subsequent siblings)
  25 siblings, 3 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:12 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

There are no functional changes, the helper will be used by llbitmap in
following patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md.c | 9 +--------
 drivers/md/md.h | 6 ++++++
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 18e03f651f6b..b0468e795d94 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9037,19 +9037,12 @@ static sector_t md_sync_position(struct mddev *mddev, enum sync_action action)
 
 static bool sync_io_within_limit(struct mddev *mddev)
 {
-	int io_sectors;
-
 	/*
 	 * For raid456, sync IO is stripe(4k) per IO, for other levels, it's
 	 * RESYNC_PAGES(64k) per IO.
 	 */
-	if (mddev->level == 4 || mddev->level == 5 || mddev->level == 6)
-		io_sectors = 8;
-	else
-		io_sectors = 128;
-
 	return atomic_read(&mddev->recovery_active) <
-		io_sectors * sync_io_depth(mddev);
+	       (raid_is_456(mddev) ? 8 : 128) * sync_io_depth(mddev);
 }
 
 #define SYNC_MARKS	10
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 5ba4a9093a92..c241119e6ef3 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -1011,6 +1011,12 @@ static inline bool mddev_is_dm(struct mddev *mddev)
 	return !mddev->gendisk;
 }
 
+static inline bool raid_is_456(struct mddev *mddev)
+{
+	return mddev->level == ID_RAID4 || mddev->level == ID_RAID5 ||
+	       mddev->level == ID_RAID6;
+}
+
 static inline void mddev_trace_remap(struct mddev *mddev, struct bio *bio,
 		sector_t sector)
 {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 03/23] md/md-bitmap: cleanup bitmap_ops->startwrite()
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
  2025-05-24  6:12 ` [PATCH 01/23] md: add a new parameter 'offset' to md_super_write() Yu Kuai
  2025-05-24  6:12 ` [PATCH 02/23] md: factor out a helper raid_is_456() Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-25 15:51   ` Xiao Ni
                     ` (2 more replies)
  2025-05-24  6:13 ` [PATCH 04/23] md/md-bitmap: support discard for bitmap ops Yu Kuai
                   ` (22 subsequent siblings)
  25 siblings, 3 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

bitmap_startwrite() always return 0, and the caller doesn't check return
value as well, hence change the method to void.

Also rename startwrite/endwrite to start_write/end_write, which is more in
line with the usual naming convention.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md-bitmap.c | 17 ++++++++---------
 drivers/md/md-bitmap.h |  6 +++---
 drivers/md/md.c        |  8 ++++----
 3 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 168eea6595b3..2997e09d463d 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -1669,13 +1669,13 @@ __acquires(bitmap->lock)
 			&(bitmap->bp[page].map[pageoff]);
 }
 
-static int bitmap_startwrite(struct mddev *mddev, sector_t offset,
-			     unsigned long sectors)
+static void bitmap_start_write(struct mddev *mddev, sector_t offset,
+			       unsigned long sectors)
 {
 	struct bitmap *bitmap = mddev->bitmap;
 
 	if (!bitmap)
-		return 0;
+		return;
 
 	while (sectors) {
 		sector_t blocks;
@@ -1685,7 +1685,7 @@ static int bitmap_startwrite(struct mddev *mddev, sector_t offset,
 		bmc = md_bitmap_get_counter(&bitmap->counts, offset, &blocks, 1);
 		if (!bmc) {
 			spin_unlock_irq(&bitmap->counts.lock);
-			return 0;
+			return;
 		}
 
 		if (unlikely(COUNTER(*bmc) == COUNTER_MAX)) {
@@ -1721,11 +1721,10 @@ static int bitmap_startwrite(struct mddev *mddev, sector_t offset,
 		else
 			sectors = 0;
 	}
-	return 0;
 }
 
-static void bitmap_endwrite(struct mddev *mddev, sector_t offset,
-			    unsigned long sectors)
+static void bitmap_end_write(struct mddev *mddev, sector_t offset,
+			     unsigned long sectors)
 {
 	struct bitmap *bitmap = mddev->bitmap;
 
@@ -2990,8 +2989,8 @@ static struct bitmap_operations bitmap_ops = {
 	.end_behind_write	= bitmap_end_behind_write,
 	.wait_behind_writes	= bitmap_wait_behind_writes,
 
-	.startwrite		= bitmap_startwrite,
-	.endwrite		= bitmap_endwrite,
+	.start_write		= bitmap_start_write,
+	.end_write		= bitmap_end_write,
 	.start_sync		= bitmap_start_sync,
 	.end_sync		= bitmap_end_sync,
 	.cond_end_sync		= bitmap_cond_end_sync,
diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index d3d50629af91..9474e0d86fc6 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -90,10 +90,10 @@ struct bitmap_operations {
 	void (*end_behind_write)(struct mddev *mddev);
 	void (*wait_behind_writes)(struct mddev *mddev);
 
-	int (*startwrite)(struct mddev *mddev, sector_t offset,
+	void (*start_write)(struct mddev *mddev, sector_t offset,
+			    unsigned long sectors);
+	void (*end_write)(struct mddev *mddev, sector_t offset,
 			  unsigned long sectors);
-	void (*endwrite)(struct mddev *mddev, sector_t offset,
-			 unsigned long sectors);
 	bool (*start_sync)(struct mddev *mddev, sector_t offset,
 			   sector_t *blocks, bool degraded);
 	void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index b0468e795d94..04a659f40cd6 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8849,14 +8849,14 @@ static void md_bitmap_start(struct mddev *mddev,
 		mddev->pers->bitmap_sector(mddev, &md_io_clone->offset,
 					   &md_io_clone->sectors);
 
-	mddev->bitmap_ops->startwrite(mddev, md_io_clone->offset,
-				      md_io_clone->sectors);
+	mddev->bitmap_ops->start_write(mddev, md_io_clone->offset,
+				       md_io_clone->sectors);
 }
 
 static void md_bitmap_end(struct mddev *mddev, struct md_io_clone *md_io_clone)
 {
-	mddev->bitmap_ops->endwrite(mddev, md_io_clone->offset,
-				    md_io_clone->sectors);
+	mddev->bitmap_ops->end_write(mddev, md_io_clone->offset,
+				     md_io_clone->sectors);
 }
 
 static void md_end_clone_io(struct bio *bio)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 04/23] md/md-bitmap: support discard for bitmap ops
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (2 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 03/23] md/md-bitmap: cleanup bitmap_ops->startwrite() Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-25 15:53   ` Xiao Ni
                     ` (3 more replies)
  2025-05-24  6:13 ` [PATCH 05/23] md/md-bitmap: remove parameter slot from bitmap_create() Yu Kuai
                   ` (21 subsequent siblings)
  25 siblings, 4 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Use two new methods {start, end}_discard to handle discard IO, prepare
to support new md bitmap.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md-bitmap.c |  3 +++
 drivers/md/md-bitmap.h | 12 ++++++++----
 drivers/md/md.c        | 15 +++++++++++----
 drivers/md/md.h        |  1 +
 4 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 2997e09d463d..848626049dea 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -2991,6 +2991,9 @@ static struct bitmap_operations bitmap_ops = {
 
 	.start_write		= bitmap_start_write,
 	.end_write		= bitmap_end_write,
+	.start_discard		= bitmap_start_write,
+	.end_discard		= bitmap_end_write,
+
 	.start_sync		= bitmap_start_sync,
 	.end_sync		= bitmap_end_sync,
 	.cond_end_sync		= bitmap_cond_end_sync,
diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index 9474e0d86fc6..4d804c07dbdd 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -70,6 +70,9 @@ struct md_bitmap_stats {
 	struct file	*file;
 };
 
+typedef void (md_bitmap_fn)(struct mddev *mddev, sector_t offset,
+			    unsigned long sectors);
+
 struct bitmap_operations {
 	struct md_submodule_head head;
 
@@ -90,10 +93,11 @@ struct bitmap_operations {
 	void (*end_behind_write)(struct mddev *mddev);
 	void (*wait_behind_writes)(struct mddev *mddev);
 
-	void (*start_write)(struct mddev *mddev, sector_t offset,
-			    unsigned long sectors);
-	void (*end_write)(struct mddev *mddev, sector_t offset,
-			  unsigned long sectors);
+	md_bitmap_fn *start_write;
+	md_bitmap_fn *end_write;
+	md_bitmap_fn *start_discard;
+	md_bitmap_fn *end_discard;
+
 	bool (*start_sync)(struct mddev *mddev, sector_t offset,
 			   sector_t *blocks, bool degraded);
 	void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 04a659f40cd6..466087cef4f9 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8845,18 +8845,24 @@ EXPORT_SYMBOL_GPL(md_submit_discard_bio);
 static void md_bitmap_start(struct mddev *mddev,
 			    struct md_io_clone *md_io_clone)
 {
+	md_bitmap_fn *fn = unlikely(md_io_clone->rw == STAT_DISCARD) ?
+			   mddev->bitmap_ops->start_discard :
+			   mddev->bitmap_ops->start_write;
+
 	if (mddev->pers->bitmap_sector)
 		mddev->pers->bitmap_sector(mddev, &md_io_clone->offset,
 					   &md_io_clone->sectors);
 
-	mddev->bitmap_ops->start_write(mddev, md_io_clone->offset,
-				       md_io_clone->sectors);
+	fn(mddev, md_io_clone->offset, md_io_clone->sectors);
 }
 
 static void md_bitmap_end(struct mddev *mddev, struct md_io_clone *md_io_clone)
 {
-	mddev->bitmap_ops->end_write(mddev, md_io_clone->offset,
-				     md_io_clone->sectors);
+	md_bitmap_fn *fn = unlikely(md_io_clone->rw == STAT_DISCARD) ?
+			   mddev->bitmap_ops->end_discard :
+			   mddev->bitmap_ops->end_write;
+
+	fn(mddev, md_io_clone->offset, md_io_clone->sectors);
 }
 
 static void md_end_clone_io(struct bio *bio)
@@ -8895,6 +8901,7 @@ static void md_clone_bio(struct mddev *mddev, struct bio **bio)
 	if (bio_data_dir(*bio) == WRITE && md_bitmap_enabled(mddev)) {
 		md_io_clone->offset = (*bio)->bi_iter.bi_sector;
 		md_io_clone->sectors = bio_sectors(*bio);
+		md_io_clone->rw = op_stat_group(bio_op(*bio));
 		md_bitmap_start(mddev, md_io_clone);
 	}
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index c241119e6ef3..13e3f9ce1b79 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -850,6 +850,7 @@ struct md_io_clone {
 	unsigned long	start_time;
 	sector_t	offset;
 	unsigned long	sectors;
+	enum stat_group	rw;
 	struct bio	bio_clone;
 };
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 05/23] md/md-bitmap: remove parameter slot from bitmap_create()
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (3 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 04/23] md/md-bitmap: support discard for bitmap ops Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-25 16:09   ` Xiao Ni
                     ` (2 more replies)
  2025-05-24  6:13 ` [PATCH 06/23] md/md-bitmap: add a new sysfs api bitmap_type Yu Kuai
                   ` (20 subsequent siblings)
  25 siblings, 3 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

All callers pass in '-1' for 'slot', hence it can be removed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/md-bitmap.c | 6 +++---
 drivers/md/md-bitmap.h | 2 +-
 drivers/md/md.c        | 6 +++---
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 848626049dea..17d41a7b30ce 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -2185,9 +2185,9 @@ static struct bitmap *__bitmap_create(struct mddev *mddev, int slot)
 	return ERR_PTR(err);
 }
 
-static int bitmap_create(struct mddev *mddev, int slot)
+static int bitmap_create(struct mddev *mddev)
 {
-	struct bitmap *bitmap = __bitmap_create(mddev, slot);
+	struct bitmap *bitmap = __bitmap_create(mddev, -1);
 
 	if (IS_ERR(bitmap))
 		return PTR_ERR(bitmap);
@@ -2649,7 +2649,7 @@ location_store(struct mddev *mddev, const char *buf, size_t len)
 			}
 
 			mddev->bitmap_info.offset = offset;
-			rv = bitmap_create(mddev, -1);
+			rv = bitmap_create(mddev);
 			if (rv)
 				goto out;
 
diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index 4d804c07dbdd..2b99ddef7a41 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -77,7 +77,7 @@ struct bitmap_operations {
 	struct md_submodule_head head;
 
 	bool (*enabled)(void *data);
-	int (*create)(struct mddev *mddev, int slot);
+	int (*create)(struct mddev *mddev);
 	int (*resize)(struct mddev *mddev, sector_t blocks, int chunksize);
 
 	int (*load)(struct mddev *mddev);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 466087cef4f9..311e52d5173d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6255,7 +6255,7 @@ int md_run(struct mddev *mddev)
 	}
 	if (err == 0 && pers->sync_request && md_bitmap_registered(mddev) &&
 	    (mddev->bitmap_info.file || mddev->bitmap_info.offset)) {
-		err = mddev->bitmap_ops->create(mddev, -1);
+		err = mddev->bitmap_ops->create(mddev);
 		if (err)
 			pr_warn("%s: failed to create bitmap (%d)\n",
 				mdname(mddev), err);
@@ -7324,7 +7324,7 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
 	err = 0;
 	if (mddev->pers) {
 		if (fd >= 0) {
-			err = mddev->bitmap_ops->create(mddev, -1);
+			err = mddev->bitmap_ops->create(mddev);
 			if (!err)
 				err = mddev->bitmap_ops->load(mddev);
 
@@ -7648,7 +7648,7 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
 				mddev->bitmap_info.default_offset;
 			mddev->bitmap_info.space =
 				mddev->bitmap_info.default_space;
-			rv = mddev->bitmap_ops->create(mddev, -1);
+			rv = mddev->bitmap_ops->create(mddev);
 			if (!rv)
 				rv = mddev->bitmap_ops->load(mddev);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 06/23] md/md-bitmap: add a new sysfs api bitmap_type
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (4 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 05/23] md/md-bitmap: remove parameter slot from bitmap_create() Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-25 16:32   ` Xiao Ni
                     ` (2 more replies)
  2025-05-24  6:13 ` [PATCH 07/23] md/md-bitmap: delay registration of bitmap_ops until creating bitmap Yu Kuai
                   ` (19 subsequent siblings)
  25 siblings, 3 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

The api will be used by mdadm to set bitmap_ops while creating new array
or assemble array, prepare to add a new bitmap.

Currently available options are:

cat /sys/block/md0/md/bitmap_type
none [bitmap]

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 Documentation/admin-guide/md.rst | 73 ++++++++++++++----------
 drivers/md/md.c                  | 96 ++++++++++++++++++++++++++++++--
 drivers/md/md.h                  |  2 +
 3 files changed, 135 insertions(+), 36 deletions(-)

diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
index 4ff2cc291d18..356d2a344f08 100644
--- a/Documentation/admin-guide/md.rst
+++ b/Documentation/admin-guide/md.rst
@@ -347,6 +347,49 @@ All md devices contain:
      active-idle
          like active, but no writes have been seen for a while (safe_mode_delay).
 
+  consistency_policy
+     This indicates how the array maintains consistency in case of unexpected
+     shutdown. It can be:
+
+     none
+       Array has no redundancy information, e.g. raid0, linear.
+
+     resync
+       Full resync is performed and all redundancy is regenerated when the
+       array is started after unclean shutdown.
+
+     bitmap
+       Resync assisted by a write-intent bitmap.
+
+     journal
+       For raid4/5/6, journal device is used to log transactions and replay
+       after unclean shutdown.
+
+     ppl
+       For raid5 only, Partial Parity Log is used to close the write hole and
+       eliminate resync.
+
+     The accepted values when writing to this file are ``ppl`` and ``resync``,
+     used to enable and disable PPL.
+
+  uuid
+     This indicates the UUID of the array in the following format:
+     xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
+
+  bitmap_type
+     [RW] When read, this file will display the current and available
+     bitmap for this array. The currently active bitmap will be enclosed
+     in [] brackets. Writing an bitmap name or ID to this file will switch
+     control of this array to that new bitmap. Note that writing a new
+     bitmap for created array is forbidden.
+
+     none
+         No bitmap
+     bitmap
+         The default internal bitmap
+
+If bitmap_type is bitmap, then the md device will also contain:
+
   bitmap/location
      This indicates where the write-intent bitmap for the array is
      stored.
@@ -401,36 +444,6 @@ All md devices contain:
      once the array becomes non-degraded, and this fact has been
      recorded in the metadata.
 
-  consistency_policy
-     This indicates how the array maintains consistency in case of unexpected
-     shutdown. It can be:
-
-     none
-       Array has no redundancy information, e.g. raid0, linear.
-
-     resync
-       Full resync is performed and all redundancy is regenerated when the
-       array is started after unclean shutdown.
-
-     bitmap
-       Resync assisted by a write-intent bitmap.
-
-     journal
-       For raid4/5/6, journal device is used to log transactions and replay
-       after unclean shutdown.
-
-     ppl
-       For raid5 only, Partial Parity Log is used to close the write hole and
-       eliminate resync.
-
-     The accepted values when writing to this file are ``ppl`` and ``resync``,
-     used to enable and disable PPL.
-
-  uuid
-     This indicates the UUID of the array in the following format:
-     xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
-
-
 As component devices are added to an md array, they appear in the ``md``
 directory as new directories named::
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 311e52d5173d..4eb0c6effd5b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -672,13 +672,18 @@ static void active_io_release(struct percpu_ref *ref)
 
 static void no_op(struct percpu_ref *r) {}
 
-static void mddev_set_bitmap_ops(struct mddev *mddev, enum md_submodule_id id)
+static bool mddev_set_bitmap_ops(struct mddev *mddev)
 {
 	xa_lock(&md_submodule);
-	mddev->bitmap_ops = xa_load(&md_submodule, id);
+	mddev->bitmap_ops = xa_load(&md_submodule, mddev->bitmap_id);
 	xa_unlock(&md_submodule);
-	if (!mddev->bitmap_ops)
-		pr_warn_once("md: can't find bitmap id %d\n", id);
+
+	if (!mddev->bitmap_ops) {
+		pr_warn_once("md: can't find bitmap id %d\n", mddev->bitmap_id);
+		return false;
+	}
+
+	return true;
 }
 
 static void mddev_clear_bitmap_ops(struct mddev *mddev)
@@ -688,8 +693,10 @@ static void mddev_clear_bitmap_ops(struct mddev *mddev)
 
 int mddev_init(struct mddev *mddev)
 {
-	/* TODO: support more versions */
-	mddev_set_bitmap_ops(mddev, ID_BITMAP);
+	mddev->bitmap_id = ID_BITMAP;
+
+	if (!mddev_set_bitmap_ops(mddev))
+		return -EINVAL;
 
 	if (percpu_ref_init(&mddev->active_io, active_io_release,
 			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
@@ -4155,6 +4162,82 @@ new_level_store(struct mddev *mddev, const char *buf, size_t len)
 static struct md_sysfs_entry md_new_level =
 __ATTR(new_level, 0664, new_level_show, new_level_store);
 
+static ssize_t
+bitmap_type_show(struct mddev *mddev, char *page)
+{
+	struct md_submodule_head *head;
+	unsigned long i;
+	ssize_t len = 0;
+
+	if (mddev->bitmap_id == ID_BITMAP_NONE)
+		len += sprintf(page + len, "[none] ");
+	else
+		len += sprintf(page + len, "none ");
+
+	xa_lock(&md_submodule);
+	xa_for_each(&md_submodule, i, head) {
+		if (head->type != MD_BITMAP)
+			continue;
+
+		if (mddev->bitmap_id == head->id)
+			len += sprintf(page + len, "[%s] ", head->name);
+		else
+			len += sprintf(page + len, "%s ", head->name);
+	}
+	xa_unlock(&md_submodule);
+
+	len += sprintf(page + len, "\n");
+	return len;
+}
+
+static ssize_t
+bitmap_type_store(struct mddev *mddev, const char *buf, size_t len)
+{
+	struct md_submodule_head *head;
+	enum md_submodule_id id;
+	unsigned long i;
+	int err;
+
+	if (mddev->bitmap_ops)
+		return -EBUSY;
+
+	err = kstrtoint(buf, 10, &id);
+	if (!err) {
+		if (id == ID_BITMAP_NONE) {
+			mddev->bitmap_id = id;
+			return len;
+		}
+
+		xa_lock(&md_submodule);
+		head = xa_load(&md_submodule, id);
+		xa_unlock(&md_submodule);
+
+		if (head && head->type == MD_BITMAP) {
+			mddev->bitmap_id = id;
+			return len;
+		}
+	}
+
+	if (cmd_match(buf, "none")) {
+		mddev->bitmap_id = ID_BITMAP_NONE;
+		return len;
+	}
+
+	xa_lock(&md_submodule);
+	xa_for_each(&md_submodule, i, head) {
+		if (head->type == MD_BITMAP && cmd_match(buf, head->name)) {
+			mddev->bitmap_id = head->id;
+			xa_unlock(&md_submodule);
+			return len;
+		}
+	}
+	xa_unlock(&md_submodule);
+	return -ENOENT;
+}
+
+static struct md_sysfs_entry md_bitmap_type =
+__ATTR(bitmap_type, 0664, bitmap_type_show, bitmap_type_store);
+
 static ssize_t
 layout_show(struct mddev *mddev, char *page)
 {
@@ -5719,6 +5802,7 @@ __ATTR(serialize_policy, S_IRUGO | S_IWUSR, serialize_policy_show,
 static struct attribute *md_default_attrs[] = {
 	&md_level.attr,
 	&md_new_level.attr,
+	&md_bitmap_type.attr,
 	&md_layout.attr,
 	&md_raid_disks.attr,
 	&md_uuid.attr,
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 13e3f9ce1b79..bf34c0a36551 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -40,6 +40,7 @@ enum md_submodule_id {
 	ID_CLUSTER,
 	ID_BITMAP,
 	ID_LLBITMAP,	/* TODO */
+	ID_BITMAP_NONE,
 };
 
 struct md_submodule_head {
@@ -565,6 +566,7 @@ struct mddev {
 	struct percpu_ref		writes_pending;
 	int				sync_checkers;	/* # of threads checking writes_pending */
 
+	enum md_submodule_id		bitmap_id;
 	void				*bitmap; /* the bitmap for the device */
 	struct bitmap_operations	*bitmap_ops;
 	struct {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 07/23] md/md-bitmap: delay registration of bitmap_ops until creating bitmap
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (5 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 06/23] md/md-bitmap: add a new sysfs api bitmap_type Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-26  6:32   ` Christoph Hellwig
                     ` (2 more replies)
  2025-05-24  6:13 ` [PATCH 08/23] md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations Yu Kuai
                   ` (18 subsequent siblings)
  25 siblings, 3 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Currently bitmap_ops is registered while allocating mddev, this is fine
when there is only one bitmap_ops, however, after introduing a new
bitmap_ops, user space need a time window to choose which bitmap_ops to
use while creating new array.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md.c | 86 +++++++++++++++++++++++++++++++------------------
 1 file changed, 55 insertions(+), 31 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4eb0c6effd5b..dc4b85f30e13 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -674,39 +674,50 @@ static void no_op(struct percpu_ref *r) {}
 
 static bool mddev_set_bitmap_ops(struct mddev *mddev)
 {
+	struct bitmap_operations *old = mddev->bitmap_ops;
+	struct md_submodule_head *head;
+
+	if (mddev->bitmap_id == ID_BITMAP_NONE ||
+	    (old && old->head.id == mddev->bitmap_id))
+		return true;
+
 	xa_lock(&md_submodule);
-	mddev->bitmap_ops = xa_load(&md_submodule, mddev->bitmap_id);
+	head = xa_load(&md_submodule, mddev->bitmap_id);
 	xa_unlock(&md_submodule);
 
-	if (!mddev->bitmap_ops) {
-		pr_warn_once("md: can't find bitmap id %d\n", mddev->bitmap_id);
+	if (WARN_ON_ONCE(!head || head->type != MD_BITMAP)) {
+		pr_err("md: can't find bitmap id %d\n", mddev->bitmap_id);
 		return false;
 	}
 
+	if (old && old->group)
+		sysfs_remove_group(&mddev->kobj, old->group);
+
+	mddev->bitmap_ops = (void *)head;
+	if (mddev->bitmap_ops && mddev->bitmap_ops->group &&
+	    sysfs_create_group(&mddev->kobj, mddev->bitmap_ops->group))
+		pr_warn("md: cannot register extra bitmap attributes for %s\n",
+			mdname(mddev));
+
 	return true;
 }
 
 static void mddev_clear_bitmap_ops(struct mddev *mddev)
 {
+	if (mddev->bitmap_ops && mddev->bitmap_ops->group)
+		sysfs_remove_group(&mddev->kobj, mddev->bitmap_ops->group);
+
 	mddev->bitmap_ops = NULL;
 }
 
 int mddev_init(struct mddev *mddev)
 {
-	mddev->bitmap_id = ID_BITMAP;
-
-	if (!mddev_set_bitmap_ops(mddev))
-		return -EINVAL;
-
 	if (percpu_ref_init(&mddev->active_io, active_io_release,
-			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
-		mddev_clear_bitmap_ops(mddev);
+			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL))
 		return -ENOMEM;
-	}
 
 	if (percpu_ref_init(&mddev->writes_pending, no_op,
 			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
-		mddev_clear_bitmap_ops(mddev);
 		percpu_ref_exit(&mddev->active_io);
 		return -ENOMEM;
 	}
@@ -734,6 +745,7 @@ int mddev_init(struct mddev *mddev)
 	mddev->resync_min = 0;
 	mddev->resync_max = MaxSector;
 	mddev->level = LEVEL_NONE;
+	mddev->bitmap_id = ID_BITMAP;
 
 	INIT_WORK(&mddev->sync_work, md_start_sync);
 	INIT_WORK(&mddev->del_work, mddev_delayed_delete);
@@ -744,7 +756,6 @@ EXPORT_SYMBOL_GPL(mddev_init);
 
 void mddev_destroy(struct mddev *mddev)
 {
-	mddev_clear_bitmap_ops(mddev);
 	percpu_ref_exit(&mddev->active_io);
 	percpu_ref_exit(&mddev->writes_pending);
 }
@@ -6093,11 +6104,6 @@ struct mddev *md_alloc(dev_t dev, char *name)
 		return ERR_PTR(error);
 	}
 
-	if (md_bitmap_registered(mddev) && mddev->bitmap_ops->group)
-		if (sysfs_create_group(&mddev->kobj, mddev->bitmap_ops->group))
-			pr_warn("md: cannot register extra bitmap attributes for %s\n",
-				mdname(mddev));
-
 	kobject_uevent(&mddev->kobj, KOBJ_ADD);
 	mddev->sysfs_state = sysfs_get_dirent_safe(mddev->kobj.sd, "array_state");
 	mddev->sysfs_level = sysfs_get_dirent_safe(mddev->kobj.sd, "level");
@@ -6173,6 +6179,26 @@ static void md_safemode_timeout(struct timer_list *t)
 
 static int start_dirty_degraded;
 
+static int md_bitmap_create(struct mddev *mddev)
+{
+	if (mddev->bitmap_id == ID_BITMAP_NONE)
+		return -EINVAL;
+
+	if (!mddev_set_bitmap_ops(mddev))
+		return -ENOENT;
+
+	return mddev->bitmap_ops->create(mddev);
+}
+
+static void md_bitmap_destroy(struct mddev *mddev)
+{
+	if (!md_bitmap_registered(mddev))
+		return;
+
+	mddev->bitmap_ops->destroy(mddev);
+	mddev_clear_bitmap_ops(mddev);
+}
+
 int md_run(struct mddev *mddev)
 {
 	int err;
@@ -6337,9 +6363,9 @@ int md_run(struct mddev *mddev)
 			(unsigned long long)pers->size(mddev, 0, 0) / 2);
 		err = -EINVAL;
 	}
-	if (err == 0 && pers->sync_request && md_bitmap_registered(mddev) &&
+	if (err == 0 && pers->sync_request &&
 	    (mddev->bitmap_info.file || mddev->bitmap_info.offset)) {
-		err = mddev->bitmap_ops->create(mddev);
+		err = md_bitmap_create(mddev);
 		if (err)
 			pr_warn("%s: failed to create bitmap (%d)\n",
 				mdname(mddev), err);
@@ -6412,8 +6438,7 @@ int md_run(struct mddev *mddev)
 		pers->free(mddev, mddev->private);
 	mddev->private = NULL;
 	put_pers(pers);
-	if (md_bitmap_registered(mddev))
-		mddev->bitmap_ops->destroy(mddev);
+	md_bitmap_destroy(mddev);
 abort:
 	bioset_exit(&mddev->io_clone_set);
 exit_sync_set:
@@ -6436,7 +6461,7 @@ int do_md_run(struct mddev *mddev)
 	if (md_bitmap_registered(mddev)) {
 		err = mddev->bitmap_ops->load(mddev);
 		if (err) {
-			mddev->bitmap_ops->destroy(mddev);
+			md_bitmap_destroy(mddev);
 			goto out;
 		}
 	}
@@ -6627,8 +6652,7 @@ static void __md_stop(struct mddev *mddev)
 {
 	struct md_personality *pers = mddev->pers;
 
-	if (md_bitmap_registered(mddev))
-		mddev->bitmap_ops->destroy(mddev);
+	md_bitmap_destroy(mddev);
 	mddev_detach(mddev);
 	spin_lock(&mddev->lock);
 	mddev->pers = NULL;
@@ -7408,16 +7432,16 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
 	err = 0;
 	if (mddev->pers) {
 		if (fd >= 0) {
-			err = mddev->bitmap_ops->create(mddev);
+			err = md_bitmap_create(mddev);
 			if (!err)
 				err = mddev->bitmap_ops->load(mddev);
 
 			if (err) {
-				mddev->bitmap_ops->destroy(mddev);
+				md_bitmap_destroy(mddev);
 				fd = -1;
 			}
 		} else if (fd < 0) {
-			mddev->bitmap_ops->destroy(mddev);
+			md_bitmap_destroy(mddev);
 		}
 	}
 
@@ -7732,12 +7756,12 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
 				mddev->bitmap_info.default_offset;
 			mddev->bitmap_info.space =
 				mddev->bitmap_info.default_space;
-			rv = mddev->bitmap_ops->create(mddev);
+			rv = md_bitmap_create(mddev);
 			if (!rv)
 				rv = mddev->bitmap_ops->load(mddev);
 
 			if (rv)
-				mddev->bitmap_ops->destroy(mddev);
+				md_bitmap_destroy(mddev);
 		} else {
 			struct md_bitmap_stats stats;
 
@@ -7763,7 +7787,7 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
 				put_cluster_ops(mddev);
 				mddev->safemode_delay = DEFAULT_SAFEMODE_DELAY;
 			}
-			mddev->bitmap_ops->destroy(mddev);
+			md_bitmap_destroy(mddev);
 			mddev->bitmap_info.offset = 0;
 		}
 	}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 08/23] md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (6 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 07/23] md/md-bitmap: delay registration of bitmap_ops until creating bitmap Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-26  7:03   ` Xiao Ni
  2025-05-27  6:14   ` Hannes Reinecke
  2025-05-24  6:13 ` [PATCH 09/23] md/md-bitmap: add a new method blocks_synced() " Yu Kuai
                   ` (17 subsequent siblings)
  25 siblings, 2 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

This method is used to check if blocks can be skipped before calling
into pers->sync_request(), llbiltmap will use this method to skip
resync for unwritten/clean data blocks, and recovery/check/repair for
unwritten data blocks;

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/md-bitmap.h | 1 +
 drivers/md/md.c        | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index 2b99ddef7a41..0de14d475ad3 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -98,6 +98,7 @@ struct bitmap_operations {
 	md_bitmap_fn *start_discard;
 	md_bitmap_fn *end_discard;
 
+	sector_t (*skip_sync_blocks)(struct mddev *mddev, sector_t offset);
 	bool (*start_sync)(struct mddev *mddev, sector_t offset,
 			   sector_t *blocks, bool degraded);
 	void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index dc4b85f30e13..890c8da43b3b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9362,6 +9362,12 @@ void md_do_sync(struct md_thread *thread)
 		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
 			break;
 
+		if (mddev->bitmap_ops && mddev->bitmap_ops->skip_sync_blocks) {
+			sectors = mddev->bitmap_ops->skip_sync_blocks(mddev, j);
+			if (sectors)
+				goto update;
+		}
+
 		sectors = mddev->pers->sync_request(mddev, j, max_sectors,
 						    &skipped);
 		if (sectors == 0) {
@@ -9377,6 +9383,7 @@ void md_do_sync(struct md_thread *thread)
 		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
 			break;
 
+update:
 		j += sectors;
 		if (j > max_sectors)
 			/* when skipping, extra large numbers can be returned. */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 09/23] md/md-bitmap: add a new method blocks_synced() in bitmap_operations
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (7 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 08/23] md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-27  2:35   ` Xiao Ni
  2025-05-27  6:16   ` Hannes Reinecke
  2025-05-24  6:13 ` [PATCH 10/23] md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER Yu Kuai
                   ` (16 subsequent siblings)
  25 siblings, 2 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Currently, raid456 must perform a whole array initial recovery to build
initail xor data, then IO to the array won't have to read all the blocks
in underlying disks.

This behavior will affect IO performance a lot, and nowadays there are
huge disks and the initial recovery can take a long time. Hence llbitmap
will support lazy initial recovery in following patches. This method is
used to check if data blocks is synced or not, if not then IO will still
have to read all blocks for raid456.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/md-bitmap.h | 1 +
 drivers/md/raid5.c     | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index 0de14d475ad3..f2d79c8a23b7 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -99,6 +99,7 @@ struct bitmap_operations {
 	md_bitmap_fn *end_discard;
 
 	sector_t (*skip_sync_blocks)(struct mddev *mddev, sector_t offset);
+	bool (*blocks_synced)(struct mddev *mddev, sector_t offset);
 	bool (*start_sync)(struct mddev *mddev, sector_t offset,
 			   sector_t *blocks, bool degraded);
 	void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7e66a99f29af..e5d3d8facb4b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3748,6 +3748,7 @@ static int want_replace(struct stripe_head *sh, int disk_idx)
 static int need_this_block(struct stripe_head *sh, struct stripe_head_state *s,
 			   int disk_idx, int disks)
 {
+	struct mddev *mddev = sh->raid_conf->mddev;
 	struct r5dev *dev = &sh->dev[disk_idx];
 	struct r5dev *fdev[2] = { &sh->dev[s->failed_num[0]],
 				  &sh->dev[s->failed_num[1]] };
@@ -3762,6 +3763,11 @@ static int need_this_block(struct stripe_head *sh, struct stripe_head_state *s,
 		 */
 		return 0;
 
+	/* The initial recover is not done, must read everything */
+	if (mddev->bitmap_ops && mddev->bitmap_ops->blocks_synced &&
+	    !mddev->bitmap_ops->blocks_synced(mddev, sh->sector))
+		return 1;
+
 	if (dev->toread ||
 	    (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)))
 		/* We need this block to directly satisfy a request */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 10/23] md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (8 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 09/23] md/md-bitmap: add a new method blocks_synced() " Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-27  6:17   ` Hannes Reinecke
  2025-05-24  6:13 ` [PATCH 11/23] md/md-bitmap: make method bitmap_ops->daemon_work optional Yu Kuai
                   ` (15 subsequent siblings)
  25 siblings, 1 reply; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

This flag is used by llbitmap in later patches to skip raid456 initial
recover and delay building initial xor data to first write.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/md.c | 12 +++++++++++-
 drivers/md/md.h |  2 ++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 890c8da43b3b..737fdb6474bd 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9132,6 +9132,14 @@ static sector_t md_sync_position(struct mddev *mddev, enum sync_action action)
 				start = rdev->recovery_offset;
 		rcu_read_unlock();
 
+		/*
+		 * If there are no spares, and raid456 lazy initial recover is
+		 * requested.
+		 */
+		if (test_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery) &&
+		    start == MaxSector)
+			start = 0;
+
 		/* If there is a bitmap, we need to make sure all
 		 * writes that started before we added a spare
 		 * complete before we start doing a recovery.
@@ -9689,6 +9697,7 @@ static bool md_choose_sync_action(struct mddev *mddev, int *spares)
 	if (mddev->recovery_cp < MaxSector) {
 		set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
 		clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
+		clear_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery);
 		return true;
 	}
 
@@ -9698,7 +9707,7 @@ static bool md_choose_sync_action(struct mddev *mddev, int *spares)
 	 * re-add.
 	 */
 	*spares = remove_and_add_spares(mddev, NULL);
-	if (*spares) {
+	if (*spares || test_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery)) {
 		clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
 		clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
 		clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
@@ -10021,6 +10030,7 @@ void md_reap_sync_thread(struct mddev *mddev)
 	clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
 	clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
 	clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
+	clear_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery);
 	/*
 	 * We call mddev->cluster_ops->update_size here because sync_size could
 	 * be changed by md_update_sb, and MD_RECOVERY_RESHAPE is cleared,
diff --git a/drivers/md/md.h b/drivers/md/md.h
index bf34c0a36551..3adb1660c7ed 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -667,6 +667,8 @@ enum recovery_flags {
 	MD_RECOVERY_RESHAPE,
 	/* remote node is running resync thread */
 	MD_RESYNCING_REMOTE,
+	/* raid456 lazy initial recover */
+	MD_RECOVERY_LAZY_RECOVER,
 };
 
 enum md_ro_state {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 11/23] md/md-bitmap: make method bitmap_ops->daemon_work optional
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (9 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 10/23] md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-26  6:34   ` Christoph Hellwig
  2025-05-27  6:19   ` Hannes Reinecke
  2025-05-24  6:13 ` [PATCH 12/23] md/md-bitmap: add macros for lockless bitmap Yu Kuai
                   ` (14 subsequent siblings)
  25 siblings, 2 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

daemon_work() will be called by daemon thread, on the one hand, daemon
thread doesn't have strict wake-up time; on the other hand, too much
work are put to daemon thread, like handle sync IO, handle failed
or specail normal IO, handle recovery, and so on. Hence daemon thread
may be too busy to clear dirty bits in time.

Make bitmap_ops->daemon_work() optional and following patches will use
separate async work to clear dirty bits for the new bitmap.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 737fdb6474bd..c7f7914b7452 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9853,7 +9853,7 @@ static void unregister_sync_thread(struct mddev *mddev)
  */
 void md_check_recovery(struct mddev *mddev)
 {
-	if (md_bitmap_enabled(mddev))
+	if (md_bitmap_enabled(mddev) && mddev->bitmap_ops->daemon_work)
 		mddev->bitmap_ops->daemon_work(mddev);
 
 	if (signal_pending(current)) {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 12/23] md/md-bitmap: add macros for lockless bitmap
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (10 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 11/23] md/md-bitmap: make method bitmap_ops->daemon_work optional Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-26  6:40   ` Christoph Hellwig
                     ` (2 more replies)
  2025-05-24  6:13 ` [PATCH 13/23] md/md-bitmap: fix dm-raid max_write_behind setting Yu Kuai
                   ` (13 subsequent siblings)
  25 siblings, 3 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Also move other values to md-bitmap.h and update comments.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md-bitmap.c |  9 ---------
 drivers/md/md-bitmap.h | 17 +++++++++++++++++
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 17d41a7b30ce..689d5dba9328 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -36,15 +36,6 @@
 #include "md-bitmap.h"
 #include "md-cluster.h"
 
-#define BITMAP_MAJOR_LO 3
-/* version 4 insists the bitmap is in little-endian order
- * with version 3, it is host-endian which is non-portable
- * Version 5 is currently set only for clustered devices
- */
-#define BITMAP_MAJOR_HI 4
-#define BITMAP_MAJOR_CLUSTERED 5
-#define	BITMAP_MAJOR_HOSTENDIAN 3
-
 /*
  * in-memory bitmap:
  *
diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index f2d79c8a23b7..d2cdf831ef1a 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -18,10 +18,27 @@ typedef __u16 bitmap_counter_t;
 #define RESYNC_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 2)))
 #define COUNTER_MAX ((bitmap_counter_t) RESYNC_MASK - 1)
 
+/*
+ * version 3 is host-endian order, this is deprecated and not used for new
+ * array
+ */
+#define BITMAP_MAJOR_LO		3
+#define BITMAP_MAJOR_HOSTENDIAN	3
+/* version 4 is little-endian order, the default value */
+#define BITMAP_MAJOR_HI		4
+/* version 5 is only used for cluster */
+#define BITMAP_MAJOR_CLUSTERED	5
+/* version 6 is only used for lockless bitmap */
+#define BITMAP_MAJOR_LOCKLESS	6
+
+#define BITMAP_SB_SIZE 1024
 /* use these for bitmap->flags and bitmap->sb->state bit-fields */
 enum bitmap_state {
 	BITMAP_STALE	   = 1,  /* the bitmap file is out of date or had -EIO */
 	BITMAP_WRITE_ERROR = 2, /* A write error has occurred */
+	BITMAP_FIRST_USE   = 3, /* llbitmap is just created */
+	BITMAP_CLEAN       = 4, /* llbitmap is created with assume_clean */
+	BITMAP_DAEMON_BUSY = 5, /* llbitmap daemon is not finished after daemon_sleep */
 	BITMAP_HOSTENDIAN  =15,
 };
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 13/23] md/md-bitmap: fix dm-raid max_write_behind setting
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (11 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 12/23] md/md-bitmap: add macros for lockless bitmap Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-26  6:40   ` Christoph Hellwig
  2025-05-27  6:21   ` Hannes Reinecke
  2025-05-24  6:13 ` [PATCH 14/23] md/dm-raid: remove max_write_behind setting limit Yu Kuai
                   ` (12 subsequent siblings)
  25 siblings, 2 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

It's supposed to be COUNTER_MAX / 2, not COUNTER_MAX.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md-bitmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 689d5dba9328..535bc1888e8c 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -777,7 +777,7 @@ static int md_bitmap_new_disk_sb(struct bitmap *bitmap)
 	 * is a good choice?  We choose COUNTER_MAX / 2 arbitrarily.
 	 */
 	write_behind = bitmap->mddev->bitmap_info.max_write_behind;
-	if (write_behind > COUNTER_MAX)
+	if (write_behind > COUNTER_MAX / 2)
 		write_behind = COUNTER_MAX / 2;
 	sb->write_behind = cpu_to_le32(write_behind);
 	bitmap->mddev->bitmap_info.max_write_behind = write_behind;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 14/23] md/dm-raid: remove max_write_behind setting limit
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (12 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 13/23] md/md-bitmap: fix dm-raid max_write_behind setting Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-26  6:41   ` Christoph Hellwig
                     ` (2 more replies)
  2025-05-24  6:13 ` [PATCH 15/23] md/md-llbitmap: implement llbitmap IO Yu Kuai
                   ` (11 subsequent siblings)
  25 siblings, 3 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

The comments said 'vaule in kB', while the value actually means the
number of write_behind IOs. And since md-bitmap will automatically
adjust the value to max COUNTER_MAX / 2, there is no need to fail
early.

Also move some macros that is only used md-bitmap.c.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/dm-raid.c   |  6 +-----
 drivers/md/md-bitmap.c | 10 ++++++++++
 drivers/md/md-bitmap.h |  9 ---------
 3 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 088cfe6e0f98..9757c32ea1f5 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -1356,11 +1356,7 @@ static int parse_raid_params(struct raid_set *rs, struct dm_arg_set *as,
 				return -EINVAL;
 			}
 
-			/*
-			 * In device-mapper, we specify things in sectors, but
-			 * MD records this value in kB
-			 */
-			if (value < 0 || value / 2 > COUNTER_MAX) {
+			if (value < 0) {
 				rs->ti->error = "Max write-behind limit out of range";
 				return -EINVAL;
 			}
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 535bc1888e8c..098e7b6cd187 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -98,9 +98,19 @@
  *
  */
 
+typedef __u16 bitmap_counter_t;
+
 #define PAGE_BITS (PAGE_SIZE << 3)
 #define PAGE_BIT_SHIFT (PAGE_SHIFT + 3)
 
+#define COUNTER_BITS 16
+#define COUNTER_BIT_SHIFT 4
+#define COUNTER_BYTE_SHIFT (COUNTER_BIT_SHIFT - 3)
+
+#define NEEDED_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 1)))
+#define RESYNC_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 2)))
+#define COUNTER_MAX ((bitmap_counter_t) RESYNC_MASK - 1)
+
 #define NEEDED(x) (((bitmap_counter_t) x) & NEEDED_MASK)
 #define RESYNC(x) (((bitmap_counter_t) x) & RESYNC_MASK)
 #define COUNTER(x) (((bitmap_counter_t) x) & COUNTER_MAX)
diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index d2cdf831ef1a..a9a0f6a8d96d 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -9,15 +9,6 @@
 
 #define BITMAP_MAGIC 0x6d746962
 
-typedef __u16 bitmap_counter_t;
-#define COUNTER_BITS 16
-#define COUNTER_BIT_SHIFT 4
-#define COUNTER_BYTE_SHIFT (COUNTER_BIT_SHIFT - 3)
-
-#define NEEDED_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 1)))
-#define RESYNC_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 2)))
-#define COUNTER_MAX ((bitmap_counter_t) RESYNC_MASK - 1)
-
 /*
  * version 3 is host-endian order, this is deprecated and not used for new
  * array
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 15/23] md/md-llbitmap: implement llbitmap IO
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (13 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 14/23] md/dm-raid: remove max_write_behind setting limit Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-27  8:27   ` Christoph Hellwig
                     ` (2 more replies)
  2025-05-24  6:13 ` [PATCH 16/23] md/md-llbitmap: implement bit state machine Yu Kuai
                   ` (10 subsequent siblings)
  25 siblings, 3 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

READ

While creating bitmap, all pages will be allocated and read for llbitmap,
there won't be read afterwards

WRITE

WRITE IO is divided into logical_block_size of the page, the dirty state
of each block is tracked independently, for example:

each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;

| page0 | page1 | ... | page 31 |
|       |
|        \-----------------------\
|                                |
| block0 | block1 | ... | block 8|
|        |
|         \-----------------\
|                            |
| bit0 | bit1 | ... | bit511 |

From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
subpage will be marked dirty, such block must write first before the IO is
issued. This behaviour will affect IO performance, to reduce the impact, if
multiple bits are changed in the same block in a short time, all bits in
this block will be changed to Dirty/NeedSync, so that there won't be any
overhead until daemon clears dirty bits.

Also add data structure definition and comments.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md-llbitmap.c | 571 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 571 insertions(+)
 create mode 100644 drivers/md/md-llbitmap.c

diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
new file mode 100644
index 000000000000..1a01b6777527
--- /dev/null
+++ b/drivers/md/md-llbitmap.c
@@ -0,0 +1,571 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#ifdef CONFIG_MD_LLBITMAP
+
+#include <linux/blkdev.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/timer.h>
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/file.h>
+#include <linux/seq_file.h>
+#include <trace/events/block.h>
+
+#include "md.h"
+#include "md-bitmap.h"
+
+/*
+ * #### Background
+ *
+ * Redundant data is used to enhance data fault tolerance, and the storage
+ * method for redundant data vary depending on the RAID levels. And it's
+ * important to maintain the consistency of redundant data.
+ *
+ * Bitmap is used to record which data blocks have been synchronized and which
+ * ones need to be resynchronized or recovered. Each bit in the bitmap
+ * represents a segment of data in the array. When a bit is set, it indicates
+ * that the multiple redundant copies of that data segment may not be
+ * consistent. Data synchronization can be performed based on the bitmap after
+ * power failure or readding a disk. If there is no bitmap, a full disk
+ * synchronization is required.
+ *
+ * #### Key Features
+ *
+ *  - IO fastpath is lockless, if user issues lots of write IO to the same
+ *  bitmap bit in a short time, only the first write have additional overhead
+ *  to update bitmap bit, no additional overhead for the following writes;
+ *  - support only resync or recover written data, means in the case creating
+ *  new array or replacing with a new disk, there is no need to do a full disk
+ *  resync/recovery;
+ *
+ * #### Key Concept
+ *
+ * ##### State Machine
+ *
+ * Each bit is one byte, contain 6 difference state, see llbitmap_state. And
+ * there are total 8 differenct actions, see llbitmap_action, can change state:
+ *
+ * llbitmap state machine: transitions between states
+ *
+ * |           | Startwrite | Startsync | Endsync | Abortsync|
+ * | --------- | ---------- | --------- | ------- | -------  |
+ * | Unwritten | Dirty      | x         | x       | x        |
+ * | Clean     | Dirty      | x         | x       | x        |
+ * | Dirty     | x          | x         | x       | x        |
+ * | NeedSync  | x          | Syncing   | x       | x        |
+ * | Syncing   | x          | Syncing   | Dirty   | NeedSync |
+ *
+ * |           | Reload   | Daemon | Discard   | Stale     |
+ * | --------- | -------- | ------ | --------- | --------- |
+ * | Unwritten | x        | x      | x         | x         |
+ * | Clean     | x        | x      | Unwritten | NeedSync  |
+ * | Dirty     | NeedSync | Clean  | Unwritten | NeedSync  |
+ * | NeedSync  | x        | x      | Unwritten | x         |
+ * | Syncing   | NeedSync | x      | Unwritten | NeedSync  |
+ *
+ * Typical scenarios:
+ *
+ * 1) Create new array
+ * All bits will be set to Unwritten by default, if --assume-clean is set,
+ * all bits will be set to Clean instead.
+ *
+ * 2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and
+ * rely on xor data
+ *
+ * 2.1) write new data to raid1/raid10:
+ * Unwritten --StartWrite--> Dirty
+ *
+ * 2.2) write new data to raid456:
+ * Unwritten --StartWrite--> NeedSync
+ *
+ * Because the initial recover for raid456 is skipped, the xor data is not build
+ * yet, the bit must set to NeedSync first and after lazy initial recover is
+ * finished, the bit will finially set to Dirty(see 5.1 and 5.4);
+ *
+ * 2.3) cover write
+ * Clean --StartWrite--> Dirty
+ *
+ * 3) daemon, if the array is not degraded:
+ * Dirty --Daemon--> Clean
+ *
+ * For degraded array, the Dirty bit will never be cleared, prevent full disk
+ * recovery while readding a removed disk.
+ *
+ * 4) discard
+ * {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
+ *
+ * 5) resync and recover
+ *
+ * 5.1) common process
+ * NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean
+ *
+ * 5.2) resync after power failure
+ * Dirty --Reload--> NeedSync
+ *
+ * 5.3) recover while replacing with a new disk
+ * By default, the old bitmap framework will recover all data, and llbitmap
+ * implement this by a new helper, see llbitmap_skip_sync_blocks:
+ *
+ * skip recover for bits other than dirty or clean;
+ *
+ * 5.4) lazy initial recover for raid5:
+ * By default, the old bitmap framework will only allow new recover when there
+ * are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add
+ * to perform raid456 lazy recover for set bits(from 2.2).
+ *
+ * ##### Bitmap IO
+ *
+ * ##### Chunksize
+ *
+ * The default bitmap size is 128k, incluing 1k bitmap super block, and
+ * the default size of segment of data in the array each bit(chunksize) is 64k,
+ * and chunksize will adjust to twice the old size each time if the total number
+ * bits is not less than 127k.(see llbitmap_init)
+ *
+ * ##### READ
+ *
+ * While creating bitmap, all pages will be allocated and read for llbitmap,
+ * there won't be read afterwards
+ *
+ * ##### WRITE
+ *
+ * WRITE IO is divided into logical_block_size of the array, the dirty state
+ * of each block is tracked independently, for example:
+ *
+ * each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;
+ *
+ * | page0 | page1 | ... | page 31 |
+ * |       |
+ * |        \-----------------------\
+ * |                                |
+ * | block0 | block1 | ... | block 8|
+ * |        |
+ * |         \-----------------\
+ * |                            |
+ * | bit0 | bit1 | ... | bit511 |
+ *
+ * From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
+ * subpage will be marked dirty, such block must write first before the IO is
+ * issued. This behaviour will affect IO performance, to reduce the impact, if
+ * multiple bits are changed in the same block in a short time, all bits in this
+ * block will be changed to Dirty/NeedSync, so that there won't be any overhead
+ * until daemon clears dirty bits.
+ *
+ * ##### Dirty Bits syncronization
+ *
+ * IO fast path will set bits to dirty, and those dirty bits will be cleared
+ * by daemon after IO is done. llbitmap_page_ctl is used to synchronize between
+ * IO path and daemon;
+ *
+ * IO path:
+ *  1) try to grab a reference, if succeed, set expire time after 5s and return;
+ *  2) if failed to grab a reference, wait for daemon to finish clearing dirty
+ *  bits;
+ *
+ * Daemon(Daemon will be waken up every daemon_sleep seconds):
+ * For each page:
+ *  1) check if page expired, if not skip this page; for expired page:
+ *  2) suspend the page and wait for inflight write IO to be done;
+ *  3) change dirty page to clean;
+ *  4) resume the page;
+ */
+
+#define BITMAP_SB_SIZE 1024
+
+/* 64k is the max IO size of sync IO for raid1/raid10 */
+#define MIN_CHUNK_SIZE (64 * 2)
+
+/* By default, daemon will be waken up every 30s */
+#define DEFAULT_DAEMON_SLEEP 30
+
+/*
+ * Dirtied bits that have not been accessed for more than 5s will be cleared
+ * by daemon.
+ */
+#define BARRIER_IDLE 5
+
+enum llbitmap_state {
+	/* No valid data, init state after assemble the array */
+	BitUnwritten = 0,
+	/* data is consistent */
+	BitClean,
+	/* data will be consistent after IO is done, set directly for writes */
+	BitDirty,
+	/*
+	 * data need to be resynchronized:
+	 * 1) set directly for writes if array is degraded, prevent full disk
+	 * synchronization after readding a disk;
+	 * 2) reassemble the array after power failure, and dirty bits are
+	 * found after reloading the bitmap;
+	 * 3) set for first write for raid5, to build initial xor data lazily
+	 */
+	BitNeedSync,
+	/* data is synchronizing */
+	BitSyncing,
+	nr_llbitmap_state,
+	BitNone = 0xff,
+};
+
+enum llbitmap_action {
+	/* User write new data, this is the only action from IO fast path */
+	BitmapActionStartwrite = 0,
+	/* Start recovery */
+	BitmapActionStartsync,
+	/* Finish recovery */
+	BitmapActionEndsync,
+	/* Failed recovery */
+	BitmapActionAbortsync,
+	/* Reassemble the array */
+	BitmapActionReload,
+	/* Daemon thread is trying to clear dirty bits */
+	BitmapActionDaemon,
+	/* Data is deleted */
+	BitmapActionDiscard,
+	/*
+	 * Bitmap is stale, mark all bits in addition to BitUnwritten to
+	 * BitNeedSync.
+	 */
+	BitmapActionStale,
+	nr_llbitmap_action,
+	/* Init state is BitUnwritten */
+	BitmapActionInit,
+};
+
+enum llbitmap_page_state {
+	LLPageFlush = 0,
+	LLPageDirty,
+};
+
+struct llbitmap_page_ctl {
+	char *state;
+	struct page *page;
+	unsigned long expire;
+	unsigned long flags;
+	wait_queue_head_t wait;
+	struct percpu_ref active;
+	/* Per block size dirty state, maximum 64k page / 1 sector = 128 */
+	unsigned long dirty[];
+};
+
+struct llbitmap {
+	struct mddev *mddev;
+	struct llbitmap_page_ctl **pctl;
+
+	unsigned int nr_pages;
+	unsigned int io_size;
+	unsigned int bits_per_page;
+
+	/* shift of one chunk */
+	unsigned long chunkshift;
+	/* size of one chunk in sector */
+	unsigned long chunksize;
+	/* total number of chunks */
+	unsigned long chunks;
+	unsigned long last_end_sync;
+	/* fires on first BitDirty state */
+	struct timer_list pending_timer;
+	struct work_struct daemon_work;
+
+	unsigned long flags;
+	__u64	events_cleared;
+
+	/* for slow disks */
+	atomic_t behind_writes;
+	wait_queue_head_t behind_wait;
+};
+
+struct llbitmap_unplug_work {
+	struct work_struct work;
+	struct llbitmap *llbitmap;
+	struct completion *done;
+};
+
+static struct workqueue_struct *md_llbitmap_io_wq;
+static struct workqueue_struct *md_llbitmap_unplug_wq;
+
+static char state_machine[nr_llbitmap_state][nr_llbitmap_action] = {
+	[BitUnwritten] = {
+		[BitmapActionStartwrite]	= BitDirty,
+		[BitmapActionStartsync]		= BitNone,
+		[BitmapActionEndsync]		= BitNone,
+		[BitmapActionAbortsync]		= BitNone,
+		[BitmapActionReload]		= BitNone,
+		[BitmapActionDaemon]		= BitNone,
+		[BitmapActionDiscard]		= BitNone,
+		[BitmapActionStale]		= BitNone,
+	},
+	[BitClean] = {
+		[BitmapActionStartwrite]	= BitDirty,
+		[BitmapActionStartsync]		= BitNone,
+		[BitmapActionEndsync]		= BitNone,
+		[BitmapActionAbortsync]		= BitNone,
+		[BitmapActionReload]		= BitNone,
+		[BitmapActionDaemon]		= BitNone,
+		[BitmapActionDiscard]		= BitUnwritten,
+		[BitmapActionStale]		= BitNeedSync,
+	},
+	[BitDirty] = {
+		[BitmapActionStartwrite]	= BitNone,
+		[BitmapActionStartsync]		= BitNone,
+		[BitmapActionEndsync]		= BitNone,
+		[BitmapActionAbortsync]		= BitNone,
+		[BitmapActionReload]		= BitNeedSync,
+		[BitmapActionDaemon]		= BitClean,
+		[BitmapActionDiscard]		= BitUnwritten,
+		[BitmapActionStale]		= BitNeedSync,
+	},
+	[BitNeedSync] = {
+		[BitmapActionStartwrite]	= BitNone,
+		[BitmapActionStartsync]		= BitSyncing,
+		[BitmapActionEndsync]		= BitNone,
+		[BitmapActionAbortsync]		= BitNone,
+		[BitmapActionReload]		= BitNone,
+		[BitmapActionDaemon]		= BitNone,
+		[BitmapActionDiscard]		= BitUnwritten,
+		[BitmapActionStale]		= BitNone,
+	},
+	[BitSyncing] = {
+		[BitmapActionStartwrite]	= BitNone,
+		[BitmapActionStartsync]		= BitSyncing,
+		[BitmapActionEndsync]		= BitDirty,
+		[BitmapActionAbortsync]		= BitNeedSync,
+		[BitmapActionReload]		= BitNeedSync,
+		[BitmapActionDaemon]		= BitNone,
+		[BitmapActionDiscard]		= BitUnwritten,
+		[BitmapActionStale]		= BitNeedSync,
+	},
+};
+
+static enum llbitmap_state llbitmap_read(struct llbitmap *llbitmap, loff_t pos)
+{
+	unsigned int idx;
+	unsigned int offset;
+
+	pos += BITMAP_SB_SIZE;
+	idx = pos >> PAGE_SHIFT;
+	offset = offset_in_page(pos);
+
+	return llbitmap->pctl[idx]->state[offset];
+}
+
+/* set all the bits in the subpage as dirty */
+static void llbitmap_infect_dirty_bits(struct llbitmap *llbitmap,
+				       struct llbitmap_page_ctl *pctl,
+				       unsigned int bit, unsigned int offset)
+{
+	bool level_456 = raid_is_456(llbitmap->mddev);
+	unsigned int io_size = llbitmap->io_size;
+	int pos;
+
+	for (pos = bit * io_size; pos < (bit + 1) * io_size; pos++) {
+		if (pos == offset)
+			continue;
+
+		switch (pctl->state[pos]) {
+		case BitUnwritten:
+			pctl->state[pos] = level_456 ? BitNeedSync : BitDirty;
+			break;
+		case BitClean:
+			pctl->state[pos] = BitDirty;
+			break;
+		};
+	}
+
+}
+
+static void llbitmap_set_page_dirty(struct llbitmap *llbitmap, int idx,
+				    int offset)
+{
+	struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
+	unsigned int io_size = llbitmap->io_size;
+	int bit = offset / io_size;
+	int pos;
+
+	if (!test_bit(LLPageDirty, &pctl->flags))
+		set_bit(LLPageDirty, &pctl->flags);
+
+	/*
+	 * The subpage usually contains a total of 512 bits. If any single bit
+	 * within the subpage is marked as dirty, the entire sector will be
+	 * written. To avoid impacting write performance, when multiple bits
+	 * within the same sector are modified within a short time frame, all
+	 * bits in the sector will be collectively marked as dirty at once.
+	 */
+	if (test_and_set_bit(bit, pctl->dirty)) {
+		llbitmap_infect_dirty_bits(llbitmap, pctl, bit, offset);
+		return;
+	}
+
+	for (pos = bit * io_size; pos < (bit + 1) * io_size; pos++) {
+		if (pos == offset)
+			continue;
+		if (pctl->state[pos] == BitDirty ||
+		    pctl->state[pos] == BitNeedSync) {
+			llbitmap_infect_dirty_bits(llbitmap, pctl, bit, offset);
+			return;
+		}
+	}
+}
+
+static void llbitmap_write(struct llbitmap *llbitmap, enum llbitmap_state state,
+			   loff_t pos)
+{
+	unsigned int idx;
+	unsigned int offset;
+
+	pos += BITMAP_SB_SIZE;
+	idx = pos >> PAGE_SHIFT;
+	offset = offset_in_page(pos);
+
+	llbitmap->pctl[idx]->state[offset] = state;
+	if (state == BitDirty || state == BitNeedSync)
+		llbitmap_set_page_dirty(llbitmap, idx, offset);
+}
+
+static struct page *llbitmap_read_page(struct llbitmap *llbitmap, int idx)
+{
+	struct mddev *mddev = llbitmap->mddev;
+	struct page *page = NULL;
+	struct md_rdev *rdev;
+
+	if (llbitmap->pctl && llbitmap->pctl[idx])
+		page = llbitmap->pctl[idx]->page;
+	if (page)
+		return page;
+
+	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	if (!page)
+		return ERR_PTR(-ENOMEM);
+
+	rdev_for_each(rdev, mddev) {
+		sector_t sector;
+
+		if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
+			continue;
+
+		sector = mddev->bitmap_info.offset +
+			 (idx << PAGE_SECTORS_SHIFT);
+
+		if (sync_page_io(rdev, sector, PAGE_SIZE, page, REQ_OP_READ,
+				 true))
+			return page;
+
+		md_error(mddev, rdev);
+	}
+
+	__free_page(page);
+	return ERR_PTR(-EIO);
+}
+
+static void llbitmap_write_page(struct llbitmap *llbitmap, int idx)
+{
+	struct page *page = llbitmap->pctl[idx]->page;
+	struct mddev *mddev = llbitmap->mddev;
+	struct md_rdev *rdev;
+	int bit;
+
+	for (bit = 0; bit < llbitmap->bits_per_page; bit++) {
+		struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
+
+		if (!test_and_clear_bit(bit, pctl->dirty))
+			continue;
+
+		rdev_for_each(rdev, mddev) {
+			sector_t sector;
+			sector_t bit_sector = llbitmap->io_size >> SECTOR_SHIFT;
+
+			if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
+				continue;
+
+			sector = mddev->bitmap_info.offset + rdev->sb_start +
+				 (idx << PAGE_SECTORS_SHIFT) +
+				 bit * bit_sector;
+			md_write_metadata(mddev, rdev, sector,
+					  llbitmap->io_size, page,
+					  bit * llbitmap->io_size);
+		}
+	}
+}
+
+static void active_release(struct percpu_ref *ref)
+{
+	struct llbitmap_page_ctl *pctl =
+		container_of(ref, struct llbitmap_page_ctl, active);
+
+	wake_up(&pctl->wait);
+}
+
+static void llbitmap_free_pages(struct llbitmap *llbitmap)
+{
+	int i;
+
+	if (!llbitmap->pctl)
+		return;
+
+	for (i = 0; i < llbitmap->nr_pages; i++) {
+		struct llbitmap_page_ctl *pctl = llbitmap->pctl[i];
+
+		if (!pctl || !pctl->page)
+			break;
+
+		__free_page(pctl->page);
+		percpu_ref_exit(&pctl->active);
+	}
+
+	kfree(llbitmap->pctl[0]);
+	kfree(llbitmap->pctl);
+	llbitmap->pctl = NULL;
+}
+
+static int llbitmap_cache_pages(struct llbitmap *llbitmap)
+{
+	struct llbitmap_page_ctl *pctl;
+	unsigned int nr_pages = DIV_ROUND_UP(llbitmap->chunks + BITMAP_SB_SIZE,
+					     PAGE_SIZE);
+	unsigned int size = struct_size(pctl, dirty,
+					BITS_TO_LONGS(llbitmap->bits_per_page));
+	int i;
+
+	llbitmap->pctl = kmalloc_array(nr_pages, sizeof(void *),
+				       GFP_KERNEL | __GFP_ZERO);
+	if (!llbitmap->pctl)
+		return -ENOMEM;
+
+	size = round_up(size, cache_line_size());
+	pctl = kmalloc_array(nr_pages, size, GFP_KERNEL | __GFP_ZERO);
+	if (!pctl) {
+		kfree(llbitmap->pctl);
+		return -ENOMEM;
+	}
+
+	llbitmap->nr_pages = nr_pages;
+
+	for (i = 0; i < nr_pages; i++, pctl = (void *)pctl + size) {
+		struct page *page = llbitmap_read_page(llbitmap, i);
+
+		llbitmap->pctl[i] = pctl;
+
+		if (IS_ERR(page)) {
+			llbitmap_free_pages(llbitmap);
+			return PTR_ERR(page);
+		}
+
+		if (percpu_ref_init(&pctl->active, active_release,
+				    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
+			__free_page(page);
+			llbitmap_free_pages(llbitmap);
+			return -ENOMEM;
+		}
+
+		pctl->page = page;
+		pctl->state = page_address(page);
+		init_waitqueue_head(&pctl->wait);
+	}
+
+	return 0;
+}
+
+#endif /* CONFIG_MD_LLBITMAP */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 16/23] md/md-llbitmap: implement bit state machine
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (14 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 15/23] md/md-llbitmap: implement llbitmap IO Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-06-30  2:14   ` Xiao Ni
  2025-05-24  6:13 ` [PATCH 17/23] md/md-llbitmap: implement APIs for page level dirty bits synchronization Yu Kuai
                   ` (9 subsequent siblings)
  25 siblings, 1 reply; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Each bit is one byte, contain 6 difference state, see llbitmap_state. And
there are total 8 differenct actions, see llbitmap_action, can change
state:

llbitmap state machine: transitions between states

|           | Startwrite | Startsync | Endsync | Abortsync| Reload   | Daemon | Discard   | Stale     |
| --------- | ---------- | --------- | ------- | -------  | -------- | ------ | --------- | --------- |
| Unwritten | Dirty      | x         | x       | x        | x        | x      | x         | x         |
| Clean     | Dirty      | x         | x       | x        | x        | x      | Unwritten | NeedSync  |
| Dirty     | x          | x         | x       | x        | NeedSync | Clean  | Unwritten | NeedSync  |
| NeedSync  | x          | Syncing   | x       | x        | x        | x      | Unwritten | x         |
| Syncing   | x          | Syncing   | Dirty   | NeedSync | NeedSync | x      | Unwritten | NeedSync  |

Typical scenarios:

1) Create new array
All bits will be set to Unwritten by default, if --assume-clean is set,
All bits will be set to Clean instead.

2) write data, raid1/raid10 have full copy of data, while raid456 donen't and
rely on xor data

2.1) write new data to raid1/raid10:
Unwritten --StartWrite--> Dirty

2.2) write new data to raid456:
Unwritten --StartWrite--> NeedSync

Because the initial recover for raid456 is skipped, the xor data is not build
yet, the bit must set to NeedSync first and after lazy initial recover is
finished, the bit will finially set to Dirty(see 5.1 and 5.4);

2.3) cover write
Clean --StartWrite--> Dirty

3) daemon, if the array is not degraded:
Dirty --Daemon--> Clean

For degraded array, the Dirty bit will never be cleared, prevent full disk
recovery while readding a removed disk.

4) discard
{Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten

5) resync and recover

5.1) common process
NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean

5.2) resync after power failure
Dirty --Reload--> NeedSync

5.3) recover while replacing with a new disk
By default, the old bitmap framework will recover all data, and llbitmap
implement this by a new helper llbitmap_skip_sync_blocks:

skip recover for bits other than dirty or clean;

5.4) lazy initial recover for raid5:
By default, the old bitmap framework will only allow new recover when there
are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add
to perform raid456 lazy recover for set bits(from 2.2).

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md-llbitmap.c | 83 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 83 insertions(+)

diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
index 1a01b6777527..f782f092ab5d 100644
--- a/drivers/md/md-llbitmap.c
+++ b/drivers/md/md-llbitmap.c
@@ -568,4 +568,87 @@ static int llbitmap_cache_pages(struct llbitmap *llbitmap)
 	return 0;
 }
 
+static void llbitmap_init_state(struct llbitmap *llbitmap)
+{
+	enum llbitmap_state state = BitUnwritten;
+	unsigned long i;
+
+	if (test_and_clear_bit(BITMAP_CLEAN, &llbitmap->flags))
+		state = BitClean;
+
+	for (i = 0; i < llbitmap->chunks; i++)
+		llbitmap_write(llbitmap, state, i);
+}
+
+/* The return value is only used from resync, where @start == @end. */
+static enum llbitmap_state llbitmap_state_machine(struct llbitmap *llbitmap,
+						  unsigned long start,
+						  unsigned long end,
+						  enum llbitmap_action action)
+{
+	struct mddev *mddev = llbitmap->mddev;
+	enum llbitmap_state state = BitNone;
+	bool need_resync = false;
+	bool need_recovery = false;
+
+	if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags))
+		return BitNone;
+
+	if (action == BitmapActionInit) {
+		llbitmap_init_state(llbitmap);
+		return BitNone;
+	}
+
+	while (start <= end) {
+		enum llbitmap_state c = llbitmap_read(llbitmap, start);
+
+		if (c < 0 || c >= nr_llbitmap_state) {
+			pr_err("%s: invalid bit %lu state %d action %d, forcing resync\n",
+			       __func__, start, c, action);
+			state = BitNeedSync;
+			goto write_bitmap;
+		}
+
+		if (c == BitNeedSync)
+			need_resync = true;
+
+		state = state_machine[c][action];
+		if (state == BitNone) {
+			start++;
+			continue;
+		}
+
+write_bitmap:
+		/* Delay raid456 initial recovery to first write. */
+		if (c == BitUnwritten && state == BitDirty &&
+		    action == BitmapActionStartwrite && raid_is_456(mddev)) {
+			state = BitNeedSync;
+			need_recovery = true;
+		}
+
+		llbitmap_write(llbitmap, state, start);
+
+		if (state == BitNeedSync)
+			need_resync = true;
+		else if (state == BitDirty &&
+			 !timer_pending(&llbitmap->pending_timer))
+			mod_timer(&llbitmap->pending_timer,
+				  jiffies + mddev->bitmap_info.daemon_sleep * HZ);
+
+		start++;
+	}
+
+	if (need_recovery) {
+		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+		set_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery);
+		md_wakeup_thread(mddev->thread);
+	} else if (need_resync) {
+		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+		set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+		md_wakeup_thread(mddev->thread);
+	}
+
+	return state;
+}
+
 #endif /* CONFIG_MD_LLBITMAP */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 17/23] md/md-llbitmap: implement APIs for page level dirty bits synchronization
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (15 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 16/23] md/md-llbitmap: implement bit state machine Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-24  6:13 ` [PATCH 18/23] md/md-llbitmap: implement APIs to mange bitmap lifetime Yu Kuai
                   ` (8 subsequent siblings)
  25 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

IO fast path will set bits to dirty, and those dirty bits will be cleared
by daemon after IO is done. llbitmap_barrier is used to synchronize between
IO path and daemon;

IO path:
 1) try to grab a reference, if succeed, set expire time after 5s and
 return;
 2) if failed to grab a reference, wait for daemon to finish clearing dirty
 bits;

Daemon(Daemon will be waken up every daemon_sleep seconds):
For each page:
 1) check if page expired, if not skip this page; for expired page:
 2) suspend the page and wait for inflight write IO to be done;
 3) change dirty page to clean;
 4) resume the page;

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md-llbitmap.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
index f782f092ab5d..4d5f9a139a25 100644
--- a/drivers/md/md-llbitmap.c
+++ b/drivers/md/md-llbitmap.c
@@ -651,4 +651,42 @@ static enum llbitmap_state llbitmap_state_machine(struct llbitmap *llbitmap,
 	return state;
 }
 
+static void llbitmap_raise_barrier(struct llbitmap *llbitmap, int page_idx)
+{
+	struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx];
+
+retry:
+	if (likely(percpu_ref_tryget_live(&pctl->active))) {
+		WRITE_ONCE(pctl->expire, jiffies + BARRIER_IDLE * HZ);
+		return;
+	}
+
+	wait_event(pctl->wait, !percpu_ref_is_dying(&pctl->active));
+	goto retry;
+}
+
+static void llbitmap_release_barrier(struct llbitmap *llbitmap, int page_idx)
+{
+	struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx];
+
+	percpu_ref_put(&pctl->active);
+}
+
+static void llbitmap_suspend(struct llbitmap *llbitmap, int page_idx)
+{
+	struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx];
+
+	percpu_ref_kill(&pctl->active);
+	wait_event(pctl->wait, percpu_ref_is_zero(&pctl->active));
+}
+
+static void llbitmap_resume(struct llbitmap *llbitmap, int page_idx)
+{
+	struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx];
+
+	pctl->expire = LONG_MAX;
+	percpu_ref_resurrect(&pctl->active);
+	wake_up(&pctl->wait);
+}
+
 #endif /* CONFIG_MD_LLBITMAP */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 18/23] md/md-llbitmap: implement APIs to mange bitmap lifetime
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (16 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 17/23] md/md-llbitmap: implement APIs for page level dirty bits synchronization Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-29  7:03   ` Xiao Ni
  2025-05-24  6:13 ` [PATCH 19/23] md/md-llbitmap: implement APIs to dirty bits and clear bits Yu Kuai
                   ` (7 subsequent siblings)
  25 siblings, 1 reply; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Include following APIs:
 - llbitmap_create
 - llbitmap_resize
 - llbitmap_load
 - llbitmap_destroy

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md-llbitmap.c | 322 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 322 insertions(+)

diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
index 4d5f9a139a25..23283c4f7263 100644
--- a/drivers/md/md-llbitmap.c
+++ b/drivers/md/md-llbitmap.c
@@ -689,4 +689,326 @@ static void llbitmap_resume(struct llbitmap *llbitmap, int page_idx)
 	wake_up(&pctl->wait);
 }
 
+static int llbitmap_check_support(struct mddev *mddev)
+{
+	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
+		pr_notice("md/llbitmap: %s: array with journal cannot have bitmap\n",
+			  mdname(mddev));
+		return -EBUSY;
+	}
+
+	if (mddev->bitmap_info.space == 0) {
+		if (mddev->bitmap_info.default_space == 0) {
+			pr_notice("md/llbitmap: %s: no space for bitmap\n",
+				  mdname(mddev));
+			return -ENOSPC;
+		}
+	}
+
+	if (!mddev->persistent) {
+		pr_notice("md/llbitmap: %s: array must be persistent\n",
+			  mdname(mddev));
+		return -EOPNOTSUPP;
+	}
+
+	if (mddev->bitmap_info.file) {
+		pr_notice("md/llbitmap: %s: doesn't support bitmap file\n",
+			  mdname(mddev));
+		return -EOPNOTSUPP;
+	}
+
+	if (mddev->bitmap_info.external) {
+		pr_notice("md/llbitmap: %s: doesn't support external metadata\n",
+			  mdname(mddev));
+		return -EOPNOTSUPP;
+	}
+
+	if (mddev_is_dm(mddev)) {
+		pr_notice("md/llbitmap: %s: doesn't support dm-raid\n",
+			  mdname(mddev));
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
+static int llbitmap_init(struct llbitmap *llbitmap)
+{
+	struct mddev *mddev = llbitmap->mddev;
+	sector_t blocks = mddev->resync_max_sectors;
+	unsigned long chunksize = MIN_CHUNK_SIZE;
+	unsigned long chunks = DIV_ROUND_UP(blocks, chunksize);
+	unsigned long space = mddev->bitmap_info.space << SECTOR_SHIFT;
+	int ret;
+
+	while (chunks > space) {
+		chunksize = chunksize << 1;
+		chunks = DIV_ROUND_UP(blocks, chunksize);
+	}
+
+	llbitmap->chunkshift = ffz(~chunksize);
+	llbitmap->chunksize = chunksize;
+	llbitmap->chunks = chunks;
+	mddev->bitmap_info.daemon_sleep = DEFAULT_DAEMON_SLEEP;
+
+	ret = llbitmap_cache_pages(llbitmap);
+	if (ret)
+		return ret;
+
+	llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, BitmapActionInit);
+	return 0;
+}
+
+static int llbitmap_read_sb(struct llbitmap *llbitmap)
+{
+	struct mddev *mddev = llbitmap->mddev;
+	unsigned long daemon_sleep;
+	unsigned long chunksize;
+	unsigned long events;
+	struct page *sb_page;
+	bitmap_super_t *sb;
+	int ret = -EINVAL;
+
+	if (!mddev->bitmap_info.offset) {
+		pr_err("md/llbitmap: %s: no super block found", mdname(mddev));
+		return -EINVAL;
+	}
+
+	sb_page = llbitmap_read_page(llbitmap, 0);
+	if (IS_ERR(sb_page)) {
+		pr_err("md/llbitmap: %s: read super block failed",
+		       mdname(mddev));
+		ret = -EIO;
+		goto out;
+	}
+
+	sb = kmap_local_page(sb_page);
+	if (sb->magic != cpu_to_le32(BITMAP_MAGIC)) {
+		pr_err("md/llbitmap: %s: invalid super block magic number",
+		       mdname(mddev));
+		goto out_put_page;
+	}
+
+	if (sb->version != cpu_to_le32(BITMAP_MAJOR_LOCKLESS)) {
+		pr_err("md/llbitmap: %s: invalid super block version",
+		       mdname(mddev));
+		goto out_put_page;
+	}
+
+	if (memcmp(sb->uuid, mddev->uuid, 16)) {
+		pr_err("md/llbitmap: %s: bitmap superblock UUID mismatch\n",
+		       mdname(mddev));
+		goto out_put_page;
+	}
+
+	if (mddev->bitmap_info.space == 0) {
+		int room = le32_to_cpu(sb->sectors_reserved);
+
+		if (room)
+			mddev->bitmap_info.space = room;
+		else
+			mddev->bitmap_info.space = mddev->bitmap_info.default_space;
+	}
+	llbitmap->flags = le32_to_cpu(sb->state);
+	if (test_and_clear_bit(BITMAP_FIRST_USE, &llbitmap->flags)) {
+		ret = llbitmap_init(llbitmap);
+		goto out_put_page;
+	}
+
+	chunksize = le32_to_cpu(sb->chunksize);
+	if (!is_power_of_2(chunksize)) {
+		pr_err("md/llbitmap: %s: chunksize not a power of 2",
+		       mdname(mddev));
+		goto out_put_page;
+	}
+
+	if (chunksize < DIV_ROUND_UP(mddev->resync_max_sectors,
+				     mddev->bitmap_info.space << SECTOR_SHIFT)) {
+		pr_err("md/llbitmap: %s: chunksize too small %lu < %llu / %lu",
+		       mdname(mddev), chunksize, mddev->resync_max_sectors,
+		       mddev->bitmap_info.space);
+		goto out_put_page;
+	}
+
+	daemon_sleep = le32_to_cpu(sb->daemon_sleep);
+	if (daemon_sleep < 1 || daemon_sleep > MAX_SCHEDULE_TIMEOUT / HZ) {
+		pr_err("md/llbitmap: %s: daemon sleep %lu period out of range",
+		       mdname(mddev), daemon_sleep);
+		goto out_put_page;
+	}
+
+	events = le64_to_cpu(sb->events);
+	if (events < mddev->events) {
+		pr_warn("md/llbitmap :%s: bitmap file is out of date (%lu < %llu) -- forcing full recovery",
+			mdname(mddev), events, mddev->events);
+		set_bit(BITMAP_STALE, &llbitmap->flags);
+	}
+
+	sb->sync_size = cpu_to_le64(mddev->resync_max_sectors);
+	mddev->bitmap_info.chunksize = chunksize;
+	mddev->bitmap_info.daemon_sleep = daemon_sleep;
+
+	llbitmap->chunksize = chunksize;
+	llbitmap->chunks = DIV_ROUND_UP(mddev->resync_max_sectors, chunksize);
+	llbitmap->chunkshift = ffz(~chunksize);
+	ret = llbitmap_cache_pages(llbitmap);
+
+out_put_page:
+	__free_page(sb_page);
+out:
+	kunmap_local(sb);
+	return ret;
+}
+
+static void llbitmap_pending_timer_fn(struct timer_list *t)
+{
+	struct llbitmap *llbitmap = from_timer(llbitmap, t, pending_timer);
+
+	if (work_busy(&llbitmap->daemon_work)) {
+		pr_warn("daemon_work not finished\n");
+		set_bit(BITMAP_DAEMON_BUSY, &llbitmap->flags);
+		return;
+	}
+
+	queue_work(md_llbitmap_io_wq, &llbitmap->daemon_work);
+}
+
+static void md_llbitmap_daemon_fn(struct work_struct *work)
+{
+	struct llbitmap *llbitmap =
+		container_of(work, struct llbitmap, daemon_work);
+	unsigned long start;
+	unsigned long end;
+	bool restart;
+	int idx;
+
+	if (llbitmap->mddev->degraded)
+		return;
+
+retry:
+	start = 0;
+	end = min(llbitmap->chunks, PAGE_SIZE - BITMAP_SB_SIZE) - 1;
+	restart = false;
+
+	for (idx = 0; idx < llbitmap->nr_pages; idx++) {
+		struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
+
+		if (idx > 0) {
+			start = end + 1;
+			end = min(end + PAGE_SIZE, llbitmap->chunks - 1);
+		}
+
+		if (!test_bit(LLPageFlush, &pctl->flags) &&
+		    time_before(jiffies, pctl->expire)) {
+			restart = true;
+			continue;
+		}
+
+		llbitmap_suspend(llbitmap, idx);
+		llbitmap_state_machine(llbitmap, start, end, BitmapActionDaemon);
+		llbitmap_resume(llbitmap, idx);
+	}
+
+	/*
+	 * If the daemon took a long time to finish, retry to prevent missing
+	 * clearing dirty bits.
+	 */
+	if (test_and_clear_bit(BITMAP_DAEMON_BUSY, &llbitmap->flags))
+		goto retry;
+
+	/* If some page is dirty but not expired, setup timer again */
+	if (restart)
+		mod_timer(&llbitmap->pending_timer,
+			  jiffies + llbitmap->mddev->bitmap_info.daemon_sleep * HZ);
+}
+
+static int llbitmap_create(struct mddev *mddev)
+{
+	struct llbitmap *llbitmap;
+	int ret;
+
+	ret = llbitmap_check_support(mddev);
+	if (ret)
+		return ret;
+
+	llbitmap = kzalloc(sizeof(*llbitmap), GFP_KERNEL);
+	if (!llbitmap)
+		return -ENOMEM;
+
+	llbitmap->mddev = mddev;
+	llbitmap->io_size = bdev_logical_block_size(mddev->gendisk->part0);
+	llbitmap->bits_per_page = PAGE_SIZE / llbitmap->io_size;
+
+	timer_setup(&llbitmap->pending_timer, llbitmap_pending_timer_fn, 0);
+	INIT_WORK(&llbitmap->daemon_work, md_llbitmap_daemon_fn);
+	atomic_set(&llbitmap->behind_writes, 0);
+	init_waitqueue_head(&llbitmap->behind_wait);
+
+	mutex_lock(&mddev->bitmap_info.mutex);
+	mddev->bitmap = llbitmap;
+	ret = llbitmap_read_sb(llbitmap);
+	mutex_unlock(&mddev->bitmap_info.mutex);
+	if (ret)
+		goto err_out;
+
+	return 0;
+
+err_out:
+	kfree(llbitmap);
+	return ret;
+}
+
+static int llbitmap_resize(struct mddev *mddev, sector_t blocks, int chunksize)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long chunks;
+
+	if (chunksize == 0)
+		chunksize = llbitmap->chunksize;
+
+	/* If there is enough space, leave the chunksize unchanged. */
+	chunks = DIV_ROUND_UP(blocks, chunksize);
+	while (chunks > mddev->bitmap_info.space << SECTOR_SHIFT) {
+		chunksize = chunksize << 1;
+		chunks = DIV_ROUND_UP(blocks, chunksize);
+	}
+
+	llbitmap->chunkshift = ffz(~chunksize);
+	llbitmap->chunksize = chunksize;
+	llbitmap->chunks = chunks;
+
+	return 0;
+}
+
+static int llbitmap_load(struct mddev *mddev)
+{
+	enum llbitmap_action action = BitmapActionReload;
+	struct llbitmap *llbitmap = mddev->bitmap;
+
+	if (test_and_clear_bit(BITMAP_STALE, &llbitmap->flags))
+		action = BitmapActionStale;
+
+	llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, action);
+	return 0;
+}
+
+static void llbitmap_destroy(struct mddev *mddev)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+
+	if (!llbitmap)
+		return;
+
+	mutex_lock(&mddev->bitmap_info.mutex);
+
+	timer_delete_sync(&llbitmap->pending_timer);
+	flush_workqueue(md_llbitmap_io_wq);
+	flush_workqueue(md_llbitmap_unplug_wq);
+
+	mddev->bitmap = NULL;
+	llbitmap_free_pages(llbitmap);
+	kfree(llbitmap);
+	mutex_unlock(&mddev->bitmap_info.mutex);
+}
+
 #endif /* CONFIG_MD_LLBITMAP */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 19/23] md/md-llbitmap: implement APIs to dirty bits and clear bits
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (17 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 18/23] md/md-llbitmap: implement APIs to mange bitmap lifetime Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-24  6:13 ` [PATCH 20/23] md/md-llbitmap: implement APIs for sync_thread Yu Kuai
                   ` (6 subsequent siblings)
  25 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Include following APIs:
 - llbitmap_startwrite
 - llbitmap_endwrite
 - llbitmap_start_discard
 - llbitmap_end_discard
 - llbitmap_unplug
 - llbitmap_flush

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md-llbitmap.c | 162 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 162 insertions(+)

diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
index 23283c4f7263..37e72885dbdb 100644
--- a/drivers/md/md-llbitmap.c
+++ b/drivers/md/md-llbitmap.c
@@ -1011,4 +1011,166 @@ static void llbitmap_destroy(struct mddev *mddev)
 	mutex_unlock(&mddev->bitmap_info.mutex);
 }
 
+static void llbitmap_start_write(struct mddev *mddev, sector_t offset,
+				 unsigned long sectors)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long start = offset >> llbitmap->chunkshift;
+	unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift;
+	int page_start = (start + BITMAP_SB_SIZE) >> PAGE_SHIFT;
+	int page_end = (end + BITMAP_SB_SIZE) >> PAGE_SHIFT;
+
+	llbitmap_state_machine(llbitmap, start, end, BitmapActionStartwrite);
+
+
+	while (page_start <= page_end) {
+		llbitmap_raise_barrier(llbitmap, page_start);
+		page_start++;
+	}
+}
+
+static void llbitmap_end_write(struct mddev *mddev, sector_t offset,
+			       unsigned long sectors)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long start = offset >> llbitmap->chunkshift;
+	unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift;
+	int page_start = (start + BITMAP_SB_SIZE) >> PAGE_SHIFT;
+	int page_end = (end + BITMAP_SB_SIZE) >> PAGE_SHIFT;
+
+	while (page_start <= page_end) {
+		llbitmap_release_barrier(llbitmap, page_start);
+		page_start++;
+	}
+}
+
+static void llbitmap_start_discard(struct mddev *mddev, sector_t offset,
+				   unsigned long sectors)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long start = DIV_ROUND_UP(offset, llbitmap->chunksize);
+	unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift;
+	int page_start = (start + BITMAP_SB_SIZE) >> PAGE_SHIFT;
+	int page_end = (end + BITMAP_SB_SIZE) >> PAGE_SHIFT;
+
+	llbitmap_state_machine(llbitmap, start, end, BitmapActionDiscard);
+
+	while (page_start <= page_end) {
+		llbitmap_raise_barrier(llbitmap, page_start);
+		page_start++;
+	}
+}
+
+static void llbitmap_end_discard(struct mddev *mddev, sector_t offset,
+				 unsigned long sectors)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long start = DIV_ROUND_UP(offset, llbitmap->chunksize);
+	unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift;
+	int page_start = (start + BITMAP_SB_SIZE) >> PAGE_SHIFT;
+	int page_end = (end + BITMAP_SB_SIZE) >> PAGE_SHIFT;
+
+	while (page_start <= page_end) {
+		llbitmap_release_barrier(llbitmap, page_start);
+		page_start++;
+	}
+}
+
+static void llbitmap_unplug_fn(struct work_struct *work)
+{
+	struct llbitmap_unplug_work *unplug_work =
+		container_of(work, struct llbitmap_unplug_work, work);
+	struct llbitmap *llbitmap = unplug_work->llbitmap;
+	struct blk_plug plug;
+	int i;
+
+	blk_start_plug(&plug);
+
+	for (i = 0; i < llbitmap->nr_pages; i++) {
+		if (!test_bit(LLPageDirty, &llbitmap->pctl[i]->flags) ||
+		    !test_and_clear_bit(LLPageDirty, &llbitmap->pctl[i]->flags))
+			continue;
+
+		llbitmap_write_page(llbitmap, i);
+	}
+
+	blk_finish_plug(&plug);
+	md_super_wait(llbitmap->mddev);
+	complete(unplug_work->done);
+}
+
+static bool llbitmap_dirty(struct llbitmap *llbitmap)
+{
+	int i;
+
+	for (i = 0; i < llbitmap->nr_pages; i++)
+		if (test_bit(LLPageDirty, &llbitmap->pctl[i]->flags))
+			return true;
+
+	return false;
+}
+
+static void llbitmap_unplug(struct mddev *mddev, bool sync)
+{
+	DECLARE_COMPLETION_ONSTACK(done);
+	struct llbitmap *llbitmap = mddev->bitmap;
+	struct llbitmap_unplug_work unplug_work = {
+		.llbitmap = llbitmap,
+		.done = &done,
+	};
+
+	if (!llbitmap_dirty(llbitmap))
+		return;
+
+	/*
+	 * Issue new bitmap IO under submit_bio() context will deadlock:
+	 *  - the bio will wait for bitmap bio to be done, before it can be
+	 *  issued;
+	 *  - bitmap bio will be added to current->bio_list and wait for this
+	 *  bio to be issued;
+	 */
+	INIT_WORK_ONSTACK(&unplug_work.work, llbitmap_unplug_fn);
+	queue_work(md_llbitmap_unplug_wq, &unplug_work.work);
+	wait_for_completion(&done);
+	destroy_work_on_stack(&unplug_work.work);
+}
+
+/*
+ * Force to write all bitmap pages to disk, called when stopping the array, or
+ * every daemon_sleep seconds when sync_thread is running.
+ */
+static void __llbitmap_flush(struct mddev *mddev)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	struct blk_plug plug;
+	int i;
+
+	blk_start_plug(&plug);
+	for (i = 0; i < llbitmap->nr_pages; i++) {
+		struct llbitmap_page_ctl *pctl = llbitmap->pctl[i];
+
+		/* mark all bits as dirty */
+		set_bit(LLPageDirty, &pctl->flags);
+		bitmap_fill(pctl->dirty, llbitmap->bits_per_page);
+		llbitmap_write_page(llbitmap, i);
+	}
+	blk_finish_plug(&plug);
+	md_super_wait(llbitmap->mddev);
+}
+
+static void llbitmap_flush(struct mddev *mddev)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	int i;
+
+	for (i = 0; i < llbitmap->nr_pages; i++)
+		set_bit(LLPageFlush, &llbitmap->pctl[i]->flags);
+
+	timer_delete_sync(&llbitmap->pending_timer);
+	queue_work(md_llbitmap_io_wq, &llbitmap->daemon_work);
+	flush_work(&llbitmap->daemon_work);
+
+	__llbitmap_flush(mddev);
+}
+
 #endif /* CONFIG_MD_LLBITMAP */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 20/23] md/md-llbitmap: implement APIs for sync_thread
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (18 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 19/23] md/md-llbitmap: implement APIs to dirty bits and clear bits Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-24  6:13 ` [PATCH 21/23] md/md-llbitmap: implement all bitmap operations Yu Kuai
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Include following APIs:
 - llbitmap_blocks_synced
 - llbitmap_skip_sync_blocks
 - llbitmap_start_sync
 - llbitmap_end_sync
 - llbitmap_close_sync
 - llbitmap_cond_end_sync

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md-llbitmap.c | 104 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 104 insertions(+)

diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
index 37e72885dbdb..1b7625d3e2ed 100644
--- a/drivers/md/md-llbitmap.c
+++ b/drivers/md/md-llbitmap.c
@@ -1173,4 +1173,108 @@ static void llbitmap_flush(struct mddev *mddev)
 	__llbitmap_flush(mddev);
 }
 
+/* This is used for raid5 lazy initial recovery */
+static bool llbitmap_blocks_synced(struct mddev *mddev, sector_t offset)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long p = offset >> llbitmap->chunkshift;
+	enum llbitmap_state c = llbitmap_read(llbitmap, p);
+
+	return c == BitClean || c == BitDirty;
+}
+
+static sector_t llbitmap_skip_sync_blocks(struct mddev *mddev, sector_t offset)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long p = offset >> llbitmap->chunkshift;
+	int blocks = llbitmap->chunksize - (offset & (llbitmap->chunksize - 1));
+	enum llbitmap_state c = llbitmap_read(llbitmap, p);
+
+	/* always skip unwritten blocks */
+	if (c == BitUnwritten)
+		return blocks;
+
+	/* For resync also skip clean/dirty blocks */
+	if ((c == BitClean || c == BitDirty) &&
+	    test_bit(MD_RECOVERY_SYNC, &mddev->recovery) &&
+	    !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
+		return blocks;
+
+	return 0;
+}
+
+static bool llbitmap_start_sync(struct mddev *mddev, sector_t offset,
+				sector_t *blocks, bool degraded)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long p = offset >> llbitmap->chunkshift;
+
+	/*
+	 * Handle one bit at a time, this is much simpler. And it doesn't matter
+	 * if md_do_sync() loop more times.
+	 */
+	*blocks = llbitmap->chunksize - (offset & (llbitmap->chunksize - 1));
+	return llbitmap_state_machine(llbitmap, p, p,
+				      BitmapActionStartsync) == BitSyncing;
+}
+
+/* Something is wrong, sync_thread stop at @offset */
+static void llbitmap_end_sync(struct mddev *mddev, sector_t offset,
+			      sector_t *blocks)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long p = offset >> llbitmap->chunkshift;
+
+	*blocks = llbitmap->chunksize - (offset & (llbitmap->chunksize - 1));
+	llbitmap_state_machine(llbitmap, p, llbitmap->chunks - 1,
+			       BitmapActionAbortsync);
+}
+
+/* A full sync_thread is finished */
+static void llbitmap_close_sync(struct mddev *mddev)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	int i;
+
+	for (i = 0; i < llbitmap->nr_pages; i++) {
+		struct llbitmap_page_ctl *pctl = llbitmap->pctl[i];
+
+		/* let daemon_fn clear dirty bits immediately */
+		WRITE_ONCE(pctl->expire, jiffies);
+	}
+
+	llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1,
+			       BitmapActionEndsync);
+}
+
+/*
+ * sync_thread have reached @sector, update metadata every daemon_sleep seconds,
+ * just in case sync_thread have to restart after power failure.
+ */
+static void llbitmap_cond_end_sync(struct mddev *mddev, sector_t sector,
+				   bool force)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+
+	if (sector == 0) {
+		llbitmap->last_end_sync = jiffies;
+		return;
+	}
+
+	if (time_before(jiffies, llbitmap->last_end_sync +
+				 HZ * mddev->bitmap_info.daemon_sleep))
+		return;
+
+	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
+
+	mddev->curr_resync_completed = sector;
+	set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
+	llbitmap_state_machine(llbitmap, 0, sector >> llbitmap->chunkshift,
+			       BitmapActionEndsync);
+	__llbitmap_flush(mddev);
+
+	llbitmap->last_end_sync = jiffies;
+	sysfs_notify_dirent_safe(mddev->sysfs_completed);
+}
+
 #endif /* CONFIG_MD_LLBITMAP */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 21/23] md/md-llbitmap: implement all bitmap operations
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (19 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 20/23] md/md-llbitmap: implement APIs for sync_thread Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-24  6:13 ` [PATCH 22/23] md/md-llbitmap: implement sysfs APIs Yu Kuai
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Include following left APIs
 - llbitmap_enabled
 - llbitmap_dirty_bits
 - llbitmap_update_sb
 - llbitmap_write_all
 - llbitmap_start_behind_write
 - llbitmap_end_behind_write
 - llbitmap_wait_behind_writes

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md-llbitmap.c | 114 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 114 insertions(+)

diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
index 1b7625d3e2ed..ae664aa110a8 100644
--- a/drivers/md/md-llbitmap.c
+++ b/drivers/md/md-llbitmap.c
@@ -1277,4 +1277,118 @@ static void llbitmap_cond_end_sync(struct mddev *mddev, sector_t sector,
 	sysfs_notify_dirent_safe(mddev->sysfs_completed);
 }
 
+static bool llbitmap_enabled(void *data)
+{
+	struct llbitmap *llbitmap = data;
+
+	return llbitmap && !test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags);
+}
+
+static void llbitmap_dirty_bits(struct mddev *mddev, unsigned long s,
+				unsigned long e)
+{
+	llbitmap_state_machine(mddev->bitmap, s, e, BitmapActionStartwrite);
+}
+
+static void llbitmap_write_sb(struct llbitmap *llbitmap)
+{
+	int nr_bits = DIV_ROUND_UP(BITMAP_SB_SIZE, llbitmap->io_size);
+
+	bitmap_fill(llbitmap->pctl[0]->dirty, nr_bits);
+	llbitmap_write_page(llbitmap, 0);
+	md_super_wait(llbitmap->mddev);
+}
+
+static void llbitmap_update_sb(void *data)
+{
+	struct llbitmap *llbitmap = data;
+	struct mddev *mddev = llbitmap->mddev;
+	struct page *sb_page;
+	bitmap_super_t *sb;
+
+	if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags))
+		return;
+
+	sb_page = llbitmap_read_page(llbitmap, 0);
+	if (IS_ERR(sb_page)) {
+		pr_err("%s: %s: read super block failed", __func__,
+		       mdname(mddev));
+		set_bit(BITMAP_WRITE_ERROR, &llbitmap->flags);
+		return;
+	}
+
+	if (mddev->events < llbitmap->events_cleared)
+		llbitmap->events_cleared = mddev->events;
+
+	sb = kmap_local_page(sb_page);
+	sb->events = cpu_to_le64(mddev->events);
+	sb->state = cpu_to_le32(llbitmap->flags);
+	sb->chunksize = cpu_to_le32(llbitmap->chunksize);
+	sb->sync_size = cpu_to_le64(mddev->resync_max_sectors);
+	sb->events_cleared = cpu_to_le64(llbitmap->events_cleared);
+	sb->sectors_reserved = cpu_to_le32(mddev->bitmap_info.space);
+	sb->daemon_sleep = cpu_to_le32(mddev->bitmap_info.daemon_sleep);
+
+	kunmap_local(sb);
+	llbitmap_write_sb(llbitmap);
+}
+
+static int llbitmap_get_stats(void *data, struct md_bitmap_stats *stats)
+{
+	struct llbitmap *llbitmap = data;
+
+	memset(stats, 0, sizeof(*stats));
+
+	stats->missing_pages = 0;
+	stats->pages = llbitmap->nr_pages;
+	stats->file_pages = llbitmap->nr_pages;
+
+	stats->behind_writes = atomic_read(&llbitmap->behind_writes);
+	stats->behind_wait = wq_has_sleeper(&llbitmap->behind_wait);
+	stats->events_cleared = llbitmap->events_cleared;
+
+	return 0;
+}
+
+/* just flag all pages as needing to be written */
+static void llbitmap_write_all(struct mddev *mddev)
+{
+	int i;
+	struct llbitmap *llbitmap = mddev->bitmap;
+
+	for (i = 0; i < llbitmap->nr_pages; i++) {
+		struct llbitmap_page_ctl *pctl = llbitmap->pctl[i];
+
+		set_bit(LLPageDirty, &pctl->flags);
+		bitmap_fill(pctl->dirty, llbitmap->bits_per_page);
+	}
+}
+
+static void llbitmap_start_behind_write(struct mddev *mddev)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+
+	atomic_inc(&llbitmap->behind_writes);
+}
+
+static void llbitmap_end_behind_write(struct mddev *mddev)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+
+	if (atomic_dec_and_test(&llbitmap->behind_writes))
+		wake_up(&llbitmap->behind_wait);
+}
+
+static void llbitmap_wait_behind_writes(struct mddev *mddev)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+
+	if (!llbitmap)
+		return;
+
+	wait_event(llbitmap->behind_wait,
+		   atomic_read(&llbitmap->behind_writes) == 0);
+
+}
+
 #endif /* CONFIG_MD_LLBITMAP */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 22/23] md/md-llbitmap: implement sysfs APIs
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (20 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 21/23] md/md-llbitmap: implement all bitmap operations Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-24  6:13 ` [PATCH 23/23] md/md-llbitmap: add Kconfig Yu Kuai
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

There are 3 APIs for now:
 - bits: readonly, show status of bitmap bits, the number of each value;
 - metadata: readonly show bitmap metadata, include chunksize, chunkshift,
 chunks, offset and daemon_sleep;
 - daemon_sleep: read-write, default value is 30;

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 Documentation/admin-guide/md.rst | 13 +++++
 drivers/md/md-llbitmap.c         | 96 ++++++++++++++++++++++++++++++++
 2 files changed, 109 insertions(+)

diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
index 356d2a344f08..2030772075b5 100644
--- a/Documentation/admin-guide/md.rst
+++ b/Documentation/admin-guide/md.rst
@@ -444,6 +444,19 @@ If bitmap_type is bitmap, then the md device will also contain:
      once the array becomes non-degraded, and this fact has been
      recorded in the metadata.
 
+If bitmap_type is llbitmap, then the md device will also contain:
+
+  llbitmap/bits
+     This is readonly, show status of bitmap bits, the number of each
+     value.
+
+  llbitmap/metadata
+     This is readonly, show bitmap metadata, include chunksize, chunkshift,
+     chunks, offset and daemon_sleep.
+
+  llbitmap/daemon_sleep
+     This is readwrite, time in seconds that daemon function will be
+     triggered to clear dirty bits.
 As component devices are added to an md array, they appear in the ``md``
 directory as new directories named::
 
diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
index ae664aa110a8..38e67d4582ad 100644
--- a/drivers/md/md-llbitmap.c
+++ b/drivers/md/md-llbitmap.c
@@ -1391,4 +1391,100 @@ static void llbitmap_wait_behind_writes(struct mddev *mddev)
 
 }
 
+static ssize_t bits_show(struct mddev *mddev, char *page)
+{
+	struct llbitmap *llbitmap;
+	int bits[nr_llbitmap_state] = {0};
+	loff_t start = 0;
+
+	mutex_lock(&mddev->bitmap_info.mutex);
+	llbitmap = mddev->bitmap;
+	if (!llbitmap || !llbitmap->pctl) {
+		mutex_unlock(&mddev->bitmap_info.mutex);
+		return sprintf(page, "no bitmap\n");
+	}
+
+	if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags)) {
+		mutex_unlock(&mddev->bitmap_info.mutex);
+		return sprintf(page, "bitmap io error\n");
+	}
+
+	while (start < llbitmap->chunks) {
+		enum llbitmap_state c = llbitmap_read(llbitmap, start);
+
+		if (c < 0 || c >= nr_llbitmap_state)
+			pr_err("%s: invalid bit %llu state %d\n",
+			       __func__, start, c);
+		else
+			bits[c]++;
+		start++;
+	}
+
+	mutex_unlock(&mddev->bitmap_info.mutex);
+	return sprintf(page, "unwritten %d\nclean %d\ndirty %d\nneed sync %d\nsyncing %d\n",
+		       bits[BitUnwritten], bits[BitClean], bits[BitDirty],
+		       bits[BitNeedSync], bits[BitSyncing]);
+}
+
+static struct md_sysfs_entry llbitmap_bits =
+__ATTR_RO(bits);
+
+static ssize_t metadata_show(struct mddev *mddev, char *page)
+{
+	struct llbitmap *llbitmap;
+	ssize_t ret;
+
+	mutex_lock(&mddev->bitmap_info.mutex);
+	llbitmap = mddev->bitmap;
+	if (!llbitmap) {
+		mutex_unlock(&mddev->bitmap_info.mutex);
+		return sprintf(page, "no bitmap\n");
+	}
+
+	ret =  sprintf(page, "chunksize %lu\nchunkshift %lu\nchunks %lu\noffset %llu\ndaemon_sleep %lu\n",
+		       llbitmap->chunksize, llbitmap->chunkshift,
+		       llbitmap->chunks, mddev->bitmap_info.offset,
+		       llbitmap->mddev->bitmap_info.daemon_sleep);
+	mutex_unlock(&mddev->bitmap_info.mutex);
+
+	return ret;
+}
+
+static struct md_sysfs_entry llbitmap_metadata =
+__ATTR_RO(metadata);
+
+static ssize_t
+daemon_sleep_show(struct mddev *mddev, char *page)
+{
+	return sprintf(page, "%lu\n", mddev->bitmap_info.daemon_sleep);
+}
+
+static ssize_t
+daemon_sleep_store(struct mddev *mddev, const char *buf, size_t len)
+{
+	unsigned long timeout;
+	int rv = kstrtoul(buf, 10, &timeout);
+
+	if (rv)
+		return rv;
+
+	mddev->bitmap_info.daemon_sleep = timeout;
+	return len;
+}
+
+static struct md_sysfs_entry llbitmap_daemon_sleep =
+__ATTR_RW(daemon_sleep);
+
+static struct attribute *md_llbitmap_attrs[] = {
+	&llbitmap_bits.attr,
+	&llbitmap_metadata.attr,
+	&llbitmap_daemon_sleep.attr,
+	NULL
+};
+
+static struct attribute_group md_llbitmap_group = {
+	.name = "llbitmap",
+	.attrs = md_llbitmap_attrs,
+};
+
 #endif /* CONFIG_MD_LLBITMAP */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 23/23] md/md-llbitmap: add Kconfig
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (21 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 22/23] md/md-llbitmap: implement sysfs APIs Yu Kuai
@ 2025-05-24  6:13 ` Yu Kuai
  2025-05-27  8:29   ` Christoph Hellwig
  2025-05-24  7:07 ` [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (2 subsequent siblings)
  25 siblings, 1 reply; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  6:13 UTC (permalink / raw)
  To: hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

A new config MD_LLBITMAP is added, user can now using llbitmap to
replace the old bitmap.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/Kconfig       | 11 +++++++
 drivers/md/Makefile      |  2 +-
 drivers/md/md-bitmap.h   | 13 ++++++++
 drivers/md/md-llbitmap.c | 66 ++++++++++++++++++++++++++++++++++++++++
 drivers/md/md.c          |  6 ++++
 drivers/md/md.h          |  4 +--
 6 files changed, 99 insertions(+), 3 deletions(-)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index f913579e731c..07c19b2182ca 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -52,6 +52,17 @@ config MD_BITMAP
 
 	  If unsure, say Y.
 
+config MD_LLBITMAP
+	bool "MD RAID lockless bitmap support"
+	depends on BLK_DEV_MD
+	help
+	  If you say Y here, support for the lockless write intent bitmap will
+	  be enabled.
+
+	  Note, this is an experimental feature.
+
+	  If unsure, say N.
+
 config MD_AUTODETECT
 	bool "Autodetect RAID arrays during kernel boot"
 	depends on BLK_DEV_MD=y
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 87bdfc9fe14c..f1ca25cc1408 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -27,7 +27,7 @@ dm-clone-y	+= dm-clone-target.o dm-clone-metadata.o
 dm-verity-y	+= dm-verity-target.o
 dm-zoned-y	+= dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
 
-md-mod-y	+= md.o md-bitmap.o
+md-mod-y	+= md.o md-bitmap.o md-llbitmap.o
 raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
 linear-y       += md-linear.o
 
diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index a9a0f6a8d96d..8b4f2068931e 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -183,4 +183,17 @@ static inline void md_bitmap_exit(void)
 }
 #endif
 
+#ifdef CONFIG_MD_LLBITMAP
+int md_llbitmap_init(void);
+void md_llbitmap_exit(void);
+#else
+static inline int md_llbitmap_init(void)
+{
+	return 0;
+}
+static inline void md_llbitmap_exit(void)
+{
+}
+#endif
+
 #endif
diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
index 38e67d4582ad..8321dcbf1ce2 100644
--- a/drivers/md/md-llbitmap.c
+++ b/drivers/md/md-llbitmap.c
@@ -1487,4 +1487,70 @@ static struct attribute_group md_llbitmap_group = {
 	.attrs = md_llbitmap_attrs,
 };
 
+static struct bitmap_operations llbitmap_ops = {
+	.head = {
+		.type	= MD_BITMAP,
+		.id	= ID_LLBITMAP,
+		.name	= "llbitmap",
+	},
+
+	.enabled		= llbitmap_enabled,
+	.create			= llbitmap_create,
+	.resize			= llbitmap_resize,
+	.load			= llbitmap_load,
+	.destroy		= llbitmap_destroy,
+
+	.start_write		= llbitmap_start_write,
+	.end_write		= llbitmap_end_write,
+	.start_discard		= llbitmap_start_discard,
+	.end_discard		= llbitmap_end_discard,
+	.unplug			= llbitmap_unplug,
+	.flush			= llbitmap_flush,
+
+	.start_behind_write	= llbitmap_start_behind_write,
+	.end_behind_write	= llbitmap_end_behind_write,
+	.wait_behind_writes	= llbitmap_wait_behind_writes,
+
+	.blocks_synced		= llbitmap_blocks_synced,
+	.skip_sync_blocks	= llbitmap_skip_sync_blocks,
+	.start_sync		= llbitmap_start_sync,
+	.end_sync		= llbitmap_end_sync,
+	.close_sync		= llbitmap_close_sync,
+	.cond_end_sync		= llbitmap_cond_end_sync,
+
+	.update_sb		= llbitmap_update_sb,
+	.get_stats		= llbitmap_get_stats,
+	.dirty_bits		= llbitmap_dirty_bits,
+	.write_all		= llbitmap_write_all,
+
+	.group			= &md_llbitmap_group,
+};
+
+int md_llbitmap_init(void)
+{
+	md_llbitmap_io_wq = alloc_workqueue("md_llbitmap_io",
+					 WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
+	if (!md_llbitmap_io_wq)
+		return -ENOMEM;
+
+	md_llbitmap_unplug_wq = alloc_workqueue("md_llbitmap_unplug",
+					 WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
+	if (!md_llbitmap_unplug_wq) {
+		destroy_workqueue(md_llbitmap_io_wq);
+		md_llbitmap_io_wq = NULL;
+		return -ENOMEM;
+	}
+
+	return register_md_submodule(&llbitmap_ops.head);
+}
+
+void md_llbitmap_exit(void)
+{
+	destroy_workqueue(md_llbitmap_io_wq);
+	md_llbitmap_io_wq = NULL;
+	destroy_workqueue(md_llbitmap_unplug_wq);
+	md_llbitmap_unplug_wq = NULL;
+	unregister_md_submodule(&llbitmap_ops.head);
+}
+
 #endif /* CONFIG_MD_LLBITMAP */
diff --git a/drivers/md/md.c b/drivers/md/md.c
index c7f7914b7452..52e19344b73e 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -10183,6 +10183,10 @@ static int __init md_init(void)
 	if (ret)
 		return ret;
 
+	ret = md_llbitmap_init();
+	if (ret)
+		goto err_bitmap;
+
 	ret = -ENOMEM;
 	md_wq = alloc_workqueue("md", WQ_MEM_RECLAIM, 0);
 	if (!md_wq)
@@ -10214,6 +10218,8 @@ static int __init md_init(void)
 err_misc_wq:
 	destroy_workqueue(md_wq);
 err_wq:
+	md_llbitmap_exit();
+err_bitmap:
 	md_bitmap_exit();
 	return ret;
 }
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 3adb1660c7ed..aba5f1ffcdfd 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -26,7 +26,7 @@
 enum md_submodule_type {
 	MD_PERSONALITY = 0,
 	MD_CLUSTER,
-	MD_BITMAP, /* TODO */
+	MD_BITMAP,
 };
 
 enum md_submodule_id {
@@ -39,7 +39,7 @@ enum md_submodule_id {
 	ID_RAID10	= 10,
 	ID_CLUSTER,
 	ID_BITMAP,
-	ID_LLBITMAP,	/* TODO */
+	ID_LLBITMAP,
 	ID_BITMAP_NONE,
 };
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (22 preceding siblings ...)
  2025-05-24  6:13 ` [PATCH 23/23] md/md-llbitmap: add Kconfig Yu Kuai
@ 2025-05-24  7:07 ` Yu Kuai
  2025-05-30  6:45 ` Yu Kuai
  2025-06-30  1:59 ` Xiao Ni
  25 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-24  7:07 UTC (permalink / raw)
  To: Yu Kuai, hch, xni, colyli, song
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

Hi,

在 2025/05/24 14:12, Yu Kuai 写道:
> following branch for review or test:
> https://git.kernel.org/pub/scm/linux/kernel/git/yukuai/linux.git/log/?h=yukuai/md-llbitmap

The correct branch is:
https://git.kernel.org/pub/scm/linux/kernel/git/yukuai/linux.git/log/?h=yukuai/llbitmap

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 01/23] md: add a new parameter 'offset' to md_super_write()
  2025-05-24  6:12 ` [PATCH 01/23] md: add a new parameter 'offset' to md_super_write() Yu Kuai
@ 2025-05-25 15:50   ` Xiao Ni
  2025-05-26  6:28   ` Christoph Hellwig
  2025-05-27  5:54   ` Hannes Reinecke
  2 siblings, 0 replies; 108+ messages in thread
From: Xiao Ni @ 2025-05-25 15:50 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, yukuai3, linux-doc, linux-kernel, linux-raid,
	yi.zhang, yangerkun, johnny.chenyi

On Sat, May 24, 2025 at 2:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> The parameter is always set to 0 for now, following patches will use
> this helper to write llbitmap to underlying disks, allow writing
> dirty sectors instead of the whole page.
>
> Also rename md_super_write to md_write_metadata since there is nothing
> super-block specific.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  drivers/md/md-bitmap.c |  3 ++-
>  drivers/md/md.c        | 28 ++++++++++++++--------------
>  drivers/md/md.h        |  5 +++--
>  3 files changed, 19 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> index 431a3ab2e449..168eea6595b3 100644
> --- a/drivers/md/md-bitmap.c
> +++ b/drivers/md/md-bitmap.c
> @@ -470,7 +470,8 @@ static int __write_sb_page(struct md_rdev *rdev, struct bitmap *bitmap,
>                         return -EINVAL;
>         }
>
> -       md_super_write(mddev, rdev, sboff + ps, (int)min(size, bitmap_limit), page);
> +       md_write_metadata(mddev, rdev, sboff + ps, (int)min(size, bitmap_limit),
> +                         page, 0);
>         return 0;
>  }
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 32b997dfe6f4..18e03f651f6b 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1021,8 +1021,9 @@ static void super_written(struct bio *bio)
>                 wake_up(&mddev->sb_wait);
>  }
>
> -void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
> -                  sector_t sector, int size, struct page *page)
> +void md_write_metadata(struct mddev *mddev, struct md_rdev *rdev,
> +                      sector_t sector, int size, struct page *page,
> +                      unsigned int offset)
>  {
>         /* write first size bytes of page to sector of rdev
>          * Increment mddev->pending_writes before returning
> @@ -1047,7 +1048,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
>         atomic_inc(&rdev->nr_pending);
>
>         bio->bi_iter.bi_sector = sector;
> -       __bio_add_page(bio, page, size, 0);
> +       __bio_add_page(bio, page, size, offset);
>         bio->bi_private = rdev;
>         bio->bi_end_io = super_written;
>
> @@ -1657,8 +1658,8 @@ super_90_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors)
>         if ((u64)num_sectors >= (2ULL << 32) && rdev->mddev->level >= 1)
>                 num_sectors = (sector_t)(2ULL << 32) - 2;
>         do {
> -               md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
> -                      rdev->sb_page);
> +               md_write_metadata(rdev->mddev, rdev, rdev->sb_start,
> +                                 rdev->sb_size, rdev->sb_page, 0);
>         } while (md_super_wait(rdev->mddev) < 0);
>         return num_sectors;
>  }
> @@ -2306,8 +2307,8 @@ super_1_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors)
>         sb->super_offset = cpu_to_le64(rdev->sb_start);
>         sb->sb_csum = calc_sb_1_csum(sb);
>         do {
> -               md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
> -                              rdev->sb_page);
> +               md_write_metadata(rdev->mddev, rdev, rdev->sb_start,
> +                                 rdev->sb_size, rdev->sb_page, 0);
>         } while (md_super_wait(rdev->mddev) < 0);
>         return num_sectors;
>
> @@ -2816,18 +2817,17 @@ void md_update_sb(struct mddev *mddev, int force_change)
>                         continue; /* no noise on spare devices */
>
>                 if (!test_bit(Faulty, &rdev->flags)) {
> -                       md_super_write(mddev,rdev,
> -                                      rdev->sb_start, rdev->sb_size,
> -                                      rdev->sb_page);
> +                       md_write_metadata(mddev, rdev, rdev->sb_start,
> +                                         rdev->sb_size, rdev->sb_page, 0);
>                         pr_debug("md: (write) %pg's sb offset: %llu\n",
>                                  rdev->bdev,
>                                  (unsigned long long)rdev->sb_start);
>                         rdev->sb_events = mddev->events;
>                         if (rdev->badblocks.size) {
> -                               md_super_write(mddev, rdev,
> -                                              rdev->badblocks.sector,
> -                                              rdev->badblocks.size << 9,
> -                                              rdev->bb_page);
> +                               md_write_metadata(mddev, rdev,
> +                                                 rdev->badblocks.sector,
> +                                                 rdev->badblocks.size << 9,
> +                                                 rdev->bb_page, 0);
>                                 rdev->badblocks.size = 0;
>                         }
>
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 6eb5dfdf2f55..5ba4a9093a92 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -886,8 +886,9 @@ void md_account_bio(struct mddev *mddev, struct bio **bio);
>  void md_free_cloned_bio(struct bio *bio);
>
>  extern bool __must_check md_flush_request(struct mddev *mddev, struct bio *bio);
> -extern void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
> -                          sector_t sector, int size, struct page *page);
> +extern void md_write_metadata(struct mddev *mddev, struct md_rdev *rdev,
> +                             sector_t sector, int size, struct page *page,
> +                             unsigned int offset);
>  extern int md_super_wait(struct mddev *mddev);
>  extern int sync_page_io(struct md_rdev *rdev, sector_t sector, int size,
>                 struct page *page, blk_opf_t opf, bool metadata_op);
> --
> 2.39.2
>

Reviewed-by: Xiao Ni <xni@redhat.com>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 02/23] md: factor out a helper raid_is_456()
  2025-05-24  6:12 ` [PATCH 02/23] md: factor out a helper raid_is_456() Yu Kuai
@ 2025-05-25 15:50   ` Xiao Ni
  2025-05-26  6:28   ` Christoph Hellwig
  2025-05-27  5:55   ` Hannes Reinecke
  2 siblings, 0 replies; 108+ messages in thread
From: Xiao Ni @ 2025-05-25 15:50 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, yukuai3, linux-doc, linux-kernel, linux-raid,
	yi.zhang, yangerkun, johnny.chenyi

On Sat, May 24, 2025 at 2:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> There are no functional changes, the helper will be used by llbitmap in
> following patches.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  drivers/md/md.c | 9 +--------
>  drivers/md/md.h | 6 ++++++
>  2 files changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 18e03f651f6b..b0468e795d94 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -9037,19 +9037,12 @@ static sector_t md_sync_position(struct mddev *mddev, enum sync_action action)
>
>  static bool sync_io_within_limit(struct mddev *mddev)
>  {
> -       int io_sectors;
> -
>         /*
>          * For raid456, sync IO is stripe(4k) per IO, for other levels, it's
>          * RESYNC_PAGES(64k) per IO.
>          */
> -       if (mddev->level == 4 || mddev->level == 5 || mddev->level == 6)
> -               io_sectors = 8;
> -       else
> -               io_sectors = 128;
> -
>         return atomic_read(&mddev->recovery_active) <
> -               io_sectors * sync_io_depth(mddev);
> +              (raid_is_456(mddev) ? 8 : 128) * sync_io_depth(mddev);
>  }
>
>  #define SYNC_MARKS     10
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 5ba4a9093a92..c241119e6ef3 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -1011,6 +1011,12 @@ static inline bool mddev_is_dm(struct mddev *mddev)
>         return !mddev->gendisk;
>  }
>
> +static inline bool raid_is_456(struct mddev *mddev)
> +{
> +       return mddev->level == ID_RAID4 || mddev->level == ID_RAID5 ||
> +              mddev->level == ID_RAID6;
> +}
> +
>  static inline void mddev_trace_remap(struct mddev *mddev, struct bio *bio,
>                 sector_t sector)
>  {
> --
> 2.39.2
>

Reviewed-by: Xiao Ni <xni@redhat.com>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 03/23] md/md-bitmap: cleanup bitmap_ops->startwrite()
  2025-05-24  6:13 ` [PATCH 03/23] md/md-bitmap: cleanup bitmap_ops->startwrite() Yu Kuai
@ 2025-05-25 15:51   ` Xiao Ni
  2025-05-26  6:29   ` Christoph Hellwig
  2025-05-27  5:56   ` Hannes Reinecke
  2 siblings, 0 replies; 108+ messages in thread
From: Xiao Ni @ 2025-05-25 15:51 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, yukuai3, linux-doc, linux-kernel, linux-raid,
	yi.zhang, yangerkun, johnny.chenyi

On Sat, May 24, 2025 at 2:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> bitmap_startwrite() always return 0, and the caller doesn't check return
> value as well, hence change the method to void.
>
> Also rename startwrite/endwrite to start_write/end_write, which is more in
> line with the usual naming convention.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  drivers/md/md-bitmap.c | 17 ++++++++---------
>  drivers/md/md-bitmap.h |  6 +++---
>  drivers/md/md.c        |  8 ++++----
>  3 files changed, 15 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> index 168eea6595b3..2997e09d463d 100644
> --- a/drivers/md/md-bitmap.c
> +++ b/drivers/md/md-bitmap.c
> @@ -1669,13 +1669,13 @@ __acquires(bitmap->lock)
>                         &(bitmap->bp[page].map[pageoff]);
>  }
>
> -static int bitmap_startwrite(struct mddev *mddev, sector_t offset,
> -                            unsigned long sectors)
> +static void bitmap_start_write(struct mddev *mddev, sector_t offset,
> +                              unsigned long sectors)
>  {
>         struct bitmap *bitmap = mddev->bitmap;
>
>         if (!bitmap)
> -               return 0;
> +               return;
>
>         while (sectors) {
>                 sector_t blocks;
> @@ -1685,7 +1685,7 @@ static int bitmap_startwrite(struct mddev *mddev, sector_t offset,
>                 bmc = md_bitmap_get_counter(&bitmap->counts, offset, &blocks, 1);
>                 if (!bmc) {
>                         spin_unlock_irq(&bitmap->counts.lock);
> -                       return 0;
> +                       return;
>                 }
>
>                 if (unlikely(COUNTER(*bmc) == COUNTER_MAX)) {
> @@ -1721,11 +1721,10 @@ static int bitmap_startwrite(struct mddev *mddev, sector_t offset,
>                 else
>                         sectors = 0;
>         }
> -       return 0;
>  }
>
> -static void bitmap_endwrite(struct mddev *mddev, sector_t offset,
> -                           unsigned long sectors)
> +static void bitmap_end_write(struct mddev *mddev, sector_t offset,
> +                            unsigned long sectors)
>  {
>         struct bitmap *bitmap = mddev->bitmap;
>
> @@ -2990,8 +2989,8 @@ static struct bitmap_operations bitmap_ops = {
>         .end_behind_write       = bitmap_end_behind_write,
>         .wait_behind_writes     = bitmap_wait_behind_writes,
>
> -       .startwrite             = bitmap_startwrite,
> -       .endwrite               = bitmap_endwrite,
> +       .start_write            = bitmap_start_write,
> +       .end_write              = bitmap_end_write,
>         .start_sync             = bitmap_start_sync,
>         .end_sync               = bitmap_end_sync,
>         .cond_end_sync          = bitmap_cond_end_sync,
> diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
> index d3d50629af91..9474e0d86fc6 100644
> --- a/drivers/md/md-bitmap.h
> +++ b/drivers/md/md-bitmap.h
> @@ -90,10 +90,10 @@ struct bitmap_operations {
>         void (*end_behind_write)(struct mddev *mddev);
>         void (*wait_behind_writes)(struct mddev *mddev);
>
> -       int (*startwrite)(struct mddev *mddev, sector_t offset,
> +       void (*start_write)(struct mddev *mddev, sector_t offset,
> +                           unsigned long sectors);
> +       void (*end_write)(struct mddev *mddev, sector_t offset,
>                           unsigned long sectors);
> -       void (*endwrite)(struct mddev *mddev, sector_t offset,
> -                        unsigned long sectors);
>         bool (*start_sync)(struct mddev *mddev, sector_t offset,
>                            sector_t *blocks, bool degraded);
>         void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks);
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index b0468e795d94..04a659f40cd6 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8849,14 +8849,14 @@ static void md_bitmap_start(struct mddev *mddev,
>                 mddev->pers->bitmap_sector(mddev, &md_io_clone->offset,
>                                            &md_io_clone->sectors);
>
> -       mddev->bitmap_ops->startwrite(mddev, md_io_clone->offset,
> -                                     md_io_clone->sectors);
> +       mddev->bitmap_ops->start_write(mddev, md_io_clone->offset,
> +                                      md_io_clone->sectors);
>  }
>
>  static void md_bitmap_end(struct mddev *mddev, struct md_io_clone *md_io_clone)
>  {
> -       mddev->bitmap_ops->endwrite(mddev, md_io_clone->offset,
> -                                   md_io_clone->sectors);
> +       mddev->bitmap_ops->end_write(mddev, md_io_clone->offset,
> +                                    md_io_clone->sectors);
>  }
>
>  static void md_end_clone_io(struct bio *bio)
> --
> 2.39.2
>

Reviewed-by: Xiao Ni <xni@redhat.com>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 04/23] md/md-bitmap: support discard for bitmap ops
  2025-05-24  6:13 ` [PATCH 04/23] md/md-bitmap: support discard for bitmap ops Yu Kuai
@ 2025-05-25 15:53   ` Xiao Ni
  2025-05-26  6:29   ` Christoph Hellwig
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 108+ messages in thread
From: Xiao Ni @ 2025-05-25 15:53 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, yukuai3, linux-doc, linux-kernel, linux-raid,
	yi.zhang, yangerkun, johnny.chenyi

On Sat, May 24, 2025 at 2:19 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> Use two new methods {start, end}_discard to handle discard IO, prepare
> to support new md bitmap.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  drivers/md/md-bitmap.c |  3 +++
>  drivers/md/md-bitmap.h | 12 ++++++++----
>  drivers/md/md.c        | 15 +++++++++++----
>  drivers/md/md.h        |  1 +
>  4 files changed, 23 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> index 2997e09d463d..848626049dea 100644
> --- a/drivers/md/md-bitmap.c
> +++ b/drivers/md/md-bitmap.c
> @@ -2991,6 +2991,9 @@ static struct bitmap_operations bitmap_ops = {
>
>         .start_write            = bitmap_start_write,
>         .end_write              = bitmap_end_write,
> +       .start_discard          = bitmap_start_write,
> +       .end_discard            = bitmap_end_write,
> +
>         .start_sync             = bitmap_start_sync,
>         .end_sync               = bitmap_end_sync,
>         .cond_end_sync          = bitmap_cond_end_sync,
> diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
> index 9474e0d86fc6..4d804c07dbdd 100644
> --- a/drivers/md/md-bitmap.h
> +++ b/drivers/md/md-bitmap.h
> @@ -70,6 +70,9 @@ struct md_bitmap_stats {
>         struct file     *file;
>  };
>
> +typedef void (md_bitmap_fn)(struct mddev *mddev, sector_t offset,
> +                           unsigned long sectors);
> +
>  struct bitmap_operations {
>         struct md_submodule_head head;
>
> @@ -90,10 +93,11 @@ struct bitmap_operations {
>         void (*end_behind_write)(struct mddev *mddev);
>         void (*wait_behind_writes)(struct mddev *mddev);
>
> -       void (*start_write)(struct mddev *mddev, sector_t offset,
> -                           unsigned long sectors);
> -       void (*end_write)(struct mddev *mddev, sector_t offset,
> -                         unsigned long sectors);
> +       md_bitmap_fn *start_write;
> +       md_bitmap_fn *end_write;
> +       md_bitmap_fn *start_discard;
> +       md_bitmap_fn *end_discard;
> +
>         bool (*start_sync)(struct mddev *mddev, sector_t offset,
>                            sector_t *blocks, bool degraded);
>         void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks);
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 04a659f40cd6..466087cef4f9 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8845,18 +8845,24 @@ EXPORT_SYMBOL_GPL(md_submit_discard_bio);
>  static void md_bitmap_start(struct mddev *mddev,
>                             struct md_io_clone *md_io_clone)
>  {
> +       md_bitmap_fn *fn = unlikely(md_io_clone->rw == STAT_DISCARD) ?
> +                          mddev->bitmap_ops->start_discard :
> +                          mddev->bitmap_ops->start_write;
> +
>         if (mddev->pers->bitmap_sector)
>                 mddev->pers->bitmap_sector(mddev, &md_io_clone->offset,
>                                            &md_io_clone->sectors);
>
> -       mddev->bitmap_ops->start_write(mddev, md_io_clone->offset,
> -                                      md_io_clone->sectors);
> +       fn(mddev, md_io_clone->offset, md_io_clone->sectors);
>  }
>
>  static void md_bitmap_end(struct mddev *mddev, struct md_io_clone *md_io_clone)
>  {
> -       mddev->bitmap_ops->end_write(mddev, md_io_clone->offset,
> -                                    md_io_clone->sectors);
> +       md_bitmap_fn *fn = unlikely(md_io_clone->rw == STAT_DISCARD) ?
> +                          mddev->bitmap_ops->end_discard :
> +                          mddev->bitmap_ops->end_write;
> +
> +       fn(mddev, md_io_clone->offset, md_io_clone->sectors);
>  }
>
>  static void md_end_clone_io(struct bio *bio)
> @@ -8895,6 +8901,7 @@ static void md_clone_bio(struct mddev *mddev, struct bio **bio)
>         if (bio_data_dir(*bio) == WRITE && md_bitmap_enabled(mddev)) {
>                 md_io_clone->offset = (*bio)->bi_iter.bi_sector;
>                 md_io_clone->sectors = bio_sectors(*bio);
> +               md_io_clone->rw = op_stat_group(bio_op(*bio));
>                 md_bitmap_start(mddev, md_io_clone);
>         }
>
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index c241119e6ef3..13e3f9ce1b79 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -850,6 +850,7 @@ struct md_io_clone {
>         unsigned long   start_time;
>         sector_t        offset;
>         unsigned long   sectors;
> +       enum stat_group rw;
>         struct bio      bio_clone;
>  };
>
> --
> 2.39.2
>
>

Reviewed-by: Xiao Ni <xni@redhat.com>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 05/23] md/md-bitmap: remove parameter slot from bitmap_create()
  2025-05-24  6:13 ` [PATCH 05/23] md/md-bitmap: remove parameter slot from bitmap_create() Yu Kuai
@ 2025-05-25 16:09   ` Xiao Ni
  2025-05-26  6:30   ` Christoph Hellwig
  2025-05-27  6:01   ` Hannes Reinecke
  2 siblings, 0 replies; 108+ messages in thread
From: Xiao Ni @ 2025-05-25 16:09 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, yukuai3, linux-doc, linux-kernel, linux-raid,
	yi.zhang, yangerkun, johnny.chenyi

On Sat, May 24, 2025 at 2:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> All callers pass in '-1' for 'slot', hence it can be removed.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/md/md-bitmap.c | 6 +++---
>  drivers/md/md-bitmap.h | 2 +-
>  drivers/md/md.c        | 6 +++---
>  3 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> index 848626049dea..17d41a7b30ce 100644
> --- a/drivers/md/md-bitmap.c
> +++ b/drivers/md/md-bitmap.c
> @@ -2185,9 +2185,9 @@ static struct bitmap *__bitmap_create(struct mddev *mddev, int slot)
>         return ERR_PTR(err);
>  }
>
> -static int bitmap_create(struct mddev *mddev, int slot)
> +static int bitmap_create(struct mddev *mddev)
>  {
> -       struct bitmap *bitmap = __bitmap_create(mddev, slot);
> +       struct bitmap *bitmap = __bitmap_create(mddev, -1);
>
>         if (IS_ERR(bitmap))
>                 return PTR_ERR(bitmap);
> @@ -2649,7 +2649,7 @@ location_store(struct mddev *mddev, const char *buf, size_t len)
>                         }
>
>                         mddev->bitmap_info.offset = offset;
> -                       rv = bitmap_create(mddev, -1);
> +                       rv = bitmap_create(mddev);
>                         if (rv)
>                                 goto out;
>
> diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
> index 4d804c07dbdd..2b99ddef7a41 100644
> --- a/drivers/md/md-bitmap.h
> +++ b/drivers/md/md-bitmap.h
> @@ -77,7 +77,7 @@ struct bitmap_operations {
>         struct md_submodule_head head;
>
>         bool (*enabled)(void *data);
> -       int (*create)(struct mddev *mddev, int slot);
> +       int (*create)(struct mddev *mddev);
>         int (*resize)(struct mddev *mddev, sector_t blocks, int chunksize);
>
>         int (*load)(struct mddev *mddev);
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 466087cef4f9..311e52d5173d 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -6255,7 +6255,7 @@ int md_run(struct mddev *mddev)
>         }
>         if (err == 0 && pers->sync_request && md_bitmap_registered(mddev) &&
>             (mddev->bitmap_info.file || mddev->bitmap_info.offset)) {
> -               err = mddev->bitmap_ops->create(mddev, -1);
> +               err = mddev->bitmap_ops->create(mddev);
>                 if (err)
>                         pr_warn("%s: failed to create bitmap (%d)\n",
>                                 mdname(mddev), err);
> @@ -7324,7 +7324,7 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
>         err = 0;
>         if (mddev->pers) {
>                 if (fd >= 0) {
> -                       err = mddev->bitmap_ops->create(mddev, -1);
> +                       err = mddev->bitmap_ops->create(mddev);
>                         if (!err)
>                                 err = mddev->bitmap_ops->load(mddev);
>
> @@ -7648,7 +7648,7 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
>                                 mddev->bitmap_info.default_offset;
>                         mddev->bitmap_info.space =
>                                 mddev->bitmap_info.default_space;
> -                       rv = mddev->bitmap_ops->create(mddev, -1);
> +                       rv = mddev->bitmap_ops->create(mddev);
>                         if (!rv)
>                                 rv = mddev->bitmap_ops->load(mddev);
>
> --
> 2.39.2
>

Reviewed-by: Xiao Ni <xni@redhat.com>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 06/23] md/md-bitmap: add a new sysfs api bitmap_type
  2025-05-24  6:13 ` [PATCH 06/23] md/md-bitmap: add a new sysfs api bitmap_type Yu Kuai
@ 2025-05-25 16:32   ` Xiao Ni
  2025-05-26  1:13     ` Yu Kuai
  2025-05-26  6:32   ` Christoph Hellwig
  2025-05-27  6:10   ` Hannes Reinecke
  2 siblings, 1 reply; 108+ messages in thread
From: Xiao Ni @ 2025-05-25 16:32 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, yukuai3, linux-doc, linux-kernel, linux-raid,
	yi.zhang, yangerkun, johnny.chenyi

On Sat, May 24, 2025 at 2:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> The api will be used by mdadm to set bitmap_ops while creating new array

Hi Kuai

Maybe you want to say "set bitmap type" here? And can you explain more
here, why does it need this sys file while creating a new array? The
reason I ask is that it doesn't use a sys file when creating an array
with bitmap.

And if it really needs this, can this be gotten by superblock?

Best Regards
Xiao

> or assemble array, prepare to add a new bitmap.
>
> Currently available options are:
>
> cat /sys/block/md0/md/bitmap_type
> none [bitmap]
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  Documentation/admin-guide/md.rst | 73 ++++++++++++++----------
>  drivers/md/md.c                  | 96 ++++++++++++++++++++++++++++++--
>  drivers/md/md.h                  |  2 +
>  3 files changed, 135 insertions(+), 36 deletions(-)
>
> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> index 4ff2cc291d18..356d2a344f08 100644
> --- a/Documentation/admin-guide/md.rst
> +++ b/Documentation/admin-guide/md.rst
> @@ -347,6 +347,49 @@ All md devices contain:
>       active-idle
>           like active, but no writes have been seen for a while (safe_mode_delay).
>
> +  consistency_policy
> +     This indicates how the array maintains consistency in case of unexpected
> +     shutdown. It can be:
> +
> +     none
> +       Array has no redundancy information, e.g. raid0, linear.
> +
> +     resync
> +       Full resync is performed and all redundancy is regenerated when the
> +       array is started after unclean shutdown.
> +
> +     bitmap
> +       Resync assisted by a write-intent bitmap.
> +
> +     journal
> +       For raid4/5/6, journal device is used to log transactions and replay
> +       after unclean shutdown.
> +
> +     ppl
> +       For raid5 only, Partial Parity Log is used to close the write hole and
> +       eliminate resync.
> +
> +     The accepted values when writing to this file are ``ppl`` and ``resync``,
> +     used to enable and disable PPL.
> +
> +  uuid
> +     This indicates the UUID of the array in the following format:
> +     xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
> +
> +  bitmap_type
> +     [RW] When read, this file will display the current and available
> +     bitmap for this array. The currently active bitmap will be enclosed
> +     in [] brackets. Writing an bitmap name or ID to this file will switch
> +     control of this array to that new bitmap. Note that writing a new
> +     bitmap for created array is forbidden.
> +
> +     none
> +         No bitmap
> +     bitmap
> +         The default internal bitmap
> +
> +If bitmap_type is bitmap, then the md device will also contain:
> +
>    bitmap/location
>       This indicates where the write-intent bitmap for the array is
>       stored.
> @@ -401,36 +444,6 @@ All md devices contain:
>       once the array becomes non-degraded, and this fact has been
>       recorded in the metadata.
>
> -  consistency_policy
> -     This indicates how the array maintains consistency in case of unexpected
> -     shutdown. It can be:
> -
> -     none
> -       Array has no redundancy information, e.g. raid0, linear.
> -
> -     resync
> -       Full resync is performed and all redundancy is regenerated when the
> -       array is started after unclean shutdown.
> -
> -     bitmap
> -       Resync assisted by a write-intent bitmap.
> -
> -     journal
> -       For raid4/5/6, journal device is used to log transactions and replay
> -       after unclean shutdown.
> -
> -     ppl
> -       For raid5 only, Partial Parity Log is used to close the write hole and
> -       eliminate resync.
> -
> -     The accepted values when writing to this file are ``ppl`` and ``resync``,
> -     used to enable and disable PPL.
> -
> -  uuid
> -     This indicates the UUID of the array in the following format:
> -     xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
> -
> -
>  As component devices are added to an md array, they appear in the ``md``
>  directory as new directories named::
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 311e52d5173d..4eb0c6effd5b 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -672,13 +672,18 @@ static void active_io_release(struct percpu_ref *ref)
>
>  static void no_op(struct percpu_ref *r) {}
>
> -static void mddev_set_bitmap_ops(struct mddev *mddev, enum md_submodule_id id)
> +static bool mddev_set_bitmap_ops(struct mddev *mddev)
>  {
>         xa_lock(&md_submodule);
> -       mddev->bitmap_ops = xa_load(&md_submodule, id);
> +       mddev->bitmap_ops = xa_load(&md_submodule, mddev->bitmap_id);
>         xa_unlock(&md_submodule);
> -       if (!mddev->bitmap_ops)
> -               pr_warn_once("md: can't find bitmap id %d\n", id);
> +
> +       if (!mddev->bitmap_ops) {
> +               pr_warn_once("md: can't find bitmap id %d\n", mddev->bitmap_id);
> +               return false;
> +       }
> +
> +       return true;
>  }
>
>  static void mddev_clear_bitmap_ops(struct mddev *mddev)
> @@ -688,8 +693,10 @@ static void mddev_clear_bitmap_ops(struct mddev *mddev)
>
>  int mddev_init(struct mddev *mddev)
>  {
> -       /* TODO: support more versions */
> -       mddev_set_bitmap_ops(mddev, ID_BITMAP);
> +       mddev->bitmap_id = ID_BITMAP;
> +
> +       if (!mddev_set_bitmap_ops(mddev))
> +               return -EINVAL;
>
>         if (percpu_ref_init(&mddev->active_io, active_io_release,
>                             PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
> @@ -4155,6 +4162,82 @@ new_level_store(struct mddev *mddev, const char *buf, size_t len)
>  static struct md_sysfs_entry md_new_level =
>  __ATTR(new_level, 0664, new_level_show, new_level_store);
>
> +static ssize_t
> +bitmap_type_show(struct mddev *mddev, char *page)
> +{
> +       struct md_submodule_head *head;
> +       unsigned long i;
> +       ssize_t len = 0;
> +
> +       if (mddev->bitmap_id == ID_BITMAP_NONE)
> +               len += sprintf(page + len, "[none] ");
> +       else
> +               len += sprintf(page + len, "none ");
> +
> +       xa_lock(&md_submodule);
> +       xa_for_each(&md_submodule, i, head) {
> +               if (head->type != MD_BITMAP)
> +                       continue;
> +
> +               if (mddev->bitmap_id == head->id)
> +                       len += sprintf(page + len, "[%s] ", head->name);
> +               else
> +                       len += sprintf(page + len, "%s ", head->name);
> +       }
> +       xa_unlock(&md_submodule);
> +
> +       len += sprintf(page + len, "\n");
> +       return len;
> +}
> +
> +static ssize_t
> +bitmap_type_store(struct mddev *mddev, const char *buf, size_t len)
> +{
> +       struct md_submodule_head *head;
> +       enum md_submodule_id id;
> +       unsigned long i;
> +       int err;
> +
> +       if (mddev->bitmap_ops)
> +               return -EBUSY;
> +
> +       err = kstrtoint(buf, 10, &id);
> +       if (!err) {
> +               if (id == ID_BITMAP_NONE) {
> +                       mddev->bitmap_id = id;
> +                       return len;
> +               }
> +
> +               xa_lock(&md_submodule);
> +               head = xa_load(&md_submodule, id);
> +               xa_unlock(&md_submodule);
> +
> +               if (head && head->type == MD_BITMAP) {
> +                       mddev->bitmap_id = id;
> +                       return len;
> +               }
> +       }
> +
> +       if (cmd_match(buf, "none")) {
> +               mddev->bitmap_id = ID_BITMAP_NONE;
> +               return len;
> +       }
> +
> +       xa_lock(&md_submodule);
> +       xa_for_each(&md_submodule, i, head) {
> +               if (head->type == MD_BITMAP && cmd_match(buf, head->name)) {
> +                       mddev->bitmap_id = head->id;
> +                       xa_unlock(&md_submodule);
> +                       return len;
> +               }
> +       }
> +       xa_unlock(&md_submodule);
> +       return -ENOENT;
> +}
> +
> +static struct md_sysfs_entry md_bitmap_type =
> +__ATTR(bitmap_type, 0664, bitmap_type_show, bitmap_type_store);
> +
>  static ssize_t
>  layout_show(struct mddev *mddev, char *page)
>  {
> @@ -5719,6 +5802,7 @@ __ATTR(serialize_policy, S_IRUGO | S_IWUSR, serialize_policy_show,
>  static struct attribute *md_default_attrs[] = {
>         &md_level.attr,
>         &md_new_level.attr,
> +       &md_bitmap_type.attr,
>         &md_layout.attr,
>         &md_raid_disks.attr,
>         &md_uuid.attr,
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 13e3f9ce1b79..bf34c0a36551 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -40,6 +40,7 @@ enum md_submodule_id {
>         ID_CLUSTER,
>         ID_BITMAP,
>         ID_LLBITMAP,    /* TODO */
> +       ID_BITMAP_NONE,
>  };
>
>  struct md_submodule_head {
> @@ -565,6 +566,7 @@ struct mddev {
>         struct percpu_ref               writes_pending;
>         int                             sync_checkers;  /* # of threads checking writes_pending */
>
> +       enum md_submodule_id            bitmap_id;
>         void                            *bitmap; /* the bitmap for the device */
>         struct bitmap_operations        *bitmap_ops;
>         struct {
> --
> 2.39.2
>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 06/23] md/md-bitmap: add a new sysfs api bitmap_type
  2025-05-25 16:32   ` Xiao Ni
@ 2025-05-26  1:13     ` Yu Kuai
  2025-05-26  5:11       ` Xiao Ni
  0 siblings, 1 reply; 108+ messages in thread
From: Yu Kuai @ 2025-05-26  1:13 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai
  Cc: hch, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/05/26 0:32, Xiao Ni 写道:
>> The api will be used by mdadm to set bitmap_ops while creating new array
> Hi Kuai
> 
> Maybe you want to say "set bitmap type" here? And can you explain more
> here, why does it need this sys file while creating a new array? The
> reason I ask is that it doesn't use a sys file when creating an array
> with bitmap.

I do mean mddev->bitmap_ops here, this is the same as mddev->pers and
the md/level api. The mdadm patch will write the new helper before
running array.
> 
> And if it really needs this, can this be gotten by superblock?

Theoretically, I can, however, the bitmap superblock is read by
bitmap_ops->create method, and we need to set the bitmap_ops
first. And changing the framwork will be much complex.

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 06/23] md/md-bitmap: add a new sysfs api bitmap_type
  2025-05-26  1:13     ` Yu Kuai
@ 2025-05-26  5:11       ` Xiao Ni
  2025-05-26  8:02         ` Yu Kuai
  0 siblings, 1 reply; 108+ messages in thread
From: Xiao Ni @ 2025-05-26  5:11 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

On Mon, May 26, 2025 at 9:14 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2025/05/26 0:32, Xiao Ni 写道:
> >> The api will be used by mdadm to set bitmap_ops while creating new array
> > Hi Kuai
> >
> > Maybe you want to say "set bitmap type" here? And can you explain more
> > here, why does it need this sys file while creating a new array? The
> > reason I ask is that it doesn't use a sys file when creating an array
> > with bitmap.
>
> I do mean mddev->bitmap_ops here, this is the same as mddev->pers and
> the md/level api. The mdadm patch will write the new helper before
> running array.

+ if (s->btype == BitmapLockless &&
+    sysfs_set_str(&info, NULL, "bitmap_type", "llbitmap") < 0)
+ goto abort_locked;

The three lines of code are in the Create function. From an intuitive
perspective, it's used to set bitmap type to llbitmap rather than
bitmap ops. And in this patch, it adds the bitmap_type sysfs api to
set mddev->bitmap_id. After adding some debug logs, I understand you.
It's better to describe here more. Because the sysfs file api is used
to set bitmap type. Then it can be used to choose the bitmap ops when
creating array in md_create_bitmap


> >
> > And if it really needs this, can this be gotten by superblock?
>
> Theoretically, I can, however, the bitmap superblock is read by
> bitmap_ops->create method, and we need to set the bitmap_ops
> first. And changing the framwork will be much complex.

After adding some debug logs, I understand you. Now the default bitmap
is "bitmap", so it can set bitmap ops in md_run->md_bitmap_create. If
it wants to use llbitmap, it needs to set bitmap type first. Then it
can set bitmap ops in md_run->md_bitmap_create.

And it's better to explain why it's a better choice to use bitmap_type
sys rather than reading from superblock. So in future, developers can
understand the design easily.

Regards
Xiao
>
> Thanks,
> Kuai
>
>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 01/23] md: add a new parameter 'offset' to md_super_write()
  2025-05-24  6:12 ` [PATCH 01/23] md: add a new parameter 'offset' to md_super_write() Yu Kuai
  2025-05-25 15:50   ` Xiao Ni
@ 2025-05-26  6:28   ` Christoph Hellwig
  2025-05-26  7:28     ` Yu Kuai
  2025-05-27  5:54   ` Hannes Reinecke
  2 siblings, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2025-05-26  6:28 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, xni, colyli, song, yukuai3, linux-doc, linux-kernel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

On Sat, May 24, 2025 at 02:12:58PM +0800, Yu Kuai wrote:
> -void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
> -		   sector_t sector, int size, struct page *page)
> +void md_write_metadata(struct mddev *mddev, struct md_rdev *rdev,
> +		       sector_t sector, int size, struct page *page,
> +		       unsigned int offset)

Maybe add a little command explaining what it does?

> +extern void md_write_metadata(struct mddev *mddev, struct md_rdev *rdev,
> +			      sector_t sector, int size, struct page *page,
> +			      unsigned int offset);

No need for the extern.  Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 02/23] md: factor out a helper raid_is_456()
  2025-05-24  6:12 ` [PATCH 02/23] md: factor out a helper raid_is_456() Yu Kuai
  2025-05-25 15:50   ` Xiao Ni
@ 2025-05-26  6:28   ` Christoph Hellwig
  2025-05-27  5:55   ` Hannes Reinecke
  2 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2025-05-26  6:28 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, xni, colyli, song, yukuai3, linux-doc, linux-kernel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 03/23] md/md-bitmap: cleanup bitmap_ops->startwrite()
  2025-05-24  6:13 ` [PATCH 03/23] md/md-bitmap: cleanup bitmap_ops->startwrite() Yu Kuai
  2025-05-25 15:51   ` Xiao Ni
@ 2025-05-26  6:29   ` Christoph Hellwig
  2025-05-27  5:56   ` Hannes Reinecke
  2 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2025-05-26  6:29 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, xni, colyli, song, yukuai3, linux-doc, linux-kernel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 04/23] md/md-bitmap: support discard for bitmap ops
  2025-05-24  6:13 ` [PATCH 04/23] md/md-bitmap: support discard for bitmap ops Yu Kuai
  2025-05-25 15:53   ` Xiao Ni
@ 2025-05-26  6:29   ` Christoph Hellwig
  2025-05-27  6:01   ` Hannes Reinecke
  2025-05-28  7:04   ` Glass Su
  3 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2025-05-26  6:29 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, xni, colyli, song, yukuai3, linux-doc, linux-kernel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

> +typedef void (md_bitmap_fn)(struct mddev *mddev, sector_t offset,
> +			    unsigned long sectors);

Does this typedef really add any value?

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 05/23] md/md-bitmap: remove parameter slot from bitmap_create()
  2025-05-24  6:13 ` [PATCH 05/23] md/md-bitmap: remove parameter slot from bitmap_create() Yu Kuai
  2025-05-25 16:09   ` Xiao Ni
@ 2025-05-26  6:30   ` Christoph Hellwig
  2025-05-27  6:01   ` Hannes Reinecke
  2 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2025-05-26  6:30 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, xni, colyli, song, yukuai3, linux-doc, linux-kernel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 06/23] md/md-bitmap: add a new sysfs api bitmap_type
  2025-05-24  6:13 ` [PATCH 06/23] md/md-bitmap: add a new sysfs api bitmap_type Yu Kuai
  2025-05-25 16:32   ` Xiao Ni
@ 2025-05-26  6:32   ` Christoph Hellwig
  2025-05-26  7:45     ` Yu Kuai
  2025-05-27  6:10   ` Hannes Reinecke
  2 siblings, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2025-05-26  6:32 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, xni, colyli, song, yukuai3, linux-doc, linux-kernel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

On Sat, May 24, 2025 at 02:13:03PM +0800, Yu Kuai wrote:
> +  consistency_policy

.. these doc changes look unrelated, or am I missing something?

> -static void mddev_set_bitmap_ops(struct mddev *mddev, enum md_submodule_id id)
> +static bool mddev_set_bitmap_ops(struct mddev *mddev)
>  {
>  	xa_lock(&md_submodule);
> -	mddev->bitmap_ops = xa_load(&md_submodule, id);
> +	mddev->bitmap_ops = xa_load(&md_submodule, mddev->bitmap_id);
>  	xa_unlock(&md_submodule);
> -	if (!mddev->bitmap_ops)
> -		pr_warn_once("md: can't find bitmap id %d\n", id);
> +
> +	if (!mddev->bitmap_ops) {
> +		pr_warn_once("md: can't find bitmap id %d\n", mddev->bitmap_id);
> +		return false;
> +	}
> +
> +	return true;

This also looks unrelated and like another prep patch?


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 07/23] md/md-bitmap: delay registration of bitmap_ops until creating bitmap
  2025-05-24  6:13 ` [PATCH 07/23] md/md-bitmap: delay registration of bitmap_ops until creating bitmap Yu Kuai
@ 2025-05-26  6:32   ` Christoph Hellwig
  2025-05-26  6:52   ` Xiao Ni
  2025-05-27  6:13   ` Hannes Reinecke
  2 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2025-05-26  6:32 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, xni, colyli, song, yukuai3, linux-doc, linux-kernel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 11/23] md/md-bitmap: make method bitmap_ops->daemon_work optional
  2025-05-24  6:13 ` [PATCH 11/23] md/md-bitmap: make method bitmap_ops->daemon_work optional Yu Kuai
@ 2025-05-26  6:34   ` Christoph Hellwig
  2025-05-27  6:19   ` Hannes Reinecke
  1 sibling, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2025-05-26  6:34 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, xni, colyli, song, yukuai3, linux-doc, linux-kernel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 12/23] md/md-bitmap: add macros for lockless bitmap
  2025-05-24  6:13 ` [PATCH 12/23] md/md-bitmap: add macros for lockless bitmap Yu Kuai
@ 2025-05-26  6:40   ` Christoph Hellwig
  2025-05-26  8:12     ` Yu Kuai
  2025-05-27  6:21   ` Hannes Reinecke
  2025-05-28  4:53   ` Xiao Ni
  2 siblings, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2025-05-26  6:40 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, xni, colyli, song, yukuai3, linux-doc, linux-kernel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

On Sat, May 24, 2025 at 02:13:09PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Also move other values to md-bitmap.h and update comments.

Hmm.  The commit message looks very confusing to me.

I think this should be two patches:

 1) move defines relevant to the disk format from md-bitmap.c to md-bitmap.h
 2) add new bits for llbitmap (and explain what they are).

> +#define BITMAP_SB_SIZE 1024

And while we're at it: this is still duplicated in llbitmap.c later.
But shouldn't it simply be replaced with a sizeof on struct bitmap_super_s?

(and when cleaning thing up, rename that to bitmap_super without
the _s and use it instead of the typedef at least for all new code)?

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 13/23] md/md-bitmap: fix dm-raid max_write_behind setting
  2025-05-24  6:13 ` [PATCH 13/23] md/md-bitmap: fix dm-raid max_write_behind setting Yu Kuai
@ 2025-05-26  6:40   ` Christoph Hellwig
  2025-05-27  6:21   ` Hannes Reinecke
  1 sibling, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2025-05-26  6:40 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, xni, colyli, song, yukuai3, linux-doc, linux-kernel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

On Sat, May 24, 2025 at 02:13:10PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> It's supposed to be COUNTER_MAX / 2, not COUNTER_MAX.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

And maybe move this to the front of the series and/or submit it ASAP
for 6.16?


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 14/23] md/dm-raid: remove max_write_behind setting limit
  2025-05-24  6:13 ` [PATCH 14/23] md/dm-raid: remove max_write_behind setting limit Yu Kuai
@ 2025-05-26  6:41   ` Christoph Hellwig
  2025-05-27  6:26   ` Hannes Reinecke
  2025-05-28  4:58   ` Xiao Ni
  2 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2025-05-26  6:41 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, xni, colyli, song, yukuai3, linux-doc, linux-kernel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 07/23] md/md-bitmap: delay registration of bitmap_ops until creating bitmap
  2025-05-24  6:13 ` [PATCH 07/23] md/md-bitmap: delay registration of bitmap_ops until creating bitmap Yu Kuai
  2025-05-26  6:32   ` Christoph Hellwig
@ 2025-05-26  6:52   ` Xiao Ni
  2025-05-26  7:57     ` Yu Kuai
  2025-05-27  6:13   ` Hannes Reinecke
  2 siblings, 1 reply; 108+ messages in thread
From: Xiao Ni @ 2025-05-26  6:52 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, yukuai3, linux-doc, linux-kernel, linux-raid,
	yi.zhang, yangerkun, johnny.chenyi

On Sat, May 24, 2025 at 2:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> Currently bitmap_ops is registered while allocating mddev, this is fine
> when there is only one bitmap_ops, however, after introduing a new
> bitmap_ops, user space need a time window to choose which bitmap_ops to
> use while creating new array.

Could you give more explanation about what the time window is? Is it
between setting llbitmap by bitmap_type and md_bitmap_create?

>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  drivers/md/md.c | 86 +++++++++++++++++++++++++++++++------------------
>  1 file changed, 55 insertions(+), 31 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 4eb0c6effd5b..dc4b85f30e13 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -674,39 +674,50 @@ static void no_op(struct percpu_ref *r) {}
>
>  static bool mddev_set_bitmap_ops(struct mddev *mddev)
>  {
> +       struct bitmap_operations *old = mddev->bitmap_ops;
> +       struct md_submodule_head *head;
> +
> +       if (mddev->bitmap_id == ID_BITMAP_NONE ||
> +           (old && old->head.id == mddev->bitmap_id))
> +               return true;
> +
>         xa_lock(&md_submodule);
> -       mddev->bitmap_ops = xa_load(&md_submodule, mddev->bitmap_id);
> +       head = xa_load(&md_submodule, mddev->bitmap_id);
>         xa_unlock(&md_submodule);
>
> -       if (!mddev->bitmap_ops) {
> -               pr_warn_once("md: can't find bitmap id %d\n", mddev->bitmap_id);
> +       if (WARN_ON_ONCE(!head || head->type != MD_BITMAP)) {
> +               pr_err("md: can't find bitmap id %d\n", mddev->bitmap_id);
>                 return false;
>         }
>
> +       if (old && old->group)
> +               sysfs_remove_group(&mddev->kobj, old->group);

I think you're handling a competition problem here. But I don't know
how the old/old->group is already created when creating an array.
Could you explain this?

Regards
Xiao

> +
> +       mddev->bitmap_ops = (void *)head;
> +       if (mddev->bitmap_ops && mddev->bitmap_ops->group &&
> +           sysfs_create_group(&mddev->kobj, mddev->bitmap_ops->group))
> +               pr_warn("md: cannot register extra bitmap attributes for %s\n",
> +                       mdname(mddev));
> +
>         return true;
>  }
>
>  static void mddev_clear_bitmap_ops(struct mddev *mddev)
>  {
> +       if (mddev->bitmap_ops && mddev->bitmap_ops->group)
> +               sysfs_remove_group(&mddev->kobj, mddev->bitmap_ops->group);
> +
>         mddev->bitmap_ops = NULL;
>  }
>
>  int mddev_init(struct mddev *mddev)
>  {
> -       mddev->bitmap_id = ID_BITMAP;
> -
> -       if (!mddev_set_bitmap_ops(mddev))
> -               return -EINVAL;
> -
>         if (percpu_ref_init(&mddev->active_io, active_io_release,
> -                           PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
> -               mddev_clear_bitmap_ops(mddev);
> +                           PERCPU_REF_ALLOW_REINIT, GFP_KERNEL))
>                 return -ENOMEM;
> -       }
>
>         if (percpu_ref_init(&mddev->writes_pending, no_op,
>                             PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
> -               mddev_clear_bitmap_ops(mddev);
>                 percpu_ref_exit(&mddev->active_io);
>                 return -ENOMEM;
>         }
> @@ -734,6 +745,7 @@ int mddev_init(struct mddev *mddev)
>         mddev->resync_min = 0;
>         mddev->resync_max = MaxSector;
>         mddev->level = LEVEL_NONE;
> +       mddev->bitmap_id = ID_BITMAP;
>
>         INIT_WORK(&mddev->sync_work, md_start_sync);
>         INIT_WORK(&mddev->del_work, mddev_delayed_delete);
> @@ -744,7 +756,6 @@ EXPORT_SYMBOL_GPL(mddev_init);
>
>  void mddev_destroy(struct mddev *mddev)
>  {
> -       mddev_clear_bitmap_ops(mddev);
>         percpu_ref_exit(&mddev->active_io);
>         percpu_ref_exit(&mddev->writes_pending);
>  }
> @@ -6093,11 +6104,6 @@ struct mddev *md_alloc(dev_t dev, char *name)
>                 return ERR_PTR(error);
>         }
>
> -       if (md_bitmap_registered(mddev) && mddev->bitmap_ops->group)
> -               if (sysfs_create_group(&mddev->kobj, mddev->bitmap_ops->group))
> -                       pr_warn("md: cannot register extra bitmap attributes for %s\n",
> -                               mdname(mddev));
> -
>         kobject_uevent(&mddev->kobj, KOBJ_ADD);
>         mddev->sysfs_state = sysfs_get_dirent_safe(mddev->kobj.sd, "array_state");
>         mddev->sysfs_level = sysfs_get_dirent_safe(mddev->kobj.sd, "level");
> @@ -6173,6 +6179,26 @@ static void md_safemode_timeout(struct timer_list *t)
>
>  static int start_dirty_degraded;
>
> +static int md_bitmap_create(struct mddev *mddev)
> +{
> +       if (mddev->bitmap_id == ID_BITMAP_NONE)
> +               return -EINVAL;
> +
> +       if (!mddev_set_bitmap_ops(mddev))
> +               return -ENOENT;
> +
> +       return mddev->bitmap_ops->create(mddev);
> +}
> +
> +static void md_bitmap_destroy(struct mddev *mddev)
> +{
> +       if (!md_bitmap_registered(mddev))
> +               return;
> +
> +       mddev->bitmap_ops->destroy(mddev);
> +       mddev_clear_bitmap_ops(mddev);
> +}
> +
>  int md_run(struct mddev *mddev)
>  {
>         int err;
> @@ -6337,9 +6363,9 @@ int md_run(struct mddev *mddev)
>                         (unsigned long long)pers->size(mddev, 0, 0) / 2);
>                 err = -EINVAL;
>         }
> -       if (err == 0 && pers->sync_request && md_bitmap_registered(mddev) &&
> +       if (err == 0 && pers->sync_request &&
>             (mddev->bitmap_info.file || mddev->bitmap_info.offset)) {
> -               err = mddev->bitmap_ops->create(mddev);
> +               err = md_bitmap_create(mddev);
>                 if (err)
>                         pr_warn("%s: failed to create bitmap (%d)\n",
>                                 mdname(mddev), err);
> @@ -6412,8 +6438,7 @@ int md_run(struct mddev *mddev)
>                 pers->free(mddev, mddev->private);
>         mddev->private = NULL;
>         put_pers(pers);
> -       if (md_bitmap_registered(mddev))
> -               mddev->bitmap_ops->destroy(mddev);
> +       md_bitmap_destroy(mddev);
>  abort:
>         bioset_exit(&mddev->io_clone_set);
>  exit_sync_set:
> @@ -6436,7 +6461,7 @@ int do_md_run(struct mddev *mddev)
>         if (md_bitmap_registered(mddev)) {
>                 err = mddev->bitmap_ops->load(mddev);
>                 if (err) {
> -                       mddev->bitmap_ops->destroy(mddev);
> +                       md_bitmap_destroy(mddev);
>                         goto out;
>                 }
>         }
> @@ -6627,8 +6652,7 @@ static void __md_stop(struct mddev *mddev)
>  {
>         struct md_personality *pers = mddev->pers;
>
> -       if (md_bitmap_registered(mddev))
> -               mddev->bitmap_ops->destroy(mddev);
> +       md_bitmap_destroy(mddev);
>         mddev_detach(mddev);
>         spin_lock(&mddev->lock);
>         mddev->pers = NULL;
> @@ -7408,16 +7432,16 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
>         err = 0;
>         if (mddev->pers) {
>                 if (fd >= 0) {
> -                       err = mddev->bitmap_ops->create(mddev);
> +                       err = md_bitmap_create(mddev);
>                         if (!err)
>                                 err = mddev->bitmap_ops->load(mddev);
>
>                         if (err) {
> -                               mddev->bitmap_ops->destroy(mddev);
> +                               md_bitmap_destroy(mddev);
>                                 fd = -1;
>                         }
>                 } else if (fd < 0) {
> -                       mddev->bitmap_ops->destroy(mddev);
> +                       md_bitmap_destroy(mddev);
>                 }
>         }
>
> @@ -7732,12 +7756,12 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
>                                 mddev->bitmap_info.default_offset;
>                         mddev->bitmap_info.space =
>                                 mddev->bitmap_info.default_space;
> -                       rv = mddev->bitmap_ops->create(mddev);
> +                       rv = md_bitmap_create(mddev);
>                         if (!rv)
>                                 rv = mddev->bitmap_ops->load(mddev);
>
>                         if (rv)
> -                               mddev->bitmap_ops->destroy(mddev);
> +                               md_bitmap_destroy(mddev);
>                 } else {
>                         struct md_bitmap_stats stats;
>
> @@ -7763,7 +7787,7 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
>                                 put_cluster_ops(mddev);
>                                 mddev->safemode_delay = DEFAULT_SAFEMODE_DELAY;
>                         }
> -                       mddev->bitmap_ops->destroy(mddev);
> +                       md_bitmap_destroy(mddev);
>                         mddev->bitmap_info.offset = 0;
>                 }
>         }
> --
> 2.39.2
>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 08/23] md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations
  2025-05-24  6:13 ` [PATCH 08/23] md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations Yu Kuai
@ 2025-05-26  7:03   ` Xiao Ni
  2025-05-27  6:14   ` Hannes Reinecke
  1 sibling, 0 replies; 108+ messages in thread
From: Xiao Ni @ 2025-05-26  7:03 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, yukuai3, linux-doc, linux-kernel, linux-raid,
	yi.zhang, yangerkun, johnny.chenyi

On Sat, May 24, 2025 at 2:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> This method is used to check if blocks can be skipped before calling
> into pers->sync_request(), llbiltmap will use this method to skip

type error: s/llbiltmap/llbitmap/g

> resync for unwritten/clean data blocks, and recovery/check/repair for
> unwritten data blocks;
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/md/md-bitmap.h | 1 +
>  drivers/md/md.c        | 7 +++++++
>  2 files changed, 8 insertions(+)
>
> diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
> index 2b99ddef7a41..0de14d475ad3 100644
> --- a/drivers/md/md-bitmap.h
> +++ b/drivers/md/md-bitmap.h
> @@ -98,6 +98,7 @@ struct bitmap_operations {
>         md_bitmap_fn *start_discard;
>         md_bitmap_fn *end_discard;
>
> +       sector_t (*skip_sync_blocks)(struct mddev *mddev, sector_t offset);
>         bool (*start_sync)(struct mddev *mddev, sector_t offset,
>                            sector_t *blocks, bool degraded);
>         void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks);
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index dc4b85f30e13..890c8da43b3b 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -9362,6 +9362,12 @@ void md_do_sync(struct md_thread *thread)
>                 if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
>                         break;
>
> +               if (mddev->bitmap_ops && mddev->bitmap_ops->skip_sync_blocks) {
> +                       sectors = mddev->bitmap_ops->skip_sync_blocks(mddev, j);
> +                       if (sectors)
> +                               goto update;
> +               }
> +
>                 sectors = mddev->pers->sync_request(mddev, j, max_sectors,
>                                                     &skipped);
>                 if (sectors == 0) {
> @@ -9377,6 +9383,7 @@ void md_do_sync(struct md_thread *thread)
>                 if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
>                         break;
>
> +update:
>                 j += sectors;
>                 if (j > max_sectors)
>                         /* when skipping, extra large numbers can be returned. */
> --
> 2.39.2
>

Reviewed-by: Xiao Ni <xni@redhat.com>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 01/23] md: add a new parameter 'offset' to md_super_write()
  2025-05-26  6:28   ` Christoph Hellwig
@ 2025-05-26  7:28     ` Yu Kuai
  0 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-26  7:28 UTC (permalink / raw)
  To: Christoph Hellwig, Yu Kuai
  Cc: xni, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/05/26 14:28, Christoph Hellwig 写道:
> On Sat, May 24, 2025 at 02:12:58PM +0800, Yu Kuai wrote:
>> -void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
>> -		   sector_t sector, int size, struct page *page)
>> +void md_write_metadata(struct mddev *mddev, struct md_rdev *rdev,
>> +		       sector_t sector, int size, struct page *page,
>> +		       unsigned int offset)
> 
> Maybe add a little command explaining what it does?

OK.
> 
>> +extern void md_write_metadata(struct mddev *mddev, struct md_rdev *rdev,
>> +			      sector_t sector, int size, struct page *page,
>> +			      unsigned int offset);
> 
> No need for the extern.  Otherwise looks good:

Got it.
Thanks for the review!

Kuai

> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> .
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 06/23] md/md-bitmap: add a new sysfs api bitmap_type
  2025-05-26  6:32   ` Christoph Hellwig
@ 2025-05-26  7:45     ` Yu Kuai
  2025-05-27  8:21       ` Christoph Hellwig
  0 siblings, 1 reply; 108+ messages in thread
From: Yu Kuai @ 2025-05-26  7:45 UTC (permalink / raw)
  To: Christoph Hellwig, Yu Kuai
  Cc: xni, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/05/26 14:32, Christoph Hellwig 写道:
> On Sat, May 24, 2025 at 02:13:03PM +0800, Yu Kuai wrote:
>> +  consistency_policy
> 
> .. these doc changes look unrelated, or am I missing something?

The position are moved to the front of the bitmap fields, because now
bitmap/xxx is not always here.

Before:

All md devices contain:
	level
	...
	bitmap/xxx
	bitmap/xxx
	consistency_policy
	uuid

After:
All md devices contain:
	level
	...
	consistency_policy
	uuid
	bitmap_type
		none xxx
		bitmap xxx
If bitmap_type is bitmap, then the md device will also contain:
	bitmap/xxx
	bitmap/xxx

> 
>> -static void mddev_set_bitmap_ops(struct mddev *mddev, enum md_submodule_id id)
>> +static bool mddev_set_bitmap_ops(struct mddev *mddev)
>>   {
>>   	xa_lock(&md_submodule);
>> -	mddev->bitmap_ops = xa_load(&md_submodule, id);
>> +	mddev->bitmap_ops = xa_load(&md_submodule, mddev->bitmap_id);
>>   	xa_unlock(&md_submodule);
>> -	if (!mddev->bitmap_ops)
>> -		pr_warn_once("md: can't find bitmap id %d\n", id);
>> +
>> +	if (!mddev->bitmap_ops) {
>> +		pr_warn_once("md: can't find bitmap id %d\n", mddev->bitmap_id);
>> +		return false;
>> +	}
>> +
>> +	return true;
> 
> This also looks unrelated and like another prep patch?

The new api will set mddev->bitmap_id, and the above change switch to
use mddev->bitmap_id to register bitmap_ops, perhaps I can factor the
change to a new prep patch, like:

md: add a new field mddev->bitmap_id

Before:
mddev_set_bitmap_ops(mddev, ID_BITMAP);

After:
mddev->bitmap_id = ID_BITMAP;
if (!mddev_set_bitmap_ops(mddev))
	return -EINVAL;

Thanks,
Kuai

> 
> .
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 07/23] md/md-bitmap: delay registration of bitmap_ops until creating bitmap
  2025-05-26  6:52   ` Xiao Ni
@ 2025-05-26  7:57     ` Yu Kuai
  2025-05-27  2:15       ` Xiao Ni
  0 siblings, 1 reply; 108+ messages in thread
From: Yu Kuai @ 2025-05-26  7:57 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai
  Cc: hch, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/05/26 14:52, Xiao Ni 写道:
> On Sat, May 24, 2025 at 2:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Currently bitmap_ops is registered while allocating mddev, this is fine
>> when there is only one bitmap_ops, however, after introduing a new
>> bitmap_ops, user space need a time window to choose which bitmap_ops to
>> use while creating new array.
> 
> Could you give more explanation about what the time window is? Is it
> between setting llbitmap by bitmap_type and md_bitmap_create?

The window after this patch is that user can write the new sysfs after
allocating mddev, and before running the array.
> 
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   drivers/md/md.c | 86 +++++++++++++++++++++++++++++++------------------
>>   1 file changed, 55 insertions(+), 31 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 4eb0c6effd5b..dc4b85f30e13 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -674,39 +674,50 @@ static void no_op(struct percpu_ref *r) {}
>>
>>   static bool mddev_set_bitmap_ops(struct mddev *mddev)
>>   {
>> +       struct bitmap_operations *old = mddev->bitmap_ops;
>> +       struct md_submodule_head *head;
>> +
>> +       if (mddev->bitmap_id == ID_BITMAP_NONE ||
>> +           (old && old->head.id == mddev->bitmap_id))
>> +               return true;
>> +
>>          xa_lock(&md_submodule);
>> -       mddev->bitmap_ops = xa_load(&md_submodule, mddev->bitmap_id);
>> +       head = xa_load(&md_submodule, mddev->bitmap_id);
>>          xa_unlock(&md_submodule);
>>
>> -       if (!mddev->bitmap_ops) {
>> -               pr_warn_once("md: can't find bitmap id %d\n", mddev->bitmap_id);
>> +       if (WARN_ON_ONCE(!head || head->type != MD_BITMAP)) {
>> +               pr_err("md: can't find bitmap id %d\n", mddev->bitmap_id);
>>                  return false;
>>          }
>>
>> +       if (old && old->group)
>> +               sysfs_remove_group(&mddev->kobj, old->group);
> 
> I think you're handling a competition problem here. But I don't know
> how the old/old->group is already created when creating an array.
> Could you explain this?

It's not possible now, this is because I think we want to be able to
switch existing array with old bitmap to new bitmap.

Thanks,
Kuai

> 
> Regards
> Xiao


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 06/23] md/md-bitmap: add a new sysfs api bitmap_type
  2025-05-26  5:11       ` Xiao Ni
@ 2025-05-26  8:02         ` Yu Kuai
  0 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-26  8:02 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai
  Cc: hch, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/05/26 13:11, Xiao Ni 写道:
> On Mon, May 26, 2025 at 9:14 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2025/05/26 0:32, Xiao Ni 写道:
>>>> The api will be used by mdadm to set bitmap_ops while creating new array
>>> Hi Kuai
>>>
>>> Maybe you want to say "set bitmap type" here? And can you explain more
>>> here, why does it need this sys file while creating a new array? The
>>> reason I ask is that it doesn't use a sys file when creating an array
>>> with bitmap.
>>
>> I do mean mddev->bitmap_ops here, this is the same as mddev->pers and
>> the md/level api. The mdadm patch will write the new helper before
>> running array.
> 
> + if (s->btype == BitmapLockless &&
> +    sysfs_set_str(&info, NULL, "bitmap_type", "llbitmap") < 0)
> + goto abort_locked;
> 
> The three lines of code are in the Create function. From an intuitive
> perspective, it's used to set bitmap type to llbitmap rather than
> bitmap ops. And in this patch, it adds the bitmap_type sysfs api to
> set mddev->bitmap_id. After adding some debug logs, I understand you.
> It's better to describe here more. Because the sysfs file api is used
> to set bitmap type. Then it can be used to choose the bitmap ops when
> creating array in md_create_bitmap
> 

Yes, sorry about the misleading, we're setting mddev->bitmap_id by
sysfs, and mddev->bitmap_ops by mddev->bitmap_id later in kernel(the
next patch).

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 12/23] md/md-bitmap: add macros for lockless bitmap
  2025-05-26  6:40   ` Christoph Hellwig
@ 2025-05-26  8:12     ` Yu Kuai
  2025-05-27  8:22       ` Christoph Hellwig
  0 siblings, 1 reply; 108+ messages in thread
From: Yu Kuai @ 2025-05-26  8:12 UTC (permalink / raw)
  To: Christoph Hellwig, Yu Kuai
  Cc: xni, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/05/26 14:40, Christoph Hellwig 写道:
> On Sat, May 24, 2025 at 02:13:09PM +0800, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Also move other values to md-bitmap.h and update comments.
> 
> Hmm.  The commit message looks very confusing to me.
> 
> I think this should be two patches:
> 
>   1) move defines relevant to the disk format from md-bitmap.c to md-bitmap.h
>   2) add new bits for llbitmap (and explain what they are).

OK.

> 
>> +#define BITMAP_SB_SIZE 1024
> 
> And while we're at it: this is still duplicated in llbitmap.c later.
> But shouldn't it simply be replaced with a sizeof on struct bitmap_super_s?

Sorry that I forgot to explain why it's still in .c

sizeof(struct bitmap_super_s) is actually 256 bytes, while by default,
1k is reserved, perhaps I can name it as BITMAP_DATA_OFFSET ?

0-255B		bitmap_super_s
256-1023B	hole
1024-[space]	bitmap
[space] - 128k	hole

BTW, the new bitmap only support the default offset and space, user
can't configure it manually.

Thanks,
Kuai

> 
> (and when cleaning thing up, rename that to bitmap_super without
> the _s and use it instead of the typedef at least for all new code)?
> .
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 07/23] md/md-bitmap: delay registration of bitmap_ops until creating bitmap
  2025-05-26  7:57     ` Yu Kuai
@ 2025-05-27  2:15       ` Xiao Ni
  2025-05-27  2:49         ` Yu Kuai
  0 siblings, 1 reply; 108+ messages in thread
From: Xiao Ni @ 2025-05-27  2:15 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)


在 2025/5/26 下午3:57, Yu Kuai 写道:
> Hi,
>
> 在 2025/05/26 14:52, Xiao Ni 写道:
>> On Sat, May 24, 2025 at 2:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> Currently bitmap_ops is registered while allocating mddev, this is fine
>>> when there is only one bitmap_ops, however, after introduing a new
>>> bitmap_ops, user space need a time window to choose which bitmap_ops to
>>> use while creating new array.
>>
>> Could you give more explanation about what the time window is? Is it
>> between setting llbitmap by bitmap_type and md_bitmap_create?
>
> The window after this patch is that user can write the new sysfs after
> allocating mddev, and before running the array.


Thanks for the explanation. Is it ok to add it in the commit log message?

>>
>>>
>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>> ---
>>>   drivers/md/md.c | 86 
>>> +++++++++++++++++++++++++++++++------------------
>>>   1 file changed, 55 insertions(+), 31 deletions(-)
>>>
>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index 4eb0c6effd5b..dc4b85f30e13 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -674,39 +674,50 @@ static void no_op(struct percpu_ref *r) {}
>>>
>>>   static bool mddev_set_bitmap_ops(struct mddev *mddev)
>>>   {
>>> +       struct bitmap_operations *old = mddev->bitmap_ops;
>>> +       struct md_submodule_head *head;
>>> +
>>> +       if (mddev->bitmap_id == ID_BITMAP_NONE ||
>>> +           (old && old->head.id == mddev->bitmap_id))
>>> +               return true;
>>> +
>>>          xa_lock(&md_submodule);
>>> -       mddev->bitmap_ops = xa_load(&md_submodule, mddev->bitmap_id);
>>> +       head = xa_load(&md_submodule, mddev->bitmap_id);
>>>          xa_unlock(&md_submodule);
>>>
>>> -       if (!mddev->bitmap_ops) {
>>> -               pr_warn_once("md: can't find bitmap id %d\n", 
>>> mddev->bitmap_id);
>>> +       if (WARN_ON_ONCE(!head || head->type != MD_BITMAP)) {
>>> +               pr_err("md: can't find bitmap id %d\n", 
>>> mddev->bitmap_id);
>>>                  return false;
>>>          }
>>>
>>> +       if (old && old->group)
>>> +               sysfs_remove_group(&mddev->kobj, old->group);
>>
>> I think you're handling a competition problem here. But I don't know
>> how the old/old->group is already created when creating an array.
>> Could you explain this?
>
> It's not possible now, this is because I think we want to be able to
> switch existing array with old bitmap to new bitmap.


Can we add the check of old when we really want it?

Regards

Xiao

>
> Thanks,
> Kuai
>
>>
>> Regards
>> Xiao
>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 09/23] md/md-bitmap: add a new method blocks_synced() in bitmap_operations
  2025-05-24  6:13 ` [PATCH 09/23] md/md-bitmap: add a new method blocks_synced() " Yu Kuai
@ 2025-05-27  2:35   ` Xiao Ni
  2025-05-27  2:48     ` Yu Kuai
  2025-05-27  6:16   ` Hannes Reinecke
  1 sibling, 1 reply; 108+ messages in thread
From: Xiao Ni @ 2025-05-27  2:35 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, yukuai3, linux-doc, linux-kernel, linux-raid,
	yi.zhang, yangerkun, johnny.chenyi

On Sat, May 24, 2025 at 2:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> Currently, raid456 must perform a whole array initial recovery to build
> initail xor data, then IO to the array won't have to read all the blocks
> in underlying disks.
>
> This behavior will affect IO performance a lot, and nowadays there are
> huge disks and the initial recovery can take a long time. Hence llbitmap
> will support lazy initial recovery in following patches. This method is
> used to check if data blocks is synced or not, if not then IO will still
> have to read all blocks for raid456.

Hi Kuai

In function handle_stripe_dirtying, if the io is behind resync, it
will force rcw. Does this interface have the same function?

Regards
Xiao
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/md/md-bitmap.h | 1 +
>  drivers/md/raid5.c     | 6 ++++++
>  2 files changed, 7 insertions(+)
>
> diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
> index 0de14d475ad3..f2d79c8a23b7 100644
> --- a/drivers/md/md-bitmap.h
> +++ b/drivers/md/md-bitmap.h
> @@ -99,6 +99,7 @@ struct bitmap_operations {
>         md_bitmap_fn *end_discard;
>
>         sector_t (*skip_sync_blocks)(struct mddev *mddev, sector_t offset);
> +       bool (*blocks_synced)(struct mddev *mddev, sector_t offset);
>         bool (*start_sync)(struct mddev *mddev, sector_t offset,
>                            sector_t *blocks, bool degraded);
>         void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks);
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 7e66a99f29af..e5d3d8facb4b 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3748,6 +3748,7 @@ static int want_replace(struct stripe_head *sh, int disk_idx)
>  static int need_this_block(struct stripe_head *sh, struct stripe_head_state *s,
>                            int disk_idx, int disks)
>  {
> +       struct mddev *mddev = sh->raid_conf->mddev;
>         struct r5dev *dev = &sh->dev[disk_idx];
>         struct r5dev *fdev[2] = { &sh->dev[s->failed_num[0]],
>                                   &sh->dev[s->failed_num[1]] };
> @@ -3762,6 +3763,11 @@ static int need_this_block(struct stripe_head *sh, struct stripe_head_state *s,
>                  */
>                 return 0;
>
> +       /* The initial recover is not done, must read everything */
> +       if (mddev->bitmap_ops && mddev->bitmap_ops->blocks_synced &&
> +           !mddev->bitmap_ops->blocks_synced(mddev, sh->sector))
> +               return 1;
> +
>         if (dev->toread ||
>             (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)))
>                 /* We need this block to directly satisfy a request */
> --
> 2.39.2
>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 09/23] md/md-bitmap: add a new method blocks_synced() in bitmap_operations
  2025-05-27  2:35   ` Xiao Ni
@ 2025-05-27  2:48     ` Yu Kuai
  0 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-27  2:48 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai
  Cc: hch, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/05/27 10:35, Xiao Ni 写道:
> On Sat, May 24, 2025 at 2:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Currently, raid456 must perform a whole array initial recovery to build
>> initail xor data, then IO to the array won't have to read all the blocks
>> in underlying disks.
>>
>> This behavior will affect IO performance a lot, and nowadays there are
>> huge disks and the initial recovery can take a long time. Hence llbitmap
>> will support lazy initial recovery in following patches. This method is
>> used to check if data blocks is synced or not, if not then IO will still
>> have to read all blocks for raid456.
> 
> Hi Kuai
> 
> In function handle_stripe_dirtying, if the io is behind resync, it
> will force rcw. Does this interface have the same function?

This api is not the same, this api is used by lazy initial recovery
for the raid5, means initial recovery is skipped and resync is not
in progress, handle_stripe_dirtying can't handle this case.

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 07/23] md/md-bitmap: delay registration of bitmap_ops until creating bitmap
  2025-05-27  2:15       ` Xiao Ni
@ 2025-05-27  2:49         ` Yu Kuai
  0 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-27  2:49 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai
  Cc: hch, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/05/27 10:15, Xiao Ni 写道:
> 
> 在 2025/5/26 下午3:57, Yu Kuai 写道:
>> Hi,
>>
>> 在 2025/05/26 14:52, Xiao Ni 写道:
>>> On Sat, May 24, 2025 at 2:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>>
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> Currently bitmap_ops is registered while allocating mddev, this is fine
>>>> when there is only one bitmap_ops, however, after introduing a new
>>>> bitmap_ops, user space need a time window to choose which bitmap_ops to
>>>> use while creating new array.
>>>
>>> Could you give more explanation about what the time window is? Is it
>>> between setting llbitmap by bitmap_type and md_bitmap_create?
>>
>> The window after this patch is that user can write the new sysfs after
>> allocating mddev, and before running the array.
> 
> 
> Thanks for the explanation. Is it ok to add it in the commit log message?

ok
> 
>>>
>>>>
>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>> ---
>>>>   drivers/md/md.c | 86 
>>>> +++++++++++++++++++++++++++++++------------------
>>>>   1 file changed, 55 insertions(+), 31 deletions(-)
>>>>
>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>> index 4eb0c6effd5b..dc4b85f30e13 100644
>>>> --- a/drivers/md/md.c
>>>> +++ b/drivers/md/md.c
>>>> @@ -674,39 +674,50 @@ static void no_op(struct percpu_ref *r) {}
>>>>
>>>>   static bool mddev_set_bitmap_ops(struct mddev *mddev)
>>>>   {
>>>> +       struct bitmap_operations *old = mddev->bitmap_ops;
>>>> +       struct md_submodule_head *head;
>>>> +
>>>> +       if (mddev->bitmap_id == ID_BITMAP_NONE ||
>>>> +           (old && old->head.id == mddev->bitmap_id))
>>>> +               return true;
>>>> +
>>>>          xa_lock(&md_submodule);
>>>> -       mddev->bitmap_ops = xa_load(&md_submodule, mddev->bitmap_id);
>>>> +       head = xa_load(&md_submodule, mddev->bitmap_id);
>>>>          xa_unlock(&md_submodule);
>>>>
>>>> -       if (!mddev->bitmap_ops) {
>>>> -               pr_warn_once("md: can't find bitmap id %d\n", 
>>>> mddev->bitmap_id);
>>>> +       if (WARN_ON_ONCE(!head || head->type != MD_BITMAP)) {
>>>> +               pr_err("md: can't find bitmap id %d\n", 
>>>> mddev->bitmap_id);
>>>>                  return false;
>>>>          }
>>>>
>>>> +       if (old && old->group)
>>>> +               sysfs_remove_group(&mddev->kobj, old->group);
>>>
>>> I think you're handling a competition problem here. But I don't know
>>> how the old/old->group is already created when creating an array.
>>> Could you explain this?
>>
>> It's not possible now, this is because I think we want to be able to
>> switch existing array with old bitmap to new bitmap.
> 
> 
> Can we add the check of old when we really want it?

I'm fine, and there is no doubt we will want it.

Thanks,
Kuai

> 
> Regards
> 
> Xiao
> 
>>
>> Thanks,
>> Kuai
>>
>>>
>>> Regards
>>> Xiao
>>
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 01/23] md: add a new parameter 'offset' to md_super_write()
  2025-05-24  6:12 ` [PATCH 01/23] md: add a new parameter 'offset' to md_super_write() Yu Kuai
  2025-05-25 15:50   ` Xiao Ni
  2025-05-26  6:28   ` Christoph Hellwig
@ 2025-05-27  5:54   ` Hannes Reinecke
  2 siblings, 0 replies; 108+ messages in thread
From: Hannes Reinecke @ 2025-05-27  5:54 UTC (permalink / raw)
  To: Yu Kuai, hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi

On 5/24/25 08:12, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> The parameter is always set to 0 for now, following patches will use
> this helper to write llbitmap to underlying disks, allow writing
> dirty sectors instead of the whole page.
> 
> Also rename md_super_write to md_write_metadata since there is nothing
> super-block specific.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   drivers/md/md-bitmap.c |  3 ++-
>   drivers/md/md.c        | 28 ++++++++++++++--------------
>   drivers/md/md.h        |  5 +++--
>   3 files changed, 19 insertions(+), 17 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 02/23] md: factor out a helper raid_is_456()
  2025-05-24  6:12 ` [PATCH 02/23] md: factor out a helper raid_is_456() Yu Kuai
  2025-05-25 15:50   ` Xiao Ni
  2025-05-26  6:28   ` Christoph Hellwig
@ 2025-05-27  5:55   ` Hannes Reinecke
  2 siblings, 0 replies; 108+ messages in thread
From: Hannes Reinecke @ 2025-05-27  5:55 UTC (permalink / raw)
  To: Yu Kuai, hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi

On 5/24/25 08:12, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> There are no functional changes, the helper will be used by llbitmap in
> following patches.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   drivers/md/md.c | 9 +--------
>   drivers/md/md.h | 6 ++++++
>   2 files changed, 7 insertions(+), 8 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 03/23] md/md-bitmap: cleanup bitmap_ops->startwrite()
  2025-05-24  6:13 ` [PATCH 03/23] md/md-bitmap: cleanup bitmap_ops->startwrite() Yu Kuai
  2025-05-25 15:51   ` Xiao Ni
  2025-05-26  6:29   ` Christoph Hellwig
@ 2025-05-27  5:56   ` Hannes Reinecke
  2 siblings, 0 replies; 108+ messages in thread
From: Hannes Reinecke @ 2025-05-27  5:56 UTC (permalink / raw)
  To: Yu Kuai, hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi

On 5/24/25 08:13, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> bitmap_startwrite() always return 0, and the caller doesn't check return
> value as well, hence change the method to void.
> 
> Also rename startwrite/endwrite to start_write/end_write, which is more in
> line with the usual naming convention.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   drivers/md/md-bitmap.c | 17 ++++++++---------
>   drivers/md/md-bitmap.h |  6 +++---
>   drivers/md/md.c        |  8 ++++----
>   3 files changed, 15 insertions(+), 16 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 04/23] md/md-bitmap: support discard for bitmap ops
  2025-05-24  6:13 ` [PATCH 04/23] md/md-bitmap: support discard for bitmap ops Yu Kuai
  2025-05-25 15:53   ` Xiao Ni
  2025-05-26  6:29   ` Christoph Hellwig
@ 2025-05-27  6:01   ` Hannes Reinecke
  2025-05-28  7:04   ` Glass Su
  3 siblings, 0 replies; 108+ messages in thread
From: Hannes Reinecke @ 2025-05-27  6:01 UTC (permalink / raw)
  To: Yu Kuai, hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi

On 5/24/25 08:13, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Use two new methods {start, end}_discard to handle discard IO, prepare
> to support new md bitmap.
> 
This actually does more than just add new methods; but not sure if one
should list all of that. Maybe just 'Add typedef for bitmap functions
and new methods to support md bitmap'.

 > Signed-off-by: Yu Kuai <yukuai3@huawei.com>> ---
>   drivers/md/md-bitmap.c |  3 +++
>   drivers/md/md-bitmap.h | 12 ++++++++----
>   drivers/md/md.c        | 15 +++++++++++----
>   drivers/md/md.h        |  1 +
>   4 files changed, 23 insertions(+), 8 deletions(-)
> 
Other than that:

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 05/23] md/md-bitmap: remove parameter slot from bitmap_create()
  2025-05-24  6:13 ` [PATCH 05/23] md/md-bitmap: remove parameter slot from bitmap_create() Yu Kuai
  2025-05-25 16:09   ` Xiao Ni
  2025-05-26  6:30   ` Christoph Hellwig
@ 2025-05-27  6:01   ` Hannes Reinecke
  2 siblings, 0 replies; 108+ messages in thread
From: Hannes Reinecke @ 2025-05-27  6:01 UTC (permalink / raw)
  To: Yu Kuai, hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi

On 5/24/25 08:13, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> All callers pass in '-1' for 'slot', hence it can be removed.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/md/md-bitmap.c | 6 +++---
>   drivers/md/md-bitmap.h | 2 +-
>   drivers/md/md.c        | 6 +++---
>   3 files changed, 7 insertions(+), 7 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 06/23] md/md-bitmap: add a new sysfs api bitmap_type
  2025-05-24  6:13 ` [PATCH 06/23] md/md-bitmap: add a new sysfs api bitmap_type Yu Kuai
  2025-05-25 16:32   ` Xiao Ni
  2025-05-26  6:32   ` Christoph Hellwig
@ 2025-05-27  6:10   ` Hannes Reinecke
  2025-05-27  7:43     ` Yu Kuai
  2 siblings, 1 reply; 108+ messages in thread
From: Hannes Reinecke @ 2025-05-27  6:10 UTC (permalink / raw)
  To: Yu Kuai, hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi

On 5/24/25 08:13, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> The api will be used by mdadm to set bitmap_ops while creating new array
> or assemble array, prepare to add a new bitmap.
> 
> Currently available options are:
> 
> cat /sys/block/md0/md/bitmap_type
> none [bitmap]
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   Documentation/admin-guide/md.rst | 73 ++++++++++++++----------
>   drivers/md/md.c                  | 96 ++++++++++++++++++++++++++++++--
>   drivers/md/md.h                  |  2 +
>   3 files changed, 135 insertions(+), 36 deletions(-)
> 
[ .. ]
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 311e52d5173d..4eb0c6effd5b 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -672,13 +672,18 @@ static void active_io_release(struct percpu_ref *ref)
>   
>   static void no_op(struct percpu_ref *r) {}
>   
> -static void mddev_set_bitmap_ops(struct mddev *mddev, enum md_submodule_id id)
> +static bool mddev_set_bitmap_ops(struct mddev *mddev)
>   {
>   	xa_lock(&md_submodule);
> -	mddev->bitmap_ops = xa_load(&md_submodule, id);
> +	mddev->bitmap_ops = xa_load(&md_submodule, mddev->bitmap_id);
>   	xa_unlock(&md_submodule);
> -	if (!mddev->bitmap_ops)
> -		pr_warn_once("md: can't find bitmap id %d\n", id);
> +
> +	if (!mddev->bitmap_ops) {
> +		pr_warn_once("md: can't find bitmap id %d\n", mddev->bitmap_id);
> +		return false;
> +	}
> +
> +	return true;
>   }
>   
>   static void mddev_clear_bitmap_ops(struct mddev *mddev)
> @@ -688,8 +693,10 @@ static void mddev_clear_bitmap_ops(struct mddev *mddev)
>   
>   int mddev_init(struct mddev *mddev)
>   {
> -	/* TODO: support more versions */
> -	mddev_set_bitmap_ops(mddev, ID_BITMAP);
> +	mddev->bitmap_id = ID_BITMAP;
> +
> +	if (!mddev_set_bitmap_ops(mddev))
> +		return -EINVAL;
>   
>   	if (percpu_ref_init(&mddev->active_io, active_io_release,
>   			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
> @@ -4155,6 +4162,82 @@ new_level_store(struct mddev *mddev, const char *buf, size_t len)
>   static struct md_sysfs_entry md_new_level =
>   __ATTR(new_level, 0664, new_level_show, new_level_store);
>   
> +static ssize_t
> +bitmap_type_show(struct mddev *mddev, char *page)
> +{
> +	struct md_submodule_head *head;
> +	unsigned long i;
> +	ssize_t len = 0;
> +
> +	if (mddev->bitmap_id == ID_BITMAP_NONE)
> +		len += sprintf(page + len, "[none] ");
> +	else
> +		len += sprintf(page + len, "none ");
> +
> +	xa_lock(&md_submodule);
> +	xa_for_each(&md_submodule, i, head) {
> +		if (head->type != MD_BITMAP)
> +			continue;
> +
> +		if (mddev->bitmap_id == head->id)
> +			len += sprintf(page + len, "[%s] ", head->name);
> +		else
> +			len += sprintf(page + len, "%s ", head->name);
> +	}
> +	xa_unlock(&md_submodule);
> +
> +	len += sprintf(page + len, "\n");
> +	return len;
> +}
> +
> +static ssize_t
> +bitmap_type_store(struct mddev *mddev, const char *buf, size_t len)
> +{
> +	struct md_submodule_head *head;
> +	enum md_submodule_id id;
> +	unsigned long i;
> +	int err;
> +
> +	if (mddev->bitmap_ops)
> +		return -EBUSY;
> +
Why isn't this protected by md_submodule lock?
The lock is taken when updating ->bitmap_ops, so I would
have expected it to be taken when checking it ...

> +	err = kstrtoint(buf, 10, &id);
> +	if (!err) {
> +		if (id == ID_BITMAP_NONE) {
> +			mddev->bitmap_id = id;
> +			return len;
> +		}
> +
> +		xa_lock(&md_submodule);
> +		head = xa_load(&md_submodule, id);
> +		xa_unlock(&md_submodule);
> +
> +		if (head && head->type == MD_BITMAP) {
> +			mddev->bitmap_id = id;
> +			return len;
> +		}
> +	}
> +
> +	if (cmd_match(buf, "none")) {
> +		mddev->bitmap_id = ID_BITMAP_NONE;
> +		return len;
> +	}
> +
That is odd coding. The 'if (!err)' condition above might
fall through to here, but then we already now that it cannot
match 'none'.
Please invert the logic, first check for 'none', and only
call kstroint if the match failed.

> +	xa_lock(&md_submodule);
> +	xa_for_each(&md_submodule, i, head) {
> +		if (head->type == MD_BITMAP && cmd_match(buf, head->name)) {
> +			mddev->bitmap_id = head->id;
> +			xa_unlock(&md_submodule);
> +			return len;
> +		}
> +	}
> +	xa_unlock(&md_submodule);
> +	return -ENOENT;
> +}
> +
> +static struct md_sysfs_entry md_bitmap_type =
> +__ATTR(bitmap_type, 0664, bitmap_type_show, bitmap_type_store);
> +
>   static ssize_t
>   layout_show(struct mddev *mddev, char *page)
>   {
> @@ -5719,6 +5802,7 @@ __ATTR(serialize_policy, S_IRUGO | S_IWUSR, serialize_policy_show,
>   static struct attribute *md_default_attrs[] = {
>   	&md_level.attr,
>   	&md_new_level.attr,
> +	&md_bitmap_type.attr,
>   	&md_layout.attr,
>   	&md_raid_disks.attr,
>   	&md_uuid.attr,
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 13e3f9ce1b79..bf34c0a36551 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -40,6 +40,7 @@ enum md_submodule_id {
>   	ID_CLUSTER,
>   	ID_BITMAP,
>   	ID_LLBITMAP,	/* TODO */
> +	ID_BITMAP_NONE,
>   };
>   
>   struct md_submodule_head {
> @@ -565,6 +566,7 @@ struct mddev {
>   	struct percpu_ref		writes_pending;
>   	int				sync_checkers;	/* # of threads checking writes_pending */
>   
> +	enum md_submodule_id		bitmap_id;
>   	void				*bitmap; /* the bitmap for the device */
>   	struct bitmap_operations	*bitmap_ops;
>   	struct {

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 07/23] md/md-bitmap: delay registration of bitmap_ops until creating bitmap
  2025-05-24  6:13 ` [PATCH 07/23] md/md-bitmap: delay registration of bitmap_ops until creating bitmap Yu Kuai
  2025-05-26  6:32   ` Christoph Hellwig
  2025-05-26  6:52   ` Xiao Ni
@ 2025-05-27  6:13   ` Hannes Reinecke
  2025-05-27  7:53     ` Yu Kuai
  2 siblings, 1 reply; 108+ messages in thread
From: Hannes Reinecke @ 2025-05-27  6:13 UTC (permalink / raw)
  To: Yu Kuai, hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi

On 5/24/25 08:13, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Currently bitmap_ops is registered while allocating mddev, this is fine
> when there is only one bitmap_ops, however, after introduing a new
> bitmap_ops, user space need a time window to choose which bitmap_ops to
> use while creating new array.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   drivers/md/md.c | 86 +++++++++++++++++++++++++++++++------------------
>   1 file changed, 55 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 4eb0c6effd5b..dc4b85f30e13 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -674,39 +674,50 @@ static void no_op(struct percpu_ref *r) {}
>   
>   static bool mddev_set_bitmap_ops(struct mddev *mddev)
>   {
> +	struct bitmap_operations *old = mddev->bitmap_ops;
> +	struct md_submodule_head *head;
> +
> +	if (mddev->bitmap_id == ID_BITMAP_NONE ||
> +	    (old && old->head.id == mddev->bitmap_id))
> +		return true;
> +
>   	xa_lock(&md_submodule);
> -	mddev->bitmap_ops = xa_load(&md_submodule, mddev->bitmap_id);
> +	head = xa_load(&md_submodule, mddev->bitmap_id);
>   	xa_unlock(&md_submodule);
>   
> -	if (!mddev->bitmap_ops) {
> -		pr_warn_once("md: can't find bitmap id %d\n", mddev->bitmap_id);
> +	if (WARN_ON_ONCE(!head || head->type != MD_BITMAP)) {
> +		pr_err("md: can't find bitmap id %d\n", mddev->bitmap_id);
>   		return false;
>   	}
>   
> +	if (old && old->group)
> +		sysfs_remove_group(&mddev->kobj, old->group);
> +
> +	mddev->bitmap_ops = (void *)head;
> +	if (mddev->bitmap_ops && mddev->bitmap_ops->group &&
> +	    sysfs_create_group(&mddev->kobj, mddev->bitmap_ops->group))
> +		pr_warn("md: cannot register extra bitmap attributes for %s\n",
> +			mdname(mddev));
> +
>   	return true;
>   }
>   
>   static void mddev_clear_bitmap_ops(struct mddev *mddev)
>   {
> +	if (mddev->bitmap_ops && mddev->bitmap_ops->group)
> +		sysfs_remove_group(&mddev->kobj, mddev->bitmap_ops->group);
> +
>   	mddev->bitmap_ops = NULL;
>   }
>   
>   int mddev_init(struct mddev *mddev)
>   {
> -	mddev->bitmap_id = ID_BITMAP;
> -
> -	if (!mddev_set_bitmap_ops(mddev))
> -		return -EINVAL;
> -
>   	if (percpu_ref_init(&mddev->active_io, active_io_release,
> -			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
> -		mddev_clear_bitmap_ops(mddev);
> +			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL))
>   		return -ENOMEM;
> -	}
>   
>   	if (percpu_ref_init(&mddev->writes_pending, no_op,
>   			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
> -		mddev_clear_bitmap_ops(mddev);
>   		percpu_ref_exit(&mddev->active_io);
>   		return -ENOMEM;
>   	}
> @@ -734,6 +745,7 @@ int mddev_init(struct mddev *mddev)
>   	mddev->resync_min = 0;
>   	mddev->resync_max = MaxSector;
>   	mddev->level = LEVEL_NONE;
> +	mddev->bitmap_id = ID_BITMAP;
>   
>   	INIT_WORK(&mddev->sync_work, md_start_sync);
>   	INIT_WORK(&mddev->del_work, mddev_delayed_delete);
> @@ -744,7 +756,6 @@ EXPORT_SYMBOL_GPL(mddev_init);
>   
>   void mddev_destroy(struct mddev *mddev)
>   {
> -	mddev_clear_bitmap_ops(mddev);
>   	percpu_ref_exit(&mddev->active_io);
>   	percpu_ref_exit(&mddev->writes_pending);
>   }
> @@ -6093,11 +6104,6 @@ struct mddev *md_alloc(dev_t dev, char *name)
>   		return ERR_PTR(error);
>   	}
>   
> -	if (md_bitmap_registered(mddev) && mddev->bitmap_ops->group)
> -		if (sysfs_create_group(&mddev->kobj, mddev->bitmap_ops->group))
> -			pr_warn("md: cannot register extra bitmap attributes for %s\n",
> -				mdname(mddev));
> -
>   	kobject_uevent(&mddev->kobj, KOBJ_ADD);
>   	mddev->sysfs_state = sysfs_get_dirent_safe(mddev->kobj.sd, "array_state");
>   	mddev->sysfs_level = sysfs_get_dirent_safe(mddev->kobj.sd, "level");

But now you've killed udev event processing.
Once the 'add' event is sent _all_ sysfs attributes must be present,
otherwise you'll have a race condition where udev is checking for
attributes which are present only later.

So when moving things around ensure to move the kobject_uevent() call, too.

(ideally you would set the sysfs attributes when calling 'add_device()',
but not sure if that's possible here.)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 08/23] md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations
  2025-05-24  6:13 ` [PATCH 08/23] md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations Yu Kuai
  2025-05-26  7:03   ` Xiao Ni
@ 2025-05-27  6:14   ` Hannes Reinecke
  1 sibling, 0 replies; 108+ messages in thread
From: Hannes Reinecke @ 2025-05-27  6:14 UTC (permalink / raw)
  To: Yu Kuai, hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi

On 5/24/25 08:13, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> This method is used to check if blocks can be skipped before calling
> into pers->sync_request(), llbiltmap will use this method to skip
> resync for unwritten/clean data blocks, and recovery/check/repair for
> unwritten data blocks;
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/md/md-bitmap.h | 1 +
>   drivers/md/md.c        | 7 +++++++
>   2 files changed, 8 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 09/23] md/md-bitmap: add a new method blocks_synced() in bitmap_operations
  2025-05-24  6:13 ` [PATCH 09/23] md/md-bitmap: add a new method blocks_synced() " Yu Kuai
  2025-05-27  2:35   ` Xiao Ni
@ 2025-05-27  6:16   ` Hannes Reinecke
  1 sibling, 0 replies; 108+ messages in thread
From: Hannes Reinecke @ 2025-05-27  6:16 UTC (permalink / raw)
  To: Yu Kuai, hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi

On 5/24/25 08:13, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Currently, raid456 must perform a whole array initial recovery to build
> initail xor data, then IO to the array won't have to read all the blocks
> in underlying disks.
> 
> This behavior will affect IO performance a lot, and nowadays there are
> huge disks and the initial recovery can take a long time. Hence llbitmap
> will support lazy initial recovery in following patches. This method is
> used to check if data blocks is synced or not, if not then IO will still
> have to read all blocks for raid456.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/md/md-bitmap.h | 1 +
>   drivers/md/raid5.c     | 6 ++++++
>   2 files changed, 7 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 10/23] md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER
  2025-05-24  6:13 ` [PATCH 10/23] md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER Yu Kuai
@ 2025-05-27  6:17   ` Hannes Reinecke
  2025-05-27  8:00     ` Yu Kuai
  0 siblings, 1 reply; 108+ messages in thread
From: Hannes Reinecke @ 2025-05-27  6:17 UTC (permalink / raw)
  To: Yu Kuai, hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi

On 5/24/25 08:13, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> This flag is used by llbitmap in later patches to skip raid456 initial
> recover and delay building initial xor data to first write.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/md/md.c | 12 +++++++++++-
>   drivers/md/md.h |  2 ++
>   2 files changed, 13 insertions(+), 1 deletion(-)
> 
Wouldn't it be enough to check for the 'blocks_synced' callback to check
if the array supports lazy recovery?

Otherwise:

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 11/23] md/md-bitmap: make method bitmap_ops->daemon_work optional
  2025-05-24  6:13 ` [PATCH 11/23] md/md-bitmap: make method bitmap_ops->daemon_work optional Yu Kuai
  2025-05-26  6:34   ` Christoph Hellwig
@ 2025-05-27  6:19   ` Hannes Reinecke
  2025-05-27  8:03     ` Yu Kuai
  1 sibling, 1 reply; 108+ messages in thread
From: Hannes Reinecke @ 2025-05-27  6:19 UTC (permalink / raw)
  To: Yu Kuai, hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi

On 5/24/25 08:13, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> daemon_work() will be called by daemon thread, on the one hand, daemon
> thread doesn't have strict wake-up time; on the other hand, too much
> work are put to daemon thread, like handle sync IO, handle failed
> or specail normal IO, handle recovery, and so on. Hence daemon thread
> may be too busy to clear dirty bits in time.
> 
> Make bitmap_ops->daemon_work() optional and following patches will use
> separate async work to clear dirty bits for the new bitmap.
> 
Why not move it to a workqueue in general?
The above argument is valid even for the current implementation, no?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 12/23] md/md-bitmap: add macros for lockless bitmap
  2025-05-24  6:13 ` [PATCH 12/23] md/md-bitmap: add macros for lockless bitmap Yu Kuai
  2025-05-26  6:40   ` Christoph Hellwig
@ 2025-05-27  6:21   ` Hannes Reinecke
  2025-05-28  4:53   ` Xiao Ni
  2 siblings, 0 replies; 108+ messages in thread
From: Hannes Reinecke @ 2025-05-27  6:21 UTC (permalink / raw)
  To: Yu Kuai, hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi

On 5/24/25 08:13, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Also move other values to md-bitmap.h and update comments.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   drivers/md/md-bitmap.c |  9 ---------
>   drivers/md/md-bitmap.h | 17 +++++++++++++++++
>   2 files changed, 17 insertions(+), 9 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 13/23] md/md-bitmap: fix dm-raid max_write_behind setting
  2025-05-24  6:13 ` [PATCH 13/23] md/md-bitmap: fix dm-raid max_write_behind setting Yu Kuai
  2025-05-26  6:40   ` Christoph Hellwig
@ 2025-05-27  6:21   ` Hannes Reinecke
  1 sibling, 0 replies; 108+ messages in thread
From: Hannes Reinecke @ 2025-05-27  6:21 UTC (permalink / raw)
  To: Yu Kuai, hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi

On 5/24/25 08:13, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> It's supposed to be COUNTER_MAX / 2, not COUNTER_MAX.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   drivers/md/md-bitmap.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> index 689d5dba9328..535bc1888e8c 100644
> --- a/drivers/md/md-bitmap.c
> +++ b/drivers/md/md-bitmap.c
> @@ -777,7 +777,7 @@ static int md_bitmap_new_disk_sb(struct bitmap *bitmap)
>   	 * is a good choice?  We choose COUNTER_MAX / 2 arbitrarily.
>   	 */
>   	write_behind = bitmap->mddev->bitmap_info.max_write_behind;
> -	if (write_behind > COUNTER_MAX)
> +	if (write_behind > COUNTER_MAX / 2)
>   		write_behind = COUNTER_MAX / 2;
>   	sb->write_behind = cpu_to_le32(write_behind);
>   	bitmap->mddev->bitmap_info.max_write_behind = write_behind;

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 14/23] md/dm-raid: remove max_write_behind setting limit
  2025-05-24  6:13 ` [PATCH 14/23] md/dm-raid: remove max_write_behind setting limit Yu Kuai
  2025-05-26  6:41   ` Christoph Hellwig
@ 2025-05-27  6:26   ` Hannes Reinecke
  2025-05-28  4:58   ` Xiao Ni
  2 siblings, 0 replies; 108+ messages in thread
From: Hannes Reinecke @ 2025-05-27  6:26 UTC (permalink / raw)
  To: Yu Kuai, hch, xni, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi

On 5/24/25 08:13, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> The comments said 'vaule in kB', while the value actually means the
> number of write_behind IOs. And since md-bitmap will automatically
> adjust the value to max COUNTER_MAX / 2, there is no need to fail
> early.
> 
> Also move some macros that is only used md-bitmap.c.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   drivers/md/dm-raid.c   |  6 +-----
>   drivers/md/md-bitmap.c | 10 ++++++++++
>   drivers/md/md-bitmap.h |  9 ---------
>   3 files changed, 11 insertions(+), 14 deletions(-)
> 

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 06/23] md/md-bitmap: add a new sysfs api bitmap_type
  2025-05-27  6:10   ` Hannes Reinecke
@ 2025-05-27  7:43     ` Yu Kuai
  0 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-27  7:43 UTC (permalink / raw)
  To: Hannes Reinecke, Yu Kuai, hch, xni, colyli, song
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

Hi,

在 2025/05/27 14:10, Hannes Reinecke 写道:
> On 5/24/25 08:13, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> The api will be used by mdadm to set bitmap_ops while creating new array
>> or assemble array, prepare to add a new bitmap.
>>
>> Currently available options are:
>>
>> cat /sys/block/md0/md/bitmap_type
>> none [bitmap]
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   Documentation/admin-guide/md.rst | 73 ++++++++++++++----------
>>   drivers/md/md.c                  | 96 ++++++++++++++++++++++++++++++--
>>   drivers/md/md.h                  |  2 +
>>   3 files changed, 135 insertions(+), 36 deletions(-)
>>
> [ .. ]
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 311e52d5173d..4eb0c6effd5b 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -672,13 +672,18 @@ static void active_io_release(struct percpu_ref 
>> *ref)
>>   static void no_op(struct percpu_ref *r) {}
>> -static void mddev_set_bitmap_ops(struct mddev *mddev, enum 
>> md_submodule_id id)
>> +static bool mddev_set_bitmap_ops(struct mddev *mddev)
>>   {
>>       xa_lock(&md_submodule);
>> -    mddev->bitmap_ops = xa_load(&md_submodule, id);
>> +    mddev->bitmap_ops = xa_load(&md_submodule, mddev->bitmap_id);
>>       xa_unlock(&md_submodule);
>> -    if (!mddev->bitmap_ops)
>> -        pr_warn_once("md: can't find bitmap id %d\n", id);
>> +
>> +    if (!mddev->bitmap_ops) {
>> +        pr_warn_once("md: can't find bitmap id %d\n", mddev->bitmap_id);
>> +        return false;
>> +    }
>> +
>> +    return true;
>>   }
>>   static void mddev_clear_bitmap_ops(struct mddev *mddev)
>> @@ -688,8 +693,10 @@ static void mddev_clear_bitmap_ops(struct mddev 
>> *mddev)
>>   int mddev_init(struct mddev *mddev)
>>   {
>> -    /* TODO: support more versions */
>> -    mddev_set_bitmap_ops(mddev, ID_BITMAP);
>> +    mddev->bitmap_id = ID_BITMAP;
>> +
>> +    if (!mddev_set_bitmap_ops(mddev))
>> +        return -EINVAL;
>>       if (percpu_ref_init(&mddev->active_io, active_io_release,
>>                   PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
>> @@ -4155,6 +4162,82 @@ new_level_store(struct mddev *mddev, const char 
>> *buf, size_t len)
>>   static struct md_sysfs_entry md_new_level =
>>   __ATTR(new_level, 0664, new_level_show, new_level_store);
>> +static ssize_t
>> +bitmap_type_show(struct mddev *mddev, char *page)
>> +{
>> +    struct md_submodule_head *head;
>> +    unsigned long i;
>> +    ssize_t len = 0;
>> +
>> +    if (mddev->bitmap_id == ID_BITMAP_NONE)
>> +        len += sprintf(page + len, "[none] ");
>> +    else
>> +        len += sprintf(page + len, "none ");
>> +
>> +    xa_lock(&md_submodule);
>> +    xa_for_each(&md_submodule, i, head) {
>> +        if (head->type != MD_BITMAP)
>> +            continue;
>> +
>> +        if (mddev->bitmap_id == head->id)
>> +            len += sprintf(page + len, "[%s] ", head->name);
>> +        else
>> +            len += sprintf(page + len, "%s ", head->name);
>> +    }
>> +    xa_unlock(&md_submodule);
>> +
>> +    len += sprintf(page + len, "\n");
>> +    return len;
>> +}
>> +
>> +static ssize_t
>> +bitmap_type_store(struct mddev *mddev, const char *buf, size_t len)
>> +{
>> +    struct md_submodule_head *head;
>> +    enum md_submodule_id id;
>> +    unsigned long i;
>> +    int err;
>> +
>> +    if (mddev->bitmap_ops)
>> +        return -EBUSY;
>> +
> Why isn't this protected by md_submodule lock?
> The lock is taken when updating ->bitmap_ops, so I would
> have expected it to be taken when checking it ...

The design is that when bitmap is created, user can no longer set
bitmap_id, and it's right without the protecting there will be race
window.

> 
>> +    err = kstrtoint(buf, 10, &id);
>> +    if (!err) {
>> +        if (id == ID_BITMAP_NONE) {
>> +            mddev->bitmap_id = id;
>> +            return len;
>> +        }
>> +
>> +        xa_lock(&md_submodule);
>> +        head = xa_load(&md_submodule, id);
>> +        xa_unlock(&md_submodule);
>> +
>> +        if (head && head->type == MD_BITMAP) {
>> +            mddev->bitmap_id = id;
>> +            return len;
>> +        }
>> +    }
>> +
>> +    if (cmd_match(buf, "none")) {
>> +        mddev->bitmap_id = ID_BITMAP_NONE;
>> +        return len;
>> +    }
>> +
> That is odd coding. The 'if (!err)' condition above might
> fall through to here, but then we already now that it cannot
> match 'none'.

The first kstrtoint() is trying to convert the input string to int id,
looks like I missed return -EINVAL if the id can't be found in
md_submodule.

> Please invert the logic, first check for 'none', and only
> call kstroint if the match failed.

Sure, this sounds better.

Thanks for the review!
Kuai

> 
>> +    xa_lock(&md_submodule);
>> +    xa_for_each(&md_submodule, i, head) {
>> +        if (head->type == MD_BITMAP && cmd_match(buf, head->name)) {
>> +            mddev->bitmap_id = head->id;
>> +            xa_unlock(&md_submodule);
>> +            return len;
>> +        }
>> +    }
>> +    xa_unlock(&md_submodule);
>> +    return -ENOENT;
>> +}
>> +
>> +static struct md_sysfs_entry md_bitmap_type =
>> +__ATTR(bitmap_type, 0664, bitmap_type_show, bitmap_type_store);
>> +
>>   static ssize_t
>>   layout_show(struct mddev *mddev, char *page)
>>   {
>> @@ -5719,6 +5802,7 @@ __ATTR(serialize_policy, S_IRUGO | S_IWUSR, 
>> serialize_policy_show,
>>   static struct attribute *md_default_attrs[] = {
>>       &md_level.attr,
>>       &md_new_level.attr,
>> +    &md_bitmap_type.attr,
>>       &md_layout.attr,
>>       &md_raid_disks.attr,
>>       &md_uuid.attr,
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 13e3f9ce1b79..bf34c0a36551 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -40,6 +40,7 @@ enum md_submodule_id {
>>       ID_CLUSTER,
>>       ID_BITMAP,
>>       ID_LLBITMAP,    /* TODO */
>> +    ID_BITMAP_NONE,
>>   };
>>   struct md_submodule_head {
>> @@ -565,6 +566,7 @@ struct mddev {
>>       struct percpu_ref        writes_pending;
>>       int                sync_checkers;    /* # of threads checking 
>> writes_pending */
>> +    enum md_submodule_id        bitmap_id;
>>       void                *bitmap; /* the bitmap for the device */
>>       struct bitmap_operations    *bitmap_ops;
>>       struct {
> 
> Cheers,
> 
> Hannes


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 07/23] md/md-bitmap: delay registration of bitmap_ops until creating bitmap
  2025-05-27  6:13   ` Hannes Reinecke
@ 2025-05-27  7:53     ` Yu Kuai
  2025-05-27  8:54       ` Hannes Reinecke
  0 siblings, 1 reply; 108+ messages in thread
From: Yu Kuai @ 2025-05-27  7:53 UTC (permalink / raw)
  To: Hannes Reinecke, Yu Kuai, hch, xni, colyli, song
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

Hi,

在 2025/05/27 14:13, Hannes Reinecke 写道:
> On 5/24/25 08:13, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Currently bitmap_ops is registered while allocating mddev, this is fine
>> when there is only one bitmap_ops, however, after introduing a new
>> bitmap_ops, user space need a time window to choose which bitmap_ops to
>> use while creating new array.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   drivers/md/md.c | 86 +++++++++++++++++++++++++++++++------------------
>>   1 file changed, 55 insertions(+), 31 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 4eb0c6effd5b..dc4b85f30e13 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -674,39 +674,50 @@ static void no_op(struct percpu_ref *r) {}
>>   static bool mddev_set_bitmap_ops(struct mddev *mddev)
>>   {
>> +    struct bitmap_operations *old = mddev->bitmap_ops;
>> +    struct md_submodule_head *head;
>> +
>> +    if (mddev->bitmap_id == ID_BITMAP_NONE ||
>> +        (old && old->head.id == mddev->bitmap_id))
>> +        return true;
>> +
>>       xa_lock(&md_submodule);
>> -    mddev->bitmap_ops = xa_load(&md_submodule, mddev->bitmap_id);
>> +    head = xa_load(&md_submodule, mddev->bitmap_id);
>>       xa_unlock(&md_submodule);
>> -    if (!mddev->bitmap_ops) {
>> -        pr_warn_once("md: can't find bitmap id %d\n", mddev->bitmap_id);
>> +    if (WARN_ON_ONCE(!head || head->type != MD_BITMAP)) {
>> +        pr_err("md: can't find bitmap id %d\n", mddev->bitmap_id);
>>           return false;
>>       }
>> +    if (old && old->group)
>> +        sysfs_remove_group(&mddev->kobj, old->group);
>> +
>> +    mddev->bitmap_ops = (void *)head;
>> +    if (mddev->bitmap_ops && mddev->bitmap_ops->group &&
>> +        sysfs_create_group(&mddev->kobj, mddev->bitmap_ops->group))
>> +        pr_warn("md: cannot register extra bitmap attributes for %s\n",
>> +            mdname(mddev));
>> +
>>       return true;
>>   }
>>   static void mddev_clear_bitmap_ops(struct mddev *mddev)
>>   {
>> +    if (mddev->bitmap_ops && mddev->bitmap_ops->group)
>> +        sysfs_remove_group(&mddev->kobj, mddev->bitmap_ops->group);
>> +
>>       mddev->bitmap_ops = NULL;
>>   }
>>   int mddev_init(struct mddev *mddev)
>>   {
>> -    mddev->bitmap_id = ID_BITMAP;
>> -
>> -    if (!mddev_set_bitmap_ops(mddev))
>> -        return -EINVAL;
>> -
>>       if (percpu_ref_init(&mddev->active_io, active_io_release,
>> -                PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
>> -        mddev_clear_bitmap_ops(mddev);
>> +                PERCPU_REF_ALLOW_REINIT, GFP_KERNEL))
>>           return -ENOMEM;
>> -    }
>>       if (percpu_ref_init(&mddev->writes_pending, no_op,
>>                   PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
>> -        mddev_clear_bitmap_ops(mddev);
>>           percpu_ref_exit(&mddev->active_io);
>>           return -ENOMEM;
>>       }
>> @@ -734,6 +745,7 @@ int mddev_init(struct mddev *mddev)
>>       mddev->resync_min = 0;
>>       mddev->resync_max = MaxSector;
>>       mddev->level = LEVEL_NONE;
>> +    mddev->bitmap_id = ID_BITMAP;
>>       INIT_WORK(&mddev->sync_work, md_start_sync);
>>       INIT_WORK(&mddev->del_work, mddev_delayed_delete);
>> @@ -744,7 +756,6 @@ EXPORT_SYMBOL_GPL(mddev_init);
>>   void mddev_destroy(struct mddev *mddev)
>>   {
>> -    mddev_clear_bitmap_ops(mddev);
>>       percpu_ref_exit(&mddev->active_io);
>>       percpu_ref_exit(&mddev->writes_pending);
>>   }
>> @@ -6093,11 +6104,6 @@ struct mddev *md_alloc(dev_t dev, char *name)
>>           return ERR_PTR(error);
>>       }
>> -    if (md_bitmap_registered(mddev) && mddev->bitmap_ops->group)
>> -        if (sysfs_create_group(&mddev->kobj, mddev->bitmap_ops->group))
>> -            pr_warn("md: cannot register extra bitmap attributes for 
>> %s\n",
>> -                mdname(mddev));
>> -
>>       kobject_uevent(&mddev->kobj, KOBJ_ADD);
>>       mddev->sysfs_state = sysfs_get_dirent_safe(mddev->kobj.sd, 
>> "array_state");
>>       mddev->sysfs_level = sysfs_get_dirent_safe(mddev->kobj.sd, 
>> "level");
> 
> But now you've killed udev event processing.
> Once the 'add' event is sent _all_ sysfs attributes must be present,
> otherwise you'll have a race condition where udev is checking for
> attributes which are present only later.
> 
> So when moving things around ensure to move the kobject_uevent() call, too.

I do not expect the bitmap entries are checked by udev, otherwise this
set can introduce regressions since the bitmap entries are no longer
existed after using the new biltmap.

And the above KOBJ_ADD uevent is used for mddev->kobj, right? In this
case, we're creating new entries under mddev->kobj, should this be
KOBJ_CHANGE?

Thanks,
Kuai

> 
> (ideally you would set the sysfs attributes when calling 'add_device()',
> but not sure if that's possible here.)
> 
> Cheers,
> 
> Hannes


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 10/23] md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER
  2025-05-27  6:17   ` Hannes Reinecke
@ 2025-05-27  8:00     ` Yu Kuai
  0 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-27  8:00 UTC (permalink / raw)
  To: Hannes Reinecke, Yu Kuai, hch, xni, colyli, song
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

Hi,

在 2025/05/27 14:17, Hannes Reinecke 写道:
> On 5/24/25 08:13, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> This flag is used by llbitmap in later patches to skip raid456 initial
>> recover and delay building initial xor data to first write.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> ---
>>   drivers/md/md.c | 12 +++++++++++-
>>   drivers/md/md.h |  2 ++
>>   2 files changed, 13 insertions(+), 1 deletion(-)
>>
> Wouldn't it be enough to check for the 'blocks_synced' callback to check
> if the array supports lazy recovery?

I think no, just to check the array supports lazy recovery is not
enough, we still have to distinguish the normal recovery and the lazy
recovery by new bitmap. For example:

+		if (test_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery) &&
+		    start == MaxSector)
+			start = 0;

For normal recovery, there is nothing to do, and for lazy recovery,
we'll register a new sync_thread later to recover bits that are
written the first time.

Thanks,
Kuai

> 
> Otherwise:
> 
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> 
> Cheers,
> 
> Hannes


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 11/23] md/md-bitmap: make method bitmap_ops->daemon_work optional
  2025-05-27  6:19   ` Hannes Reinecke
@ 2025-05-27  8:03     ` Yu Kuai
  2025-05-27  8:55       ` Hannes Reinecke
  0 siblings, 1 reply; 108+ messages in thread
From: Yu Kuai @ 2025-05-27  8:03 UTC (permalink / raw)
  To: Hannes Reinecke, Yu Kuai, hch, xni, colyli, song
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

Hi,

在 2025/05/27 14:19, Hannes Reinecke 写道:
> On 5/24/25 08:13, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> daemon_work() will be called by daemon thread, on the one hand, daemon
>> thread doesn't have strict wake-up time; on the other hand, too much
>> work are put to daemon thread, like handle sync IO, handle failed
>> or specail normal IO, handle recovery, and so on. Hence daemon thread
>> may be too busy to clear dirty bits in time.
>>
>> Make bitmap_ops->daemon_work() optional and following patches will use
>> separate async work to clear dirty bits for the new bitmap.
>>
> Why not move it to a workqueue in general?
> The above argument is valid even for the current implementation, no?

Yes, and however, I'll prefer not to touch current implementaion :(
This is trivial comparing to other flaws like global spinlock.

Thanks,
Kuai

> 
> Cheers,
> 
> Hannes


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 06/23] md/md-bitmap: add a new sysfs api bitmap_type
  2025-05-26  7:45     ` Yu Kuai
@ 2025-05-27  8:21       ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2025-05-27  8:21 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Christoph Hellwig, xni, colyli, song, linux-doc, linux-kernel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

On Mon, May 26, 2025 at 03:45:39PM +0800, Yu Kuai wrote:
> Hi,
>
> 在 2025/05/26 14:32, Christoph Hellwig 写道:
>> On Sat, May 24, 2025 at 02:13:03PM +0800, Yu Kuai wrote:
>>> +  consistency_policy
>>
>> .. these doc changes look unrelated, or am I missing something?
>
> The position are moved to the front of the bitmap fields, because now
> bitmap/xxx is not always here.

Ah, ok.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 12/23] md/md-bitmap: add macros for lockless bitmap
  2025-05-26  8:12     ` Yu Kuai
@ 2025-05-27  8:22       ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2025-05-27  8:22 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Christoph Hellwig, xni, colyli, song, linux-doc, linux-kernel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

On Mon, May 26, 2025 at 04:12:06PM +0800, Yu Kuai wrote:
>>> +#define BITMAP_SB_SIZE 1024
>>
>> And while we're at it: this is still duplicated in llbitmap.c later.
>> But shouldn't it simply be replaced with a sizeof on struct bitmap_super_s?
>
> Sorry that I forgot to explain why it's still in .c
>
> sizeof(struct bitmap_super_s) is actually 256 bytes, while by default,
> 1k is reserved, perhaps I can name it as BITMAP_DATA_OFFSET ?

Sounds good.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 15/23] md/md-llbitmap: implement llbitmap IO
  2025-05-24  6:13 ` [PATCH 15/23] md/md-llbitmap: implement llbitmap IO Yu Kuai
@ 2025-05-27  8:27   ` Christoph Hellwig
  2025-05-27  8:55     ` Yu Kuai
  2025-06-06  3:21   ` Xiao Ni
  2025-06-30  2:07   ` Xiao Ni
  2 siblings, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2025-05-27  8:27 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, xni, colyli, song, yukuai3, linux-doc, linux-kernel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

FYI, I still find splitting the additioon of the new md-llbitmap.c into
multiple patches not helpful for reviewing it.  I'm mostly reviewing
the applied code and hope I didn't forget to place anything into the
right mail.

> diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
> new file mode 100644
> index 000000000000..1a01b6777527
> --- /dev/null
> +++ b/drivers/md/md-llbitmap.c
> @@ -0,0 +1,571 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +#ifdef CONFIG_MD_LLBITMAP

Please don't ifdef the entire code in a sourc file, instead just compile
it conditionally:


md-mod-y        += md.o md-bitmap.o 
md-mod-$(CONFIG_MD_LLBITMAP) += md-llbitmap.o

> +	BitNeedSync,
> +	/* data is synchronizing */
> +	BitSyncing,
> +	nr_llbitmap_state,

Any reason nr_llbitmap_state, doesn't follow the naming scheme of the other
bits,?

> +	BitmapActionStale,
> +	nr_llbitmap_action,

Same here?

> +			if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))

Overly long line.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 23/23] md/md-llbitmap: add Kconfig
  2025-05-24  6:13 ` [PATCH 23/23] md/md-llbitmap: add Kconfig Yu Kuai
@ 2025-05-27  8:29   ` Christoph Hellwig
  2025-05-27  9:00     ` Yu Kuai
  0 siblings, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2025-05-27  8:29 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, xni, colyli, song, yukuai3, linux-doc, linux-kernel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

On Sat, May 24, 2025 at 02:13:20PM +0800, Yu Kuai wrote:
>  	MD_PERSONALITY = 0,
>  	MD_CLUSTER,
> -	MD_BITMAP, /* TODO */
> +	MD_BITMAP,
>  };
>  
>  enum md_submodule_id {
> @@ -39,7 +39,7 @@ enum md_submodule_id {
>  	ID_RAID10	= 10,
>  	ID_CLUSTER,
>  	ID_BITMAP,
> -	ID_LLBITMAP,	/* TODO */
> +	ID_LLBITMAP,

Please just drop the TODO annotation from the initial patch.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 07/23] md/md-bitmap: delay registration of bitmap_ops until creating bitmap
  2025-05-27  7:53     ` Yu Kuai
@ 2025-05-27  8:54       ` Hannes Reinecke
  0 siblings, 0 replies; 108+ messages in thread
From: Hannes Reinecke @ 2025-05-27  8:54 UTC (permalink / raw)
  To: Yu Kuai, hch, xni, colyli, song
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

On 5/27/25 09:53, Yu Kuai wrote:
> Hi,
> 
> 在 2025/05/27 14:13, Hannes Reinecke 写道:
>> On 5/24/25 08:13, Yu Kuai wrote:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> Currently bitmap_ops is registered while allocating mddev, this is fine
>>> when there is only one bitmap_ops, however, after introduing a new
>>> bitmap_ops, user space need a time window to choose which bitmap_ops to
>>> use while creating new array.
>>>
[ .. ]
>>> @@ -6093,11 +6104,6 @@ struct mddev *md_alloc(dev_t dev, char *name)
>>>           return ERR_PTR(error);
>>>       }
>>> -    if (md_bitmap_registered(mddev) && mddev->bitmap_ops->group)
>>> -        if (sysfs_create_group(&mddev->kobj, mddev->bitmap_ops->group))
>>> -            pr_warn("md: cannot register extra bitmap attributes for 
>>> %s\n",
>>> -                mdname(mddev));
>>> -
>>>       kobject_uevent(&mddev->kobj, KOBJ_ADD);
>>>       mddev->sysfs_state = sysfs_get_dirent_safe(mddev->kobj.sd, 
>>> "array_state");
>>>       mddev->sysfs_level = sysfs_get_dirent_safe(mddev->kobj.sd, 
>>> "level");
>>
>> But now you've killed udev event processing.
>> Once the 'add' event is sent _all_ sysfs attributes must be present,
>> otherwise you'll have a race condition where udev is checking for
>> attributes which are present only later.
>>
>> So when moving things around ensure to move the kobject_uevent() call, 
>> too.
> 
> I do not expect the bitmap entries are checked by udev, otherwise this
> set can introduce regressions since the bitmap entries are no longer
> existed after using the new biltmap.
> 
> And the above KOBJ_ADD uevent is used for mddev->kobj, right? In this
> case, we're creating new entries under mddev->kobj, should this be
> KOBJ_CHANGE?
> 
Yes, please.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 11/23] md/md-bitmap: make method bitmap_ops->daemon_work optional
  2025-05-27  8:03     ` Yu Kuai
@ 2025-05-27  8:55       ` Hannes Reinecke
  0 siblings, 0 replies; 108+ messages in thread
From: Hannes Reinecke @ 2025-05-27  8:55 UTC (permalink / raw)
  To: Yu Kuai, hch, xni, colyli, song
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

On 5/27/25 10:03, Yu Kuai wrote:
> Hi,
> 
> 在 2025/05/27 14:19, Hannes Reinecke 写道:
>> On 5/24/25 08:13, Yu Kuai wrote:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> daemon_work() will be called by daemon thread, on the one hand, daemon
>>> thread doesn't have strict wake-up time; on the other hand, too much
>>> work are put to daemon thread, like handle sync IO, handle failed
>>> or specail normal IO, handle recovery, and so on. Hence daemon thread
>>> may be too busy to clear dirty bits in time.
>>>
>>> Make bitmap_ops->daemon_work() optional and following patches will use
>>> separate async work to clear dirty bits for the new bitmap.
>>>
>> Why not move it to a workqueue in general?
>> The above argument is valid even for the current implementation, no?
> 
> Yes, and however, I'll prefer not to touch current implementaion :(
> This is trivial comparing to other flaws like global spinlock.
> 
Fair enough.

You can add:

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 15/23] md/md-llbitmap: implement llbitmap IO
  2025-05-27  8:27   ` Christoph Hellwig
@ 2025-05-27  8:55     ` Yu Kuai
  2025-05-27  8:58       ` Yu Kuai
  0 siblings, 1 reply; 108+ messages in thread
From: Yu Kuai @ 2025-05-27  8:55 UTC (permalink / raw)
  To: Christoph Hellwig, Yu Kuai
  Cc: xni, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/05/27 16:27, Christoph Hellwig 写道:
> FYI, I still find splitting the additioon of the new md-llbitmap.c into
> multiple patches not helpful for reviewing it.  I'm mostly reviewing
> the applied code and hope I didn't forget to place anything into the
> right mail.
> 
>> diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
>> new file mode 100644
>> index 000000000000..1a01b6777527
>> --- /dev/null
>> +++ b/drivers/md/md-llbitmap.c
>> @@ -0,0 +1,571 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +
>> +#ifdef CONFIG_MD_LLBITMAP
> 
> Please don't ifdef the entire code in a sourc file, instead just compile
> it conditionally:
> 
> 
> md-mod-y        += md.o md-bitmap.o
> md-mod-$(CONFIG_MD_LLBITMAP) += md-llbitmap.o
> 

Thanks for the suggestion, this is indeed better.

>> +	BitNeedSync,
>> +	/* data is synchronizing */
>> +	BitSyncing,
>> +	nr_llbitmap_state,
> 
> Any reason nr_llbitmap_state, doesn't follow the naming scheme of the other
> bits,?

I'm following the enum name(enum llbitmap_state) here, because this is
the number to total bits, not a meaningful bit.

Do you prefer a name like BitStateCount?

Thanks,
Kuai

> 
>> +	BitmapActionStale,
>> +	nr_llbitmap_action,
> 
> Same here?
> 
>> +			if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
> 
> Overly long line.
> .
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 15/23] md/md-llbitmap: implement llbitmap IO
  2025-05-27  8:55     ` Yu Kuai
@ 2025-05-27  8:58       ` Yu Kuai
  0 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-27  8:58 UTC (permalink / raw)
  To: Yu Kuai, Christoph Hellwig
  Cc: xni, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/05/27 16:55, Yu Kuai 写道:
> FYI, I still find splitting the additioon of the new md-llbitmap.c into
> multiple patches not helpful for reviewing it.  I'm mostly reviewing
> the applied code and hope I didn't forget to place anything into the
> right mail.

And for this, I'll switch to a single patch, I prefer this way as well.

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 23/23] md/md-llbitmap: add Kconfig
  2025-05-27  8:29   ` Christoph Hellwig
@ 2025-05-27  9:00     ` Yu Kuai
  0 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-27  9:00 UTC (permalink / raw)
  To: Christoph Hellwig, Yu Kuai
  Cc: xni, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/05/27 16:29, Christoph Hellwig 写道:
> On Sat, May 24, 2025 at 02:13:20PM +0800, Yu Kuai wrote:
>>   	MD_PERSONALITY = 0,
>>   	MD_CLUSTER,
>> -	MD_BITMAP, /* TODO */
>> +	MD_BITMAP,
>>   };
>>   
>>   enum md_submodule_id {
>> @@ -39,7 +39,7 @@ enum md_submodule_id {
>>   	ID_RAID10	= 10,
>>   	ID_CLUSTER,
>>   	ID_BITMAP,
>> -	ID_LLBITMAP,	/* TODO */
>> +	ID_LLBITMAP,
> 
> Please just drop the TODO annotation from the initial patch.
> 

I guess this is no longer a problem with one single patch.

Thanks,
Kuai

> .
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 12/23] md/md-bitmap: add macros for lockless bitmap
  2025-05-24  6:13 ` [PATCH 12/23] md/md-bitmap: add macros for lockless bitmap Yu Kuai
  2025-05-26  6:40   ` Christoph Hellwig
  2025-05-27  6:21   ` Hannes Reinecke
@ 2025-05-28  4:53   ` Xiao Ni
  2 siblings, 0 replies; 108+ messages in thread
From: Xiao Ni @ 2025-05-28  4:53 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, yukuai3, linux-doc, linux-kernel, linux-raid,
	yi.zhang, yangerkun, johnny.chenyi

On Sat, May 24, 2025 at 2:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> Also move other values to md-bitmap.h and update comments.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  drivers/md/md-bitmap.c |  9 ---------
>  drivers/md/md-bitmap.h | 17 +++++++++++++++++
>  2 files changed, 17 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> index 17d41a7b30ce..689d5dba9328 100644
> --- a/drivers/md/md-bitmap.c
> +++ b/drivers/md/md-bitmap.c
> @@ -36,15 +36,6 @@
>  #include "md-bitmap.h"
>  #include "md-cluster.h"
>
> -#define BITMAP_MAJOR_LO 3
> -/* version 4 insists the bitmap is in little-endian order
> - * with version 3, it is host-endian which is non-portable
> - * Version 5 is currently set only for clustered devices
> - */
> -#define BITMAP_MAJOR_HI 4
> -#define BITMAP_MAJOR_CLUSTERED 5
> -#define        BITMAP_MAJOR_HOSTENDIAN 3
> -
>  /*
>   * in-memory bitmap:
>   *
> diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
> index f2d79c8a23b7..d2cdf831ef1a 100644
> --- a/drivers/md/md-bitmap.h
> +++ b/drivers/md/md-bitmap.h
> @@ -18,10 +18,27 @@ typedef __u16 bitmap_counter_t;
>  #define RESYNC_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 2)))
>  #define COUNTER_MAX ((bitmap_counter_t) RESYNC_MASK - 1)
>
> +/*
> + * version 3 is host-endian order, this is deprecated and not used for new
> + * array
> + */
> +#define BITMAP_MAJOR_LO                3
> +#define BITMAP_MAJOR_HOSTENDIAN        3
> +/* version 4 is little-endian order, the default value */
> +#define BITMAP_MAJOR_HI                4
> +/* version 5 is only used for cluster */
> +#define BITMAP_MAJOR_CLUSTERED 5
> +/* version 6 is only used for lockless bitmap */
> +#define BITMAP_MAJOR_LOCKLESS  6
> +
> +#define BITMAP_SB_SIZE 1024

Hi

For super1, the bitmap bits are next to bitmap superblock.
BITMAP_SB_SIZE is only used by md-llbitmap, is it better to define it
in md-llbitmap.c?

Regards
Xiao

>  /* use these for bitmap->flags and bitmap->sb->state bit-fields */
>  enum bitmap_state {
>         BITMAP_STALE       = 1,  /* the bitmap file is out of date or had -EIO */
>         BITMAP_WRITE_ERROR = 2, /* A write error has occurred */
> +       BITMAP_FIRST_USE   = 3, /* llbitmap is just created */
> +       BITMAP_CLEAN       = 4, /* llbitmap is created with assume_clean */
> +       BITMAP_DAEMON_BUSY = 5, /* llbitmap daemon is not finished after daemon_sleep */
>         BITMAP_HOSTENDIAN  =15,
>  };
>
> --
> 2.39.2
>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 14/23] md/dm-raid: remove max_write_behind setting limit
  2025-05-24  6:13 ` [PATCH 14/23] md/dm-raid: remove max_write_behind setting limit Yu Kuai
  2025-05-26  6:41   ` Christoph Hellwig
  2025-05-27  6:26   ` Hannes Reinecke
@ 2025-05-28  4:58   ` Xiao Ni
  2 siblings, 0 replies; 108+ messages in thread
From: Xiao Ni @ 2025-05-28  4:58 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, yukuai3, linux-doc, linux-kernel, linux-raid,
	yi.zhang, yangerkun, johnny.chenyi

On Sat, May 24, 2025 at 2:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> The comments said 'vaule in kB', while the value actually means the
> number of write_behind IOs. And since md-bitmap will automatically
> adjust the value to max COUNTER_MAX / 2, there is no need to fail
> early.
>
> Also move some macros that is only used md-bitmap.c.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  drivers/md/dm-raid.c   |  6 +-----
>  drivers/md/md-bitmap.c | 10 ++++++++++
>  drivers/md/md-bitmap.h |  9 ---------
>  3 files changed, 11 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> index 088cfe6e0f98..9757c32ea1f5 100644
> --- a/drivers/md/dm-raid.c
> +++ b/drivers/md/dm-raid.c
> @@ -1356,11 +1356,7 @@ static int parse_raid_params(struct raid_set *rs, struct dm_arg_set *as,
>                                 return -EINVAL;
>                         }
>
> -                       /*
> -                        * In device-mapper, we specify things in sectors, but
> -                        * MD records this value in kB
> -                        */
> -                       if (value < 0 || value / 2 > COUNTER_MAX) {
> +                       if (value < 0) {
>                                 rs->ti->error = "Max write-behind limit out of range";
>                                 return -EINVAL;
>                         }
> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> index 535bc1888e8c..098e7b6cd187 100644
> --- a/drivers/md/md-bitmap.c
> +++ b/drivers/md/md-bitmap.c
> @@ -98,9 +98,19 @@
>   *
>   */
>
> +typedef __u16 bitmap_counter_t;
> +
>  #define PAGE_BITS (PAGE_SIZE << 3)
>  #define PAGE_BIT_SHIFT (PAGE_SHIFT + 3)
>
> +#define COUNTER_BITS 16
> +#define COUNTER_BIT_SHIFT 4
> +#define COUNTER_BYTE_SHIFT (COUNTER_BIT_SHIFT - 3)
> +
> +#define NEEDED_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 1)))
> +#define RESYNC_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 2)))
> +#define COUNTER_MAX ((bitmap_counter_t) RESYNC_MASK - 1)
> +
>  #define NEEDED(x) (((bitmap_counter_t) x) & NEEDED_MASK)
>  #define RESYNC(x) (((bitmap_counter_t) x) & RESYNC_MASK)
>  #define COUNTER(x) (((bitmap_counter_t) x) & COUNTER_MAX)
> diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
> index d2cdf831ef1a..a9a0f6a8d96d 100644
> --- a/drivers/md/md-bitmap.h
> +++ b/drivers/md/md-bitmap.h
> @@ -9,15 +9,6 @@
>
>  #define BITMAP_MAGIC 0x6d746962
>
> -typedef __u16 bitmap_counter_t;
> -#define COUNTER_BITS 16
> -#define COUNTER_BIT_SHIFT 4
> -#define COUNTER_BYTE_SHIFT (COUNTER_BIT_SHIFT - 3)
> -
> -#define NEEDED_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 1)))
> -#define RESYNC_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 2)))
> -#define COUNTER_MAX ((bitmap_counter_t) RESYNC_MASK - 1)
> -
>  /*
>   * version 3 is host-endian order, this is deprecated and not used for new
>   * array
> --
> 2.39.2
>

Reviewed-by: Xiao Ni <xni@redhat.com>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 04/23] md/md-bitmap: support discard for bitmap ops
  2025-05-24  6:13 ` [PATCH 04/23] md/md-bitmap: support discard for bitmap ops Yu Kuai
                     ` (2 preceding siblings ...)
  2025-05-27  6:01   ` Hannes Reinecke
@ 2025-05-28  7:04   ` Glass Su
  3 siblings, 0 replies; 108+ messages in thread
From: Glass Su @ 2025-05-28  7:04 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, xni, colyli, song, yukuai3, linux-doc, linux-kernel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi



> On May 24, 2025, at 14:13, Yu Kuai <yukuai1@huaweicloud.com> wrote:
> 
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Use two new methods {start, end}_discard to handle discard IO, prepare
> to support new md bitmap.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
> drivers/md/md-bitmap.c |  3 +++
> drivers/md/md-bitmap.h | 12 ++++++++----
> drivers/md/md.c        | 15 +++++++++++----
> drivers/md/md.h        |  1 +
> 4 files changed, 23 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> index 2997e09d463d..848626049dea 100644
> --- a/drivers/md/md-bitmap.c
> +++ b/drivers/md/md-bitmap.c
> @@ -2991,6 +2991,9 @@ static struct bitmap_operations bitmap_ops = {
> 
> .start_write = bitmap_start_write,
> .end_write = bitmap_end_write,
> + .start_discard = bitmap_start_write,
> + .end_discard = bitmap_end_write,
> +
> .start_sync = bitmap_start_sync,
> .end_sync = bitmap_end_sync,
> .cond_end_sync = bitmap_cond_end_sync,
> diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
> index 9474e0d86fc6..4d804c07dbdd 100644
> --- a/drivers/md/md-bitmap.h
> +++ b/drivers/md/md-bitmap.h
> @@ -70,6 +70,9 @@ struct md_bitmap_stats {
> struct file *file;
> };
> 
> +typedef void (md_bitmap_fn)(struct mddev *mddev, sector_t offset,
> +    unsigned long sectors);
> +
> struct bitmap_operations {
> struct md_submodule_head head;
> 
> @@ -90,10 +93,11 @@ struct bitmap_operations {
> void (*end_behind_write)(struct mddev *mddev);
> void (*wait_behind_writes)(struct mddev *mddev);
> 
> - void (*start_write)(struct mddev *mddev, sector_t offset,
> -    unsigned long sectors);
> - void (*end_write)(struct mddev *mddev, sector_t offset,
> -  unsigned long sectors);
> + md_bitmap_fn *start_write;
> + md_bitmap_fn *end_write;
> + md_bitmap_fn *start_discard;
> + md_bitmap_fn *end_discard;
> +
> bool (*start_sync)(struct mddev *mddev, sector_t offset,
>   sector_t *blocks, bool degraded);
> void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks);
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 04a659f40cd6..466087cef4f9 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8845,18 +8845,24 @@ EXPORT_SYMBOL_GPL(md_submit_discard_bio);
> static void md_bitmap_start(struct mddev *mddev,
>    struct md_io_clone *md_io_clone)
> {
> + md_bitmap_fn *fn = unlikely(md_io_clone->rw == STAT_DISCARD) ?
> +   mddev->bitmap_ops->start_discard :
> +   mddev->bitmap_ops->start_write;
> +
> if (mddev->pers->bitmap_sector)
> mddev->pers->bitmap_sector(mddev, &md_io_clone->offset,
>   &md_io_clone->sectors);
> 
> - mddev->bitmap_ops->start_write(mddev, md_io_clone->offset,
> -       md_io_clone->sectors);
> + fn(mddev, md_io_clone->offset, md_io_clone->sectors);
> }
> 
> static void md_bitmap_end(struct mddev *mddev, struct md_io_clone *md_io_clone)
> {
> - mddev->bitmap_ops->end_write(mddev, md_io_clone->offset,
> -     md_io_clone->sectors);
> + md_bitmap_fn *fn = unlikely(md_io_clone->rw == STAT_DISCARD) ?
> +   mddev->bitmap_ops->end_discard :
> +   mddev->bitmap_ops->end_write;
> +
> + fn(mddev, md_io_clone->offset, md_io_clone->sectors);
> }
> 
> static void md_end_clone_io(struct bio *bio)
> @@ -8895,6 +8901,7 @@ static void md_clone_bio(struct mddev *mddev, struct bio **bio)
> if (bio_data_dir(*bio) == WRITE && md_bitmap_enabled(mddev)) {
> md_io_clone->offset = (*bio)->bi_iter.bi_sector;
> md_io_clone->sectors = bio_sectors(*bio);
> + md_io_clone->rw = op_stat_group(bio_op(*bio));
> md_bitmap_start(mddev, md_io_clone);
> }
> 
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index c241119e6ef3..13e3f9ce1b79 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -850,6 +850,7 @@ struct md_io_clone {
> unsigned long start_time;
> sector_t offset;
> unsigned long sectors;
> + enum stat_group rw;

Please also mention the change in commit message.

— 
Su
> struct bio bio_clone;
> };
> 
> -- 
> 2.39.2
> 
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 18/23] md/md-llbitmap: implement APIs to mange bitmap lifetime
  2025-05-24  6:13 ` [PATCH 18/23] md/md-llbitmap: implement APIs to mange bitmap lifetime Yu Kuai
@ 2025-05-29  7:03   ` Xiao Ni
  2025-05-29  9:03     ` Yu Kuai
  0 siblings, 1 reply; 108+ messages in thread
From: Xiao Ni @ 2025-05-29  7:03 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, yukuai3, linux-doc, linux-kernel, linux-raid,
	yi.zhang, yangerkun, johnny.chenyi

Hi Kuai

Is it better to put this patch before patch15. I'm reading patch15.
But I need to read this patch first to understand how llbitmap is
created and loaded. Then I can go to read the io related part.

Regards
Xiao

On Sat, May 24, 2025 at 2:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> Include following APIs:
>  - llbitmap_create
>  - llbitmap_resize
>  - llbitmap_load
>  - llbitmap_destroy
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  drivers/md/md-llbitmap.c | 322 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 322 insertions(+)
>
> diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
> index 4d5f9a139a25..23283c4f7263 100644
> --- a/drivers/md/md-llbitmap.c
> +++ b/drivers/md/md-llbitmap.c
> @@ -689,4 +689,326 @@ static void llbitmap_resume(struct llbitmap *llbitmap, int page_idx)
>         wake_up(&pctl->wait);
>  }
>
> +static int llbitmap_check_support(struct mddev *mddev)
> +{
> +       if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
> +               pr_notice("md/llbitmap: %s: array with journal cannot have bitmap\n",
> +                         mdname(mddev));
> +               return -EBUSY;
> +       }
> +
> +       if (mddev->bitmap_info.space == 0) {
> +               if (mddev->bitmap_info.default_space == 0) {
> +                       pr_notice("md/llbitmap: %s: no space for bitmap\n",
> +                                 mdname(mddev));
> +                       return -ENOSPC;
> +               }
> +       }
> +
> +       if (!mddev->persistent) {
> +               pr_notice("md/llbitmap: %s: array must be persistent\n",
> +                         mdname(mddev));
> +               return -EOPNOTSUPP;
> +       }
> +
> +       if (mddev->bitmap_info.file) {
> +               pr_notice("md/llbitmap: %s: doesn't support bitmap file\n",
> +                         mdname(mddev));
> +               return -EOPNOTSUPP;
> +       }
> +
> +       if (mddev->bitmap_info.external) {
> +               pr_notice("md/llbitmap: %s: doesn't support external metadata\n",
> +                         mdname(mddev));
> +               return -EOPNOTSUPP;
> +       }
> +
> +       if (mddev_is_dm(mddev)) {
> +               pr_notice("md/llbitmap: %s: doesn't support dm-raid\n",
> +                         mdname(mddev));
> +               return -EOPNOTSUPP;
> +       }
> +
> +       return 0;
> +}
> +
> +static int llbitmap_init(struct llbitmap *llbitmap)
> +{
> +       struct mddev *mddev = llbitmap->mddev;
> +       sector_t blocks = mddev->resync_max_sectors;
> +       unsigned long chunksize = MIN_CHUNK_SIZE;
> +       unsigned long chunks = DIV_ROUND_UP(blocks, chunksize);
> +       unsigned long space = mddev->bitmap_info.space << SECTOR_SHIFT;
> +       int ret;
> +
> +       while (chunks > space) {
> +               chunksize = chunksize << 1;
> +               chunks = DIV_ROUND_UP(blocks, chunksize);
> +       }
> +
> +       llbitmap->chunkshift = ffz(~chunksize);
> +       llbitmap->chunksize = chunksize;
> +       llbitmap->chunks = chunks;
> +       mddev->bitmap_info.daemon_sleep = DEFAULT_DAEMON_SLEEP;
> +
> +       ret = llbitmap_cache_pages(llbitmap);
> +       if (ret)
> +               return ret;
> +
> +       llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, BitmapActionInit);
> +       return 0;
> +}
> +
> +static int llbitmap_read_sb(struct llbitmap *llbitmap)
> +{
> +       struct mddev *mddev = llbitmap->mddev;
> +       unsigned long daemon_sleep;
> +       unsigned long chunksize;
> +       unsigned long events;
> +       struct page *sb_page;
> +       bitmap_super_t *sb;
> +       int ret = -EINVAL;
> +
> +       if (!mddev->bitmap_info.offset) {
> +               pr_err("md/llbitmap: %s: no super block found", mdname(mddev));
> +               return -EINVAL;
> +       }
> +
> +       sb_page = llbitmap_read_page(llbitmap, 0);
> +       if (IS_ERR(sb_page)) {
> +               pr_err("md/llbitmap: %s: read super block failed",
> +                      mdname(mddev));
> +               ret = -EIO;
> +               goto out;
> +       }
> +
> +       sb = kmap_local_page(sb_page);
> +       if (sb->magic != cpu_to_le32(BITMAP_MAGIC)) {
> +               pr_err("md/llbitmap: %s: invalid super block magic number",
> +                      mdname(mddev));
> +               goto out_put_page;
> +       }
> +
> +       if (sb->version != cpu_to_le32(BITMAP_MAJOR_LOCKLESS)) {
> +               pr_err("md/llbitmap: %s: invalid super block version",
> +                      mdname(mddev));
> +               goto out_put_page;
> +       }
> +
> +       if (memcmp(sb->uuid, mddev->uuid, 16)) {
> +               pr_err("md/llbitmap: %s: bitmap superblock UUID mismatch\n",
> +                      mdname(mddev));
> +               goto out_put_page;
> +       }
> +
> +       if (mddev->bitmap_info.space == 0) {
> +               int room = le32_to_cpu(sb->sectors_reserved);
> +
> +               if (room)
> +                       mddev->bitmap_info.space = room;
> +               else
> +                       mddev->bitmap_info.space = mddev->bitmap_info.default_space;
> +       }
> +       llbitmap->flags = le32_to_cpu(sb->state);
> +       if (test_and_clear_bit(BITMAP_FIRST_USE, &llbitmap->flags)) {
> +               ret = llbitmap_init(llbitmap);
> +               goto out_put_page;
> +       }
> +
> +       chunksize = le32_to_cpu(sb->chunksize);
> +       if (!is_power_of_2(chunksize)) {
> +               pr_err("md/llbitmap: %s: chunksize not a power of 2",
> +                      mdname(mddev));
> +               goto out_put_page;
> +       }
> +
> +       if (chunksize < DIV_ROUND_UP(mddev->resync_max_sectors,
> +                                    mddev->bitmap_info.space << SECTOR_SHIFT)) {
> +               pr_err("md/llbitmap: %s: chunksize too small %lu < %llu / %lu",
> +                      mdname(mddev), chunksize, mddev->resync_max_sectors,
> +                      mddev->bitmap_info.space);
> +               goto out_put_page;
> +       }
> +
> +       daemon_sleep = le32_to_cpu(sb->daemon_sleep);
> +       if (daemon_sleep < 1 || daemon_sleep > MAX_SCHEDULE_TIMEOUT / HZ) {
> +               pr_err("md/llbitmap: %s: daemon sleep %lu period out of range",
> +                      mdname(mddev), daemon_sleep);
> +               goto out_put_page;
> +       }
> +
> +       events = le64_to_cpu(sb->events);
> +       if (events < mddev->events) {
> +               pr_warn("md/llbitmap :%s: bitmap file is out of date (%lu < %llu) -- forcing full recovery",
> +                       mdname(mddev), events, mddev->events);
> +               set_bit(BITMAP_STALE, &llbitmap->flags);
> +       }
> +
> +       sb->sync_size = cpu_to_le64(mddev->resync_max_sectors);
> +       mddev->bitmap_info.chunksize = chunksize;
> +       mddev->bitmap_info.daemon_sleep = daemon_sleep;
> +
> +       llbitmap->chunksize = chunksize;
> +       llbitmap->chunks = DIV_ROUND_UP(mddev->resync_max_sectors, chunksize);
> +       llbitmap->chunkshift = ffz(~chunksize);
> +       ret = llbitmap_cache_pages(llbitmap);
> +
> +out_put_page:
> +       __free_page(sb_page);
> +out:
> +       kunmap_local(sb);
> +       return ret;
> +}
> +
> +static void llbitmap_pending_timer_fn(struct timer_list *t)
> +{
> +       struct llbitmap *llbitmap = from_timer(llbitmap, t, pending_timer);
> +
> +       if (work_busy(&llbitmap->daemon_work)) {
> +               pr_warn("daemon_work not finished\n");
> +               set_bit(BITMAP_DAEMON_BUSY, &llbitmap->flags);
> +               return;
> +       }
> +
> +       queue_work(md_llbitmap_io_wq, &llbitmap->daemon_work);
> +}
> +
> +static void md_llbitmap_daemon_fn(struct work_struct *work)
> +{
> +       struct llbitmap *llbitmap =
> +               container_of(work, struct llbitmap, daemon_work);
> +       unsigned long start;
> +       unsigned long end;
> +       bool restart;
> +       int idx;
> +
> +       if (llbitmap->mddev->degraded)
> +               return;
> +
> +retry:
> +       start = 0;
> +       end = min(llbitmap->chunks, PAGE_SIZE - BITMAP_SB_SIZE) - 1;
> +       restart = false;
> +
> +       for (idx = 0; idx < llbitmap->nr_pages; idx++) {
> +               struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
> +
> +               if (idx > 0) {
> +                       start = end + 1;
> +                       end = min(end + PAGE_SIZE, llbitmap->chunks - 1);
> +               }
> +
> +               if (!test_bit(LLPageFlush, &pctl->flags) &&
> +                   time_before(jiffies, pctl->expire)) {
> +                       restart = true;
> +                       continue;
> +               }
> +
> +               llbitmap_suspend(llbitmap, idx);
> +               llbitmap_state_machine(llbitmap, start, end, BitmapActionDaemon);
> +               llbitmap_resume(llbitmap, idx);
> +       }
> +
> +       /*
> +        * If the daemon took a long time to finish, retry to prevent missing
> +        * clearing dirty bits.
> +        */
> +       if (test_and_clear_bit(BITMAP_DAEMON_BUSY, &llbitmap->flags))
> +               goto retry;
> +
> +       /* If some page is dirty but not expired, setup timer again */
> +       if (restart)
> +               mod_timer(&llbitmap->pending_timer,
> +                         jiffies + llbitmap->mddev->bitmap_info.daemon_sleep * HZ);
> +}
> +
> +static int llbitmap_create(struct mddev *mddev)
> +{
> +       struct llbitmap *llbitmap;
> +       int ret;
> +
> +       ret = llbitmap_check_support(mddev);
> +       if (ret)
> +               return ret;
> +
> +       llbitmap = kzalloc(sizeof(*llbitmap), GFP_KERNEL);
> +       if (!llbitmap)
> +               return -ENOMEM;
> +
> +       llbitmap->mddev = mddev;
> +       llbitmap->io_size = bdev_logical_block_size(mddev->gendisk->part0);
> +       llbitmap->bits_per_page = PAGE_SIZE / llbitmap->io_size;
> +
> +       timer_setup(&llbitmap->pending_timer, llbitmap_pending_timer_fn, 0);
> +       INIT_WORK(&llbitmap->daemon_work, md_llbitmap_daemon_fn);
> +       atomic_set(&llbitmap->behind_writes, 0);
> +       init_waitqueue_head(&llbitmap->behind_wait);
> +
> +       mutex_lock(&mddev->bitmap_info.mutex);
> +       mddev->bitmap = llbitmap;
> +       ret = llbitmap_read_sb(llbitmap);
> +       mutex_unlock(&mddev->bitmap_info.mutex);
> +       if (ret)
> +               goto err_out;
> +
> +       return 0;
> +
> +err_out:
> +       kfree(llbitmap);
> +       return ret;
> +}
> +
> +static int llbitmap_resize(struct mddev *mddev, sector_t blocks, int chunksize)
> +{
> +       struct llbitmap *llbitmap = mddev->bitmap;
> +       unsigned long chunks;
> +
> +       if (chunksize == 0)
> +               chunksize = llbitmap->chunksize;
> +
> +       /* If there is enough space, leave the chunksize unchanged. */
> +       chunks = DIV_ROUND_UP(blocks, chunksize);
> +       while (chunks > mddev->bitmap_info.space << SECTOR_SHIFT) {
> +               chunksize = chunksize << 1;
> +               chunks = DIV_ROUND_UP(blocks, chunksize);
> +       }
> +
> +       llbitmap->chunkshift = ffz(~chunksize);
> +       llbitmap->chunksize = chunksize;
> +       llbitmap->chunks = chunks;
> +
> +       return 0;
> +}
> +
> +static int llbitmap_load(struct mddev *mddev)
> +{
> +       enum llbitmap_action action = BitmapActionReload;
> +       struct llbitmap *llbitmap = mddev->bitmap;
> +
> +       if (test_and_clear_bit(BITMAP_STALE, &llbitmap->flags))
> +               action = BitmapActionStale;
> +
> +       llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, action);
> +       return 0;
> +}
> +
> +static void llbitmap_destroy(struct mddev *mddev)
> +{
> +       struct llbitmap *llbitmap = mddev->bitmap;
> +
> +       if (!llbitmap)
> +               return;
> +
> +       mutex_lock(&mddev->bitmap_info.mutex);
> +
> +       timer_delete_sync(&llbitmap->pending_timer);
> +       flush_workqueue(md_llbitmap_io_wq);
> +       flush_workqueue(md_llbitmap_unplug_wq);
> +
> +       mddev->bitmap = NULL;
> +       llbitmap_free_pages(llbitmap);
> +       kfree(llbitmap);
> +       mutex_unlock(&mddev->bitmap_info.mutex);
> +}
> +
>  #endif /* CONFIG_MD_LLBITMAP */
> --
> 2.39.2
>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 18/23] md/md-llbitmap: implement APIs to mange bitmap lifetime
  2025-05-29  7:03   ` Xiao Ni
@ 2025-05-29  9:03     ` Yu Kuai
  0 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-29  9:03 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai
  Cc: hch, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/05/29 15:03, Xiao Ni 写道:
> Hi Kuai
> 
> Is it better to put this patch before patch15. I'm reading patch15.
> But I need to read this patch first to understand how llbitmap is
> created and loaded. Then I can go to read the io related part.

Never mind, I'll merge patch 15-23 into one single patch in the next
version, it's better for review.

Thanks,
Kuai

> 
> Regards
> Xiao
> 
> On Sat, May 24, 2025 at 2:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Include following APIs:
>>   - llbitmap_create
>>   - llbitmap_resize
>>   - llbitmap_load
>>   - llbitmap_destroy
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   drivers/md/md-llbitmap.c | 322 +++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 322 insertions(+)
>>
>> diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
>> index 4d5f9a139a25..23283c4f7263 100644
>> --- a/drivers/md/md-llbitmap.c
>> +++ b/drivers/md/md-llbitmap.c
>> @@ -689,4 +689,326 @@ static void llbitmap_resume(struct llbitmap *llbitmap, int page_idx)
>>          wake_up(&pctl->wait);
>>   }
>>
>> +static int llbitmap_check_support(struct mddev *mddev)
>> +{
>> +       if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
>> +               pr_notice("md/llbitmap: %s: array with journal cannot have bitmap\n",
>> +                         mdname(mddev));
>> +               return -EBUSY;
>> +       }
>> +
>> +       if (mddev->bitmap_info.space == 0) {
>> +               if (mddev->bitmap_info.default_space == 0) {
>> +                       pr_notice("md/llbitmap: %s: no space for bitmap\n",
>> +                                 mdname(mddev));
>> +                       return -ENOSPC;
>> +               }
>> +       }
>> +
>> +       if (!mddev->persistent) {
>> +               pr_notice("md/llbitmap: %s: array must be persistent\n",
>> +                         mdname(mddev));
>> +               return -EOPNOTSUPP;
>> +       }
>> +
>> +       if (mddev->bitmap_info.file) {
>> +               pr_notice("md/llbitmap: %s: doesn't support bitmap file\n",
>> +                         mdname(mddev));
>> +               return -EOPNOTSUPP;
>> +       }
>> +
>> +       if (mddev->bitmap_info.external) {
>> +               pr_notice("md/llbitmap: %s: doesn't support external metadata\n",
>> +                         mdname(mddev));
>> +               return -EOPNOTSUPP;
>> +       }
>> +
>> +       if (mddev_is_dm(mddev)) {
>> +               pr_notice("md/llbitmap: %s: doesn't support dm-raid\n",
>> +                         mdname(mddev));
>> +               return -EOPNOTSUPP;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static int llbitmap_init(struct llbitmap *llbitmap)
>> +{
>> +       struct mddev *mddev = llbitmap->mddev;
>> +       sector_t blocks = mddev->resync_max_sectors;
>> +       unsigned long chunksize = MIN_CHUNK_SIZE;
>> +       unsigned long chunks = DIV_ROUND_UP(blocks, chunksize);
>> +       unsigned long space = mddev->bitmap_info.space << SECTOR_SHIFT;
>> +       int ret;
>> +
>> +       while (chunks > space) {
>> +               chunksize = chunksize << 1;
>> +               chunks = DIV_ROUND_UP(blocks, chunksize);
>> +       }
>> +
>> +       llbitmap->chunkshift = ffz(~chunksize);
>> +       llbitmap->chunksize = chunksize;
>> +       llbitmap->chunks = chunks;
>> +       mddev->bitmap_info.daemon_sleep = DEFAULT_DAEMON_SLEEP;
>> +
>> +       ret = llbitmap_cache_pages(llbitmap);
>> +       if (ret)
>> +               return ret;
>> +
>> +       llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, BitmapActionInit);
>> +       return 0;
>> +}
>> +
>> +static int llbitmap_read_sb(struct llbitmap *llbitmap)
>> +{
>> +       struct mddev *mddev = llbitmap->mddev;
>> +       unsigned long daemon_sleep;
>> +       unsigned long chunksize;
>> +       unsigned long events;
>> +       struct page *sb_page;
>> +       bitmap_super_t *sb;
>> +       int ret = -EINVAL;
>> +
>> +       if (!mddev->bitmap_info.offset) {
>> +               pr_err("md/llbitmap: %s: no super block found", mdname(mddev));
>> +               return -EINVAL;
>> +       }
>> +
>> +       sb_page = llbitmap_read_page(llbitmap, 0);
>> +       if (IS_ERR(sb_page)) {
>> +               pr_err("md/llbitmap: %s: read super block failed",
>> +                      mdname(mddev));
>> +               ret = -EIO;
>> +               goto out;
>> +       }
>> +
>> +       sb = kmap_local_page(sb_page);
>> +       if (sb->magic != cpu_to_le32(BITMAP_MAGIC)) {
>> +               pr_err("md/llbitmap: %s: invalid super block magic number",
>> +                      mdname(mddev));
>> +               goto out_put_page;
>> +       }
>> +
>> +       if (sb->version != cpu_to_le32(BITMAP_MAJOR_LOCKLESS)) {
>> +               pr_err("md/llbitmap: %s: invalid super block version",
>> +                      mdname(mddev));
>> +               goto out_put_page;
>> +       }
>> +
>> +       if (memcmp(sb->uuid, mddev->uuid, 16)) {
>> +               pr_err("md/llbitmap: %s: bitmap superblock UUID mismatch\n",
>> +                      mdname(mddev));
>> +               goto out_put_page;
>> +       }
>> +
>> +       if (mddev->bitmap_info.space == 0) {
>> +               int room = le32_to_cpu(sb->sectors_reserved);
>> +
>> +               if (room)
>> +                       mddev->bitmap_info.space = room;
>> +               else
>> +                       mddev->bitmap_info.space = mddev->bitmap_info.default_space;
>> +       }
>> +       llbitmap->flags = le32_to_cpu(sb->state);
>> +       if (test_and_clear_bit(BITMAP_FIRST_USE, &llbitmap->flags)) {
>> +               ret = llbitmap_init(llbitmap);
>> +               goto out_put_page;
>> +       }
>> +
>> +       chunksize = le32_to_cpu(sb->chunksize);
>> +       if (!is_power_of_2(chunksize)) {
>> +               pr_err("md/llbitmap: %s: chunksize not a power of 2",
>> +                      mdname(mddev));
>> +               goto out_put_page;
>> +       }
>> +
>> +       if (chunksize < DIV_ROUND_UP(mddev->resync_max_sectors,
>> +                                    mddev->bitmap_info.space << SECTOR_SHIFT)) {
>> +               pr_err("md/llbitmap: %s: chunksize too small %lu < %llu / %lu",
>> +                      mdname(mddev), chunksize, mddev->resync_max_sectors,
>> +                      mddev->bitmap_info.space);
>> +               goto out_put_page;
>> +       }
>> +
>> +       daemon_sleep = le32_to_cpu(sb->daemon_sleep);
>> +       if (daemon_sleep < 1 || daemon_sleep > MAX_SCHEDULE_TIMEOUT / HZ) {
>> +               pr_err("md/llbitmap: %s: daemon sleep %lu period out of range",
>> +                      mdname(mddev), daemon_sleep);
>> +               goto out_put_page;
>> +       }
>> +
>> +       events = le64_to_cpu(sb->events);
>> +       if (events < mddev->events) {
>> +               pr_warn("md/llbitmap :%s: bitmap file is out of date (%lu < %llu) -- forcing full recovery",
>> +                       mdname(mddev), events, mddev->events);
>> +               set_bit(BITMAP_STALE, &llbitmap->flags);
>> +       }
>> +
>> +       sb->sync_size = cpu_to_le64(mddev->resync_max_sectors);
>> +       mddev->bitmap_info.chunksize = chunksize;
>> +       mddev->bitmap_info.daemon_sleep = daemon_sleep;
>> +
>> +       llbitmap->chunksize = chunksize;
>> +       llbitmap->chunks = DIV_ROUND_UP(mddev->resync_max_sectors, chunksize);
>> +       llbitmap->chunkshift = ffz(~chunksize);
>> +       ret = llbitmap_cache_pages(llbitmap);
>> +
>> +out_put_page:
>> +       __free_page(sb_page);
>> +out:
>> +       kunmap_local(sb);
>> +       return ret;
>> +}
>> +
>> +static void llbitmap_pending_timer_fn(struct timer_list *t)
>> +{
>> +       struct llbitmap *llbitmap = from_timer(llbitmap, t, pending_timer);
>> +
>> +       if (work_busy(&llbitmap->daemon_work)) {
>> +               pr_warn("daemon_work not finished\n");
>> +               set_bit(BITMAP_DAEMON_BUSY, &llbitmap->flags);
>> +               return;
>> +       }
>> +
>> +       queue_work(md_llbitmap_io_wq, &llbitmap->daemon_work);
>> +}
>> +
>> +static void md_llbitmap_daemon_fn(struct work_struct *work)
>> +{
>> +       struct llbitmap *llbitmap =
>> +               container_of(work, struct llbitmap, daemon_work);
>> +       unsigned long start;
>> +       unsigned long end;
>> +       bool restart;
>> +       int idx;
>> +
>> +       if (llbitmap->mddev->degraded)
>> +               return;
>> +
>> +retry:
>> +       start = 0;
>> +       end = min(llbitmap->chunks, PAGE_SIZE - BITMAP_SB_SIZE) - 1;
>> +       restart = false;
>> +
>> +       for (idx = 0; idx < llbitmap->nr_pages; idx++) {
>> +               struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
>> +
>> +               if (idx > 0) {
>> +                       start = end + 1;
>> +                       end = min(end + PAGE_SIZE, llbitmap->chunks - 1);
>> +               }
>> +
>> +               if (!test_bit(LLPageFlush, &pctl->flags) &&
>> +                   time_before(jiffies, pctl->expire)) {
>> +                       restart = true;
>> +                       continue;
>> +               }
>> +
>> +               llbitmap_suspend(llbitmap, idx);
>> +               llbitmap_state_machine(llbitmap, start, end, BitmapActionDaemon);
>> +               llbitmap_resume(llbitmap, idx);
>> +       }
>> +
>> +       /*
>> +        * If the daemon took a long time to finish, retry to prevent missing
>> +        * clearing dirty bits.
>> +        */
>> +       if (test_and_clear_bit(BITMAP_DAEMON_BUSY, &llbitmap->flags))
>> +               goto retry;
>> +
>> +       /* If some page is dirty but not expired, setup timer again */
>> +       if (restart)
>> +               mod_timer(&llbitmap->pending_timer,
>> +                         jiffies + llbitmap->mddev->bitmap_info.daemon_sleep * HZ);
>> +}
>> +
>> +static int llbitmap_create(struct mddev *mddev)
>> +{
>> +       struct llbitmap *llbitmap;
>> +       int ret;
>> +
>> +       ret = llbitmap_check_support(mddev);
>> +       if (ret)
>> +               return ret;
>> +
>> +       llbitmap = kzalloc(sizeof(*llbitmap), GFP_KERNEL);
>> +       if (!llbitmap)
>> +               return -ENOMEM;
>> +
>> +       llbitmap->mddev = mddev;
>> +       llbitmap->io_size = bdev_logical_block_size(mddev->gendisk->part0);
>> +       llbitmap->bits_per_page = PAGE_SIZE / llbitmap->io_size;
>> +
>> +       timer_setup(&llbitmap->pending_timer, llbitmap_pending_timer_fn, 0);
>> +       INIT_WORK(&llbitmap->daemon_work, md_llbitmap_daemon_fn);
>> +       atomic_set(&llbitmap->behind_writes, 0);
>> +       init_waitqueue_head(&llbitmap->behind_wait);
>> +
>> +       mutex_lock(&mddev->bitmap_info.mutex);
>> +       mddev->bitmap = llbitmap;
>> +       ret = llbitmap_read_sb(llbitmap);
>> +       mutex_unlock(&mddev->bitmap_info.mutex);
>> +       if (ret)
>> +               goto err_out;
>> +
>> +       return 0;
>> +
>> +err_out:
>> +       kfree(llbitmap);
>> +       return ret;
>> +}
>> +
>> +static int llbitmap_resize(struct mddev *mddev, sector_t blocks, int chunksize)
>> +{
>> +       struct llbitmap *llbitmap = mddev->bitmap;
>> +       unsigned long chunks;
>> +
>> +       if (chunksize == 0)
>> +               chunksize = llbitmap->chunksize;
>> +
>> +       /* If there is enough space, leave the chunksize unchanged. */
>> +       chunks = DIV_ROUND_UP(blocks, chunksize);
>> +       while (chunks > mddev->bitmap_info.space << SECTOR_SHIFT) {
>> +               chunksize = chunksize << 1;
>> +               chunks = DIV_ROUND_UP(blocks, chunksize);
>> +       }
>> +
>> +       llbitmap->chunkshift = ffz(~chunksize);
>> +       llbitmap->chunksize = chunksize;
>> +       llbitmap->chunks = chunks;
>> +
>> +       return 0;
>> +}
>> +
>> +static int llbitmap_load(struct mddev *mddev)
>> +{
>> +       enum llbitmap_action action = BitmapActionReload;
>> +       struct llbitmap *llbitmap = mddev->bitmap;
>> +
>> +       if (test_and_clear_bit(BITMAP_STALE, &llbitmap->flags))
>> +               action = BitmapActionStale;
>> +
>> +       llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, action);
>> +       return 0;
>> +}
>> +
>> +static void llbitmap_destroy(struct mddev *mddev)
>> +{
>> +       struct llbitmap *llbitmap = mddev->bitmap;
>> +
>> +       if (!llbitmap)
>> +               return;
>> +
>> +       mutex_lock(&mddev->bitmap_info.mutex);
>> +
>> +       timer_delete_sync(&llbitmap->pending_timer);
>> +       flush_workqueue(md_llbitmap_io_wq);
>> +       flush_workqueue(md_llbitmap_unplug_wq);
>> +
>> +       mddev->bitmap = NULL;
>> +       llbitmap_free_pages(llbitmap);
>> +       kfree(llbitmap);
>> +       mutex_unlock(&mddev->bitmap_info.mutex);
>> +}
>> +
>>   #endif /* CONFIG_MD_LLBITMAP */
>> --
>> 2.39.2
>>
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (23 preceding siblings ...)
  2025-05-24  7:07 ` [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
@ 2025-05-30  6:45 ` Yu Kuai
  2025-06-30  1:59 ` Xiao Ni
  25 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-05-30  6:45 UTC (permalink / raw)
  To: Yu Kuai, hch, xni, colyli, song
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

Hi,

在 2025/05/24 14:12, Yu Kuai 写道:
> Yu Kuai (23):
>    md: add a new parameter 'offset' to md_super_write()
>    md: factor out a helper raid_is_456()
>    md/md-bitmap: cleanup bitmap_ops->startwrite()
>    md/md-bitmap: support discard for bitmap ops
>    md/md-bitmap: remove parameter slot from bitmap_create()
>    md/md-bitmap: add a new sysfs api bitmap_type
>    md/md-bitmap: delay registration of bitmap_ops until creating bitmap
>    md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations
>    md/md-bitmap: add a new method blocks_synced() in bitmap_operations
>    md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER
>    md/md-bitmap: make method bitmap_ops->daemon_work optional
>    md/md-bitmap: add macros for lockless bitmap
>    md/md-bitmap: fix dm-raid max_write_behind setting
>    md/dm-raid: remove max_write_behind setting limit
>    md/md-llbitmap: implement llbitmap IO
>    md/md-llbitmap: implement bit state machine
>    md/md-llbitmap: implement APIs for page level dirty bits
>      synchronization
>    md/md-llbitmap: implement APIs to mange bitmap lifetime
>    md/md-llbitmap: implement APIs to dirty bits and clear bits
>    md/md-llbitmap: implement APIs for sync_thread
>    md/md-llbitmap: implement all bitmap operations
>    md/md-llbitmap: implement sysfs APIs
>    md/md-llbitmap: add Kconfig

Patch 3, 13, 14 are applied to md-6.16, they are not related to
new bitmap:

	md/md-bitmap: cleanup bitmap_ops->startwrite()
	md/md-bitmap: fix dm-raid max_write_behind setting
	md/dm-raid: remove max_write_behind setting limit

Thanks,
Kuai

> 
>   Documentation/admin-guide/md.rst |   80 +-
>   drivers/md/Kconfig               |   11 +
>   drivers/md/Makefile              |    2 +-
>   drivers/md/dm-raid.c             |    6 +-
>   drivers/md/md-bitmap.c           |   50 +-
>   drivers/md/md-bitmap.h           |   55 +-
>   drivers/md/md-llbitmap.c         | 1556 ++++++++++++++++++++++++++++++
>   drivers/md/md.c                  |  247 +++--
>   drivers/md/md.h                  |   20 +-
>   drivers/md/raid5.c               |    6 +
>   10 files changed, 1901 insertions(+), 132 deletions(-)
>   create mode 100644 drivers/md/md-llbitmap.c


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 15/23] md/md-llbitmap: implement llbitmap IO
  2025-05-24  6:13 ` [PATCH 15/23] md/md-llbitmap: implement llbitmap IO Yu Kuai
  2025-05-27  8:27   ` Christoph Hellwig
@ 2025-06-06  3:21   ` Xiao Ni
  2025-06-06  3:48     ` Yu Kuai
  2025-06-30  2:07   ` Xiao Ni
  2 siblings, 1 reply; 108+ messages in thread
From: Xiao Ni @ 2025-06-06  3:21 UTC (permalink / raw)
  To: Yu Kuai, hch, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi

Hi Kuai

I've read some codes of llbitmap, but I don't figure out the 
relationship of in memory bits and in storage bits. Does llbitmap have 
the two types as old bitmap? For example, in llbitmap_create, there is a 
argument ->bits_per_page which is calculated by 
PAGE_SIZE/logical_block_size. As the graph bellow, bits_per_page is 8 
(4K/512byte). What does the bit mean? And in the graph below, it talks 
512 bits in one block, what does this bit mean?  I haven't walked 
through all codes, maybe I can get the answer myself. If you can give a 
summary of how many types of bit and what's the usage of the bit, it can 
help to understand it easier.

Best Regards

Xiao

在 2025/5/24 下午2:13, Yu Kuai 写道:
> From: Yu Kuai <yukuai3@huawei.com>
>
> READ
>
> While creating bitmap, all pages will be allocated and read for llbitmap,
> there won't be read afterwards
>
> WRITE
>
> WRITE IO is divided into logical_block_size of the page, the dirty state
> of each block is tracked independently, for example:
>
> each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;
>
> | page0 | page1 | ... | page 31 |
> |       |
> |        \-----------------------\
> |                                |
> | block0 | block1 | ... | block 8|
> |        |
> |         \-----------------\
> |                            |
> | bit0 | bit1 | ... | bit511 |
>
>  From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
> subpage will be marked dirty, such block must write first before the IO is
> issued. This behaviour will affect IO performance, to reduce the impact, if
> multiple bits are changed in the same block in a short time, all bits in
> this block will be changed to Dirty/NeedSync, so that there won't be any
> overhead until daemon clears dirty bits.
>
> Also add data structure definition and comments.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   drivers/md/md-llbitmap.c | 571 +++++++++++++++++++++++++++++++++++++++
>   1 file changed, 571 insertions(+)
>   create mode 100644 drivers/md/md-llbitmap.c
>
> diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
> new file mode 100644
> index 000000000000..1a01b6777527
> --- /dev/null
> +++ b/drivers/md/md-llbitmap.c
> @@ -0,0 +1,571 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +#ifdef CONFIG_MD_LLBITMAP
> +
> +#include <linux/blkdev.h>
> +#include <linux/module.h>
> +#include <linux/errno.h>
> +#include <linux/slab.h>
> +#include <linux/init.h>
> +#include <linux/timer.h>
> +#include <linux/sched.h>
> +#include <linux/list.h>
> +#include <linux/file.h>
> +#include <linux/seq_file.h>
> +#include <trace/events/block.h>
> +
> +#include "md.h"
> +#include "md-bitmap.h"
> +
> +/*
> + * #### Background
> + *
> + * Redundant data is used to enhance data fault tolerance, and the storage
> + * method for redundant data vary depending on the RAID levels. And it's
> + * important to maintain the consistency of redundant data.
> + *
> + * Bitmap is used to record which data blocks have been synchronized and which
> + * ones need to be resynchronized or recovered. Each bit in the bitmap
> + * represents a segment of data in the array. When a bit is set, it indicates
> + * that the multiple redundant copies of that data segment may not be
> + * consistent. Data synchronization can be performed based on the bitmap after
> + * power failure or readding a disk. If there is no bitmap, a full disk
> + * synchronization is required.
> + *
> + * #### Key Features
> + *
> + *  - IO fastpath is lockless, if user issues lots of write IO to the same
> + *  bitmap bit in a short time, only the first write have additional overhead
> + *  to update bitmap bit, no additional overhead for the following writes;
> + *  - support only resync or recover written data, means in the case creating
> + *  new array or replacing with a new disk, there is no need to do a full disk
> + *  resync/recovery;
> + *
> + * #### Key Concept
> + *
> + * ##### State Machine
> + *
> + * Each bit is one byte, contain 6 difference state, see llbitmap_state. And
> + * there are total 8 differenct actions, see llbitmap_action, can change state:
> + *
> + * llbitmap state machine: transitions between states
> + *
> + * |           | Startwrite | Startsync | Endsync | Abortsync|
> + * | --------- | ---------- | --------- | ------- | -------  |
> + * | Unwritten | Dirty      | x         | x       | x        |
> + * | Clean     | Dirty      | x         | x       | x        |
> + * | Dirty     | x          | x         | x       | x        |
> + * | NeedSync  | x          | Syncing   | x       | x        |
> + * | Syncing   | x          | Syncing   | Dirty   | NeedSync |
> + *
> + * |           | Reload   | Daemon | Discard   | Stale     |
> + * | --------- | -------- | ------ | --------- | --------- |
> + * | Unwritten | x        | x      | x         | x         |
> + * | Clean     | x        | x      | Unwritten | NeedSync  |
> + * | Dirty     | NeedSync | Clean  | Unwritten | NeedSync  |
> + * | NeedSync  | x        | x      | Unwritten | x         |
> + * | Syncing   | NeedSync | x      | Unwritten | NeedSync  |
> + *
> + * Typical scenarios:
> + *
> + * 1) Create new array
> + * All bits will be set to Unwritten by default, if --assume-clean is set,
> + * all bits will be set to Clean instead.
> + *
> + * 2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and
> + * rely on xor data
> + *
> + * 2.1) write new data to raid1/raid10:
> + * Unwritten --StartWrite--> Dirty
> + *
> + * 2.2) write new data to raid456:
> + * Unwritten --StartWrite--> NeedSync
> + *
> + * Because the initial recover for raid456 is skipped, the xor data is not build
> + * yet, the bit must set to NeedSync first and after lazy initial recover is
> + * finished, the bit will finially set to Dirty(see 5.1 and 5.4);
> + *
> + * 2.3) cover write
> + * Clean --StartWrite--> Dirty
> + *
> + * 3) daemon, if the array is not degraded:
> + * Dirty --Daemon--> Clean
> + *
> + * For degraded array, the Dirty bit will never be cleared, prevent full disk
> + * recovery while readding a removed disk.
> + *
> + * 4) discard
> + * {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
> + *
> + * 5) resync and recover
> + *
> + * 5.1) common process
> + * NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean
> + *
> + * 5.2) resync after power failure
> + * Dirty --Reload--> NeedSync
> + *
> + * 5.3) recover while replacing with a new disk
> + * By default, the old bitmap framework will recover all data, and llbitmap
> + * implement this by a new helper, see llbitmap_skip_sync_blocks:
> + *
> + * skip recover for bits other than dirty or clean;
> + *
> + * 5.4) lazy initial recover for raid5:
> + * By default, the old bitmap framework will only allow new recover when there
> + * are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add
> + * to perform raid456 lazy recover for set bits(from 2.2).
> + *
> + * ##### Bitmap IO
> + *
> + * ##### Chunksize
> + *
> + * The default bitmap size is 128k, incluing 1k bitmap super block, and
> + * the default size of segment of data in the array each bit(chunksize) is 64k,
> + * and chunksize will adjust to twice the old size each time if the total number
> + * bits is not less than 127k.(see llbitmap_init)
> + *
> + * ##### READ
> + *
> + * While creating bitmap, all pages will be allocated and read for llbitmap,
> + * there won't be read afterwards
> + *
> + * ##### WRITE
> + *
> + * WRITE IO is divided into logical_block_size of the array, the dirty state
> + * of each block is tracked independently, for example:
> + *
> + * each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;
> + *
> + * | page0 | page1 | ... | page 31 |
> + * |       |
> + * |        \-----------------------\
> + * |                                |
> + * | block0 | block1 | ... | block 8|
> + * |        |
> + * |         \-----------------\
> + * |                            |
> + * | bit0 | bit1 | ... | bit511 |
> + *
> + * From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
> + * subpage will be marked dirty, such block must write first before the IO is
> + * issued. This behaviour will affect IO performance, to reduce the impact, if
> + * multiple bits are changed in the same block in a short time, all bits in this
> + * block will be changed to Dirty/NeedSync, so that there won't be any overhead
> + * until daemon clears dirty bits.
> + *
> + * ##### Dirty Bits syncronization
> + *
> + * IO fast path will set bits to dirty, and those dirty bits will be cleared
> + * by daemon after IO is done. llbitmap_page_ctl is used to synchronize between
> + * IO path and daemon;
> + *
> + * IO path:
> + *  1) try to grab a reference, if succeed, set expire time after 5s and return;
> + *  2) if failed to grab a reference, wait for daemon to finish clearing dirty
> + *  bits;
> + *
> + * Daemon(Daemon will be waken up every daemon_sleep seconds):
> + * For each page:
> + *  1) check if page expired, if not skip this page; for expired page:
> + *  2) suspend the page and wait for inflight write IO to be done;
> + *  3) change dirty page to clean;
> + *  4) resume the page;
> + */
> +
> +#define BITMAP_SB_SIZE 1024
> +
> +/* 64k is the max IO size of sync IO for raid1/raid10 */
> +#define MIN_CHUNK_SIZE (64 * 2)
> +
> +/* By default, daemon will be waken up every 30s */
> +#define DEFAULT_DAEMON_SLEEP 30
> +
> +/*
> + * Dirtied bits that have not been accessed for more than 5s will be cleared
> + * by daemon.
> + */
> +#define BARRIER_IDLE 5
> +
> +enum llbitmap_state {
> +	/* No valid data, init state after assemble the array */
> +	BitUnwritten = 0,
> +	/* data is consistent */
> +	BitClean,
> +	/* data will be consistent after IO is done, set directly for writes */
> +	BitDirty,
> +	/*
> +	 * data need to be resynchronized:
> +	 * 1) set directly for writes if array is degraded, prevent full disk
> +	 * synchronization after readding a disk;
> +	 * 2) reassemble the array after power failure, and dirty bits are
> +	 * found after reloading the bitmap;
> +	 * 3) set for first write for raid5, to build initial xor data lazily
> +	 */
> +	BitNeedSync,
> +	/* data is synchronizing */
> +	BitSyncing,
> +	nr_llbitmap_state,
> +	BitNone = 0xff,
> +};
> +
> +enum llbitmap_action {
> +	/* User write new data, this is the only action from IO fast path */
> +	BitmapActionStartwrite = 0,
> +	/* Start recovery */
> +	BitmapActionStartsync,
> +	/* Finish recovery */
> +	BitmapActionEndsync,
> +	/* Failed recovery */
> +	BitmapActionAbortsync,
> +	/* Reassemble the array */
> +	BitmapActionReload,
> +	/* Daemon thread is trying to clear dirty bits */
> +	BitmapActionDaemon,
> +	/* Data is deleted */
> +	BitmapActionDiscard,
> +	/*
> +	 * Bitmap is stale, mark all bits in addition to BitUnwritten to
> +	 * BitNeedSync.
> +	 */
> +	BitmapActionStale,
> +	nr_llbitmap_action,
> +	/* Init state is BitUnwritten */
> +	BitmapActionInit,
> +};
> +
> +enum llbitmap_page_state {
> +	LLPageFlush = 0,
> +	LLPageDirty,
> +};
> +
> +struct llbitmap_page_ctl {
> +	char *state;
> +	struct page *page;
> +	unsigned long expire;
> +	unsigned long flags;
> +	wait_queue_head_t wait;
> +	struct percpu_ref active;
> +	/* Per block size dirty state, maximum 64k page / 1 sector = 128 */
> +	unsigned long dirty[];
> +};
> +
> +struct llbitmap {
> +	struct mddev *mddev;
> +	struct llbitmap_page_ctl **pctl;
> +
> +	unsigned int nr_pages;
> +	unsigned int io_size;
> +	unsigned int bits_per_page;
> +
> +	/* shift of one chunk */
> +	unsigned long chunkshift;
> +	/* size of one chunk in sector */
> +	unsigned long chunksize;
> +	/* total number of chunks */
> +	unsigned long chunks;
> +	unsigned long last_end_sync;
> +	/* fires on first BitDirty state */
> +	struct timer_list pending_timer;
> +	struct work_struct daemon_work;
> +
> +	unsigned long flags;
> +	__u64	events_cleared;
> +
> +	/* for slow disks */
> +	atomic_t behind_writes;
> +	wait_queue_head_t behind_wait;
> +};
> +
> +struct llbitmap_unplug_work {
> +	struct work_struct work;
> +	struct llbitmap *llbitmap;
> +	struct completion *done;
> +};
> +
> +static struct workqueue_struct *md_llbitmap_io_wq;
> +static struct workqueue_struct *md_llbitmap_unplug_wq;
> +
> +static char state_machine[nr_llbitmap_state][nr_llbitmap_action] = {
> +	[BitUnwritten] = {
> +		[BitmapActionStartwrite]	= BitDirty,
> +		[BitmapActionStartsync]		= BitNone,
> +		[BitmapActionEndsync]		= BitNone,
> +		[BitmapActionAbortsync]		= BitNone,
> +		[BitmapActionReload]		= BitNone,
> +		[BitmapActionDaemon]		= BitNone,
> +		[BitmapActionDiscard]		= BitNone,
> +		[BitmapActionStale]		= BitNone,
> +	},
> +	[BitClean] = {
> +		[BitmapActionStartwrite]	= BitDirty,
> +		[BitmapActionStartsync]		= BitNone,
> +		[BitmapActionEndsync]		= BitNone,
> +		[BitmapActionAbortsync]		= BitNone,
> +		[BitmapActionReload]		= BitNone,
> +		[BitmapActionDaemon]		= BitNone,
> +		[BitmapActionDiscard]		= BitUnwritten,
> +		[BitmapActionStale]		= BitNeedSync,
> +	},
> +	[BitDirty] = {
> +		[BitmapActionStartwrite]	= BitNone,
> +		[BitmapActionStartsync]		= BitNone,
> +		[BitmapActionEndsync]		= BitNone,
> +		[BitmapActionAbortsync]		= BitNone,
> +		[BitmapActionReload]		= BitNeedSync,
> +		[BitmapActionDaemon]		= BitClean,
> +		[BitmapActionDiscard]		= BitUnwritten,
> +		[BitmapActionStale]		= BitNeedSync,
> +	},
> +	[BitNeedSync] = {
> +		[BitmapActionStartwrite]	= BitNone,
> +		[BitmapActionStartsync]		= BitSyncing,
> +		[BitmapActionEndsync]		= BitNone,
> +		[BitmapActionAbortsync]		= BitNone,
> +		[BitmapActionReload]		= BitNone,
> +		[BitmapActionDaemon]		= BitNone,
> +		[BitmapActionDiscard]		= BitUnwritten,
> +		[BitmapActionStale]		= BitNone,
> +	},
> +	[BitSyncing] = {
> +		[BitmapActionStartwrite]	= BitNone,
> +		[BitmapActionStartsync]		= BitSyncing,
> +		[BitmapActionEndsync]		= BitDirty,
> +		[BitmapActionAbortsync]		= BitNeedSync,
> +		[BitmapActionReload]		= BitNeedSync,
> +		[BitmapActionDaemon]		= BitNone,
> +		[BitmapActionDiscard]		= BitUnwritten,
> +		[BitmapActionStale]		= BitNeedSync,
> +	},
> +};
> +
> +static enum llbitmap_state llbitmap_read(struct llbitmap *llbitmap, loff_t pos)
> +{
> +	unsigned int idx;
> +	unsigned int offset;
> +
> +	pos += BITMAP_SB_SIZE;
> +	idx = pos >> PAGE_SHIFT;
> +	offset = offset_in_page(pos);
> +
> +	return llbitmap->pctl[idx]->state[offset];
> +}
> +
> +/* set all the bits in the subpage as dirty */
> +static void llbitmap_infect_dirty_bits(struct llbitmap *llbitmap,
> +				       struct llbitmap_page_ctl *pctl,
> +				       unsigned int bit, unsigned int offset)
> +{
> +	bool level_456 = raid_is_456(llbitmap->mddev);
> +	unsigned int io_size = llbitmap->io_size;
> +	int pos;
> +
> +	for (pos = bit * io_size; pos < (bit + 1) * io_size; pos++) {
> +		if (pos == offset)
> +			continue;
> +
> +		switch (pctl->state[pos]) {
> +		case BitUnwritten:
> +			pctl->state[pos] = level_456 ? BitNeedSync : BitDirty;
> +			break;
> +		case BitClean:
> +			pctl->state[pos] = BitDirty;
> +			break;
> +		};
> +	}
> +
> +}
> +
> +static void llbitmap_set_page_dirty(struct llbitmap *llbitmap, int idx,
> +				    int offset)
> +{
> +	struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
> +	unsigned int io_size = llbitmap->io_size;
> +	int bit = offset / io_size;
> +	int pos;
> +
> +	if (!test_bit(LLPageDirty, &pctl->flags))
> +		set_bit(LLPageDirty, &pctl->flags);
> +
> +	/*
> +	 * The subpage usually contains a total of 512 bits. If any single bit
> +	 * within the subpage is marked as dirty, the entire sector will be
> +	 * written. To avoid impacting write performance, when multiple bits
> +	 * within the same sector are modified within a short time frame, all
> +	 * bits in the sector will be collectively marked as dirty at once.
> +	 */
> +	if (test_and_set_bit(bit, pctl->dirty)) {
> +		llbitmap_infect_dirty_bits(llbitmap, pctl, bit, offset);
> +		return;
> +	}
> +
> +	for (pos = bit * io_size; pos < (bit + 1) * io_size; pos++) {
> +		if (pos == offset)
> +			continue;
> +		if (pctl->state[pos] == BitDirty ||
> +		    pctl->state[pos] == BitNeedSync) {
> +			llbitmap_infect_dirty_bits(llbitmap, pctl, bit, offset);
> +			return;
> +		}
> +	}
> +}
> +
> +static void llbitmap_write(struct llbitmap *llbitmap, enum llbitmap_state state,
> +			   loff_t pos)
> +{
> +	unsigned int idx;
> +	unsigned int offset;
> +
> +	pos += BITMAP_SB_SIZE;
> +	idx = pos >> PAGE_SHIFT;
> +	offset = offset_in_page(pos);
> +
> +	llbitmap->pctl[idx]->state[offset] = state;
> +	if (state == BitDirty || state == BitNeedSync)
> +		llbitmap_set_page_dirty(llbitmap, idx, offset);
> +}
> +
> +static struct page *llbitmap_read_page(struct llbitmap *llbitmap, int idx)
> +{
> +	struct mddev *mddev = llbitmap->mddev;
> +	struct page *page = NULL;
> +	struct md_rdev *rdev;
> +
> +	if (llbitmap->pctl && llbitmap->pctl[idx])
> +		page = llbitmap->pctl[idx]->page;
> +	if (page)
> +		return page;
> +
> +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	if (!page)
> +		return ERR_PTR(-ENOMEM);
> +
> +	rdev_for_each(rdev, mddev) {
> +		sector_t sector;
> +
> +		if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
> +			continue;
> +
> +		sector = mddev->bitmap_info.offset +
> +			 (idx << PAGE_SECTORS_SHIFT);
> +
> +		if (sync_page_io(rdev, sector, PAGE_SIZE, page, REQ_OP_READ,
> +				 true))
> +			return page;
> +
> +		md_error(mddev, rdev);
> +	}
> +
> +	__free_page(page);
> +	return ERR_PTR(-EIO);
> +}
> +
> +static void llbitmap_write_page(struct llbitmap *llbitmap, int idx)
> +{
> +	struct page *page = llbitmap->pctl[idx]->page;
> +	struct mddev *mddev = llbitmap->mddev;
> +	struct md_rdev *rdev;
> +	int bit;
> +
> +	for (bit = 0; bit < llbitmap->bits_per_page; bit++) {
> +		struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
> +
> +		if (!test_and_clear_bit(bit, pctl->dirty))
> +			continue;
> +
> +		rdev_for_each(rdev, mddev) {
> +			sector_t sector;
> +			sector_t bit_sector = llbitmap->io_size >> SECTOR_SHIFT;
> +
> +			if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
> +				continue;
> +
> +			sector = mddev->bitmap_info.offset + rdev->sb_start +
> +				 (idx << PAGE_SECTORS_SHIFT) +
> +				 bit * bit_sector;
> +			md_write_metadata(mddev, rdev, sector,
> +					  llbitmap->io_size, page,
> +					  bit * llbitmap->io_size);
> +		}
> +	}
> +}
> +
> +static void active_release(struct percpu_ref *ref)
> +{
> +	struct llbitmap_page_ctl *pctl =
> +		container_of(ref, struct llbitmap_page_ctl, active);
> +
> +	wake_up(&pctl->wait);
> +}
> +
> +static void llbitmap_free_pages(struct llbitmap *llbitmap)
> +{
> +	int i;
> +
> +	if (!llbitmap->pctl)
> +		return;
> +
> +	for (i = 0; i < llbitmap->nr_pages; i++) {
> +		struct llbitmap_page_ctl *pctl = llbitmap->pctl[i];
> +
> +		if (!pctl || !pctl->page)
> +			break;
> +
> +		__free_page(pctl->page);
> +		percpu_ref_exit(&pctl->active);
> +	}
> +
> +	kfree(llbitmap->pctl[0]);
> +	kfree(llbitmap->pctl);
> +	llbitmap->pctl = NULL;
> +}
> +
> +static int llbitmap_cache_pages(struct llbitmap *llbitmap)
> +{
> +	struct llbitmap_page_ctl *pctl;
> +	unsigned int nr_pages = DIV_ROUND_UP(llbitmap->chunks + BITMAP_SB_SIZE,
> +					     PAGE_SIZE);
> +	unsigned int size = struct_size(pctl, dirty,
> +					BITS_TO_LONGS(llbitmap->bits_per_page));
> +	int i;
> +
> +	llbitmap->pctl = kmalloc_array(nr_pages, sizeof(void *),
> +				       GFP_KERNEL | __GFP_ZERO);
> +	if (!llbitmap->pctl)
> +		return -ENOMEM;
> +
> +	size = round_up(size, cache_line_size());
> +	pctl = kmalloc_array(nr_pages, size, GFP_KERNEL | __GFP_ZERO);
> +	if (!pctl) {
> +		kfree(llbitmap->pctl);
> +		return -ENOMEM;
> +	}
> +
> +	llbitmap->nr_pages = nr_pages;
> +
> +	for (i = 0; i < nr_pages; i++, pctl = (void *)pctl + size) {
> +		struct page *page = llbitmap_read_page(llbitmap, i);
> +
> +		llbitmap->pctl[i] = pctl;
> +
> +		if (IS_ERR(page)) {
> +			llbitmap_free_pages(llbitmap);
> +			return PTR_ERR(page);
> +		}
> +
> +		if (percpu_ref_init(&pctl->active, active_release,
> +				    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
> +			__free_page(page);
> +			llbitmap_free_pages(llbitmap);
> +			return -ENOMEM;
> +		}
> +
> +		pctl->page = page;
> +		pctl->state = page_address(page);
> +		init_waitqueue_head(&pctl->wait);
> +	}
> +
> +	return 0;
> +}
> +
> +#endif /* CONFIG_MD_LLBITMAP */


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 15/23] md/md-llbitmap: implement llbitmap IO
  2025-06-06  3:21   ` Xiao Ni
@ 2025-06-06  3:48     ` Yu Kuai
  2025-06-06  6:24       ` Xiao Ni
  0 siblings, 1 reply; 108+ messages in thread
From: Yu Kuai @ 2025-06-06  3:48 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai, hch, colyli, song
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

Hi,

在 2025/06/06 11:21, Xiao Ni 写道:
> Hi Kuai
> 
> I've read some codes of llbitmap, but I don't figure out the 
> relationship of in memory bits and in storage bits. Does llbitmap have 
> the two types as old bitmap? For example, in llbitmap_create, there is a 
> argument ->bits_per_page which is calculated by 
> PAGE_SIZE/logical_block_size. As the graph bellow, bits_per_page is 8 
> (4K/512byte). What does the bit mean? And in the graph below, it talks 
> 512 bits in one block, what does this bit mean?  I haven't walked 
> through all codes, maybe I can get the answer myself. If you can give a 
> summary of how many types of bit and what's the usage of the bit, it can 
> help to understand it easier.

llbitmap bit is always 1 byte, it's the same in memory and on disk.

bits_per_page bit is used to track dirty sectors in the memory page.

For example, usually 4k page will contain 8 sectors, each sector is 512
bytes, if one llbitmap bit is dirty, then the related bits_per_page bit
will be set as well, and later will write the sector to disk.

Thanks,
kuai

> 
> Best Regards
> 
> Xiao
> 
> 在 2025/5/24 下午2:13, Yu Kuai 写道:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> READ
>>
>> While creating bitmap, all pages will be allocated and read for llbitmap,
>> there won't be read afterwards
>>
>> WRITE
>>
>> WRITE IO is divided into logical_block_size of the page, the dirty state
>> of each block is tracked independently, for example:
>>
>> each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 
>> bit;
>>
>> | page0 | page1 | ... | page 31 |
>> |       |
>> |        \-----------------------\
>> |                                |
>> | block0 | block1 | ... | block 8|
>> |        |
>> |         \-----------------\
>> |                            |
>> | bit0 | bit1 | ... | bit511 |
>>
>>  From IO path, if one bit is changed to Dirty or NeedSync, the 
>> corresponding
>> subpage will be marked dirty, such block must write first before the 
>> IO is
>> issued. This behaviour will affect IO performance, to reduce the 
>> impact, if
>> multiple bits are changed in the same block in a short time, all bits in
>> this block will be changed to Dirty/NeedSync, so that there won't be any
>> overhead until daemon clears dirty bits.
>>
>> Also add data structure definition and comments.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   drivers/md/md-llbitmap.c | 571 +++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 571 insertions(+)
>>   create mode 100644 drivers/md/md-llbitmap.c
>>
>> diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
>> new file mode 100644
>> index 000000000000..1a01b6777527
>> --- /dev/null
>> +++ b/drivers/md/md-llbitmap.c
>> @@ -0,0 +1,571 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +
>> +#ifdef CONFIG_MD_LLBITMAP
>> +
>> +#include <linux/blkdev.h>
>> +#include <linux/module.h>
>> +#include <linux/errno.h>
>> +#include <linux/slab.h>
>> +#include <linux/init.h>
>> +#include <linux/timer.h>
>> +#include <linux/sched.h>
>> +#include <linux/list.h>
>> +#include <linux/file.h>
>> +#include <linux/seq_file.h>
>> +#include <trace/events/block.h>
>> +
>> +#include "md.h"
>> +#include "md-bitmap.h"
>> +
>> +/*
>> + * #### Background
>> + *
>> + * Redundant data is used to enhance data fault tolerance, and the 
>> storage
>> + * method for redundant data vary depending on the RAID levels. And it's
>> + * important to maintain the consistency of redundant data.
>> + *
>> + * Bitmap is used to record which data blocks have been synchronized 
>> and which
>> + * ones need to be resynchronized or recovered. Each bit in the bitmap
>> + * represents a segment of data in the array. When a bit is set, it 
>> indicates
>> + * that the multiple redundant copies of that data segment may not be
>> + * consistent. Data synchronization can be performed based on the 
>> bitmap after
>> + * power failure or readding a disk. If there is no bitmap, a full disk
>> + * synchronization is required.
>> + *
>> + * #### Key Features
>> + *
>> + *  - IO fastpath is lockless, if user issues lots of write IO to the 
>> same
>> + *  bitmap bit in a short time, only the first write have additional 
>> overhead
>> + *  to update bitmap bit, no additional overhead for the following 
>> writes;
>> + *  - support only resync or recover written data, means in the case 
>> creating
>> + *  new array or replacing with a new disk, there is no need to do a 
>> full disk
>> + *  resync/recovery;
>> + *
>> + * #### Key Concept
>> + *
>> + * ##### State Machine
>> + *
>> + * Each bit is one byte, contain 6 difference state, see 
>> llbitmap_state. And
>> + * there are total 8 differenct actions, see llbitmap_action, can 
>> change state:
>> + *
>> + * llbitmap state machine: transitions between states
>> + *
>> + * |           | Startwrite | Startsync | Endsync | Abortsync|
>> + * | --------- | ---------- | --------- | ------- | -------  |
>> + * | Unwritten | Dirty      | x         | x       | x        |
>> + * | Clean     | Dirty      | x         | x       | x        |
>> + * | Dirty     | x          | x         | x       | x        |
>> + * | NeedSync  | x          | Syncing   | x       | x        |
>> + * | Syncing   | x          | Syncing   | Dirty   | NeedSync |
>> + *
>> + * |           | Reload   | Daemon | Discard   | Stale     |
>> + * | --------- | -------- | ------ | --------- | --------- |
>> + * | Unwritten | x        | x      | x         | x         |
>> + * | Clean     | x        | x      | Unwritten | NeedSync  |
>> + * | Dirty     | NeedSync | Clean  | Unwritten | NeedSync  |
>> + * | NeedSync  | x        | x      | Unwritten | x         |
>> + * | Syncing   | NeedSync | x      | Unwritten | NeedSync  |
>> + *
>> + * Typical scenarios:
>> + *
>> + * 1) Create new array
>> + * All bits will be set to Unwritten by default, if --assume-clean is 
>> set,
>> + * all bits will be set to Clean instead.
>> + *
>> + * 2) write data, raid1/raid10 have full copy of data, while raid456 
>> doesn't and
>> + * rely on xor data
>> + *
>> + * 2.1) write new data to raid1/raid10:
>> + * Unwritten --StartWrite--> Dirty
>> + *
>> + * 2.2) write new data to raid456:
>> + * Unwritten --StartWrite--> NeedSync
>> + *
>> + * Because the initial recover for raid456 is skipped, the xor data 
>> is not build
>> + * yet, the bit must set to NeedSync first and after lazy initial 
>> recover is
>> + * finished, the bit will finially set to Dirty(see 5.1 and 5.4);
>> + *
>> + * 2.3) cover write
>> + * Clean --StartWrite--> Dirty
>> + *
>> + * 3) daemon, if the array is not degraded:
>> + * Dirty --Daemon--> Clean
>> + *
>> + * For degraded array, the Dirty bit will never be cleared, prevent 
>> full disk
>> + * recovery while readding a removed disk.
>> + *
>> + * 4) discard
>> + * {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
>> + *
>> + * 5) resync and recover
>> + *
>> + * 5.1) common process
>> + * NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean
>> + *
>> + * 5.2) resync after power failure
>> + * Dirty --Reload--> NeedSync
>> + *
>> + * 5.3) recover while replacing with a new disk
>> + * By default, the old bitmap framework will recover all data, and 
>> llbitmap
>> + * implement this by a new helper, see llbitmap_skip_sync_blocks:
>> + *
>> + * skip recover for bits other than dirty or clean;
>> + *
>> + * 5.4) lazy initial recover for raid5:
>> + * By default, the old bitmap framework will only allow new recover 
>> when there
>> + * are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER 
>> is add
>> + * to perform raid456 lazy recover for set bits(from 2.2).
>> + *
>> + * ##### Bitmap IO
>> + *
>> + * ##### Chunksize
>> + *
>> + * The default bitmap size is 128k, incluing 1k bitmap super block, and
>> + * the default size of segment of data in the array each 
>> bit(chunksize) is 64k,
>> + * and chunksize will adjust to twice the old size each time if the 
>> total number
>> + * bits is not less than 127k.(see llbitmap_init)
>> + *
>> + * ##### READ
>> + *
>> + * While creating bitmap, all pages will be allocated and read for 
>> llbitmap,
>> + * there won't be read afterwards
>> + *
>> + * ##### WRITE
>> + *
>> + * WRITE IO is divided into logical_block_size of the array, the 
>> dirty state
>> + * of each block is tracked independently, for example:
>> + *
>> + * each page is 4k, contain 8 blocks; each block is 512 bytes contain 
>> 512 bit;
>> + *
>> + * | page0 | page1 | ... | page 31 |
>> + * |       |
>> + * |        \-----------------------\
>> + * |                                |
>> + * | block0 | block1 | ... | block 8|
>> + * |        |
>> + * |         \-----------------\
>> + * |                            |
>> + * | bit0 | bit1 | ... | bit511 |
>> + *
>> + * From IO path, if one bit is changed to Dirty or NeedSync, the 
>> corresponding
>> + * subpage will be marked dirty, such block must write first before 
>> the IO is
>> + * issued. This behaviour will affect IO performance, to reduce the 
>> impact, if
>> + * multiple bits are changed in the same block in a short time, all 
>> bits in this
>> + * block will be changed to Dirty/NeedSync, so that there won't be 
>> any overhead
>> + * until daemon clears dirty bits.
>> + *
>> + * ##### Dirty Bits syncronization
>> + *
>> + * IO fast path will set bits to dirty, and those dirty bits will be 
>> cleared
>> + * by daemon after IO is done. llbitmap_page_ctl is used to 
>> synchronize between
>> + * IO path and daemon;
>> + *
>> + * IO path:
>> + *  1) try to grab a reference, if succeed, set expire time after 5s 
>> and return;
>> + *  2) if failed to grab a reference, wait for daemon to finish 
>> clearing dirty
>> + *  bits;
>> + *
>> + * Daemon(Daemon will be waken up every daemon_sleep seconds):
>> + * For each page:
>> + *  1) check if page expired, if not skip this page; for expired page:
>> + *  2) suspend the page and wait for inflight write IO to be done;
>> + *  3) change dirty page to clean;
>> + *  4) resume the page;
>> + */
>> +
>> +#define BITMAP_SB_SIZE 1024
>> +
>> +/* 64k is the max IO size of sync IO for raid1/raid10 */
>> +#define MIN_CHUNK_SIZE (64 * 2)
>> +
>> +/* By default, daemon will be waken up every 30s */
>> +#define DEFAULT_DAEMON_SLEEP 30
>> +
>> +/*
>> + * Dirtied bits that have not been accessed for more than 5s will be 
>> cleared
>> + * by daemon.
>> + */
>> +#define BARRIER_IDLE 5
>> +
>> +enum llbitmap_state {
>> +    /* No valid data, init state after assemble the array */
>> +    BitUnwritten = 0,
>> +    /* data is consistent */
>> +    BitClean,
>> +    /* data will be consistent after IO is done, set directly for 
>> writes */
>> +    BitDirty,
>> +    /*
>> +     * data need to be resynchronized:
>> +     * 1) set directly for writes if array is degraded, prevent full 
>> disk
>> +     * synchronization after readding a disk;
>> +     * 2) reassemble the array after power failure, and dirty bits are
>> +     * found after reloading the bitmap;
>> +     * 3) set for first write for raid5, to build initial xor data 
>> lazily
>> +     */
>> +    BitNeedSync,
>> +    /* data is synchronizing */
>> +    BitSyncing,
>> +    nr_llbitmap_state,
>> +    BitNone = 0xff,
>> +};
>> +
>> +enum llbitmap_action {
>> +    /* User write new data, this is the only action from IO fast path */
>> +    BitmapActionStartwrite = 0,
>> +    /* Start recovery */
>> +    BitmapActionStartsync,
>> +    /* Finish recovery */
>> +    BitmapActionEndsync,
>> +    /* Failed recovery */
>> +    BitmapActionAbortsync,
>> +    /* Reassemble the array */
>> +    BitmapActionReload,
>> +    /* Daemon thread is trying to clear dirty bits */
>> +    BitmapActionDaemon,
>> +    /* Data is deleted */
>> +    BitmapActionDiscard,
>> +    /*
>> +     * Bitmap is stale, mark all bits in addition to BitUnwritten to
>> +     * BitNeedSync.
>> +     */
>> +    BitmapActionStale,
>> +    nr_llbitmap_action,
>> +    /* Init state is BitUnwritten */
>> +    BitmapActionInit,
>> +};
>> +
>> +enum llbitmap_page_state {
>> +    LLPageFlush = 0,
>> +    LLPageDirty,
>> +};
>> +
>> +struct llbitmap_page_ctl {
>> +    char *state;
>> +    struct page *page;
>> +    unsigned long expire;
>> +    unsigned long flags;
>> +    wait_queue_head_t wait;
>> +    struct percpu_ref active;
>> +    /* Per block size dirty state, maximum 64k page / 1 sector = 128 */
>> +    unsigned long dirty[];
>> +};
>> +
>> +struct llbitmap {
>> +    struct mddev *mddev;
>> +    struct llbitmap_page_ctl **pctl;
>> +
>> +    unsigned int nr_pages;
>> +    unsigned int io_size;
>> +    unsigned int bits_per_page;
>> +
>> +    /* shift of one chunk */
>> +    unsigned long chunkshift;
>> +    /* size of one chunk in sector */
>> +    unsigned long chunksize;
>> +    /* total number of chunks */
>> +    unsigned long chunks;
>> +    unsigned long last_end_sync;
>> +    /* fires on first BitDirty state */
>> +    struct timer_list pending_timer;
>> +    struct work_struct daemon_work;
>> +
>> +    unsigned long flags;
>> +    __u64    events_cleared;
>> +
>> +    /* for slow disks */
>> +    atomic_t behind_writes;
>> +    wait_queue_head_t behind_wait;
>> +};
>> +
>> +struct llbitmap_unplug_work {
>> +    struct work_struct work;
>> +    struct llbitmap *llbitmap;
>> +    struct completion *done;
>> +};
>> +
>> +static struct workqueue_struct *md_llbitmap_io_wq;
>> +static struct workqueue_struct *md_llbitmap_unplug_wq;
>> +
>> +static char state_machine[nr_llbitmap_state][nr_llbitmap_action] = {
>> +    [BitUnwritten] = {
>> +        [BitmapActionStartwrite]    = BitDirty,
>> +        [BitmapActionStartsync]        = BitNone,
>> +        [BitmapActionEndsync]        = BitNone,
>> +        [BitmapActionAbortsync]        = BitNone,
>> +        [BitmapActionReload]        = BitNone,
>> +        [BitmapActionDaemon]        = BitNone,
>> +        [BitmapActionDiscard]        = BitNone,
>> +        [BitmapActionStale]        = BitNone,
>> +    },
>> +    [BitClean] = {
>> +        [BitmapActionStartwrite]    = BitDirty,
>> +        [BitmapActionStartsync]        = BitNone,
>> +        [BitmapActionEndsync]        = BitNone,
>> +        [BitmapActionAbortsync]        = BitNone,
>> +        [BitmapActionReload]        = BitNone,
>> +        [BitmapActionDaemon]        = BitNone,
>> +        [BitmapActionDiscard]        = BitUnwritten,
>> +        [BitmapActionStale]        = BitNeedSync,
>> +    },
>> +    [BitDirty] = {
>> +        [BitmapActionStartwrite]    = BitNone,
>> +        [BitmapActionStartsync]        = BitNone,
>> +        [BitmapActionEndsync]        = BitNone,
>> +        [BitmapActionAbortsync]        = BitNone,
>> +        [BitmapActionReload]        = BitNeedSync,
>> +        [BitmapActionDaemon]        = BitClean,
>> +        [BitmapActionDiscard]        = BitUnwritten,
>> +        [BitmapActionStale]        = BitNeedSync,
>> +    },
>> +    [BitNeedSync] = {
>> +        [BitmapActionStartwrite]    = BitNone,
>> +        [BitmapActionStartsync]        = BitSyncing,
>> +        [BitmapActionEndsync]        = BitNone,
>> +        [BitmapActionAbortsync]        = BitNone,
>> +        [BitmapActionReload]        = BitNone,
>> +        [BitmapActionDaemon]        = BitNone,
>> +        [BitmapActionDiscard]        = BitUnwritten,
>> +        [BitmapActionStale]        = BitNone,
>> +    },
>> +    [BitSyncing] = {
>> +        [BitmapActionStartwrite]    = BitNone,
>> +        [BitmapActionStartsync]        = BitSyncing,
>> +        [BitmapActionEndsync]        = BitDirty,
>> +        [BitmapActionAbortsync]        = BitNeedSync,
>> +        [BitmapActionReload]        = BitNeedSync,
>> +        [BitmapActionDaemon]        = BitNone,
>> +        [BitmapActionDiscard]        = BitUnwritten,
>> +        [BitmapActionStale]        = BitNeedSync,
>> +    },
>> +};
>> +
>> +static enum llbitmap_state llbitmap_read(struct llbitmap *llbitmap, 
>> loff_t pos)
>> +{
>> +    unsigned int idx;
>> +    unsigned int offset;
>> +
>> +    pos += BITMAP_SB_SIZE;
>> +    idx = pos >> PAGE_SHIFT;
>> +    offset = offset_in_page(pos);
>> +
>> +    return llbitmap->pctl[idx]->state[offset];
>> +}
>> +
>> +/* set all the bits in the subpage as dirty */
>> +static void llbitmap_infect_dirty_bits(struct llbitmap *llbitmap,
>> +                       struct llbitmap_page_ctl *pctl,
>> +                       unsigned int bit, unsigned int offset)
>> +{
>> +    bool level_456 = raid_is_456(llbitmap->mddev);
>> +    unsigned int io_size = llbitmap->io_size;
>> +    int pos;
>> +
>> +    for (pos = bit * io_size; pos < (bit + 1) * io_size; pos++) {
>> +        if (pos == offset)
>> +            continue;
>> +
>> +        switch (pctl->state[pos]) {
>> +        case BitUnwritten:
>> +            pctl->state[pos] = level_456 ? BitNeedSync : BitDirty;
>> +            break;
>> +        case BitClean:
>> +            pctl->state[pos] = BitDirty;
>> +            break;
>> +        };
>> +    }
>> +
>> +}
>> +
>> +static void llbitmap_set_page_dirty(struct llbitmap *llbitmap, int idx,
>> +                    int offset)
>> +{
>> +    struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
>> +    unsigned int io_size = llbitmap->io_size;
>> +    int bit = offset / io_size;
>> +    int pos;
>> +
>> +    if (!test_bit(LLPageDirty, &pctl->flags))
>> +        set_bit(LLPageDirty, &pctl->flags);
>> +
>> +    /*
>> +     * The subpage usually contains a total of 512 bits. If any 
>> single bit
>> +     * within the subpage is marked as dirty, the entire sector will be
>> +     * written. To avoid impacting write performance, when multiple bits
>> +     * within the same sector are modified within a short time frame, 
>> all
>> +     * bits in the sector will be collectively marked as dirty at once.
>> +     */
>> +    if (test_and_set_bit(bit, pctl->dirty)) {
>> +        llbitmap_infect_dirty_bits(llbitmap, pctl, bit, offset);
>> +        return;
>> +    }
>> +
>> +    for (pos = bit * io_size; pos < (bit + 1) * io_size; pos++) {
>> +        if (pos == offset)
>> +            continue;
>> +        if (pctl->state[pos] == BitDirty ||
>> +            pctl->state[pos] == BitNeedSync) {
>> +            llbitmap_infect_dirty_bits(llbitmap, pctl, bit, offset);
>> +            return;
>> +        }
>> +    }
>> +}
>> +
>> +static void llbitmap_write(struct llbitmap *llbitmap, enum 
>> llbitmap_state state,
>> +               loff_t pos)
>> +{
>> +    unsigned int idx;
>> +    unsigned int offset;
>> +
>> +    pos += BITMAP_SB_SIZE;
>> +    idx = pos >> PAGE_SHIFT;
>> +    offset = offset_in_page(pos);
>> +
>> +    llbitmap->pctl[idx]->state[offset] = state;
>> +    if (state == BitDirty || state == BitNeedSync)
>> +        llbitmap_set_page_dirty(llbitmap, idx, offset);
>> +}
>> +
>> +static struct page *llbitmap_read_page(struct llbitmap *llbitmap, int 
>> idx)
>> +{
>> +    struct mddev *mddev = llbitmap->mddev;
>> +    struct page *page = NULL;
>> +    struct md_rdev *rdev;
>> +
>> +    if (llbitmap->pctl && llbitmap->pctl[idx])
>> +        page = llbitmap->pctl[idx]->page;
>> +    if (page)
>> +        return page;
>> +
>> +    page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>> +    if (!page)
>> +        return ERR_PTR(-ENOMEM);
>> +
>> +    rdev_for_each(rdev, mddev) {
>> +        sector_t sector;
>> +
>> +        if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
>> +            continue;
>> +
>> +        sector = mddev->bitmap_info.offset +
>> +             (idx << PAGE_SECTORS_SHIFT);
>> +
>> +        if (sync_page_io(rdev, sector, PAGE_SIZE, page, REQ_OP_READ,
>> +                 true))
>> +            return page;
>> +
>> +        md_error(mddev, rdev);
>> +    }
>> +
>> +    __free_page(page);
>> +    return ERR_PTR(-EIO);
>> +}
>> +
>> +static void llbitmap_write_page(struct llbitmap *llbitmap, int idx)
>> +{
>> +    struct page *page = llbitmap->pctl[idx]->page;
>> +    struct mddev *mddev = llbitmap->mddev;
>> +    struct md_rdev *rdev;
>> +    int bit;
>> +
>> +    for (bit = 0; bit < llbitmap->bits_per_page; bit++) {
>> +        struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
>> +
>> +        if (!test_and_clear_bit(bit, pctl->dirty))
>> +            continue;
>> +
>> +        rdev_for_each(rdev, mddev) {
>> +            sector_t sector;
>> +            sector_t bit_sector = llbitmap->io_size >> SECTOR_SHIFT;
>> +
>> +            if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
>> +                continue;
>> +
>> +            sector = mddev->bitmap_info.offset + rdev->sb_start +
>> +                 (idx << PAGE_SECTORS_SHIFT) +
>> +                 bit * bit_sector;
>> +            md_write_metadata(mddev, rdev, sector,
>> +                      llbitmap->io_size, page,
>> +                      bit * llbitmap->io_size);
>> +        }
>> +    }
>> +}
>> +
>> +static void active_release(struct percpu_ref *ref)
>> +{
>> +    struct llbitmap_page_ctl *pctl =
>> +        container_of(ref, struct llbitmap_page_ctl, active);
>> +
>> +    wake_up(&pctl->wait);
>> +}
>> +
>> +static void llbitmap_free_pages(struct llbitmap *llbitmap)
>> +{
>> +    int i;
>> +
>> +    if (!llbitmap->pctl)
>> +        return;
>> +
>> +    for (i = 0; i < llbitmap->nr_pages; i++) {
>> +        struct llbitmap_page_ctl *pctl = llbitmap->pctl[i];
>> +
>> +        if (!pctl || !pctl->page)
>> +            break;
>> +
>> +        __free_page(pctl->page);
>> +        percpu_ref_exit(&pctl->active);
>> +    }
>> +
>> +    kfree(llbitmap->pctl[0]);
>> +    kfree(llbitmap->pctl);
>> +    llbitmap->pctl = NULL;
>> +}
>> +
>> +static int llbitmap_cache_pages(struct llbitmap *llbitmap)
>> +{
>> +    struct llbitmap_page_ctl *pctl;
>> +    unsigned int nr_pages = DIV_ROUND_UP(llbitmap->chunks + 
>> BITMAP_SB_SIZE,
>> +                         PAGE_SIZE);
>> +    unsigned int size = struct_size(pctl, dirty,
>> +                    BITS_TO_LONGS(llbitmap->bits_per_page));
>> +    int i;
>> +
>> +    llbitmap->pctl = kmalloc_array(nr_pages, sizeof(void *),
>> +                       GFP_KERNEL | __GFP_ZERO);
>> +    if (!llbitmap->pctl)
>> +        return -ENOMEM;
>> +
>> +    size = round_up(size, cache_line_size());
>> +    pctl = kmalloc_array(nr_pages, size, GFP_KERNEL | __GFP_ZERO);
>> +    if (!pctl) {
>> +        kfree(llbitmap->pctl);
>> +        return -ENOMEM;
>> +    }
>> +
>> +    llbitmap->nr_pages = nr_pages;
>> +
>> +    for (i = 0; i < nr_pages; i++, pctl = (void *)pctl + size) {
>> +        struct page *page = llbitmap_read_page(llbitmap, i);
>> +
>> +        llbitmap->pctl[i] = pctl;
>> +
>> +        if (IS_ERR(page)) {
>> +            llbitmap_free_pages(llbitmap);
>> +            return PTR_ERR(page);
>> +        }
>> +
>> +        if (percpu_ref_init(&pctl->active, active_release,
>> +                    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
>> +            __free_page(page);
>> +            llbitmap_free_pages(llbitmap);
>> +            return -ENOMEM;
>> +        }
>> +
>> +        pctl->page = page;
>> +        pctl->state = page_address(page);
>> +        init_waitqueue_head(&pctl->wait);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +#endif /* CONFIG_MD_LLBITMAP */
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 15/23] md/md-llbitmap: implement llbitmap IO
  2025-06-06  3:48     ` Yu Kuai
@ 2025-06-06  6:24       ` Xiao Ni
  2025-06-06  8:56         ` Yu Kuai
  0 siblings, 1 reply; 108+ messages in thread
From: Xiao Ni @ 2025-06-06  6:24 UTC (permalink / raw)
  To: Yu Kuai, hch, colyli, song
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)


在 2025/6/6 上午11:48, Yu Kuai 写道:
> Hi,
>
> 在 2025/06/06 11:21, Xiao Ni 写道:
>> Hi Kuai
>>
>> I've read some codes of llbitmap, but I don't figure out the 
>> relationship of in memory bits and in storage bits. Does llbitmap 
>> have the two types as old bitmap? For example, in llbitmap_create, 
>> there is a argument ->bits_per_page which is calculated by 
>> PAGE_SIZE/logical_block_size. As the graph bellow, bits_per_page is 8 
>> (4K/512byte). What does the bit mean? And in the graph below, it 
>> talks 512 bits in one block, what does this bit mean?  I haven't 
>> walked through all codes, maybe I can get the answer myself. If you 
>> can give a summary of how many types of bit and what's the usage of 
>> the bit, it can help to understand it easier.
>
> llbitmap bit is always 1 byte, it's the same in memory and on disk.


I c, thanks for the explanation.

>
> bits_per_page bit is used to track dirty sectors in the memory page.
>
> For example, usually 4k page will contain 8 sectors, each sector is 512
> bytes, if one llbitmap bit is dirty, then the related bits_per_page bit
> will be set as well, and later will write the sector to disk.


Maybe consider another name of bits_per_page? bits_per_page can easily 
let people to think the bitmat bits in one page. Through the graph 
below, maybe blocks_per_page?

Regards

Xiao

>
> Thanks,
> kuai
>
>>
>> Best Regards
>>
>> Xiao
>>
>> 在 2025/5/24 下午2:13, Yu Kuai 写道:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> READ
>>>
>>> While creating bitmap, all pages will be allocated and read for 
>>> llbitmap,
>>> there won't be read afterwards
>>>
>>> WRITE
>>>
>>> WRITE IO is divided into logical_block_size of the page, the dirty 
>>> state
>>> of each block is tracked independently, for example:
>>>
>>> each page is 4k, contain 8 blocks; each block is 512 bytes contain 
>>> 512 bit;
>>>
>>> | page0 | page1 | ... | page 31 |
>>> |       |
>>> |        \-----------------------\
>>> |                                |
>>> | block0 | block1 | ... | block 8|
>>> |        |
>>> |         \-----------------\
>>> |                            |
>>> | bit0 | bit1 | ... | bit511 |
>>>
>>>  From IO path, if one bit is changed to Dirty or NeedSync, the 
>>> corresponding
>>> subpage will be marked dirty, such block must write first before the 
>>> IO is
>>> issued. This behaviour will affect IO performance, to reduce the 
>>> impact, if
>>> multiple bits are changed in the same block in a short time, all 
>>> bits in
>>> this block will be changed to Dirty/NeedSync, so that there won't be 
>>> any
>>> overhead until daemon clears dirty bits.
>>>
>>> Also add data structure definition and comments.
>>>
>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>> ---
>>>   drivers/md/md-llbitmap.c | 571 
>>> +++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 571 insertions(+)
>>>   create mode 100644 drivers/md/md-llbitmap.c
>>>
>>> diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
>>> new file mode 100644
>>> index 000000000000..1a01b6777527
>>> --- /dev/null
>>> +++ b/drivers/md/md-llbitmap.c
>>> @@ -0,0 +1,571 @@
>>> +// SPDX-License-Identifier: GPL-2.0-or-later
>>> +
>>> +#ifdef CONFIG_MD_LLBITMAP
>>> +
>>> +#include <linux/blkdev.h>
>>> +#include <linux/module.h>
>>> +#include <linux/errno.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/init.h>
>>> +#include <linux/timer.h>
>>> +#include <linux/sched.h>
>>> +#include <linux/list.h>
>>> +#include <linux/file.h>
>>> +#include <linux/seq_file.h>
>>> +#include <trace/events/block.h>
>>> +
>>> +#include "md.h"
>>> +#include "md-bitmap.h"
>>> +
>>> +/*
>>> + * #### Background
>>> + *
>>> + * Redundant data is used to enhance data fault tolerance, and the 
>>> storage
>>> + * method for redundant data vary depending on the RAID levels. And 
>>> it's
>>> + * important to maintain the consistency of redundant data.
>>> + *
>>> + * Bitmap is used to record which data blocks have been 
>>> synchronized and which
>>> + * ones need to be resynchronized or recovered. Each bit in the bitmap
>>> + * represents a segment of data in the array. When a bit is set, it 
>>> indicates
>>> + * that the multiple redundant copies of that data segment may not be
>>> + * consistent. Data synchronization can be performed based on the 
>>> bitmap after
>>> + * power failure or readding a disk. If there is no bitmap, a full 
>>> disk
>>> + * synchronization is required.
>>> + *
>>> + * #### Key Features
>>> + *
>>> + *  - IO fastpath is lockless, if user issues lots of write IO to 
>>> the same
>>> + *  bitmap bit in a short time, only the first write have 
>>> additional overhead
>>> + *  to update bitmap bit, no additional overhead for the following 
>>> writes;
>>> + *  - support only resync or recover written data, means in the 
>>> case creating
>>> + *  new array or replacing with a new disk, there is no need to do 
>>> a full disk
>>> + *  resync/recovery;
>>> + *
>>> + * #### Key Concept
>>> + *
>>> + * ##### State Machine
>>> + *
>>> + * Each bit is one byte, contain 6 difference state, see 
>>> llbitmap_state. And
>>> + * there are total 8 differenct actions, see llbitmap_action, can 
>>> change state:
>>> + *
>>> + * llbitmap state machine: transitions between states
>>> + *
>>> + * |           | Startwrite | Startsync | Endsync | Abortsync|
>>> + * | --------- | ---------- | --------- | ------- | ------- |
>>> + * | Unwritten | Dirty      | x         | x       | x |
>>> + * | Clean     | Dirty      | x         | x       | x |
>>> + * | Dirty     | x          | x         | x       | x |
>>> + * | NeedSync  | x          | Syncing   | x       | x |
>>> + * | Syncing   | x          | Syncing   | Dirty   | NeedSync |
>>> + *
>>> + * |           | Reload   | Daemon | Discard   | Stale     |
>>> + * | --------- | -------- | ------ | --------- | --------- |
>>> + * | Unwritten | x        | x      | x         | x         |
>>> + * | Clean     | x        | x      | Unwritten | NeedSync  |
>>> + * | Dirty     | NeedSync | Clean  | Unwritten | NeedSync  |
>>> + * | NeedSync  | x        | x      | Unwritten | x         |
>>> + * | Syncing   | NeedSync | x      | Unwritten | NeedSync  |
>>> + *
>>> + * Typical scenarios:
>>> + *
>>> + * 1) Create new array
>>> + * All bits will be set to Unwritten by default, if --assume-clean 
>>> is set,
>>> + * all bits will be set to Clean instead.
>>> + *
>>> + * 2) write data, raid1/raid10 have full copy of data, while 
>>> raid456 doesn't and
>>> + * rely on xor data
>>> + *
>>> + * 2.1) write new data to raid1/raid10:
>>> + * Unwritten --StartWrite--> Dirty
>>> + *
>>> + * 2.2) write new data to raid456:
>>> + * Unwritten --StartWrite--> NeedSync
>>> + *
>>> + * Because the initial recover for raid456 is skipped, the xor data 
>>> is not build
>>> + * yet, the bit must set to NeedSync first and after lazy initial 
>>> recover is
>>> + * finished, the bit will finially set to Dirty(see 5.1 and 5.4);
>>> + *
>>> + * 2.3) cover write
>>> + * Clean --StartWrite--> Dirty
>>> + *
>>> + * 3) daemon, if the array is not degraded:
>>> + * Dirty --Daemon--> Clean
>>> + *
>>> + * For degraded array, the Dirty bit will never be cleared, prevent 
>>> full disk
>>> + * recovery while readding a removed disk.
>>> + *
>>> + * 4) discard
>>> + * {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
>>> + *
>>> + * 5) resync and recover
>>> + *
>>> + * 5.1) common process
>>> + * NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> 
>>> Clean
>>> + *
>>> + * 5.2) resync after power failure
>>> + * Dirty --Reload--> NeedSync
>>> + *
>>> + * 5.3) recover while replacing with a new disk
>>> + * By default, the old bitmap framework will recover all data, and 
>>> llbitmap
>>> + * implement this by a new helper, see llbitmap_skip_sync_blocks:
>>> + *
>>> + * skip recover for bits other than dirty or clean;
>>> + *
>>> + * 5.4) lazy initial recover for raid5:
>>> + * By default, the old bitmap framework will only allow new recover 
>>> when there
>>> + * are spares(new disk), a new recovery flag 
>>> MD_RECOVERY_LAZY_RECOVER is add
>>> + * to perform raid456 lazy recover for set bits(from 2.2).
>>> + *
>>> + * ##### Bitmap IO
>>> + *
>>> + * ##### Chunksize
>>> + *
>>> + * The default bitmap size is 128k, incluing 1k bitmap super block, 
>>> and
>>> + * the default size of segment of data in the array each 
>>> bit(chunksize) is 64k,
>>> + * and chunksize will adjust to twice the old size each time if the 
>>> total number
>>> + * bits is not less than 127k.(see llbitmap_init)
>>> + *
>>> + * ##### READ
>>> + *
>>> + * While creating bitmap, all pages will be allocated and read for 
>>> llbitmap,
>>> + * there won't be read afterwards
>>> + *
>>> + * ##### WRITE
>>> + *
>>> + * WRITE IO is divided into logical_block_size of the array, the 
>>> dirty state
>>> + * of each block is tracked independently, for example:
>>> + *
>>> + * each page is 4k, contain 8 blocks; each block is 512 bytes 
>>> contain 512 bit;
>>> + *
>>> + * | page0 | page1 | ... | page 31 |
>>> + * |       |
>>> + * |        \-----------------------\
>>> + * |                                |
>>> + * | block0 | block1 | ... | block 8|
>>> + * |        |
>>> + * |         \-----------------\
>>> + * |                            |
>>> + * | bit0 | bit1 | ... | bit511 |
>>> + *
>>> + * From IO path, if one bit is changed to Dirty or NeedSync, the 
>>> corresponding
>>> + * subpage will be marked dirty, such block must write first before 
>>> the IO is
>>> + * issued. This behaviour will affect IO performance, to reduce the 
>>> impact, if
>>> + * multiple bits are changed in the same block in a short time, all 
>>> bits in this
>>> + * block will be changed to Dirty/NeedSync, so that there won't be 
>>> any overhead
>>> + * until daemon clears dirty bits.
>>> + *
>>> + * ##### Dirty Bits syncronization
>>> + *
>>> + * IO fast path will set bits to dirty, and those dirty bits will 
>>> be cleared
>>> + * by daemon after IO is done. llbitmap_page_ctl is used to 
>>> synchronize between
>>> + * IO path and daemon;
>>> + *
>>> + * IO path:
>>> + *  1) try to grab a reference, if succeed, set expire time after 
>>> 5s and return;
>>> + *  2) if failed to grab a reference, wait for daemon to finish 
>>> clearing dirty
>>> + *  bits;
>>> + *
>>> + * Daemon(Daemon will be waken up every daemon_sleep seconds):
>>> + * For each page:
>>> + *  1) check if page expired, if not skip this page; for expired page:
>>> + *  2) suspend the page and wait for inflight write IO to be done;
>>> + *  3) change dirty page to clean;
>>> + *  4) resume the page;
>>> + */
>>> +
>>> +#define BITMAP_SB_SIZE 1024
>>> +
>>> +/* 64k is the max IO size of sync IO for raid1/raid10 */
>>> +#define MIN_CHUNK_SIZE (64 * 2)
>>> +
>>> +/* By default, daemon will be waken up every 30s */
>>> +#define DEFAULT_DAEMON_SLEEP 30
>>> +
>>> +/*
>>> + * Dirtied bits that have not been accessed for more than 5s will 
>>> be cleared
>>> + * by daemon.
>>> + */
>>> +#define BARRIER_IDLE 5
>>> +
>>> +enum llbitmap_state {
>>> +    /* No valid data, init state after assemble the array */
>>> +    BitUnwritten = 0,
>>> +    /* data is consistent */
>>> +    BitClean,
>>> +    /* data will be consistent after IO is done, set directly for 
>>> writes */
>>> +    BitDirty,
>>> +    /*
>>> +     * data need to be resynchronized:
>>> +     * 1) set directly for writes if array is degraded, prevent 
>>> full disk
>>> +     * synchronization after readding a disk;
>>> +     * 2) reassemble the array after power failure, and dirty bits are
>>> +     * found after reloading the bitmap;
>>> +     * 3) set for first write for raid5, to build initial xor data 
>>> lazily
>>> +     */
>>> +    BitNeedSync,
>>> +    /* data is synchronizing */
>>> +    BitSyncing,
>>> +    nr_llbitmap_state,
>>> +    BitNone = 0xff,
>>> +};
>>> +
>>> +enum llbitmap_action {
>>> +    /* User write new data, this is the only action from IO fast 
>>> path */
>>> +    BitmapActionStartwrite = 0,
>>> +    /* Start recovery */
>>> +    BitmapActionStartsync,
>>> +    /* Finish recovery */
>>> +    BitmapActionEndsync,
>>> +    /* Failed recovery */
>>> +    BitmapActionAbortsync,
>>> +    /* Reassemble the array */
>>> +    BitmapActionReload,
>>> +    /* Daemon thread is trying to clear dirty bits */
>>> +    BitmapActionDaemon,
>>> +    /* Data is deleted */
>>> +    BitmapActionDiscard,
>>> +    /*
>>> +     * Bitmap is stale, mark all bits in addition to BitUnwritten to
>>> +     * BitNeedSync.
>>> +     */
>>> +    BitmapActionStale,
>>> +    nr_llbitmap_action,
>>> +    /* Init state is BitUnwritten */
>>> +    BitmapActionInit,
>>> +};
>>> +
>>> +enum llbitmap_page_state {
>>> +    LLPageFlush = 0,
>>> +    LLPageDirty,
>>> +};
>>> +
>>> +struct llbitmap_page_ctl {
>>> +    char *state;
>>> +    struct page *page;
>>> +    unsigned long expire;
>>> +    unsigned long flags;
>>> +    wait_queue_head_t wait;
>>> +    struct percpu_ref active;
>>> +    /* Per block size dirty state, maximum 64k page / 1 sector = 
>>> 128 */
>>> +    unsigned long dirty[];
>>> +};
>>> +
>>> +struct llbitmap {
>>> +    struct mddev *mddev;
>>> +    struct llbitmap_page_ctl **pctl;
>>> +
>>> +    unsigned int nr_pages;
>>> +    unsigned int io_size;
>>> +    unsigned int bits_per_page;
>>> +
>>> +    /* shift of one chunk */
>>> +    unsigned long chunkshift;
>>> +    /* size of one chunk in sector */
>>> +    unsigned long chunksize;
>>> +    /* total number of chunks */
>>> +    unsigned long chunks;
>>> +    unsigned long last_end_sync;
>>> +    /* fires on first BitDirty state */
>>> +    struct timer_list pending_timer;
>>> +    struct work_struct daemon_work;
>>> +
>>> +    unsigned long flags;
>>> +    __u64    events_cleared;
>>> +
>>> +    /* for slow disks */
>>> +    atomic_t behind_writes;
>>> +    wait_queue_head_t behind_wait;
>>> +};
>>> +
>>> +struct llbitmap_unplug_work {
>>> +    struct work_struct work;
>>> +    struct llbitmap *llbitmap;
>>> +    struct completion *done;
>>> +};
>>> +
>>> +static struct workqueue_struct *md_llbitmap_io_wq;
>>> +static struct workqueue_struct *md_llbitmap_unplug_wq;
>>> +
>>> +static char state_machine[nr_llbitmap_state][nr_llbitmap_action] = {
>>> +    [BitUnwritten] = {
>>> +        [BitmapActionStartwrite]    = BitDirty,
>>> +        [BitmapActionStartsync]        = BitNone,
>>> +        [BitmapActionEndsync]        = BitNone,
>>> +        [BitmapActionAbortsync]        = BitNone,
>>> +        [BitmapActionReload]        = BitNone,
>>> +        [BitmapActionDaemon]        = BitNone,
>>> +        [BitmapActionDiscard]        = BitNone,
>>> +        [BitmapActionStale]        = BitNone,
>>> +    },
>>> +    [BitClean] = {
>>> +        [BitmapActionStartwrite]    = BitDirty,
>>> +        [BitmapActionStartsync]        = BitNone,
>>> +        [BitmapActionEndsync]        = BitNone,
>>> +        [BitmapActionAbortsync]        = BitNone,
>>> +        [BitmapActionReload]        = BitNone,
>>> +        [BitmapActionDaemon]        = BitNone,
>>> +        [BitmapActionDiscard]        = BitUnwritten,
>>> +        [BitmapActionStale]        = BitNeedSync,
>>> +    },
>>> +    [BitDirty] = {
>>> +        [BitmapActionStartwrite]    = BitNone,
>>> +        [BitmapActionStartsync]        = BitNone,
>>> +        [BitmapActionEndsync]        = BitNone,
>>> +        [BitmapActionAbortsync]        = BitNone,
>>> +        [BitmapActionReload]        = BitNeedSync,
>>> +        [BitmapActionDaemon]        = BitClean,
>>> +        [BitmapActionDiscard]        = BitUnwritten,
>>> +        [BitmapActionStale]        = BitNeedSync,
>>> +    },
>>> +    [BitNeedSync] = {
>>> +        [BitmapActionStartwrite]    = BitNone,
>>> +        [BitmapActionStartsync]        = BitSyncing,
>>> +        [BitmapActionEndsync]        = BitNone,
>>> +        [BitmapActionAbortsync]        = BitNone,
>>> +        [BitmapActionReload]        = BitNone,
>>> +        [BitmapActionDaemon]        = BitNone,
>>> +        [BitmapActionDiscard]        = BitUnwritten,
>>> +        [BitmapActionStale]        = BitNone,
>>> +    },
>>> +    [BitSyncing] = {
>>> +        [BitmapActionStartwrite]    = BitNone,
>>> +        [BitmapActionStartsync]        = BitSyncing,
>>> +        [BitmapActionEndsync]        = BitDirty,
>>> +        [BitmapActionAbortsync]        = BitNeedSync,
>>> +        [BitmapActionReload]        = BitNeedSync,
>>> +        [BitmapActionDaemon]        = BitNone,
>>> +        [BitmapActionDiscard]        = BitUnwritten,
>>> +        [BitmapActionStale]        = BitNeedSync,
>>> +    },
>>> +};
>>> +
>>> +static enum llbitmap_state llbitmap_read(struct llbitmap *llbitmap, 
>>> loff_t pos)
>>> +{
>>> +    unsigned int idx;
>>> +    unsigned int offset;
>>> +
>>> +    pos += BITMAP_SB_SIZE;
>>> +    idx = pos >> PAGE_SHIFT;
>>> +    offset = offset_in_page(pos);
>>> +
>>> +    return llbitmap->pctl[idx]->state[offset];
>>> +}
>>> +
>>> +/* set all the bits in the subpage as dirty */
>>> +static void llbitmap_infect_dirty_bits(struct llbitmap *llbitmap,
>>> +                       struct llbitmap_page_ctl *pctl,
>>> +                       unsigned int bit, unsigned int offset)
>>> +{
>>> +    bool level_456 = raid_is_456(llbitmap->mddev);
>>> +    unsigned int io_size = llbitmap->io_size;
>>> +    int pos;
>>> +
>>> +    for (pos = bit * io_size; pos < (bit + 1) * io_size; pos++) {
>>> +        if (pos == offset)
>>> +            continue;
>>> +
>>> +        switch (pctl->state[pos]) {
>>> +        case BitUnwritten:
>>> +            pctl->state[pos] = level_456 ? BitNeedSync : BitDirty;
>>> +            break;
>>> +        case BitClean:
>>> +            pctl->state[pos] = BitDirty;
>>> +            break;
>>> +        };
>>> +    }
>>> +
>>> +}
>>> +
>>> +static void llbitmap_set_page_dirty(struct llbitmap *llbitmap, int 
>>> idx,
>>> +                    int offset)
>>> +{
>>> +    struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
>>> +    unsigned int io_size = llbitmap->io_size;
>>> +    int bit = offset / io_size;
>>> +    int pos;
>>> +
>>> +    if (!test_bit(LLPageDirty, &pctl->flags))
>>> +        set_bit(LLPageDirty, &pctl->flags);
>>> +
>>> +    /*
>>> +     * The subpage usually contains a total of 512 bits. If any 
>>> single bit
>>> +     * within the subpage is marked as dirty, the entire sector 
>>> will be
>>> +     * written. To avoid impacting write performance, when multiple 
>>> bits
>>> +     * within the same sector are modified within a short time 
>>> frame, all
>>> +     * bits in the sector will be collectively marked as dirty at 
>>> once.
>>> +     */
>>> +    if (test_and_set_bit(bit, pctl->dirty)) {
>>> +        llbitmap_infect_dirty_bits(llbitmap, pctl, bit, offset);
>>> +        return;
>>> +    }
>>> +
>>> +    for (pos = bit * io_size; pos < (bit + 1) * io_size; pos++) {
>>> +        if (pos == offset)
>>> +            continue;
>>> +        if (pctl->state[pos] == BitDirty ||
>>> +            pctl->state[pos] == BitNeedSync) {
>>> +            llbitmap_infect_dirty_bits(llbitmap, pctl, bit, offset);
>>> +            return;
>>> +        }
>>> +    }
>>> +}
>>> +
>>> +static void llbitmap_write(struct llbitmap *llbitmap, enum 
>>> llbitmap_state state,
>>> +               loff_t pos)
>>> +{
>>> +    unsigned int idx;
>>> +    unsigned int offset;
>>> +
>>> +    pos += BITMAP_SB_SIZE;
>>> +    idx = pos >> PAGE_SHIFT;
>>> +    offset = offset_in_page(pos);
>>> +
>>> +    llbitmap->pctl[idx]->state[offset] = state;
>>> +    if (state == BitDirty || state == BitNeedSync)
>>> +        llbitmap_set_page_dirty(llbitmap, idx, offset);
>>> +}
>>> +
>>> +static struct page *llbitmap_read_page(struct llbitmap *llbitmap, 
>>> int idx)
>>> +{
>>> +    struct mddev *mddev = llbitmap->mddev;
>>> +    struct page *page = NULL;
>>> +    struct md_rdev *rdev;
>>> +
>>> +    if (llbitmap->pctl && llbitmap->pctl[idx])
>>> +        page = llbitmap->pctl[idx]->page;
>>> +    if (page)
>>> +        return page;
>>> +
>>> +    page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>>> +    if (!page)
>>> +        return ERR_PTR(-ENOMEM);
>>> +
>>> +    rdev_for_each(rdev, mddev) {
>>> +        sector_t sector;
>>> +
>>> +        if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
>>> +            continue;
>>> +
>>> +        sector = mddev->bitmap_info.offset +
>>> +             (idx << PAGE_SECTORS_SHIFT);
>>> +
>>> +        if (sync_page_io(rdev, sector, PAGE_SIZE, page, REQ_OP_READ,
>>> +                 true))
>>> +            return page;
>>> +
>>> +        md_error(mddev, rdev);
>>> +    }
>>> +
>>> +    __free_page(page);
>>> +    return ERR_PTR(-EIO);
>>> +}
>>> +
>>> +static void llbitmap_write_page(struct llbitmap *llbitmap, int idx)
>>> +{
>>> +    struct page *page = llbitmap->pctl[idx]->page;
>>> +    struct mddev *mddev = llbitmap->mddev;
>>> +    struct md_rdev *rdev;
>>> +    int bit;
>>> +
>>> +    for (bit = 0; bit < llbitmap->bits_per_page; bit++) {
>>> +        struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
>>> +
>>> +        if (!test_and_clear_bit(bit, pctl->dirty))
>>> +            continue;
>>> +
>>> +        rdev_for_each(rdev, mddev) {
>>> +            sector_t sector;
>>> +            sector_t bit_sector = llbitmap->io_size >> SECTOR_SHIFT;
>>> +
>>> +            if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
>>> +                continue;
>>> +
>>> +            sector = mddev->bitmap_info.offset + rdev->sb_start +
>>> +                 (idx << PAGE_SECTORS_SHIFT) +
>>> +                 bit * bit_sector;
>>> +            md_write_metadata(mddev, rdev, sector,
>>> +                      llbitmap->io_size, page,
>>> +                      bit * llbitmap->io_size);
>>> +        }
>>> +    }
>>> +}
>>> +
>>> +static void active_release(struct percpu_ref *ref)
>>> +{
>>> +    struct llbitmap_page_ctl *pctl =
>>> +        container_of(ref, struct llbitmap_page_ctl, active);
>>> +
>>> +    wake_up(&pctl->wait);
>>> +}
>>> +
>>> +static void llbitmap_free_pages(struct llbitmap *llbitmap)
>>> +{
>>> +    int i;
>>> +
>>> +    if (!llbitmap->pctl)
>>> +        return;
>>> +
>>> +    for (i = 0; i < llbitmap->nr_pages; i++) {
>>> +        struct llbitmap_page_ctl *pctl = llbitmap->pctl[i];
>>> +
>>> +        if (!pctl || !pctl->page)
>>> +            break;
>>> +
>>> +        __free_page(pctl->page);
>>> +        percpu_ref_exit(&pctl->active);
>>> +    }
>>> +
>>> +    kfree(llbitmap->pctl[0]);
>>> +    kfree(llbitmap->pctl);
>>> +    llbitmap->pctl = NULL;
>>> +}
>>> +
>>> +static int llbitmap_cache_pages(struct llbitmap *llbitmap)
>>> +{
>>> +    struct llbitmap_page_ctl *pctl;
>>> +    unsigned int nr_pages = DIV_ROUND_UP(llbitmap->chunks + 
>>> BITMAP_SB_SIZE,
>>> +                         PAGE_SIZE);
>>> +    unsigned int size = struct_size(pctl, dirty,
>>> + BITS_TO_LONGS(llbitmap->bits_per_page));
>>> +    int i;
>>> +
>>> +    llbitmap->pctl = kmalloc_array(nr_pages, sizeof(void *),
>>> +                       GFP_KERNEL | __GFP_ZERO);
>>> +    if (!llbitmap->pctl)
>>> +        return -ENOMEM;
>>> +
>>> +    size = round_up(size, cache_line_size());
>>> +    pctl = kmalloc_array(nr_pages, size, GFP_KERNEL | __GFP_ZERO);
>>> +    if (!pctl) {
>>> +        kfree(llbitmap->pctl);
>>> +        return -ENOMEM;
>>> +    }
>>> +
>>> +    llbitmap->nr_pages = nr_pages;
>>> +
>>> +    for (i = 0; i < nr_pages; i++, pctl = (void *)pctl + size) {
>>> +        struct page *page = llbitmap_read_page(llbitmap, i);
>>> +
>>> +        llbitmap->pctl[i] = pctl;
>>> +
>>> +        if (IS_ERR(page)) {
>>> +            llbitmap_free_pages(llbitmap);
>>> +            return PTR_ERR(page);
>>> +        }
>>> +
>>> +        if (percpu_ref_init(&pctl->active, active_release,
>>> +                    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
>>> +            __free_page(page);
>>> +            llbitmap_free_pages(llbitmap);
>>> +            return -ENOMEM;
>>> +        }
>>> +
>>> +        pctl->page = page;
>>> +        pctl->state = page_address(page);
>>> +        init_waitqueue_head(&pctl->wait);
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +#endif /* CONFIG_MD_LLBITMAP */
>>
>>
>> .
>>
>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 15/23] md/md-llbitmap: implement llbitmap IO
  2025-06-06  6:24       ` Xiao Ni
@ 2025-06-06  8:56         ` Yu Kuai
  0 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-06-06  8:56 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai, hch, colyli, song
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

Hi,

在 2025/06/06 14:24, Xiao Ni 写道:
> Maybe consider another name of bits_per_page? bits_per_page can easily 
> let people to think the bitmat bits in one page. Through the graph 
> below, maybe blocks_per_page?

Sounds good. :)

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap
  2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (24 preceding siblings ...)
  2025-05-30  6:45 ` Yu Kuai
@ 2025-06-30  1:59 ` Xiao Ni
  2025-06-30  2:34   ` Yu Kuai
  25 siblings, 1 reply; 108+ messages in thread
From: Xiao Ni @ 2025-06-30  1:59 UTC (permalink / raw)
  To: Yu Kuai, hch, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi


在 2025/5/24 下午2:12, Yu Kuai 写道:
> From: Yu Kuai <yukuai3@huawei.com>
>
> This is the formal version after previous RFC version:
>
> https://lore.kernel.org/all/20250512011927.2809400-1-yukuai1@huaweicloud.com/
>
> #### Background
>
> Redundant data is used to enhance data fault tolerance, and the storage
> method for redundant data vary depending on the RAID levels. And it's
> important to maintain the consistency of redundant data.
>
> Bitmap is used to record which data blocks have been synchronized and which
> ones need to be resynchronized or recovered. Each bit in the bitmap
> represents a segment of data in the array. When a bit is set, it indicates
> that the multiple redundant copies of that data segment may not be
> consistent. Data synchronization can be performed based on the bitmap after
> power failure or readding a disk. If there is no bitmap, a full disk
> synchronization is required.


Hi Kuai

>
> #### Key Features
>
>   - IO fastpath is lockless, if user issues lots of write IO to the same
>   bitmap bit in a short time, only the first write have additional overhead
>   to update bitmap bit, no additional overhead for the following writes;

After reading other patches, I want to check if I understand right.

The first write sets the bitmap bit. The second write which hits the 
same block (one sector, 512 bits) will call llbitmap_infect_dirty_bits 
to set all other bits. Then the third write doesn't need to set bitmap 
bits. If I'm right, the comments above should say only the first two 
writes have additional overhead?


>   - support only resync or recover written data, means in the case creating
>   new array or replacing with a new disk, there is no need to do a full disk
>   resync/recovery;
>
> #### Key Concept
>
> ##### State Machine
>
> Each bit is one byte, contain 6 difference state, see llbitmap_state. And
> there are total 8 differenct actions, see llbitmap_action, can change state:
>
> llbitmap state machine: transitions between states
>
> |           | Startwrite | Startsync | Endsync | Abortsync|
> | --------- | ---------- | --------- | ------- | -------  |
> | Unwritten | Dirty      | x         | x       | x        |
> | Clean     | Dirty      | x         | x       | x        |
> | Dirty     | x          | x         | x       | x        |
> | NeedSync  | x          | Syncing   | x       | x        |
> | Syncing   | x          | Syncing   | Dirty   | NeedSync |
>
> |           | Reload   | Daemon | Discard   | Stale     |
> | --------- | -------- | ------ | --------- | --------- |
> | Unwritten | x        | x      | x         | x         |
> | Clean     | x        | x      | Unwritten | NeedSync  |
> | Dirty     | NeedSync | Clean  | Unwritten | NeedSync  |
> | NeedSync  | x        | x      | Unwritten | x         |
> | Syncing   | NeedSync | x      | Unwritten | NeedSync  |


For Reload action, if the bitmap bit is NeedSync, the changed status 
will be x. It can't trigger resync/recovery.

For example:

cat /sys/block/md127/md/llbitmap/bits
unwritten 3480
clean 2
dirty 0
need sync 510

It doesn't do resync after aseembling the array. Does it need to modify 
the changed status from x to NeedSync?


Best Regards

Xiao

>
> Typical scenarios:
>
> 1) Create new array
> All bits will be set to Unwritten by default, if --assume-clean is set,
> all bits will be set to Clean instead.
>
> 2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and
> rely on xor data
>
> 2.1) write new data to raid1/raid10:
> Unwritten --StartWrite--> Dirty
>
> 2.2) write new data to raid456:
> Unwritten --StartWrite--> NeedSync
>
> Because the initial recover for raid456 is skipped, the xor data is not build
> yet, the bit must set to NeedSync first and after lazy initial recover is
> finished, the bit will finially set to Dirty(see 5.1 and 5.4);
>
> 2.3) cover write
> Clean --StartWrite--> Dirty
>
> 3) daemon, if the array is not degraded:
> Dirty --Daemon--> Clean
>
> For degraded array, the Dirty bit will never be cleared, prevent full disk
> recovery while readding a removed disk.
>
> 4) discard
> {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
>
> 5) resync and recover
>
> 5.1) common process
> NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean
>
> 5.2) resync after power failure
> Dirty --Reload--> NeedSync
>
> 5.3) recover while replacing with a new disk
> By default, the old bitmap framework will recover all data, and llbitmap
> implement this by a new helper, see llbitmap_skip_sync_blocks:
>
> skip recover for bits other than dirty or clean;
>
> 5.4) lazy initial recover for raid5:
> By default, the old bitmap framework will only allow new recover when there
> are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add
> to perform raid456 lazy recover for set bits(from 2.2).
>
> ##### Bitmap IO
>
> ##### Chunksize
>
> The default bitmap size is 128k, incluing 1k bitmap super block, and
> the default size of segment of data in the array each bit(chunksize) is 64k,
> and chunksize will adjust to twice the old size each time if the total number
> bits is not less than 127k.(see llbitmap_init)
>
> ##### READ
>
> While creating bitmap, all pages will be allocated and read for llbitmap,
> there won't be read afterwards
>
> ##### WRITE
>
> WRITE IO is divided into logical_block_size of the array, the dirty state
> of each block is tracked independently, for example:
>
> each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;
>
> | page0 | page1 | ... | page 31 |
> |       |
> |        \-----------------------\
> |                                |
> | block0 | block1 | ... | block 8|
> |        |
> |         \-----------------\
> |                            |
> | bit0 | bit1 | ... | bit511 |
>
>  From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
> subpage will be marked dirty, such block must write first before the IO is
> issued. This behaviour will affect IO performance, to reduce the impact, if
> multiple bits are changed in the same block in a short time, all bits in this
> block will be changed to Dirty/NeedSync, so that there won't be any overhead
> until daemon clears dirty bits.
>
> ##### Dirty Bits syncronization
>
> IO fast path will set bits to dirty, and those dirty bits will be cleared
> by daemon after IO is done. llbitmap_page_ctl is used to synchronize between
> IO path and daemon;
>
> IO path:
>   1) try to grab a reference, if succeed, set expire time after 5s and return;
>   2) if failed to grab a reference, wait for daemon to finish clearing dirty
>   bits;
>
> Daemon(Daemon will be waken up every daemon_sleep seconds):
> For each page:
>   1) check if page expired, if not skip this page; for expired page:
>   2) suspend the page and wait for inflight write IO to be done;
>   3) change dirty page to clean;
>   4) resume the page;
>
> Performance Test:
> Simple fio randwrite test to build array with 20GB ramdisk in my VM:
>
> |                      | none      | bitmap    | llbitmap  |
> | -------------------- | --------- | --------- | --------- |
> | raid1                | 13.7MiB/s | 9696KiB/s | 19.5MiB/s |
> | raid1(assume clean)  | 19.5MiB/s | 11.9MiB/s | 19.5MiB/s |
> | raid10               | 21.9MiB/s | 11.6MiB/s | 27.8MiB/s |
> | raid10(assume clean) | 27.8MiB/s | 15.4MiB/s | 27.8MiB/s |
> | raid5                | 14.0MiB/s | 11.6MiB/s | 12.9MiB/s |
> | raid5(assume clean)  | 17.8MiB/s | 13.4MiB/s | 13.9MiB/s |
>
> For raid1/raid10 llbitmap can be better than none bitmap with background
> initial resync, and it's the same as none bitmap without it.
>
> Noted that llbitmap performance improvement for raid5 is not obvious,
> this is due to raid5 has many other performance bottleneck, perf
> results still shows that bitmap overhead will be much less.
>
> following branch for review or test:
> https://git.kernel.org/pub/scm/linux/kernel/git/yukuai/linux.git/log/?h=yukuai/md-llbitmap
>
> Yu Kuai (23):
>    md: add a new parameter 'offset' to md_super_write()
>    md: factor out a helper raid_is_456()
>    md/md-bitmap: cleanup bitmap_ops->startwrite()
>    md/md-bitmap: support discard for bitmap ops
>    md/md-bitmap: remove parameter slot from bitmap_create()
>    md/md-bitmap: add a new sysfs api bitmap_type
>    md/md-bitmap: delay registration of bitmap_ops until creating bitmap
>    md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations
>    md/md-bitmap: add a new method blocks_synced() in bitmap_operations
>    md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER
>    md/md-bitmap: make method bitmap_ops->daemon_work optional
>    md/md-bitmap: add macros for lockless bitmap
>    md/md-bitmap: fix dm-raid max_write_behind setting
>    md/dm-raid: remove max_write_behind setting limit
>    md/md-llbitmap: implement llbitmap IO
>    md/md-llbitmap: implement bit state machine
>    md/md-llbitmap: implement APIs for page level dirty bits
>      synchronization
>    md/md-llbitmap: implement APIs to mange bitmap lifetime
>    md/md-llbitmap: implement APIs to dirty bits and clear bits
>    md/md-llbitmap: implement APIs for sync_thread
>    md/md-llbitmap: implement all bitmap operations
>    md/md-llbitmap: implement sysfs APIs
>    md/md-llbitmap: add Kconfig
>
>   Documentation/admin-guide/md.rst |   80 +-
>   drivers/md/Kconfig               |   11 +
>   drivers/md/Makefile              |    2 +-
>   drivers/md/dm-raid.c             |    6 +-
>   drivers/md/md-bitmap.c           |   50 +-
>   drivers/md/md-bitmap.h           |   55 +-
>   drivers/md/md-llbitmap.c         | 1556 ++++++++++++++++++++++++++++++
>   drivers/md/md.c                  |  247 +++--
>   drivers/md/md.h                  |   20 +-
>   drivers/md/raid5.c               |    6 +
>   10 files changed, 1901 insertions(+), 132 deletions(-)
>   create mode 100644 drivers/md/md-llbitmap.c
>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 15/23] md/md-llbitmap: implement llbitmap IO
  2025-05-24  6:13 ` [PATCH 15/23] md/md-llbitmap: implement llbitmap IO Yu Kuai
  2025-05-27  8:27   ` Christoph Hellwig
  2025-06-06  3:21   ` Xiao Ni
@ 2025-06-30  2:07   ` Xiao Ni
  2025-06-30  2:17     ` Yu Kuai
  2 siblings, 1 reply; 108+ messages in thread
From: Xiao Ni @ 2025-06-30  2:07 UTC (permalink / raw)
  To: Yu Kuai, hch, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi


在 2025/5/24 下午2:13, Yu Kuai 写道:
> From: Yu Kuai <yukuai3@huawei.com>
>
> READ
>
> While creating bitmap, all pages will be allocated and read for llbitmap,
> there won't be read afterwards
>
> WRITE
>
> WRITE IO is divided into logical_block_size of the page, the dirty state
> of each block is tracked independently, for example:
>
> each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;
>
> | page0 | page1 | ... | page 31 |
> |       |
> |        \-----------------------\
> |                                |
> | block0 | block1 | ... | block 8|
> |        |
> |         \-----------------\
> |                            |
> | bit0 | bit1 | ... | bit511 |
>
>  From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
> subpage will be marked dirty, such block must write first before the IO is
> issued. This behaviour will affect IO performance, to reduce the impact, if
> multiple bits are changed in the same block in a short time, all bits in
> this block will be changed to Dirty/NeedSync, so that there won't be any
> overhead until daemon clears dirty bits.
>
> Also add data structure definition and comments.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   drivers/md/md-llbitmap.c | 571 +++++++++++++++++++++++++++++++++++++++
>   1 file changed, 571 insertions(+)
>   create mode 100644 drivers/md/md-llbitmap.c
>
> diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
> new file mode 100644
> index 000000000000..1a01b6777527
> --- /dev/null
> +++ b/drivers/md/md-llbitmap.c
> @@ -0,0 +1,571 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +#ifdef CONFIG_MD_LLBITMAP
> +
> +#include <linux/blkdev.h>
> +#include <linux/module.h>
> +#include <linux/errno.h>
> +#include <linux/slab.h>
> +#include <linux/init.h>
> +#include <linux/timer.h>
> +#include <linux/sched.h>
> +#include <linux/list.h>
> +#include <linux/file.h>
> +#include <linux/seq_file.h>
> +#include <trace/events/block.h>
> +
> +#include "md.h"
> +#include "md-bitmap.h"
> +
> +/*
> + * #### Background
> + *
> + * Redundant data is used to enhance data fault tolerance, and the storage
> + * method for redundant data vary depending on the RAID levels. And it's
> + * important to maintain the consistency of redundant data.
> + *
> + * Bitmap is used to record which data blocks have been synchronized and which
> + * ones need to be resynchronized or recovered. Each bit in the bitmap
> + * represents a segment of data in the array. When a bit is set, it indicates
> + * that the multiple redundant copies of that data segment may not be
> + * consistent. Data synchronization can be performed based on the bitmap after
> + * power failure or readding a disk. If there is no bitmap, a full disk
> + * synchronization is required.
> + *
> + * #### Key Features
> + *
> + *  - IO fastpath is lockless, if user issues lots of write IO to the same
> + *  bitmap bit in a short time, only the first write have additional overhead
> + *  to update bitmap bit, no additional overhead for the following writes;
> + *  - support only resync or recover written data, means in the case creating
> + *  new array or replacing with a new disk, there is no need to do a full disk
> + *  resync/recovery;
> + *
> + * #### Key Concept
> + *
> + * ##### State Machine
> + *
> + * Each bit is one byte, contain 6 difference state, see llbitmap_state. And
> + * there are total 8 differenct actions, see llbitmap_action, can change state:
> + *
> + * llbitmap state machine: transitions between states
> + *
> + * |           | Startwrite | Startsync | Endsync | Abortsync|
> + * | --------- | ---------- | --------- | ------- | -------  |
> + * | Unwritten | Dirty      | x         | x       | x        |
> + * | Clean     | Dirty      | x         | x       | x        |
> + * | Dirty     | x          | x         | x       | x        |
> + * | NeedSync  | x          | Syncing   | x       | x        |
> + * | Syncing   | x          | Syncing   | Dirty   | NeedSync |
> + *
> + * |           | Reload   | Daemon | Discard   | Stale     |
> + * | --------- | -------- | ------ | --------- | --------- |
> + * | Unwritten | x        | x      | x         | x         |
> + * | Clean     | x        | x      | Unwritten | NeedSync  |
> + * | Dirty     | NeedSync | Clean  | Unwritten | NeedSync  |
> + * | NeedSync  | x        | x      | Unwritten | x         |
> + * | Syncing   | NeedSync | x      | Unwritten | NeedSync  |
> + *
> + * Typical scenarios:
> + *
> + * 1) Create new array
> + * All bits will be set to Unwritten by default, if --assume-clean is set,
> + * all bits will be set to Clean instead.
> + *
> + * 2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and
> + * rely on xor data
> + *
> + * 2.1) write new data to raid1/raid10:
> + * Unwritten --StartWrite--> Dirty
> + *
> + * 2.2) write new data to raid456:
> + * Unwritten --StartWrite--> NeedSync
> + *
> + * Because the initial recover for raid456 is skipped, the xor data is not build
> + * yet, the bit must set to NeedSync first and after lazy initial recover is
> + * finished, the bit will finially set to Dirty(see 5.1 and 5.4);
> + *
> + * 2.3) cover write
> + * Clean --StartWrite--> Dirty
> + *
> + * 3) daemon, if the array is not degraded:
> + * Dirty --Daemon--> Clean
> + *
> + * For degraded array, the Dirty bit will never be cleared, prevent full disk
> + * recovery while readding a removed disk.
> + *
> + * 4) discard
> + * {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
> + *
> + * 5) resync and recover
> + *
> + * 5.1) common process
> + * NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean
> + *
> + * 5.2) resync after power failure
> + * Dirty --Reload--> NeedSync
> + *
> + * 5.3) recover while replacing with a new disk
> + * By default, the old bitmap framework will recover all data, and llbitmap
> + * implement this by a new helper, see llbitmap_skip_sync_blocks:
> + *
> + * skip recover for bits other than dirty or clean;
> + *
> + * 5.4) lazy initial recover for raid5:
> + * By default, the old bitmap framework will only allow new recover when there
> + * are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add
> + * to perform raid456 lazy recover for set bits(from 2.2).
> + *
> + * ##### Bitmap IO
> + *
> + * ##### Chunksize
> + *
> + * The default bitmap size is 128k, incluing 1k bitmap super block, and
> + * the default size of segment of data in the array each bit(chunksize) is 64k,
> + * and chunksize will adjust to twice the old size each time if the total number
> + * bits is not less than 127k.(see llbitmap_init)
> + *
> + * ##### READ
> + *
> + * While creating bitmap, all pages will be allocated and read for llbitmap,
> + * there won't be read afterwards
> + *
> + * ##### WRITE
> + *
> + * WRITE IO is divided into logical_block_size of the array, the dirty state
> + * of each block is tracked independently, for example:
> + *
> + * each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;
> + *
> + * | page0 | page1 | ... | page 31 |
> + * |       |
> + * |        \-----------------------\
> + * |                                |
> + * | block0 | block1 | ... | block 8|
> + * |        |
> + * |         \-----------------\
> + * |                            |
> + * | bit0 | bit1 | ... | bit511 |
> + *
> + * From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
> + * subpage will be marked dirty, such block must write first before the IO is
> + * issued. This behaviour will affect IO performance, to reduce the impact, if
> + * multiple bits are changed in the same block in a short time, all bits in this
> + * block will be changed to Dirty/NeedSync, so that there won't be any overhead
> + * until daemon clears dirty bits.
> + *
> + * ##### Dirty Bits syncronization
> + *
> + * IO fast path will set bits to dirty, and those dirty bits will be cleared
> + * by daemon after IO is done. llbitmap_page_ctl is used to synchronize between
> + * IO path and daemon;
> + *
> + * IO path:
> + *  1) try to grab a reference, if succeed, set expire time after 5s and return;
> + *  2) if failed to grab a reference, wait for daemon to finish clearing dirty
> + *  bits;
> + *
> + * Daemon(Daemon will be waken up every daemon_sleep seconds):
> + * For each page:
> + *  1) check if page expired, if not skip this page; for expired page:
> + *  2) suspend the page and wait for inflight write IO to be done;
> + *  3) change dirty page to clean;
> + *  4) resume the page;
> + */
> +
> +#define BITMAP_SB_SIZE 1024
> +
> +/* 64k is the max IO size of sync IO for raid1/raid10 */
> +#define MIN_CHUNK_SIZE (64 * 2)
> +
> +/* By default, daemon will be waken up every 30s */
> +#define DEFAULT_DAEMON_SLEEP 30
> +
> +/*
> + * Dirtied bits that have not been accessed for more than 5s will be cleared
> + * by daemon.
> + */
> +#define BARRIER_IDLE 5
> +
> +enum llbitmap_state {
> +	/* No valid data, init state after assemble the array */
> +	BitUnwritten = 0,
> +	/* data is consistent */
> +	BitClean,
> +	/* data will be consistent after IO is done, set directly for writes */
> +	BitDirty,
> +	/*
> +	 * data need to be resynchronized:
> +	 * 1) set directly for writes if array is degraded, prevent full disk
> +	 * synchronization after readding a disk;
> +	 * 2) reassemble the array after power failure, and dirty bits are
> +	 * found after reloading the bitmap;
> +	 * 3) set for first write for raid5, to build initial xor data lazily
> +	 */
> +	BitNeedSync,
> +	/* data is synchronizing */
> +	BitSyncing,
> +	nr_llbitmap_state,
> +	BitNone = 0xff,
> +};
> +
> +enum llbitmap_action {
> +	/* User write new data, this is the only action from IO fast path */
> +	BitmapActionStartwrite = 0,
> +	/* Start recovery */
> +	BitmapActionStartsync,
> +	/* Finish recovery */
> +	BitmapActionEndsync,
> +	/* Failed recovery */
> +	BitmapActionAbortsync,
> +	/* Reassemble the array */
> +	BitmapActionReload,
> +	/* Daemon thread is trying to clear dirty bits */
> +	BitmapActionDaemon,
> +	/* Data is deleted */
> +	BitmapActionDiscard,
> +	/*
> +	 * Bitmap is stale, mark all bits in addition to BitUnwritten to
> +	 * BitNeedSync.
> +	 */
> +	BitmapActionStale,
> +	nr_llbitmap_action,
> +	/* Init state is BitUnwritten */
> +	BitmapActionInit,
> +};
> +
> +enum llbitmap_page_state {
> +	LLPageFlush = 0,
> +	LLPageDirty,
> +};
> +
> +struct llbitmap_page_ctl {
> +	char *state;
> +	struct page *page;
> +	unsigned long expire;
> +	unsigned long flags;
> +	wait_queue_head_t wait;
> +	struct percpu_ref active;
> +	/* Per block size dirty state, maximum 64k page / 1 sector = 128 */
> +	unsigned long dirty[];
> +};
> +
> +struct llbitmap {
> +	struct mddev *mddev;
> +	struct llbitmap_page_ctl **pctl;
> +
> +	unsigned int nr_pages;
> +	unsigned int io_size;
> +	unsigned int bits_per_page;
> +
> +	/* shift of one chunk */
> +	unsigned long chunkshift;
> +	/* size of one chunk in sector */
> +	unsigned long chunksize;
> +	/* total number of chunks */
> +	unsigned long chunks;
> +	unsigned long last_end_sync;
> +	/* fires on first BitDirty state */
> +	struct timer_list pending_timer;
> +	struct work_struct daemon_work;
> +
> +	unsigned long flags;
> +	__u64	events_cleared;
> +
> +	/* for slow disks */
> +	atomic_t behind_writes;
> +	wait_queue_head_t behind_wait;
> +};
> +
> +struct llbitmap_unplug_work {
> +	struct work_struct work;
> +	struct llbitmap *llbitmap;
> +	struct completion *done;
> +};
> +
> +static struct workqueue_struct *md_llbitmap_io_wq;
> +static struct workqueue_struct *md_llbitmap_unplug_wq;
> +
> +static char state_machine[nr_llbitmap_state][nr_llbitmap_action] = {
> +	[BitUnwritten] = {
> +		[BitmapActionStartwrite]	= BitDirty,
> +		[BitmapActionStartsync]		= BitNone,
> +		[BitmapActionEndsync]		= BitNone,
> +		[BitmapActionAbortsync]		= BitNone,
> +		[BitmapActionReload]		= BitNone,
> +		[BitmapActionDaemon]		= BitNone,
> +		[BitmapActionDiscard]		= BitNone,
> +		[BitmapActionStale]		= BitNone,
> +	},
> +	[BitClean] = {
> +		[BitmapActionStartwrite]	= BitDirty,
> +		[BitmapActionStartsync]		= BitNone,
> +		[BitmapActionEndsync]		= BitNone,
> +		[BitmapActionAbortsync]		= BitNone,
> +		[BitmapActionReload]		= BitNone,
> +		[BitmapActionDaemon]		= BitNone,
> +		[BitmapActionDiscard]		= BitUnwritten,
> +		[BitmapActionStale]		= BitNeedSync,
> +	},
> +	[BitDirty] = {
> +		[BitmapActionStartwrite]	= BitNone,
> +		[BitmapActionStartsync]		= BitNone,
> +		[BitmapActionEndsync]		= BitNone,
> +		[BitmapActionAbortsync]		= BitNone,
> +		[BitmapActionReload]		= BitNeedSync,
> +		[BitmapActionDaemon]		= BitClean,
> +		[BitmapActionDiscard]		= BitUnwritten,
> +		[BitmapActionStale]		= BitNeedSync,
> +	},
> +	[BitNeedSync] = {
> +		[BitmapActionStartwrite]	= BitNone,
> +		[BitmapActionStartsync]		= BitSyncing,
> +		[BitmapActionEndsync]		= BitNone,
> +		[BitmapActionAbortsync]		= BitNone,
> +		[BitmapActionReload]		= BitNone,
> +		[BitmapActionDaemon]		= BitNone,
> +		[BitmapActionDiscard]		= BitUnwritten,
> +		[BitmapActionStale]		= BitNone,
> +	},
> +	[BitSyncing] = {
> +		[BitmapActionStartwrite]	= BitNone,
> +		[BitmapActionStartsync]		= BitSyncing,
> +		[BitmapActionEndsync]		= BitDirty,
> +		[BitmapActionAbortsync]		= BitNeedSync,
> +		[BitmapActionReload]		= BitNeedSync,
> +		[BitmapActionDaemon]		= BitNone,
> +		[BitmapActionDiscard]		= BitUnwritten,
> +		[BitmapActionStale]		= BitNeedSync,
> +	},
> +};
> +
> +static enum llbitmap_state llbitmap_read(struct llbitmap *llbitmap, loff_t pos)
> +{
> +	unsigned int idx;
> +	unsigned int offset;
> +
> +	pos += BITMAP_SB_SIZE;
> +	idx = pos >> PAGE_SHIFT;
> +	offset = offset_in_page(pos);
> +
> +	return llbitmap->pctl[idx]->state[offset];
> +}
> +
> +/* set all the bits in the subpage as dirty */
> +static void llbitmap_infect_dirty_bits(struct llbitmap *llbitmap,
> +				       struct llbitmap_page_ctl *pctl,
> +				       unsigned int bit, unsigned int offset)
> +{
> +	bool level_456 = raid_is_456(llbitmap->mddev);
> +	unsigned int io_size = llbitmap->io_size;
> +	int pos;
> +
> +	for (pos = bit * io_size; pos < (bit + 1) * io_size; pos++) {
> +		if (pos == offset)
> +			continue;
> +
> +		switch (pctl->state[pos]) {
> +		case BitUnwritten:
> +			pctl->state[pos] = level_456 ? BitNeedSync : BitDirty;
> +			break;
> +		case BitClean:
> +			pctl->state[pos] = BitDirty;
> +			break;
> +		};
> +	}
> +
> +}
> +
> +static void llbitmap_set_page_dirty(struct llbitmap *llbitmap, int idx,
> +				    int offset)
> +{
> +	struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
> +	unsigned int io_size = llbitmap->io_size;
> +	int bit = offset / io_size;
> +	int pos;
> +
> +	if (!test_bit(LLPageDirty, &pctl->flags))
> +		set_bit(LLPageDirty, &pctl->flags);
> +
> +	/*
> +	 * The subpage usually contains a total of 512 bits. If any single bit
> +	 * within the subpage is marked as dirty, the entire sector will be
> +	 * written. To avoid impacting write performance, when multiple bits
> +	 * within the same sector are modified within a short time frame, all
> +	 * bits in the sector will be collectively marked as dirty at once.
> +	 */
> +	if (test_and_set_bit(bit, pctl->dirty)) {
> +		llbitmap_infect_dirty_bits(llbitmap, pctl, bit, offset);
> +		return;
> +	}

Hi Kuai


It's better to change name bit to block
> +
> +	for (pos = bit * io_size; pos < (bit + 1) * io_size; pos++) {
> +		if (pos == offset)
> +			continue;
> +		if (pctl->state[pos] == BitDirty ||
> +		    pctl->state[pos] == BitNeedSync) {
> +			llbitmap_infect_dirty_bits(llbitmap, pctl, bit, offset);
> +			return;
> +		}
> +	}


Can this for loop run? If one bit is dirty, it must set pctl->dirty. So 
the second write comes, it finds pctl->dirty is set and 
llbitmap_infect_dirty_bits function run and return. So it looks like it 
will not run the for loop.

Regards

Xiao

> +}
> +
> +static void llbitmap_write(struct llbitmap *llbitmap, enum llbitmap_state state,
> +			   loff_t pos)
> +{
> +	unsigned int idx;
> +	unsigned int offset;
> +
> +	pos += BITMAP_SB_SIZE;
> +	idx = pos >> PAGE_SHIFT;
> +	offset = offset_in_page(pos);
> +
> +	llbitmap->pctl[idx]->state[offset] = state;
> +	if (state == BitDirty || state == BitNeedSync)
> +		llbitmap_set_page_dirty(llbitmap, idx, offset);
> +}
> +
> +static struct page *llbitmap_read_page(struct llbitmap *llbitmap, int idx)
> +{
> +	struct mddev *mddev = llbitmap->mddev;
> +	struct page *page = NULL;
> +	struct md_rdev *rdev;
> +
> +	if (llbitmap->pctl && llbitmap->pctl[idx])
> +		page = llbitmap->pctl[idx]->page;
> +	if (page)
> +		return page;
> +
> +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	if (!page)
> +		return ERR_PTR(-ENOMEM);
> +
> +	rdev_for_each(rdev, mddev) {
> +		sector_t sector;
> +
> +		if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
> +			continue;
> +
> +		sector = mddev->bitmap_info.offset +
> +			 (idx << PAGE_SECTORS_SHIFT);
> +
> +		if (sync_page_io(rdev, sector, PAGE_SIZE, page, REQ_OP_READ,
> +				 true))
> +			return page;
> +
> +		md_error(mddev, rdev);
> +	}
> +
> +	__free_page(page);
> +	return ERR_PTR(-EIO);
> +}
> +
> +static void llbitmap_write_page(struct llbitmap *llbitmap, int idx)
> +{
> +	struct page *page = llbitmap->pctl[idx]->page;
> +	struct mddev *mddev = llbitmap->mddev;
> +	struct md_rdev *rdev;
> +	int bit;
> +
> +	for (bit = 0; bit < llbitmap->bits_per_page; bit++) {
> +		struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
> +
> +		if (!test_and_clear_bit(bit, pctl->dirty))
> +			continue;
> +
> +		rdev_for_each(rdev, mddev) {
> +			sector_t sector;
> +			sector_t bit_sector = llbitmap->io_size >> SECTOR_SHIFT;
> +
> +			if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
> +				continue;
> +
> +			sector = mddev->bitmap_info.offset + rdev->sb_start +
> +				 (idx << PAGE_SECTORS_SHIFT) +
> +				 bit * bit_sector;
> +			md_write_metadata(mddev, rdev, sector,
> +					  llbitmap->io_size, page,
> +					  bit * llbitmap->io_size);
> +		}
> +	}
> +}
> +
> +static void active_release(struct percpu_ref *ref)
> +{
> +	struct llbitmap_page_ctl *pctl =
> +		container_of(ref, struct llbitmap_page_ctl, active);
> +
> +	wake_up(&pctl->wait);
> +}
> +
> +static void llbitmap_free_pages(struct llbitmap *llbitmap)
> +{
> +	int i;
> +
> +	if (!llbitmap->pctl)
> +		return;
> +
> +	for (i = 0; i < llbitmap->nr_pages; i++) {
> +		struct llbitmap_page_ctl *pctl = llbitmap->pctl[i];
> +
> +		if (!pctl || !pctl->page)
> +			break;
> +
> +		__free_page(pctl->page);
> +		percpu_ref_exit(&pctl->active);
> +	}
> +
> +	kfree(llbitmap->pctl[0]);
> +	kfree(llbitmap->pctl);
> +	llbitmap->pctl = NULL;
> +}
> +
> +static int llbitmap_cache_pages(struct llbitmap *llbitmap)
> +{
> +	struct llbitmap_page_ctl *pctl;
> +	unsigned int nr_pages = DIV_ROUND_UP(llbitmap->chunks + BITMAP_SB_SIZE,
> +					     PAGE_SIZE);
> +	unsigned int size = struct_size(pctl, dirty,
> +					BITS_TO_LONGS(llbitmap->bits_per_page));
> +	int i;
> +
> +	llbitmap->pctl = kmalloc_array(nr_pages, sizeof(void *),
> +				       GFP_KERNEL | __GFP_ZERO);
> +	if (!llbitmap->pctl)
> +		return -ENOMEM;
> +
> +	size = round_up(size, cache_line_size());
> +	pctl = kmalloc_array(nr_pages, size, GFP_KERNEL | __GFP_ZERO);
> +	if (!pctl) {
> +		kfree(llbitmap->pctl);
> +		return -ENOMEM;
> +	}
> +
> +	llbitmap->nr_pages = nr_pages;
> +
> +	for (i = 0; i < nr_pages; i++, pctl = (void *)pctl + size) {
> +		struct page *page = llbitmap_read_page(llbitmap, i);
> +
> +		llbitmap->pctl[i] = pctl;
> +
> +		if (IS_ERR(page)) {
> +			llbitmap_free_pages(llbitmap);
> +			return PTR_ERR(page);
> +		}
> +
> +		if (percpu_ref_init(&pctl->active, active_release,
> +				    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
> +			__free_page(page);
> +			llbitmap_free_pages(llbitmap);
> +			return -ENOMEM;
> +		}
> +
> +		pctl->page = page;
> +		pctl->state = page_address(page);
> +		init_waitqueue_head(&pctl->wait);
> +	}
> +
> +	return 0;
> +}
> +
> +#endif /* CONFIG_MD_LLBITMAP */


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 16/23] md/md-llbitmap: implement bit state machine
  2025-05-24  6:13 ` [PATCH 16/23] md/md-llbitmap: implement bit state machine Yu Kuai
@ 2025-06-30  2:14   ` Xiao Ni
  2025-06-30  2:25     ` Yu Kuai
  0 siblings, 1 reply; 108+ messages in thread
From: Xiao Ni @ 2025-06-30  2:14 UTC (permalink / raw)
  To: Yu Kuai, hch, colyli, song, yukuai3
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi


在 2025/5/24 下午2:13, Yu Kuai 写道:
> From: Yu Kuai <yukuai3@huawei.com>
>
> Each bit is one byte, contain 6 difference state, see llbitmap_state. And
> there are total 8 differenct actions, see llbitmap_action, can change
> state:
>
> llbitmap state machine: transitions between states
>
> |           | Startwrite | Startsync | Endsync | Abortsync| Reload   | Daemon | Discard   | Stale     |
> | --------- | ---------- | --------- | ------- | -------  | -------- | ------ | --------- | --------- |
> | Unwritten | Dirty      | x         | x       | x        | x        | x      | x         | x         |
> | Clean     | Dirty      | x         | x       | x        | x        | x      | Unwritten | NeedSync  |
> | Dirty     | x          | x         | x       | x        | NeedSync | Clean  | Unwritten | NeedSync  |
> | NeedSync  | x          | Syncing   | x       | x        | x        | x      | Unwritten | x         |
> | Syncing   | x          | Syncing   | Dirty   | NeedSync | NeedSync | x      | Unwritten | NeedSync  |
>
> Typical scenarios:
>
> 1) Create new array
> All bits will be set to Unwritten by default, if --assume-clean is set,
> All bits will be set to Clean instead.
>
> 2) write data, raid1/raid10 have full copy of data, while raid456 donen't and
> rely on xor data
>
> 2.1) write new data to raid1/raid10:
> Unwritten --StartWrite--> Dirty
>
> 2.2) write new data to raid456:
> Unwritten --StartWrite--> NeedSync
>
> Because the initial recover for raid456 is skipped, the xor data is not build
> yet, the bit must set to NeedSync first and after lazy initial recover is
> finished, the bit will finially set to Dirty(see 5.1 and 5.4);
>
> 2.3) cover write
> Clean --StartWrite--> Dirty
>
> 3) daemon, if the array is not degraded:
> Dirty --Daemon--> Clean
>
> For degraded array, the Dirty bit will never be cleared, prevent full disk
> recovery while readding a removed disk.
>
> 4) discard
> {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
>
> 5) resync and recover
>
> 5.1) common process
> NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean
>
> 5.2) resync after power failure
> Dirty --Reload--> NeedSync
>
> 5.3) recover while replacing with a new disk
> By default, the old bitmap framework will recover all data, and llbitmap
> implement this by a new helper llbitmap_skip_sync_blocks:
>
> skip recover for bits other than dirty or clean;
>
> 5.4) lazy initial recover for raid5:
> By default, the old bitmap framework will only allow new recover when there
> are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add
> to perform raid456 lazy recover for set bits(from 2.2).
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   drivers/md/md-llbitmap.c | 83 ++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 83 insertions(+)
>
> diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
> index 1a01b6777527..f782f092ab5d 100644
> --- a/drivers/md/md-llbitmap.c
> +++ b/drivers/md/md-llbitmap.c
> @@ -568,4 +568,87 @@ static int llbitmap_cache_pages(struct llbitmap *llbitmap)
>   	return 0;
>   }
>   
> +static void llbitmap_init_state(struct llbitmap *llbitmap)
> +{
> +	enum llbitmap_state state = BitUnwritten;
> +	unsigned long i;
> +
> +	if (test_and_clear_bit(BITMAP_CLEAN, &llbitmap->flags))
> +		state = BitClean;
> +
> +	for (i = 0; i < llbitmap->chunks; i++)
> +		llbitmap_write(llbitmap, state, i);
> +}
> +
> +/* The return value is only used from resync, where @start == @end. */
> +static enum llbitmap_state llbitmap_state_machine(struct llbitmap *llbitmap,
> +						  unsigned long start,
> +						  unsigned long end,
> +						  enum llbitmap_action action)
> +{
> +	struct mddev *mddev = llbitmap->mddev;
> +	enum llbitmap_state state = BitNone;
> +	bool need_resync = false;
> +	bool need_recovery = false;
> +
> +	if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags))
> +		return BitNone;
> +
> +	if (action == BitmapActionInit) {
> +		llbitmap_init_state(llbitmap);
> +		return BitNone;
> +	}
> +
> +	while (start <= end) {
> +		enum llbitmap_state c = llbitmap_read(llbitmap, start);
> +
> +		if (c < 0 || c >= nr_llbitmap_state) {
> +			pr_err("%s: invalid bit %lu state %d action %d, forcing resync\n",
> +			       __func__, start, c, action);
> +			state = BitNeedSync;
> +			goto write_bitmap;
> +		}
> +
> +		if (c == BitNeedSync)
> +			need_resync = true;
> +
> +		state = state_machine[c][action];
> +		if (state == BitNone) {
> +			start++;
> +			continue;
> +		}

For reload action, it runs continue here.

And doesn't it need a lock when reading the state?

> +
> +write_bitmap:
> +		/* Delay raid456 initial recovery to first write. */
> +		if (c == BitUnwritten && state == BitDirty &&
> +		    action == BitmapActionStartwrite && raid_is_456(mddev)) {
> +			state = BitNeedSync;
> +			need_recovery = true;
> +		}
> +
> +		llbitmap_write(llbitmap, state, start);

Same question here, doesn't it need a lock when writing bitmap bits?

Regards

Xiao

> +
> +		if (state == BitNeedSync)
> +			need_resync = true;
> +		else if (state == BitDirty &&
> +			 !timer_pending(&llbitmap->pending_timer))
> +			mod_timer(&llbitmap->pending_timer,
> +				  jiffies + mddev->bitmap_info.daemon_sleep * HZ);
> +
> +		start++;
> +	}
> +
> +	if (need_recovery) {
> +		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> +		set_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery);
> +		md_wakeup_thread(mddev->thread);
> +	} else if (need_resync) {
> +		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> +		set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
> +		md_wakeup_thread(mddev->thread);
> +	}
> +
> +	return state;
> +}
> +
>   #endif /* CONFIG_MD_LLBITMAP */


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 15/23] md/md-llbitmap: implement llbitmap IO
  2025-06-30  2:07   ` Xiao Ni
@ 2025-06-30  2:17     ` Yu Kuai
  0 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-06-30  2:17 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai, hch, colyli, song
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

Hi,

在 2025/06/30 10:07, Xiao Ni 写道:
> Can this for loop run? If one bit is dirty, it must set pctl->dirty. So 
> the second write comes, it finds pctl->dirty is set and 
> llbitmap_infect_dirty_bits function run and return. So it looks like it 
> will not run the for loop.

I don't quite understand what you mean ...

There are two cases:

1) write two bits int the same block twice, then test_and_set_bit() will
pass;
2) write one bit, set block dirty, after bitmap unplug and before
cleaning the bit, then test_and_set_bit() will fail, and the loop will
run.

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 16/23] md/md-llbitmap: implement bit state machine
  2025-06-30  2:14   ` Xiao Ni
@ 2025-06-30  2:25     ` Yu Kuai
  2025-06-30  8:25       ` Xiao Ni
  0 siblings, 1 reply; 108+ messages in thread
From: Yu Kuai @ 2025-06-30  2:25 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai, hch, colyli, song
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

Hi,

在 2025/06/30 10:14, Xiao Ni 写道:
> For reload action, it runs continue here.

No one can concurent with reload.

> 
> And doesn't it need a lock when reading the state?

Notice that from IO path, all concurrent context are doing the same
thing, it doesn't matter if old state or new state are read. If old
state is read, it will write new state in memory again; if new state is
read, it just do nothing.

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap
  2025-06-30  1:59 ` Xiao Ni
@ 2025-06-30  2:34   ` Yu Kuai
  2025-06-30  3:25     ` Xiao Ni
  0 siblings, 1 reply; 108+ messages in thread
From: Yu Kuai @ 2025-06-30  2:34 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai, hch, colyli, song
  Cc: linux-doc, linux-kernel, linux-raid, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

Hi,

在 2025/06/30 9:59, Xiao Ni 写道:
> 
> After reading other patches, I want to check if I understand right.
> 
> The first write sets the bitmap bit. The second write which hits the 
> same block (one sector, 512 bits) will call llbitmap_infect_dirty_bits 
> to set all other bits. Then the third write doesn't need to set bitmap 
> bits. If I'm right, the comments above should say only the first two 
> writes have additional overhead?

Yes, for the same bit, it's twice; For different bit in the same block,
it's third, by infect all bits in the block in the second.

  For Reload action, if the bitmap bit is
> NeedSync, the changed status will be x. It can't trigger resync/recovery.

This is not expected, see llbitmap_state_machine(), if old or new state
is need_sync, it will trigger a resync.

c = llbitmap_read(llbitmap, start);
if (c == BitNeedSync)
  need_resync = true;
-> for RELOAD case, need_resync is still set.

state = state_machine[c][action];
if (state == BitNone)
  continue
if (state == BitNeedSync)
  need_resync = true;

> 
> For example:
> 
> cat /sys/block/md127/md/llbitmap/bits
> unwritten 3480
> clean 2
> dirty 0
> need sync 510
> 
> It doesn't do resync after aseembling the array. Does it need to modify 
> the changed status from x to NeedSync?

Can you explain in detail how to reporduce this? Aseembling in my VM is
fine.

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap
  2025-06-30  2:34   ` Yu Kuai
@ 2025-06-30  3:25     ` Xiao Ni
  2025-06-30  3:46       ` Yu Kuai
  0 siblings, 1 reply; 108+ messages in thread
From: Xiao Ni @ 2025-06-30  3:25 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

On Mon, Jun 30, 2025 at 10:34 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2025/06/30 9:59, Xiao Ni 写道:
> >
> > After reading other patches, I want to check if I understand right.
> >
> > The first write sets the bitmap bit. The second write which hits the
> > same block (one sector, 512 bits) will call llbitmap_infect_dirty_bits
> > to set all other bits. Then the third write doesn't need to set bitmap
> > bits. If I'm right, the comments above should say only the first two
> > writes have additional overhead?
>
> Yes, for the same bit, it's twice; For different bit in the same block,
> it's third, by infect all bits in the block in the second.

For different bits in the same block, test_and_set_bit(bit,
pctl->dirty) should be true too, right? So it infects other bits when
second write hits the same block too.

[946761.035079] llbitmap_set_page_dirty:390 page[0] offset 2024, block 3
[946761.035430] llbitmap_state_machine:646 delay raid456 initial recovery
[946761.035802] llbitmap_state_machine:652 bit 1001 state from 0 to 3
[946761.036498] llbitmap_set_page_dirty:390 page[0] offset 2025, block 3
[946761.036856] llbitmap_set_page_dirty:403 call llbitmap_infect_dirty_bits

As the debug logs show, different bits in the same block, the second
write (offset 2025) infects other bits.

>
>   For Reload action, if the bitmap bit is
> > NeedSync, the changed status will be x. It can't trigger resync/recovery.
>
> This is not expected, see llbitmap_state_machine(), if old or new state
> is need_sync, it will trigger a resync.
>
> c = llbitmap_read(llbitmap, start);
> if (c == BitNeedSync)
>   need_resync = true;
> -> for RELOAD case, need_resync is still set.
>
> state = state_machine[c][action];
> if (state == BitNone)
>   continue

If bitmap bit is BitNeedSync,
state_machine[BitNeedSync][BitmapActionReload] returns BitNone, so if
(state == BitNone) is true, it can't set MD_RECOVERY_NEEDED and it
can't start sync after assembling the array.

> if (state == BitNeedSync)
>   need_resync = true;
>
> >
> > For example:
> >
> > cat /sys/block/md127/md/llbitmap/bits
> > unwritten 3480
> > clean 2
> > dirty 0
> > need sync 510
> >
> > It doesn't do resync after aseembling the array. Does it need to modify
> > the changed status from x to NeedSync?
>
> Can you explain in detail how to reporduce this? Aseembling in my VM is
> fine.

I added many debug logs, so the sync request runs slowly. The test I do:
mdadm -CR /dev/md0 -l5 -n3 /dev/loop[0-2] --bitmap=lockless -x 1 /dev/loop3
dd if=/dev/zero of=/dev/md0 bs=1M count=1 seek=500 oflag=direct
mdadm --stop /dev/md0 (the sync thread finishes the region that two
bitmap bits represent, so you can see llbitmap/bits has 510 bits (need
sync))
mdadm -As

Regards
Xiao
>
> Thanks,
> Kuai
>
>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap
  2025-06-30  3:25     ` Xiao Ni
@ 2025-06-30  3:46       ` Yu Kuai
  2025-06-30  5:38         ` Xiao Ni
  0 siblings, 1 reply; 108+ messages in thread
From: Yu Kuai @ 2025-06-30  3:46 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai
  Cc: hch, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/06/30 11:25, Xiao Ni 写道:
> On Mon, Jun 30, 2025 at 10:34 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2025/06/30 9:59, Xiao Ni 写道:
>>>
>>> After reading other patches, I want to check if I understand right.
>>>
>>> The first write sets the bitmap bit. The second write which hits the
>>> same block (one sector, 512 bits) will call llbitmap_infect_dirty_bits
>>> to set all other bits. Then the third write doesn't need to set bitmap
>>> bits. If I'm right, the comments above should say only the first two
>>> writes have additional overhead?
>>
>> Yes, for the same bit, it's twice; For different bit in the same block,
>> it's third, by infect all bits in the block in the second.
> 
> For different bits in the same block, test_and_set_bit(bit,
> pctl->dirty) should be true too, right? So it infects other bits when
> second write hits the same block too.

The dirty will be cleared after bitmap_unplug.
> 
> [946761.035079] llbitmap_set_page_dirty:390 page[0] offset 2024, block 3
> [946761.035430] llbitmap_state_machine:646 delay raid456 initial recovery
> [946761.035802] llbitmap_state_machine:652 bit 1001 state from 0 to 3
> [946761.036498] llbitmap_set_page_dirty:390 page[0] offset 2025, block 3
> [946761.036856] llbitmap_set_page_dirty:403 call llbitmap_infect_dirty_bits
> 
> As the debug logs show, different bits in the same block, the second
> write (offset 2025) infects other bits.
> 
>>
>>    For Reload action, if the bitmap bit is
>>> NeedSync, the changed status will be x. It can't trigger resync/recovery.
>>
>> This is not expected, see llbitmap_state_machine(), if old or new state
>> is need_sync, it will trigger a resync.
>>
>> c = llbitmap_read(llbitmap, start);
>> if (c == BitNeedSync)
>>    need_resync = true;
>> -> for RELOAD case, need_resync is still set.
>>
>> state = state_machine[c][action];
>> if (state == BitNone)
>>    continue
> 
> If bitmap bit is BitNeedSync,
> state_machine[BitNeedSync][BitmapActionReload] returns BitNone, so if
> (state == BitNone) is true, it can't set MD_RECOVERY_NEEDED and it
> can't start sync after assembling the array.

You missed what I said above that llbitmap_read() will trigger resync as
well.
> 
>> if (state == BitNeedSync)
>>    need_resync = true;
>>
>>>
>>> For example:
>>>
>>> cat /sys/block/md127/md/llbitmap/bits
>>> unwritten 3480
>>> clean 2
>>> dirty 0
>>> need sync 510
>>>
>>> It doesn't do resync after aseembling the array. Does it need to modify
>>> the changed status from x to NeedSync?
>>
>> Can you explain in detail how to reporduce this? Aseembling in my VM is
>> fine.
> 
> I added many debug logs, so the sync request runs slowly. The test I do:
> mdadm -CR /dev/md0 -l5 -n3 /dev/loop[0-2] --bitmap=lockless -x 1 /dev/loop3
> dd if=/dev/zero of=/dev/md0 bs=1M count=1 seek=500 oflag=direct
> mdadm --stop /dev/md0 (the sync thread finishes the region that two
> bitmap bits represent, so you can see llbitmap/bits has 510 bits (need
> sync))
> mdadm -As

I don't quite understand, in my case, mdadm -As works fine.
> 
> Regards
> Xiao
>>
>> Thanks,
>> Kuai
>>
>>
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap
  2025-06-30  3:46       ` Yu Kuai
@ 2025-06-30  5:38         ` Xiao Ni
  2025-06-30  6:09           ` Yu Kuai
  0 siblings, 1 reply; 108+ messages in thread
From: Xiao Ni @ 2025-06-30  5:38 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

On Mon, Jun 30, 2025 at 11:46 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2025/06/30 11:25, Xiao Ni 写道:
> > On Mon, Jun 30, 2025 at 10:34 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> Hi,
> >>
> >> 在 2025/06/30 9:59, Xiao Ni 写道:
> >>>
> >>> After reading other patches, I want to check if I understand right.
> >>>
> >>> The first write sets the bitmap bit. The second write which hits the
> >>> same block (one sector, 512 bits) will call llbitmap_infect_dirty_bits
> >>> to set all other bits. Then the third write doesn't need to set bitmap
> >>> bits. If I'm right, the comments above should say only the first two
> >>> writes have additional overhead?
> >>
> >> Yes, for the same bit, it's twice; For different bit in the same block,
> >> it's third, by infect all bits in the block in the second.
> >
> > For different bits in the same block, test_and_set_bit(bit,
> > pctl->dirty) should be true too, right? So it infects other bits when
> > second write hits the same block too.
>
> The dirty will be cleared after bitmap_unplug.

I understand you now. The for loop in llbitmap_set_page_dirty is used
for new writes after unplug.
> >
> > [946761.035079] llbitmap_set_page_dirty:390 page[0] offset 2024, block 3
> > [946761.035430] llbitmap_state_machine:646 delay raid456 initial recovery
> > [946761.035802] llbitmap_state_machine:652 bit 1001 state from 0 to 3
> > [946761.036498] llbitmap_set_page_dirty:390 page[0] offset 2025, block 3
> > [946761.036856] llbitmap_set_page_dirty:403 call llbitmap_infect_dirty_bits
> >
> > As the debug logs show, different bits in the same block, the second
> > write (offset 2025) infects other bits.
> >
> >>
> >>    For Reload action, if the bitmap bit is
> >>> NeedSync, the changed status will be x. It can't trigger resync/recovery.
> >>
> >> This is not expected, see llbitmap_state_machine(), if old or new state
> >> is need_sync, it will trigger a resync.
> >>
> >> c = llbitmap_read(llbitmap, start);
> >> if (c == BitNeedSync)
> >>    need_resync = true;
> >> -> for RELOAD case, need_resync is still set.
> >>
> >> state = state_machine[c][action];
> >> if (state == BitNone)
> >>    continue
> >
> > If bitmap bit is BitNeedSync,
> > state_machine[BitNeedSync][BitmapActionReload] returns BitNone, so if
> > (state == BitNone) is true, it can't set MD_RECOVERY_NEEDED and it
> > can't start sync after assembling the array.
>
> You missed what I said above that llbitmap_read() will trigger resync as
> well.
> >
> >> if (state == BitNeedSync)
> >>    need_resync = true;
> >>
> >>>
> >>> For example:
> >>>
> >>> cat /sys/block/md127/md/llbitmap/bits
> >>> unwritten 3480
> >>> clean 2
> >>> dirty 0
> >>> need sync 510
> >>>
> >>> It doesn't do resync after aseembling the array. Does it need to modify
> >>> the changed status from x to NeedSync?
> >>
> >> Can you explain in detail how to reporduce this? Aseembling in my VM is
> >> fine.
> >
> > I added many debug logs, so the sync request runs slowly. The test I do:
> > mdadm -CR /dev/md0 -l5 -n3 /dev/loop[0-2] --bitmap=lockless -x 1 /dev/loop3
> > dd if=/dev/zero of=/dev/md0 bs=1M count=1 seek=500 oflag=direct
> > mdadm --stop /dev/md0 (the sync thread finishes the region that two
> > bitmap bits represent, so you can see llbitmap/bits has 510 bits (need
> > sync))
> > mdadm -As
>
> I don't quite understand, in my case, mdadm -As works fine.

Sorry for this, I forgot I removed the codes in function llbitmap_state_machine
        //if (c == BitNeedSync)
        //  need_resync = true;
The reason I do this: I find if the status table changes like this, it
doesn't need to check the original status anymore
-               [BitmapActionReload]            = BitNone,
+               [BitmapActionReload]            = BitNeedSync,//?


Regards
Xiao

Xiao
> >
> > Regards
> > Xiao
> >>
> >> Thanks,
> >> Kuai
> >>
> >>
> >
> >
> > .
> >
>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap
  2025-06-30  5:38         ` Xiao Ni
@ 2025-06-30  6:09           ` Yu Kuai
  0 siblings, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-06-30  6:09 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai
  Cc: hch, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/06/30 13:38, Xiao Ni 写道:
>> I don't quite understand, in my case, mdadm -As works fine.
> Sorry for this, I forgot I removed the codes in function llbitmap_state_machine
>          //if (c == BitNeedSync)
>          //  need_resync = true;
Ok.

> The reason I do this: I find if the status table changes like this, it
> doesn't need to check the original status anymore
> -               [BitmapActionReload]            = BitNone,
> +               [BitmapActionReload]            = BitNeedSync,//?

However, we don't want do dirty the bitmap page in this case, as nothing
chagned in the bitmap. And because of this, we have to check the old
value anyway...

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 16/23] md/md-llbitmap: implement bit state machine
  2025-06-30  2:25     ` Yu Kuai
@ 2025-06-30  8:25       ` Xiao Ni
  2025-06-30 11:05         ` Yu Kuai
  0 siblings, 1 reply; 108+ messages in thread
From: Xiao Ni @ 2025-06-30  8:25 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

On Mon, Jun 30, 2025 at 10:25 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2025/06/30 10:14, Xiao Ni 写道:
> > For reload action, it runs continue here.
>
> No one can concurent with reload.
>
> >
> > And doesn't it need a lock when reading the state?
>
> Notice that from IO path, all concurrent context are doing the same
> thing, it doesn't matter if old state or new state are read. If old
> state is read, it will write new state in memory again; if new state is
> read, it just do nothing.

Hi Kuai

This is the last place that I don't understand well. Is it the reason
that it only changes one byte at a time and the system can guarantee
the atomic when updating one byte?

If so, it only needs to concern the old and new data you mentioned
above. For example:
raid1 is created without --assume-clean, so all bits are BitUnwritten.
And a write bio comes, the bit changes to dirty. Then a discard is
submitted in another cpu context and it reads the old status
unwritten. From the status change table, the discard doesn't do
anything. In fact, discard should update dirty to unwritten. Can such
a case happen?

Regards
Xiao
>
> Thanks,
> Kuai
>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 16/23] md/md-llbitmap: implement bit state machine
  2025-06-30  8:25       ` Xiao Ni
@ 2025-06-30 11:05         ` Yu Kuai
  2025-06-30 11:30           ` Yu Kuai
  2025-07-01  1:55           ` Xiao Ni
  0 siblings, 2 replies; 108+ messages in thread
From: Yu Kuai @ 2025-06-30 11:05 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai
  Cc: hch, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/06/30 16:25, Xiao Ni 写道:
> On Mon, Jun 30, 2025 at 10:25 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2025/06/30 10:14, Xiao Ni 写道:
>>> For reload action, it runs continue here.
>>
>> No one can concurent with reload.
>>
>>>
>>> And doesn't it need a lock when reading the state?
>>
>> Notice that from IO path, all concurrent context are doing the same
>> thing, it doesn't matter if old state or new state are read. If old
>> state is read, it will write new state in memory again; if new state is
>> read, it just do nothing.
> 
> Hi Kuai
> 
> This is the last place that I don't understand well. Is it the reason
> that it only changes one byte at a time and the system can guarantee
> the atomic when updating one byte?
> 
> If so, it only needs to concern the old and new data you mentioned
> above. For example:
> raid1 is created without --assume-clean, so all bits are BitUnwritten.
> And a write bio comes, the bit changes to dirty. Then a discard is
> submitted in another cpu context and it reads the old status
> unwritten. From the status change table, the discard doesn't do
> anything. In fact, discard should update dirty to unwritten. Can such
> a case happen?

This can happen for raw disk, however, if there are filesystem, discard
and write can never race. And for raw disk, if user really issue write
and discard concurrently, the result really is uncertain, and it's fine
the bit result in dirty or unwritten.

Thanks,
Kuai

> 
> Regards
> Xiao
>>
>> Thanks,
>> Kuai
>>
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 16/23] md/md-llbitmap: implement bit state machine
  2025-06-30 11:05         ` Yu Kuai
@ 2025-06-30 11:30           ` Yu Kuai
  2025-07-01  1:55           ` Xiao Ni
  1 sibling, 0 replies; 108+ messages in thread
From: Yu Kuai @ 2025-06-30 11:30 UTC (permalink / raw)
  To: Yu Kuai, Xiao Ni
  Cc: hch, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

在 2025/06/30 19:05, Yu Kuai 写道:
> Is it the reason
> that it only changes one byte at a time and the system can guarantee
> the atomic when updating one byte?

I think it's not atomic, I don't use atomic API here, because all mormal
write are always changing the byte to the same value, if old value is
read by concurrent writes, then this byte will be written multiple
times, and I don't see any problems this way, just pure memory
operation.

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 16/23] md/md-llbitmap: implement bit state machine
  2025-06-30 11:05         ` Yu Kuai
  2025-06-30 11:30           ` Yu Kuai
@ 2025-07-01  1:55           ` Xiao Ni
  2025-07-01  2:02             ` Yu Kuai
  1 sibling, 1 reply; 108+ messages in thread
From: Xiao Ni @ 2025-07-01  1:55 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

On Mon, Jun 30, 2025 at 7:05 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2025/06/30 16:25, Xiao Ni 写道:
> > On Mon, Jun 30, 2025 at 10:25 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> Hi,
> >>
> >> 在 2025/06/30 10:14, Xiao Ni 写道:
> >>> For reload action, it runs continue here.
> >>
> >> No one can concurent with reload.
> >>
> >>>
> >>> And doesn't it need a lock when reading the state?
> >>
> >> Notice that from IO path, all concurrent context are doing the same
> >> thing, it doesn't matter if old state or new state are read. If old
> >> state is read, it will write new state in memory again; if new state is
> >> read, it just do nothing.
> >
> > Hi Kuai
> >
> > This is the last place that I don't understand well. Is it the reason
> > that it only changes one byte at a time and the system can guarantee
> > the atomic when updating one byte?
> >
> > If so, it only needs to concern the old and new data you mentioned
> > above. For example:
> > raid1 is created without --assume-clean, so all bits are BitUnwritten.
> > And a write bio comes, the bit changes to dirty. Then a discard is
> > submitted in another cpu context and it reads the old status
> > unwritten. From the status change table, the discard doesn't do
> > anything. In fact, discard should update dirty to unwritten. Can such
> > a case happen?
>
> This can happen for raw disk, however, if there are filesystem, discard
> and write can never race. And for raw disk, if user really issue write
> and discard concurrently, the result really is uncertain, and it's fine
> the bit result in dirty or unwritten.

Hi Kuai

If there is a filesystem and the write io returns. The discard must
see the memory changes without any memory barrier apis?

Regards
Xiao
>
> Thanks,
> Kuai
>
> >
> > Regards
> > Xiao
> >>
> >> Thanks,
> >> Kuai
> >>
> >
> >
> > .
> >
>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 16/23] md/md-llbitmap: implement bit state machine
  2025-07-01  1:55           ` Xiao Ni
@ 2025-07-01  2:02             ` Yu Kuai
  2025-07-01  2:31               ` Xiao Ni
  0 siblings, 1 reply; 108+ messages in thread
From: Yu Kuai @ 2025-07-01  2:02 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai
  Cc: hch, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/07/01 9:55, Xiao Ni 写道:
> If there is a filesystem and the write io returns. The discard must
> see the memory changes without any memory barrier apis?

It's the filesystem itself should manage free blocks, and gurantee
discard can only be issued to free blocks that is not used at all.

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 16/23] md/md-llbitmap: implement bit state machine
  2025-07-01  2:02             ` Yu Kuai
@ 2025-07-01  2:31               ` Xiao Ni
  0 siblings, 0 replies; 108+ messages in thread
From: Xiao Ni @ 2025-07-01  2:31 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, colyli, song, linux-doc, linux-kernel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

On Tue, Jul 1, 2025 at 10:03 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2025/07/01 9:55, Xiao Ni 写道:
> > If there is a filesystem and the write io returns. The discard must
> > see the memory changes without any memory barrier apis?
>
> It's the filesystem itself should manage free blocks, and gurantee
> discard can only be issued to free blocks that is not used at all.

Hi Kuai

Thanks for all the explanations and your patience.

Regards
Xiao
>
> Thanks,
> Kuai
>


^ permalink raw reply	[flat|nested] 108+ messages in thread

end of thread, other threads:[~2025-07-01  2:32 UTC | newest]

Thread overview: 108+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-24  6:12 [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
2025-05-24  6:12 ` [PATCH 01/23] md: add a new parameter 'offset' to md_super_write() Yu Kuai
2025-05-25 15:50   ` Xiao Ni
2025-05-26  6:28   ` Christoph Hellwig
2025-05-26  7:28     ` Yu Kuai
2025-05-27  5:54   ` Hannes Reinecke
2025-05-24  6:12 ` [PATCH 02/23] md: factor out a helper raid_is_456() Yu Kuai
2025-05-25 15:50   ` Xiao Ni
2025-05-26  6:28   ` Christoph Hellwig
2025-05-27  5:55   ` Hannes Reinecke
2025-05-24  6:13 ` [PATCH 03/23] md/md-bitmap: cleanup bitmap_ops->startwrite() Yu Kuai
2025-05-25 15:51   ` Xiao Ni
2025-05-26  6:29   ` Christoph Hellwig
2025-05-27  5:56   ` Hannes Reinecke
2025-05-24  6:13 ` [PATCH 04/23] md/md-bitmap: support discard for bitmap ops Yu Kuai
2025-05-25 15:53   ` Xiao Ni
2025-05-26  6:29   ` Christoph Hellwig
2025-05-27  6:01   ` Hannes Reinecke
2025-05-28  7:04   ` Glass Su
2025-05-24  6:13 ` [PATCH 05/23] md/md-bitmap: remove parameter slot from bitmap_create() Yu Kuai
2025-05-25 16:09   ` Xiao Ni
2025-05-26  6:30   ` Christoph Hellwig
2025-05-27  6:01   ` Hannes Reinecke
2025-05-24  6:13 ` [PATCH 06/23] md/md-bitmap: add a new sysfs api bitmap_type Yu Kuai
2025-05-25 16:32   ` Xiao Ni
2025-05-26  1:13     ` Yu Kuai
2025-05-26  5:11       ` Xiao Ni
2025-05-26  8:02         ` Yu Kuai
2025-05-26  6:32   ` Christoph Hellwig
2025-05-26  7:45     ` Yu Kuai
2025-05-27  8:21       ` Christoph Hellwig
2025-05-27  6:10   ` Hannes Reinecke
2025-05-27  7:43     ` Yu Kuai
2025-05-24  6:13 ` [PATCH 07/23] md/md-bitmap: delay registration of bitmap_ops until creating bitmap Yu Kuai
2025-05-26  6:32   ` Christoph Hellwig
2025-05-26  6:52   ` Xiao Ni
2025-05-26  7:57     ` Yu Kuai
2025-05-27  2:15       ` Xiao Ni
2025-05-27  2:49         ` Yu Kuai
2025-05-27  6:13   ` Hannes Reinecke
2025-05-27  7:53     ` Yu Kuai
2025-05-27  8:54       ` Hannes Reinecke
2025-05-24  6:13 ` [PATCH 08/23] md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations Yu Kuai
2025-05-26  7:03   ` Xiao Ni
2025-05-27  6:14   ` Hannes Reinecke
2025-05-24  6:13 ` [PATCH 09/23] md/md-bitmap: add a new method blocks_synced() " Yu Kuai
2025-05-27  2:35   ` Xiao Ni
2025-05-27  2:48     ` Yu Kuai
2025-05-27  6:16   ` Hannes Reinecke
2025-05-24  6:13 ` [PATCH 10/23] md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER Yu Kuai
2025-05-27  6:17   ` Hannes Reinecke
2025-05-27  8:00     ` Yu Kuai
2025-05-24  6:13 ` [PATCH 11/23] md/md-bitmap: make method bitmap_ops->daemon_work optional Yu Kuai
2025-05-26  6:34   ` Christoph Hellwig
2025-05-27  6:19   ` Hannes Reinecke
2025-05-27  8:03     ` Yu Kuai
2025-05-27  8:55       ` Hannes Reinecke
2025-05-24  6:13 ` [PATCH 12/23] md/md-bitmap: add macros for lockless bitmap Yu Kuai
2025-05-26  6:40   ` Christoph Hellwig
2025-05-26  8:12     ` Yu Kuai
2025-05-27  8:22       ` Christoph Hellwig
2025-05-27  6:21   ` Hannes Reinecke
2025-05-28  4:53   ` Xiao Ni
2025-05-24  6:13 ` [PATCH 13/23] md/md-bitmap: fix dm-raid max_write_behind setting Yu Kuai
2025-05-26  6:40   ` Christoph Hellwig
2025-05-27  6:21   ` Hannes Reinecke
2025-05-24  6:13 ` [PATCH 14/23] md/dm-raid: remove max_write_behind setting limit Yu Kuai
2025-05-26  6:41   ` Christoph Hellwig
2025-05-27  6:26   ` Hannes Reinecke
2025-05-28  4:58   ` Xiao Ni
2025-05-24  6:13 ` [PATCH 15/23] md/md-llbitmap: implement llbitmap IO Yu Kuai
2025-05-27  8:27   ` Christoph Hellwig
2025-05-27  8:55     ` Yu Kuai
2025-05-27  8:58       ` Yu Kuai
2025-06-06  3:21   ` Xiao Ni
2025-06-06  3:48     ` Yu Kuai
2025-06-06  6:24       ` Xiao Ni
2025-06-06  8:56         ` Yu Kuai
2025-06-30  2:07   ` Xiao Ni
2025-06-30  2:17     ` Yu Kuai
2025-05-24  6:13 ` [PATCH 16/23] md/md-llbitmap: implement bit state machine Yu Kuai
2025-06-30  2:14   ` Xiao Ni
2025-06-30  2:25     ` Yu Kuai
2025-06-30  8:25       ` Xiao Ni
2025-06-30 11:05         ` Yu Kuai
2025-06-30 11:30           ` Yu Kuai
2025-07-01  1:55           ` Xiao Ni
2025-07-01  2:02             ` Yu Kuai
2025-07-01  2:31               ` Xiao Ni
2025-05-24  6:13 ` [PATCH 17/23] md/md-llbitmap: implement APIs for page level dirty bits synchronization Yu Kuai
2025-05-24  6:13 ` [PATCH 18/23] md/md-llbitmap: implement APIs to mange bitmap lifetime Yu Kuai
2025-05-29  7:03   ` Xiao Ni
2025-05-29  9:03     ` Yu Kuai
2025-05-24  6:13 ` [PATCH 19/23] md/md-llbitmap: implement APIs to dirty bits and clear bits Yu Kuai
2025-05-24  6:13 ` [PATCH 20/23] md/md-llbitmap: implement APIs for sync_thread Yu Kuai
2025-05-24  6:13 ` [PATCH 21/23] md/md-llbitmap: implement all bitmap operations Yu Kuai
2025-05-24  6:13 ` [PATCH 22/23] md/md-llbitmap: implement sysfs APIs Yu Kuai
2025-05-24  6:13 ` [PATCH 23/23] md/md-llbitmap: add Kconfig Yu Kuai
2025-05-27  8:29   ` Christoph Hellwig
2025-05-27  9:00     ` Yu Kuai
2025-05-24  7:07 ` [PATCH 00/23] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
2025-05-30  6:45 ` Yu Kuai
2025-06-30  1:59 ` Xiao Ni
2025-06-30  2:34   ` Yu Kuai
2025-06-30  3:25     ` Xiao Ni
2025-06-30  3:46       ` Yu Kuai
2025-06-30  5:38         ` Xiao Ni
2025-06-30  6:09           ` Yu Kuai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).