dm-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 md-6.18 00/11] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap
@ 2025-08-26  8:51 Yu Kuai
  2025-08-26  8:51 ` [PATCH v6 md-6.18 01/11] md: add a new parameter 'offset' to md_super_write() Yu Kuai
                   ` (10 more replies)
  0 siblings, 11 replies; 19+ messages in thread
From: Yu Kuai @ 2025-08-26  8:51 UTC (permalink / raw)
  To: hch, corbet, agk, snitzer, mpatocka, song, xni, hare, linan122,
	colyli
  Cc: linux-doc, linux-kernel, dm-devel, linux-raid, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Changes from v5:
 - fix wrong place to check is blocks are synced in patch 8; (Xiao)
 - raid5 is using recover to build initial xor data, fix that unexpected
 resync is triggered, patch 9 and patch 11; (Myself by test)
 - flush bitmap after it's initialized the first time, patch 11; (Xiao)
 - some coding style, patch 11; (Xiao)
Changes from v4:
 - fix dm-raid regerssion in patch 6, skip creating bitmap attributes
 under mddev gendisk which is NULL;
 - some minor cleanups and error handling fixes in patch 11;
 - add review tag:
  - patch 1-10 from Li Nan
  - patch 4,5,6,8 from Xiao
  - patch 6 from Hannes
Changes from v3:
 - fix redundant setting mddev->bitmap_id in patch 6;
 - add explanation of bitmap attributes in Documentation;
 - add llbitmap/barrier_idle in patch 11;
 - add some comments in patch 11;
Changes from v2:
 - add comments about KOBJECT_CHANGE uevent in patch 6;
 - convert llbitmap_suspend() to llbitmap_suspend_timeout() in patch 11;
 - add some comments in patch 11;
 - add review tag:
  - patch 3,4,5,9 from Hannes
Changes from v1:
 - explain md_bitmap_fn in commit message, patch 3;
 - handle the case CONFIG_MD_BITMAP is disabled, patch 4;
 - split patch 7 from v1 into patch 5 + 6;
 - rewrite bitmap_type_store, patch 5;
 - fix dm-raid regerssion that md-bitmap sysfs entries should not be
 created under mddev kobject, patch 6
 - merge llbitmap patches into one patch, with lots of cleanups;
 - add review tag:
  - patch 1,2,7,8,9,10 from Christoph
  - patch 1,2,7,8,10 from Hannes
  - patch 1,2,3,7 from Xiao

v4: https://lore.kernel.org/all/20250721171557.34587-1-yukuai@kernel.org/
v3: https://lore.kernel.org/linux-raid/20250718092336.3346644-1-yukuai1@huaweicloud.com/
v2: https://lore.kernel.org/all/20250707165202.11073-12-yukuai@kernel.org/
v1: https://lore.kernel.org/all/20250524061320.370630-1-yukuai1@huaweicloud.com/
RFC: https://lore.kernel.org/all/20250512011927.2809400-1-yukuai1@huaweicloud.com/

#### Background

Redundant data is used to enhance data fault tolerance, and the storage
method for redundant data vary depending on the RAID levels. And it's
important to maintain the consistency of redundant data.

Bitmap is used to record which data blocks have been synchronized and which
ones need to be resynchronized or recovered. Each bit in the bitmap
represents a segment of data in the array. When a bit is set, it indicates
that the multiple redundant copies of that data segment may not be
consistent. Data synchronization can be performed based on the bitmap after
power failure or readding a disk. If there is no bitmap, a full disk
synchronization is required.

#### Key Features

 - IO fastpath is lockless, if user issues lots of write IO to the same
 bitmap bit in a short time, only the first write have additional overhead
 to update bitmap bit, no additional overhead for the following writes;
 - support only resync or recover written data, means in the case creating
 new array or replacing with a new disk, there is no need to do a full disk
 resync/recovery;

#### Key Concept

##### State Machine

Each bit is one byte, contain 6 difference state, see llbitmap_state. And
there are total 8 differenct actions, see llbitmap_action, can change state:

llbitmap state machine: transitions between states

|           | Startwrite | Startsync | Endsync | Abortsync|
| --------- | ---------- | --------- | ------- | -------  |
| Unwritten | Dirty      | x         | x       | x        |
| Clean     | Dirty      | x         | x       | x        |
| Dirty     | x          | x         | x       | x        |
| NeedSync  | x          | Syncing   | x       | x        |
| Syncing   | x          | Syncing   | Dirty   | NeedSync |

|           | Reload   | Daemon | Discard   | Stale     |
| --------- | -------- | ------ | --------- | --------- |
| Unwritten | x        | x      | x         | x         |
| Clean     | x        | x      | Unwritten | NeedSync  |
| Dirty     | NeedSync | Clean  | Unwritten | NeedSync  |
| NeedSync  | x        | x      | Unwritten | x         |
| Syncing   | NeedSync | x      | Unwritten | NeedSync  |

Typical scenarios:

1) Create new array
All bits will be set to Unwritten by default, if --assume-clean is set,
all bits will be set to Clean instead.

2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and
rely on xor data

2.1) write new data to raid1/raid10:
Unwritten --StartWrite--> Dirty

2.2) write new data to raid456:
Unwritten --StartWrite--> NeedSync

Because the initial recover for raid456 is skipped, the xor data is not build
yet, the bit must set to NeedSync first and after lazy initial recover is
finished, the bit will finially set to Dirty(see 5.1 and 5.4);

2.3) cover write
Clean --StartWrite--> Dirty

3) daemon, if the array is not degraded:
Dirty --Daemon--> Clean

For degraded array, the Dirty bit will never be cleared, prevent full disk
recovery while readding a removed disk.

4) discard
{Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten

5) resync and recover

5.1) common process
NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean

5.2) resync after power failure
Dirty --Reload--> NeedSync

5.3) recover while replacing with a new disk
By default, the old bitmap framework will recover all data, and llbitmap
implement this by a new helper, see llbitmap_skip_sync_blocks:

skip recover for bits other than dirty or clean;

5.4) lazy initial recover for raid5:
By default, the old bitmap framework will only allow new recover when there
are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add
to perform raid456 lazy recover for set bits(from 2.2).

##### Bitmap IO

##### Chunksize

The default bitmap size is 128k, incluing 1k bitmap super block, and
the default size of segment of data in the array each bit(chunksize) is 64k,
and chunksize will adjust to twice the old size each time if the total number
bits is not less than 127k.(see llbitmap_init)

##### READ

While creating bitmap, all pages will be allocated and read for llbitmap,
there won't be read afterwards

##### WRITE

WRITE IO is divided into logical_block_size of the array, the dirty state
of each block is tracked independently, for example:

each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;

| page0 | page1 | ... | page 31 |
|       |
|        \-----------------------\
|                                |
| block0 | block1 | ... | block 8|
|        |
|         \-----------------\
|                            |
| bit0 | bit1 | ... | bit511 |

>From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
subpage will be marked dirty, such block must write first before the IO is
issued. This behaviour will affect IO performance, to reduce the impact, if
multiple bits are changed in the same block in a short time, all bits in this
block will be changed to Dirty/NeedSync, so that there won't be any overhead
until daemon clears dirty bits.

##### Dirty Bits syncronization

IO fast path will set bits to dirty, and those dirty bits will be cleared
by daemon after IO is done. llbitmap_page_ctl is used to synchronize between
IO path and daemon;

IO path:
 1) try to grab a reference, if succeed, set expire time after 5s and return;
 2) if failed to grab a reference, wait for daemon to finish clearing dirty
 bits;

Daemon(Daemon will be waken up every daemon_sleep seconds):
For each page:
 1) check if page expired, if not skip this page; for expired page:
 2) suspend the page and wait for inflight write IO to be done;
 3) change dirty page to clean;
 4) resume the page;

Performance Test:
Simple fio randwrite test to build array with 20GB ramdisk in my VM:

|                      | none      | bitmap    | llbitmap  |
| -------------------- | --------- | --------- | --------- |
| raid1                | 13.7MiB/s | 9696KiB/s | 19.5MiB/s |
| raid1(assume clean)  | 19.5MiB/s | 11.9MiB/s | 19.5MiB/s |
| raid10               | 21.9MiB/s | 11.6MiB/s | 27.8MiB/s |
| raid10(assume clean) | 27.8MiB/s | 15.4MiB/s | 27.8MiB/s |
| raid5                | 14.0MiB/s | 11.6MiB/s | 12.9MiB/s |
| raid5(assume clean)  | 17.8MiB/s | 13.4MiB/s | 13.9MiB/s |

For raid1/raid10 llbitmap can be better than none bitmap with background
initial resync, and it's the same as none bitmap without it.

Noted that llbitmap performance improvement for raid5 is not obvious,
this is due to raid5 has many other performance bottleneck, perf
results still shows that bitmap overhead will be much less.

Yu Kuai (11):
  md: add a new parameter 'offset' to md_super_write()
  md: factor out a helper raid_is_456()
  md/md-bitmap: support discard for bitmap ops
  md: add a new mddev field 'bitmap_id'
  md/md-bitmap: add a new sysfs api bitmap_type
  md/md-bitmap: delay registration of bitmap_ops until creating bitmap
  md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations
  md/md-bitmap: add a new method blocks_synced() in bitmap_operations
  md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER
  md/md-bitmap: make method bitmap_ops->daemon_work optional
  md/md-llbitmap: introduce new lockless bitmap

 Documentation/admin-guide/md.rst |   86 +-
 drivers/md/Kconfig               |   11 +
 drivers/md/Makefile              |    1 +
 drivers/md/md-bitmap.c           |   15 +-
 drivers/md/md-bitmap.h           |   45 +-
 drivers/md/md-llbitmap.c         | 1600 ++++++++++++++++++++++++++++++
 drivers/md/md.c                  |  332 +++++--
 drivers/md/md.h                  |   20 +-
 drivers/md/raid5.c               |   34 +-
 9 files changed, 2021 insertions(+), 123 deletions(-)
 create mode 100644 drivers/md/md-llbitmap.c

-- 
2.39.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v6 md-6.18 01/11] md: add a new parameter 'offset' to md_super_write()
  2025-08-26  8:51 [PATCH v6 md-6.18 00/11] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
@ 2025-08-26  8:51 ` Yu Kuai
  2025-08-26  8:51 ` [PATCH v6 md-6.18 02/11] md: factor out a helper raid_is_456() Yu Kuai
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-08-26  8:51 UTC (permalink / raw)
  To: hch, corbet, agk, snitzer, mpatocka, song, xni, hare, linan122,
	colyli
  Cc: linux-doc, linux-kernel, dm-devel, linux-raid, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

The parameter is always set to 0 for now, following patches will use
this helper to write llbitmap to underlying disks, allow writing
dirty sectors instead of the whole page.

Also rename md_super_write to md_write_metadata since there is nothing
super-block specific.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
---
 drivers/md/md-bitmap.c |  3 ++-
 drivers/md/md.c        | 52 +++++++++++++++++++++++++-----------------
 drivers/md/md.h        |  5 ++--
 3 files changed, 36 insertions(+), 24 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 5f62f2fd8f3f..b157119de123 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -485,7 +485,8 @@ static int __write_sb_page(struct md_rdev *rdev, struct bitmap *bitmap,
 			return -EINVAL;
 	}
 
-	md_super_write(mddev, rdev, sboff + ps, (int)min(size, bitmap_limit), page);
+	md_write_metadata(mddev, rdev, sboff + ps, (int)min(size, bitmap_limit),
+			  page, 0);
 	return 0;
 }
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 61a659820779..74f876497c09 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1038,15 +1038,26 @@ static void super_written(struct bio *bio)
 		wake_up(&mddev->sb_wait);
 }
 
-void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
-		   sector_t sector, int size, struct page *page)
-{
-	/* write first size bytes of page to sector of rdev
-	 * Increment mddev->pending_writes before returning
-	 * and decrement it on completion, waking up sb_wait
-	 * if zero is reached.
-	 * If an error occurred, call md_error
-	 */
+/**
+ * md_write_metadata - write metadata to underlying disk, including
+ * array superblock, badblocks, bitmap superblock and bitmap bits.
+ * @mddev:	the array to write
+ * @rdev:	the underlying disk to write
+ * @sector:	the offset to @rdev
+ * @size:	the length of the metadata
+ * @page:	the metadata
+ * @offset:	the offset to @page
+ *
+ * Write @size bytes of @page start from @offset, to @sector of @rdev, Increment
+ * mddev->pending_writes before returning, and decrement it on completion,
+ * waking up sb_wait. Caller must call md_super_wait() after issuing io to all
+ * rdev. If an error occurred, md_error() will be called, and the @rdev will be
+ * kicked out from @mddev.
+ */
+void md_write_metadata(struct mddev *mddev, struct md_rdev *rdev,
+		       sector_t sector, int size, struct page *page,
+		       unsigned int offset)
+{
 	struct bio *bio;
 
 	if (!page)
@@ -1064,7 +1075,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
 	atomic_inc(&rdev->nr_pending);
 
 	bio->bi_iter.bi_sector = sector;
-	__bio_add_page(bio, page, size, 0);
+	__bio_add_page(bio, page, size, offset);
 	bio->bi_private = rdev;
 	bio->bi_end_io = super_written;
 
@@ -1674,8 +1685,8 @@ super_90_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors)
 	if ((u64)num_sectors >= (2ULL << 32) && rdev->mddev->level >= 1)
 		num_sectors = (sector_t)(2ULL << 32) - 2;
 	do {
-		md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
-		       rdev->sb_page);
+		md_write_metadata(rdev->mddev, rdev, rdev->sb_start,
+				  rdev->sb_size, rdev->sb_page, 0);
 	} while (md_super_wait(rdev->mddev) < 0);
 	return num_sectors;
 }
@@ -2323,8 +2334,8 @@ super_1_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors)
 	sb->super_offset = cpu_to_le64(rdev->sb_start);
 	sb->sb_csum = calc_sb_1_csum(sb);
 	do {
-		md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
-			       rdev->sb_page);
+		md_write_metadata(rdev->mddev, rdev, rdev->sb_start,
+				  rdev->sb_size, rdev->sb_page, 0);
 	} while (md_super_wait(rdev->mddev) < 0);
 	return num_sectors;
 
@@ -2833,18 +2844,17 @@ void md_update_sb(struct mddev *mddev, int force_change)
 			continue; /* no noise on spare devices */
 
 		if (!test_bit(Faulty, &rdev->flags)) {
-			md_super_write(mddev,rdev,
-				       rdev->sb_start, rdev->sb_size,
-				       rdev->sb_page);
+			md_write_metadata(mddev, rdev, rdev->sb_start,
+					  rdev->sb_size, rdev->sb_page, 0);
 			pr_debug("md: (write) %pg's sb offset: %llu\n",
 				 rdev->bdev,
 				 (unsigned long long)rdev->sb_start);
 			rdev->sb_events = mddev->events;
 			if (rdev->badblocks.size) {
-				md_super_write(mddev, rdev,
-					       rdev->badblocks.sector,
-					       rdev->badblocks.size << 9,
-					       rdev->bb_page);
+				md_write_metadata(mddev, rdev,
+						  rdev->badblocks.sector,
+						  rdev->badblocks.size << 9,
+						  rdev->bb_page, 0);
 				rdev->badblocks.size = 0;
 			}
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 081152c8de1f..cadd9bc99938 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -908,8 +908,9 @@ void md_account_bio(struct mddev *mddev, struct bio **bio);
 void md_free_cloned_bio(struct bio *bio);
 
 extern bool __must_check md_flush_request(struct mddev *mddev, struct bio *bio);
-extern void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
-			   sector_t sector, int size, struct page *page);
+void md_write_metadata(struct mddev *mddev, struct md_rdev *rdev,
+		       sector_t sector, int size, struct page *page,
+		       unsigned int offset);
 extern int md_super_wait(struct mddev *mddev);
 extern int sync_page_io(struct md_rdev *rdev, sector_t sector, int size,
 		struct page *page, blk_opf_t opf, bool metadata_op);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 md-6.18 02/11] md: factor out a helper raid_is_456()
  2025-08-26  8:51 [PATCH v6 md-6.18 00/11] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
  2025-08-26  8:51 ` [PATCH v6 md-6.18 01/11] md: add a new parameter 'offset' to md_super_write() Yu Kuai
@ 2025-08-26  8:51 ` Yu Kuai
  2025-08-26  8:51 ` [PATCH v6 md-6.18 03/11] md/md-bitmap: support discard for bitmap ops Yu Kuai
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-08-26  8:51 UTC (permalink / raw)
  To: hch, corbet, agk, snitzer, mpatocka, song, xni, hare, linan122,
	colyli
  Cc: linux-doc, linux-kernel, dm-devel, linux-raid, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

There are no functional changes, the helper will be used by llbitmap in
following patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
---
 drivers/md/md.c | 9 +--------
 drivers/md/md.h | 6 ++++++
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 74f876497c09..86cf97c0a77b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9121,19 +9121,12 @@ static sector_t md_sync_position(struct mddev *mddev, enum sync_action action)
 
 static bool sync_io_within_limit(struct mddev *mddev)
 {
-	int io_sectors;
-
 	/*
 	 * For raid456, sync IO is stripe(4k) per IO, for other levels, it's
 	 * RESYNC_PAGES(64k) per IO.
 	 */
-	if (mddev->level == 4 || mddev->level == 5 || mddev->level == 6)
-		io_sectors = 8;
-	else
-		io_sectors = 128;
-
 	return atomic_read(&mddev->recovery_active) <
-		io_sectors * sync_io_depth(mddev);
+	       (raid_is_456(mddev) ? 8 : 128) * sync_io_depth(mddev);
 }
 
 #define SYNC_MARKS	10
diff --git a/drivers/md/md.h b/drivers/md/md.h
index cadd9bc99938..5ef73109d14d 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -1033,6 +1033,12 @@ static inline bool mddev_is_dm(struct mddev *mddev)
 	return !mddev->gendisk;
 }
 
+static inline bool raid_is_456(struct mddev *mddev)
+{
+	return mddev->level == ID_RAID4 || mddev->level == ID_RAID5 ||
+	       mddev->level == ID_RAID6;
+}
+
 static inline void mddev_trace_remap(struct mddev *mddev, struct bio *bio,
 		sector_t sector)
 {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 md-6.18 03/11] md/md-bitmap: support discard for bitmap ops
  2025-08-26  8:51 [PATCH v6 md-6.18 00/11] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
  2025-08-26  8:51 ` [PATCH v6 md-6.18 01/11] md: add a new parameter 'offset' to md_super_write() Yu Kuai
  2025-08-26  8:51 ` [PATCH v6 md-6.18 02/11] md: factor out a helper raid_is_456() Yu Kuai
@ 2025-08-26  8:51 ` Yu Kuai
  2025-08-26  8:51 ` [PATCH v6 md-6.18 04/11] md: add a new mddev field 'bitmap_id' Yu Kuai
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-08-26  8:51 UTC (permalink / raw)
  To: hch, corbet, agk, snitzer, mpatocka, song, xni, hare, linan122,
	colyli
  Cc: linux-doc, linux-kernel, dm-devel, linux-raid, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Use two new methods {start, end}_discard in bitmap_ops and a new field 'rw'
in struct md_io_clone to handle discard IO, prepare to support new md
bitmap.

Since all bitmap functions to handle write IO are the same, also add
typedef to make code cleaner.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
---
 drivers/md/md-bitmap.c |  3 +++
 drivers/md/md-bitmap.h | 12 ++++++++----
 drivers/md/md.c        | 15 +++++++++++----
 drivers/md/md.h        |  1 +
 4 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index b157119de123..dc050ff94d5b 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -3005,6 +3005,9 @@ static struct bitmap_operations bitmap_ops = {
 
 	.start_write		= bitmap_start_write,
 	.end_write		= bitmap_end_write,
+	.start_discard		= bitmap_start_write,
+	.end_discard		= bitmap_end_write,
+
 	.start_sync		= bitmap_start_sync,
 	.end_sync		= bitmap_end_sync,
 	.cond_end_sync		= bitmap_cond_end_sync,
diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index 42f91755a341..8616ced49077 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -61,6 +61,9 @@ struct md_bitmap_stats {
 	struct file	*file;
 };
 
+typedef void (md_bitmap_fn)(struct mddev *mddev, sector_t offset,
+			    unsigned long sectors);
+
 struct bitmap_operations {
 	struct md_submodule_head head;
 
@@ -81,10 +84,11 @@ struct bitmap_operations {
 	void (*end_behind_write)(struct mddev *mddev);
 	void (*wait_behind_writes)(struct mddev *mddev);
 
-	void (*start_write)(struct mddev *mddev, sector_t offset,
-			    unsigned long sectors);
-	void (*end_write)(struct mddev *mddev, sector_t offset,
-			  unsigned long sectors);
+	md_bitmap_fn *start_write;
+	md_bitmap_fn *end_write;
+	md_bitmap_fn *start_discard;
+	md_bitmap_fn *end_discard;
+
 	bool (*start_sync)(struct mddev *mddev, sector_t offset,
 			   sector_t *blocks, bool degraded);
 	void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 86cf97c0a77b..2e088196d42c 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8933,18 +8933,24 @@ EXPORT_SYMBOL_GPL(md_submit_discard_bio);
 static void md_bitmap_start(struct mddev *mddev,
 			    struct md_io_clone *md_io_clone)
 {
+	md_bitmap_fn *fn = unlikely(md_io_clone->rw == STAT_DISCARD) ?
+			   mddev->bitmap_ops->start_discard :
+			   mddev->bitmap_ops->start_write;
+
 	if (mddev->pers->bitmap_sector)
 		mddev->pers->bitmap_sector(mddev, &md_io_clone->offset,
 					   &md_io_clone->sectors);
 
-	mddev->bitmap_ops->start_write(mddev, md_io_clone->offset,
-				       md_io_clone->sectors);
+	fn(mddev, md_io_clone->offset, md_io_clone->sectors);
 }
 
 static void md_bitmap_end(struct mddev *mddev, struct md_io_clone *md_io_clone)
 {
-	mddev->bitmap_ops->end_write(mddev, md_io_clone->offset,
-				     md_io_clone->sectors);
+	md_bitmap_fn *fn = unlikely(md_io_clone->rw == STAT_DISCARD) ?
+			   mddev->bitmap_ops->end_discard :
+			   mddev->bitmap_ops->end_write;
+
+	fn(mddev, md_io_clone->offset, md_io_clone->sectors);
 }
 
 static void md_end_clone_io(struct bio *bio)
@@ -8983,6 +8989,7 @@ static void md_clone_bio(struct mddev *mddev, struct bio **bio)
 	if (bio_data_dir(*bio) == WRITE && md_bitmap_enabled(mddev, false)) {
 		md_io_clone->offset = (*bio)->bi_iter.bi_sector;
 		md_io_clone->sectors = bio_sectors(*bio);
+		md_io_clone->rw = op_stat_group(bio_op(*bio));
 		md_bitmap_start(mddev, md_io_clone);
 	}
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 5ef73109d14d..1b767b5320cf 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -872,6 +872,7 @@ struct md_io_clone {
 	unsigned long	start_time;
 	sector_t	offset;
 	unsigned long	sectors;
+	enum stat_group	rw;
 	struct bio	bio_clone;
 };
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 md-6.18 04/11] md: add a new mddev field 'bitmap_id'
  2025-08-26  8:51 [PATCH v6 md-6.18 00/11] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (2 preceding siblings ...)
  2025-08-26  8:51 ` [PATCH v6 md-6.18 03/11] md/md-bitmap: support discard for bitmap ops Yu Kuai
@ 2025-08-26  8:51 ` Yu Kuai
  2025-08-26  8:51 ` [PATCH v6 md-6.18 05/11] md/md-bitmap: add a new sysfs api bitmap_type Yu Kuai
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-08-26  8:51 UTC (permalink / raw)
  To: hch, corbet, agk, snitzer, mpatocka, song, xni, hare, linan122,
	colyli
  Cc: linux-doc, linux-kernel, dm-devel, linux-raid, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Prepare to store the bitmap id selected by user, also refactor
mddev_set_bitmap_ops a bit in case the value is invalid.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/md.c | 37 +++++++++++++++++++++++++++++++------
 drivers/md/md.h |  2 ++
 2 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2e088196d42c..82c84bdabe79 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -676,13 +676,33 @@ static void active_io_release(struct percpu_ref *ref)
 
 static void no_op(struct percpu_ref *r) {}
 
-static void mddev_set_bitmap_ops(struct mddev *mddev, enum md_submodule_id id)
+static bool mddev_set_bitmap_ops(struct mddev *mddev)
 {
+	struct md_submodule_head *head;
+
+	if (mddev->bitmap_id == ID_BITMAP_NONE)
+		return true;
+
 	xa_lock(&md_submodule);
-	mddev->bitmap_ops = xa_load(&md_submodule, id);
+	head = xa_load(&md_submodule, mddev->bitmap_id);
+
+	if (!head) {
+		pr_warn("md: can't find bitmap id %d\n", mddev->bitmap_id);
+		goto err;
+	}
+
+	if (head->type != MD_BITMAP) {
+		pr_warn("md: invalid bitmap id %d\n", mddev->bitmap_id);
+		goto err;
+	}
+
+	mddev->bitmap_ops = (void *)head;
 	xa_unlock(&md_submodule);
-	if (!mddev->bitmap_ops)
-		pr_warn_once("md: can't find bitmap id %d\n", id);
+	return true;
+
+err:
+	xa_unlock(&md_submodule);
+	return false;
 }
 
 static void mddev_clear_bitmap_ops(struct mddev *mddev)
@@ -692,8 +712,13 @@ static void mddev_clear_bitmap_ops(struct mddev *mddev)
 
 int mddev_init(struct mddev *mddev)
 {
-	/* TODO: support more versions */
-	mddev_set_bitmap_ops(mddev, ID_BITMAP);
+	if (!IS_ENABLED(CONFIG_MD_BITMAP)) {
+		mddev->bitmap_id = ID_BITMAP_NONE;
+	} else {
+		mddev->bitmap_id = ID_BITMAP;
+		if (!mddev_set_bitmap_ops(mddev))
+			return -EINVAL;
+	}
 
 	if (percpu_ref_init(&mddev->active_io, active_io_release,
 			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 1b767b5320cf..4fa5a3e68a0c 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -40,6 +40,7 @@ enum md_submodule_id {
 	ID_CLUSTER,
 	ID_BITMAP,
 	ID_LLBITMAP,	/* TODO */
+	ID_BITMAP_NONE,
 };
 
 struct md_submodule_head {
@@ -565,6 +566,7 @@ struct mddev {
 	struct percpu_ref		writes_pending;
 	int				sync_checkers;	/* # of threads checking writes_pending */
 
+	enum md_submodule_id		bitmap_id;
 	void				*bitmap; /* the bitmap for the device */
 	struct bitmap_operations	*bitmap_ops;
 	struct {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 md-6.18 05/11] md/md-bitmap: add a new sysfs api bitmap_type
  2025-08-26  8:51 [PATCH v6 md-6.18 00/11] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (3 preceding siblings ...)
  2025-08-26  8:51 ` [PATCH v6 md-6.18 04/11] md: add a new mddev field 'bitmap_id' Yu Kuai
@ 2025-08-26  8:51 ` Yu Kuai
  2025-08-26  8:52 ` [PATCH v6 md-6.18 06/11] md/md-bitmap: delay registration of bitmap_ops until creating bitmap Yu Kuai
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-08-26  8:51 UTC (permalink / raw)
  To: hch, corbet, agk, snitzer, mpatocka, song, xni, hare, linan122,
	colyli
  Cc: linux-doc, linux-kernel, dm-devel, linux-raid, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

The api will be used by mdadm to set bitmap_type while creating new array
or assembling array, prepare to add a new bitmap.

Currently available options are:

cat /sys/block/md0/md/bitmap_type
none [bitmap]

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Li Nan <linan122@huawei.com>
---
 Documentation/admin-guide/md.rst | 73 ++++++++++++++++------------
 drivers/md/md.c                  | 81 ++++++++++++++++++++++++++++++++
 2 files changed, 124 insertions(+), 30 deletions(-)

diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
index 4ff2cc291d18..356d2a344f08 100644
--- a/Documentation/admin-guide/md.rst
+++ b/Documentation/admin-guide/md.rst
@@ -347,6 +347,49 @@ All md devices contain:
      active-idle
          like active, but no writes have been seen for a while (safe_mode_delay).
 
+  consistency_policy
+     This indicates how the array maintains consistency in case of unexpected
+     shutdown. It can be:
+
+     none
+       Array has no redundancy information, e.g. raid0, linear.
+
+     resync
+       Full resync is performed and all redundancy is regenerated when the
+       array is started after unclean shutdown.
+
+     bitmap
+       Resync assisted by a write-intent bitmap.
+
+     journal
+       For raid4/5/6, journal device is used to log transactions and replay
+       after unclean shutdown.
+
+     ppl
+       For raid5 only, Partial Parity Log is used to close the write hole and
+       eliminate resync.
+
+     The accepted values when writing to this file are ``ppl`` and ``resync``,
+     used to enable and disable PPL.
+
+  uuid
+     This indicates the UUID of the array in the following format:
+     xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
+
+  bitmap_type
+     [RW] When read, this file will display the current and available
+     bitmap for this array. The currently active bitmap will be enclosed
+     in [] brackets. Writing an bitmap name or ID to this file will switch
+     control of this array to that new bitmap. Note that writing a new
+     bitmap for created array is forbidden.
+
+     none
+         No bitmap
+     bitmap
+         The default internal bitmap
+
+If bitmap_type is bitmap, then the md device will also contain:
+
   bitmap/location
      This indicates where the write-intent bitmap for the array is
      stored.
@@ -401,36 +444,6 @@ All md devices contain:
      once the array becomes non-degraded, and this fact has been
      recorded in the metadata.
 
-  consistency_policy
-     This indicates how the array maintains consistency in case of unexpected
-     shutdown. It can be:
-
-     none
-       Array has no redundancy information, e.g. raid0, linear.
-
-     resync
-       Full resync is performed and all redundancy is regenerated when the
-       array is started after unclean shutdown.
-
-     bitmap
-       Resync assisted by a write-intent bitmap.
-
-     journal
-       For raid4/5/6, journal device is used to log transactions and replay
-       after unclean shutdown.
-
-     ppl
-       For raid5 only, Partial Parity Log is used to close the write hole and
-       eliminate resync.
-
-     The accepted values when writing to this file are ``ppl`` and ``resync``,
-     used to enable and disable PPL.
-
-  uuid
-     This indicates the UUID of the array in the following format:
-     xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
-
-
 As component devices are added to an md array, they appear in the ``md``
 directory as new directories named::
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 82c84bdabe79..aeae0d4854dc 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4207,6 +4207,86 @@ new_level_store(struct mddev *mddev, const char *buf, size_t len)
 static struct md_sysfs_entry md_new_level =
 __ATTR(new_level, 0664, new_level_show, new_level_store);
 
+static ssize_t
+bitmap_type_show(struct mddev *mddev, char *page)
+{
+	struct md_submodule_head *head;
+	unsigned long i;
+	ssize_t len = 0;
+
+	if (mddev->bitmap_id == ID_BITMAP_NONE)
+		len += sprintf(page + len, "[none] ");
+	else
+		len += sprintf(page + len, "none ");
+
+	xa_lock(&md_submodule);
+	xa_for_each(&md_submodule, i, head) {
+		if (head->type != MD_BITMAP)
+			continue;
+
+		if (mddev->bitmap_id == head->id)
+			len += sprintf(page + len, "[%s] ", head->name);
+		else
+			len += sprintf(page + len, "%s ", head->name);
+	}
+	xa_unlock(&md_submodule);
+
+	len += sprintf(page + len, "\n");
+	return len;
+}
+
+static ssize_t
+bitmap_type_store(struct mddev *mddev, const char *buf, size_t len)
+{
+	struct md_submodule_head *head;
+	enum md_submodule_id id;
+	unsigned long i;
+	int err = 0;
+
+	xa_lock(&md_submodule);
+
+	if (mddev->bitmap_ops) {
+		err = -EBUSY;
+		goto out;
+	}
+
+	if (cmd_match(buf, "none")) {
+		mddev->bitmap_id = ID_BITMAP_NONE;
+		goto out;
+	}
+
+	xa_for_each(&md_submodule, i, head) {
+		if (head->type == MD_BITMAP && cmd_match(buf, head->name)) {
+			mddev->bitmap_id = head->id;
+			goto out;
+		}
+	}
+
+	err = kstrtoint(buf, 10, &id);
+	if (err)
+		goto out;
+
+	if (id == ID_BITMAP_NONE) {
+		mddev->bitmap_id = id;
+		goto out;
+	}
+
+	head = xa_load(&md_submodule, id);
+	if (head && head->type == MD_BITMAP) {
+		mddev->bitmap_id = id;
+		goto out;
+	}
+
+	err = -ENOENT;
+
+out:
+	xa_unlock(&md_submodule);
+	return err ? err : len;
+}
+
+static struct md_sysfs_entry md_bitmap_type =
+__ATTR(bitmap_type, 0664, bitmap_type_show, bitmap_type_store);
+
 static ssize_t
 layout_show(struct mddev *mddev, char *page)
 {
@@ -5813,6 +5893,7 @@ __ATTR(serialize_policy, S_IRUGO | S_IWUSR, serialize_policy_show,
 static struct attribute *md_default_attrs[] = {
 	&md_level.attr,
 	&md_new_level.attr,
+	&md_bitmap_type.attr,
 	&md_layout.attr,
 	&md_raid_disks.attr,
 	&md_uuid.attr,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 md-6.18 06/11] md/md-bitmap: delay registration of bitmap_ops until creating bitmap
  2025-08-26  8:51 [PATCH v6 md-6.18 00/11] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (4 preceding siblings ...)
  2025-08-26  8:51 ` [PATCH v6 md-6.18 05/11] md/md-bitmap: add a new sysfs api bitmap_type Yu Kuai
@ 2025-08-26  8:52 ` Yu Kuai
  2025-08-26  8:52 ` [PATCH v6 md-6.18 07/11] md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations Yu Kuai
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-08-26  8:52 UTC (permalink / raw)
  To: hch, corbet, agk, snitzer, mpatocka, song, xni, hare, linan122,
	colyli
  Cc: linux-doc, linux-kernel, dm-devel, linux-raid, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Currently bitmap_ops is registered while allocating mddev, this is fine
when there is only one bitmap_ops.

Delay setting bitmap_ops until creating bitmap, so that user can choose
which bitmap to use before running the array.

Link: https://lore.kernel.org/linux-raid/20250721171557.34587-7-yukuai@kernel.org
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
---
 Documentation/admin-guide/md.rst |  3 ++
 drivers/md/md.c                  | 90 +++++++++++++++++++-------------
 2 files changed, 56 insertions(+), 37 deletions(-)

diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
index 356d2a344f08..001363f81850 100644
--- a/Documentation/admin-guide/md.rst
+++ b/Documentation/admin-guide/md.rst
@@ -388,6 +388,9 @@ All md devices contain:
      bitmap
          The default internal bitmap
 
+If bitmap_type is not none, then additional bitmap attributes bitmap/xxx or
+llbitmap/xxx will be created after md device KOBJ_CHANGE event.
+
 If bitmap_type is bitmap, then the md device will also contain:
 
   bitmap/location
diff --git a/drivers/md/md.c b/drivers/md/md.c
index aeae0d4854dc..6560bd89d0a2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -678,9 +678,11 @@ static void no_op(struct percpu_ref *r) {}
 
 static bool mddev_set_bitmap_ops(struct mddev *mddev)
 {
+	struct bitmap_operations *old = mddev->bitmap_ops;
 	struct md_submodule_head *head;
 
-	if (mddev->bitmap_id == ID_BITMAP_NONE)
+	if (mddev->bitmap_id == ID_BITMAP_NONE ||
+	    (old && old->head.id == mddev->bitmap_id))
 		return true;
 
 	xa_lock(&md_submodule);
@@ -698,6 +700,18 @@ static bool mddev_set_bitmap_ops(struct mddev *mddev)
 
 	mddev->bitmap_ops = (void *)head;
 	xa_unlock(&md_submodule);
+
+	if (!mddev_is_dm(mddev) && mddev->bitmap_ops->group) {
+		if (sysfs_create_group(&mddev->kobj, mddev->bitmap_ops->group))
+			pr_warn("md: cannot register extra bitmap attributes for %s\n",
+				mdname(mddev));
+		else
+			/*
+			 * Inform user with KOBJ_CHANGE about new bitmap
+			 * attributes.
+			 */
+			kobject_uevent(&mddev->kobj, KOBJ_CHANGE);
+	}
 	return true;
 
 err:
@@ -707,28 +721,26 @@ static bool mddev_set_bitmap_ops(struct mddev *mddev)
 
 static void mddev_clear_bitmap_ops(struct mddev *mddev)
 {
+	if (!mddev_is_dm(mddev) && mddev->bitmap_ops &&
+	    mddev->bitmap_ops->group)
+		sysfs_remove_group(&mddev->kobj, mddev->bitmap_ops->group);
+
 	mddev->bitmap_ops = NULL;
 }
 
 int mddev_init(struct mddev *mddev)
 {
-	if (!IS_ENABLED(CONFIG_MD_BITMAP)) {
+	if (!IS_ENABLED(CONFIG_MD_BITMAP))
 		mddev->bitmap_id = ID_BITMAP_NONE;
-	} else {
+	else
 		mddev->bitmap_id = ID_BITMAP;
-		if (!mddev_set_bitmap_ops(mddev))
-			return -EINVAL;
-	}
 
 	if (percpu_ref_init(&mddev->active_io, active_io_release,
-			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
-		mddev_clear_bitmap_ops(mddev);
+			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL))
 		return -ENOMEM;
-	}
 
 	if (percpu_ref_init(&mddev->writes_pending, no_op,
 			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
-		mddev_clear_bitmap_ops(mddev);
 		percpu_ref_exit(&mddev->active_io);
 		return -ENOMEM;
 	}
@@ -766,7 +778,6 @@ EXPORT_SYMBOL_GPL(mddev_init);
 
 void mddev_destroy(struct mddev *mddev)
 {
-	mddev_clear_bitmap_ops(mddev);
 	percpu_ref_exit(&mddev->active_io);
 	percpu_ref_exit(&mddev->writes_pending);
 }
@@ -6196,11 +6207,6 @@ struct mddev *md_alloc(dev_t dev, char *name)
 		return ERR_PTR(error);
 	}
 
-	if (md_bitmap_registered(mddev) && mddev->bitmap_ops->group)
-		if (sysfs_create_group(&mddev->kobj, mddev->bitmap_ops->group))
-			pr_warn("md: cannot register extra bitmap attributes for %s\n",
-				mdname(mddev));
-
 	kobject_uevent(&mddev->kobj, KOBJ_ADD);
 	mddev->sysfs_state = sysfs_get_dirent_safe(mddev->kobj.sd, "array_state");
 	mddev->sysfs_level = sysfs_get_dirent_safe(mddev->kobj.sd, "level");
@@ -6279,6 +6285,26 @@ static void md_safemode_timeout(struct timer_list *t)
 
 static int start_dirty_degraded;
 
+static int md_bitmap_create(struct mddev *mddev)
+{
+	if (mddev->bitmap_id == ID_BITMAP_NONE)
+		return -EINVAL;
+
+	if (!mddev_set_bitmap_ops(mddev))
+		return -ENOENT;
+
+	return mddev->bitmap_ops->create(mddev);
+}
+
+static void md_bitmap_destroy(struct mddev *mddev)
+{
+	if (!md_bitmap_registered(mddev))
+		return;
+
+	mddev->bitmap_ops->destroy(mddev);
+	mddev_clear_bitmap_ops(mddev);
+}
+
 int md_run(struct mddev *mddev)
 {
 	int err;
@@ -6443,9 +6469,9 @@ int md_run(struct mddev *mddev)
 			(unsigned long long)pers->size(mddev, 0, 0) / 2);
 		err = -EINVAL;
 	}
-	if (err == 0 && pers->sync_request && md_bitmap_registered(mddev) &&
+	if (err == 0 && pers->sync_request &&
 	    (mddev->bitmap_info.file || mddev->bitmap_info.offset)) {
-		err = mddev->bitmap_ops->create(mddev);
+		err = md_bitmap_create(mddev);
 		if (err)
 			pr_warn("%s: failed to create bitmap (%d)\n",
 				mdname(mddev), err);
@@ -6518,8 +6544,7 @@ int md_run(struct mddev *mddev)
 		pers->free(mddev, mddev->private);
 	mddev->private = NULL;
 	put_pers(pers);
-	if (md_bitmap_registered(mddev))
-		mddev->bitmap_ops->destroy(mddev);
+	md_bitmap_destroy(mddev);
 abort:
 	bioset_exit(&mddev->io_clone_set);
 exit_sync_set:
@@ -6542,7 +6567,7 @@ int do_md_run(struct mddev *mddev)
 	if (md_bitmap_registered(mddev)) {
 		err = mddev->bitmap_ops->load(mddev);
 		if (err) {
-			mddev->bitmap_ops->destroy(mddev);
+			md_bitmap_destroy(mddev);
 			goto out;
 		}
 	}
@@ -6740,8 +6765,7 @@ static void __md_stop(struct mddev *mddev)
 {
 	struct md_personality *pers = mddev->pers;
 
-	if (md_bitmap_registered(mddev))
-		mddev->bitmap_ops->destroy(mddev);
+	md_bitmap_destroy(mddev);
 	mddev_detach(mddev);
 	spin_lock(&mddev->lock);
 	mddev->pers = NULL;
@@ -7518,16 +7542,16 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
 	err = 0;
 	if (mddev->pers) {
 		if (fd >= 0) {
-			err = mddev->bitmap_ops->create(mddev);
+			err = md_bitmap_create(mddev);
 			if (!err)
 				err = mddev->bitmap_ops->load(mddev);
 
 			if (err) {
-				mddev->bitmap_ops->destroy(mddev);
+				md_bitmap_destroy(mddev);
 				fd = -1;
 			}
 		} else if (fd < 0) {
-			mddev->bitmap_ops->destroy(mddev);
+			md_bitmap_destroy(mddev);
 		}
 	}
 
@@ -7812,14 +7836,6 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
 		rv = update_raid_disks(mddev, info->raid_disks);
 
 	if ((state ^ info->state) & (1<<MD_SB_BITMAP_PRESENT)) {
-		/*
-		 * Metadata says bitmap existed, however kernel can't find
-		 * registered bitmap.
-		 */
-		if (WARN_ON_ONCE(!md_bitmap_registered(mddev))) {
-			rv = -EINVAL;
-			goto err;
-		}
 		if (mddev->pers->quiesce == NULL || mddev->thread == NULL) {
 			rv = -EINVAL;
 			goto err;
@@ -7842,12 +7858,12 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
 				mddev->bitmap_info.default_offset;
 			mddev->bitmap_info.space =
 				mddev->bitmap_info.default_space;
-			rv = mddev->bitmap_ops->create(mddev);
+			rv = md_bitmap_create(mddev);
 			if (!rv)
 				rv = mddev->bitmap_ops->load(mddev);
 
 			if (rv)
-				mddev->bitmap_ops->destroy(mddev);
+				md_bitmap_destroy(mddev);
 		} else {
 			struct md_bitmap_stats stats;
 
@@ -7873,7 +7889,7 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
 				put_cluster_ops(mddev);
 				mddev->safemode_delay = DEFAULT_SAFEMODE_DELAY;
 			}
-			mddev->bitmap_ops->destroy(mddev);
+			md_bitmap_destroy(mddev);
 			mddev->bitmap_info.offset = 0;
 		}
 	}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 md-6.18 07/11] md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations
  2025-08-26  8:51 [PATCH v6 md-6.18 00/11] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (5 preceding siblings ...)
  2025-08-26  8:52 ` [PATCH v6 md-6.18 06/11] md/md-bitmap: delay registration of bitmap_ops until creating bitmap Yu Kuai
@ 2025-08-26  8:52 ` Yu Kuai
  2025-08-26  8:52 ` [PATCH v6 md-6.18 08/11] md/md-bitmap: add a new method blocks_synced() " Yu Kuai
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-08-26  8:52 UTC (permalink / raw)
  To: hch, corbet, agk, snitzer, mpatocka, song, xni, hare, linan122,
	colyli
  Cc: linux-doc, linux-kernel, dm-devel, linux-raid, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

This method is used to check if blocks can be skipped before calling
into pers->sync_request(), llbitmap will use this method to skip
resync for unwritten/clean data blocks, and recovery/check/repair for
unwritten data blocks;

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
---
 drivers/md/md-bitmap.h | 1 +
 drivers/md/md.c        | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index 8616ced49077..95453696c68e 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -89,6 +89,7 @@ struct bitmap_operations {
 	md_bitmap_fn *start_discard;
 	md_bitmap_fn *end_discard;
 
+	sector_t (*skip_sync_blocks)(struct mddev *mddev, sector_t offset);
 	bool (*start_sync)(struct mddev *mddev, sector_t offset,
 			   sector_t *blocks, bool degraded);
 	void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 6560bd89d0a2..7196e7f6b2a4 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9460,6 +9460,12 @@ void md_do_sync(struct md_thread *thread)
 		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
 			break;
 
+		if (mddev->bitmap_ops && mddev->bitmap_ops->skip_sync_blocks) {
+			sectors = mddev->bitmap_ops->skip_sync_blocks(mddev, j);
+			if (sectors)
+				goto update;
+		}
+
 		sectors = mddev->pers->sync_request(mddev, j, max_sectors,
 						    &skipped);
 		if (sectors == 0) {
@@ -9475,6 +9481,7 @@ void md_do_sync(struct md_thread *thread)
 		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
 			break;
 
+update:
 		j += sectors;
 		if (j > max_sectors)
 			/* when skipping, extra large numbers can be returned. */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 md-6.18 08/11] md/md-bitmap: add a new method blocks_synced() in bitmap_operations
  2025-08-26  8:51 [PATCH v6 md-6.18 00/11] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (6 preceding siblings ...)
  2025-08-26  8:52 ` [PATCH v6 md-6.18 07/11] md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations Yu Kuai
@ 2025-08-26  8:52 ` Yu Kuai
  2025-08-26  8:52 ` [PATCH v6 md-6.18 09/11] md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER Yu Kuai
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-08-26  8:52 UTC (permalink / raw)
  To: hch, corbet, agk, snitzer, mpatocka, song, xni, hare, linan122,
	colyli
  Cc: linux-doc, linux-kernel, dm-devel, linux-raid, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Currently, raid456 must perform a whole array initial recovery to build
initail xor data, then IO to the array won't have to read all the blocks
in underlying disks.

This behavior will affect IO performance a lot, and nowadays there are
huge disks and the initial recovery can take a long time. Hence llbitmap
will support lazy initial recovery in following patches. This method is
used to check if data blocks is synced or not, if not then IO will still
have to read all blocks for raid456.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md-bitmap.h |  1 +
 drivers/md/raid5.c     | 15 +++++++++++----
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index 95453696c68e..5f41724cbcd8 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -90,6 +90,7 @@ struct bitmap_operations {
 	md_bitmap_fn *end_discard;
 
 	sector_t (*skip_sync_blocks)(struct mddev *mddev, sector_t offset);
+	bool (*blocks_synced)(struct mddev *mddev, sector_t offset);
 	bool (*start_sync)(struct mddev *mddev, sector_t offset,
 			   sector_t *blocks, bool degraded);
 	void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 5285e72341a2..672ab226e43c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4097,7 +4097,8 @@ static int handle_stripe_dirtying(struct r5conf *conf,
 				  int disks)
 {
 	int rmw = 0, rcw = 0, i;
-	sector_t resync_offset = conf->mddev->resync_offset;
+	struct mddev *mddev = conf->mddev;
+	sector_t resync_offset = mddev->resync_offset;
 
 	/* Check whether resync is now happening or should start.
 	 * If yes, then the array is dirty (after unclean shutdown or
@@ -4116,6 +4117,12 @@ static int handle_stripe_dirtying(struct r5conf *conf,
 		pr_debug("force RCW rmw_level=%u, resync_offset=%llu sh->sector=%llu\n",
 			 conf->rmw_level, (unsigned long long)resync_offset,
 			 (unsigned long long)sh->sector);
+	} else if (mddev->bitmap_ops && mddev->bitmap_ops->blocks_synced &&
+		   !mddev->bitmap_ops->blocks_synced(mddev, sh->sector)) {
+		/* The initial recover is not done, must read everything */
+		rcw = 1; rmw = 2;
+		pr_debug("force RCW by lazy recovery, sh->sector=%llu\n",
+			 sh->sector);
 	} else for (i = disks; i--; ) {
 		/* would I have to read this buffer for read_modify_write */
 		struct r5dev *dev = &sh->dev[i];
@@ -4148,7 +4155,7 @@ static int handle_stripe_dirtying(struct r5conf *conf,
 	set_bit(STRIPE_HANDLE, &sh->state);
 	if ((rmw < rcw || (rmw == rcw && conf->rmw_level == PARITY_PREFER_RMW)) && rmw > 0) {
 		/* prefer read-modify-write, but need to get some data */
-		mddev_add_trace_msg(conf->mddev, "raid5 rmw %llu %d",
+		mddev_add_trace_msg(mddev, "raid5 rmw %llu %d",
 				sh->sector, rmw);
 
 		for (i = disks; i--; ) {
@@ -4227,8 +4234,8 @@ static int handle_stripe_dirtying(struct r5conf *conf,
 					set_bit(STRIPE_DELAYED, &sh->state);
 			}
 		}
-		if (rcw && !mddev_is_dm(conf->mddev))
-			blk_add_trace_msg(conf->mddev->gendisk->queue,
+		if (rcw && !mddev_is_dm(mddev))
+			blk_add_trace_msg(mddev->gendisk->queue,
 				"raid5 rcw %llu %d %d %d",
 				(unsigned long long)sh->sector, rcw, qread,
 				test_bit(STRIPE_DELAYED, &sh->state));
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 md-6.18 09/11] md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER
  2025-08-26  8:51 [PATCH v6 md-6.18 00/11] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (7 preceding siblings ...)
  2025-08-26  8:52 ` [PATCH v6 md-6.18 08/11] md/md-bitmap: add a new method blocks_synced() " Yu Kuai
@ 2025-08-26  8:52 ` Yu Kuai
  2025-08-26  8:52 ` [PATCH v6 md-6.18 10/11] md/md-bitmap: make method bitmap_ops->daemon_work optional Yu Kuai
  2025-08-26  8:52 ` [PATCH v6 md-6.18 11/11] md/md-llbitmap: introduce new lockless bitmap Yu Kuai
  10 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-08-26  8:52 UTC (permalink / raw)
  To: hch, corbet, agk, snitzer, mpatocka, song, xni, hare, linan122,
	colyli
  Cc: linux-doc, linux-kernel, dm-devel, linux-raid, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

This flag is used by llbitmap in later patches to skip raid456 initial
recover and delay building initial xor data to first write.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md.c    | 47 +++++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/md.h    |  2 ++
 drivers/md/raid5.c | 19 +++++++++++++++----
 3 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 7196e7f6b2a4..199843356449 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9199,6 +9199,39 @@ static sector_t md_sync_max_sectors(struct mddev *mddev,
 	}
 }
 
+/*
+ * If lazy recovery is requested and all rdevs are in sync, select the rdev with
+ * the higest index to perfore recovery to build initial xor data, this is the
+ * same as old bitmap.
+ */
+static bool mddev_select_lazy_recover_rdev(struct mddev *mddev)
+{
+	struct md_rdev *recover_rdev = NULL;
+	struct md_rdev *rdev;
+	bool ret = false;
+
+	rcu_read_lock();
+	rdev_for_each_rcu(rdev, mddev) {
+		if (rdev->raid_disk < 0)
+			continue;
+
+		if (test_bit(Faulty, &rdev->flags) ||
+		    !test_bit(In_sync, &rdev->flags))
+			break;
+
+		if (!recover_rdev || recover_rdev->raid_disk < rdev->raid_disk)
+			recover_rdev = rdev;
+	}
+
+	if (recover_rdev) {
+		clear_bit(In_sync, &recover_rdev->flags);
+		ret = true;
+	}
+
+	rcu_read_unlock();
+	return ret;
+}
+
 static sector_t md_sync_position(struct mddev *mddev, enum sync_action action)
 {
 	sector_t start = 0;
@@ -9230,6 +9263,14 @@ static sector_t md_sync_position(struct mddev *mddev, enum sync_action action)
 				start = rdev->recovery_offset;
 		rcu_read_unlock();
 
+		/*
+		 * If there are no spares, and raid456 lazy initial recover is
+		 * requested.
+		 */
+		if (test_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery) &&
+		    start == MaxSector && mddev_select_lazy_recover_rdev(mddev))
+			start = 0;
+
 		/* If there is a bitmap, we need to make sure all
 		 * writes that started before we added a spare
 		 * complete before we start doing a recovery.
@@ -9791,6 +9832,7 @@ static bool md_choose_sync_action(struct mddev *mddev, int *spares)
 
 		set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
 		clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
+		clear_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery);
 		return true;
 	}
 
@@ -9799,6 +9841,7 @@ static bool md_choose_sync_action(struct mddev *mddev, int *spares)
 		remove_spares(mddev, NULL);
 		set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
 		clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
+		clear_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery);
 		return true;
 	}
 
@@ -9808,7 +9851,7 @@ static bool md_choose_sync_action(struct mddev *mddev, int *spares)
 	 * re-add.
 	 */
 	*spares = remove_and_add_spares(mddev, NULL);
-	if (*spares) {
+	if (*spares || test_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery)) {
 		clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
 		clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
 		clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
@@ -10021,6 +10064,7 @@ void md_check_recovery(struct mddev *mddev)
 			}
 
 			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
+			clear_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery);
 			clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 			clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
 
@@ -10131,6 +10175,7 @@ void md_reap_sync_thread(struct mddev *mddev)
 	clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
 	clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
 	clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
+	clear_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery);
 	/*
 	 * We call mddev->cluster_ops->update_size here because sync_size could
 	 * be changed by md_update_sb, and MD_RECOVERY_RESHAPE is cleared,
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 4fa5a3e68a0c..7b6357879a84 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -667,6 +667,8 @@ enum recovery_flags {
 	MD_RECOVERY_RESHAPE,
 	/* remote node is running resync thread */
 	MD_RESYNCING_REMOTE,
+	/* raid456 lazy initial recover */
+	MD_RECOVERY_LAZY_RECOVER,
 };
 
 enum md_ro_state {
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 672ab226e43c..5112658ef5f6 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4705,10 +4705,21 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
 			}
 		} else if (test_bit(In_sync, &rdev->flags))
 			set_bit(R5_Insync, &dev->flags);
-		else if (sh->sector + RAID5_STRIPE_SECTORS(conf) <= rdev->recovery_offset)
-			/* in sync if before recovery_offset */
-			set_bit(R5_Insync, &dev->flags);
-		else if (test_bit(R5_UPTODATE, &dev->flags) &&
+		else if (sh->sector + RAID5_STRIPE_SECTORS(conf) <=
+			 rdev->recovery_offset) {
+			/*
+			 * in sync if:
+			 *  - normal IO, or
+			 *  - resync IO that is not lazy recovery
+			 *
+			 * For lazy recovery, we have to mark the rdev without
+			 * In_sync as failed, to build initial xor data.
+			 */
+			if (!test_bit(STRIPE_SYNCING, &sh->state) ||
+			    !test_bit(MD_RECOVERY_LAZY_RECOVER,
+				      &conf->mddev->recovery))
+				set_bit(R5_Insync, &dev->flags);
+		} else if (test_bit(R5_UPTODATE, &dev->flags) &&
 			 test_bit(R5_Expanded, &dev->flags))
 			/* If we've reshaped into here, we assume it is Insync.
 			 * We will shortly update recovery_offset to make
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 md-6.18 10/11] md/md-bitmap: make method bitmap_ops->daemon_work optional
  2025-08-26  8:51 [PATCH v6 md-6.18 00/11] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (8 preceding siblings ...)
  2025-08-26  8:52 ` [PATCH v6 md-6.18 09/11] md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER Yu Kuai
@ 2025-08-26  8:52 ` Yu Kuai
  2025-08-26  8:52 ` [PATCH v6 md-6.18 11/11] md/md-llbitmap: introduce new lockless bitmap Yu Kuai
  10 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-08-26  8:52 UTC (permalink / raw)
  To: hch, corbet, agk, snitzer, mpatocka, song, xni, hare, linan122,
	colyli
  Cc: linux-doc, linux-kernel, dm-devel, linux-raid, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

daemon_work() will be called by daemon thread, on the one hand, daemon
thread doesn't have strict wake-up time; on the other hand, too much
work are put to daemon thread, like handle sync IO, handle failed
or specail normal IO, handle recovery, and so on. Hence daemon thread
may be too busy to clear dirty bits in time.

Make bitmap_ops->daemon_work() optional and following patches will use
separate async work to clear dirty bits for the new bitmap.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
---
 drivers/md/md.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 199843356449..3a3a3fdecfbd 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9997,7 +9997,7 @@ static void unregister_sync_thread(struct mddev *mddev)
  */
 void md_check_recovery(struct mddev *mddev)
 {
-	if (md_bitmap_enabled(mddev, false))
+	if (md_bitmap_enabled(mddev, false) && mddev->bitmap_ops->daemon_work)
 		mddev->bitmap_ops->daemon_work(mddev);
 
 	if (signal_pending(current)) {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 md-6.18 11/11] md/md-llbitmap: introduce new lockless bitmap
  2025-08-26  8:51 [PATCH v6 md-6.18 00/11] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
                   ` (9 preceding siblings ...)
  2025-08-26  8:52 ` [PATCH v6 md-6.18 10/11] md/md-bitmap: make method bitmap_ops->daemon_work optional Yu Kuai
@ 2025-08-26  8:52 ` Yu Kuai
  2025-08-26  9:52   ` Paul Menzel
                     ` (2 more replies)
  10 siblings, 3 replies; 19+ messages in thread
From: Yu Kuai @ 2025-08-26  8:52 UTC (permalink / raw)
  To: hch, corbet, agk, snitzer, mpatocka, song, xni, hare, linan122,
	colyli
  Cc: linux-doc, linux-kernel, dm-devel, linux-raid, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Redundant data is used to enhance data fault tolerance, and the storage
method for redundant data vary depending on the RAID levels. And it's
important to maintain the consistency of redundant data.

Bitmap is used to record which data blocks have been synchronized and which
ones need to be resynchronized or recovered. Each bit in the bitmap
represents a segment of data in the array. When a bit is set, it indicates
that the multiple redundant copies of that data segment may not be
consistent. Data synchronization can be performed based on the bitmap after
power failure or readding a disk. If there is no bitmap, a full disk
synchronization is required.

Key Features:

 - IO fastpath is lockless, if user issues lots of write IO to the same
 bitmap bit in a short time, only the first write have additional overhead
 to update bitmap bit, no additional overhead for the following writes;
 - support only resync or recover written data, means in the case creating
 new array or replacing with a new disk, there is no need to do a full disk
 resync/recovery;

Key Concept:

 - State Machine:

Each bit is one byte, contain 6 difference state, see llbitmap_state. And
there are total 8 differenct actions, see llbitmap_action, can change state:

llbitmap state machine: transitions between states

|           | Startwrite | Startsync | Endsync | Abortsync|
| --------- | ---------- | --------- | ------- | -------  |
| Unwritten | Dirty      | x         | x       | x        |
| Clean     | Dirty      | x         | x       | x        |
| Dirty     | x          | x         | x       | x        |
| NeedSync  | x          | Syncing   | x       | x        |
| Syncing   | x          | Syncing   | Dirty   | NeedSync |

|           | Reload   | Daemon | Discard   | Stale     |
| --------- | -------- | ------ | --------- | --------- |
| Unwritten | x        | x      | x         | x         |
| Clean     | x        | x      | Unwritten | NeedSync  |
| Dirty     | NeedSync | Clean  | Unwritten | NeedSync  |
| NeedSync  | x        | x      | Unwritten | x         |
| Syncing   | NeedSync | x      | Unwritten | NeedSync  |

Typical scenarios:

1) Create new array
All bits will be set to Unwritten by default, if --assume-clean is set,
all bits will be set to Clean instead.

2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and
rely on xor data

2.1) write new data to raid1/raid10:
Unwritten --StartWrite--> Dirty

2.2) write new data to raid456:
Unwritten --StartWrite--> NeedSync

Because the initial recover for raid456 is skipped, the xor data is not build
yet, the bit must set to NeedSync first and after lazy initial recover is
finished, the bit will finially set to Dirty(see 5.1 and 5.4);

2.3) cover write
Clean --StartWrite--> Dirty

3) daemon, if the array is not degraded:
Dirty --Daemon--> Clean

For degraded array, the Dirty bit will never be cleared, prevent full disk
recovery while readding a removed disk.

4) discard
{Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten

5) resync and recover

5.1) common process
NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean

5.2) resync after power failure
Dirty --Reload--> NeedSync

5.3) recover while replacing with a new disk
By default, the old bitmap framework will recover all data, and llbitmap
implement this by a new helper, see llbitmap_skip_sync_blocks:

skip recover for bits other than dirty or clean;

5.4) lazy initial recover for raid5:
By default, the old bitmap framework will only allow new recover when there
are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add
to perform raid456 lazy recover for set bits(from 2.2).

Bitmap IO:

 - Chunksize

The default bitmap size is 128k, incluing 1k bitmap super block, and
the default size of segment of data in the array each bit(chunksize) is 64k,
and chunksize will adjust to twice the old size each time if the total number
bits is not less than 127k.(see llbitmap_init)

 - READ

While creating bitmap, all pages will be allocated and read for llbitmap,
there won't be read afterwards

 - WRITE

WRITE IO is divided into logical_block_size of the array, the dirty state
of each block is tracked independently, for example:

each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;

| page0 | page1 | ... | page 31 |
|       |
|        \-----------------------\
|                                |
| block0 | block1 | ... | block 8|
|        |
|         \-----------------\
|                            |
| bit0 | bit1 | ... | bit511 |

From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
subpage will be marked dirty, such block must write first before the IO is
issued. This behaviour will affect IO performance, to reduce the impact, if
multiple bits are changed in the same block in a short time, all bits in this
block will be changed to Dirty/NeedSync, so that there won't be any overhead
until daemon clears dirty bits.

Dirty Bits syncronization:

IO fast path will set bits to dirty, and those dirty bits will be cleared
by daemon after IO is done. llbitmap_page_ctl is used to synchronize between
IO path and daemon;

IO path:
 1) try to grab a reference, if succeed, set expire time after 5s and return;
 2) if failed to grab a reference, wait for daemon to finish clearing dirty
 bits;

Daemon(Daemon will be waken up every daemon_sleep seconds):
For each page:
 1) check if page expired, if not skip this page; for expired page:
 2) suspend the page and wait for inflight write IO to be done;
 3) change dirty page to clean;
 4) resume the page;

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 Documentation/admin-guide/md.rst |   20 +
 drivers/md/Kconfig               |   11 +
 drivers/md/Makefile              |    1 +
 drivers/md/md-bitmap.c           |    9 -
 drivers/md/md-bitmap.h           |   31 +-
 drivers/md/md-llbitmap.c         | 1600 ++++++++++++++++++++++++++++++
 drivers/md/md.c                  |    6 +
 drivers/md/md.h                  |    4 +-
 8 files changed, 1670 insertions(+), 12 deletions(-)
 create mode 100644 drivers/md/md-llbitmap.c

diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
index 001363f81850..47d1347ccd00 100644
--- a/Documentation/admin-guide/md.rst
+++ b/Documentation/admin-guide/md.rst
@@ -387,6 +387,8 @@ All md devices contain:
          No bitmap
      bitmap
          The default internal bitmap
+     llbitmap
+         The lockless internal bitmap
 
 If bitmap_type is not none, then additional bitmap attributes bitmap/xxx or
 llbitmap/xxx will be created after md device KOBJ_CHANGE event.
@@ -447,6 +449,24 @@ If bitmap_type is bitmap, then the md device will also contain:
      once the array becomes non-degraded, and this fact has been
      recorded in the metadata.
 
+If bitmap_type is llbitmap, then the md device will also contain:
+
+  llbitmap/bits
+     This is readonly, show status of bitmap bits, the number of each
+     value.
+
+  llbitmap/metadata
+     This is readonly, show bitmap metadata, include chunksize, chunkshift,
+     chunks, offset and daemon_sleep.
+
+  llbitmap/daemon_sleep
+     This is readwrite, time in seconds that daemon function will be
+     triggered to clear dirty bits.
+
+  llbitmap/barrier_idle
+     This is readwrite, time in seconds that page barrier will be idled,
+     means dirty bits in the page will be cleared.
+
 As component devices are added to an md array, they appear in the ``md``
 directory as new directories named::
 
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index f913579e731c..07c19b2182ca 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -52,6 +52,17 @@ config MD_BITMAP
 
 	  If unsure, say Y.
 
+config MD_LLBITMAP
+	bool "MD RAID lockless bitmap support"
+	depends on BLK_DEV_MD
+	help
+	  If you say Y here, support for the lockless write intent bitmap will
+	  be enabled.
+
+	  Note, this is an experimental feature.
+
+	  If unsure, say N.
+
 config MD_AUTODETECT
 	bool "Autodetect RAID arrays during kernel boot"
 	depends on BLK_DEV_MD=y
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 2e18147a9c40..5a51b3408b70 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -29,6 +29,7 @@ dm-zoned-y	+= dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
 
 md-mod-y	+= md.o
 md-mod-$(CONFIG_MD_BITMAP)	+= md-bitmap.o
+md-mod-$(CONFIG_MD_LLBITMAP)	+= md-llbitmap.o
 raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
 linear-y       += md-linear.o
 
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index dc050ff94d5b..84b7e2af6dba 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -34,15 +34,6 @@
 #include "md-bitmap.h"
 #include "md-cluster.h"
 
-#define BITMAP_MAJOR_LO 3
-/* version 4 insists the bitmap is in little-endian order
- * with version 3, it is host-endian which is non-portable
- * Version 5 is currently set only for clustered devices
- */
-#define BITMAP_MAJOR_HI 4
-#define BITMAP_MAJOR_CLUSTERED 5
-#define	BITMAP_MAJOR_HOSTENDIAN 3
-
 /*
  * in-memory bitmap:
  *
diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index 5f41724cbcd8..b42a28fa83a0 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -9,10 +9,26 @@
 
 #define BITMAP_MAGIC 0x6d746962
 
+/*
+ * version 3 is host-endian order, this is deprecated and not used for new
+ * array
+ */
+#define BITMAP_MAJOR_LO		3
+#define BITMAP_MAJOR_HOSTENDIAN	3
+/* version 4 is little-endian order, the default value */
+#define BITMAP_MAJOR_HI		4
+/* version 5 is only used for cluster */
+#define BITMAP_MAJOR_CLUSTERED	5
+/* version 6 is only used for lockless bitmap */
+#define BITMAP_MAJOR_LOCKLESS	6
+
 /* use these for bitmap->flags and bitmap->sb->state bit-fields */
 enum bitmap_state {
-	BITMAP_STALE	   = 1,  /* the bitmap file is out of date or had -EIO */
+	BITMAP_STALE	   = 1, /* the bitmap file is out of date or had -EIO */
 	BITMAP_WRITE_ERROR = 2, /* A write error has occurred */
+	BITMAP_FIRST_USE   = 3, /* llbitmap is just created */
+	BITMAP_CLEAN       = 4, /* llbitmap is created with assume_clean */
+	BITMAP_DAEMON_BUSY = 5, /* llbitmap daemon is not finished after daemon_sleep */
 	BITMAP_HOSTENDIAN  =15,
 };
 
@@ -166,4 +182,17 @@ static inline void md_bitmap_exit(void)
 }
 #endif
 
+#ifdef CONFIG_MD_LLBITMAP
+int md_llbitmap_init(void);
+void md_llbitmap_exit(void);
+#else
+static inline int md_llbitmap_init(void)
+{
+	return 0;
+}
+static inline void md_llbitmap_exit(void)
+{
+}
+#endif
+
 #endif
diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
new file mode 100644
index 000000000000..88207f31c728
--- /dev/null
+++ b/drivers/md/md-llbitmap.c
@@ -0,0 +1,1600 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <linux/blkdev.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/timer.h>
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/file.h>
+#include <linux/seq_file.h>
+#include <trace/events/block.h>
+
+#include "md.h"
+#include "md-bitmap.h"
+
+/*
+ * #### Background
+ *
+ * Redundant data is used to enhance data fault tolerance, and the storage
+ * method for redundant data vary depending on the RAID levels. And it's
+ * important to maintain the consistency of redundant data.
+ *
+ * Bitmap is used to record which data blocks have been synchronized and which
+ * ones need to be resynchronized or recovered. Each bit in the bitmap
+ * represents a segment of data in the array. When a bit is set, it indicates
+ * that the multiple redundant copies of that data segment may not be
+ * consistent. Data synchronization can be performed based on the bitmap after
+ * power failure or readding a disk. If there is no bitmap, a full disk
+ * synchronization is required.
+ *
+ * #### Key Features
+ *
+ *  - IO fastpath is lockless, if user issues lots of write IO to the same
+ *  bitmap bit in a short time, only the first write have additional overhead
+ *  to update bitmap bit, no additional overhead for the following writes;
+ *  - support only resync or recover written data, means in the case creating
+ *  new array or replacing with a new disk, there is no need to do a full disk
+ *  resync/recovery;
+ *
+ * #### Key Concept
+ *
+ * ##### State Machine
+ *
+ * Each bit is one byte, contain 6 difference state, see llbitmap_state. And
+ * there are total 8 differenct actions, see llbitmap_action, can change state:
+ *
+ * llbitmap state machine: transitions between states
+ *
+ * |           | Startwrite | Startsync | Endsync | Abortsync|
+ * | --------- | ---------- | --------- | ------- | -------  |
+ * | Unwritten | Dirty      | x         | x       | x        |
+ * | Clean     | Dirty      | x         | x       | x        |
+ * | Dirty     | x          | x         | x       | x        |
+ * | NeedSync  | x          | Syncing   | x       | x        |
+ * | Syncing   | x          | Syncing   | Dirty   | NeedSync |
+ *
+ * |           | Reload   | Daemon | Discard   | Stale     |
+ * | --------- | -------- | ------ | --------- | --------- |
+ * | Unwritten | x        | x      | x         | x         |
+ * | Clean     | x        | x      | Unwritten | NeedSync  |
+ * | Dirty     | NeedSync | Clean  | Unwritten | NeedSync  |
+ * | NeedSync  | x        | x      | Unwritten | x         |
+ * | Syncing   | NeedSync | x      | Unwritten | NeedSync  |
+ *
+ * Typical scenarios:
+ *
+ * 1) Create new array
+ * All bits will be set to Unwritten by default, if --assume-clean is set,
+ * all bits will be set to Clean instead.
+ *
+ * 2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and
+ * rely on xor data
+ *
+ * 2.1) write new data to raid1/raid10:
+ * Unwritten --StartWrite--> Dirty
+ *
+ * 2.2) write new data to raid456:
+ * Unwritten --StartWrite--> NeedSync
+ *
+ * Because the initial recover for raid456 is skipped, the xor data is not build
+ * yet, the bit must set to NeedSync first and after lazy initial recover is
+ * finished, the bit will finially set to Dirty(see 5.1 and 5.4);
+ *
+ * 2.3) cover write
+ * Clean --StartWrite--> Dirty
+ *
+ * 3) daemon, if the array is not degraded:
+ * Dirty --Daemon--> Clean
+ *
+ * For degraded array, the Dirty bit will never be cleared, prevent full disk
+ * recovery while readding a removed disk.
+ *
+ * 4) discard
+ * {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
+ *
+ * 5) resync and recover
+ *
+ * 5.1) common process
+ * NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean
+ *
+ * 5.2) resync after power failure
+ * Dirty --Reload--> NeedSync
+ *
+ * 5.3) recover while replacing with a new disk
+ * By default, the old bitmap framework will recover all data, and llbitmap
+ * implement this by a new helper, see llbitmap_skip_sync_blocks:
+ *
+ * skip recover for bits other than dirty or clean;
+ *
+ * 5.4) lazy initial recover for raid5:
+ * By default, the old bitmap framework will only allow new recover when there
+ * are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add
+ * to perform raid456 lazy recover for set bits(from 2.2).
+ *
+ * ##### Bitmap IO
+ *
+ * ##### Chunksize
+ *
+ * The default bitmap size is 128k, incluing 1k bitmap super block, and
+ * the default size of segment of data in the array each bit(chunksize) is 64k,
+ * and chunksize will adjust to twice the old size each time if the total number
+ * bits is not less than 127k.(see llbitmap_init)
+ *
+ * ##### READ
+ *
+ * While creating bitmap, all pages will be allocated and read for llbitmap,
+ * there won't be read afterwards
+ *
+ * ##### WRITE
+ *
+ * WRITE IO is divided into logical_block_size of the array, the dirty state
+ * of each block is tracked independently, for example:
+ *
+ * each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;
+ *
+ * | page0 | page1 | ... | page 31 |
+ * |       |
+ * |        \-----------------------\
+ * |                                |
+ * | block0 | block1 | ... | block 8|
+ * |        |
+ * |         \-----------------\
+ * |                            |
+ * | bit0 | bit1 | ... | bit511 |
+ *
+ * From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
+ * subpage will be marked dirty, such block must write first before the IO is
+ * issued. This behaviour will affect IO performance, to reduce the impact, if
+ * multiple bits are changed in the same block in a short time, all bits in this
+ * block will be changed to Dirty/NeedSync, so that there won't be any overhead
+ * until daemon clears dirty bits.
+ *
+ * ##### Dirty Bits syncronization
+ *
+ * IO fast path will set bits to dirty, and those dirty bits will be cleared
+ * by daemon after IO is done. llbitmap_page_ctl is used to synchronize between
+ * IO path and daemon;
+ *
+ * IO path:
+ *  1) try to grab a reference, if succeed, set expire time after 5s and return;
+ *  2) if failed to grab a reference, wait for daemon to finish clearing dirty
+ *  bits;
+ *
+ * Daemon(Daemon will be waken up every daemon_sleep seconds):
+ * For each page:
+ *  1) check if page expired, if not skip this page; for expired page:
+ *  2) suspend the page and wait for inflight write IO to be done;
+ *  3) change dirty page to clean;
+ *  4) resume the page;
+ */
+
+#define BITMAP_DATA_OFFSET 1024
+
+/* 64k is the max IO size of sync IO for raid1/raid10 */
+#define MIN_CHUNK_SIZE (64 * 2)
+
+/* By default, daemon will be waken up every 30s */
+#define DEFAULT_DAEMON_SLEEP 30
+
+/*
+ * Dirtied bits that have not been accessed for more than 5s will be cleared
+ * by daemon.
+ */
+#define DEFAULT_BARRIER_IDLE 5
+
+enum llbitmap_state {
+	/* No valid data, init state after assemble the array */
+	BitUnwritten = 0,
+	/* data is consistent */
+	BitClean,
+	/* data will be consistent after IO is done, set directly for writes */
+	BitDirty,
+	/*
+	 * data need to be resynchronized:
+	 * 1) set directly for writes if array is degraded, prevent full disk
+	 * synchronization after readding a disk;
+	 * 2) reassemble the array after power failure, and dirty bits are
+	 * found after reloading the bitmap;
+	 * 3) set for first write for raid5, to build initial xor data lazily
+	 */
+	BitNeedSync,
+	/* data is synchronizing */
+	BitSyncing,
+	BitStateCount,
+	BitNone = 0xff,
+};
+
+enum llbitmap_action {
+	/* User write new data, this is the only action from IO fast path */
+	BitmapActionStartwrite = 0,
+	/* Start recovery */
+	BitmapActionStartsync,
+	/* Finish recovery */
+	BitmapActionEndsync,
+	/* Failed recovery */
+	BitmapActionAbortsync,
+	/* Reassemble the array */
+	BitmapActionReload,
+	/* Daemon thread is trying to clear dirty bits */
+	BitmapActionDaemon,
+	/* Data is deleted */
+	BitmapActionDiscard,
+	/*
+	 * Bitmap is stale, mark all bits in addition to BitUnwritten to
+	 * BitNeedSync.
+	 */
+	BitmapActionStale,
+	BitmapActionCount,
+	/* Init state is BitUnwritten */
+	BitmapActionInit,
+};
+
+enum llbitmap_page_state {
+	LLPageFlush = 0,
+	LLPageDirty,
+};
+
+struct llbitmap_page_ctl {
+	char *state;
+	struct page *page;
+	unsigned long expire;
+	unsigned long flags;
+	wait_queue_head_t wait;
+	struct percpu_ref active;
+	/* Per block size dirty state, maximum 64k page / 1 sector = 128 */
+	unsigned long dirty[];
+};
+
+struct llbitmap {
+	struct mddev *mddev;
+	struct llbitmap_page_ctl **pctl;
+
+	unsigned int nr_pages;
+	unsigned int io_size;
+	unsigned int blocks_per_page;
+
+	/* shift of one chunk */
+	unsigned long chunkshift;
+	/* size of one chunk in sector */
+	unsigned long chunksize;
+	/* total number of chunks */
+	unsigned long chunks;
+	unsigned long last_end_sync;
+	/*
+	 * time in seconds that dirty bits will be cleared if the page is not
+	 * accessed.
+	 */
+	unsigned long barrier_idle;
+	/* fires on first BitDirty state */
+	struct timer_list pending_timer;
+	struct work_struct daemon_work;
+
+	unsigned long flags;
+	__u64	events_cleared;
+
+	/* for slow disks */
+	atomic_t behind_writes;
+	wait_queue_head_t behind_wait;
+};
+
+struct llbitmap_unplug_work {
+	struct work_struct work;
+	struct llbitmap *llbitmap;
+	struct completion *done;
+};
+
+static struct workqueue_struct *md_llbitmap_io_wq;
+static struct workqueue_struct *md_llbitmap_unplug_wq;
+
+static char state_machine[BitStateCount][BitmapActionCount] = {
+	[BitUnwritten] = {
+		[BitmapActionStartwrite]	= BitDirty,
+		[BitmapActionStartsync]		= BitNone,
+		[BitmapActionEndsync]		= BitNone,
+		[BitmapActionAbortsync]		= BitNone,
+		[BitmapActionReload]		= BitNone,
+		[BitmapActionDaemon]		= BitNone,
+		[BitmapActionDiscard]		= BitNone,
+		[BitmapActionStale]		= BitNone,
+	},
+	[BitClean] = {
+		[BitmapActionStartwrite]	= BitDirty,
+		[BitmapActionStartsync]		= BitNone,
+		[BitmapActionEndsync]		= BitNone,
+		[BitmapActionAbortsync]		= BitNone,
+		[BitmapActionReload]		= BitNone,
+		[BitmapActionDaemon]		= BitNone,
+		[BitmapActionDiscard]		= BitUnwritten,
+		[BitmapActionStale]		= BitNeedSync,
+	},
+	[BitDirty] = {
+		[BitmapActionStartwrite]	= BitNone,
+		[BitmapActionStartsync]		= BitNone,
+		[BitmapActionEndsync]		= BitNone,
+		[BitmapActionAbortsync]		= BitNone,
+		[BitmapActionReload]		= BitNeedSync,
+		[BitmapActionDaemon]		= BitClean,
+		[BitmapActionDiscard]		= BitUnwritten,
+		[BitmapActionStale]		= BitNeedSync,
+	},
+	[BitNeedSync] = {
+		[BitmapActionStartwrite]	= BitNone,
+		[BitmapActionStartsync]		= BitSyncing,
+		[BitmapActionEndsync]		= BitNone,
+		[BitmapActionAbortsync]		= BitNone,
+		[BitmapActionReload]		= BitNone,
+		[BitmapActionDaemon]		= BitNone,
+		[BitmapActionDiscard]		= BitUnwritten,
+		[BitmapActionStale]		= BitNone,
+	},
+	[BitSyncing] = {
+		[BitmapActionStartwrite]	= BitNone,
+		[BitmapActionStartsync]		= BitSyncing,
+		[BitmapActionEndsync]		= BitDirty,
+		[BitmapActionAbortsync]		= BitNeedSync,
+		[BitmapActionReload]		= BitNeedSync,
+		[BitmapActionDaemon]		= BitNone,
+		[BitmapActionDiscard]		= BitUnwritten,
+		[BitmapActionStale]		= BitNeedSync,
+	},
+};
+
+static void __llbitmap_flush(struct mddev *mddev);
+
+static enum llbitmap_state llbitmap_read(struct llbitmap *llbitmap, loff_t pos)
+{
+	unsigned int idx;
+	unsigned int offset;
+
+	pos += BITMAP_DATA_OFFSET;
+	idx = pos >> PAGE_SHIFT;
+	offset = offset_in_page(pos);
+
+	return llbitmap->pctl[idx]->state[offset];
+}
+
+/* set all the bits in the subpage as dirty */
+static void llbitmap_infect_dirty_bits(struct llbitmap *llbitmap,
+				       struct llbitmap_page_ctl *pctl,
+				       unsigned int block)
+{
+	bool level_456 = raid_is_456(llbitmap->mddev);
+	unsigned int io_size = llbitmap->io_size;
+	int pos;
+
+	for (pos = block * io_size; pos < (block + 1) * io_size; pos++) {
+		switch (pctl->state[pos]) {
+		case BitUnwritten:
+			pctl->state[pos] = level_456 ? BitNeedSync : BitDirty;
+			break;
+		case BitClean:
+			pctl->state[pos] = BitDirty;
+			break;
+		};
+	}
+
+}
+
+static void llbitmap_set_page_dirty(struct llbitmap *llbitmap, int idx,
+				    int offset)
+{
+	struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
+	unsigned int io_size = llbitmap->io_size;
+	int block = offset / io_size;
+	int pos;
+
+	if (!test_bit(LLPageDirty, &pctl->flags))
+		set_bit(LLPageDirty, &pctl->flags);
+
+	/*
+	 * The subpage usually contains a total of 512 bits. If any single bit
+	 * within the subpage is marked as dirty, the entire sector will be
+	 * written. To avoid impacting write performance, when multiple bits
+	 * within the same sector are modified within llbitmap->barrier_idle,
+	 * all bits in the sector will be collectively marked as dirty at once.
+	 */
+	if (test_and_set_bit(block, pctl->dirty)) {
+		llbitmap_infect_dirty_bits(llbitmap, pctl, block);
+		return;
+	}
+
+	for (pos = block * io_size; pos < (block + 1) * io_size; pos++) {
+		if (pos == offset)
+			continue;
+		if (pctl->state[pos] == BitDirty ||
+		    pctl->state[pos] == BitNeedSync) {
+			llbitmap_infect_dirty_bits(llbitmap, pctl, block);
+			return;
+		}
+	}
+}
+
+static void llbitmap_write(struct llbitmap *llbitmap, enum llbitmap_state state,
+			   loff_t pos)
+{
+	unsigned int idx;
+	unsigned int bit;
+
+	pos += BITMAP_DATA_OFFSET;
+	idx = pos >> PAGE_SHIFT;
+	bit = offset_in_page(pos);
+
+	llbitmap->pctl[idx]->state[bit] = state;
+	if (state == BitDirty || state == BitNeedSync)
+		llbitmap_set_page_dirty(llbitmap, idx, bit);
+}
+
+static struct page *llbitmap_read_page(struct llbitmap *llbitmap, int idx)
+{
+	struct mddev *mddev = llbitmap->mddev;
+	struct page *page = NULL;
+	struct md_rdev *rdev;
+
+	if (llbitmap->pctl && llbitmap->pctl[idx])
+		page = llbitmap->pctl[idx]->page;
+	if (page)
+		return page;
+
+	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	if (!page)
+		return ERR_PTR(-ENOMEM);
+
+	rdev_for_each(rdev, mddev) {
+		sector_t sector;
+
+		if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
+			continue;
+
+		sector = mddev->bitmap_info.offset +
+			 (idx << PAGE_SECTORS_SHIFT);
+
+		if (sync_page_io(rdev, sector, PAGE_SIZE, page, REQ_OP_READ,
+				 true))
+			return page;
+
+		md_error(mddev, rdev);
+	}
+
+	__free_page(page);
+	return ERR_PTR(-EIO);
+}
+
+static void llbitmap_write_page(struct llbitmap *llbitmap, int idx)
+{
+	struct page *page = llbitmap->pctl[idx]->page;
+	struct mddev *mddev = llbitmap->mddev;
+	struct md_rdev *rdev;
+	int block;
+
+	for (block = 0; block < llbitmap->blocks_per_page; block++) {
+		struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
+
+		if (!test_and_clear_bit(block, pctl->dirty))
+			continue;
+
+		rdev_for_each(rdev, mddev) {
+			sector_t sector;
+			sector_t bit_sector = llbitmap->io_size >> SECTOR_SHIFT;
+
+			if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
+				continue;
+
+			sector = mddev->bitmap_info.offset + rdev->sb_start +
+				 (idx << PAGE_SECTORS_SHIFT) +
+				 block * bit_sector;
+			md_write_metadata(mddev, rdev, sector,
+					  llbitmap->io_size, page,
+					  block * llbitmap->io_size);
+		}
+	}
+}
+
+static void active_release(struct percpu_ref *ref)
+{
+	struct llbitmap_page_ctl *pctl =
+		container_of(ref, struct llbitmap_page_ctl, active);
+
+	wake_up(&pctl->wait);
+}
+
+static void llbitmap_free_pages(struct llbitmap *llbitmap)
+{
+	int i;
+
+	if (!llbitmap->pctl)
+		return;
+
+	for (i = 0; i < llbitmap->nr_pages; i++) {
+		struct llbitmap_page_ctl *pctl = llbitmap->pctl[i];
+
+		if (!pctl || !pctl->page)
+			break;
+
+		__free_page(pctl->page);
+		percpu_ref_exit(&pctl->active);
+	}
+
+	kfree(llbitmap->pctl[0]);
+	kfree(llbitmap->pctl);
+	llbitmap->pctl = NULL;
+}
+
+static int llbitmap_cache_pages(struct llbitmap *llbitmap)
+{
+	struct llbitmap_page_ctl *pctl;
+	unsigned int nr_pages = DIV_ROUND_UP(llbitmap->chunks +
+					     BITMAP_DATA_OFFSET, PAGE_SIZE);
+	unsigned int size = struct_size(pctl, dirty, BITS_TO_LONGS(
+						llbitmap->blocks_per_page));
+	int i;
+
+	llbitmap->pctl = kmalloc_array(nr_pages, sizeof(void *),
+				       GFP_KERNEL | __GFP_ZERO);
+	if (!llbitmap->pctl)
+		return -ENOMEM;
+
+	size = round_up(size, cache_line_size());
+	pctl = kmalloc_array(nr_pages, size, GFP_KERNEL | __GFP_ZERO);
+	if (!pctl) {
+		kfree(llbitmap->pctl);
+		return -ENOMEM;
+	}
+
+	llbitmap->nr_pages = nr_pages;
+
+	for (i = 0; i < nr_pages; i++, pctl = (void *)pctl + size) {
+		struct page *page = llbitmap_read_page(llbitmap, i);
+
+		llbitmap->pctl[i] = pctl;
+
+		if (IS_ERR(page)) {
+			llbitmap_free_pages(llbitmap);
+			return PTR_ERR(page);
+		}
+
+		if (percpu_ref_init(&pctl->active, active_release,
+				    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
+			__free_page(page);
+			llbitmap_free_pages(llbitmap);
+			return -ENOMEM;
+		}
+
+		pctl->page = page;
+		pctl->state = page_address(page);
+		init_waitqueue_head(&pctl->wait);
+	}
+
+	return 0;
+}
+
+static void llbitmap_init_state(struct llbitmap *llbitmap)
+{
+	enum llbitmap_state state = BitUnwritten;
+	unsigned long i;
+
+	if (test_and_clear_bit(BITMAP_CLEAN, &llbitmap->flags))
+		state = BitClean;
+
+	for (i = 0; i < llbitmap->chunks; i++)
+		llbitmap_write(llbitmap, state, i);
+}
+
+/* The return value is only used from resync, where @start == @end. */
+static enum llbitmap_state llbitmap_state_machine(struct llbitmap *llbitmap,
+						  unsigned long start,
+						  unsigned long end,
+						  enum llbitmap_action action)
+{
+	struct mddev *mddev = llbitmap->mddev;
+	enum llbitmap_state state = BitNone;
+	bool level_456 = raid_is_456(llbitmap->mddev);
+	bool need_resync = false;
+	bool need_recovery = false;
+
+	if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags))
+		return BitNone;
+
+	if (action == BitmapActionInit) {
+		llbitmap_init_state(llbitmap);
+		return BitNone;
+	}
+
+	while (start <= end) {
+		enum llbitmap_state c = llbitmap_read(llbitmap, start);
+
+		if (c < 0 || c >= BitStateCount) {
+			pr_err("%s: invalid bit %lu state %d action %d, forcing resync\n",
+			       __func__, start, c, action);
+			state = BitNeedSync;
+			goto write_bitmap;
+		}
+
+		if (c == BitNeedSync)
+			need_resync = true;
+
+		state = state_machine[c][action];
+		if (state == BitNone) {
+			start++;
+			continue;
+		}
+
+write_bitmap:
+		/* Delay raid456 initial recovery to first write. */
+		if (c == BitUnwritten && state == BitDirty &&
+		    action == BitmapActionStartwrite && level_456) {
+			state = BitNeedSync;
+			need_recovery = true;
+		}
+
+		llbitmap_write(llbitmap, state, start);
+
+		if (state == BitNeedSync)
+			need_resync = true;
+		else if (state == BitDirty &&
+			 !timer_pending(&llbitmap->pending_timer))
+			mod_timer(&llbitmap->pending_timer,
+				  jiffies + mddev->bitmap_info.daemon_sleep * HZ);
+
+		start++;
+	}
+
+	if (need_resync && level_456)
+		need_recovery = true;
+
+	if (need_recovery) {
+		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+		set_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery);
+		md_wakeup_thread(mddev->thread);
+	} else if (need_resync) {
+		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+		set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+		md_wakeup_thread(mddev->thread);
+	}
+
+	return state;
+}
+
+static void llbitmap_raise_barrier(struct llbitmap *llbitmap, int page_idx)
+{
+	struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx];
+
+retry:
+	if (likely(percpu_ref_tryget_live(&pctl->active))) {
+		WRITE_ONCE(pctl->expire, jiffies + llbitmap->barrier_idle * HZ);
+		return;
+	}
+
+	wait_event(pctl->wait, !percpu_ref_is_dying(&pctl->active));
+	goto retry;
+}
+
+static void llbitmap_release_barrier(struct llbitmap *llbitmap, int page_idx)
+{
+	struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx];
+
+	percpu_ref_put(&pctl->active);
+}
+
+static int llbitmap_suspend_timeout(struct llbitmap *llbitmap, int page_idx)
+{
+	struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx];
+
+	percpu_ref_kill(&pctl->active);
+
+	if (!wait_event_timeout(pctl->wait, percpu_ref_is_zero(&pctl->active),
+			llbitmap->mddev->bitmap_info.daemon_sleep * HZ))
+		return -ETIMEDOUT;
+
+	return 0;
+}
+
+static void llbitmap_resume(struct llbitmap *llbitmap, int page_idx)
+{
+	struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx];
+
+	pctl->expire = LONG_MAX;
+	percpu_ref_resurrect(&pctl->active);
+	wake_up(&pctl->wait);
+}
+
+static int llbitmap_check_support(struct mddev *mddev)
+{
+	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
+		pr_notice("md/llbitmap: %s: array with journal cannot have bitmap\n",
+			  mdname(mddev));
+		return -EBUSY;
+	}
+
+	if (mddev->bitmap_info.space == 0) {
+		if (mddev->bitmap_info.default_space == 0) {
+			pr_notice("md/llbitmap: %s: no space for bitmap\n",
+				  mdname(mddev));
+			return -ENOSPC;
+		}
+	}
+
+	if (!mddev->persistent) {
+		pr_notice("md/llbitmap: %s: array must be persistent\n",
+			  mdname(mddev));
+		return -EOPNOTSUPP;
+	}
+
+	if (mddev->bitmap_info.file) {
+		pr_notice("md/llbitmap: %s: doesn't support bitmap file\n",
+			  mdname(mddev));
+		return -EOPNOTSUPP;
+	}
+
+	if (mddev->bitmap_info.external) {
+		pr_notice("md/llbitmap: %s: doesn't support external metadata\n",
+			  mdname(mddev));
+		return -EOPNOTSUPP;
+	}
+
+	if (mddev_is_dm(mddev)) {
+		pr_notice("md/llbitmap: %s: doesn't support dm-raid\n",
+			  mdname(mddev));
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
+static int llbitmap_init(struct llbitmap *llbitmap)
+{
+	struct mddev *mddev = llbitmap->mddev;
+	sector_t blocks = mddev->resync_max_sectors;
+	unsigned long chunksize = MIN_CHUNK_SIZE;
+	unsigned long chunks = DIV_ROUND_UP(blocks, chunksize);
+	unsigned long space = mddev->bitmap_info.space << SECTOR_SHIFT;
+	int ret;
+
+	while (chunks > space) {
+		chunksize = chunksize << 1;
+		chunks = DIV_ROUND_UP(blocks, chunksize);
+	}
+
+	llbitmap->barrier_idle = DEFAULT_BARRIER_IDLE;
+	llbitmap->chunkshift = ffz(~chunksize);
+	llbitmap->chunksize = chunksize;
+	llbitmap->chunks = chunks;
+	mddev->bitmap_info.daemon_sleep = DEFAULT_DAEMON_SLEEP;
+
+	ret = llbitmap_cache_pages(llbitmap);
+	if (ret)
+		return ret;
+
+	llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1,
+			       BitmapActionInit);
+	/* flush initial llbitmap to disk */
+	__llbitmap_flush(mddev);
+
+	return 0;
+}
+
+static int llbitmap_read_sb(struct llbitmap *llbitmap)
+{
+	struct mddev *mddev = llbitmap->mddev;
+	unsigned long daemon_sleep;
+	unsigned long chunksize;
+	unsigned long events;
+	struct page *sb_page;
+	bitmap_super_t *sb;
+	int ret = -EINVAL;
+
+	if (!mddev->bitmap_info.offset) {
+		pr_err("md/llbitmap: %s: no super block found", mdname(mddev));
+		return -EINVAL;
+	}
+
+	sb_page = llbitmap_read_page(llbitmap, 0);
+	if (IS_ERR(sb_page)) {
+		pr_err("md/llbitmap: %s: read super block failed",
+		       mdname(mddev));
+		return -EIO;
+	}
+
+	sb = kmap_local_page(sb_page);
+	if (sb->magic != cpu_to_le32(BITMAP_MAGIC)) {
+		pr_err("md/llbitmap: %s: invalid super block magic number",
+		       mdname(mddev));
+		goto out_put_page;
+	}
+
+	if (sb->version != cpu_to_le32(BITMAP_MAJOR_LOCKLESS)) {
+		pr_err("md/llbitmap: %s: invalid super block version",
+		       mdname(mddev));
+		goto out_put_page;
+	}
+
+	if (memcmp(sb->uuid, mddev->uuid, 16)) {
+		pr_err("md/llbitmap: %s: bitmap superblock UUID mismatch\n",
+		       mdname(mddev));
+		goto out_put_page;
+	}
+
+	if (mddev->bitmap_info.space == 0) {
+		int room = le32_to_cpu(sb->sectors_reserved);
+
+		if (room)
+			mddev->bitmap_info.space = room;
+		else
+			mddev->bitmap_info.space = mddev->bitmap_info.default_space;
+	}
+	llbitmap->flags = le32_to_cpu(sb->state);
+	if (test_and_clear_bit(BITMAP_FIRST_USE, &llbitmap->flags)) {
+		ret = llbitmap_init(llbitmap);
+		goto out_put_page;
+	}
+
+	chunksize = le32_to_cpu(sb->chunksize);
+	if (!is_power_of_2(chunksize)) {
+		pr_err("md/llbitmap: %s: chunksize not a power of 2",
+		       mdname(mddev));
+		goto out_put_page;
+	}
+
+	if (chunksize < DIV_ROUND_UP(mddev->resync_max_sectors,
+				     mddev->bitmap_info.space << SECTOR_SHIFT)) {
+		pr_err("md/llbitmap: %s: chunksize too small %lu < %llu / %lu",
+		       mdname(mddev), chunksize, mddev->resync_max_sectors,
+		       mddev->bitmap_info.space);
+		goto out_put_page;
+	}
+
+	daemon_sleep = le32_to_cpu(sb->daemon_sleep);
+	if (daemon_sleep < 1 || daemon_sleep > MAX_SCHEDULE_TIMEOUT / HZ) {
+		pr_err("md/llbitmap: %s: daemon sleep %lu period out of range",
+		       mdname(mddev), daemon_sleep);
+		goto out_put_page;
+	}
+
+	events = le64_to_cpu(sb->events);
+	if (events < mddev->events) {
+		pr_warn("md/llbitmap :%s: bitmap file is out of date (%lu < %llu) -- forcing full recovery",
+			mdname(mddev), events, mddev->events);
+		set_bit(BITMAP_STALE, &llbitmap->flags);
+	}
+
+	sb->sync_size = cpu_to_le64(mddev->resync_max_sectors);
+	mddev->bitmap_info.chunksize = chunksize;
+	mddev->bitmap_info.daemon_sleep = daemon_sleep;
+
+	llbitmap->barrier_idle = DEFAULT_BARRIER_IDLE;
+	llbitmap->chunksize = chunksize;
+	llbitmap->chunks = DIV_ROUND_UP(mddev->resync_max_sectors, chunksize);
+	llbitmap->chunkshift = ffz(~chunksize);
+	ret = llbitmap_cache_pages(llbitmap);
+
+out_put_page:
+	__free_page(sb_page);
+	kunmap_local(sb);
+	return ret;
+}
+
+static void llbitmap_pending_timer_fn(struct timer_list *pending_timer)
+{
+	struct llbitmap *llbitmap =
+		container_of(pending_timer, struct llbitmap, pending_timer);
+
+	if (work_busy(&llbitmap->daemon_work)) {
+		pr_warn("md/llbitmap: %s daemon_work not finished in %lu seconds\n",
+			mdname(llbitmap->mddev),
+			llbitmap->mddev->bitmap_info.daemon_sleep);
+		set_bit(BITMAP_DAEMON_BUSY, &llbitmap->flags);
+		return;
+	}
+
+	queue_work(md_llbitmap_io_wq, &llbitmap->daemon_work);
+}
+
+static void md_llbitmap_daemon_fn(struct work_struct *work)
+{
+	struct llbitmap *llbitmap =
+		container_of(work, struct llbitmap, daemon_work);
+	unsigned long start;
+	unsigned long end;
+	bool restart;
+	int idx;
+
+	if (llbitmap->mddev->degraded)
+		return;
+
+retry:
+	start = 0;
+	end = min(llbitmap->chunks, PAGE_SIZE - BITMAP_DATA_OFFSET) - 1;
+	restart = false;
+
+	for (idx = 0; idx < llbitmap->nr_pages; idx++) {
+		struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
+
+		if (idx > 0) {
+			start = end + 1;
+			end = min(end + PAGE_SIZE, llbitmap->chunks - 1);
+		}
+
+		if (!test_bit(LLPageFlush, &pctl->flags) &&
+		    time_before(jiffies, pctl->expire)) {
+			restart = true;
+			continue;
+		}
+
+		if (llbitmap_suspend_timeout(llbitmap, idx) < 0) {
+			pr_warn("md/llbitmap: %s: %s waiting for page %d timeout\n",
+				mdname(llbitmap->mddev), __func__, idx);
+			continue;
+		}
+
+		llbitmap_state_machine(llbitmap, start, end, BitmapActionDaemon);
+		llbitmap_resume(llbitmap, idx);
+	}
+
+	/*
+	 * If the daemon took a long time to finish, retry to prevent missing
+	 * clearing dirty bits.
+	 */
+	if (test_and_clear_bit(BITMAP_DAEMON_BUSY, &llbitmap->flags))
+		goto retry;
+
+	/* If some page is dirty but not expired, setup timer again */
+	if (restart)
+		mod_timer(&llbitmap->pending_timer,
+			  jiffies + llbitmap->mddev->bitmap_info.daemon_sleep * HZ);
+}
+
+static int llbitmap_create(struct mddev *mddev)
+{
+	struct llbitmap *llbitmap;
+	int ret;
+
+	ret = llbitmap_check_support(mddev);
+	if (ret)
+		return ret;
+
+	llbitmap = kzalloc(sizeof(*llbitmap), GFP_KERNEL);
+	if (!llbitmap)
+		return -ENOMEM;
+
+	llbitmap->mddev = mddev;
+	llbitmap->io_size = bdev_logical_block_size(mddev->gendisk->part0);
+	llbitmap->blocks_per_page = PAGE_SIZE / llbitmap->io_size;
+
+	timer_setup(&llbitmap->pending_timer, llbitmap_pending_timer_fn, 0);
+	INIT_WORK(&llbitmap->daemon_work, md_llbitmap_daemon_fn);
+	atomic_set(&llbitmap->behind_writes, 0);
+	init_waitqueue_head(&llbitmap->behind_wait);
+
+	mutex_lock(&mddev->bitmap_info.mutex);
+	mddev->bitmap = llbitmap;
+	ret = llbitmap_read_sb(llbitmap);
+	mutex_unlock(&mddev->bitmap_info.mutex);
+	if (ret) {
+		kfree(llbitmap);
+		mddev->bitmap = NULL;
+	}
+
+	return ret;
+}
+
+static int llbitmap_resize(struct mddev *mddev, sector_t blocks, int chunksize)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long chunks;
+
+	if (chunksize == 0)
+		chunksize = llbitmap->chunksize;
+
+	/* If there is enough space, leave the chunksize unchanged. */
+	chunks = DIV_ROUND_UP(blocks, chunksize);
+	while (chunks > mddev->bitmap_info.space << SECTOR_SHIFT) {
+		chunksize = chunksize << 1;
+		chunks = DIV_ROUND_UP(blocks, chunksize);
+	}
+
+	llbitmap->chunkshift = ffz(~chunksize);
+	llbitmap->chunksize = chunksize;
+	llbitmap->chunks = chunks;
+
+	return 0;
+}
+
+static int llbitmap_load(struct mddev *mddev)
+{
+	enum llbitmap_action action = BitmapActionReload;
+	struct llbitmap *llbitmap = mddev->bitmap;
+
+	if (test_and_clear_bit(BITMAP_STALE, &llbitmap->flags))
+		action = BitmapActionStale;
+
+	llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, action);
+	return 0;
+}
+
+static void llbitmap_destroy(struct mddev *mddev)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+
+	if (!llbitmap)
+		return;
+
+	mutex_lock(&mddev->bitmap_info.mutex);
+
+	timer_delete_sync(&llbitmap->pending_timer);
+	flush_workqueue(md_llbitmap_io_wq);
+	flush_workqueue(md_llbitmap_unplug_wq);
+
+	mddev->bitmap = NULL;
+	llbitmap_free_pages(llbitmap);
+	kfree(llbitmap);
+	mutex_unlock(&mddev->bitmap_info.mutex);
+}
+
+static void llbitmap_start_write(struct mddev *mddev, sector_t offset,
+				 unsigned long sectors)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long start = offset >> llbitmap->chunkshift;
+	unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift;
+	int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
+	int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
+
+	llbitmap_state_machine(llbitmap, start, end, BitmapActionStartwrite);
+
+	while (page_start <= page_end) {
+		llbitmap_raise_barrier(llbitmap, page_start);
+		page_start++;
+	}
+}
+
+static void llbitmap_end_write(struct mddev *mddev, sector_t offset,
+			       unsigned long sectors)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long start = offset >> llbitmap->chunkshift;
+	unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift;
+	int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
+	int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
+
+	while (page_start <= page_end) {
+		llbitmap_release_barrier(llbitmap, page_start);
+		page_start++;
+	}
+}
+
+static void llbitmap_start_discard(struct mddev *mddev, sector_t offset,
+				   unsigned long sectors)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long start = DIV_ROUND_UP(offset, llbitmap->chunksize);
+	unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift;
+	int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
+	int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
+
+	llbitmap_state_machine(llbitmap, start, end, BitmapActionDiscard);
+
+	while (page_start <= page_end) {
+		llbitmap_raise_barrier(llbitmap, page_start);
+		page_start++;
+	}
+}
+
+static void llbitmap_end_discard(struct mddev *mddev, sector_t offset,
+				 unsigned long sectors)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long start = DIV_ROUND_UP(offset, llbitmap->chunksize);
+	unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift;
+	int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
+	int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
+
+	while (page_start <= page_end) {
+		llbitmap_release_barrier(llbitmap, page_start);
+		page_start++;
+	}
+}
+
+static void llbitmap_unplug_fn(struct work_struct *work)
+{
+	struct llbitmap_unplug_work *unplug_work =
+		container_of(work, struct llbitmap_unplug_work, work);
+	struct llbitmap *llbitmap = unplug_work->llbitmap;
+	struct blk_plug plug;
+	int i;
+
+	blk_start_plug(&plug);
+
+	for (i = 0; i < llbitmap->nr_pages; i++) {
+		if (!test_bit(LLPageDirty, &llbitmap->pctl[i]->flags) ||
+		    !test_and_clear_bit(LLPageDirty, &llbitmap->pctl[i]->flags))
+			continue;
+
+		llbitmap_write_page(llbitmap, i);
+	}
+
+	blk_finish_plug(&plug);
+	md_super_wait(llbitmap->mddev);
+	complete(unplug_work->done);
+}
+
+static bool llbitmap_dirty(struct llbitmap *llbitmap)
+{
+	int i;
+
+	for (i = 0; i < llbitmap->nr_pages; i++)
+		if (test_bit(LLPageDirty, &llbitmap->pctl[i]->flags))
+			return true;
+
+	return false;
+}
+
+static void llbitmap_unplug(struct mddev *mddev, bool sync)
+{
+	DECLARE_COMPLETION_ONSTACK(done);
+	struct llbitmap *llbitmap = mddev->bitmap;
+	struct llbitmap_unplug_work unplug_work = {
+		.llbitmap = llbitmap,
+		.done = &done,
+	};
+
+	if (!llbitmap_dirty(llbitmap))
+		return;
+
+	/*
+	 * Issue new bitmap IO under submit_bio() context will deadlock:
+	 *  - the bio will wait for bitmap bio to be done, before it can be
+	 *  issued;
+	 *  - bitmap bio will be added to current->bio_list and wait for this
+	 *  bio to be issued;
+	 */
+	INIT_WORK_ONSTACK(&unplug_work.work, llbitmap_unplug_fn);
+	queue_work(md_llbitmap_unplug_wq, &unplug_work.work);
+	wait_for_completion(&done);
+	destroy_work_on_stack(&unplug_work.work);
+}
+
+/*
+ * Force to write all bitmap pages to disk, called when stopping the array, or
+ * every daemon_sleep seconds when sync_thread is running.
+ */
+static void __llbitmap_flush(struct mddev *mddev)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	struct blk_plug plug;
+	int i;
+
+	blk_start_plug(&plug);
+	for (i = 0; i < llbitmap->nr_pages; i++) {
+		struct llbitmap_page_ctl *pctl = llbitmap->pctl[i];
+
+		/* mark all blocks as dirty */
+		set_bit(LLPageDirty, &pctl->flags);
+		bitmap_fill(pctl->dirty, llbitmap->blocks_per_page);
+		llbitmap_write_page(llbitmap, i);
+	}
+	blk_finish_plug(&plug);
+	md_super_wait(llbitmap->mddev);
+}
+
+static void llbitmap_flush(struct mddev *mddev)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	int i;
+
+	for (i = 0; i < llbitmap->nr_pages; i++)
+		set_bit(LLPageFlush, &llbitmap->pctl[i]->flags);
+
+	timer_delete_sync(&llbitmap->pending_timer);
+	queue_work(md_llbitmap_io_wq, &llbitmap->daemon_work);
+	flush_work(&llbitmap->daemon_work);
+
+	__llbitmap_flush(mddev);
+}
+
+/* This is used for raid5 lazy initial recovery */
+static bool llbitmap_blocks_synced(struct mddev *mddev, sector_t offset)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long p = offset >> llbitmap->chunkshift;
+	enum llbitmap_state c = llbitmap_read(llbitmap, p);
+
+	return c == BitClean || c == BitDirty;
+}
+
+static sector_t llbitmap_skip_sync_blocks(struct mddev *mddev, sector_t offset)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long p = offset >> llbitmap->chunkshift;
+	int blocks = llbitmap->chunksize - (offset & (llbitmap->chunksize - 1));
+	enum llbitmap_state c = llbitmap_read(llbitmap, p);
+
+	/* always skip unwritten blocks */
+	if (c == BitUnwritten)
+		return blocks;
+
+	/* For resync also skip clean/dirty blocks */
+	if ((c == BitClean || c == BitDirty) &&
+	    test_bit(MD_RECOVERY_SYNC, &mddev->recovery) &&
+	    !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
+		return blocks;
+
+	return 0;
+}
+
+static bool llbitmap_start_sync(struct mddev *mddev, sector_t offset,
+				sector_t *blocks, bool degraded)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long p = offset >> llbitmap->chunkshift;
+
+	/*
+	 * Handle one bit at a time, this is much simpler. And it doesn't matter
+	 * if md_do_sync() loop more times.
+	 */
+	*blocks = llbitmap->chunksize - (offset & (llbitmap->chunksize - 1));
+	return llbitmap_state_machine(llbitmap, p, p,
+				      BitmapActionStartsync) == BitSyncing;
+}
+
+/* Something is wrong, sync_thread stop at @offset */
+static void llbitmap_end_sync(struct mddev *mddev, sector_t offset,
+			      sector_t *blocks)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long p = offset >> llbitmap->chunkshift;
+
+	*blocks = llbitmap->chunksize - (offset & (llbitmap->chunksize - 1));
+	llbitmap_state_machine(llbitmap, p, llbitmap->chunks - 1,
+			       BitmapActionAbortsync);
+}
+
+/* A full sync_thread is finished */
+static void llbitmap_close_sync(struct mddev *mddev)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	int i;
+
+	for (i = 0; i < llbitmap->nr_pages; i++) {
+		struct llbitmap_page_ctl *pctl = llbitmap->pctl[i];
+
+		/* let daemon_fn clear dirty bits immediately */
+		WRITE_ONCE(pctl->expire, jiffies);
+	}
+
+	llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1,
+			       BitmapActionEndsync);
+}
+
+/*
+ * sync_thread have reached @sector, update metadata every daemon_sleep seconds,
+ * just in case sync_thread have to restart after power failure.
+ */
+static void llbitmap_cond_end_sync(struct mddev *mddev, sector_t sector,
+				   bool force)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+
+	if (sector == 0) {
+		llbitmap->last_end_sync = jiffies;
+		return;
+	}
+
+	if (time_before(jiffies, llbitmap->last_end_sync +
+				 HZ * mddev->bitmap_info.daemon_sleep))
+		return;
+
+	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
+
+	mddev->curr_resync_completed = sector;
+	set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
+	llbitmap_state_machine(llbitmap, 0, sector >> llbitmap->chunkshift,
+			       BitmapActionEndsync);
+	__llbitmap_flush(mddev);
+
+	llbitmap->last_end_sync = jiffies;
+	sysfs_notify_dirent_safe(mddev->sysfs_completed);
+}
+
+static bool llbitmap_enabled(void *data, bool flush)
+{
+	struct llbitmap *llbitmap = data;
+
+	return llbitmap && !test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags);
+}
+
+static void llbitmap_dirty_bits(struct mddev *mddev, unsigned long s,
+				unsigned long e)
+{
+	llbitmap_state_machine(mddev->bitmap, s, e, BitmapActionStartwrite);
+}
+
+static void llbitmap_write_sb(struct llbitmap *llbitmap)
+{
+	int nr_blocks = DIV_ROUND_UP(BITMAP_DATA_OFFSET, llbitmap->io_size);
+
+	bitmap_fill(llbitmap->pctl[0]->dirty, nr_blocks);
+	llbitmap_write_page(llbitmap, 0);
+	md_super_wait(llbitmap->mddev);
+}
+
+static void llbitmap_update_sb(void *data)
+{
+	struct llbitmap *llbitmap = data;
+	struct mddev *mddev = llbitmap->mddev;
+	struct page *sb_page;
+	bitmap_super_t *sb;
+
+	if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags))
+		return;
+
+	sb_page = llbitmap_read_page(llbitmap, 0);
+	if (IS_ERR(sb_page)) {
+		pr_err("%s: %s: read super block failed", __func__,
+		       mdname(mddev));
+		set_bit(BITMAP_WRITE_ERROR, &llbitmap->flags);
+		return;
+	}
+
+	if (mddev->events < llbitmap->events_cleared)
+		llbitmap->events_cleared = mddev->events;
+
+	sb = kmap_local_page(sb_page);
+	sb->events = cpu_to_le64(mddev->events);
+	sb->state = cpu_to_le32(llbitmap->flags);
+	sb->chunksize = cpu_to_le32(llbitmap->chunksize);
+	sb->sync_size = cpu_to_le64(mddev->resync_max_sectors);
+	sb->events_cleared = cpu_to_le64(llbitmap->events_cleared);
+	sb->sectors_reserved = cpu_to_le32(mddev->bitmap_info.space);
+	sb->daemon_sleep = cpu_to_le32(mddev->bitmap_info.daemon_sleep);
+
+	kunmap_local(sb);
+	llbitmap_write_sb(llbitmap);
+}
+
+static int llbitmap_get_stats(void *data, struct md_bitmap_stats *stats)
+{
+	struct llbitmap *llbitmap = data;
+
+	memset(stats, 0, sizeof(*stats));
+
+	stats->missing_pages = 0;
+	stats->pages = llbitmap->nr_pages;
+	stats->file_pages = llbitmap->nr_pages;
+
+	stats->behind_writes = atomic_read(&llbitmap->behind_writes);
+	stats->behind_wait = wq_has_sleeper(&llbitmap->behind_wait);
+	stats->events_cleared = llbitmap->events_cleared;
+
+	return 0;
+}
+
+/* just flag all pages as needing to be written */
+static void llbitmap_write_all(struct mddev *mddev)
+{
+	int i;
+	struct llbitmap *llbitmap = mddev->bitmap;
+
+	for (i = 0; i < llbitmap->nr_pages; i++) {
+		struct llbitmap_page_ctl *pctl = llbitmap->pctl[i];
+
+		set_bit(LLPageDirty, &pctl->flags);
+		bitmap_fill(pctl->dirty, llbitmap->blocks_per_page);
+	}
+}
+
+static void llbitmap_start_behind_write(struct mddev *mddev)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+
+	atomic_inc(&llbitmap->behind_writes);
+}
+
+static void llbitmap_end_behind_write(struct mddev *mddev)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+
+	if (atomic_dec_and_test(&llbitmap->behind_writes))
+		wake_up(&llbitmap->behind_wait);
+}
+
+static void llbitmap_wait_behind_writes(struct mddev *mddev)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+
+	if (!llbitmap)
+		return;
+
+	wait_event(llbitmap->behind_wait,
+		   atomic_read(&llbitmap->behind_writes) == 0);
+
+}
+
+static ssize_t bits_show(struct mddev *mddev, char *page)
+{
+	struct llbitmap *llbitmap;
+	int bits[BitStateCount] = {0};
+	loff_t start = 0;
+
+	mutex_lock(&mddev->bitmap_info.mutex);
+	llbitmap = mddev->bitmap;
+	if (!llbitmap || !llbitmap->pctl) {
+		mutex_unlock(&mddev->bitmap_info.mutex);
+		return sprintf(page, "no bitmap\n");
+	}
+
+	if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags)) {
+		mutex_unlock(&mddev->bitmap_info.mutex);
+		return sprintf(page, "bitmap io error\n");
+	}
+
+	while (start < llbitmap->chunks) {
+		enum llbitmap_state c = llbitmap_read(llbitmap, start);
+
+		if (c < 0 || c >= BitStateCount)
+			pr_err("%s: invalid bit %llu state %d\n",
+			       __func__, start, c);
+		else
+			bits[c]++;
+		start++;
+	}
+
+	mutex_unlock(&mddev->bitmap_info.mutex);
+	return sprintf(page, "unwritten %d\nclean %d\ndirty %d\nneed sync %d\nsyncing %d\n",
+		       bits[BitUnwritten], bits[BitClean], bits[BitDirty],
+		       bits[BitNeedSync], bits[BitSyncing]);
+}
+
+static struct md_sysfs_entry llbitmap_bits =
+__ATTR_RO(bits);
+
+static ssize_t metadata_show(struct mddev *mddev, char *page)
+{
+	struct llbitmap *llbitmap;
+	ssize_t ret;
+
+	mutex_lock(&mddev->bitmap_info.mutex);
+	llbitmap = mddev->bitmap;
+	if (!llbitmap) {
+		mutex_unlock(&mddev->bitmap_info.mutex);
+		return sprintf(page, "no bitmap\n");
+	}
+
+	ret =  sprintf(page, "chunksize %lu\nchunkshift %lu\nchunks %lu\noffset %llu\ndaemon_sleep %lu\n",
+		       llbitmap->chunksize, llbitmap->chunkshift,
+		       llbitmap->chunks, mddev->bitmap_info.offset,
+		       llbitmap->mddev->bitmap_info.daemon_sleep);
+	mutex_unlock(&mddev->bitmap_info.mutex);
+
+	return ret;
+}
+
+static struct md_sysfs_entry llbitmap_metadata =
+__ATTR_RO(metadata);
+
+static ssize_t
+daemon_sleep_show(struct mddev *mddev, char *page)
+{
+	return sprintf(page, "%lu\n", mddev->bitmap_info.daemon_sleep);
+}
+
+static ssize_t
+daemon_sleep_store(struct mddev *mddev, const char *buf, size_t len)
+{
+	unsigned long timeout;
+	int rv = kstrtoul(buf, 10, &timeout);
+
+	if (rv)
+		return rv;
+
+	mddev->bitmap_info.daemon_sleep = timeout;
+	return len;
+}
+
+static struct md_sysfs_entry llbitmap_daemon_sleep =
+__ATTR_RW(daemon_sleep);
+
+static ssize_t
+barrier_idle_show(struct mddev *mddev, char *page)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+
+	return sprintf(page, "%lu\n", llbitmap->barrier_idle);
+}
+
+static ssize_t
+barrier_idle_store(struct mddev *mddev, const char *buf, size_t len)
+{
+	struct llbitmap *llbitmap = mddev->bitmap;
+	unsigned long timeout;
+	int rv = kstrtoul(buf, 10, &timeout);
+
+	if (rv)
+		return rv;
+
+	llbitmap->barrier_idle = timeout;
+	return len;
+}
+
+static struct md_sysfs_entry llbitmap_barrier_idle =
+__ATTR_RW(barrier_idle);
+
+static struct attribute *md_llbitmap_attrs[] = {
+	&llbitmap_bits.attr,
+	&llbitmap_metadata.attr,
+	&llbitmap_daemon_sleep.attr,
+	&llbitmap_barrier_idle.attr,
+	NULL
+};
+
+static struct attribute_group md_llbitmap_group = {
+	.name = "llbitmap",
+	.attrs = md_llbitmap_attrs,
+};
+
+static struct bitmap_operations llbitmap_ops = {
+	.head = {
+		.type	= MD_BITMAP,
+		.id	= ID_LLBITMAP,
+		.name	= "llbitmap",
+	},
+
+	.enabled		= llbitmap_enabled,
+	.create			= llbitmap_create,
+	.resize			= llbitmap_resize,
+	.load			= llbitmap_load,
+	.destroy		= llbitmap_destroy,
+
+	.start_write		= llbitmap_start_write,
+	.end_write		= llbitmap_end_write,
+	.start_discard		= llbitmap_start_discard,
+	.end_discard		= llbitmap_end_discard,
+	.unplug			= llbitmap_unplug,
+	.flush			= llbitmap_flush,
+
+	.start_behind_write	= llbitmap_start_behind_write,
+	.end_behind_write	= llbitmap_end_behind_write,
+	.wait_behind_writes	= llbitmap_wait_behind_writes,
+
+	.blocks_synced		= llbitmap_blocks_synced,
+	.skip_sync_blocks	= llbitmap_skip_sync_blocks,
+	.start_sync		= llbitmap_start_sync,
+	.end_sync		= llbitmap_end_sync,
+	.close_sync		= llbitmap_close_sync,
+	.cond_end_sync		= llbitmap_cond_end_sync,
+
+	.update_sb		= llbitmap_update_sb,
+	.get_stats		= llbitmap_get_stats,
+	.dirty_bits		= llbitmap_dirty_bits,
+	.write_all		= llbitmap_write_all,
+
+	.group			= &md_llbitmap_group,
+};
+
+int md_llbitmap_init(void)
+{
+	md_llbitmap_io_wq = alloc_workqueue("md_llbitmap_io",
+					 WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
+	if (!md_llbitmap_io_wq)
+		return -ENOMEM;
+
+	md_llbitmap_unplug_wq = alloc_workqueue("md_llbitmap_unplug",
+					 WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
+	if (!md_llbitmap_unplug_wq) {
+		destroy_workqueue(md_llbitmap_io_wq);
+		md_llbitmap_io_wq = NULL;
+		return -ENOMEM;
+	}
+
+	return register_md_submodule(&llbitmap_ops.head);
+}
+
+void md_llbitmap_exit(void)
+{
+	destroy_workqueue(md_llbitmap_io_wq);
+	md_llbitmap_io_wq = NULL;
+	destroy_workqueue(md_llbitmap_unplug_wq);
+	md_llbitmap_unplug_wq = NULL;
+	unregister_md_submodule(&llbitmap_ops.head);
+}
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3a3a3fdecfbd..722c76b4fade 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -10328,6 +10328,10 @@ static int __init md_init(void)
 	if (ret)
 		return ret;
 
+	ret = md_llbitmap_init();
+	if (ret)
+		goto err_bitmap;
+
 	ret = -ENOMEM;
 	md_wq = alloc_workqueue("md", WQ_MEM_RECLAIM, 0);
 	if (!md_wq)
@@ -10359,6 +10363,8 @@ static int __init md_init(void)
 err_misc_wq:
 	destroy_workqueue(md_wq);
 err_wq:
+	md_llbitmap_exit();
+err_bitmap:
 	md_bitmap_exit();
 	return ret;
 }
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 7b6357879a84..1979c2d4fe89 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -26,7 +26,7 @@
 enum md_submodule_type {
 	MD_PERSONALITY = 0,
 	MD_CLUSTER,
-	MD_BITMAP, /* TODO */
+	MD_BITMAP,
 };
 
 enum md_submodule_id {
@@ -39,7 +39,7 @@ enum md_submodule_id {
 	ID_RAID10	= 10,
 	ID_CLUSTER,
 	ID_BITMAP,
-	ID_LLBITMAP,	/* TODO */
+	ID_LLBITMAP,
 	ID_BITMAP_NONE,
 };
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 md-6.18 11/11] md/md-llbitmap: introduce new lockless bitmap
  2025-08-26  8:52 ` [PATCH v6 md-6.18 11/11] md/md-llbitmap: introduce new lockless bitmap Yu Kuai
@ 2025-08-26  9:52   ` Paul Menzel
  2025-08-27  3:44     ` Yu Kuai
  2025-08-28  4:15   ` Randy Dunlap
  2025-08-28 11:24   ` Li Nan
  2 siblings, 1 reply; 19+ messages in thread
From: Paul Menzel @ 2025-08-26  9:52 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, corbet, agk, snitzer, mpatocka, song, xni, hare, linan122,
	colyli, linux-doc, linux-kernel, dm-devel, linux-raid, yukuai3,
	yi.zhang, yangerkun, johnny.chenyi

Dear Kuai,


Thank you for your patch.


Am 26.08.25 um 10:52 schrieb Yu Kuai:
> From: Yu Kuai <yukuai3@huawei.com>

It’d be great if you could motivate, why a lockless bitmap is needed 
compared to the current implemention.

> Redundant data is used to enhance data fault tolerance, and the storage
> method for redundant data vary depending on the RAID levels. And it's
> important to maintain the consistency of redundant data.
> 
> Bitmap is used to record which data blocks have been synchronized and which
> ones need to be resynchronized or recovered. Each bit in the bitmap
> represents a segment of data in the array. When a bit is set, it indicates
> that the multiple redundant copies of that data segment may not be
> consistent. Data synchronization can be performed based on the bitmap after
> power failure or readding a disk. If there is no bitmap, a full disk
> synchronization is required.
> 
> Key Features:
> 
>   - IO fastpath is lockless, if user issues lots of write IO to the same
>   bitmap bit in a short time, only the first write have additional overhead
>   to update bitmap bit, no additional overhead for the following writes;
>   - support only resync or recover written data, means in the case creating
>   new array or replacing with a new disk, there is no need to do a full disk
>   resync/recovery;
> 
> Key Concept:
> 
>   - State Machine:
> 
> Each bit is one byte, contain 6 difference state, see llbitmap_state. And
> there are total 8 differenct actions, see llbitmap_action, can change state:

different

> llbitmap state machine: transitions between states
> 
> |           | Startwrite | Startsync | Endsync | Abortsync|
> | --------- | ---------- | --------- | ------- | -------  |
> | Unwritten | Dirty      | x         | x       | x        |
> | Clean     | Dirty      | x         | x       | x        |
> | Dirty     | x          | x         | x       | x        |
> | NeedSync  | x          | Syncing   | x       | x        |
> | Syncing   | x          | Syncing   | Dirty   | NeedSync |
> 
> |           | Reload   | Daemon | Discard   | Stale     |
> | --------- | -------- | ------ | --------- | --------- |
> | Unwritten | x        | x      | x         | x         |
> | Clean     | x        | x      | Unwritten | NeedSync  |
> | Dirty     | NeedSync | Clean  | Unwritten | NeedSync  |
> | NeedSync  | x        | x      | Unwritten | x         |
> | Syncing   | NeedSync | x      | Unwritten | NeedSync  |
> 
> Typical scenarios:
> 
> 1) Create new array
> All bits will be set to Unwritten by default, if --assume-clean is set,
> all bits will be set to Clean instead.
> 
> 2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and
> rely on xor data
> 
> 2.1) write new data to raid1/raid10:
> Unwritten --StartWrite--> Dirty
> 
> 2.2) write new data to raid456:
> Unwritten --StartWrite--> NeedSync
> 
> Because the initial recover for raid456 is skipped, the xor data is not build

buil*t*

> yet, the bit must set to NeedSync first and after lazy initial recover is
> finished, the bit will finially set to Dirty(see 5.1 and 5.4);
> 
> 2.3) cover write
> Clean --StartWrite--> Dirty
> 
> 3) daemon, if the array is not degraded:
> Dirty --Daemon--> Clean
> 
> For degraded array, the Dirty bit will never be cleared, prevent full disk
> recovery while readding a removed disk.
> 
> 4) discard
> {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
> 
> 5) resync and recover
> 
> 5.1) common process
> NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean
> 
> 5.2) resync after power failure
> Dirty --Reload--> NeedSync
> 
> 5.3) recover while replacing with a new disk
> By default, the old bitmap framework will recover all data, and llbitmap
> implement this by a new helper, see llbitmap_skip_sync_blocks:
> 
> skip recover for bits other than dirty or clean;
> 
> 5.4) lazy initial recover for raid5:
> By default, the old bitmap framework will only allow new recover when there
> are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add
> to perform raid456 lazy recover for set bits(from 2.2).
> 
> Bitmap IO:
> 
>   - Chunksize
> 
> The default bitmap size is 128k, incluing 1k bitmap super block, and
> the default size of segment of data in the array each bit(chunksize) is 64k,
> and chunksize will adjust to twice the old size each time if the total number
> bits is not less than 127k.(see llbitmap_init)
> 
>   - READ
> 
> While creating bitmap, all pages will be allocated and read for llbitmap,
> there won't be read afterwards
> 
>   - WRITE
> 
> WRITE IO is divided into logical_block_size of the array, the dirty state
> of each block is tracked independently, for example:
> 
> each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;
> 
> | page0 | page1 | ... | page 31 |
> |       |
> |        \-----------------------\
> |                                |
> | block0 | block1 | ... | block 8|
> |        |
> |         \-----------------\
> |                            |
> | bit0 | bit1 | ... | bit511 |
> 
>  From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
> subpage will be marked dirty, such block must write first before the IO is
> issued. This behaviour will affect IO performance, to reduce the impact, if
> multiple bits are changed in the same block in a short time, all bits in this
> block will be changed to Dirty/NeedSync, so that there won't be any overhead
> until daemon clears dirty bits.
> 
> Dirty Bits syncronization:

sync*h*ronization

> 
> IO fast path will set bits to dirty, and those dirty bits will be cleared
> by daemon after IO is done. llbitmap_page_ctl is used to synchronize between
> IO path and daemon;
> 
> IO path:
>   1) try to grab a reference, if succeed, set expire time after 5s and return;
>   2) if failed to grab a reference, wait for daemon to finish clearing dirty
>   bits;
> 
> Daemon(Daemon will be waken up every daemon_sleep seconds):

Add a space before (.

> For each page:
>   1) check if page expired, if not skip this page; for expired page:
>   2) suspend the page and wait for inflight write IO to be done;
>   3) change dirty page to clean;
>   4) resume the page;

How can/should this patch be tested/benchmarked?

> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   Documentation/admin-guide/md.rst |   20 +
>   drivers/md/Kconfig               |   11 +
>   drivers/md/Makefile              |    1 +
>   drivers/md/md-bitmap.c           |    9 -
>   drivers/md/md-bitmap.h           |   31 +-
>   drivers/md/md-llbitmap.c         | 1600 ++++++++++++++++++++++++++++++
>   drivers/md/md.c                  |    6 +
>   drivers/md/md.h                  |    4 +-
>   8 files changed, 1670 insertions(+), 12 deletions(-)
>   create mode 100644 drivers/md/md-llbitmap.c
> 
> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> index 001363f81850..47d1347ccd00 100644
> --- a/Documentation/admin-guide/md.rst
> +++ b/Documentation/admin-guide/md.rst
> @@ -387,6 +387,8 @@ All md devices contain:
>            No bitmap
>        bitmap
>            The default internal bitmap
> +     llbitmap
> +         The lockless internal bitmap
>   
>   If bitmap_type is not none, then additional bitmap attributes bitmap/xxx or
>   llbitmap/xxx will be created after md device KOBJ_CHANGE event.
> @@ -447,6 +449,24 @@ If bitmap_type is bitmap, then the md device will also contain:
>        once the array becomes non-degraded, and this fact has been
>        recorded in the metadata.
>   
> +If bitmap_type is llbitmap, then the md device will also contain:
> +
> +  llbitmap/bits
> +     This is readonly, show status of bitmap bits, the number of each
> +     value.
> +
> +  llbitmap/metadata
> +     This is readonly, show bitmap metadata, include chunksize, chunkshift,
> +     chunks, offset and daemon_sleep.
> +
> +  llbitmap/daemon_sleep

Add the unit to the name? daemon_sleep_s?

> +     This is readwrite, time in seconds that daemon function will be
> +     triggered to clear dirty bits.
> +
> +  llbitmap/barrier_idle

Ditto.

> +     This is readwrite, time in seconds that page barrier will be idled,
> +     means dirty bits in the page will be cleared.
> +
>   As component devices are added to an md array, they appear in the ``md``
>   directory as new directories named::
>   
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index f913579e731c..07c19b2182ca 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -52,6 +52,17 @@ config MD_BITMAP
>   
>   	  If unsure, say Y.
>   
> +config MD_LLBITMAP
> +	bool "MD RAID lockless bitmap support"
> +	depends on BLK_DEV_MD
> +	help
> +	  If you say Y here, support for the lockless write intent bitmap will
> +	  be enabled.

Maybe elaborate a little, when/why this should be selected?

> +
> +	  Note, this is an experimental feature.
> +
> +	  If unsure, say N.
> +
>   config MD_AUTODETECT
>   	bool "Autodetect RAID arrays during kernel boot"
>   	depends on BLK_DEV_MD=y
> diff --git a/drivers/md/Makefile b/drivers/md/Makefile
> index 2e18147a9c40..5a51b3408b70 100644
> --- a/drivers/md/Makefile
> +++ b/drivers/md/Makefile
> @@ -29,6 +29,7 @@ dm-zoned-y	+= dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
>   
>   md-mod-y	+= md.o
>   md-mod-$(CONFIG_MD_BITMAP)	+= md-bitmap.o
> +md-mod-$(CONFIG_MD_LLBITMAP)	+= md-llbitmap.o
>   raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
>   linear-y       += md-linear.o
>   
> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> index dc050ff94d5b..84b7e2af6dba 100644
> --- a/drivers/md/md-bitmap.c
> +++ b/drivers/md/md-bitmap.c
> @@ -34,15 +34,6 @@
>   #include "md-bitmap.h"
>   #include "md-cluster.h"
>   
> -#define BITMAP_MAJOR_LO 3
> -/* version 4 insists the bitmap is in little-endian order
> - * with version 3, it is host-endian which is non-portable
> - * Version 5 is currently set only for clustered devices
> - */
> -#define BITMAP_MAJOR_HI 4
> -#define BITMAP_MAJOR_CLUSTERED 5
> -#define	BITMAP_MAJOR_HOSTENDIAN 3
> -
>   /*
>    * in-memory bitmap:
>    *
> diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
> index 5f41724cbcd8..b42a28fa83a0 100644
> --- a/drivers/md/md-bitmap.h
> +++ b/drivers/md/md-bitmap.h
> @@ -9,10 +9,26 @@
>   
>   #define BITMAP_MAGIC 0x6d746962
>   
> +/*
> + * version 3 is host-endian order, this is deprecated and not used for new
> + * array
> + */
> +#define BITMAP_MAJOR_LO		3
> +#define BITMAP_MAJOR_HOSTENDIAN	3
> +/* version 4 is little-endian order, the default value */
> +#define BITMAP_MAJOR_HI		4
> +/* version 5 is only used for cluster */
> +#define BITMAP_MAJOR_CLUSTERED	5

Move this to the header in a separate patch?

> +/* version 6 is only used for lockless bitmap */
> +#define BITMAP_MAJOR_LOCKLESS	6
> +
>   /* use these for bitmap->flags and bitmap->sb->state bit-fields */
>   enum bitmap_state {
> -	BITMAP_STALE	   = 1,  /* the bitmap file is out of date or had -EIO */
> +	BITMAP_STALE	   = 1, /* the bitmap file is out of date or had -EIO */

Unrelated.

>   	BITMAP_WRITE_ERROR = 2, /* A write error has occurred */
> +	BITMAP_FIRST_USE   = 3, /* llbitmap is just created */
> +	BITMAP_CLEAN       = 4, /* llbitmap is created with assume_clean */
> +	BITMAP_DAEMON_BUSY = 5, /* llbitmap daemon is not finished after daemon_sleep */
>   	BITMAP_HOSTENDIAN  =15,
>   };
>   
> @@ -166,4 +182,17 @@ static inline void md_bitmap_exit(void)
>   }
>   #endif
>   
> +#ifdef CONFIG_MD_LLBITMAP
> +int md_llbitmap_init(void);
> +void md_llbitmap_exit(void);
> +#else
> +static inline int md_llbitmap_init(void)
> +{
> +	return 0;
> +}
> +static inline void md_llbitmap_exit(void)
> +{
> +}
> +#endif
> +
>   #endif
> diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
> new file mode 100644
> index 000000000000..88207f31c728
> --- /dev/null
> +++ b/drivers/md/md-llbitmap.c
> @@ -0,0 +1,1600 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +#include <linux/blkdev.h>
> +#include <linux/module.h>
> +#include <linux/errno.h>
> +#include <linux/slab.h>
> +#include <linux/init.h>
> +#include <linux/timer.h>
> +#include <linux/sched.h>
> +#include <linux/list.h>
> +#include <linux/file.h>
> +#include <linux/seq_file.h>
> +#include <trace/events/block.h>
> +
> +#include "md.h"
> +#include "md-bitmap.h"
> +
> +/*
> + * #### Background
> + *
> + * Redundant data is used to enhance data fault tolerance, and the storage
> + * method for redundant data vary depending on the RAID levels. And it's
> + * important to maintain the consistency of redundant data.
> + *
> + * Bitmap is used to record which data blocks have been synchronized and which
> + * ones need to be resynchronized or recovered. Each bit in the bitmap
> + * represents a segment of data in the array. When a bit is set, it indicates
> + * that the multiple redundant copies of that data segment may not be
> + * consistent. Data synchronization can be performed based on the bitmap after
> + * power failure or readding a disk. If there is no bitmap, a full disk
> + * synchronization is required.
> + *
> + * #### Key Features
> + *
> + *  - IO fastpath is lockless, if user issues lots of write IO to the same
> + *  bitmap bit in a short time, only the first write have additional overhead
> + *  to update bitmap bit, no additional overhead for the following writes;
> + *  - support only resync or recover written data, means in the case creating
> + *  new array or replacing with a new disk, there is no need to do a full disk
> + *  resync/recovery;
> + *
> + * #### Key Concept
> + *
> + * ##### State Machine
> + *
> + * Each bit is one byte, contain 6 difference state, see llbitmap_state. And
> + * there are total 8 differenct actions, see llbitmap_action, can change state:

Same findings as in commit message.

> + *
> + * llbitmap state machine: transitions between states
> + *
> + * |           | Startwrite | Startsync | Endsync | Abortsync|
> + * | --------- | ---------- | --------- | ------- | -------  |
> + * | Unwritten | Dirty      | x         | x       | x        |
> + * | Clean     | Dirty      | x         | x       | x        |
> + * | Dirty     | x          | x         | x       | x        |
> + * | NeedSync  | x          | Syncing   | x       | x        |
> + * | Syncing   | x          | Syncing   | Dirty   | NeedSync |
> + *
> + * |           | Reload   | Daemon | Discard   | Stale     |
> + * | --------- | -------- | ------ | --------- | --------- |
> + * | Unwritten | x        | x      | x         | x         |
> + * | Clean     | x        | x      | Unwritten | NeedSync  |
> + * | Dirty     | NeedSync | Clean  | Unwritten | NeedSync  |
> + * | NeedSync  | x        | x      | Unwritten | x         |
> + * | Syncing   | NeedSync | x      | Unwritten | NeedSync  |
> + *
> + * Typical scenarios:
> + *
> + * 1) Create new array
> + * All bits will be set to Unwritten by default, if --assume-clean is set,
> + * all bits will be set to Clean instead.
> + *
> + * 2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and
> + * rely on xor data
> + *
> + * 2.1) write new data to raid1/raid10:
> + * Unwritten --StartWrite--> Dirty
> + *
> + * 2.2) write new data to raid456:
> + * Unwritten --StartWrite--> NeedSync
> + *
> + * Because the initial recover for raid456 is skipped, the xor data is not build
> + * yet, the bit must set to NeedSync first and after lazy initial recover is
> + * finished, the bit will finially set to Dirty(see 5.1 and 5.4);
> + *
> + * 2.3) cover write
> + * Clean --StartWrite--> Dirty
> + *
> + * 3) daemon, if the array is not degraded:
> + * Dirty --Daemon--> Clean
> + *
> + * For degraded array, the Dirty bit will never be cleared, prevent full disk
> + * recovery while readding a removed disk.
> + *
> + * 4) discard
> + * {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
> + *
> + * 5) resync and recover
> + *
> + * 5.1) common process
> + * NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean
> + *
> + * 5.2) resync after power failure
> + * Dirty --Reload--> NeedSync
> + *
> + * 5.3) recover while replacing with a new disk
> + * By default, the old bitmap framework will recover all data, and llbitmap
> + * implement this by a new helper, see llbitmap_skip_sync_blocks:
> + *
> + * skip recover for bits other than dirty or clean;
> + *
> + * 5.4) lazy initial recover for raid5:
> + * By default, the old bitmap framework will only allow new recover when there
> + * are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add
> + * to perform raid456 lazy recover for set bits(from 2.2).
> + *
> + * ##### Bitmap IO
> + *
> + * ##### Chunksize
> + *
> + * The default bitmap size is 128k, incluing 1k bitmap super block, and
> + * the default size of segment of data in the array each bit(chunksize) is 64k,
> + * and chunksize will adjust to twice the old size each time if the total number
> + * bits is not less than 127k.(see llbitmap_init)
> + *
> + * ##### READ
> + *
> + * While creating bitmap, all pages will be allocated and read for llbitmap,
> + * there won't be read afterwards
> + *
> + * ##### WRITE
> + *
> + * WRITE IO is divided into logical_block_size of the array, the dirty state
> + * of each block is tracked independently, for example:
> + *
> + * each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;
> + *
> + * | page0 | page1 | ... | page 31 |
> + * |       |
> + * |        \-----------------------\
> + * |                                |
> + * | block0 | block1 | ... | block 8|
> + * |        |
> + * |         \-----------------\
> + * |                            |
> + * | bit0 | bit1 | ... | bit511 |
> + *
> + * From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
> + * subpage will be marked dirty, such block must write first before the IO is
> + * issued. This behaviour will affect IO performance, to reduce the impact, if
> + * multiple bits are changed in the same block in a short time, all bits in this
> + * block will be changed to Dirty/NeedSync, so that there won't be any overhead
> + * until daemon clears dirty bits.
> + *
> + * ##### Dirty Bits syncronization

sync*h*ronization

> + *
> + * IO fast path will set bits to dirty, and those dirty bits will be cleared
> + * by daemon after IO is done. llbitmap_page_ctl is used to synchronize between
> + * IO path and daemon;
> + *
> + * IO path:
> + *  1) try to grab a reference, if succeed, set expire time after 5s and return;
> + *  2) if failed to grab a reference, wait for daemon to finish clearing dirty
> + *  bits;
> + *
> + * Daemon(Daemon will be waken up every daemon_sleep seconds):
> + * For each page:
> + *  1) check if page expired, if not skip this page; for expired page:
> + *  2) suspend the page and wait for inflight write IO to be done;
> + *  3) change dirty page to clean;
> + *  4) resume the page;
> + */

Instead of the header, should this go under `Documentation/`?

> +
> +#define BITMAP_DATA_OFFSET 1024
> +
> +/* 64k is the max IO size of sync IO for raid1/raid10 */
> +#define MIN_CHUNK_SIZE (64 * 2)

Why is double the maximum IOsize chosen?

> +
> +/* By default, daemon will be waken up every 30s */
> +#define DEFAULT_DAEMON_SLEEP 30

Append the unit?

> +
> +/*
> + * Dirtied bits that have not been accessed for more than 5s will be cleared
> + * by daemon.
> + */
> +#define DEFAULT_BARRIER_IDLE 5
> +
> +enum llbitmap_state {
> +	/* No valid data, init state after assemble the array */
> +	BitUnwritten = 0,
> +	/* data is consistent */
> +	BitClean,
> +	/* data will be consistent after IO is done, set directly for writes */
> +	BitDirty,
> +	/*
> +	 * data need to be resynchronized:
> +	 * 1) set directly for writes if array is degraded, prevent full disk
> +	 * synchronization after readding a disk;
> +	 * 2) reassemble the array after power failure, and dirty bits are
> +	 * found after reloading the bitmap;
> +	 * 3) set for first write for raid5, to build initial xor data lazily
> +	 */
> +	BitNeedSync,
> +	/* data is synchronizing */
> +	BitSyncing,
> +	BitStateCount,
> +	BitNone = 0xff,
> +};
> +
> +enum llbitmap_action {
> +	/* User write new data, this is the only action from IO fast path */
> +	BitmapActionStartwrite = 0,
> +	/* Start recovery */
> +	BitmapActionStartsync,
> +	/* Finish recovery */
> +	BitmapActionEndsync,
> +	/* Failed recovery */
> +	BitmapActionAbortsync,
> +	/* Reassemble the array */
> +	BitmapActionReload,
> +	/* Daemon thread is trying to clear dirty bits */
> +	BitmapActionDaemon,
> +	/* Data is deleted */
> +	BitmapActionDiscard,
> +	/*
> +	 * Bitmap is stale, mark all bits in addition to BitUnwritten to
> +	 * BitNeedSync.
> +	 */
> +	BitmapActionStale,
> +	BitmapActionCount,
> +	/* Init state is BitUnwritten */
> +	BitmapActionInit,
> +};
> +
> +enum llbitmap_page_state {
> +	LLPageFlush = 0,
> +	LLPageDirty,
> +};
> +
> +struct llbitmap_page_ctl {
> +	char *state;
> +	struct page *page;
> +	unsigned long expire;
> +	unsigned long flags;
> +	wait_queue_head_t wait;
> +	struct percpu_ref active;
> +	/* Per block size dirty state, maximum 64k page / 1 sector = 128 */
> +	unsigned long dirty[];
> +};
> +
> +struct llbitmap {
> +	struct mddev *mddev;
> +	struct llbitmap_page_ctl **pctl;
> +
> +	unsigned int nr_pages;
> +	unsigned int io_size;
> +	unsigned int blocks_per_page;
> +
> +	/* shift of one chunk */
> +	unsigned long chunkshift;
> +	/* size of one chunk in sector */
> +	unsigned long chunksize;
> +	/* total number of chunks */
> +	unsigned long chunks;
> +	unsigned long last_end_sync;
> +	/*
> +	 * time in seconds that dirty bits will be cleared if the page is not
> +	 * accessed.
> +	 */
> +	unsigned long barrier_idle;
> +	/* fires on first BitDirty state */
> +	struct timer_list pending_timer;
> +	struct work_struct daemon_work;
> +
> +	unsigned long flags;
> +	__u64	events_cleared;
> +
> +	/* for slow disks */
> +	atomic_t behind_writes;
> +	wait_queue_head_t behind_wait;
> +};
> +
> +struct llbitmap_unplug_work {
> +	struct work_struct work;
> +	struct llbitmap *llbitmap;
> +	struct completion *done;
> +};
> +
> +static struct workqueue_struct *md_llbitmap_io_wq;
> +static struct workqueue_struct *md_llbitmap_unplug_wq;
> +
> +static char state_machine[BitStateCount][BitmapActionCount] = {
> +	[BitUnwritten] = {
> +		[BitmapActionStartwrite]	= BitDirty,
> +		[BitmapActionStartsync]		= BitNone,
> +		[BitmapActionEndsync]		= BitNone,
> +		[BitmapActionAbortsync]		= BitNone,
> +		[BitmapActionReload]		= BitNone,
> +		[BitmapActionDaemon]		= BitNone,
> +		[BitmapActionDiscard]		= BitNone,
> +		[BitmapActionStale]		= BitNone,
> +	},
> +	[BitClean] = {
> +		[BitmapActionStartwrite]	= BitDirty,
> +		[BitmapActionStartsync]		= BitNone,
> +		[BitmapActionEndsync]		= BitNone,
> +		[BitmapActionAbortsync]		= BitNone,
> +		[BitmapActionReload]		= BitNone,
> +		[BitmapActionDaemon]		= BitNone,
> +		[BitmapActionDiscard]		= BitUnwritten,
> +		[BitmapActionStale]		= BitNeedSync,
> +	},
> +	[BitDirty] = {
> +		[BitmapActionStartwrite]	= BitNone,
> +		[BitmapActionStartsync]		= BitNone,
> +		[BitmapActionEndsync]		= BitNone,
> +		[BitmapActionAbortsync]		= BitNone,
> +		[BitmapActionReload]		= BitNeedSync,
> +		[BitmapActionDaemon]		= BitClean,
> +		[BitmapActionDiscard]		= BitUnwritten,
> +		[BitmapActionStale]		= BitNeedSync,
> +	},
> +	[BitNeedSync] = {
> +		[BitmapActionStartwrite]	= BitNone,
> +		[BitmapActionStartsync]		= BitSyncing,
> +		[BitmapActionEndsync]		= BitNone,
> +		[BitmapActionAbortsync]		= BitNone,
> +		[BitmapActionReload]		= BitNone,
> +		[BitmapActionDaemon]		= BitNone,
> +		[BitmapActionDiscard]		= BitUnwritten,
> +		[BitmapActionStale]		= BitNone,
> +	},
> +	[BitSyncing] = {
> +		[BitmapActionStartwrite]	= BitNone,
> +		[BitmapActionStartsync]		= BitSyncing,
> +		[BitmapActionEndsync]		= BitDirty,
> +		[BitmapActionAbortsync]		= BitNeedSync,
> +		[BitmapActionReload]		= BitNeedSync,
> +		[BitmapActionDaemon]		= BitNone,
> +		[BitmapActionDiscard]		= BitUnwritten,
> +		[BitmapActionStale]		= BitNeedSync,
> +	},
> +};
> +
> +static void __llbitmap_flush(struct mddev *mddev);
> +
> +static enum llbitmap_state llbitmap_read(struct llbitmap *llbitmap, loff_t pos)
> +{
> +	unsigned int idx;
> +	unsigned int offset;
> +
> +	pos += BITMAP_DATA_OFFSET;
> +	idx = pos >> PAGE_SHIFT;
> +	offset = offset_in_page(pos);
> +
> +	return llbitmap->pctl[idx]->state[offset];
> +}
> +
> +/* set all the bits in the subpage as dirty */
> +static void llbitmap_infect_dirty_bits(struct llbitmap *llbitmap,
> +				       struct llbitmap_page_ctl *pctl,
> +				       unsigned int block)
> +{
> +	bool level_456 = raid_is_456(llbitmap->mddev);
> +	unsigned int io_size = llbitmap->io_size;
> +	int pos;

`size_t` or `unsigned int`? (Also below.)

> +
> +	for (pos = block * io_size; pos < (block + 1) * io_size; pos++) {
> +		switch (pctl->state[pos]) {
> +		case BitUnwritten:
> +			pctl->state[pos] = level_456 ? BitNeedSync : BitDirty;
> +			break;
> +		case BitClean:
> +			pctl->state[pos] = BitDirty;
> +			break;
> +		};
> +	}
> +
> +}
> +
> +static void llbitmap_set_page_dirty(struct llbitmap *llbitmap, int idx,
> +				    int offset)
> +{
> +	struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
> +	unsigned int io_size = llbitmap->io_size;
> +	int block = offset / io_size;
> +	int pos;
> +
> +	if (!test_bit(LLPageDirty, &pctl->flags))
> +		set_bit(LLPageDirty, &pctl->flags);
> +
> +	/*
> +	 * The subpage usually contains a total of 512 bits. If any single bit
> +	 * within the subpage is marked as dirty, the entire sector will be
> +	 * written. To avoid impacting write performance, when multiple bits
> +	 * within the same sector are modified within llbitmap->barrier_idle,
> +	 * all bits in the sector will be collectively marked as dirty at once.
> +	 */
> +	if (test_and_set_bit(block, pctl->dirty)) {
> +		llbitmap_infect_dirty_bits(llbitmap, pctl, block);
> +		return;
> +	}
> +
> +	for (pos = block * io_size; pos < (block + 1) * io_size; pos++) {
> +		if (pos == offset)
> +			continue;
> +		if (pctl->state[pos] == BitDirty ||
> +		    pctl->state[pos] == BitNeedSync) {
> +			llbitmap_infect_dirty_bits(llbitmap, pctl, block);
> +			return;
> +		}
> +	}
> +}
> +
> +static void llbitmap_write(struct llbitmap *llbitmap, enum llbitmap_state state,
> +			   loff_t pos)
> +{
> +	unsigned int idx;
> +	unsigned int bit;
> +
> +	pos += BITMAP_DATA_OFFSET;
> +	idx = pos >> PAGE_SHIFT;
> +	bit = offset_in_page(pos);
> +
> +	llbitmap->pctl[idx]->state[bit] = state;
> +	if (state == BitDirty || state == BitNeedSync)
> +		llbitmap_set_page_dirty(llbitmap, idx, bit);
> +}
> +
> +static struct page *llbitmap_read_page(struct llbitmap *llbitmap, int idx)
> +{
> +	struct mddev *mddev = llbitmap->mddev;
> +	struct page *page = NULL;
> +	struct md_rdev *rdev;
> +
> +	if (llbitmap->pctl && llbitmap->pctl[idx])
> +		page = llbitmap->pctl[idx]->page;
> +	if (page)
> +		return page;
> +
> +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	if (!page)
> +		return ERR_PTR(-ENOMEM);
> +
> +	rdev_for_each(rdev, mddev) {
> +		sector_t sector;
> +
> +		if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
> +			continue;
> +
> +		sector = mddev->bitmap_info.offset +
> +			 (idx << PAGE_SECTORS_SHIFT);
> +
> +		if (sync_page_io(rdev, sector, PAGE_SIZE, page, REQ_OP_READ,
> +				 true))
> +			return page;
> +
> +		md_error(mddev, rdev);
> +	}
> +
> +	__free_page(page);
> +	return ERR_PTR(-EIO);
> +}
> +
> +static void llbitmap_write_page(struct llbitmap *llbitmap, int idx)
> +{
> +	struct page *page = llbitmap->pctl[idx]->page;
> +	struct mddev *mddev = llbitmap->mddev;
> +	struct md_rdev *rdev;
> +	int block;
> +
> +	for (block = 0; block < llbitmap->blocks_per_page; block++) {
> +		struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
> +
> +		if (!test_and_clear_bit(block, pctl->dirty))
> +			continue;
> +
> +		rdev_for_each(rdev, mddev) {
> +			sector_t sector;
> +			sector_t bit_sector = llbitmap->io_size >> SECTOR_SHIFT;
> +
> +			if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
> +				continue;
> +
> +			sector = mddev->bitmap_info.offset + rdev->sb_start +
> +				 (idx << PAGE_SECTORS_SHIFT) +
> +				 block * bit_sector;
> +			md_write_metadata(mddev, rdev, sector,
> +					  llbitmap->io_size, page,
> +					  block * llbitmap->io_size);
> +		}
> +	}
> +}
> +
> +static void active_release(struct percpu_ref *ref)
> +{
> +	struct llbitmap_page_ctl *pctl =
> +		container_of(ref, struct llbitmap_page_ctl, active);
> +
> +	wake_up(&pctl->wait);
> +}
> +
> +static void llbitmap_free_pages(struct llbitmap *llbitmap)
> +{
> +	int i;
> +
> +	if (!llbitmap->pctl)
> +		return;
> +
> +	for (i = 0; i < llbitmap->nr_pages; i++) {
> +		struct llbitmap_page_ctl *pctl = llbitmap->pctl[i];
> +
> +		if (!pctl || !pctl->page)
> +			break;
> +
> +		__free_page(pctl->page);
> +		percpu_ref_exit(&pctl->active);
> +	}
> +
> +	kfree(llbitmap->pctl[0]);
> +	kfree(llbitmap->pctl);
> +	llbitmap->pctl = NULL;
> +}
> +
> +static int llbitmap_cache_pages(struct llbitmap *llbitmap)
> +{
> +	struct llbitmap_page_ctl *pctl;
> +	unsigned int nr_pages = DIV_ROUND_UP(llbitmap->chunks +
> +					     BITMAP_DATA_OFFSET, PAGE_SIZE);
> +	unsigned int size = struct_size(pctl, dirty, BITS_TO_LONGS(
> +						llbitmap->blocks_per_page));
> +	int i;
> +
> +	llbitmap->pctl = kmalloc_array(nr_pages, sizeof(void *),
> +				       GFP_KERNEL | __GFP_ZERO);
> +	if (!llbitmap->pctl)
> +		return -ENOMEM;
> +
> +	size = round_up(size, cache_line_size());
> +	pctl = kmalloc_array(nr_pages, size, GFP_KERNEL | __GFP_ZERO);
> +	if (!pctl) {
> +		kfree(llbitmap->pctl);
> +		return -ENOMEM;
> +	}
> +
> +	llbitmap->nr_pages = nr_pages;
> +
> +	for (i = 0; i < nr_pages; i++, pctl = (void *)pctl + size) {
> +		struct page *page = llbitmap_read_page(llbitmap, i);
> +
> +		llbitmap->pctl[i] = pctl;
> +
> +		if (IS_ERR(page)) {
> +			llbitmap_free_pages(llbitmap);
> +			return PTR_ERR(page);
> +		}
> +
> +		if (percpu_ref_init(&pctl->active, active_release,
> +				    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
> +			__free_page(page);
> +			llbitmap_free_pages(llbitmap);
> +			return -ENOMEM;
> +		}
> +
> +		pctl->page = page;
> +		pctl->state = page_address(page);
> +		init_waitqueue_head(&pctl->wait);
> +	}
> +
> +	return 0;
> +}
> +
> +static void llbitmap_init_state(struct llbitmap *llbitmap)
> +{
> +	enum llbitmap_state state = BitUnwritten;
> +	unsigned long i;
> +
> +	if (test_and_clear_bit(BITMAP_CLEAN, &llbitmap->flags))
> +		state = BitClean;
> +
> +	for (i = 0; i < llbitmap->chunks; i++)
> +		llbitmap_write(llbitmap, state, i);
> +}
> +
> +/* The return value is only used from resync, where @start == @end. */
> +static enum llbitmap_state llbitmap_state_machine(struct llbitmap *llbitmap,
> +						  unsigned long start,
> +						  unsigned long end,
> +						  enum llbitmap_action action)
> +{
> +	struct mddev *mddev = llbitmap->mddev;
> +	enum llbitmap_state state = BitNone;
> +	bool level_456 = raid_is_456(llbitmap->mddev);
> +	bool need_resync = false;
> +	bool need_recovery = false;
> +
> +	if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags))
> +		return BitNone;
> +
> +	if (action == BitmapActionInit) {
> +		llbitmap_init_state(llbitmap);
> +		return BitNone;
> +	}
> +
> +	while (start <= end) {
> +		enum llbitmap_state c = llbitmap_read(llbitmap, start);
> +
> +		if (c < 0 || c >= BitStateCount) {
> +			pr_err("%s: invalid bit %lu state %d action %d, forcing resync\n",
> +			       __func__, start, c, action);
> +			state = BitNeedSync;
> +			goto write_bitmap;
> +		}
> +
> +		if (c == BitNeedSync)
> +			need_resync = true;
> +
> +		state = state_machine[c][action];
> +		if (state == BitNone) {
> +			start++;
> +			continue;
> +		}
> +
> +write_bitmap:
> +		/* Delay raid456 initial recovery to first write. */
> +		if (c == BitUnwritten && state == BitDirty &&
> +		    action == BitmapActionStartwrite && level_456) {
> +			state = BitNeedSync;
> +			need_recovery = true;
> +		}
> +
> +		llbitmap_write(llbitmap, state, start);
> +
> +		if (state == BitNeedSync)
> +			need_resync = true;
> +		else if (state == BitDirty &&
> +			 !timer_pending(&llbitmap->pending_timer))
> +			mod_timer(&llbitmap->pending_timer,
> +				  jiffies + mddev->bitmap_info.daemon_sleep * HZ);
> +
> +		start++;
> +	}
> +
> +	if (need_resync && level_456)
> +		need_recovery = true;
> +
> +	if (need_recovery) {
> +		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> +		set_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery);
> +		md_wakeup_thread(mddev->thread);
> +	} else if (need_resync) {
> +		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> +		set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
> +		md_wakeup_thread(mddev->thread);
> +	}
> +
> +	return state;
> +}
> +
> +static void llbitmap_raise_barrier(struct llbitmap *llbitmap, int page_idx)
> +{
> +	struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx];
> +
> +retry:
> +	if (likely(percpu_ref_tryget_live(&pctl->active))) {
> +		WRITE_ONCE(pctl->expire, jiffies + llbitmap->barrier_idle * HZ);
> +		return;
> +	}
> +
> +	wait_event(pctl->wait, !percpu_ref_is_dying(&pctl->active));
> +	goto retry;
> +}
> +
> +static void llbitmap_release_barrier(struct llbitmap *llbitmap, int page_idx)
> +{
> +	struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx];
> +
> +	percpu_ref_put(&pctl->active);
> +}
> +
> +static int llbitmap_suspend_timeout(struct llbitmap *llbitmap, int page_idx)
> +{
> +	struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx];
> +
> +	percpu_ref_kill(&pctl->active);
> +
> +	if (!wait_event_timeout(pctl->wait, percpu_ref_is_zero(&pctl->active),
> +			llbitmap->mddev->bitmap_info.daemon_sleep * HZ))
> +		return -ETIMEDOUT;
> +
> +	return 0;
> +}
> +
> +static void llbitmap_resume(struct llbitmap *llbitmap, int page_idx)
> +{
> +	struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx];
> +
> +	pctl->expire = LONG_MAX;
> +	percpu_ref_resurrect(&pctl->active);
> +	wake_up(&pctl->wait);
> +}
> +
> +static int llbitmap_check_support(struct mddev *mddev)
> +{
> +	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
> +		pr_notice("md/llbitmap: %s: array with journal cannot have bitmap\n",
> +			  mdname(mddev));
> +		return -EBUSY;
> +	}
> +
> +	if (mddev->bitmap_info.space == 0) {
> +		if (mddev->bitmap_info.default_space == 0) {
> +			pr_notice("md/llbitmap: %s: no space for bitmap\n",
> +				  mdname(mddev));
> +			return -ENOSPC;
> +		}
> +	}
> +
> +	if (!mddev->persistent) {
> +		pr_notice("md/llbitmap: %s: array must be persistent\n",
> +			  mdname(mddev));
> +		return -EOPNOTSUPP;
> +	}
> +
> +	if (mddev->bitmap_info.file) {
> +		pr_notice("md/llbitmap: %s: doesn't support bitmap file\n",
> +			  mdname(mddev));
> +		return -EOPNOTSUPP;
> +	}
> +
> +	if (mddev->bitmap_info.external) {
> +		pr_notice("md/llbitmap: %s: doesn't support external metadata\n",
> +			  mdname(mddev));
> +		return -EOPNOTSUPP;
> +	}
> +
> +	if (mddev_is_dm(mddev)) {
> +		pr_notice("md/llbitmap: %s: doesn't support dm-raid\n",
> +			  mdname(mddev));
> +		return -EOPNOTSUPP;
> +	}
> +
> +	return 0;
> +}
> +
> +static int llbitmap_init(struct llbitmap *llbitmap)
> +{
> +	struct mddev *mddev = llbitmap->mddev;
> +	sector_t blocks = mddev->resync_max_sectors;
> +	unsigned long chunksize = MIN_CHUNK_SIZE;
> +	unsigned long chunks = DIV_ROUND_UP(blocks, chunksize);
> +	unsigned long space = mddev->bitmap_info.space << SECTOR_SHIFT;
> +	int ret;
> +
> +	while (chunks > space) {
> +		chunksize = chunksize << 1;
> +		chunks = DIV_ROUND_UP(blocks, chunksize);
> +	}
> +
> +	llbitmap->barrier_idle = DEFAULT_BARRIER_IDLE;
> +	llbitmap->chunkshift = ffz(~chunksize);
> +	llbitmap->chunksize = chunksize;
> +	llbitmap->chunks = chunks;
> +	mddev->bitmap_info.daemon_sleep = DEFAULT_DAEMON_SLEEP;
> +
> +	ret = llbitmap_cache_pages(llbitmap);
> +	if (ret)
> +		return ret;
> +
> +	llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1,
> +			       BitmapActionInit);
> +	/* flush initial llbitmap to disk */
> +	__llbitmap_flush(mddev);
> +
> +	return 0;
> +}
> +
> +static int llbitmap_read_sb(struct llbitmap *llbitmap)
> +{
> +	struct mddev *mddev = llbitmap->mddev;
> +	unsigned long daemon_sleep;
> +	unsigned long chunksize;
> +	unsigned long events;
> +	struct page *sb_page;
> +	bitmap_super_t *sb;
> +	int ret = -EINVAL;
> +
> +	if (!mddev->bitmap_info.offset) {
> +		pr_err("md/llbitmap: %s: no super block found", mdname(mddev));
> +		return -EINVAL;
> +	}
> +
> +	sb_page = llbitmap_read_page(llbitmap, 0);
> +	if (IS_ERR(sb_page)) {
> +		pr_err("md/llbitmap: %s: read super block failed",
> +		       mdname(mddev));
> +		return -EIO;
> +	}
> +
> +	sb = kmap_local_page(sb_page);
> +	if (sb->magic != cpu_to_le32(BITMAP_MAGIC)) {
> +		pr_err("md/llbitmap: %s: invalid super block magic number",
> +		       mdname(mddev));
> +		goto out_put_page;
> +	}
> +
> +	if (sb->version != cpu_to_le32(BITMAP_MAJOR_LOCKLESS)) {
> +		pr_err("md/llbitmap: %s: invalid super block version",
> +		       mdname(mddev));
> +		goto out_put_page;
> +	}
> +
> +	if (memcmp(sb->uuid, mddev->uuid, 16)) {
> +		pr_err("md/llbitmap: %s: bitmap superblock UUID mismatch\n",
> +		       mdname(mddev));
> +		goto out_put_page;
> +	}
> +
> +	if (mddev->bitmap_info.space == 0) {
> +		int room = le32_to_cpu(sb->sectors_reserved);
> +
> +		if (room)
> +			mddev->bitmap_info.space = room;
> +		else
> +			mddev->bitmap_info.space = mddev->bitmap_info.default_space;
> +	}
> +	llbitmap->flags = le32_to_cpu(sb->state);
> +	if (test_and_clear_bit(BITMAP_FIRST_USE, &llbitmap->flags)) {
> +		ret = llbitmap_init(llbitmap);
> +		goto out_put_page;
> +	}
> +
> +	chunksize = le32_to_cpu(sb->chunksize);
> +	if (!is_power_of_2(chunksize)) {
> +		pr_err("md/llbitmap: %s: chunksize not a power of 2",
> +		       mdname(mddev));
> +		goto out_put_page;
> +	}
> +
> +	if (chunksize < DIV_ROUND_UP(mddev->resync_max_sectors,
> +				     mddev->bitmap_info.space << SECTOR_SHIFT)) {
> +		pr_err("md/llbitmap: %s: chunksize too small %lu < %llu / %lu",
> +		       mdname(mddev), chunksize, mddev->resync_max_sectors,
> +		       mddev->bitmap_info.space);
> +		goto out_put_page;
> +	}
> +
> +	daemon_sleep = le32_to_cpu(sb->daemon_sleep);
> +	if (daemon_sleep < 1 || daemon_sleep > MAX_SCHEDULE_TIMEOUT / HZ) {
> +		pr_err("md/llbitmap: %s: daemon sleep %lu period out of range",
> +		       mdname(mddev), daemon_sleep);
> +		goto out_put_page;
> +	}
> +
> +	events = le64_to_cpu(sb->events);
> +	if (events < mddev->events) {
> +		pr_warn("md/llbitmap :%s: bitmap file is out of date (%lu < %llu) -- forcing full recovery",
> +			mdname(mddev), events, mddev->events);
> +		set_bit(BITMAP_STALE, &llbitmap->flags);
> +	}
> +
> +	sb->sync_size = cpu_to_le64(mddev->resync_max_sectors);
> +	mddev->bitmap_info.chunksize = chunksize;
> +	mddev->bitmap_info.daemon_sleep = daemon_sleep;
> +
> +	llbitmap->barrier_idle = DEFAULT_BARRIER_IDLE;
> +	llbitmap->chunksize = chunksize;
> +	llbitmap->chunks = DIV_ROUND_UP(mddev->resync_max_sectors, chunksize);
> +	llbitmap->chunkshift = ffz(~chunksize);
> +	ret = llbitmap_cache_pages(llbitmap);
> +
> +out_put_page:
> +	__free_page(sb_page);
> +	kunmap_local(sb);
> +	return ret;
> +}
> +
> +static void llbitmap_pending_timer_fn(struct timer_list *pending_timer)
> +{
> +	struct llbitmap *llbitmap =
> +		container_of(pending_timer, struct llbitmap, pending_timer);
> +
> +	if (work_busy(&llbitmap->daemon_work)) {
> +		pr_warn("md/llbitmap: %s daemon_work not finished in %lu seconds\n",
> +			mdname(llbitmap->mddev),
> +			llbitmap->mddev->bitmap_info.daemon_sleep);
> +		set_bit(BITMAP_DAEMON_BUSY, &llbitmap->flags);
> +		return;
> +	}
> +
> +	queue_work(md_llbitmap_io_wq, &llbitmap->daemon_work);
> +}
> +
> +static void md_llbitmap_daemon_fn(struct work_struct *work)
> +{
> +	struct llbitmap *llbitmap =
> +		container_of(work, struct llbitmap, daemon_work);
> +	unsigned long start;
> +	unsigned long end;
> +	bool restart;
> +	int idx;
> +
> +	if (llbitmap->mddev->degraded)
> +		return;
> +
> +retry:
> +	start = 0;
> +	end = min(llbitmap->chunks, PAGE_SIZE - BITMAP_DATA_OFFSET) - 1;
> +	restart = false;
> +
> +	for (idx = 0; idx < llbitmap->nr_pages; idx++) {
> +		struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx];
> +
> +		if (idx > 0) {
> +			start = end + 1;
> +			end = min(end + PAGE_SIZE, llbitmap->chunks - 1);
> +		}
> +
> +		if (!test_bit(LLPageFlush, &pctl->flags) &&
> +		    time_before(jiffies, pctl->expire)) {
> +			restart = true;
> +			continue;
> +		}
> +
> +		if (llbitmap_suspend_timeout(llbitmap, idx) < 0) {
> +			pr_warn("md/llbitmap: %s: %s waiting for page %d timeout\n",
> +				mdname(llbitmap->mddev), __func__, idx);
> +			continue;
> +		}
> +
> +		llbitmap_state_machine(llbitmap, start, end, BitmapActionDaemon);
> +		llbitmap_resume(llbitmap, idx);
> +	}
> +
> +	/*
> +	 * If the daemon took a long time to finish, retry to prevent missing
> +	 * clearing dirty bits.
> +	 */
> +	if (test_and_clear_bit(BITMAP_DAEMON_BUSY, &llbitmap->flags))
> +		goto retry;
> +
> +	/* If some page is dirty but not expired, setup timer again */
> +	if (restart)
> +		mod_timer(&llbitmap->pending_timer,
> +			  jiffies + llbitmap->mddev->bitmap_info.daemon_sleep * HZ);
> +}
> +
> +static int llbitmap_create(struct mddev *mddev)
> +{
> +	struct llbitmap *llbitmap;
> +	int ret;
> +
> +	ret = llbitmap_check_support(mddev);
> +	if (ret)
> +		return ret;
> +
> +	llbitmap = kzalloc(sizeof(*llbitmap), GFP_KERNEL);
> +	if (!llbitmap)
> +		return -ENOMEM;
> +
> +	llbitmap->mddev = mddev;
> +	llbitmap->io_size = bdev_logical_block_size(mddev->gendisk->part0);
> +	llbitmap->blocks_per_page = PAGE_SIZE / llbitmap->io_size;
> +
> +	timer_setup(&llbitmap->pending_timer, llbitmap_pending_timer_fn, 0);
> +	INIT_WORK(&llbitmap->daemon_work, md_llbitmap_daemon_fn);
> +	atomic_set(&llbitmap->behind_writes, 0);
> +	init_waitqueue_head(&llbitmap->behind_wait);
> +
> +	mutex_lock(&mddev->bitmap_info.mutex);
> +	mddev->bitmap = llbitmap;
> +	ret = llbitmap_read_sb(llbitmap);
> +	mutex_unlock(&mddev->bitmap_info.mutex);
> +	if (ret) {
> +		kfree(llbitmap);
> +		mddev->bitmap = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +static int llbitmap_resize(struct mddev *mddev, sector_t blocks, int chunksize)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +	unsigned long chunks;
> +
> +	if (chunksize == 0)
> +		chunksize = llbitmap->chunksize;
> +
> +	/* If there is enough space, leave the chunksize unchanged. */
> +	chunks = DIV_ROUND_UP(blocks, chunksize);
> +	while (chunks > mddev->bitmap_info.space << SECTOR_SHIFT) {
> +		chunksize = chunksize << 1;
> +		chunks = DIV_ROUND_UP(blocks, chunksize);
> +	}
> +
> +	llbitmap->chunkshift = ffz(~chunksize);
> +	llbitmap->chunksize = chunksize;
> +	llbitmap->chunks = chunks;
> +
> +	return 0;
> +}
> +
> +static int llbitmap_load(struct mddev *mddev)
> +{
> +	enum llbitmap_action action = BitmapActionReload;
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +
> +	if (test_and_clear_bit(BITMAP_STALE, &llbitmap->flags))
> +		action = BitmapActionStale;
> +
> +	llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, action);
> +	return 0;
> +}
> +
> +static void llbitmap_destroy(struct mddev *mddev)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +
> +	if (!llbitmap)
> +		return;
> +
> +	mutex_lock(&mddev->bitmap_info.mutex);
> +
> +	timer_delete_sync(&llbitmap->pending_timer);
> +	flush_workqueue(md_llbitmap_io_wq);
> +	flush_workqueue(md_llbitmap_unplug_wq);
> +
> +	mddev->bitmap = NULL;
> +	llbitmap_free_pages(llbitmap);
> +	kfree(llbitmap);
> +	mutex_unlock(&mddev->bitmap_info.mutex);
> +}
> +
> +static void llbitmap_start_write(struct mddev *mddev, sector_t offset,
> +				 unsigned long sectors)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +	unsigned long start = offset >> llbitmap->chunkshift;
> +	unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift;
> +	int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
> +	int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
> +
> +	llbitmap_state_machine(llbitmap, start, end, BitmapActionStartwrite);
> +
> +	while (page_start <= page_end) {
> +		llbitmap_raise_barrier(llbitmap, page_start);
> +		page_start++;
> +	}
> +}
> +
> +static void llbitmap_end_write(struct mddev *mddev, sector_t offset,
> +			       unsigned long sectors)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +	unsigned long start = offset >> llbitmap->chunkshift;
> +	unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift;
> +	int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
> +	int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
> +
> +	while (page_start <= page_end) {
> +		llbitmap_release_barrier(llbitmap, page_start);
> +		page_start++;
> +	}
> +}
> +
> +static void llbitmap_start_discard(struct mddev *mddev, sector_t offset,
> +				   unsigned long sectors)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +	unsigned long start = DIV_ROUND_UP(offset, llbitmap->chunksize);
> +	unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift;
> +	int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
> +	int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
> +
> +	llbitmap_state_machine(llbitmap, start, end, BitmapActionDiscard);
> +
> +	while (page_start <= page_end) {
> +		llbitmap_raise_barrier(llbitmap, page_start);
> +		page_start++;
> +	}
> +}
> +
> +static void llbitmap_end_discard(struct mddev *mddev, sector_t offset,
> +				 unsigned long sectors)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +	unsigned long start = DIV_ROUND_UP(offset, llbitmap->chunksize);
> +	unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift;
> +	int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
> +	int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT;
> +
> +	while (page_start <= page_end) {
> +		llbitmap_release_barrier(llbitmap, page_start);
> +		page_start++;
> +	}
> +}
> +
> +static void llbitmap_unplug_fn(struct work_struct *work)
> +{
> +	struct llbitmap_unplug_work *unplug_work =
> +		container_of(work, struct llbitmap_unplug_work, work);
> +	struct llbitmap *llbitmap = unplug_work->llbitmap;
> +	struct blk_plug plug;
> +	int i;
> +
> +	blk_start_plug(&plug);
> +
> +	for (i = 0; i < llbitmap->nr_pages; i++) {
> +		if (!test_bit(LLPageDirty, &llbitmap->pctl[i]->flags) ||
> +		    !test_and_clear_bit(LLPageDirty, &llbitmap->pctl[i]->flags))
> +			continue;
> +
> +		llbitmap_write_page(llbitmap, i);
> +	}
> +
> +	blk_finish_plug(&plug);
> +	md_super_wait(llbitmap->mddev);
> +	complete(unplug_work->done);
> +}
> +
> +static bool llbitmap_dirty(struct llbitmap *llbitmap)
> +{
> +	int i;
> +
> +	for (i = 0; i < llbitmap->nr_pages; i++)
> +		if (test_bit(LLPageDirty, &llbitmap->pctl[i]->flags))
> +			return true;
> +
> +	return false;
> +}
> +
> +static void llbitmap_unplug(struct mddev *mddev, bool sync)
> +{
> +	DECLARE_COMPLETION_ONSTACK(done);
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +	struct llbitmap_unplug_work unplug_work = {
> +		.llbitmap = llbitmap,
> +		.done = &done,
> +	};
> +
> +	if (!llbitmap_dirty(llbitmap))
> +		return;
> +
> +	/*
> +	 * Issue new bitmap IO under submit_bio() context will deadlock:
> +	 *  - the bio will wait for bitmap bio to be done, before it can be
> +	 *  issued;
> +	 *  - bitmap bio will be added to current->bio_list and wait for this
> +	 *  bio to be issued;
> +	 */
> +	INIT_WORK_ONSTACK(&unplug_work.work, llbitmap_unplug_fn);
> +	queue_work(md_llbitmap_unplug_wq, &unplug_work.work);
> +	wait_for_completion(&done);
> +	destroy_work_on_stack(&unplug_work.work);
> +}
> +
> +/*
> + * Force to write all bitmap pages to disk, called when stopping the array, or
> + * every daemon_sleep seconds when sync_thread is running.
> + */
> +static void __llbitmap_flush(struct mddev *mddev)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +	struct blk_plug plug;
> +	int i;
> +
> +	blk_start_plug(&plug);
> +	for (i = 0; i < llbitmap->nr_pages; i++) {
> +		struct llbitmap_page_ctl *pctl = llbitmap->pctl[i];
> +
> +		/* mark all blocks as dirty */
> +		set_bit(LLPageDirty, &pctl->flags);
> +		bitmap_fill(pctl->dirty, llbitmap->blocks_per_page);
> +		llbitmap_write_page(llbitmap, i);
> +	}
> +	blk_finish_plug(&plug);
> +	md_super_wait(llbitmap->mddev);
> +}
> +
> +static void llbitmap_flush(struct mddev *mddev)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +	int i;
> +
> +	for (i = 0; i < llbitmap->nr_pages; i++)
> +		set_bit(LLPageFlush, &llbitmap->pctl[i]->flags);
> +
> +	timer_delete_sync(&llbitmap->pending_timer);
> +	queue_work(md_llbitmap_io_wq, &llbitmap->daemon_work);
> +	flush_work(&llbitmap->daemon_work);
> +
> +	__llbitmap_flush(mddev);
> +}
> +
> +/* This is used for raid5 lazy initial recovery */
> +static bool llbitmap_blocks_synced(struct mddev *mddev, sector_t offset)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +	unsigned long p = offset >> llbitmap->chunkshift;
> +	enum llbitmap_state c = llbitmap_read(llbitmap, p);
> +
> +	return c == BitClean || c == BitDirty;
> +}
> +
> +static sector_t llbitmap_skip_sync_blocks(struct mddev *mddev, sector_t offset)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +	unsigned long p = offset >> llbitmap->chunkshift;
> +	int blocks = llbitmap->chunksize - (offset & (llbitmap->chunksize - 1));
> +	enum llbitmap_state c = llbitmap_read(llbitmap, p);
> +
> +	/* always skip unwritten blocks */
> +	if (c == BitUnwritten)
> +		return blocks;
> +
> +	/* For resync also skip clean/dirty blocks */
> +	if ((c == BitClean || c == BitDirty) &&
> +	    test_bit(MD_RECOVERY_SYNC, &mddev->recovery) &&
> +	    !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
> +		return blocks;
> +
> +	return 0;
> +}
> +
> +static bool llbitmap_start_sync(struct mddev *mddev, sector_t offset,
> +				sector_t *blocks, bool degraded)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +	unsigned long p = offset >> llbitmap->chunkshift;
> +
> +	/*
> +	 * Handle one bit at a time, this is much simpler. And it doesn't matter
> +	 * if md_do_sync() loop more times.
> +	 */
> +	*blocks = llbitmap->chunksize - (offset & (llbitmap->chunksize - 1));
> +	return llbitmap_state_machine(llbitmap, p, p,
> +				      BitmapActionStartsync) == BitSyncing;
> +}
> +
> +/* Something is wrong, sync_thread stop at @offset */
> +static void llbitmap_end_sync(struct mddev *mddev, sector_t offset,
> +			      sector_t *blocks)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +	unsigned long p = offset >> llbitmap->chunkshift;
> +
> +	*blocks = llbitmap->chunksize - (offset & (llbitmap->chunksize - 1));
> +	llbitmap_state_machine(llbitmap, p, llbitmap->chunks - 1,
> +			       BitmapActionAbortsync);
> +}
> +
> +/* A full sync_thread is finished */
> +static void llbitmap_close_sync(struct mddev *mddev)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +	int i;
> +
> +	for (i = 0; i < llbitmap->nr_pages; i++) {
> +		struct llbitmap_page_ctl *pctl = llbitmap->pctl[i];
> +
> +		/* let daemon_fn clear dirty bits immediately */
> +		WRITE_ONCE(pctl->expire, jiffies);
> +	}
> +
> +	llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1,
> +			       BitmapActionEndsync);
> +}
> +
> +/*
> + * sync_thread have reached @sector, update metadata every daemon_sleep seconds,
> + * just in case sync_thread have to restart after power failure.
> + */
> +static void llbitmap_cond_end_sync(struct mddev *mddev, sector_t sector,
> +				   bool force)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +
> +	if (sector == 0) {
> +		llbitmap->last_end_sync = jiffies;
> +		return;
> +	}
> +
> +	if (time_before(jiffies, llbitmap->last_end_sync +
> +				 HZ * mddev->bitmap_info.daemon_sleep))
> +		return;
> +
> +	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
> +
> +	mddev->curr_resync_completed = sector;
> +	set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
> +	llbitmap_state_machine(llbitmap, 0, sector >> llbitmap->chunkshift,
> +			       BitmapActionEndsync);
> +	__llbitmap_flush(mddev);
> +
> +	llbitmap->last_end_sync = jiffies;
> +	sysfs_notify_dirent_safe(mddev->sysfs_completed);
> +}
> +
> +static bool llbitmap_enabled(void *data, bool flush)
> +{
> +	struct llbitmap *llbitmap = data;
> +
> +	return llbitmap && !test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags);
> +}
> +
> +static void llbitmap_dirty_bits(struct mddev *mddev, unsigned long s,
> +				unsigned long e)
> +{
> +	llbitmap_state_machine(mddev->bitmap, s, e, BitmapActionStartwrite);
> +}
> +
> +static void llbitmap_write_sb(struct llbitmap *llbitmap)
> +{
> +	int nr_blocks = DIV_ROUND_UP(BITMAP_DATA_OFFSET, llbitmap->io_size);
> +
> +	bitmap_fill(llbitmap->pctl[0]->dirty, nr_blocks);
> +	llbitmap_write_page(llbitmap, 0);
> +	md_super_wait(llbitmap->mddev);
> +}
> +
> +static void llbitmap_update_sb(void *data)
> +{
> +	struct llbitmap *llbitmap = data;
> +	struct mddev *mddev = llbitmap->mddev;
> +	struct page *sb_page;
> +	bitmap_super_t *sb;
> +
> +	if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags))
> +		return;
> +
> +	sb_page = llbitmap_read_page(llbitmap, 0);
> +	if (IS_ERR(sb_page)) {
> +		pr_err("%s: %s: read super block failed", __func__,
> +		       mdname(mddev));
> +		set_bit(BITMAP_WRITE_ERROR, &llbitmap->flags);
> +		return;
> +	}
> +
> +	if (mddev->events < llbitmap->events_cleared)
> +		llbitmap->events_cleared = mddev->events;
> +
> +	sb = kmap_local_page(sb_page);
> +	sb->events = cpu_to_le64(mddev->events);
> +	sb->state = cpu_to_le32(llbitmap->flags);
> +	sb->chunksize = cpu_to_le32(llbitmap->chunksize);
> +	sb->sync_size = cpu_to_le64(mddev->resync_max_sectors);
> +	sb->events_cleared = cpu_to_le64(llbitmap->events_cleared);
> +	sb->sectors_reserved = cpu_to_le32(mddev->bitmap_info.space);
> +	sb->daemon_sleep = cpu_to_le32(mddev->bitmap_info.daemon_sleep);
> +
> +	kunmap_local(sb);
> +	llbitmap_write_sb(llbitmap);
> +}
> +
> +static int llbitmap_get_stats(void *data, struct md_bitmap_stats *stats)
> +{
> +	struct llbitmap *llbitmap = data;
> +
> +	memset(stats, 0, sizeof(*stats));
> +
> +	stats->missing_pages = 0;
> +	stats->pages = llbitmap->nr_pages;
> +	stats->file_pages = llbitmap->nr_pages;
> +
> +	stats->behind_writes = atomic_read(&llbitmap->behind_writes);
> +	stats->behind_wait = wq_has_sleeper(&llbitmap->behind_wait);
> +	stats->events_cleared = llbitmap->events_cleared;
> +
> +	return 0;
> +}
> +
> +/* just flag all pages as needing to be written */
> +static void llbitmap_write_all(struct mddev *mddev)
> +{
> +	int i;
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +
> +	for (i = 0; i < llbitmap->nr_pages; i++) {
> +		struct llbitmap_page_ctl *pctl = llbitmap->pctl[i];
> +
> +		set_bit(LLPageDirty, &pctl->flags);
> +		bitmap_fill(pctl->dirty, llbitmap->blocks_per_page);
> +	}
> +}
> +
> +static void llbitmap_start_behind_write(struct mddev *mddev)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +
> +	atomic_inc(&llbitmap->behind_writes);
> +}
> +
> +static void llbitmap_end_behind_write(struct mddev *mddev)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +
> +	if (atomic_dec_and_test(&llbitmap->behind_writes))
> +		wake_up(&llbitmap->behind_wait);
> +}
> +
> +static void llbitmap_wait_behind_writes(struct mddev *mddev)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +
> +	if (!llbitmap)
> +		return;
> +
> +	wait_event(llbitmap->behind_wait,
> +		   atomic_read(&llbitmap->behind_writes) == 0);
> +
> +}
> +
> +static ssize_t bits_show(struct mddev *mddev, char *page)
> +{
> +	struct llbitmap *llbitmap;
> +	int bits[BitStateCount] = {0};
> +	loff_t start = 0;
> +
> +	mutex_lock(&mddev->bitmap_info.mutex);
> +	llbitmap = mddev->bitmap;
> +	if (!llbitmap || !llbitmap->pctl) {
> +		mutex_unlock(&mddev->bitmap_info.mutex);
> +		return sprintf(page, "no bitmap\n");
> +	}
> +
> +	if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags)) {
> +		mutex_unlock(&mddev->bitmap_info.mutex);
> +		return sprintf(page, "bitmap io error\n");
> +	}
> +
> +	while (start < llbitmap->chunks) {
> +		enum llbitmap_state c = llbitmap_read(llbitmap, start);
> +
> +		if (c < 0 || c >= BitStateCount)
> +			pr_err("%s: invalid bit %llu state %d\n",
> +			       __func__, start, c);
> +		else
> +			bits[c]++;
> +		start++;
> +	}
> +
> +	mutex_unlock(&mddev->bitmap_info.mutex);
> +	return sprintf(page, "unwritten %d\nclean %d\ndirty %d\nneed sync %d\nsyncing %d\n",
> +		       bits[BitUnwritten], bits[BitClean], bits[BitDirty],
> +		       bits[BitNeedSync], bits[BitSyncing]);
> +}
> +
> +static struct md_sysfs_entry llbitmap_bits =
> +__ATTR_RO(bits);
> +
> +static ssize_t metadata_show(struct mddev *mddev, char *page)
> +{
> +	struct llbitmap *llbitmap;
> +	ssize_t ret;
> +
> +	mutex_lock(&mddev->bitmap_info.mutex);
> +	llbitmap = mddev->bitmap;
> +	if (!llbitmap) {
> +		mutex_unlock(&mddev->bitmap_info.mutex);
> +		return sprintf(page, "no bitmap\n");
> +	}
> +
> +	ret =  sprintf(page, "chunksize %lu\nchunkshift %lu\nchunks %lu\noffset %llu\ndaemon_sleep %lu\n",
> +		       llbitmap->chunksize, llbitmap->chunkshift,
> +		       llbitmap->chunks, mddev->bitmap_info.offset,
> +		       llbitmap->mddev->bitmap_info.daemon_sleep);
> +	mutex_unlock(&mddev->bitmap_info.mutex);
> +
> +	return ret;
> +}
> +
> +static struct md_sysfs_entry llbitmap_metadata =
> +__ATTR_RO(metadata);
> +
> +static ssize_t
> +daemon_sleep_show(struct mddev *mddev, char *page)
> +{
> +	return sprintf(page, "%lu\n", mddev->bitmap_info.daemon_sleep);
> +}
> +
> +static ssize_t
> +daemon_sleep_store(struct mddev *mddev, const char *buf, size_t len)
> +{
> +	unsigned long timeout;
> +	int rv = kstrtoul(buf, 10, &timeout);
> +
> +	if (rv)
> +		return rv;
> +
> +	mddev->bitmap_info.daemon_sleep = timeout;
> +	return len;
> +}
> +
> +static struct md_sysfs_entry llbitmap_daemon_sleep =
> +__ATTR_RW(daemon_sleep);
> +
> +static ssize_t
> +barrier_idle_show(struct mddev *mddev, char *page)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +
> +	return sprintf(page, "%lu\n", llbitmap->barrier_idle);
> +}
> +
> +static ssize_t
> +barrier_idle_store(struct mddev *mddev, const char *buf, size_t len)
> +{
> +	struct llbitmap *llbitmap = mddev->bitmap;
> +	unsigned long timeout;
> +	int rv = kstrtoul(buf, 10, &timeout);
> +
> +	if (rv)
> +		return rv;
> +
> +	llbitmap->barrier_idle = timeout;
> +	return len;
> +}
> +
> +static struct md_sysfs_entry llbitmap_barrier_idle =
> +__ATTR_RW(barrier_idle);
> +
> +static struct attribute *md_llbitmap_attrs[] = {
> +	&llbitmap_bits.attr,
> +	&llbitmap_metadata.attr,
> +	&llbitmap_daemon_sleep.attr,
> +	&llbitmap_barrier_idle.attr,
> +	NULL
> +};
> +
> +static struct attribute_group md_llbitmap_group = {
> +	.name = "llbitmap",
> +	.attrs = md_llbitmap_attrs,
> +};
> +
> +static struct bitmap_operations llbitmap_ops = {
> +	.head = {
> +		.type	= MD_BITMAP,
> +		.id	= ID_LLBITMAP,
> +		.name	= "llbitmap",
> +	},
> +
> +	.enabled		= llbitmap_enabled,
> +	.create			= llbitmap_create,
> +	.resize			= llbitmap_resize,
> +	.load			= llbitmap_load,
> +	.destroy		= llbitmap_destroy,
> +
> +	.start_write		= llbitmap_start_write,
> +	.end_write		= llbitmap_end_write,
> +	.start_discard		= llbitmap_start_discard,
> +	.end_discard		= llbitmap_end_discard,
> +	.unplug			= llbitmap_unplug,
> +	.flush			= llbitmap_flush,
> +
> +	.start_behind_write	= llbitmap_start_behind_write,
> +	.end_behind_write	= llbitmap_end_behind_write,
> +	.wait_behind_writes	= llbitmap_wait_behind_writes,
> +
> +	.blocks_synced		= llbitmap_blocks_synced,
> +	.skip_sync_blocks	= llbitmap_skip_sync_blocks,
> +	.start_sync		= llbitmap_start_sync,
> +	.end_sync		= llbitmap_end_sync,
> +	.close_sync		= llbitmap_close_sync,
> +	.cond_end_sync		= llbitmap_cond_end_sync,
> +
> +	.update_sb		= llbitmap_update_sb,
> +	.get_stats		= llbitmap_get_stats,
> +	.dirty_bits		= llbitmap_dirty_bits,
> +	.write_all		= llbitmap_write_all,
> +
> +	.group			= &md_llbitmap_group,
> +};
> +
> +int md_llbitmap_init(void)
> +{
> +	md_llbitmap_io_wq = alloc_workqueue("md_llbitmap_io",
> +					 WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
> +	if (!md_llbitmap_io_wq)
> +		return -ENOMEM;
> +
> +	md_llbitmap_unplug_wq = alloc_workqueue("md_llbitmap_unplug",
> +					 WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
> +	if (!md_llbitmap_unplug_wq) {
> +		destroy_workqueue(md_llbitmap_io_wq);
> +		md_llbitmap_io_wq = NULL;
> +		return -ENOMEM;
> +	}
> +
> +	return register_md_submodule(&llbitmap_ops.head);
> +}
> +
> +void md_llbitmap_exit(void)
> +{
> +	destroy_workqueue(md_llbitmap_io_wq);
> +	md_llbitmap_io_wq = NULL;
> +	destroy_workqueue(md_llbitmap_unplug_wq);
> +	md_llbitmap_unplug_wq = NULL;
> +	unregister_md_submodule(&llbitmap_ops.head);
> +}
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 3a3a3fdecfbd..722c76b4fade 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -10328,6 +10328,10 @@ static int __init md_init(void)
>   	if (ret)
>   		return ret;
>   
> +	ret = md_llbitmap_init();
> +	if (ret)
> +		goto err_bitmap;
> +
>   	ret = -ENOMEM;
>   	md_wq = alloc_workqueue("md", WQ_MEM_RECLAIM, 0);
>   	if (!md_wq)
> @@ -10359,6 +10363,8 @@ static int __init md_init(void)
>   err_misc_wq:
>   	destroy_workqueue(md_wq);
>   err_wq:
> +	md_llbitmap_exit();
> +err_bitmap:
>   	md_bitmap_exit();
>   	return ret;
>   }
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 7b6357879a84..1979c2d4fe89 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -26,7 +26,7 @@
>   enum md_submodule_type {
>   	MD_PERSONALITY = 0,
>   	MD_CLUSTER,
> -	MD_BITMAP, /* TODO */
> +	MD_BITMAP,
>   };
>   
>   enum md_submodule_id {
> @@ -39,7 +39,7 @@ enum md_submodule_id {
>   	ID_RAID10	= 10,
>   	ID_CLUSTER,
>   	ID_BITMAP,
> -	ID_LLBITMAP,	/* TODO */
> +	ID_LLBITMAP,
>   	ID_BITMAP_NONE,
>   };
>   


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 md-6.18 11/11] md/md-llbitmap: introduce new lockless bitmap
  2025-08-26  9:52   ` Paul Menzel
@ 2025-08-27  3:44     ` Yu Kuai
  2025-08-27  6:07       ` Paul Menzel
  0 siblings, 1 reply; 19+ messages in thread
From: Yu Kuai @ 2025-08-27  3:44 UTC (permalink / raw)
  To: Paul Menzel, Yu Kuai
  Cc: hch, corbet, agk, snitzer, mpatocka, song, xni, hare, linan122,
	colyli, linux-doc, linux-kernel, dm-devel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/08/26 17:52, Paul Menzel 写道:
> It’d be great if you could motivate, why a lockless bitmap is needed  > compared to the current implemention.

Se the performance test, old bitmap have global spinlock and is bad with
fast disk.

[snip the typo part]

> How can/should this patch be tested/benchmarked?

There is pending mdadm patch, rfc verion can be used. Will work on
formal version after this set is applied.

> --- a/drivers/md/md-bitmap.h
> +++ b/drivers/md/md-bitmap.h
> @@ -9,10 +9,26 @@
>    #define BITMAP_MAGIC 0x6d746962
> +/*
> + * version 3 is host-endian order, this is deprecated and not used for new
> + * array
> + */
> +#define BITMAP_MAJOR_LO        3
> +#define BITMAP_MAJOR_HOSTENDIAN    3
> +/* version 4 is little-endian order, the default value */
> +#define BITMAP_MAJOR_HI        4
> +/* version 5 is only used for cluster */
> +#define BITMAP_MAJOR_CLUSTERED    5  > Move this to the header in a separate patch?

I prefer not, old bitmap use this as well.

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 md-6.18 11/11] md/md-llbitmap: introduce new lockless bitmap
  2025-08-27  3:44     ` Yu Kuai
@ 2025-08-27  6:07       ` Paul Menzel
  2025-08-28  7:10         ` Yu Kuai
  0 siblings, 1 reply; 19+ messages in thread
From: Paul Menzel @ 2025-08-27  6:07 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, corbet, agk, snitzer, mpatocka, song, xni, hare, linan122,
	colyli, linux-doc, linux-kernel, dm-devel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Dear Kuai,


Thank you for your reply.

Am 27.08.25 um 05:44 schrieb Yu Kuai:

> 在 2025/08/26 17:52, Paul Menzel 写道:
>> It’d be great if you could motivate, why a lockless bitmap is needed  
>> > compared to the current implemention.
> 
> Se the performance test, old bitmap have global spinlock and is bad with
> fast disk.

Yes, but it’s at the end, and not explicitly stated. Should you resend, 
it’d be great if you could add that.

> [snip the typo part]
> 
>> How can/should this patch be tested/benchmarked?
> 
> There is pending mdadm patch, rfc verion can be used. Will work on
> formal version after this set is applied.

Understood. Maybe add an URL to the mdadm patch. (Sorry, should I have 
missed it.)

>> --- a/drivers/md/md-bitmap.h
>> +++ b/drivers/md/md-bitmap.h
>> @@ -9,10 +9,26 @@
>>    #define BITMAP_MAGIC 0x6d746962
>> +/*
>> + * version 3 is host-endian order, this is deprecated and not used for new
>> + * array
>> + */
>> +#define BITMAP_MAJOR_LO        3
>> +#define BITMAP_MAJOR_HOSTENDIAN    3
>> +/* version 4 is little-endian order, the default value */
>> +#define BITMAP_MAJOR_HI        4
>> +/* version 5 is only used for cluster */
>> +#define BITMAP_MAJOR_CLUSTERED    5

>> Move this to the header in a separate patch?
> 
> I prefer not, old bitmap use this as well.
Hmm, I do not understand the answer, as it’s moved in this patch, why 
can’t it be moved in another? But it’s not that important.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 md-6.18 11/11] md/md-llbitmap: introduce new lockless bitmap
  2025-08-26  8:52 ` [PATCH v6 md-6.18 11/11] md/md-llbitmap: introduce new lockless bitmap Yu Kuai
  2025-08-26  9:52   ` Paul Menzel
@ 2025-08-28  4:15   ` Randy Dunlap
  2025-08-28 11:24   ` Li Nan
  2 siblings, 0 replies; 19+ messages in thread
From: Randy Dunlap @ 2025-08-28  4:15 UTC (permalink / raw)
  To: Yu Kuai, hch, corbet, agk, snitzer, mpatocka, song, xni, hare,
	linan122, colyli
  Cc: linux-doc, linux-kernel, dm-devel, linux-raid, yukuai3, yi.zhang,
	yangerkun, johnny.chenyi



On 8/26/25 1:52 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Redundant data is used to enhance data fault tolerance, and the storage
> method for redundant data vary depending on the RAID levels. And it's
> important to maintain the consistency of redundant data.
> 
> Bitmap is used to record which data blocks have been synchronized and which
> ones need to be resynchronized or recovered. Each bit in the bitmap
> represents a segment of data in the array. When a bit is set, it indicates
> that the multiple redundant copies of that data segment may not be
> consistent. Data synchronization can be performed based on the bitmap after
> power failure or readding a disk. If there is no bitmap, a full disk

                   reading

> synchronization is required.
> 
> Key Features:
> 
>  - IO fastpath is lockless, if user issues lots of write IO to the same

                    lockless. If the user

>  bitmap bit in a short time, only the first write have additional overhead

                                                    has

>  to update bitmap bit, no additional overhead for the following writes;
>  - support only resync or recover written data, means in the case creating
>  new array or replacing with a new disk, there is no need to do a full disk
>  resync/recovery;
> 
> Key Concept:
> 
>  - State Machine:
> 
> Each bit is one byte, contain 6 difference state, see llbitmap_state. And

                        contains 6 different states,


> there are total 8 differenct actions, see llbitmap_action, can change state:

                    different                                that can change state:

> 
> llbitmap state machine: transitions between states
> 
> |           | Startwrite | Startsync | Endsync | Abortsync|
> | --------- | ---------- | --------- | ------- | -------  |
> | Unwritten | Dirty      | x         | x       | x        |
> | Clean     | Dirty      | x         | x       | x        |
> | Dirty     | x          | x         | x       | x        |
> | NeedSync  | x          | Syncing   | x       | x        |
> | Syncing   | x          | Syncing   | Dirty   | NeedSync |
> 
> |           | Reload   | Daemon | Discard   | Stale     |
> | --------- | -------- | ------ | --------- | --------- |
> | Unwritten | x        | x      | x         | x         |
> | Clean     | x        | x      | Unwritten | NeedSync  |
> | Dirty     | NeedSync | Clean  | Unwritten | NeedSync  |
> | NeedSync  | x        | x      | Unwritten | x         |
> | Syncing   | NeedSync | x      | Unwritten | NeedSync  |
> 
> Typical scenarios:
> 
> 1) Create new array
> All bits will be set to Unwritten by default, if --assume-clean is set,

                                       default. If

> all bits will be set to Clean instead.
> 
> 2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and
> rely on xor data
> 
> 2.1) write new data to raid1/raid10:
> Unwritten --StartWrite--> Dirty
> 
> 2.2) write new data to raid456:
> Unwritten --StartWrite--> NeedSync
> 
> Because the initial recover for raid456 is skipped, the xor data is not build
> yet, the bit must set to NeedSync first and after lazy initial recover is
> finished, the bit will finially set to Dirty(see 5.1 and 5.4);

                         finally

> 
> 2.3) cover write
> Clean --StartWrite--> Dirty
> 
> 3) daemon, if the array is not degraded:
> Dirty --Daemon--> Clean
> 
> For degraded array, the Dirty bit will never be cleared, prevent full disk

                                                           preventing

> recovery while readding a removed disk.

                 reading

> 
> 4) discard
> {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
> 
> 5) resync and recover
> 
> 5.1) common process
> NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean
> 
> 5.2) resync after power failure
> Dirty --Reload--> NeedSync
> 
> 5.3) recover while replacing with a new disk
> By default, the old bitmap framework will recover all data, and llbitmap
> implement this by a new helper, see llbitmap_skip_sync_blocks:

  implements

> 
> skip recover for bits other than dirty or clean;
> 
> 5.4) lazy initial recover for raid5:
> By default, the old bitmap framework will only allow new recover when there
> are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add

                                                                        added

> to perform raid456 lazy recover for set bits(from 2.2).
> 
> Bitmap IO:
> 
>  - Chunksize
> 
> The default bitmap size is 128k, incluing 1k bitmap super block, and

                                   including

> the default size of segment of data in the array each bit(chunksize) is 64k,
> and chunksize will adjust to twice the old size each time if the total number
> bits is not less than 127k.(see llbitmap_init)
> 
>  - READ
> 
> While creating bitmap, all pages will be allocated and read for llbitmap,

                                                                  llbitmap.

> there won't be read afterwards

  There          a read afterwards.

> 
>  - WRITE
> 
> WRITE IO is divided into logical_block_size of the array, the dirty state
> of each block is tracked independently, for example:
> 
> each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;

                                                       bytes and contains 512 bits:

> 
> | page0 | page1 | ... | page 31 |
> |       |
> |        \-----------------------\
> |                                |
> | block0 | block1 | ... | block 8|
> |        |
> |         \-----------------\
> |                            |
> | bit0 | bit1 | ... | bit511 |
> 
> From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
> subpage will be marked dirty, such block must write first before the IO is

                         dirty;

> issued. This behaviour will affect IO performance, to reduce the impact, if

                                        performance. To

> multiple bits are changed in the same block in a short time, all bits in this
> block will be changed to Dirty/NeedSync, so that there won't be any overhead
> until daemon clears dirty bits.
> 
> Dirty Bits syncronization:
> 
> IO fast path will set bits to dirty, and those dirty bits will be cleared
> by daemon after IO is done. llbitmap_page_ctl is used to synchronize between
> IO path and daemon;
> 
> IO path:
>  1) try to grab a reference, if succeed, set expire time after 5s and return;
>  2) if failed to grab a reference, wait for daemon to finish clearing dirty
>  bits;
> 
> Daemon(Daemon will be waken up every daemon_sleep seconds):

                will be woken up
or
                will be awakened

> For each page:
>  1) check if page expired, if not skip this page; for expired page:

                    expired; if not, skip this page. For expired page:

>  2) suspend the page and wait for inflight write IO to be done;
>  3) change dirty page to clean;
>  4) resume the page;
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  Documentation/admin-guide/md.rst |   20 +
>  drivers/md/Kconfig               |   11 +
>  drivers/md/Makefile              |    1 +
>  drivers/md/md-bitmap.c           |    9 -
>  drivers/md/md-bitmap.h           |   31 +-
>  drivers/md/md-llbitmap.c         | 1600 ++++++++++++++++++++++++++++++
>  drivers/md/md.c                  |    6 +
>  drivers/md/md.h                  |    4 +-
>  8 files changed, 1670 insertions(+), 12 deletions(-)
>  create mode 100644 drivers/md/md-llbitmap.c
> 
> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> index 001363f81850..47d1347ccd00 100644
> --- a/Documentation/admin-guide/md.rst
> +++ b/Documentation/admin-guide/md.rst
> @@ -387,6 +387,8 @@ All md devices contain:
>           No bitmap
>       bitmap
>           The default internal bitmap
> +     llbitmap
> +         The lockless internal bitmap
>  
>  If bitmap_type is not none, then additional bitmap attributes bitmap/xxx or
>  llbitmap/xxx will be created after md device KOBJ_CHANGE event.
> @@ -447,6 +449,24 @@ If bitmap_type is bitmap, then the md device will also contain:
>       once the array becomes non-degraded, and this fact has been
>       recorded in the metadata.
>  
> +If bitmap_type is llbitmap, then the md device will also contain:
> +
> +  llbitmap/bits
> +     This is readonly, show status of bitmap bits, the number of each

                read-only; it shows the status of bitmap bits,

> +     value.
> +
> +  llbitmap/metadata
> +     This is readonly, show bitmap metadata, include chunksize, chunkshift,

                read-only; it shows bitmap metadata, including

> +     chunks, offset and daemon_sleep.
> +
> +  llbitmap/daemon_sleep
> +     This is readwrite, time in seconds that daemon function will be

                read-write, time in seconds

> +     triggered to clear dirty bits.
> +
> +  llbitmap/barrier_idle
> +     This is readwrite, time in seconds that page barrier will be idled,

                read-write,> +     means dirty bits in the page will be cleared.
> +
>  As component devices are added to an md array, they appear in the ``md``
>  directory as new directories named::
>  

> diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
> new file mode 100644
> index 000000000000..88207f31c728
> --- /dev/null
> +++ b/drivers/md/md-llbitmap.c
> @@ -0,0 +1,1600 @@

> +/*
> + * #### Background
> + *
> + * Redundant data is used to enhance data fault tolerance, and the storage
> + * method for redundant data vary depending on the RAID levels. And it's

      methods

> + * important to maintain the consistency of redundant data.
> + *
> + * Bitmap is used to record which data blocks have been synchronized and which
> + * ones need to be resynchronized or recovered. Each bit in the bitmap
> + * represents a segment of data in the array. When a bit is set, it indicates
> + * that the multiple redundant copies of that data segment may not be
> + * consistent. Data synchronization can be performed based on the bitmap after
> + * power failure or readding a disk. If there is no bitmap, a full disk

                       reading

> + * synchronization is required.
> + *
> + * #### Key Features
> + *
> + *  - IO fastpath is lockless, if user issues lots of write IO to the same

                        lockless. If the user

> + *  bitmap bit in a short time, only the first write have additional overhead

                                                        has

> + *  to update bitmap bit, no additional overhead for the following writes;

                        bit; there is no additional overhead for the following writes;

> + *  - support only resync or recover written data, means in the case creating
> + *  new array or replacing with a new disk, there is no need to do a full disk
> + *  resync/recovery;
> + *
> + * #### Key Concept
> + *
> + * ##### State Machine
> + *
> + * Each bit is one byte, contain 6 difference state, see llbitmap_state. And

                      byte, containing           states,

> + * there are total 8 differenct actions, see llbitmap_action, can change state:

                        different                              , that can change state.

> + *
> + * llbitmap state machine: transitions between states

                                                  states::

Use "::" to maintain the table spacing.

> + *
> + * |           | Startwrite | Startsync | Endsync | Abortsync|
> + * | --------- | ---------- | --------- | ------- | -------  |
> + * | Unwritten | Dirty      | x         | x       | x        |
> + * | Clean     | Dirty      | x         | x       | x        |
> + * | Dirty     | x          | x         | x       | x        |
> + * | NeedSync  | x          | Syncing   | x       | x        |
> + * | Syncing   | x          | Syncing   | Dirty   | NeedSync |
> + *
> + * |           | Reload   | Daemon | Discard   | Stale     |
> + * | --------- | -------- | ------ | --------- | --------- |
> + * | Unwritten | x        | x      | x         | x         |
> + * | Clean     | x        | x      | Unwritten | NeedSync  |
> + * | Dirty     | NeedSync | Clean  | Unwritten | NeedSync  |
> + * | NeedSync  | x        | x      | Unwritten | x         |
> + * | Syncing   | NeedSync | x      | Unwritten | NeedSync  |
> + *
> + * Typical scenarios:
> + *
> + * 1) Create new array
> + * All bits will be set to Unwritten by default, if --assume-clean is set,

                                           default. If

> + * all bits will be set to Clean instead.
> + *
> + * 2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and
> + * rely on xor data
> + *
> + * 2.1) write new data to raid1/raid10:
> + * Unwritten --StartWrite--> Dirty
> + *
> + * 2.2) write new data to raid456:
> + * Unwritten --StartWrite--> NeedSync
> + *
> + * Because the initial recover for raid456 is skipped, the xor data is not build

                                                                              built

> + * yet, the bit must set to NeedSync first and after lazy initial recover is

                   must be set to

> + * finished, the bit will finially set to Dirty(see 5.1 and 5.4);

                             finally be set to

> + *
> + * 2.3) cover write
> + * Clean --StartWrite--> Dirty
> + *
> + * 3) daemon, if the array is not degraded:
> + * Dirty --Daemon--> Clean
> + *
> + * For degraded array, the Dirty bit will never be cleared, prevent full disk

                                                               preventing

> + * recovery while readding a removed disk.

                     reading

> + *
> + * 4) discard
> + * {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
> + *
> + * 5) resync and recover
> + *
> + * 5.1) common process
> + * NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean
> + *
> + * 5.2) resync after power failure
> + * Dirty --Reload--> NeedSync
> + *
> + * 5.3) recover while replacing with a new disk
> + * By default, the old bitmap framework will recover all data, and llbitmap
> + * implement this by a new helper, see llbitmap_skip_sync_blocks:

      implements

> + *
> + * skip recover for bits other than dirty or clean;
> + *
> + * 5.4) lazy initial recover for raid5:
> + * By default, the old bitmap framework will only allow new recover when there
> + * are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add

                     disk). A new                                           added

> + * to perform raid456 lazy recover for set bits(from 2.2).
> + *
> + * ##### Bitmap IO
> + *
> + * ##### Chunksize
> + *
> + * The default bitmap size is 128k, incluing 1k bitmap super block, and
> + * the default size of segment of data in the array each bit(chunksize) is 64k,
> + * and chunksize will adjust to twice the old size each time if the total number
> + * bits is not less than 127k.(see llbitmap_init)
> + *
> + * ##### READ
> + *
> + * While creating bitmap, all pages will be allocated and read for llbitmap,
> + * there won't be read afterwards
> + *
> + * ##### WRITE
> + *
> + * WRITE IO is divided into logical_block_size of the array, the dirty state
> + * of each block is tracked independently, for example:
> + *
> + * each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit;

                                                                 and contains 512 bits;

> + *
> + * | page0 | page1 | ... | page 31 |
> + * |       |
> + * |        \-----------------------\
> + * |                                |
> + * | block0 | block1 | ... | block 8|
> + * |        |
> + * |         \-----------------\
> + * |                            |
> + * | bit0 | bit1 | ... | bit511 |
> + *
> + * From IO path, if one bit is changed to Dirty or NeedSync, the corresponding
> + * subpage will be marked dirty, such block must write first before the IO is
> + * issued. This behaviour will affect IO performance, to reduce the impact, if
> + * multiple bits are changed in the same block in a short time, all bits in this
> + * block will be changed to Dirty/NeedSync, so that there won't be any overhead
> + * until daemon clears dirty bits.
> + *
> + * ##### Dirty Bits syncronization

                       synchronization

[snip]

> +
> +static struct md_sysfs_entry llbitmap_bits =
> +__ATTR_RO(bits);

One line, or if you feel that it must be 2 lines, the second line
should be indented.

> +
> +static struct md_sysfs_entry llbitmap_metadata =
> +__ATTR_RO(metadata);

One line, or if you feel that it must be 2 lines, the second line
should be indented.

> +
> +static struct md_sysfs_entry llbitmap_daemon_sleep =
> +__ATTR_RW(daemon_sleep);
> +

One line, or if you feel that it must be 2 lines, the second line
should be indented.

> +
> +static struct md_sysfs_entry llbitmap_barrier_idle =
> +__ATTR_RW(barrier_idle);
> +

One line, or if you feel that it must be 2 lines, the second line
should be indented.


-- 
~Randy


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 md-6.18 11/11] md/md-llbitmap: introduce new lockless bitmap
  2025-08-27  6:07       ` Paul Menzel
@ 2025-08-28  7:10         ` Yu Kuai
  0 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-08-28  7:10 UTC (permalink / raw)
  To: Paul Menzel, Yu Kuai
  Cc: hch, corbet, agk, snitzer, mpatocka, song, xni, hare, linan122,
	colyli, linux-doc, linux-kernel, dm-devel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/08/27 14:07, Paul Menzel 写道:
> Dear Kuai,
> 
> 
> Thank you for your reply.
> 
> Am 27.08.25 um 05:44 schrieb Yu Kuai:
> 
>> 在 2025/08/26 17:52, Paul Menzel 写道:
>>> It’d be great if you could motivate, why a lockless bitmap is needed 
>>> > compared to the current implemention.
>>
>> Se the performance test, old bitmap have global spinlock and is bad with
>> fast disk.
> 
> Yes, but it’s at the end, and not explicitly stated. Should you resend, 
> it’d be great if you could add that.

If there is no suggestions about functionality, I can add following in
the beginning when I apply this:

Due to known performance issues with md-bitmap and the unreasonable
implementations:

  - self-managed IO submitting like filemap_write_page();
  - global spin_lock

I have decided not to continue optimizing based on the current bitmap
implementation.

And the same as fixing those typos.

Thanks,
Kuai

> 
>> [snip the typo part]
>>
>>> How can/should this patch be tested/benchmarked?
>>
>> There is pending mdadm patch, rfc verion can be used. Will work on
>> formal version after this set is applied.
> 
> Understood. Maybe add an URL to the mdadm patch. (Sorry, should I have 
> missed it.)
> 
>>> --- a/drivers/md/md-bitmap.h
>>> +++ b/drivers/md/md-bitmap.h
>>> @@ -9,10 +9,26 @@
>>>    #define BITMAP_MAGIC 0x6d746962
>>> +/*
>>> + * version 3 is host-endian order, this is deprecated and not used 
>>> for new
>>> + * array
>>> + */
>>> +#define BITMAP_MAJOR_LO        3
>>> +#define BITMAP_MAJOR_HOSTENDIAN    3
>>> +/* version 4 is little-endian order, the default value */
>>> +#define BITMAP_MAJOR_HI        4
>>> +/* version 5 is only used for cluster */
>>> +#define BITMAP_MAJOR_CLUSTERED    5
> 
>>> Move this to the header in a separate patch?
>>
>> I prefer not, old bitmap use this as well.
> Hmm, I do not understand the answer, as it’s moved in this patch, why 
> can’t it be moved in another? But it’s not that important.
> 
> 
> Kind regards,
> 
> Paul
> .
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 md-6.18 11/11] md/md-llbitmap: introduce new lockless bitmap
  2025-08-26  8:52 ` [PATCH v6 md-6.18 11/11] md/md-llbitmap: introduce new lockless bitmap Yu Kuai
  2025-08-26  9:52   ` Paul Menzel
  2025-08-28  4:15   ` Randy Dunlap
@ 2025-08-28 11:24   ` Li Nan
  2025-08-29  1:03     ` Yu Kuai
  2 siblings, 1 reply; 19+ messages in thread
From: Li Nan @ 2025-08-28 11:24 UTC (permalink / raw)
  To: Yu Kuai, hch, corbet, agk, snitzer, mpatocka, song, xni, hare,
	colyli
  Cc: linux-doc, linux-kernel, dm-devel, linux-raid, yukuai3, yi.zhang,
	yangerkun, johnny.chenyi



在 2025/8/26 16:52, Yu Kuai 写道:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Redundant data is used to enhance data fault tolerance, and the storage
> method for redundant data vary depending on the RAID levels. And it's
> important to maintain the consistency of redundant data.
> 
> Bitmap is used to record which data blocks have been synchronized and which
> ones need to be resynchronized or recovered. Each bit in the bitmap
> represents a segment of data in the array. When a bit is set, it indicates
> that the multiple redundant copies of that data segment may not be
> consistent. Data synchronization can be performed based on the bitmap after
> power failure or readding a disk. If there is no bitmap, a full disk
> synchronization is required.
> 
> Key Features:
> 
>   - IO fastpath is lockless, if user issues lots of write IO to the same
>   bitmap bit in a short time, only the first write have additional overhead
>   to update bitmap bit, no additional overhead for the following writes;
>   - support only resync or recover written data, means in the case creating
>   new array or replacing with a new disk, there is no need to do a full disk
>   resync/recovery;
> 
> Key Concept:
> 
>   - State Machine:
> 
> Each bit is one byte, contain 6 difference state, see llbitmap_state. And
> there are total 8 differenct actions, see llbitmap_action, can change state:
> 
> llbitmap state machine: transitions between states
> 
> |           | Startwrite | Startsync | Endsync | Abortsync|
> | --------- | ---------- | --------- | ------- | -------  |
> | Unwritten | Dirty      | x         | x       | x        |
> | Clean     | Dirty      | x         | x       | x        |
> | Dirty     | x          | x         | x       | x        |
> | NeedSync  | x          | Syncing   | x       | x        |
> | Syncing   | x          | Syncing   | Dirty   | NeedSync |
> 
> |           | Reload   | Daemon | Discard   | Stale     |
> | --------- | -------- | ------ | --------- | --------- |
> | Unwritten | x        | x      | x         | x         |
> | Clean     | x        | x      | Unwritten | NeedSync  |
> | Dirty     | NeedSync | Clean  | Unwritten | NeedSync  |
> | NeedSync  | x        | x      | Unwritten | x         |
> | Syncing   | NeedSync | x      | Unwritten | NeedSync  |
> 
> Typical scenarios:
> 
> 1) Create new array
> All bits will be set to Unwritten by default, if --assume-clean is set,
> all bits will be set to Clean instead.
> 
> 2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and
> rely on xor data
> 
> 2.1) write new data to raid1/raid10:
> Unwritten --StartWrite--> Dirty
> 
> 2.2) write new data to raid456:
> Unwritten --StartWrite--> NeedSync
> 
> Because the initial recover for raid456 is skipped, the xor data is not build
> yet, the bit must set to NeedSync first and after lazy initial recover is
> finished, the bit will finially set to Dirty(see 5.1 and 5.4);
> 
> 2.3) cover write
> Clean --StartWrite--> Dirty
> 
> 3) daemon, if the array is not degraded:
> Dirty --Daemon--> Clean
> 
> For degraded array, the Dirty bit will never be cleared, prevent full disk
> recovery while readding a removed disk.
> 
> 4) discard
> {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
> 
> 5) resync and recover
> 
> 5.1) common process
> NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean

There is some issue whith Dirty state:
1. The Dirty bit will not synced when a disk is re-add.
2. It remains Dirty even after a full recovery -- it should be Clean.

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 md-6.18 11/11] md/md-llbitmap: introduce new lockless bitmap
  2025-08-28 11:24   ` Li Nan
@ 2025-08-29  1:03     ` Yu Kuai
  0 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-08-29  1:03 UTC (permalink / raw)
  To: Li Nan, Yu Kuai, hch, corbet, agk, snitzer, mpatocka, song, xni,
	hare, colyli
  Cc: linux-doc, linux-kernel, dm-devel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/08/28 19:24, Li Nan 写道:
> 
> 
> 在 2025/8/26 16:52, Yu Kuai 写道:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Redundant data is used to enhance data fault tolerance, and the storage
>> method for redundant data vary depending on the RAID levels. And it's
>> important to maintain the consistency of redundant data.
>>
>> Bitmap is used to record which data blocks have been synchronized and 
>> which
>> ones need to be resynchronized or recovered. Each bit in the bitmap
>> represents a segment of data in the array. When a bit is set, it 
>> indicates
>> that the multiple redundant copies of that data segment may not be
>> consistent. Data synchronization can be performed based on the bitmap 
>> after
>> power failure or readding a disk. If there is no bitmap, a full disk
>> synchronization is required.
>>
>> Key Features:
>>
>>   - IO fastpath is lockless, if user issues lots of write IO to the same
>>   bitmap bit in a short time, only the first write have additional 
>> overhead
>>   to update bitmap bit, no additional overhead for the following writes;
>>   - support only resync or recover written data, means in the case 
>> creating
>>   new array or replacing with a new disk, there is no need to do a 
>> full disk
>>   resync/recovery;
>>
>> Key Concept:
>>
>>   - State Machine:
>>
>> Each bit is one byte, contain 6 difference state, see llbitmap_state. And
>> there are total 8 differenct actions, see llbitmap_action, can change 
>> state:
>>
>> llbitmap state machine: transitions between states
>>
>> |           | Startwrite | Startsync | Endsync | Abortsync|
>> | --------- | ---------- | --------- | ------- | -------  |
>> | Unwritten | Dirty      | x         | x       | x        |
>> | Clean     | Dirty      | x         | x       | x        |
>> | Dirty     | x          | x         | x       | x        |
>> | NeedSync  | x          | Syncing   | x       | x        |
>> | Syncing   | x          | Syncing   | Dirty   | NeedSync |
>>
>> |           | Reload   | Daemon | Discard   | Stale     |
>> | --------- | -------- | ------ | --------- | --------- |
>> | Unwritten | x        | x      | x         | x         |
>> | Clean     | x        | x      | Unwritten | NeedSync  |
>> | Dirty     | NeedSync | Clean  | Unwritten | NeedSync  |
>> | NeedSync  | x        | x      | Unwritten | x         |
>> | Syncing   | NeedSync | x      | Unwritten | NeedSync  |
>>
>> Typical scenarios:
>>
>> 1) Create new array
>> All bits will be set to Unwritten by default, if --assume-clean is set,
>> all bits will be set to Clean instead.
>>
>> 2) write data, raid1/raid10 have full copy of data, while raid456 
>> doesn't and
>> rely on xor data
>>
>> 2.1) write new data to raid1/raid10:
>> Unwritten --StartWrite--> Dirty
>>
>> 2.2) write new data to raid456:
>> Unwritten --StartWrite--> NeedSync
>>
>> Because the initial recover for raid456 is skipped, the xor data is 
>> not build
>> yet, the bit must set to NeedSync first and after lazy initial recover is
>> finished, the bit will finially set to Dirty(see 5.1 and 5.4);
>>
>> 2.3) cover write
>> Clean --StartWrite--> Dirty
>>
>> 3) daemon, if the array is not degraded:
>> Dirty --Daemon--> Clean
>>
>> For degraded array, the Dirty bit will never be cleared, prevent full 
>> disk
>> recovery while readding a removed disk.
>>
>> 4) discard
>> {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten
>>
>> 5) resync and recover
>>
>> 5.1) common process
>> NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean
> 
> There is some issue whith Dirty state:
> 1. The Dirty bit will not synced when a disk is re-add.
> 2. It remains Dirty even after a full recovery -- it should be Clean.

We're setting new bits to dirty for degraded array, and there is no
futher action to change the state to need sync before recovery by new
disk.

This can be fixed by setting new bits directly to need sync for degraded
array, will do this in the next version.

Thanks,
Kuai
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-08-29  1:03 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-26  8:51 [PATCH v6 md-6.18 00/11] md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap Yu Kuai
2025-08-26  8:51 ` [PATCH v6 md-6.18 01/11] md: add a new parameter 'offset' to md_super_write() Yu Kuai
2025-08-26  8:51 ` [PATCH v6 md-6.18 02/11] md: factor out a helper raid_is_456() Yu Kuai
2025-08-26  8:51 ` [PATCH v6 md-6.18 03/11] md/md-bitmap: support discard for bitmap ops Yu Kuai
2025-08-26  8:51 ` [PATCH v6 md-6.18 04/11] md: add a new mddev field 'bitmap_id' Yu Kuai
2025-08-26  8:51 ` [PATCH v6 md-6.18 05/11] md/md-bitmap: add a new sysfs api bitmap_type Yu Kuai
2025-08-26  8:52 ` [PATCH v6 md-6.18 06/11] md/md-bitmap: delay registration of bitmap_ops until creating bitmap Yu Kuai
2025-08-26  8:52 ` [PATCH v6 md-6.18 07/11] md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations Yu Kuai
2025-08-26  8:52 ` [PATCH v6 md-6.18 08/11] md/md-bitmap: add a new method blocks_synced() " Yu Kuai
2025-08-26  8:52 ` [PATCH v6 md-6.18 09/11] md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER Yu Kuai
2025-08-26  8:52 ` [PATCH v6 md-6.18 10/11] md/md-bitmap: make method bitmap_ops->daemon_work optional Yu Kuai
2025-08-26  8:52 ` [PATCH v6 md-6.18 11/11] md/md-llbitmap: introduce new lockless bitmap Yu Kuai
2025-08-26  9:52   ` Paul Menzel
2025-08-27  3:44     ` Yu Kuai
2025-08-27  6:07       ` Paul Menzel
2025-08-28  7:10         ` Yu Kuai
2025-08-28  4:15   ` Randy Dunlap
2025-08-28 11:24   ` Li Nan
2025-08-29  1:03     ` Yu Kuai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).