* [PATCH RFC v2 00/14] md: introduce a new lockless bitmap
@ 2025-03-28 6:08 Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 01/14] block: factor out a helper bdev_file_alloc() Yu Kuai
` (15 more replies)
0 siblings, 16 replies; 27+ messages in thread
From: Yu Kuai @ 2025-03-28 6:08 UTC (permalink / raw)
To: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song, yukuai3
Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
yi.zhang, yangerkun
From: Yu Kuai <yukuai3@huawei.com>
#### Background
Redundant data is used to enhance data fault tolerance, and the storage
method for redundant data vary depending on the RAID levels. And it's
important to maintain the consistency of redundant data.
Bitmap is used to record which data blocks have been synchronized and which
ones need to be resynchronized or recovered. Each bit in the bitmap
represents a segment of data in the array. When a bit is set, it indicates
that the multiple redundant copies of that data segment may not be
consistent. Data synchronization can be performed based on the bitmap after
power failure or readding a disk. If there is no bitmap, a full disk
synchronization is required.
#### Key Concept
##### State Machine
Each bit is one byte, contain 6 difference state, see llbitmap_state. And
there are total 8 differenct actions, see llbitmap_action, can change state:
llbitmap state machine: transitions between states
| | Startwrite | Startsync | Endsync | Abortsync| Reload | Daemon | Discard | Stale |
| --------- | ---------- | --------- | ------- | ------- | -------- | ------ | --------- | --------- |
| Unwritten | Dirty | x | x | x | x | x | x | x |
| Clean | Dirty | x | x | x | x | x | Unwritten | NeedSync |
| Dirty | x | x | x | x | NeedSync | Clean | Unwritten | NeedSync |
| NeedSync | x | Syncing | x | x | x | x | Unwritten | x |
| Syncing | x | Syncing | Dirty | NeedSync | NeedSync | x | Unwritten | NeedSync |
special illustration:
- Unwritten is special state, which means user never write data, hence there
is no need to resync/recover data. This is safe if user create filesystems
for the array, filesystem will make sure user will get zero data for
unwritten blocks.
- After resync is done, change state from Syncing to Dirty first, in case
Startwrite happen before the state is Clean.
##### Bitmap IO
A hidden disk, named mdxxx_bitmap, is created for bitmap, see details in
llbitmap_add_disk(). And a file is created as well to manage bitmap IO for
this disk, see details in llbitmap_open_disk(). Read/write bitmap is
converted to buffer IO to this file.
IO fast path will set bits to dirty, and those dirty bits will be cleared
by daemon after IO is done. llbitmap_barrier is used to syncronize between
IO path and daemon;
Test result: to be added
Noted:
1) user must apply the following mdadm patch, and then llbitmap can be
enabled by --bitmap=lockless
https://lore.kernel.org/all/20250327134853.1069356-1-yukuai1@huaweicloud.com/
2) this set is cooked on the top of my other set:
https://lore.kernel.org/all/20250219083456.941760-1-yukuai1@huaweicloud.com/
Yu Kuai (14):
block: factor out a helper bdev_file_alloc()
md/md-bitmap: pass discard information to bitmap_{start, end}write
md/md-bitmap: remove parameter slot from bitmap_create()
md: add a new sysfs api bitmap_version
md: delay registeration of bitmap_ops until creating bitmap
md/md-llbitmap: implement bit state machine
md/md-llbitmap: implement hidden disk to manage bitmap IO
md/md-llbitmap: implement APIs for page level dirty bits
synchronization
md/md-llbitmap: implement APIs to mange bitmap lifetime
md/md-llbitmap: implement APIs to dirty bits and clear bits
md/md-llbitmap: implement APIs for sync_thread
md/md-llbitmap: implement all bitmap operations
md/md-llbitmap: implement sysfs APIs
md/md-llbitmap: add Kconfig
block/bdev.c | 21 +-
drivers/md/Kconfig | 12 +
drivers/md/Makefile | 1 +
drivers/md/md-bitmap.c | 10 +-
drivers/md/md-bitmap.h | 21 +-
drivers/md/md-llbitmap.c | 1410 ++++++++++++++++++++++++++++++++++++++
drivers/md/md.c | 180 ++++-
drivers/md/md.h | 3 +
include/linux/blkdev.h | 1 +
9 files changed, 1614 insertions(+), 45 deletions(-)
create mode 100644 drivers/md/md-llbitmap.c
--
2.39.2
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH RFC v2 01/14] block: factor out a helper bdev_file_alloc()
2025-03-28 6:08 [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Yu Kuai
@ 2025-03-28 6:08 ` Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 02/14] md/md-bitmap: pass discard information to bitmap_{start, end}write Yu Kuai
` (14 subsequent siblings)
15 siblings, 0 replies; 27+ messages in thread
From: Yu Kuai @ 2025-03-28 6:08 UTC (permalink / raw)
To: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song, yukuai3
Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
yi.zhang, yangerkun
From: Yu Kuai <yukuai3@huawei.com>
To allocate bdev_file without opening the bdev, mdraid will create hidden
disk to manage internal bitmap in following patches, while the hidden disk
can't be opened by user, and mdraid will initialize this file to manage
bitmap IO.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
block/bdev.c | 21 ++++++++++++++++-----
include/linux/blkdev.h | 1 +
2 files changed, 17 insertions(+), 5 deletions(-)
diff --git a/block/bdev.c b/block/bdev.c
index 9d73a8fbf7f9..6b4ba6cb04c9 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -989,12 +989,26 @@ static unsigned blk_to_file_flags(blk_mode_t mode)
return flags;
}
+struct file *bdev_file_alloc(struct block_device *bdev, blk_mode_t mode)
+{
+ unsigned int flags = blk_to_file_flags(mode);
+ struct file *bdev_file;
+
+ bdev_file = alloc_file_pseudo_noaccount(BD_INODE(bdev),
+ blockdev_mnt, "", flags | O_LARGEFILE, &def_blk_fops);
+
+ if (!IS_ERR(bdev_file))
+ ihold(BD_INODE(bdev));
+
+ return bdev_file;
+}
+EXPORT_SYMBOL_GPL(bdev_file_alloc);
+
struct file *bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
const struct blk_holder_ops *hops)
{
struct file *bdev_file;
struct block_device *bdev;
- unsigned int flags;
int ret;
ret = bdev_permission(dev, mode, holder);
@@ -1005,14 +1019,11 @@ struct file *bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
if (!bdev)
return ERR_PTR(-ENXIO);
- flags = blk_to_file_flags(mode);
- bdev_file = alloc_file_pseudo_noaccount(BD_INODE(bdev),
- blockdev_mnt, "", flags | O_LARGEFILE, &def_blk_fops);
+ bdev_file = bdev_file_alloc(bdev, mode);
if (IS_ERR(bdev_file)) {
blkdev_put_no_open(bdev);
return bdev_file;
}
- ihold(BD_INODE(bdev));
ret = bdev_open(bdev, mode, holder, hops, bdev_file);
if (ret) {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 248416ecd01c..dede6721374a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1639,6 +1639,7 @@ extern const struct blk_holder_ops fs_holder_ops;
(BLK_OPEN_READ | BLK_OPEN_RESTRICT_WRITES | \
(((flags) & SB_RDONLY) ? 0 : BLK_OPEN_WRITE))
+struct file *bdev_file_alloc(struct block_device *bdev, blk_mode_t mode);
struct file *bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
const struct blk_holder_ops *hops);
struct file *bdev_file_open_by_path(const char *path, blk_mode_t mode,
--
2.39.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH RFC v2 02/14] md/md-bitmap: pass discard information to bitmap_{start, end}write
2025-03-28 6:08 [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 01/14] block: factor out a helper bdev_file_alloc() Yu Kuai
@ 2025-03-28 6:08 ` Yu Kuai
2025-04-04 9:29 ` Christoph Hellwig
2025-03-28 6:08 ` [PATCH RFC v2 03/14] md/md-bitmap: remove parameter slot from bitmap_create() Yu Kuai
` (13 subsequent siblings)
15 siblings, 1 reply; 27+ messages in thread
From: Yu Kuai @ 2025-03-28 6:08 UTC (permalink / raw)
To: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song, yukuai3
Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
yi.zhang, yangerkun
From: Yu Kuai <yukuai3@huawei.com>
It's not used for now, and prepare to handle discard for llbitmap in
following patches.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/md-bitmap.c | 4 ++--
drivers/md/md-bitmap.h | 4 ++--
drivers/md/md.c | 10 ++++++++--
drivers/md/md.h | 1 +
4 files changed, 13 insertions(+), 6 deletions(-)
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 733fbb886f67..0cef5c199d32 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -1667,7 +1667,7 @@ __acquires(bitmap->lock)
}
static int bitmap_startwrite(struct mddev *mddev, sector_t offset,
- unsigned long sectors)
+ unsigned long sectors, bool is_discard)
{
struct bitmap *bitmap = mddev->bitmap;
@@ -1722,7 +1722,7 @@ static int bitmap_startwrite(struct mddev *mddev, sector_t offset,
}
static void bitmap_endwrite(struct mddev *mddev, sector_t offset,
- unsigned long sectors)
+ unsigned long sectors, bool is_discard)
{
struct bitmap *bitmap = mddev->bitmap;
diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index d3d50629af91..504d33d4980b 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -91,9 +91,9 @@ struct bitmap_operations {
void (*wait_behind_writes)(struct mddev *mddev);
int (*startwrite)(struct mddev *mddev, sector_t offset,
- unsigned long sectors);
+ unsigned long sectors, bool is_discard);
void (*endwrite)(struct mddev *mddev, sector_t offset,
- unsigned long sectors);
+ unsigned long sectors, bool is_discard);
bool (*start_sync)(struct mddev *mddev, sector_t offset,
sector_t *blocks, bool degraded);
void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4a9aa6879e98..c06c41e39609 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8805,13 +8805,15 @@ static void md_bitmap_start(struct mddev *mddev,
&md_io_clone->sectors);
mddev->bitmap_ops->startwrite(mddev, md_io_clone->offset,
- md_io_clone->sectors);
+ md_io_clone->sectors,
+ md_io_clone->is_discard);
}
static void md_bitmap_end(struct mddev *mddev, struct md_io_clone *md_io_clone)
{
mddev->bitmap_ops->endwrite(mddev, md_io_clone->offset,
- md_io_clone->sectors);
+ md_io_clone->sectors,
+ md_io_clone->is_discard);
}
static void md_end_clone_io(struct bio *bio)
@@ -8850,6 +8852,10 @@ static void md_clone_bio(struct mddev *mddev, struct bio **bio)
if (bio_data_dir(*bio) == WRITE && md_bitmap_enabled(mddev)) {
md_io_clone->offset = (*bio)->bi_iter.bi_sector;
md_io_clone->sectors = bio_sectors(*bio);
+ if (unlikely(bio_op(*bio) == REQ_OP_DISCARD))
+ md_io_clone->is_discard = true;
+ else
+ md_io_clone->is_discard = false;
md_bitmap_start(mddev, md_io_clone);
}
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 254bbab6f443..ad18ef9b5061 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -858,6 +858,7 @@ struct md_io_clone {
unsigned long start_time;
sector_t offset;
unsigned long sectors;
+ bool is_discard;
struct bio bio_clone;
};
--
2.39.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH RFC v2 03/14] md/md-bitmap: remove parameter slot from bitmap_create()
2025-03-28 6:08 [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 01/14] block: factor out a helper bdev_file_alloc() Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 02/14] md/md-bitmap: pass discard information to bitmap_{start, end}write Yu Kuai
@ 2025-03-28 6:08 ` Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 04/14] md: add a new sysfs api bitmap_version Yu Kuai
` (12 subsequent siblings)
15 siblings, 0 replies; 27+ messages in thread
From: Yu Kuai @ 2025-03-28 6:08 UTC (permalink / raw)
To: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song, yukuai3
Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
yi.zhang, yangerkun
From: Yu Kuai <yukuai3@huawei.com>
All callers pass in '-1' for 'slot', hence it can be removed.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/md-bitmap.c | 6 +++---
drivers/md/md-bitmap.h | 2 +-
drivers/md/md.c | 6 +++---
3 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 0cef5c199d32..3d1e9501823d 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -2183,9 +2183,9 @@ static struct bitmap *__bitmap_create(struct mddev *mddev, int slot)
return ERR_PTR(err);
}
-static int bitmap_create(struct mddev *mddev, int slot)
+static int bitmap_create(struct mddev *mddev)
{
- struct bitmap *bitmap = __bitmap_create(mddev, slot);
+ struct bitmap *bitmap = __bitmap_create(mddev, -1);
if (IS_ERR(bitmap))
return PTR_ERR(bitmap);
@@ -2648,7 +2648,7 @@ location_store(struct mddev *mddev, const char *buf, size_t len)
}
mddev->bitmap_info.offset = offset;
- rv = bitmap_create(mddev, -1);
+ rv = bitmap_create(mddev);
if (rv)
goto out;
diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index 504d33d4980b..5d579f0b0c3a 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -74,7 +74,7 @@ struct bitmap_operations {
struct md_submodule_head head;
bool (*enabled)(void *data);
- int (*create)(struct mddev *mddev, int slot);
+ int (*create)(struct mddev *mddev);
int (*resize)(struct mddev *mddev, sector_t blocks, int chunksize);
int (*load)(struct mddev *mddev);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index c06c41e39609..c4fbd4d6a9f1 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6215,7 +6215,7 @@ int md_run(struct mddev *mddev)
}
if (err == 0 && pers->sync_request && md_bitmap_registered(mddev) &&
(mddev->bitmap_info.file || mddev->bitmap_info.offset)) {
- err = mddev->bitmap_ops->create(mddev, -1);
+ err = mddev->bitmap_ops->create(mddev);
if (err)
pr_warn("%s: failed to create bitmap (%d)\n",
mdname(mddev), err);
@@ -7284,7 +7284,7 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
err = 0;
if (mddev->pers) {
if (fd >= 0) {
- err = mddev->bitmap_ops->create(mddev, -1);
+ err = mddev->bitmap_ops->create(mddev);
if (!err)
err = mddev->bitmap_ops->load(mddev);
@@ -7608,7 +7608,7 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
mddev->bitmap_info.default_offset;
mddev->bitmap_info.space =
mddev->bitmap_info.default_space;
- rv = mddev->bitmap_ops->create(mddev, -1);
+ rv = mddev->bitmap_ops->create(mddev);
if (!rv)
rv = mddev->bitmap_ops->load(mddev);
--
2.39.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH RFC v2 04/14] md: add a new sysfs api bitmap_version
2025-03-28 6:08 [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Yu Kuai
` (2 preceding siblings ...)
2025-03-28 6:08 ` [PATCH RFC v2 03/14] md/md-bitmap: remove parameter slot from bitmap_create() Yu Kuai
@ 2025-03-28 6:08 ` Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 05/14] md: delay registeration of bitmap_ops until creating bitmap Yu Kuai
` (11 subsequent siblings)
15 siblings, 0 replies; 27+ messages in thread
From: Yu Kuai @ 2025-03-28 6:08 UTC (permalink / raw)
To: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song, yukuai3
Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
yi.zhang, yangerkun
From: Yu Kuai <yukuai3@huawei.com>
The api will be used by mdadm to set bitmap version while creating new
array or assemble array, prepare to add a new bitmap.
Currently available options are:
cat /sys/block/md0/md/bitmap_version
none [bitmap]
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/md.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++---
drivers/md/md.h | 2 ++
2 files changed, 84 insertions(+), 5 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index c4fbd4d6a9f1..6ec8b5311a0a 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -649,13 +649,13 @@ static void active_io_release(struct percpu_ref *ref)
static void no_op(struct percpu_ref *r) {}
-static void mddev_set_bitmap_ops(struct mddev *mddev, enum md_submodule_id id)
+static void mddev_set_bitmap_ops(struct mddev *mddev)
{
xa_lock(&md_submodule);
- mddev->bitmap_ops = xa_load(&md_submodule, id);
+ mddev->bitmap_ops = xa_load(&md_submodule, mddev->bitmap_id);
xa_unlock(&md_submodule);
if (!mddev->bitmap_ops)
- pr_warn_once("md: can't find bitmap id %d\n", id);
+ pr_warn_once("md: can't find bitmap id %d\n", mddev->bitmap_id);
}
static void mddev_clear_bitmap_ops(struct mddev *mddev)
@@ -665,8 +665,8 @@ static void mddev_clear_bitmap_ops(struct mddev *mddev)
int mddev_init(struct mddev *mddev)
{
- /* TODO: support more versions */
- mddev_set_bitmap_ops(mddev, ID_BITMAP);
+ mddev->bitmap_id = ID_BITMAP;
+ mddev_set_bitmap_ops(mddev);
if (percpu_ref_init(&mddev->active_io, active_io_release,
PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
@@ -4145,6 +4145,82 @@ new_level_store(struct mddev *mddev, const char *buf, size_t len)
static struct md_sysfs_entry md_new_level =
__ATTR(new_level, 0664, new_level_show, new_level_store);
+static ssize_t
+bitmap_version_show(struct mddev *mddev, char *page)
+{
+ struct md_submodule_head *head;
+ unsigned long i;
+ ssize_t len = 0;
+
+ if (mddev->bitmap_id == ID_BITMAP_NONE)
+ len += sprintf(page + len, "[none] ");
+ else
+ len += sprintf(page + len, "none ");
+
+ xa_lock(&md_submodule);
+ xa_for_each(&md_submodule, i, head) {
+ if (head->type != MD_BITMAP)
+ continue;
+
+ if (mddev->bitmap_id == head->id)
+ len += sprintf(page + len, "[%s] ", head->name);
+ else
+ len += sprintf(page + len, "%s ", head->name);
+ }
+ xa_unlock(&md_submodule);
+
+ len += sprintf(page + len, "\n");
+ return len;
+}
+
+static ssize_t
+bitmap_version_store(struct mddev *mddev, const char *buf, size_t len)
+{
+ struct md_submodule_head *head;
+ enum md_submodule_id id;
+ unsigned long i;
+ int err;
+
+ if (mddev->bitmap_ops)
+ return -EBUSY;
+
+ err = kstrtoint(buf, 10, &id);
+ if (!err) {
+ if (id == ID_BITMAP_NONE) {
+ mddev->bitmap_id = id;
+ return len;
+ }
+
+ xa_lock(&md_submodule);
+ head = xa_load(&md_submodule, id);
+ xa_unlock(&md_submodule);
+
+ if (head && head->type == MD_BITMAP) {
+ mddev->bitmap_id = id;
+ return len;
+ }
+ }
+
+ if (cmd_match(buf, "none")) {
+ mddev->bitmap_id = ID_BITMAP_NONE;
+ return len;
+ }
+
+ xa_lock(&md_submodule);
+ xa_for_each(&md_submodule, i, head) {
+ if (head->type == MD_BITMAP && cmd_match(buf, head->name)) {
+ mddev->bitmap_id = head->id;
+ xa_unlock(&md_submodule);
+ return len;
+ }
+ }
+ xa_unlock(&md_submodule);
+ return -ENOENT;
+}
+
+static struct md_sysfs_entry md_bitmap_version =
+__ATTR(bitmap_version, 0664, bitmap_version_show, bitmap_version_store);
+
static ssize_t
layout_show(struct mddev *mddev, char *page)
{
@@ -5680,6 +5756,7 @@ __ATTR(serialize_policy, S_IRUGO | S_IWUSR, serialize_policy_show,
static struct attribute *md_default_attrs[] = {
&md_level.attr,
&md_new_level.attr,
+ &md_bitmap_version.attr,
&md_layout.attr,
&md_raid_disks.attr,
&md_uuid.attr,
diff --git a/drivers/md/md.h b/drivers/md/md.h
index ad18ef9b5061..3bc367f6bd62 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -40,6 +40,7 @@ enum md_submodule_id {
ID_CLUSTER,
ID_BITMAP,
ID_LLBITMAP, /* TODO */
+ ID_BITMAP_NONE,
};
struct md_submodule_head {
@@ -562,6 +563,7 @@ struct mddev {
struct percpu_ref writes_pending;
int sync_checkers; /* # of threads checking writes_pending */
+ enum md_submodule_id bitmap_id;
void *bitmap; /* the bitmap for the device */
struct bitmap_operations *bitmap_ops;
struct {
--
2.39.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH RFC v2 05/14] md: delay registeration of bitmap_ops until creating bitmap
2025-03-28 6:08 [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Yu Kuai
` (3 preceding siblings ...)
2025-03-28 6:08 ` [PATCH RFC v2 04/14] md: add a new sysfs api bitmap_version Yu Kuai
@ 2025-03-28 6:08 ` Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 06/14] md/md-llbitmap: implement bit state machine Yu Kuai
` (10 subsequent siblings)
15 siblings, 0 replies; 27+ messages in thread
From: Yu Kuai @ 2025-03-28 6:08 UTC (permalink / raw)
To: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song, yukuai3
Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
yi.zhang, yangerkun
From: Yu Kuai <yukuai3@huawei.com>
Currently bitmap_ops is registered while allocating mddev, this is fine
when there is only one bitmap_ops, however, after introduing a new
bitmap_ops, user space need a time window to choose which bitmap_ops to
use while creating new array.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/md.c | 84 ++++++++++++++++++++++++++++++++-----------------
1 file changed, 55 insertions(+), 29 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 6ec8b5311a0a..c1f13288069a 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -651,32 +651,47 @@ static void no_op(struct percpu_ref *r) {}
static void mddev_set_bitmap_ops(struct mddev *mddev)
{
+ struct bitmap_operations *old = mddev->bitmap_ops;
+ struct md_submodule_head *head;
+
+ if (mddev->bitmap_id == ID_BITMAP_NONE ||
+ (old && old->head.id == mddev->bitmap_id))
+ return;
+
xa_lock(&md_submodule);
- mddev->bitmap_ops = xa_load(&md_submodule, mddev->bitmap_id);
+ head = xa_load(&md_submodule, mddev->bitmap_id);
xa_unlock(&md_submodule);
- if (!mddev->bitmap_ops)
- pr_warn_once("md: can't find bitmap id %d\n", mddev->bitmap_id);
+
+ if (WARN_ON_ONCE(!head || head->type != MD_BITMAP)) {
+ pr_err("md: can't find bitmap id %d\n", mddev->bitmap_id);
+ return;
+ }
+
+ if (old && old->group)
+ sysfs_remove_group(&mddev->kobj, old->group);
+
+ mddev->bitmap_ops = (void *)head;
+ if (mddev->bitmap_ops && mddev->bitmap_ops->group &&
+ sysfs_create_group(&mddev->kobj, mddev->bitmap_ops->group))
+ pr_warn("md: cannot register extra bitmap attributes for %s\n",
+ mdname(mddev));
}
static void mddev_clear_bitmap_ops(struct mddev *mddev)
{
+ if (mddev->bitmap_ops && mddev->bitmap_ops->group)
+ sysfs_remove_group(&mddev->kobj, mddev->bitmap_ops->group);
mddev->bitmap_ops = NULL;
}
int mddev_init(struct mddev *mddev)
{
- mddev->bitmap_id = ID_BITMAP;
- mddev_set_bitmap_ops(mddev);
-
if (percpu_ref_init(&mddev->active_io, active_io_release,
- PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
- mddev_clear_bitmap_ops(mddev);
+ PERCPU_REF_ALLOW_REINIT, GFP_KERNEL))
return -ENOMEM;
- }
if (percpu_ref_init(&mddev->writes_pending, no_op,
PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
- mddev_clear_bitmap_ops(mddev);
percpu_ref_exit(&mddev->active_io);
return -ENOMEM;
}
@@ -714,7 +729,6 @@ EXPORT_SYMBOL_GPL(mddev_init);
void mddev_destroy(struct mddev *mddev)
{
- mddev_clear_bitmap_ops(mddev);
percpu_ref_exit(&mddev->active_io);
percpu_ref_exit(&mddev->writes_pending);
}
@@ -6046,11 +6060,6 @@ struct mddev *md_alloc(dev_t dev, char *name)
return ERR_PTR(error);
}
- if (md_bitmap_registered(mddev) && mddev->bitmap_ops->group)
- if (sysfs_create_group(&mddev->kobj, mddev->bitmap_ops->group))
- pr_warn("md: cannot register extra bitmap attributes for %s\n",
- mdname(mddev));
-
kobject_uevent(&mddev->kobj, KOBJ_ADD);
mddev->sysfs_state = sysfs_get_dirent_safe(mddev->kobj.sd, "array_state");
mddev->sysfs_level = sysfs_get_dirent_safe(mddev->kobj.sd, "level");
@@ -6126,6 +6135,25 @@ static void md_safemode_timeout(struct timer_list *t)
static int start_dirty_degraded;
+static int md_bitmap_create(struct mddev *mddev)
+{
+ if (!md_bitmap_registered(mddev))
+ mddev_set_bitmap_ops(mddev);
+ if (!mddev->bitmap_ops)
+ return -ENOENT;
+
+ return mddev->bitmap_ops->create(mddev);
+}
+
+static void md_bitmap_destroy(struct mddev *mddev)
+{
+ if (!md_bitmap_registered(mddev))
+ return;
+
+ mddev->bitmap_ops->destroy(mddev);
+ mddev_clear_bitmap_ops(mddev);
+}
+
int md_run(struct mddev *mddev)
{
int err;
@@ -6290,9 +6318,9 @@ int md_run(struct mddev *mddev)
(unsigned long long)pers->size(mddev, 0, 0) / 2);
err = -EINVAL;
}
- if (err == 0 && pers->sync_request && md_bitmap_registered(mddev) &&
+ if (err == 0 && pers->sync_request &&
(mddev->bitmap_info.file || mddev->bitmap_info.offset)) {
- err = mddev->bitmap_ops->create(mddev);
+ err = md_bitmap_create(mddev);
if (err)
pr_warn("%s: failed to create bitmap (%d)\n",
mdname(mddev), err);
@@ -6365,8 +6393,7 @@ int md_run(struct mddev *mddev)
pers->free(mddev, mddev->private);
mddev->private = NULL;
put_pers(pers);
- if (md_bitmap_registered(mddev))
- mddev->bitmap_ops->destroy(mddev);
+ md_bitmap_destroy(mddev);
abort:
bioset_exit(&mddev->io_clone_set);
exit_sync_set:
@@ -6389,7 +6416,7 @@ int do_md_run(struct mddev *mddev)
if (md_bitmap_registered(mddev)) {
err = mddev->bitmap_ops->load(mddev);
if (err) {
- mddev->bitmap_ops->destroy(mddev);
+ md_bitmap_destroy(mddev);
goto out;
}
}
@@ -6580,8 +6607,7 @@ static void __md_stop(struct mddev *mddev)
{
struct md_personality *pers = mddev->pers;
- if (md_bitmap_registered(mddev))
- mddev->bitmap_ops->destroy(mddev);
+ md_bitmap_destroy(mddev);
mddev_detach(mddev);
spin_lock(&mddev->lock);
mddev->pers = NULL;
@@ -7361,16 +7387,16 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
err = 0;
if (mddev->pers) {
if (fd >= 0) {
- err = mddev->bitmap_ops->create(mddev);
+ err = md_bitmap_create(mddev);
if (!err)
err = mddev->bitmap_ops->load(mddev);
if (err) {
- mddev->bitmap_ops->destroy(mddev);
+ md_bitmap_destroy(mddev);
fd = -1;
}
} else if (fd < 0) {
- mddev->bitmap_ops->destroy(mddev);
+ md_bitmap_destroy(mddev);
}
}
@@ -7685,12 +7711,12 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
mddev->bitmap_info.default_offset;
mddev->bitmap_info.space =
mddev->bitmap_info.default_space;
- rv = mddev->bitmap_ops->create(mddev);
+ rv = md_bitmap_create(mddev);
if (!rv)
rv = mddev->bitmap_ops->load(mddev);
if (rv)
- mddev->bitmap_ops->destroy(mddev);
+ md_bitmap_destroy(mddev);
} else {
struct md_bitmap_stats stats;
@@ -7716,7 +7742,7 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
put_cluster_ops(mddev);
mddev->safemode_delay = DEFAULT_SAFEMODE_DELAY;
}
- mddev->bitmap_ops->destroy(mddev);
+ md_bitmap_destroy(mddev);
mddev->bitmap_info.offset = 0;
}
}
--
2.39.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH RFC v2 06/14] md/md-llbitmap: implement bit state machine
2025-03-28 6:08 [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Yu Kuai
` (4 preceding siblings ...)
2025-03-28 6:08 ` [PATCH RFC v2 05/14] md: delay registeration of bitmap_ops until creating bitmap Yu Kuai
@ 2025-03-28 6:08 ` Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 07/14] md/md-llbitmap: implement hidden disk to manage bitmap IO Yu Kuai
` (9 subsequent siblings)
15 siblings, 0 replies; 27+ messages in thread
From: Yu Kuai @ 2025-03-28 6:08 UTC (permalink / raw)
To: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song, yukuai3
Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
yi.zhang, yangerkun
From: Yu Kuai <yukuai3@huawei.com>
Each bit is one byte and contain 6 different state, and there are total
8 different action can change state, see details in the following form:
| | Startwrite | Startsync | Endsync | Abortsync| Reload | Daemon | Discard | Stale |
| --------- | ---------- | --------- | ------- | ------- | -------- | ------ | --------- | --------- |
| Unwritten | Dirty | x | x | x | x | x | x | x |
| Clean | Dirty | x | x | x | x | x | Unwritten | NeedSync |
| Dirty | x | x | x | x | NeedSync | Clean | Unwritten | NeedSync |
| NeedSync | x | Syncing | x | x | x | x | Unwritten | x |
| Syncing | x | Syncing | Dirty | NeedSync | NeedSync | x | Unwritten | NeedSync |
This patch implement the state machine first, and following patches will
use it to implement new llbitmap.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/md-llbitmap.c | 256 +++++++++++++++++++++++++++++++++++++++
1 file changed, 256 insertions(+)
create mode 100644 drivers/md/md-llbitmap.c
diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
new file mode 100644
index 000000000000..1f97b6868279
--- /dev/null
+++ b/drivers/md/md-llbitmap.c
@@ -0,0 +1,256 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <linux/blkdev.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/timer.h>
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/buffer_head.h>
+#include <linux/seq_file.h>
+#include <trace/events/block.h>
+
+#include "md.h"
+#include "md-bitmap.h"
+
+/*
+ * #### Background
+ *
+ * Redundant data is used to enhance data fault tolerance, and the storage
+ * method for redundant data vary depending on the RAID levels. And it's
+ * important to maintain the consistency of redundant data.
+ *
+ * Bitmap is used to record which data blocks have been synchronized and which
+ * ones need to be resynchronized or recovered. Each bit in the bitmap
+ * represents a segment of data in the array. When a bit is set, it indicates
+ * that the multiple redundant copies of that data segment may not be
+ * consistent. Data synchronization can be performed based on the bitmap after
+ * power failure or readding a disk. If there is no bitmap, a full disk
+ * synchronization is required.
+ *
+ * #### Key Concept
+ *
+ * ##### State Machine
+ *
+ * Each bit is one byte, contain 6 difference state, see llbitmap_state. And
+ * there are total 8 differenct actions, see llbitmap_action, can change state:
+ *
+ * llbitmap state machine: transitions between states
+ *
+ * | | Startwrite | Startsync | Endsync | Abortsync| Reload | Daemon | Discard | Stale |
+ * | --------- | ---------- | --------- | ------- | ------- | -------- | ------ | --------- | --------- |
+ * | Unwritten | Dirty | x | x | x | x | x | x | x |
+ * | Clean | Dirty | x | x | x | x | x | Unwritten | NeedSync |
+ * | Dirty | x | x | x | x | NeedSync | Clean | Unwritten | NeedSync |
+ * | NeedSync | x | Syncing | x | x | x | x | Unwritten | x |
+ * | Syncing | x | Syncing | Dirty | NeedSync | NeedSync | x | Unwritten | NeedSync |
+ *
+ * special illustration:
+ * - Unwritten is special state, which means user never write data, hence there
+ * is no need to resync/recover data. This is safe if user create filesystems
+ * for the array, filesystem will make sure user will get zero data for
+ * unwritten blocks.
+ * - After resync is done, change state from Syncing to Dirty first, in case
+ * Startwrite happen before the state is Clean.
+ */
+
+#define BITMAP_MAX_PAGES 32
+#define BITMAP_SB_SIZE 1024
+
+enum llbitmap_state {
+ /* No valid data, init state after assemble the array */
+ BitUnwritten = 0,
+ /* data is consistent */
+ BitClean,
+ /* data will be consistent after IO is done, set directly for writes */
+ BitDirty,
+ /*
+ * data need to be resynchronized:
+ * 1) set directly for writes if array is degraded, prevent full disk
+ * synchronization after readding a disk;
+ * 2) reassemble the array after power failure, and dirty bits are
+ * found after reloading the bitmap;
+ * */
+ BitNeedSync,
+ /* data is synchronizing */
+ BitSyncing,
+ nr_llbitmap_state,
+ BitNone = 0xff,
+};
+
+enum llbitmap_action {
+ /* User write new data, this is the only acton from IO fast path */
+ BitmapActionStartwrite = 0,
+ /* Start recovery */
+ BitmapActionStartsync,
+ /* Finish recovery */
+ BitmapActionEndsync,
+ /* Failed recovery */
+ BitmapActionAbortsync,
+ /* Reassemble the array */
+ BitmapActionReload,
+ /* Daemon thread is trying to clear dirty bits */
+ BitmapActionDaemon,
+ /* Data is deleted */
+ BitmapActionDiscard,
+ /*
+ * Bitmap is stale, mark all bits in addition to BitUnwritten to
+ * BitNeedSync.
+ */
+ BitmapActionStale,
+ nr_llbitmap_action,
+ /* Init state is BitUnwritten */
+ BitmapActionInit,
+};
+
+struct llbitmap {
+ struct mddev *mddev;
+ /* hidden disk to manage bitmap IO */
+ struct gendisk *bitmap_disk;
+ /* opened hidden disk */
+ struct file *bitmap_file;
+ int nr_pages;
+ struct page *pages[BITMAP_MAX_PAGES];
+
+ struct bio_set bio_set;
+ struct bio_list retry_list;
+ struct work_struct retry_work;
+ spinlock_t retry_lock;
+
+ /* shift of one chunk */
+ unsigned long chunkshift;
+ /* size of one chunk in sector */
+ unsigned long chunksize;
+ /* total number of chunks */
+ unsigned long chunks;
+ /* fires on first BitDirty state */
+ struct timer_list pending_timer;
+ struct work_struct daemon_work;
+
+ unsigned long flags;
+ __u64 events_cleared;
+};
+
+static char state_machine[nr_llbitmap_state][nr_llbitmap_action] = {
+ [BitUnwritten] = {BitDirty, BitNone, BitNone, BitNone, BitNone, BitNone, BitNone, BitNone},
+ [BitClean] = {BitDirty, BitNone, BitNone, BitNone, BitNone, BitNone, BitUnwritten, BitNeedSync},
+ [BitDirty] = {BitNone, BitNone, BitNone, BitNone, BitNeedSync, BitClean, BitUnwritten, BitNeedSync},
+ [BitNeedSync] = {BitNone, BitSyncing, BitNone, BitNone, BitNone, BitNone, BitUnwritten, BitNone},
+ [BitSyncing] = {BitNone, BitSyncing, BitDirty, BitNeedSync, BitNeedSync, BitNone, BitUnwritten, BitNeedSync},
+};
+
+static enum llbitmap_state state_from_page(struct page *page, loff_t pos)
+{
+ u8 *p = kmap_local_page(page);
+ enum llbitmap_state state = p[offset_in_page(pos)];
+
+ kunmap_local(p);
+ return state;
+}
+
+static void state_to_page(struct page *page, enum llbitmap_state state,
+ loff_t pos)
+{
+ u8 *p = kmap_local_page(page);
+
+ p[offset_in_page(pos)] = state;
+ set_page_dirty(page);
+ kunmap_local(p);
+}
+
+static int llbitmap_read(struct llbitmap *llbitmap, enum llbitmap_state *state,
+ loff_t pos)
+{
+ pos += BITMAP_SB_SIZE;
+ *state = state_from_page(llbitmap->pages[pos >> PAGE_SHIFT], pos);
+ return 0;
+}
+
+static int llbitmap_write(struct llbitmap *llbitmap, enum llbitmap_state state,
+ loff_t pos)
+{
+ pos += BITMAP_SB_SIZE;
+ state_to_page(llbitmap->pages[pos >> PAGE_SHIFT], state, pos);
+ return 0;
+}
+
+/* The return value is only used from resync, where @start == @end. */
+static enum llbitmap_state llbitmap_state_machine(struct llbitmap *llbitmap,
+ unsigned long start,
+ unsigned long end,
+ enum llbitmap_action action)
+{
+ struct mddev *mddev = llbitmap->mddev;
+ enum llbitmap_state state = BitNone;
+ bool need_recovery = false;
+
+ if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags))
+ return BitNone;
+
+ while (start <= end) {
+ ssize_t ret;
+ enum llbitmap_state c;
+
+ if (action == BitmapActionInit) {
+ state = BitUnwritten;
+ ret = llbitmap_write(llbitmap, state, start);
+ if (ret < 0) {
+ set_bit(BITMAP_WRITE_ERROR, &llbitmap->flags);
+ return BitNone;
+ }
+
+ start++;
+ continue;
+ }
+
+ ret = llbitmap_read(llbitmap, &c, start);
+ if (ret < 0) {
+ set_bit(BITMAP_WRITE_ERROR, &llbitmap->flags);
+ return BitNone;
+ }
+
+ if (c < 0 || c >= nr_llbitmap_state) {
+ pr_err("%s: invalid bit %lu state %d action %d, forcing resync\n",
+ __func__, start, c, action);
+ c = BitNeedSync;
+ goto write_bitmap;
+ }
+
+ if (c == BitNeedSync)
+ need_recovery = true;
+
+ state = state_machine[c][action];
+ if (state == BitNone) {
+ start++;
+ continue;
+ }
+
+write_bitmap:
+ ret = llbitmap_write(llbitmap, state, start);
+ if (ret < 0) {
+ set_bit(BITMAP_WRITE_ERROR, &llbitmap->flags);
+ return BitNone;
+ }
+
+ if (state == BitNeedSync)
+ need_recovery = true;
+ else if (state == BitDirty &&
+ !timer_pending(&llbitmap->pending_timer))
+ mod_timer(&llbitmap->pending_timer,
+ jiffies + mddev->bitmap_info.daemon_sleep * HZ);
+
+ start++;
+ }
+
+ if (need_recovery) {
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+ set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+ md_wakeup_thread(mddev->thread);
+ }
+
+ return state;
+}
--
2.39.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH RFC v2 07/14] md/md-llbitmap: implement hidden disk to manage bitmap IO
2025-03-28 6:08 [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Yu Kuai
` (5 preceding siblings ...)
2025-03-28 6:08 ` [PATCH RFC v2 06/14] md/md-llbitmap: implement bit state machine Yu Kuai
@ 2025-03-28 6:08 ` Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 08/14] md/md-llbitmap: implement APIs for page level dirty bits synchronization Yu Kuai
` (8 subsequent siblings)
15 siblings, 0 replies; 27+ messages in thread
From: Yu Kuai @ 2025-03-28 6:08 UTC (permalink / raw)
To: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song, yukuai3
Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
yi.zhang, yangerkun
From: Yu Kuai <yukuai3@huawei.com>
Bitmap is stored in each member disk, the old bitmap implementation is
allocating memory and managing data by itself, read and write will
attach the allocated page to bio for member disks, and a bitmap level
spinlock is used for synchronization
For llbitmap, a hidden disk, named mdxxx_bitmap, is created for bitmap, see
details in llbitmap_add_disk(). And a file is created as well to manage
bitmap IO for this disk. Read/write bitmap will be converted to buffer
IO to this file.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/md-llbitmap.c | 238 +++++++++++++++++++++++++++++++++++++++
1 file changed, 238 insertions(+)
diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
index 1f97b6868279..bbd8a7c99577 100644
--- a/drivers/md/md-llbitmap.c
+++ b/drivers/md/md-llbitmap.c
@@ -56,8 +56,16 @@
* unwritten blocks.
* - After resync is done, change state from Syncing to Dirty first, in case
* Startwrite happen before the state is Clean.
+ *
+ * ##### Bitmap IO
+ *
+ * A hidden disk, named mdxxx_bitmap, is created for bitmap, see details in
+ * llbitmap_add_disk(). And a file is created as well to manage bitmap IO for
+ * this disk, see details in llbitmap_open_disk(). Read/write bitmap is
+ * converted to buffer IO to this file.
*/
+#define BITMAP_MAX_SECTOR (128 * 2)
#define BITMAP_MAX_PAGES 32
#define BITMAP_SB_SIZE 1024
@@ -135,6 +143,13 @@ struct llbitmap {
__u64 events_cleared;
};
+struct llbitmap_bio {
+ struct md_rdev *rdev;
+ struct bio bio;
+};
+
+static struct workqueue_struct *md_llbitmap_io_wq;
+
static char state_machine[nr_llbitmap_state][nr_llbitmap_action] = {
[BitUnwritten] = {BitDirty, BitNone, BitNone, BitNone, BitNone, BitNone, BitNone, BitNone},
[BitClean] = {BitDirty, BitNone, BitNone, BitNone, BitNone, BitNone, BitUnwritten, BitNeedSync},
@@ -254,3 +269,226 @@ static enum llbitmap_state llbitmap_state_machine(struct llbitmap *llbitmap,
return state;
}
+
+static void llbitmap_end_write(struct bio *bio)
+{
+ struct bio *parent = bio->bi_private;
+ struct llbitmap_bio *llbitmap_bio;
+ struct md_rdev *rdev;
+
+ if (bio->bi_status == BLK_STS_OK) {
+ WRITE_ONCE(parent->bi_status, BLK_STS_OK);
+ } else {
+ llbitmap_bio = container_of(bio, struct llbitmap_bio, bio);
+ rdev = llbitmap_bio->rdev;
+
+ pr_err("%s: %s: bitmap write failed for %pg\n", __func__,
+ mdname(rdev->mddev), rdev->bdev);
+ md_error(rdev->mddev, rdev);
+ }
+
+ bio_put(bio);
+ bio_endio(parent);
+}
+
+static void md_llbitmap_retry_read(struct llbitmap *llbitmap, struct bio *bio)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&llbitmap->retry_lock, flags);
+ bio_list_add(&llbitmap->retry_list, bio);
+ queue_work(md_llbitmap_io_wq, &llbitmap->retry_work);
+ spin_unlock_irqrestore(&llbitmap->retry_lock, flags);
+}
+
+static void llbitmap_end_read(struct bio *bio)
+{
+ struct bio *parent = bio->bi_private;
+ struct llbitmap_bio *llbitmap_bio;
+ struct llbitmap *llbitmap;
+ struct md_rdev *rdev;
+
+ if (bio->bi_status == BLK_STS_OK) {
+ WRITE_ONCE(parent->bi_status, BLK_STS_OK);
+ bio_put(bio);
+ bio_endio(parent);
+ return;
+ }
+
+ llbitmap_bio = container_of(bio, struct llbitmap_bio, bio);
+ rdev = llbitmap_bio->rdev;
+ pr_err("%s: %s: bitmap read failed for %pg\n", __func__,
+ mdname(rdev->mddev), rdev->bdev);
+ md_error(rdev->mddev, rdev);
+ bio_put(bio);
+ md_llbitmap_retry_read(llbitmap, parent);
+}
+
+static void md_llbitmap_retry_fn(struct work_struct *work)
+{
+ struct llbitmap *llbitmap =
+ container_of(work, struct llbitmap, retry_work);
+ struct mddev *mddev = llbitmap->mddev;
+ struct md_rdev *rdev;
+ struct bio *bio;
+
+again:
+ spin_lock_irq(&llbitmap->retry_lock);
+ bio = bio_list_pop(&llbitmap->retry_list);
+ spin_unlock_irq(&llbitmap->retry_lock);
+
+ if (!bio)
+ return;
+
+ rdev_for_each(rdev, mddev) {
+ struct llbitmap_bio *llbitmap_bio;
+ struct bio *new;
+
+ if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
+ continue;
+
+ new = bio_alloc_clone(rdev->bdev, bio, GFP_NOIO,
+ &llbitmap->bio_set);
+ new->bi_iter.bi_sector = bio->bi_iter.bi_sector +
+ rdev->sb_start +
+ mddev->bitmap_info.offset;
+ new->bi_opf |= REQ_SYNC | REQ_IDLE | REQ_META;
+ new->bi_private = bio;
+ new->bi_end_io = llbitmap_end_read;
+
+ llbitmap_bio = container_of(new, struct llbitmap_bio, bio);
+ llbitmap_bio->rdev = rdev;
+
+ submit_bio_noacct(new);
+ goto again;
+ }
+}
+
+static void llbitmap_submit_bio(struct bio *bio)
+{
+ struct mddev *mddev = bio->bi_bdev->bd_disk->private_data;
+ struct llbitmap *llbitmap = mddev->bitmap;
+ struct llbitmap_bio *llbitmap_bio;
+ struct md_rdev *rdev;
+ struct bio *new;
+
+ if (unlikely(bio->bi_opf & REQ_PREFLUSH))
+ bio->bi_opf &= ~REQ_PREFLUSH;
+
+ if (!bio_sectors(bio)) {
+ bio_endio(bio);
+ return;
+ }
+
+ /* status will be cleared if any member disk IO succeed */
+ bio->bi_status = BLK_STS_IOERR;
+
+ rdev_for_each(rdev, mddev) {
+ if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
+ continue;
+
+ new = bio_alloc_clone(rdev->bdev, bio, GFP_NOIO,
+ &llbitmap->bio_set);
+ new->bi_iter.bi_sector = bio->bi_iter.bi_sector +
+ rdev->sb_start +
+ mddev->bitmap_info.offset;
+ new->bi_opf |= REQ_SYNC | REQ_IDLE | REQ_META;
+
+ llbitmap_bio = container_of(new, struct llbitmap_bio, bio);
+ llbitmap_bio->rdev = rdev;
+ bio_inc_remaining(bio);
+ new->bi_private = bio;
+
+ if (bio_data_dir(bio) == WRITE) {
+ new->bi_end_io = llbitmap_end_write;
+ new->bi_opf |= REQ_FUA;
+ submit_bio_noacct(new);
+ continue;
+ }
+
+ new->bi_end_io = llbitmap_end_read;
+ submit_bio_noacct(new);
+ break;
+ }
+
+ bio_endio(bio);
+}
+
+const struct block_device_operations llbitmap_fops = {
+ .owner = THIS_MODULE,
+ .submit_bio = llbitmap_submit_bio,
+};
+
+static int llbitmap_add_disk(struct llbitmap *llbitmap)
+{
+ struct mddev *mddev = llbitmap->mddev;
+ struct gendisk *disk = blk_alloc_disk(&mddev->gendisk->queue->limits,
+ NUMA_NO_NODE);
+ int ret;
+
+ if (IS_ERR(disk))
+ return PTR_ERR(disk);
+
+ sprintf(disk->disk_name, "%s_bitmap", mdname(mddev));
+ disk->flags |= GENHD_FL_HIDDEN;
+ disk->fops = &llbitmap_fops;
+
+ ret = add_disk(disk);
+ if (ret) {
+ put_disk(disk);
+ return ret;
+ }
+
+ set_capacity(disk, BITMAP_MAX_SECTOR);
+ disk->private_data = mddev;
+ llbitmap->bitmap_disk = disk;
+ return 0;
+}
+
+static void llbitmap_del_disk(struct llbitmap *llbitmap)
+{
+ struct gendisk *disk = llbitmap->bitmap_disk;
+
+ if (!disk)
+ return;
+
+ llbitmap->bitmap_disk = NULL;
+ del_gendisk(disk);
+ put_disk(disk);
+}
+
+static int llbitmap_open_disk(struct llbitmap *llbitmap)
+{
+ struct gendisk *disk = llbitmap->bitmap_disk;
+ struct file *bitmap_file;
+
+ bitmap_file = bdev_file_alloc(disk->part0,
+ BLK_OPEN_READ | BLK_OPEN_WRITE);
+ if (IS_ERR(bitmap_file))
+ return PTR_ERR(bitmap_file);
+
+ /* corresponding to the blkdev_put_no_open() from blkdev_release() */
+ get_device(disk_to_dev(disk));
+
+ bitmap_file->f_flags |= O_LARGEFILE;
+ bitmap_file->f_mode |= FMODE_CAN_ODIRECT;
+ bitmap_file->f_mapping = disk->part0->bd_mapping;
+ bitmap_file->f_wb_err = filemap_sample_wb_err(bitmap_file->f_mapping);
+
+ /* not actually opened, let blkdev_release() know */
+ bitmap_file->private_data = ERR_PTR(-ENODEV);
+ llbitmap->bitmap_file = bitmap_file;
+ return 0;
+}
+
+static void llbitmap_close_disk(struct llbitmap *llbitmap)
+{
+ struct file *bitmap_file = llbitmap->bitmap_file;
+
+ if (!bitmap_file)
+ return;
+
+ llbitmap->bitmap_file = NULL;
+ fput(bitmap_file);
+}
+
--
2.39.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH RFC v2 08/14] md/md-llbitmap: implement APIs for page level dirty bits synchronization
2025-03-28 6:08 [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Yu Kuai
` (6 preceding siblings ...)
2025-03-28 6:08 ` [PATCH RFC v2 07/14] md/md-llbitmap: implement hidden disk to manage bitmap IO Yu Kuai
@ 2025-03-28 6:08 ` Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 09/14] md/md-llbitmap: implement APIs to mange bitmap lifetime Yu Kuai
` (7 subsequent siblings)
15 siblings, 0 replies; 27+ messages in thread
From: Yu Kuai @ 2025-03-28 6:08 UTC (permalink / raw)
To: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song, yukuai3
Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
yi.zhang, yangerkun
From: Yu Kuai <yukuai3@huawei.com>
IO fast path will set bits to dirty, and those dirty bits must be
cleared after IO is done, to prevent unnecessary data recovery after
power failure.
This patch add a bitmap page level barrier and related APIs,
- llbitmap_{suspend, resume} will be used by daemon from slow path:
1) suspend new write IO;
2) wait for inflight write IO to be done;
3) clear dirty bits;
4) resume write IO;
- llbitmap_{raise, release}_barrier will be used in IO fast path, the
overhead is just one percpu ref get if the page is not suspended.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/md-llbitmap.c | 119 +++++++++++++++++++++++++++++++++++++++
1 file changed, 119 insertions(+)
diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
index bbd8a7c99577..7d4a0e81f8e1 100644
--- a/drivers/md/md-llbitmap.c
+++ b/drivers/md/md-llbitmap.c
@@ -63,12 +63,29 @@
* llbitmap_add_disk(). And a file is created as well to manage bitmap IO for
* this disk, see details in llbitmap_open_disk(). Read/write bitmap is
* converted to buffer IO to this file.
+ *
+ * IO fast path will set bits to dirty, and those dirty bits will be cleared
+ * by daemon after IO is done. llbitmap_barrier is used to syncronize between
+ * IO path and daemon;
+ *
+ * IO patch:
+ * 1) try to grab a reference, if succeed, set expire time after 5s and return;
+ * 2) wait for daemon to finish clearing dirty bits;
+ *
+ * Daemon(Daemon will be wake up every daemon_sleep seconds):
+ * For each page:
+ * 1) check if page expired, if not skip this page; for expired page:
+ * 2) suspend the page and wait for inflight write IO to be done;
+ * 3) change dirty page to clean;
+ * 4) resume the page;
*/
#define BITMAP_MAX_SECTOR (128 * 2)
#define BITMAP_MAX_PAGES 32
#define BITMAP_SB_SIZE 1024
+#define BARRIER_IDLE 5
+
enum llbitmap_state {
/* No valid data, init state after assemble the array */
BitUnwritten = 0,
@@ -115,6 +132,16 @@ enum llbitmap_action {
BitmapActionInit,
};
+/*
+ * page level barrier to synchronize between dirty bit by write IO and clean bit
+ * by daemon.
+ */
+struct llbitmap_barrier {
+ struct percpu_ref active;
+ unsigned long expire;
+ wait_queue_head_t wait;
+} ____cacheline_aligned_in_smp;
+
struct llbitmap {
struct mddev *mddev;
/* hidden disk to manage bitmap IO */
@@ -123,6 +150,7 @@ struct llbitmap {
struct file *bitmap_file;
int nr_pages;
struct page *pages[BITMAP_MAX_PAGES];
+ struct llbitmap_barrier barrier[BITMAP_MAX_PAGES];
struct bio_set bio_set;
struct bio_list retry_list;
@@ -492,3 +520,94 @@ static void llbitmap_close_disk(struct llbitmap *llbitmap)
fput(bitmap_file);
}
+static void llbitmap_free_pages(struct llbitmap *llbitmap)
+{
+ int i;
+
+ for (i = 0; i < BITMAP_MAX_PAGES; i++) {
+ struct page *page = llbitmap->pages[i];
+
+ if (!page)
+ return;
+
+ llbitmap->pages[i] = NULL;
+ put_page(page);
+ percpu_ref_exit(&llbitmap->barrier[i].active);
+ }
+}
+
+static void llbitmap_raise_barrier(struct llbitmap *llbitmap, int page_idx)
+{
+ struct llbitmap_barrier *barrier = &llbitmap->barrier[page_idx];
+
+retry:
+ if (likely(percpu_ref_tryget_live(&barrier->active))) {
+ WRITE_ONCE(barrier->expire, jiffies + BARRIER_IDLE * HZ);
+ return;
+ }
+
+ wait_event(barrier->wait, !percpu_ref_is_dying(&barrier->active));
+ goto retry;
+}
+
+static void llbitmap_release_barrier(struct llbitmap *llbitmap, int page_idx)
+{
+ struct llbitmap_barrier *barrier = &llbitmap->barrier[page_idx];
+
+ percpu_ref_put(&barrier->active);
+}
+
+static void llbitmap_suspend(struct llbitmap *llbitmap, int page_idx)
+{
+ struct llbitmap_barrier *barrier = &llbitmap->barrier[page_idx];
+
+ percpu_ref_kill(&barrier->active);
+ wait_event(barrier->wait, percpu_ref_is_zero(&barrier->active));
+}
+
+static void llbitmap_resume(struct llbitmap *llbitmap, int page_idx)
+{
+ struct llbitmap_barrier *barrier = &llbitmap->barrier[page_idx];
+
+ barrier->expire = LONG_MAX;
+ percpu_ref_resurrect(&barrier->active);
+ wake_up(&barrier->wait);
+}
+
+static void active_release(struct percpu_ref *ref)
+{
+ struct llbitmap_barrier *barrier =
+ container_of(ref, struct llbitmap_barrier, active);
+
+ wake_up(&barrier->wait);
+}
+
+static int llbitmap_cache_pages(struct llbitmap *llbitmap)
+{
+ int nr_pages = (llbitmap->chunks + BITMAP_SB_SIZE + PAGE_SIZE - 1) / PAGE_SIZE;
+ struct page *page;
+ int i = 0;
+
+ llbitmap->nr_pages = nr_pages;
+ while (i < nr_pages) {
+ page = read_mapping_page(llbitmap->bitmap_file->f_mapping, i, NULL);
+ if (IS_ERR(page)) {
+ int ret = PTR_ERR(page);
+
+ llbitmap_free_pages(llbitmap);
+ return ret;
+ }
+
+ if (percpu_ref_init(&llbitmap->barrier[i].active, active_release,
+ PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
+ put_page(page);
+ return -ENOMEM;
+ }
+
+ init_waitqueue_head(&llbitmap->barrier[i].wait);
+ llbitmap->pages[i++] = page;
+ }
+
+ return 0;
+}
+
--
2.39.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH RFC v2 09/14] md/md-llbitmap: implement APIs to mange bitmap lifetime
2025-03-28 6:08 [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Yu Kuai
` (7 preceding siblings ...)
2025-03-28 6:08 ` [PATCH RFC v2 08/14] md/md-llbitmap: implement APIs for page level dirty bits synchronization Yu Kuai
@ 2025-03-28 6:08 ` Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 10/14] md/md-llbitmap: implement APIs to dirty bits and clear bits Yu Kuai
` (6 subsequent siblings)
15 siblings, 0 replies; 27+ messages in thread
From: Yu Kuai @ 2025-03-28 6:08 UTC (permalink / raw)
To: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song, yukuai3
Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
yi.zhang, yangerkun
From: Yu Kuai <yukuai3@huawei.com>
Include following APIs:
- llbitmap_create
- llbitmap_resize
- llbitmap_load
- llbitmap_destroy
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/md-bitmap.h | 1 +
drivers/md/md-llbitmap.c | 306 +++++++++++++++++++++++++++++++++++++++
2 files changed, 307 insertions(+)
diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index 5d579f0b0c3a..9e8bc7895751 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -22,6 +22,7 @@ typedef __u16 bitmap_counter_t;
enum bitmap_state {
BITMAP_STALE = 1, /* the bitmap file is out of date or had -EIO */
BITMAP_WRITE_ERROR = 2, /* A write error has occurred */
+ BITMAP_FIRST_USE = 3, /* llbtimap is just created */
BITMAP_HOSTENDIAN =15,
};
diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
index 7d4a0e81f8e1..1452887ffc5d 100644
--- a/drivers/md/md-llbitmap.c
+++ b/drivers/md/md-llbitmap.c
@@ -80,9 +80,15 @@
* 4) resume the page;
*/
+#define LLBITMAP_MAJOR_HI 6
+
#define BITMAP_MAX_SECTOR (128 * 2)
#define BITMAP_MAX_PAGES 32
#define BITMAP_SB_SIZE 1024
+/* 64k is the max IO size of sync IO for raid1/raid10 */
+#define MIN_CHUNK_SIZE (64 * 2)
+
+#define DEFAULT_DAEMON_SLEEP 30
#define BARRIER_IDLE 5
@@ -177,6 +183,7 @@ struct llbitmap_bio {
};
static struct workqueue_struct *md_llbitmap_io_wq;
+static struct workqueue_struct *md_llbitmap_unplug_wq;
static char state_machine[nr_llbitmap_state][nr_llbitmap_action] = {
[BitUnwritten] = {BitDirty, BitNone, BitNone, BitNone, BitNone, BitNone, BitNone, BitNone},
@@ -611,3 +618,302 @@ static int llbitmap_cache_pages(struct llbitmap *llbitmap)
return 0;
}
+static int llbitmap_check_support(struct mddev *mddev)
+{
+ if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
+ pr_notice("md/llbitmap: %s: array with journal cannot have bitmap\n",
+ mdname(mddev));
+ return -EBUSY;
+ }
+
+ if (mddev->bitmap_info.space == 0) {
+ if (mddev->bitmap_info.default_space == 0) {
+ pr_notice("md/llbitmap: %s: no space for bitmap\n",
+ mdname(mddev));
+ return -ENOSPC;
+ }
+ }
+
+ if (!mddev->persistent) {
+ pr_notice("md/llbitmap: %s: array must be persistent\n",
+ mdname(mddev));
+ return -EOPNOTSUPP;
+ }
+
+ if (mddev->bitmap_info.file) {
+ pr_notice("md/llbitmap: %s: doesn't support bitmap file\n",
+ mdname(mddev));
+ return -EOPNOTSUPP;
+ }
+
+ if (mddev->bitmap_info.external) {
+ pr_notice("md/llbitmap: %s: doesn't support external metadata\n",
+ mdname(mddev));
+ return -EOPNOTSUPP;
+ }
+
+ if (mddev_is_dm(mddev)) {
+ pr_notice("md/llbitmap: %s: doesn't support dm-raid\n",
+ mdname(mddev));
+ return -EOPNOTSUPP;
+ }
+
+ return 0;
+}
+
+static int llbitmap_init(struct llbitmap *llbitmap)
+{
+ struct mddev *mddev = llbitmap->mddev;
+ sector_t blocks = mddev->resync_max_sectors;
+ unsigned long chunksize = MIN_CHUNK_SIZE;
+ unsigned long chunks = DIV_ROUND_UP(blocks, chunksize);
+ unsigned long space = mddev->bitmap_info.space << SECTOR_SHIFT;
+ int ret;
+
+ while (chunks > space) {
+ chunksize = chunksize << 1;
+ chunks = DIV_ROUND_UP(blocks, chunksize);
+ }
+
+ llbitmap->chunkshift = ffz(~chunksize);
+ llbitmap->chunksize = chunksize;
+ llbitmap->chunks = chunks;
+ mddev->bitmap_info.daemon_sleep = DEFAULT_DAEMON_SLEEP;
+
+ ret = llbitmap_cache_pages(llbitmap);
+ if (ret)
+ return ret;
+
+ llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, BitmapActionInit);
+ return 0;
+}
+
+static int llbitmap_read_sb(struct llbitmap *llbitmap)
+{
+ struct mddev *mddev = llbitmap->mddev;
+ unsigned long daemon_sleep;
+ unsigned long chunksize;
+ unsigned long events;
+ struct page *sb_page;
+ bitmap_super_t *sb;
+ int ret = -EINVAL;
+
+ if (!mddev->bitmap_info.offset) {
+ pr_err("md/llbitmap: %s: no super block found", mdname(mddev));
+ return -EINVAL;
+ }
+
+ sb_page = read_mapping_page(llbitmap->bitmap_file->f_mapping, 0, NULL);
+ if (IS_ERR(sb_page)) {
+ pr_err("md/llbitmap: %s: read super block failed",
+ mdname(mddev));
+ ret = -EIO;
+ goto out;
+ }
+
+ sb = kmap_local_page(sb_page);
+ if (sb->magic != cpu_to_le32(BITMAP_MAGIC)) {
+ pr_err("md/llbitmap: %s: invalid super block magic number",
+ mdname(mddev));
+ goto out_put_page;
+ }
+
+ if (sb->version != cpu_to_le32(LLBITMAP_MAJOR_HI)) {
+ pr_err("md/llbitmap: %s: invalid super block version",
+ mdname(mddev));
+ goto out_put_page;
+ }
+
+ if (memcmp(sb->uuid, mddev->uuid, 16)) {
+ pr_err("md/llbitmap: %s: bitmap superblock UUID mismatch\n",
+ mdname(mddev));
+ goto out_put_page;
+ }
+
+ if (mddev->bitmap_info.space == 0) {
+ int room = le32_to_cpu(sb->sectors_reserved);
+
+ if (room)
+ mddev->bitmap_info.space = room;
+ else
+ mddev->bitmap_info.space = mddev->bitmap_info.default_space;
+ }
+ llbitmap->flags = le32_to_cpu(sb->state);
+ if (test_and_clear_bit(BITMAP_FIRST_USE, &llbitmap->flags)) {
+ ret = llbitmap_init(llbitmap);
+ goto out_put_page;
+ }
+
+ chunksize = le32_to_cpu(sb->chunksize);
+ if (!is_power_of_2(chunksize)) {
+ pr_err("md/llbitmap: %s: chunksize not a power of 2",
+ mdname(mddev));
+ goto out_put_page;
+ }
+
+ if (chunksize < DIV_ROUND_UP(mddev->resync_max_sectors,
+ mddev->bitmap_info.space << SECTOR_SHIFT)) {
+ pr_err("md/llbitmap: %s: chunksize too small %lu < %llu / %lu",
+ mdname(mddev), chunksize, mddev->resync_max_sectors,
+ mddev->bitmap_info.space);
+ goto out_put_page;
+ }
+
+ daemon_sleep = le32_to_cpu(sb->daemon_sleep);
+ if (daemon_sleep < 1 || daemon_sleep > MAX_SCHEDULE_TIMEOUT / HZ) {
+ pr_err("md/llbitmap: %s: daemon sleep %lu period out of range",
+ mdname(mddev), daemon_sleep);
+ goto out_put_page;
+ }
+
+ if (le32_to_cpu(sb->write_behind))
+ pr_warn("md/llbitmap: %s: slow disk is not supported",
+ mdname(mddev));
+
+ events = le64_to_cpu(sb->events);
+ if (events < mddev->events) {
+ pr_warn("md/llbitmap :%s: bitmap file is out of date (%lu < %llu) -- forcing full recovery",
+ mdname(mddev), events, mddev->events);
+ set_bit(BITMAP_STALE, &llbitmap->flags);
+ }
+
+ sb->sync_size = cpu_to_le64(mddev->resync_max_sectors);
+ mddev->bitmap_info.chunksize = chunksize;
+ mddev->bitmap_info.daemon_sleep = daemon_sleep;
+
+ llbitmap->chunksize = chunksize;
+ llbitmap->chunks = DIV_ROUND_UP(mddev->resync_max_sectors, chunksize);
+ llbitmap->chunkshift = ffz(~chunksize);
+ ret = llbitmap_cache_pages(llbitmap);
+
+out_put_page:
+ put_page(sb_page);
+out:
+ kunmap_local(sb);
+ return ret;
+}
+
+static int llbitmap_create(struct mddev *mddev)
+{
+ struct llbitmap *llbitmap;
+ int ret;
+
+ ret = llbitmap_check_support(mddev);
+ if (ret)
+ return ret;
+
+ llbitmap = kzalloc(sizeof(*llbitmap), GFP_KERNEL);
+ if (!llbitmap)
+ return -ENOMEM;
+
+ llbitmap->mddev = mddev;
+ bio_list_init(&llbitmap->retry_list);
+ spin_lock_init(&llbitmap->retry_lock);
+
+ timer_setup(&llbitmap->pending_timer, llbitmap_pending_timer_fn, 0);
+ INIT_WORK(&llbitmap->retry_work, md_llbitmap_retry_fn);
+ INIT_WORK(&llbitmap->daemon_work, md_llbitmap_daemon_fn);
+
+ ret = bioset_init(&llbitmap->bio_set, BIO_POOL_SIZE,
+ offsetof(struct llbitmap_bio, bio), 0);
+ if (ret)
+ goto err_out;
+
+ ret = llbitmap_add_disk(llbitmap);
+ if (ret)
+ goto err_bio_set;
+
+ ret = llbitmap_open_disk(llbitmap);
+ if (ret)
+ goto err_del_disk;
+
+ mutex_lock(&mddev->bitmap_info.mutex);
+ mddev->bitmap = llbitmap;
+ ret = llbitmap_read_sb(llbitmap);
+ mutex_unlock(&mddev->bitmap_info.mutex);
+ if (ret)
+ goto err_close_disk;
+
+ return 0;
+
+err_close_disk:
+ mddev->bitmap = NULL;
+ llbitmap_close_disk(llbitmap);
+err_del_disk:
+ llbitmap_del_disk(llbitmap);
+err_bio_set:
+ bioset_exit(&llbitmap->bio_set);
+err_out:
+ kfree(llbitmap);
+ return ret;
+}
+
+static int llbitmap_resize(struct mddev *mddev, sector_t blocks, int chunksize)
+{
+ struct llbitmap *llbitmap = mddev->bitmap;
+ unsigned long chunks;
+
+ if (chunksize == 0)
+ chunksize = llbitmap->chunksize;
+
+ /* If there is enough space, leave the chunksize unchanged. */
+ chunks = DIV_ROUND_UP(blocks, chunksize);
+ while (chunks > mddev->bitmap_info.space << SECTOR_SHIFT) {
+ chunksize = chunksize << 1;
+ chunks = DIV_ROUND_UP(blocks, chunksize);
+ }
+
+ llbitmap->chunkshift = ffz(~chunksize);
+ llbitmap->chunksize = chunksize;
+ llbitmap->chunks = chunks;
+
+ return 0;
+}
+
+static int llbitmap_load(struct mddev *mddev)
+{
+ enum llbitmap_action action = BitmapActionReload;
+ struct llbitmap *llbitmap = mddev->bitmap;
+
+ if (test_and_clear_bit(BITMAP_STALE, &llbitmap->flags))
+ action = BitmapActionStale;
+
+ llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, action);
+ return 0;
+}
+
+static void llbitmap_destroy(struct mddev *mddev)
+{
+ struct llbitmap *llbitmap = mddev->bitmap;
+
+ if (!llbitmap)
+ return;
+
+ mutex_lock(&mddev->bitmap_info.mutex);
+
+ del_timer_sync(&llbitmap->pending_timer);
+ flush_workqueue(md_llbitmap_io_wq);
+ flush_workqueue(md_llbitmap_unplug_wq);
+
+ llbitmap_del_disk(llbitmap);
+ llbitmap_close_disk(llbitmap);
+ mddev->bitmap = NULL;
+
+ llbitmap_free_pages(llbitmap);
+ bioset_exit(&llbitmap->bio_set);
+ kfree(llbitmap);
+ mutex_unlock(&mddev->bitmap_info.mutex);
+}
+
+static struct bitmap_operations llbitmap_ops = {
+ .head = {
+ .type = MD_BITMAP,
+ .id = ID_LLBITMAP,
+ .name = "llbitmap",
+ },
+
+ .create = llbitmap_create,
+ .resize = llbitmap_resize,
+ .load = llbitmap_load,
+ .destroy = llbitmap_destroy,
+};
--
2.39.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH RFC v2 10/14] md/md-llbitmap: implement APIs to dirty bits and clear bits
2025-03-28 6:08 [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Yu Kuai
` (8 preceding siblings ...)
2025-03-28 6:08 ` [PATCH RFC v2 09/14] md/md-llbitmap: implement APIs to mange bitmap lifetime Yu Kuai
@ 2025-03-28 6:08 ` Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 11/14] md/md-llbitmap: implement APIs for sync_thread Yu Kuai
` (5 subsequent siblings)
15 siblings, 0 replies; 27+ messages in thread
From: Yu Kuai @ 2025-03-28 6:08 UTC (permalink / raw)
To: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song, yukuai3
Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
yi.zhang, yangerkun
From: Yu Kuai <yukuai3@huawei.com>
Include following APIs:
- llbitmap_startwrite
- llbitmap_endwrite
- llbitmap_unplug
- llbitmap_flush
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/md-bitmap.h | 1 +
drivers/md/md-llbitmap.c | 171 +++++++++++++++++++++++++++++++++++++++
2 files changed, 172 insertions(+)
diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index 9e8bc7895751..7a3cd2f70772 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -23,6 +23,7 @@ enum bitmap_state {
BITMAP_STALE = 1, /* the bitmap file is out of date or had -EIO */
BITMAP_WRITE_ERROR = 2, /* A write error has occurred */
BITMAP_FIRST_USE = 3, /* llbtimap is just created */
+ BITMAP_DAEMON_BUSY = 4, /* llbitmap daemon is still not done after daemon_sleep */
BITMAP_HOSTENDIAN =15,
};
diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
index 1452887ffc5d..982ec868ce22 100644
--- a/drivers/md/md-llbitmap.c
+++ b/drivers/md/md-llbitmap.c
@@ -145,6 +145,7 @@ enum llbitmap_action {
struct llbitmap_barrier {
struct percpu_ref active;
unsigned long expire;
+ bool flush;
wait_queue_head_t wait;
} ____cacheline_aligned_in_smp;
@@ -182,6 +183,12 @@ struct llbitmap_bio {
struct bio bio;
};
+struct llbitmap_unplug_work {
+ struct work_struct work;
+ struct llbitmap *llbitmap;
+ struct completion *done;
+};
+
static struct workqueue_struct *md_llbitmap_io_wq;
static struct workqueue_struct *md_llbitmap_unplug_wq;
@@ -793,6 +800,54 @@ static int llbitmap_read_sb(struct llbitmap *llbitmap)
return ret;
}
+static void llbitmap_pending_timer_fn(struct timer_list *t)
+{
+ struct llbitmap *llbitmap = from_timer(llbitmap, t, pending_timer);
+
+ if (work_busy(&llbitmap->daemon_work)) {
+ pr_warn("daemon_work not finished\n");
+ set_bit(BITMAP_DAEMON_BUSY, &llbitmap->flags);
+ return;
+ }
+
+ queue_work(md_llbitmap_io_wq, &llbitmap->daemon_work);
+}
+
+static void md_llbitmap_daemon_fn(struct work_struct *work)
+{
+ struct llbitmap *llbitmap =
+ container_of(work, struct llbitmap, daemon_work);
+ unsigned long start = 0;
+ unsigned long end = min(llbitmap->chunks, PAGE_SIZE - BITMAP_SB_SIZE) - 1;
+ bool restart = false;
+ int page_idx = 0;
+
+ while (page_idx < llbitmap->nr_pages) {
+ struct llbitmap_barrier *barrier = &llbitmap->barrier[page_idx];
+
+ if (page_idx > 0) {
+ start = end + 1;
+ end = min(end + PAGE_SIZE, llbitmap->chunks - 1);
+ }
+
+ if (!barrier->flush && time_before(jiffies, barrier->expire)) {
+ restart = true;
+ page_idx++;
+ continue;
+ }
+
+ llbitmap_suspend(llbitmap, page_idx);
+ llbitmap_state_machine(llbitmap, start, end, BitmapActionDaemon);
+ llbitmap_resume(llbitmap, page_idx);
+
+ page_idx++;
+ }
+
+ if (test_and_clear_bit(BITMAP_DAEMON_BUSY, &llbitmap->flags) || restart)
+ mod_timer(&llbitmap->pending_timer,
+ jiffies + llbitmap->mddev->bitmap_info.daemon_sleep * HZ);
+}
+
static int llbitmap_create(struct mddev *mddev)
{
struct llbitmap *llbitmap;
@@ -905,6 +960,117 @@ static void llbitmap_destroy(struct mddev *mddev)
mutex_unlock(&mddev->bitmap_info.mutex);
}
+static int llbitmap_startwrite(struct mddev *mddev, sector_t offset,
+ unsigned long sectors, bool is_discard)
+{
+ struct llbitmap *llbitmap = mddev->bitmap;
+ enum llbitmap_action action;
+ unsigned long start;
+ unsigned long end;
+ int page_start;
+ int page_end;
+
+ if (likely(!is_discard)) {
+ start = offset >> llbitmap->chunkshift;
+ end = (offset + sectors - 1) >> llbitmap->chunkshift;
+ action = BitmapActionStartwrite;
+ } else {
+ /*
+ * For discard, the bit can be handled only if the discard range
+ * cover the whole bit, hence round start up, and end down.
+ */
+ start = DIV_ROUND_UP(offset, llbitmap->chunksize);
+ end = (offset + sectors - 1) >> llbitmap->chunkshift;
+ action = BitmapActionDiscard;
+ }
+
+ llbitmap_state_machine(llbitmap, start, end, action);
+
+ page_start = (start + BITMAP_SB_SIZE) >> PAGE_SHIFT;
+ page_end = (end + BITMAP_SB_SIZE) >> PAGE_SHIFT;
+
+ while (page_start <= page_end) {
+ llbitmap_raise_barrier(llbitmap, page_start);
+ page_start++;
+ }
+
+ return 0;
+}
+
+static void llbitmap_endwrite(struct mddev *mddev, sector_t offset,
+ unsigned long sectors, bool is_discard)
+{
+ struct llbitmap *llbitmap = mddev->bitmap;
+ unsigned long start;
+ unsigned long end;
+ int page_start;
+ int page_end;
+
+ if (likely(!is_discard)) {
+ start = offset >> llbitmap->chunkshift;
+ end = (offset + sectors - 1) >> llbitmap->chunkshift;
+ } else {
+ /*
+ * For discard, the bit can be handled only if the discard range
+ * cover the whole bit, hence round start up, and end down.
+ */
+ start = DIV_ROUND_UP(offset, llbitmap->chunksize);
+ end = (offset + sectors - 1) >> llbitmap->chunkshift;
+ }
+
+ page_start = (start + BITMAP_SB_SIZE) >> PAGE_SHIFT;
+ page_end = (end + BITMAP_SB_SIZE) >> PAGE_SHIFT;
+
+ while (page_start <= page_end) {
+ llbitmap_release_barrier(llbitmap, page_start);
+ page_start++;
+ }
+}
+
+static void llbitmap_unplug_fn(struct work_struct *work)
+{
+ struct llbitmap_unplug_work *unplug_work =
+ container_of(work, struct llbitmap_unplug_work, work);
+ struct llbitmap *llbitmap = unplug_work->llbitmap;
+
+ filemap_write_and_wait_range(llbitmap->bitmap_file->f_mapping,
+ BITMAP_SB_SIZE,
+ BITMAP_SB_SIZE + llbitmap->chunks - 1);
+ complete(unplug_work->done);
+}
+
+static void llbitmap_unplug(struct mddev *mddev, bool sync)
+{
+ DECLARE_COMPLETION_ONSTACK(done);
+ struct llbitmap *llbitmap = mddev->bitmap;
+ struct llbitmap_unplug_work unplug_work = {
+ .llbitmap = llbitmap,
+ .done = &done,
+ };
+
+ if (!mapping_tagged(llbitmap->bitmap_file->f_mapping,
+ PAGECACHE_TAG_DIRTY))
+ return;
+
+ INIT_WORK_ONSTACK(&unplug_work.work, llbitmap_unplug_fn);
+ queue_work(md_llbitmap_unplug_wq, &unplug_work.work);
+ wait_for_completion(&done);
+ destroy_work_on_stack(&unplug_work.work);
+}
+
+static void llbitmap_flush(struct mddev *mddev)
+{
+ struct llbitmap *llbitmap = mddev->bitmap;
+ int i;
+
+ for (i = 0; i < llbitmap->nr_pages; i++)
+ llbitmap->barrier[i].flush = true;
+
+ del_timer_sync(&llbitmap->pending_timer);
+ queue_work(md_llbitmap_io_wq, &llbitmap->daemon_work);
+ flush_work(&llbitmap->daemon_work);
+}
+
static struct bitmap_operations llbitmap_ops = {
.head = {
.type = MD_BITMAP,
@@ -916,4 +1082,9 @@ static struct bitmap_operations llbitmap_ops = {
.resize = llbitmap_resize,
.load = llbitmap_load,
.destroy = llbitmap_destroy,
+
+ .startwrite = llbitmap_startwrite,
+ .endwrite = llbitmap_endwrite,
+ .unplug = llbitmap_unplug,
+ .flush = llbitmap_flush,
};
--
2.39.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH RFC v2 11/14] md/md-llbitmap: implement APIs for sync_thread
2025-03-28 6:08 [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Yu Kuai
` (9 preceding siblings ...)
2025-03-28 6:08 ` [PATCH RFC v2 10/14] md/md-llbitmap: implement APIs to dirty bits and clear bits Yu Kuai
@ 2025-03-28 6:08 ` Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 12/14] md/md-llbitmap: implement all bitmap operations Yu Kuai
` (4 subsequent siblings)
15 siblings, 0 replies; 27+ messages in thread
From: Yu Kuai @ 2025-03-28 6:08 UTC (permalink / raw)
To: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song, yukuai3
Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
yi.zhang, yangerkun
From: Yu Kuai <yukuai3@huawei.com>
Include following APIs:
- llbitmap_start_sync
- llbitmap_end_sync
- llbitmap_close_sync
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/md-llbitmap.c | 43 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 43 insertions(+)
diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
index 982ec868ce22..1692942942ff 100644
--- a/drivers/md/md-llbitmap.c
+++ b/drivers/md/md-llbitmap.c
@@ -1071,6 +1071,45 @@ static void llbitmap_flush(struct mddev *mddev)
flush_work(&llbitmap->daemon_work);
}
+static bool llbitmap_start_sync(struct mddev *mddev, sector_t offset,
+ sector_t *blocks, bool degraded)
+{
+ struct llbitmap *llbitmap = mddev->bitmap;
+ unsigned long p = offset >> llbitmap->chunkshift;
+
+ /*
+ * Handle one bit at a time, this is much simpler. And it doesn't matter
+ * if md_do_sync() loop more times.
+ */
+ *blocks = llbitmap->chunksize - (offset & (llbitmap->chunksize - 1));
+ return llbitmap_state_machine(llbitmap, p, p, BitmapActionStartsync) == BitSyncing;
+}
+
+static void llbitmap_end_sync(struct mddev *mddev, sector_t offset,
+ sector_t *blocks)
+{
+ struct llbitmap *llbitmap = mddev->bitmap;
+ unsigned long p = offset >> llbitmap->chunkshift;
+
+ *blocks = llbitmap->chunksize - (offset & (llbitmap->chunksize - 1));
+ llbitmap_state_machine(llbitmap, p, llbitmap->chunks - 1, BitmapActionAbortsync);
+}
+
+static void llbitmap_close_sync(struct mddev *mddev)
+{
+ struct llbitmap *llbitmap = mddev->bitmap;
+ int i;
+
+ for (i = 0; i < llbitmap->nr_pages; i++) {
+ struct llbitmap_barrier *barrier = &llbitmap->barrier[i];
+
+ /* let daemon_fn clear dirty bits immediately */
+ WRITE_ONCE(barrier->expire, jiffies);
+ }
+
+ llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, BitmapActionEndsync);
+}
+
static struct bitmap_operations llbitmap_ops = {
.head = {
.type = MD_BITMAP,
@@ -1087,4 +1126,8 @@ static struct bitmap_operations llbitmap_ops = {
.endwrite = llbitmap_endwrite,
.unplug = llbitmap_unplug,
.flush = llbitmap_flush,
+
+ .start_sync = llbitmap_start_sync,
+ .end_sync = llbitmap_end_sync,
+ .close_sync = llbitmap_close_sync,
};
--
2.39.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH RFC v2 12/14] md/md-llbitmap: implement all bitmap operations
2025-03-28 6:08 [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Yu Kuai
` (10 preceding siblings ...)
2025-03-28 6:08 ` [PATCH RFC v2 11/14] md/md-llbitmap: implement APIs for sync_thread Yu Kuai
@ 2025-03-28 6:08 ` Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 13/14] md/md-llbitmap: implement sysfs APIs Yu Kuai
` (3 subsequent siblings)
15 siblings, 0 replies; 27+ messages in thread
From: Yu Kuai @ 2025-03-28 6:08 UTC (permalink / raw)
To: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song, yukuai3
Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
yi.zhang, yangerkun
From: Yu Kuai <yukuai3@huawei.com>
Include following left APIs
- llbitmap_enabled
- llbitmap_dirty_bits
- llbitmap_update_sb
And following APIs that are not needed:
- llbitmap_write_all, used in old bitmap to mark all pages need
writeback;
- llbitmap_daemon_work, used in old bitmap, llbitmap use timer to
trigger daemon;
- llbitmap_cond_end_sync, use to end sync for completed sectors(TODO,
don't affect functionality)
And following APIs that are not supported:
- llbitmap_start_behind_write
- llbitmap_end_behind_write
- llbitmap_wait_behind_writes
- llbitmap_sync_with_cluster
- llbitmap_get_from_slot
- llbitmap_copy_from_slot
- llbitmap_set_pages
- llbitmap_free
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/md-llbitmap.c | 144 +++++++++++++++++++++++++++++++++++++++
1 file changed, 144 insertions(+)
diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
index 1692942942ff..a16bc84d9fa2 100644
--- a/drivers/md/md-llbitmap.c
+++ b/drivers/md/md-llbitmap.c
@@ -1110,6 +1110,130 @@ static void llbitmap_close_sync(struct mddev *mddev)
llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, BitmapActionEndsync);
}
+static bool llbitmap_enabled(void *data)
+{
+ struct llbitmap *llbitmap = data;
+
+ return llbitmap && !test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags);
+}
+
+static void llbitmap_dirty_bits(struct mddev *mddev, unsigned long s,
+ unsigned long e)
+{
+ llbitmap_state_machine(mddev->bitmap, s, e, BitmapActionStartwrite);
+}
+
+static void llbitmap_update_sb(void *data)
+{
+ struct llbitmap *llbitmap = data;
+ struct mddev *mddev = llbitmap->mddev;
+ struct page *sb_page;
+ bitmap_super_t *sb;
+
+ if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags))
+ return;
+
+ sb_page = read_mapping_page(llbitmap->bitmap_file->f_mapping, 0, NULL);
+ if (IS_ERR(sb_page)) {
+ pr_err("%s: %s: read super block failed", __func__,
+ mdname(mddev));
+ set_bit(BITMAP_WRITE_ERROR, &llbitmap->flags);
+ return;
+ }
+
+ if (mddev->events < llbitmap->events_cleared)
+ llbitmap->events_cleared = mddev->events;
+
+ sb = kmap_local_page(sb_page);
+ sb->events = cpu_to_le64(mddev->events);
+ sb->state = cpu_to_le32(llbitmap->flags);
+ sb->chunksize = cpu_to_le32(llbitmap->chunksize);
+ sb->sync_size = cpu_to_le64(mddev->resync_max_sectors);
+ sb->events_cleared = cpu_to_le64(llbitmap->events_cleared);
+ sb->sectors_reserved = cpu_to_le32(mddev->bitmap_info.space);
+ sb->daemon_sleep = cpu_to_le32(mddev->bitmap_info.daemon_sleep);
+
+ kunmap_local(sb);
+ set_page_dirty(sb_page);
+ put_page(sb_page);
+
+ if (filemap_write_and_wait_range(llbitmap->bitmap_file->f_mapping, 0,
+ BITMAP_SB_SIZE) != 0) {
+ pr_err("%s: %s: write super block failed", __func__,
+ mdname(mddev));
+ set_bit(BITMAP_WRITE_ERROR, &llbitmap->flags);
+ }
+}
+
+static int llbitmap_get_stats(void *data, struct md_bitmap_stats *stats)
+{
+ struct llbitmap *llbitmap = data;
+
+ memset(stats, 0, sizeof(*stats));
+
+ stats->missing_pages = 0;
+ stats->pages = llbitmap->nr_pages;
+ stats->file_pages = llbitmap->nr_pages;
+
+ return 0;
+}
+
+static void llbitmap_write_all(struct mddev *mddev)
+{
+
+}
+
+static void llbitmap_daemon_work(struct mddev *mddev)
+{
+
+}
+
+static void llbitmap_start_behind_write(struct mddev *mddev)
+{
+
+}
+
+static void llbitmap_end_behind_write(struct mddev *mddev)
+{
+
+}
+
+static void llbitmap_wait_behind_writes(struct mddev *mddev)
+{
+
+}
+
+static void llbitmap_cond_end_sync(struct mddev *mddev, sector_t sector,
+ bool force)
+{
+}
+
+static void llbitmap_sync_with_cluster(struct mddev *mddev,
+ sector_t old_lo, sector_t old_hi,
+ sector_t new_lo, sector_t new_hi)
+{
+
+}
+
+static void *llbitmap_get_from_slot(struct mddev *mddev, int slot)
+{
+ return ERR_PTR(-EOPNOTSUPP);
+}
+
+static int llbitmap_copy_from_slot(struct mddev *mddev, int slot, sector_t *low,
+ sector_t *high, bool clear_bits)
+{
+ return -EOPNOTSUPP;
+}
+
+static void llbitmap_set_pages(void *data, unsigned long pages)
+{
+}
+
+static void llbitmap_free(void *data)
+{
+}
+
static struct bitmap_operations llbitmap_ops = {
.head = {
.type = MD_BITMAP,
@@ -1117,6 +1241,7 @@ static struct bitmap_operations llbitmap_ops = {
.name = "llbitmap",
},
+ .enabled = llbitmap_enabled,
.create = llbitmap_create,
.resize = llbitmap_resize,
.load = llbitmap_load,
@@ -1130,4 +1255,23 @@ static struct bitmap_operations llbitmap_ops = {
.start_sync = llbitmap_start_sync,
.end_sync = llbitmap_end_sync,
.close_sync = llbitmap_close_sync,
+
+ .update_sb = llbitmap_update_sb,
+ .get_stats = llbitmap_get_stats,
+ .dirty_bits = llbitmap_dirty_bits,
+
+ /* not needed */
+ .write_all = llbitmap_write_all,
+ .daemon_work = llbitmap_daemon_work,
+ .cond_end_sync = llbitmap_cond_end_sync,
+
+ /* not supported */
+ .start_behind_write = llbitmap_start_behind_write,
+ .end_behind_write = llbitmap_end_behind_write,
+ .wait_behind_writes = llbitmap_wait_behind_writes,
+ .sync_with_cluster = llbitmap_sync_with_cluster,
+ .get_from_slot = llbitmap_get_from_slot,
+ .copy_from_slot = llbitmap_copy_from_slot,
+ .set_pages = llbitmap_set_pages,
+ .free = llbitmap_free,
};
--
2.39.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH RFC v2 13/14] md/md-llbitmap: implement sysfs APIs
2025-03-28 6:08 [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Yu Kuai
` (11 preceding siblings ...)
2025-03-28 6:08 ` [PATCH RFC v2 12/14] md/md-llbitmap: implement all bitmap operations Yu Kuai
@ 2025-03-28 6:08 ` Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 14/14] md/md-llbitmap: add Kconfig Yu Kuai
` (2 subsequent siblings)
15 siblings, 0 replies; 27+ messages in thread
From: Yu Kuai @ 2025-03-28 6:08 UTC (permalink / raw)
To: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song, yukuai3
Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
yi.zhang, yangerkun
From: Yu Kuai <yukuai3@huawei.com>
There are 3 APIs for now:
- bits: readonly, show status of bitmap bits, the number of each value;
- metadata: readonly show bitmap metadata, include chunksize, chunkshift,
chunks, offset and daemon_sleep;
- daemon_sleep: read-write, default value is 30;
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/md-llbitmap.c | 106 +++++++++++++++++++++++++++++++++++++++
1 file changed, 106 insertions(+)
diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
index a16bc84d9fa2..88ba29111e13 100644
--- a/drivers/md/md-llbitmap.c
+++ b/drivers/md/md-llbitmap.c
@@ -1234,6 +1234,110 @@ static void llbitmap_free(void *data)
{
}
+static ssize_t bits_show(struct mddev *mddev, char *page)
+{
+ struct llbitmap *llbitmap;
+ int bits[nr_llbitmap_state] = {0};
+ loff_t start = 0;
+
+ mutex_lock(&mddev->bitmap_info.mutex);
+ llbitmap = mddev->bitmap;
+ if (!llbitmap) {
+ mutex_unlock(&mddev->bitmap_info.mutex);
+ return sprintf(page, "no bitmap\n");
+ }
+
+ if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags)) {
+ mutex_unlock(&mddev->bitmap_info.mutex);
+ return sprintf(page, "bitmap io error\n");
+ }
+
+ while (start < llbitmap->chunks) {
+ ssize_t ret;
+ enum llbitmap_state c;
+
+ ret = llbitmap_read(llbitmap, &c, start);
+ if (ret < 0) {
+ set_bit(BITMAP_WRITE_ERROR, &llbitmap->flags);
+ mutex_unlock(&mddev->bitmap_info.mutex);
+ return sprintf(page, "bitmap io error\n");
+ }
+
+ if (c < 0 || c >= nr_llbitmap_state)
+ pr_err("%s: invalid bit %llu state %d\n",
+ __func__, start, c);
+ else
+ bits[c]++;
+ start++;
+ }
+
+ mutex_unlock(&mddev->bitmap_info.mutex);
+ return sprintf(page, "unwritten %d\nclean %d\ndirty %d\nneed sync %d\nsyncing %d\n",
+ bits[BitUnwritten], bits[BitClean], bits[BitDirty],
+ bits[BitNeedSync], bits[BitSyncing]);
+}
+
+static struct md_sysfs_entry llbitmap_bits =
+__ATTR_RO(bits);
+
+static ssize_t metadata_show(struct mddev *mddev, char *page)
+{
+ struct llbitmap *llbitmap;
+ ssize_t ret;
+
+ mutex_lock(&mddev->bitmap_info.mutex);
+ llbitmap = mddev->bitmap;
+ if (!llbitmap) {
+ mutex_unlock(&mddev->bitmap_info.mutex);
+ return sprintf(page, "no bitmap\n");
+ }
+
+ ret = sprintf(page, "chunksize %lu\nchunkshift %lu\nchunks %lu\noffset %llu\ndaemon_sleep %lu\n",
+ llbitmap->chunksize, llbitmap->chunkshift,
+ llbitmap->chunks, mddev->bitmap_info.offset,
+ llbitmap->mddev->bitmap_info.daemon_sleep);
+ mutex_unlock(&mddev->bitmap_info.mutex);
+
+ return ret;
+}
+
+static struct md_sysfs_entry llbitmap_metadata =
+__ATTR_RO(metadata);
+
+static ssize_t
+daemon_sleep_show(struct mddev *mddev, char *page)
+{
+ return sprintf(page, "%lu\n", mddev->bitmap_info.daemon_sleep);
+}
+
+static ssize_t
+daemon_sleep_store(struct mddev *mddev, const char *buf, size_t len)
+{
+ unsigned long timeout;
+ int rv = kstrtoul(buf, 10, &timeout);
+
+ if (rv)
+ return rv;
+
+ mddev->bitmap_info.daemon_sleep = timeout;
+ return len;
+}
+
+static struct md_sysfs_entry llbitmap_daemon_sleep =
+__ATTR_RW(daemon_sleep);
+
+static struct attribute *md_llbitmap_attrs[] = {
+ &llbitmap_bits.attr,
+ &llbitmap_metadata.attr,
+ &llbitmap_daemon_sleep.attr,
+ NULL
+};
+
+static struct attribute_group md_llbitmap_group = {
+ .name = "llbitmap",
+ .attrs = md_llbitmap_attrs,
+};
+
static struct bitmap_operations llbitmap_ops = {
.head = {
.type = MD_BITMAP,
@@ -1274,4 +1378,6 @@ static struct bitmap_operations llbitmap_ops = {
.copy_from_slot = llbitmap_copy_from_slot,
.set_pages = llbitmap_set_pages,
.free = llbitmap_free,
+
+ .group = &md_llbitmap_group,
};
--
2.39.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH RFC v2 14/14] md/md-llbitmap: add Kconfig
2025-03-28 6:08 [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Yu Kuai
` (12 preceding siblings ...)
2025-03-28 6:08 ` [PATCH RFC v2 13/14] md/md-llbitmap: implement sysfs APIs Yu Kuai
@ 2025-03-28 6:08 ` Yu Kuai
2025-03-28 11:06 ` [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Christoph Hellwig
2025-04-04 9:27 ` Christoph Hellwig
15 siblings, 0 replies; 27+ messages in thread
From: Yu Kuai @ 2025-03-28 6:08 UTC (permalink / raw)
To: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song, yukuai3
Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
yi.zhang, yangerkun
From: Yu Kuai <yukuai3@huawei.com>
A new config MD_LLBITMAP is added, user can now using llbitmap to
replace the old bitmap.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/Kconfig | 12 ++++++++++++
drivers/md/Makefile | 1 +
drivers/md/md-bitmap.h | 13 +++++++++++++
drivers/md/md-llbitmap.c | 27 +++++++++++++++++++++++++++
drivers/md/md.c | 7 +++++++
5 files changed, 60 insertions(+)
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 0da07182494c..e5dc893ad09a 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -52,6 +52,18 @@ config MD_BITMAP
If unsure, say Y.
+config MD_LLBITMAP
+ bool "MD RAID lockless bitmap support"
+ default n
+ depends on BLK_DEV_MD
+ help
+ If you say Y here, support for the lockless write intent bitmap will
+ be enabled.
+
+ Note, this is an experimental feature.
+
+ If unsure, say N.
+
config MD_AUTODETECT
bool "Autodetect RAID arrays during kernel boot"
depends on BLK_DEV_MD=y
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 811731840a5c..e70e4d3cbe29 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -39,6 +39,7 @@ linear-y += md-linear.o
obj-$(CONFIG_MD_LINEAR) += linear.o
obj-$(CONFIG_MD_RAID0) += raid0.o
obj-$(CONFIG_MD_BITMAP) += md-bitmap.o
+obj-$(CONFIG_MD_LLBITMAP) += md-llbitmap.o
obj-$(CONFIG_MD_RAID1) += raid1.o
obj-$(CONFIG_MD_RAID10) += raid10.o
obj-$(CONFIG_MD_RAID456) += raid456.o
diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index 7a3cd2f70772..ee0c65e87dba 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -171,4 +171,17 @@ static inline void md_bitmap_exit(void)
}
#endif
+#ifdef CONFIG_MD_LLBITMAP
+int md_llbitmap_init(void);
+void md_llbitmap_exit(void);
+#else
+static inline int md_llbitmap_init(void)
+{
+ return 0;
+}
+static inline void md_llbitmap_exit(void)
+{
+}
+#endif
+
#endif
diff --git a/drivers/md/md-llbitmap.c b/drivers/md/md-llbitmap.c
index 88ba29111e13..5e2f89137feb 100644
--- a/drivers/md/md-llbitmap.c
+++ b/drivers/md/md-llbitmap.c
@@ -1381,3 +1381,30 @@ static struct bitmap_operations llbitmap_ops = {
.group = &md_llbitmap_group,
};
+
+int md_llbitmap_init(void)
+{
+ md_llbitmap_io_wq = alloc_workqueue("md_llbitmap_io",
+ WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
+ if (!md_llbitmap_io_wq)
+ return -ENOMEM;
+
+ md_llbitmap_unplug_wq = alloc_workqueue("md_llbitmap_unplug",
+ WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
+ if (!md_llbitmap_unplug_wq) {
+ destroy_workqueue(md_llbitmap_io_wq);
+ md_llbitmap_io_wq = NULL;
+ return -ENOMEM;
+ }
+
+ return register_md_submodule(&llbitmap_ops.head);
+}
+
+void md_llbitmap_exit(void)
+{
+ destroy_workqueue(md_llbitmap_io_wq);
+ md_llbitmap_io_wq = NULL;
+ destroy_workqueue(md_llbitmap_unplug_wq);
+ md_llbitmap_unplug_wq = NULL;
+ unregister_md_submodule(&llbitmap_ops.head);
+}
diff --git a/drivers/md/md.c b/drivers/md/md.c
index c1f13288069a..4d05eae69795 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -10104,6 +10104,10 @@ static int __init md_init(void)
if (ret)
return ret;
+ ret = md_llbitmap_init();
+ if (ret)
+ goto err_bitmap;
+
ret = -ENOMEM;
md_wq = alloc_workqueue("md", WQ_MEM_RECLAIM, 0);
if (!md_wq)
@@ -10135,6 +10139,8 @@ static int __init md_init(void)
err_misc_wq:
destroy_workqueue(md_wq);
err_wq:
+ md_llbitmap_exit();
+err_bitmap:
md_bitmap_exit();
return ret;
}
@@ -10439,6 +10445,7 @@ static __exit void md_exit(void)
destroy_workqueue(md_misc_wq);
destroy_workqueue(md_wq);
+ md_llbitmap_exit();
md_bitmap_exit();
}
--
2.39.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH RFC v2 00/14] md: introduce a new lockless bitmap
2025-03-28 6:08 [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Yu Kuai
` (13 preceding siblings ...)
2025-03-28 6:08 ` [PATCH RFC v2 14/14] md/md-llbitmap: add Kconfig Yu Kuai
@ 2025-03-28 11:06 ` Christoph Hellwig
2025-03-29 1:11 ` Yu Kuai
2025-04-04 9:27 ` Christoph Hellwig
15 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2025-03-28 11:06 UTC (permalink / raw)
To: Yu Kuai
Cc: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song, yukuai3,
linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
yangerkun
On Fri, Mar 28, 2025 at 02:08:39PM +0800, Yu Kuai wrote:
> A hidden disk, named mdxxx_bitmap, is created for bitmap, see details in
> llbitmap_add_disk(). And a file is created as well to manage bitmap IO for
> this disk, see details in llbitmap_open_disk(). Read/write bitmap is
> converted to buffer IO to this file.
>
> IO fast path will set bits to dirty, and those dirty bits will be cleared
> by daemon after IO is done. llbitmap_barrier is used to syncronize between
> IO path and daemon;
Why do you need a separate gendisk? I'll try to find some time to read
the code to understand what it does, but it would also be really useful
to explain the need for such an unusual concept here.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH RFC v2 00/14] md: introduce a new lockless bitmap
2025-03-28 11:06 ` [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Christoph Hellwig
@ 2025-03-29 1:11 ` Yu Kuai
2025-04-09 8:32 ` Christoph Hellwig
0 siblings, 1 reply; 27+ messages in thread
From: Yu Kuai @ 2025-03-29 1:11 UTC (permalink / raw)
To: Christoph Hellwig, Yu Kuai
Cc: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song,
linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
yangerkun, yukuai (C)
Hi,
在 2025/03/28 19:06, Christoph Hellwig 写道:
> On Fri, Mar 28, 2025 at 02:08:39PM +0800, Yu Kuai wrote:
>> A hidden disk, named mdxxx_bitmap, is created for bitmap, see details in
>> llbitmap_add_disk(). And a file is created as well to manage bitmap IO for
>> this disk, see details in llbitmap_open_disk(). Read/write bitmap is
>> converted to buffer IO to this file.
>>
>> IO fast path will set bits to dirty, and those dirty bits will be cleared
>> by daemon after IO is done. llbitmap_barrier is used to syncronize between
>> IO path and daemon;
>
> Why do you need a separate gendisk? I'll try to find some time to read
> the code to understand what it does, but it would also be really useful
> to explain the need for such an unusual concept here.
The purpose here is to hide the low level bitmap IO implementation to
the API disk->submit_bio(), and the bitmap IO can be converted to buffer
IO to the bdev_file. This is the easiest way that I can think of to
resue the pagecache, with natural ability for dirty page writeback. I do
think about creating a new anon file and implement a new
file_operations, this will be much more complicated.
Meanwhile, bitmap file for the old bitmap will be removed sooner or
later, and this bdev_file implementation will compatible with bitmap
file as well.
Thanks,
Kuai
>
> .
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH RFC v2 00/14] md: introduce a new lockless bitmap
2025-03-28 6:08 [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Yu Kuai
` (14 preceding siblings ...)
2025-03-28 11:06 ` [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Christoph Hellwig
@ 2025-04-04 9:27 ` Christoph Hellwig
2025-04-07 1:09 ` Yu Kuai
15 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2025-04-04 9:27 UTC (permalink / raw)
To: Yu Kuai
Cc: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song, yukuai3,
linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
yangerkun
On Fri, Mar 28, 2025 at 02:08:39PM +0800, Yu Kuai wrote:
> 1) user must apply the following mdadm patch, and then llbitmap can be
> enabled by --bitmap=lockless
> https://lore.kernel.org/all/20250327134853.1069356-1-yukuai1@huaweicloud.com/
> 2) this set is cooked on the top of my other set:
> https://lore.kernel.org/all/20250219083456.941760-1-yukuai1@huaweicloud.com/
I tried to create a tree to review the entire thing but failed. Can you
please also provide a working git branch?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH RFC v2 02/14] md/md-bitmap: pass discard information to bitmap_{start, end}write
2025-03-28 6:08 ` [PATCH RFC v2 02/14] md/md-bitmap: pass discard information to bitmap_{start, end}write Yu Kuai
@ 2025-04-04 9:29 ` Christoph Hellwig
2025-04-07 1:19 ` Yu Kuai
0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2025-04-04 9:29 UTC (permalink / raw)
To: Yu Kuai
Cc: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song, yukuai3,
linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
yangerkun
> int (*startwrite)(struct mddev *mddev, sector_t offset,
> - unsigned long sectors);
> + unsigned long sectors, bool is_discard);
> void (*endwrite)(struct mddev *mddev, sector_t offset,
> - unsigned long sectors);
> + unsigned long sectors, bool is_discard);
a bool discard is not a very good interface. I'd expect an op enum or a set
of flag to properly describe it.
But is start/end write really the right interface for discard or should it
have it's own set of ops?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH RFC v2 00/14] md: introduce a new lockless bitmap
2025-04-04 9:27 ` Christoph Hellwig
@ 2025-04-07 1:09 ` Yu Kuai
0 siblings, 0 replies; 27+ messages in thread
From: Yu Kuai @ 2025-04-07 1:09 UTC (permalink / raw)
To: Christoph Hellwig, Yu Kuai
Cc: xni, colyli, axboe, agk, snitzer, mpatocka, song, linux-block,
linux-kernel, dm-devel, linux-raid, yi.zhang, yangerkun,
yukuai (C)
Hi,
在 2025/04/04 17:27, Christoph Hellwig 写道:
> On Fri, Mar 28, 2025 at 02:08:39PM +0800, Yu Kuai wrote:
>> 1) user must apply the following mdadm patch, and then llbitmap can be
>> enabled by --bitmap=lockless
>> https://lore.kernel.org/all/20250327134853.1069356-1-yukuai1@huaweicloud.com/
>> 2) this set is cooked on the top of my other set:
>> https://lore.kernel.org/all/20250219083456.941760-1-yukuai1@huaweicloud.com/
>
> I tried to create a tree to review the entire thing but failed. Can you
> please also provide a working git branch?
Of course, here is the branch:
https://git.kernel.org/pub/scm/linux/kernel/git/yukuai/linux.git/log/?h=md-6.15
Thanks,
Kuai
>
> .
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH RFC v2 02/14] md/md-bitmap: pass discard information to bitmap_{start, end}write
2025-04-04 9:29 ` Christoph Hellwig
@ 2025-04-07 1:19 ` Yu Kuai
0 siblings, 0 replies; 27+ messages in thread
From: Yu Kuai @ 2025-04-07 1:19 UTC (permalink / raw)
To: Christoph Hellwig, Yu Kuai
Cc: hch, xni, colyli, axboe, agk, snitzer, mpatocka, song,
linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
yangerkun, yukuai (C)
Hi,
在 2025/04/04 17:29, Christoph Hellwig 写道:
>> int (*startwrite)(struct mddev *mddev, sector_t offset,
>> - unsigned long sectors);
>> + unsigned long sectors, bool is_discard);
>> void (*endwrite)(struct mddev *mddev, sector_t offset,
>> - unsigned long sectors);
>> + unsigned long sectors, bool is_discard);
>
> a bool discard is not a very good interface. I'd expect an op enum or a set
> of flag to properly describe it.
Will update in the next version.
>
> But is start/end write really the right interface for discard or should it
> have it's own set of ops?
Yes, this is historical issue. The old bitmap handle discard the same as
normal write, while new bitmap handle them differently. And I agree that
add a new ops for discard is better in the long term.
Thanks,
Kuai
>
> .
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH RFC v2 00/14] md: introduce a new lockless bitmap
2025-03-29 1:11 ` Yu Kuai
@ 2025-04-09 8:32 ` Christoph Hellwig
2025-04-09 9:27 ` Yu Kuai
0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2025-04-09 8:32 UTC (permalink / raw)
To: Yu Kuai
Cc: xni, colyli, axboe, agk, snitzer, mpatocka, song, linux-block,
linux-kernel, dm-devel, linux-raid, yi.zhang, yangerkun,
yukuai (C), kbusch
On Sat, Mar 29, 2025 at 09:11:13AM +0800, Yu Kuai wrote:
> The purpose here is to hide the low level bitmap IO implementation to
> the API disk->submit_bio(), and the bitmap IO can be converted to buffer
> IO to the bdev_file. This is the easiest way that I can think of to
> resue the pagecache, with natural ability for dirty page writeback. I do
> think about creating a new anon file and implement a new
> file_operations, this will be much more complicated.
I've started looking at this a bit now, sorry for the delay.
As far as I can see you use the bitmap file just so that you have your
own struct address_space and thus page cache instance and then call
read_mapping_page and filemap_write_and_wait_range on it right?
For that you'd be much better of just creating your own trivial
file_system_type with an inode fully controlled by your driver
that has a trivial set of address_space ops instead of oddly
mixing with the block layer.
Note that either way I'm not sure using the page cache here is an
all that good idea, as we're at the bottom of the I/O stack and
thus memory allocations can very easily deadlock.
What speaks against using your own folios explicitly allocated at
probe time and then just doing manual submit_bio on that? That's
probably not much more code but a lot more robust.
Also a high level note: the bitmap_operations aren't a very nice
interface. A lot of methods are empty and should just be called
conditionally. Or even better you'd do away with the expensive
indirect calls and just directly call either the old or new
bitmap code.
> Meanwhile, bitmap file for the old bitmap will be removed sooner or
> later, and this bdev_file implementation will compatible with bitmap
> file as well.
Which would also mean that at that point the operations vector would
be pointless, so we might as well not add it to start with.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH RFC v2 00/14] md: introduce a new lockless bitmap
2025-04-09 8:32 ` Christoph Hellwig
@ 2025-04-09 9:27 ` Yu Kuai
2025-04-09 9:40 ` Christoph Hellwig
0 siblings, 1 reply; 27+ messages in thread
From: Yu Kuai @ 2025-04-09 9:27 UTC (permalink / raw)
To: Christoph Hellwig, Yu Kuai
Cc: xni, colyli, axboe, agk, snitzer, mpatocka, song, linux-block,
linux-kernel, dm-devel, linux-raid, yi.zhang, yangerkun, kbusch,
yukuai (C)
Hi,
在 2025/04/09 16:32, Christoph Hellwig 写道:
> On Sat, Mar 29, 2025 at 09:11:13AM +0800, Yu Kuai wrote:
>> The purpose here is to hide the low level bitmap IO implementation to
>> the API disk->submit_bio(), and the bitmap IO can be converted to buffer
>> IO to the bdev_file. This is the easiest way that I can think of to
>> resue the pagecache, with natural ability for dirty page writeback. I do
>> think about creating a new anon file and implement a new
>> file_operations, this will be much more complicated.
>
> I've started looking at this a bit now, sorry for the delay.
>
> As far as I can see you use the bitmap file just so that you have your
> own struct address_space and thus page cache instance and then call
> read_mapping_page and filemap_write_and_wait_range on it right?
Yes.
>
> For that you'd be much better of just creating your own trivial
> file_system_type with an inode fully controlled by your driver
> that has a trivial set of address_space ops instead of oddly
> mixing with the block layer.
Yes, this is exactly what I said implement a new file_operations(and
address_space ops), I wanted do this the easy way, just reuse the raw
block device ops, this way I just need to implement the submit_bio ops
for new hidden disk.
I can try with new fs type if we really think this solution is too
hacky, however, the code line will be much more. :(
>
> Note that either way I'm not sure using the page cache here is an
> all that good idea, as we're at the bottom of the I/O stack and
> thus memory allocations can very easily deadlock.
Yes, for the page from bitmap, this set do the easy way just read and
ping all realted pages while loading the bitmap. For two reasons:
1) We don't need to allocate and read pages from IO path;(In the first
RFC version, I'm using a worker to do that).
2) In the first RFC version, I find and get page in the IO path, turns
out page reference is an *atomic*, and the overhead is not acceptable;
And the only action from IO path is that if bitmap page is dirty,
filemap_write_and_wait_range() is called from async worker, the same as
old bitmap, to flush bitmap dirty pages.
>
> What speaks against using your own folios explicitly allocated at
> probe time and then just doing manual submit_bio on that? That's
> probably not much more code but a lot more robust.
I'm not quite sure if I understand you correctly. Do you means don't use
pagecache for bitmap IO, and manually create BIOs like the old bitmap,
meanwhile invent a new solution for synchronism instead of the global
spin_lock from old bitmap?
Thanks,
Kuai
>
> Also a high level note: the bitmap_operations aren't a very nice
> interface. A lot of methods are empty and should just be called
> conditionally. Or even better you'd do away with the expensive
> indirect calls and just directly call either the old or new
> bitmap code.
>
>> Meanwhile, bitmap file for the old bitmap will be removed sooner or
>> later, and this bdev_file implementation will compatible with bitmap
>> file as well.
>
> Which would also mean that at that point the operations vector would
> be pointless, so we might as well not add it to start with.
>
> .
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH RFC v2 00/14] md: introduce a new lockless bitmap
2025-04-09 9:27 ` Yu Kuai
@ 2025-04-09 9:40 ` Christoph Hellwig
2025-04-11 1:36 ` Yu Kuai
0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2025-04-09 9:40 UTC (permalink / raw)
To: Yu Kuai
Cc: Christoph Hellwig, xni, colyli, axboe, agk, snitzer, mpatocka,
song, linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
yangerkun, kbusch, yukuai (C)
On Wed, Apr 09, 2025 at 05:27:11PM +0800, Yu Kuai wrote:
>> For that you'd be much better of just creating your own trivial
>> file_system_type with an inode fully controlled by your driver
>> that has a trivial set of address_space ops instead of oddly
>> mixing with the block layer.
>
> Yes, this is exactly what I said implement a new file_operations(and
> address_space ops), I wanted do this the easy way, just reuse the raw
> block device ops, this way I just need to implement the submit_bio ops
> for new hidden disk.
>
> I can try with new fs type if we really think this solution is too
> hacky, however, the code line will be much more. :(
I don't think it should be much more. It'll also remove the rather
unexpected indirection through submit_bio. Just make sure you use
iomap for your operations, and implement the submit_io hook. That
will also be more efficient than the buffer_head based block ops
for writes.
>>
>> Note that either way I'm not sure using the page cache here is an
>> all that good idea, as we're at the bottom of the I/O stack and
>> thus memory allocations can very easily deadlock.
>
> Yes, for the page from bitmap, this set do the easy way just read and
> ping all realted pages while loading the bitmap. For two reasons:
>
> 1) We don't need to allocate and read pages from IO path;(In the first
> RFC version, I'm using a worker to do that).
You still depend on the worker, which will still deadlock.
>> What speaks against using your own folios explicitly allocated at
>> probe time and then just doing manual submit_bio on that? That's
>> probably not much more code but a lot more robust.
>
> I'm not quite sure if I understand you correctly. Do you means don't use
> pagecache for bitmap IO, and manually create BIOs like the old bitmap,
> meanwhile invent a new solution for synchronism instead of the global
> spin_lock from old bitmap?
Yes. Alternatively you need to pre-populate the page cache and keep
extra page references.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH RFC v2 00/14] md: introduce a new lockless bitmap
2025-04-09 9:40 ` Christoph Hellwig
@ 2025-04-11 1:36 ` Yu Kuai
2025-04-19 8:46 ` Yu Kuai
0 siblings, 1 reply; 27+ messages in thread
From: Yu Kuai @ 2025-04-11 1:36 UTC (permalink / raw)
To: Christoph Hellwig, Yu Kuai
Cc: xni, colyli, axboe, agk, snitzer, mpatocka, song, linux-block,
linux-kernel, dm-devel, linux-raid, yi.zhang, yangerkun, kbusch,
yukuai (C)
Hi,
在 2025/04/09 17:40, Christoph Hellwig 写道:
> On Wed, Apr 09, 2025 at 05:27:11PM +0800, Yu Kuai wrote:
>>> For that you'd be much better of just creating your own trivial
>>> file_system_type with an inode fully controlled by your driver
>>> that has a trivial set of address_space ops instead of oddly
>>> mixing with the block layer.
>>
>> Yes, this is exactly what I said implement a new file_operations(and
>> address_space ops), I wanted do this the easy way, just reuse the raw
>> block device ops, this way I just need to implement the submit_bio ops
>> for new hidden disk.
>>
>> I can try with new fs type if we really think this solution is too
>> hacky, however, the code line will be much more. :(
>
> I don't think it should be much more. It'll also remove the rather
> unexpected indirection through submit_bio. Just make sure you use
> iomap for your operations, and implement the submit_io hook. That
> will also be more efficient than the buffer_head based block ops
> for writes.
>
>>>
>>> Note that either way I'm not sure using the page cache here is an
>>> all that good idea, as we're at the bottom of the I/O stack and
>>> thus memory allocations can very easily deadlock.
>>
>> Yes, for the page from bitmap, this set do the easy way just read and
>> ping all realted pages while loading the bitmap. For two reasons:
>>
>> 1) We don't need to allocate and read pages from IO path;(In the first
>> RFC version, I'm using a worker to do that).
>
> You still depend on the worker, which will still deadlock.
>
>>> What speaks against using your own folios explicitly allocated at
>>> probe time and then just doing manual submit_bio on that? That's
>>> probably not much more code but a lot more robust.
>>
>> I'm not quite sure if I understand you correctly. Do you means don't use
>> pagecache for bitmap IO, and manually create BIOs like the old bitmap,
>> meanwhile invent a new solution for synchronism instead of the global
>> spin_lock from old bitmap?
>
> Yes. Alternatively you need to pre-populate the page cache and keep
> extra page references.
Ok, I'll think about self managed pages and IO path. Meanwhile, please
let me know if you have questions with other parts.
Thanks,
Kuai
>
> .
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH RFC v2 00/14] md: introduce a new lockless bitmap
2025-04-11 1:36 ` Yu Kuai
@ 2025-04-19 8:46 ` Yu Kuai
2025-04-21 7:39 ` Christoph Hellwig
0 siblings, 1 reply; 27+ messages in thread
From: Yu Kuai @ 2025-04-19 8:46 UTC (permalink / raw)
To: Yu Kuai, Christoph Hellwig
Cc: xni, colyli, axboe, agk, snitzer, mpatocka, song, linux-block,
linux-kernel, dm-devel, linux-raid, yi.zhang, yangerkun, kbusch,
yukuai (C)
Hi, Christoph!
在 2025/04/11 9:36, Yu Kuai 写道:
> Hi,
>
> 在 2025/04/09 17:40, Christoph Hellwig 写道:
>> On Wed, Apr 09, 2025 at 05:27:11PM +0800, Yu Kuai wrote:
>>>> For that you'd be much better of just creating your own trivial
>>>> file_system_type with an inode fully controlled by your driver
>>>> that has a trivial set of address_space ops instead of oddly
>>>> mixing with the block layer.
>>>
>>> Yes, this is exactly what I said implement a new file_operations(and
>>> address_space ops), I wanted do this the easy way, just reuse the raw
>>> block device ops, this way I just need to implement the submit_bio ops
>>> for new hidden disk.
>>>
>>> I can try with new fs type if we really think this solution is too
>>> hacky, however, the code line will be much more. :(
>>
>> I don't think it should be much more. It'll also remove the rather
>> unexpected indirection through submit_bio. Just make sure you use
>> iomap for your operations, and implement the submit_io hook. That
>> will also be more efficient than the buffer_head based block ops
>> for writes.
>>
>>>>
>>>> Note that either way I'm not sure using the page cache here is an
>>>> all that good idea, as we're at the bottom of the I/O stack and
>>>> thus memory allocations can very easily deadlock.
>>>
>>> Yes, for the page from bitmap, this set do the easy way just read and
>>> ping all realted pages while loading the bitmap. For two reasons:
>>>
>>> 1) We don't need to allocate and read pages from IO path;(In the first
>>> RFC version, I'm using a worker to do that).
>>
>> You still depend on the worker, which will still deadlock.
>>
>>>> What speaks against using your own folios explicitly allocated at
>>>> probe time and then just doing manual submit_bio on that? That's
>>>> probably not much more code but a lot more robust.
>>>
>>> I'm not quite sure if I understand you correctly. Do you means don't use
>>> pagecache for bitmap IO, and manually create BIOs like the old bitmap,
>>> meanwhile invent a new solution for synchronism instead of the global
>>> spin_lock from old bitmap?
>>
>> Yes. Alternatively you need to pre-populate the page cache and keep
>> extra page references.
>
> Ok, I'll think about self managed pages and IO path. Meanwhile, please
> let me know if you have questions with other parts.
So, today I implement a version, and I do admit this way is much
simpler, turns out total 200 less code lines. And can you check the
following untested code if you agree with the implementation? I'll
start to work a new version if you agree.
Thanks,
Kuai
static int llbitmap_rdev_page_io(struct md_rdev *rdev, struct page *page,
┊int idx, bool rw)
{
struct bio bio;
int ret;
bio_init(&bio, rdev->bdev, bio.bi_inline_vecs, BIO_INLINE_VECS,
┊REQ_SYNC | REQ_IDLE | REQ_META);
if (rw)
bio.bi_opf |= REQ_OP_WRITE;
else
bio.bi_opf |= REQ_OP_READ;
__bio_add_page(&bio, page, PAGE_SIZE, 0);
bio.bi_iter.bi_size = PAGE_SIZE;
bio.bi_iter.bi_sector = rdev->sb_start +
rdev->mddev->bitmap_info.offset +
(PAGE_SECTORS << PAGE_SECTORS_SHIFT);
ret = submit_bio_wait(&bio);
bio_uninit(&bio);
if (ret)
md_error(rdev->mddev, rdev);
return ret;
}
static struct page *llbitmap_read_page(struct llbitmap *llbitmap, int idx)
{
struct page *page = llbitmap->pages[idx];
struct mddev *mddev = llbitmap->mddev;
struct md_rdev *rdev;
int err = -EIO;
if (page)
return page;
page = alloc_page(GFP_KERNEL);
if (!page)
return ERR_PTR(-ENOMEM);
rdev_for_each(rdev, mddev) {
if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
continue;
err = llbitmap_rdev_page_io(rdev, page, idx, READ);
if (!err)
break;
}
if (err) {
__free_page(page);
return ERR_PTR(err);
}
return page;
}
static int llbitmap_write_page(struct llbitmap *llbitmap, int idx)
{
struct page *page = llbitmap->pages[idx];
struct mddev *mddev = llbitmap->mddev;
struct md_rdev *rdev;
int err = -EIO;
if (!page)
return err;
rdev_for_each(rdev, mddev) {
if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags))
continue;
if (!llbitmap_rdev_page_io(rdev, page, idx, WRITE))
err = 0;
}
return err;
}
static bool llbitmap_dirty(struct llbitmap *llbitmap)
{
int i;
for (i = 0; i < llbitmap->nr_pages; ++i) {
struct llbitmap_barrier *barrier = &llbitmap->barrier[i];
if (test_bit(BitmapPageDirty, &barrier->flags))
return true;
}
return false;
}
static void llbitmap_flush_dirty_page(struct llbitmap *llbitmap)
{
int i;
for (i = 0; i < llbitmap->nr_pages; ++i) {
struct llbitmap_barrier *barrier = &llbitmap->barrier[i];
if (!test_and_clear_bit(BitmapPageDirty, &barrier->flags))
continue;
llbitmap_write_page(llbitmap, i);
}
}
>
> Thanks,
> Kuai
>
>>
>> .
>>
>
> .
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH RFC v2 00/14] md: introduce a new lockless bitmap
2025-04-19 8:46 ` Yu Kuai
@ 2025-04-21 7:39 ` Christoph Hellwig
0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-04-21 7:39 UTC (permalink / raw)
To: Yu Kuai
Cc: Christoph Hellwig, xni, colyli, axboe, agk, snitzer, mpatocka,
song, linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
yangerkun, kbusch, yukuai (C)
On Sat, Apr 19, 2025 at 04:46:03PM +0800, Yu Kuai wrote:
> So, today I implement a version, and I do admit this way is much
> simpler, turns out total 200 less code lines.
That's what I thought :)
> And can you check the
> following untested code if you agree with the implementation? I'll
> start to work a new version if you agree.
From a very quick look this looks great.
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2025-04-21 7:39 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-28 6:08 [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 01/14] block: factor out a helper bdev_file_alloc() Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 02/14] md/md-bitmap: pass discard information to bitmap_{start, end}write Yu Kuai
2025-04-04 9:29 ` Christoph Hellwig
2025-04-07 1:19 ` Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 03/14] md/md-bitmap: remove parameter slot from bitmap_create() Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 04/14] md: add a new sysfs api bitmap_version Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 05/14] md: delay registeration of bitmap_ops until creating bitmap Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 06/14] md/md-llbitmap: implement bit state machine Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 07/14] md/md-llbitmap: implement hidden disk to manage bitmap IO Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 08/14] md/md-llbitmap: implement APIs for page level dirty bits synchronization Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 09/14] md/md-llbitmap: implement APIs to mange bitmap lifetime Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 10/14] md/md-llbitmap: implement APIs to dirty bits and clear bits Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 11/14] md/md-llbitmap: implement APIs for sync_thread Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 12/14] md/md-llbitmap: implement all bitmap operations Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 13/14] md/md-llbitmap: implement sysfs APIs Yu Kuai
2025-03-28 6:08 ` [PATCH RFC v2 14/14] md/md-llbitmap: add Kconfig Yu Kuai
2025-03-28 11:06 ` [PATCH RFC v2 00/14] md: introduce a new lockless bitmap Christoph Hellwig
2025-03-29 1:11 ` Yu Kuai
2025-04-09 8:32 ` Christoph Hellwig
2025-04-09 9:27 ` Yu Kuai
2025-04-09 9:40 ` Christoph Hellwig
2025-04-11 1:36 ` Yu Kuai
2025-04-19 8:46 ` Yu Kuai
2025-04-21 7:39 ` Christoph Hellwig
2025-04-04 9:27 ` Christoph Hellwig
2025-04-07 1:09 ` Yu Kuai
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).