linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Bitmap percentage flushing
@ 2022-10-06 22:08 Jonathan Derrick
  2022-10-06 22:08 ` [RFC PATCH] mdadm: Add parameter for bitmap chunk threshold Jonathan Derrick
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Jonathan Derrick @ 2022-10-06 22:08 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, linux-kernel, jonathan.derrick, jonathanx.sk.derrick,
	Jonathan Derrick

This introduces a percentage-flushing mechanism that works in-tandem to
the delay timer. The percentage argument is based on the number of
chunks dirty. It was chosen to use number of chunks due to large drives requiring
smaller and smaller percentages (eg, 32TB drives-> 1% is 320GB).

The first patch fixes a performance gap observed in RAID1
configurations. With a synchronous qd1 workload, bitmap writes can
easily become almost half of the I/O. This could be argued to be
expected, but undesirable. Moving the unplug operation to the periodic
delay work seemed to help the situation.

The second part of this set adds a new field in the superblock and
version, allowing for a new argument through mdadm specifying the number
of chunks allowed to be dirty before flushing.

Accompanying this set is an RFC for mdadm patch. It lacks documentation
which will be sent in v2 if this changeset is appropriate.

Jonathan Derrick (2):
  md/bitmap: Move unplug to daemon thread
  md/bitmap: Add chunk-count-based bitmap flushing

 drivers/md/md-bitmap.c | 38 +++++++++++++++++++++++++++++++++++---
 drivers/md/md-bitmap.h |  5 ++++-
 drivers/md/md.h        |  1 +
 drivers/md/raid1.c     |  2 --
 drivers/md/raid10.c    |  4 ----
 5 files changed, 40 insertions(+), 10 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH] mdadm: Add parameter for bitmap chunk threshold
  2022-10-06 22:08 [PATCH 0/2] Bitmap percentage flushing Jonathan Derrick
@ 2022-10-06 22:08 ` Jonathan Derrick
  2022-10-12  7:17   ` Mariusz Tkaczyk
  2022-10-06 22:08 ` [PATCH 1/2] md/bitmap: Move unplug to daemon thread Jonathan Derrick
  2022-10-06 22:08 ` [PATCH 2/2] md/bitmap: Add chunk-count-based bitmap flushing Jonathan Derrick
  2 siblings, 1 reply; 10+ messages in thread
From: Jonathan Derrick @ 2022-10-06 22:08 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, linux-kernel, jonathan.derrick, jonathanx.sk.derrick,
	Jonathan Derrick

Adds parameter to mdadm create, grow, and build similar to the delay
parameter, that specifies a chunk threshold. This value will instruct
the kernel, in-tandem with the delay timer, to flush the bitmap after
every N chunks have been dirtied. This can be used in-addition to the
delay parameter and complements it.

This requires an addition to the bitmap superblock and version increment.

Usage: -g <Number of chunks, default 0=off>

Signed-off-by: Jonathan Derrick <jonathan.derrick@linux.dev>
---
This RFC patch lacks documentation

 Build.c       |  4 ++--
 Create.c      |  8 ++++----
 Grow.c        |  8 ++++----
 ReadMe.c      | 10 +++++++---
 bitmap.c      |  7 +++++--
 bitmap.h      |  5 ++++-
 config.c      | 16 +++++++++++++++-
 mdadm.c       | 10 ++++++++++
 mdadm.h       |  5 +++--
 super-intel.c |  2 +-
 super0.c      |  3 ++-
 super1.c      |  4 +++-
 12 files changed, 60 insertions(+), 22 deletions(-)

diff --git a/Build.c b/Build.c
index 8d6f6f58..9cdf9616 100644
--- a/Build.c
+++ b/Build.c
@@ -157,7 +157,7 @@ int Build(char *mddev, struct mddev_dev *devlist,
 	if (s->bitmap_file) {
 		bitmap_fd = open(s->bitmap_file, O_RDWR);
 		if (bitmap_fd < 0) {
-			int major = BITMAP_MAJOR_HI;
+			int major = c->threshold ? BITMAP_MAJOR_CHUNKFLUSH : BITMAP_MAJOR_HI;
 #if 0
 			if (s->bitmap_chunk == UnSet) {
 				pr_err("%s cannot be opened.\n", s->bitmap_file);
@@ -166,7 +166,7 @@ int Build(char *mddev, struct mddev_dev *devlist,
 #endif
 			bitmapsize = s->size >> 9; /* FIXME wrong for RAID10 */
 			if (CreateBitmap(s->bitmap_file, 1, NULL,
-					 s->bitmap_chunk, c->delay,
+					 s->bitmap_chunk, c->delay, c->threshold,
 					 s->write_behind, bitmapsize, major)) {
 				goto abort;
 			}
diff --git a/Create.c b/Create.c
index 953e7372..9ef24f82 100644
--- a/Create.c
+++ b/Create.c
@@ -143,7 +143,7 @@ int Create(struct supertype *st, char *mddev,
 	unsigned long long newsize;
 	mdu_array_info_t inf;
 
-	int major_num = BITMAP_MAJOR_HI;
+	int major_num = c->threshold ? BITMAP_MAJOR_CHUNKFLUSH : BITMAP_MAJOR_HI;
 	if (s->bitmap_file && strcmp(s->bitmap_file, "clustered") == 0) {
 		major_num = BITMAP_MAJOR_CLUSTERED;
 		if (c->nodes <= 1) {
@@ -798,8 +798,8 @@ int Create(struct supertype *st, char *mddev,
 				st->ss->name);
 			goto abort_locked;
 		}
-		if (st->ss->add_internal_bitmap(st, &s->bitmap_chunk,
-						c->delay, s->write_behind,
+		if (st->ss->add_internal_bitmap(st, &s->bitmap_chunk, c->delay,
+						c->threshold, s->write_behind,
 						bitmapsize, 1, major_num)) {
 			pr_err("Given bitmap chunk size not supported.\n");
 			goto abort_locked;
@@ -852,7 +852,7 @@ int Create(struct supertype *st, char *mddev,
 
 		st->ss->uuid_from_super(st, uuid);
 		if (CreateBitmap(s->bitmap_file, c->force, (char*)uuid, s->bitmap_chunk,
-				 c->delay, s->write_behind,
+				 c->delay, c->threshold, s->write_behind,
 				 bitmapsize,
 				 major_num)) {
 			goto abort_locked;
diff --git a/Grow.c b/Grow.c
index e362403a..5ae91138 100644
--- a/Grow.c
+++ b/Grow.c
@@ -287,7 +287,7 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 	mdu_array_info_t array;
 	struct supertype *st;
 	char *subarray = NULL;
-	int major = BITMAP_MAJOR_HI;
+	int major = c->threshold ? BITMAP_MAJOR_CHUNKFLUSH : BITMAP_MAJOR_HI;
 	unsigned long long bitmapsize, array_size;
 	struct mdinfo *mdi;
 
@@ -441,7 +441,7 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 			if (!rv) {
 				rv = st->ss->add_internal_bitmap(
 					st, &s->bitmap_chunk, c->delay,
-					s->write_behind, bitmapsize,
+					c->threshold, s->write_behind, bitmapsize,
 					offset_setable, major);
 				if (!rv) {
 					st->ss->write_bitmap(st, fd2,
@@ -512,8 +512,8 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 			return 1;
 		}
 		if (CreateBitmap(s->bitmap_file, c->force, (char*)uuid,
-				 s->bitmap_chunk, c->delay, s->write_behind,
-				 bitmapsize, major)) {
+				 s->bitmap_chunk, c->delay, c->threshold,
+				 s->write_behind, bitmapsize, major)) {
 			return 1;
 		}
 		bitmap_fd = open(s->bitmap_file, O_RDWR);
diff --git a/ReadMe.c b/ReadMe.c
index 50a5e36d..87ef4b42 100644
--- a/ReadMe.c
+++ b/ReadMe.c
@@ -81,12 +81,12 @@ char Version[] = "mdadm - v" VERSION " - " VERS_DATE EXTRAVERSION "\n";
  *     found, it is started.
  */
 
-char short_options[]="-ABCDEFGIQhVXYWZ:vqbc:i:l:p:m:n:x:u:c:d:z:U:N:sarfRSow1tye:k:";
+char short_options[]="-ABCDEFGIQhVXYWZ:vqbc:g:i:l:p:m:n:x:u:c:d:z:U:N:sarfRSow1tye:k:";
 char short_monitor_options[]="-ABCDEFGIQhVXYWZ:vqbc:i:l:p:m:r:n:x:u:c:d:z:U:N:safRSow1tye:k:";
 char short_bitmap_options[]=
-		"-ABCDEFGIQhVXYWZ:vqb:c:i:l:p:m:n:x:u:c:d:z:U:N:sarfRSow1tye:k:";
+		"-ABCDEFGIQhVXYWZ:vqb:c:g:i:l:p:m:n:x:u:c:d:z:U:N:sarfRSow1tye:k:";
 char short_bitmap_auto_options[]=
-		"-ABCDEFGIQhVXYWZ:vqb:c:i:l:p:m:n:x:u:c:d:z:U:N:sa:rfRSow1tye:k:";
+		"-ABCDEFGIQhVXYWZ:vqb:c:g:i:l:p:m:n:x:u:c:d:z:U:N:sa:rfRSow1tye:k:";
 
 struct option long_options[] = {
     {"manage",    0, 0, ManageOpt},
@@ -196,6 +196,7 @@ struct option long_options[] = {
     {"alert",     1, 0, ProgramOpt},
     {"increment", 1, 0, Increment},
     {"delay",     1, 0, 'd'},
+    {"threshold", 1, 0, 'g'},
     {"daemonise", 0, 0, Fork},
     {"daemonize", 0, 0, Fork},
     {"oneshot",   0, 0, '1'},
@@ -304,6 +305,7 @@ char OptionHelp[] =
 "  --assume-clean     : Assume the array is already in-sync. This is dangerous for RAID5.\n"
 "  --bitmap-chunk=    : chunksize of bitmap in bitmap file (Kilobytes)\n"
 "  --delay=      -d   : seconds between bitmap updates\n"
+"  --threshold=  -g   : chunks between bitmap updates\n"
 "  --write-behind=    : number of simultaneous write-behind requests to allow (requires bitmap)\n"
 "  --name=       -N   : Textual name for array - max 32 characters\n"
 "\n"
@@ -387,6 +389,7 @@ char Help_create[] =
 "  --name=            -N : Textual name for array - max 32 characters\n"
 "  --bitmap-chunk=       : bitmap chunksize in Kilobytes.\n"
 "  --delay=           -d : bitmap update delay in seconds.\n"
+"  --threshold=       -g : chunks between bitmap updates.\n"
 "  --write-journal=      : Specify journal device for RAID-4/5/6 array\n"
 "  --consistency-policy= : Specify the policy that determines how the array\n"
 "                     -k : maintains consistency in case of unexpected shutdown.\n"
@@ -412,6 +415,7 @@ char Help_build[] =
 "  --raid-devices= -n : number of active devices in array\n"
 "  --bitmap-chunk=    : bitmap chunksize in Kilobytes.\n"
 "  --delay=      -d   : bitmap update delay in seconds.\n"
+"  --threshold=  -g   : chunks between bitmap updates\n"
 ;
 
 char Help_assemble[] =
diff --git a/bitmap.c b/bitmap.c
index 9a7ffe3b..0dfdb9c7 100644
--- a/bitmap.c
+++ b/bitmap.c
@@ -33,6 +33,7 @@ static inline void sb_le_to_cpu(bitmap_super_t *sb)
 	sb->sync_size = __le64_to_cpu(sb->sync_size);
 	sb->write_behind = __le32_to_cpu(sb->write_behind);
 	sb->nodes = __le32_to_cpu(sb->nodes);
+	sb->daemon_flush_chunks = __le32_to_cpu(sb->daemon_flush_chunks);
 	sb->sectors_reserved = __le32_to_cpu(sb->sectors_reserved);
 }
 
@@ -273,7 +274,7 @@ int ExamineBitmap(char *filename, int brief, struct supertype *st)
 	}
 	printf("         Version : %d\n", sb->version);
 	if (sb->version < BITMAP_MAJOR_LO ||
-	    sb->version > BITMAP_MAJOR_CLUSTERED) {
+	    sb->version > BITMAP_MAJOR_CHUNKFLUSH) {
 		pr_err("unknown bitmap version %d, either the bitmap file\n",
 		       sb->version);
 		pr_err("is corrupted or you need to upgrade your tools\n");
@@ -311,7 +312,7 @@ int ExamineBitmap(char *filename, int brief, struct supertype *st)
 	}
 
 	printf("       Chunksize : %s\n", human_chunksize(sb->chunksize));
-	printf("          Daemon : %ds flush period\n", sb->daemon_sleep);
+	printf("          Daemon : %ds flush period, %d chunks\n", sb->daemon_sleep, sb->daemon_flush_chunks);
 	if (sb->write_behind)
 		sprintf(buf, "Allow write behind, max %d", sb->write_behind);
 	else
@@ -427,6 +428,7 @@ out:
 
 int CreateBitmap(char *filename, int force, char uuid[16],
 		 unsigned long chunksize, unsigned long daemon_sleep,
+		 unsigned int daemon_flush_chunks,
 		 unsigned long write_behind,
 		 unsigned long long array_size /* sectors */,
 		 int major)
@@ -472,6 +474,7 @@ int CreateBitmap(char *filename, int force, char uuid[16],
 		memcpy(sb.uuid, uuid, 16);
 	sb.chunksize = chunksize;
 	sb.daemon_sleep = daemon_sleep;
+	sb.daemon_flush_chunks = daemon_flush_chunks;
 	sb.write_behind = write_behind;
 	sb.sync_size = array_size;
 
diff --git a/bitmap.h b/bitmap.h
index 7b1f80f2..48ebc0b9 100644
--- a/bitmap.h
+++ b/bitmap.h
@@ -9,10 +9,12 @@
 #define BITMAP_MAJOR_LO 3
 /* version 4 insists the bitmap is in little-endian order
  * with version 3, it is host-endian which is non-portable
+ * Version 6 supports the flush-chunks threshold
  */
 #define BITMAP_MAJOR_HI 4
 #define	BITMAP_MAJOR_HOSTENDIAN 3
 #define	BITMAP_MAJOR_CLUSTERED 5
+#define	BITMAP_MAJOR_CHUNKFLUSH 6
 
 #define BITMAP_MINOR 39
 
@@ -159,7 +161,8 @@ typedef struct bitmap_super_s {
 				 * reserved for the bitmap. */
 	__u32 nodes;        /* 68 the maximum number of nodes in cluster. */
 	__u8 cluster_name[64]; /* 72 cluster name to which this md belongs */
-	__u8  pad[256 - 136]; /* set to zero */
+	__u32 daemon_flush_chunks; /* 136 dirty chunks between flushes */
+	__u8  pad[256 - 140]; /* set to zero */
 } bitmap_super_t;
 
 /* notes:
diff --git a/config.c b/config.c
index dc1620c1..744d5d4f 100644
--- a/config.c
+++ b/config.c
@@ -81,7 +81,7 @@ char DefaultAltConfDir[] = CONFFILE2 ".d";
 
 enum linetype { Devices, Array, Mailaddr, Mailfrom, Program, CreateDev,
 		Homehost, HomeCluster, AutoMode, Policy, PartPolicy, Sysfs,
-		MonitorDelay, LTEnd };
+		MonitorDelay, Threshold, LTEnd };
 char *keywords[] = {
 	[Devices]  = "devices",
 	[Array]    = "array",
@@ -96,6 +96,7 @@ char *keywords[] = {
 	[PartPolicy]="part-policy",
 	[Sysfs]    = "sysfs",
 	[MonitorDelay] = "monitordelay",
+	[Threshold] = "threshold",
 	[LTEnd]    = NULL
 };
 
@@ -595,6 +596,17 @@ void monitordelayline(char *line)
 	}
 }
 
+static int threshold;
+void thresholdline(char *line)
+{
+	char *w;
+
+	for (w = dl_next(line); w != line; w = dl_next(w)) {
+		if (threshold == 0)
+			threshold = strtol(w, NULL, 10);
+	}
+}
+
 char auto_yes[] = "yes";
 char auto_no[] = "no";
 char auto_homehost[] = "homehost";
@@ -779,6 +791,8 @@ void conf_file(FILE *f)
 		case MonitorDelay:
 			monitordelayline(line);
 			break;
+		case Threshold:
+			thresholdline(line);
 		default:
 			pr_err("Unknown keyword %s\n", line);
 		}
diff --git a/mdadm.c b/mdadm.c
index 972adb52..72c12406 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -912,6 +912,16 @@ int main(int argc, char *argv[])
 				exit(2);
 			}
 			continue;
+		case O(GROW, 'g'):
+		case O(BUILD,'g'): /* flush chunk threshold for bitmap updates */
+		case O(CREATE,'g'):
+			if (c.threshold)
+				pr_err("only specify threshold once. %s ignored.\n", optarg);
+			else if (parse_num(&c.threshold, optarg) != 0) {
+				pr_err("invalid threshold: %s\n", optarg);
+				exit(2);
+			}
+			continue;
 		case O(MONITOR,'f'): /* daemonise */
 		case O(MONITOR,Fork):
 			daemonise = 1;
diff --git a/mdadm.h b/mdadm.h
index 3673494e..d135a55a 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -574,6 +574,7 @@ struct context {
 	int	SparcAdjust;
 	int	autof;
 	int	delay;
+	int	threshold;
 	int	freeze_reshape;
 	char	*backup_file;
 	int	invalid_backup;
@@ -1043,7 +1044,7 @@ extern struct superswitch {
 	 * -Exxxx: On error
 	 */
 	int (*add_internal_bitmap)(struct supertype *st, int *chunkp,
-				   int delay, int write_behind,
+				   int delay, int threshold, int write_behind,
 				   unsigned long long size, int may_change, int major);
 	/* Perform additional setup required to activate a bitmap.
 	 */
@@ -1491,7 +1492,7 @@ extern int IncrementalScan(struct context *c, char *devnm);
 extern int IncrementalRemove(char *devname, char *path, int verbose);
 extern int CreateBitmap(char *filename, int force, char uuid[16],
 			unsigned long chunksize, unsigned long daemon_sleep,
-			unsigned long write_behind,
+			unsigned int daemon_flush_chunks, unsigned long write_behind,
 			unsigned long long array_size,
 			int major);
 extern int ExamineBitmap(char *filename, int brief, struct supertype *st);
diff --git a/super-intel.c b/super-intel.c
index b0565610..aadccdda 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -12645,7 +12645,7 @@ static int validate_internal_bitmap_imsm(struct supertype *st)
  *	-1 : fail
  ******************************************************************************/
 static int add_internal_bitmap_imsm(struct supertype *st, int *chunkp,
-				    int delay, int write_behind,
+				    int delay, int threshold, int write_behind,
 				    unsigned long long size, int may_change,
 				    int amajor)
 {
diff --git a/super0.c b/super0.c
index 93876e2e..369a870d 100644
--- a/super0.c
+++ b/super0.c
@@ -1123,7 +1123,7 @@ static __u64 avail_size0(struct supertype *st, __u64 devsize,
 }
 
 static int add_internal_bitmap0(struct supertype *st, int *chunkp,
-				int delay, int write_behind,
+				int delay, int threshold, int write_behind,
 				unsigned long long size, int may_change,
 				int major)
 {
@@ -1166,6 +1166,7 @@ static int add_internal_bitmap0(struct supertype *st, int *chunkp,
 	memcpy(bms->uuid, uuid, 16);
 	bms->chunksize = __cpu_to_le32(chunk);
 	bms->daemon_sleep = __cpu_to_le32(delay);
+	bms->daemon_flush_chunks = __cpu_to_le32(threshold);
 	bms->sync_size = __cpu_to_le64(size);
 	bms->write_behind = __cpu_to_le32(write_behind);
 	*chunkp = chunk;
diff --git a/super1.c b/super1.c
index 0b505a7e..67068e02 100644
--- a/super1.c
+++ b/super1.c
@@ -2466,7 +2466,8 @@ static __u64 avail_size1(struct supertype *st, __u64 devsize,
 
 static int
 add_internal_bitmap1(struct supertype *st,
-		     int *chunkp, int delay, int write_behind,
+		     int *chunkp, int delay,
+		     int threshold, int write_behind,
 		     unsigned long long size,
 		     int may_change, int major)
 {
@@ -2615,6 +2616,7 @@ add_internal_bitmap1(struct supertype *st,
 	memcpy(bms->uuid, uuid, 16);
 	bms->chunksize = __cpu_to_le32(chunk);
 	bms->daemon_sleep = __cpu_to_le32(delay);
+	bms->daemon_flush_chunks = __cpu_to_le32(threshold);
 	bms->sync_size = __cpu_to_le64(size);
 	bms->write_behind = __cpu_to_le32(write_behind);
 	bms->nodes = __cpu_to_le32(st->nodes);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 1/2] md/bitmap: Move unplug to daemon thread
  2022-10-06 22:08 [PATCH 0/2] Bitmap percentage flushing Jonathan Derrick
  2022-10-06 22:08 ` [RFC PATCH] mdadm: Add parameter for bitmap chunk threshold Jonathan Derrick
@ 2022-10-06 22:08 ` Jonathan Derrick
  2022-10-06 22:08 ` [PATCH 2/2] md/bitmap: Add chunk-count-based bitmap flushing Jonathan Derrick
  2 siblings, 0 replies; 10+ messages in thread
From: Jonathan Derrick @ 2022-10-06 22:08 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, linux-kernel, jonathan.derrick, jonathanx.sk.derrick,
	Jonathan Derrick

It's been observed in raid1/raid10 configurations that synchronous I/O
can cause workloads resulting in greater than 40% bitmap updates. This
appears to be due to the synchronous workload requiring a bitmap flush
with every flush of the I/O list. Instead prefer to flush this
configuration in the daemon sleeper thread.

Signed-off-by: Jonathan Derrick <jonathan.derrick@linux.dev>
---
 drivers/md/md-bitmap.c | 1 +
 drivers/md/raid1.c     | 2 --
 drivers/md/raid10.c    | 4 ----
 3 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index bf6dffadbe6f..451259b38d25 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -1244,6 +1244,7 @@ void md_bitmap_daemon_work(struct mddev *mddev)
 			+ mddev->bitmap_info.daemon_sleep))
 		goto done;
 
+	md_bitmap_unplug(bitmap);
 	bitmap->daemon_lastrun = jiffies;
 	if (bitmap->allclean) {
 		mddev->thread->timeout = MAX_SCHEDULE_TIMEOUT;
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 05d8438cfec8..42ba2d884773 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -793,8 +793,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
 
 static void flush_bio_list(struct r1conf *conf, struct bio *bio)
 {
-	/* flush any pending bitmap writes to disk before proceeding w/ I/O */
-	md_bitmap_unplug(conf->mddev->bitmap);
 	wake_up(&conf->wait_barrier);
 
 	while (bio) { /* submit pending writes */
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 9117fcdee1be..e43352aae3c4 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -881,9 +881,6 @@ static void flush_pending_writes(struct r10conf *conf)
 		__set_current_state(TASK_RUNNING);
 
 		blk_start_plug(&plug);
-		/* flush any pending bitmap writes to disk
-		 * before proceeding w/ I/O */
-		md_bitmap_unplug(conf->mddev->bitmap);
 		wake_up(&conf->wait_barrier);
 
 		while (bio) { /* submit pending writes */
@@ -1078,7 +1075,6 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
 
 	/* we aren't scheduling, so we can do the write-out directly. */
 	bio = bio_list_get(&plug->pending);
-	md_bitmap_unplug(mddev->bitmap);
 	wake_up(&conf->wait_barrier);
 
 	while (bio) { /* submit pending writes */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] md/bitmap: Add chunk-count-based bitmap flushing
  2022-10-06 22:08 [PATCH 0/2] Bitmap percentage flushing Jonathan Derrick
  2022-10-06 22:08 ` [RFC PATCH] mdadm: Add parameter for bitmap chunk threshold Jonathan Derrick
  2022-10-06 22:08 ` [PATCH 1/2] md/bitmap: Move unplug to daemon thread Jonathan Derrick
@ 2022-10-06 22:08 ` Jonathan Derrick
  2022-10-07 17:50   ` Song Liu
  2 siblings, 1 reply; 10+ messages in thread
From: Jonathan Derrick @ 2022-10-06 22:08 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, linux-kernel, jonathan.derrick, jonathanx.sk.derrick,
	Jonathan Derrick

In addition to the timer, allow the bitmap flushing to be controlled by a
counter that tracks the number of dirty chunks and flushes when it exceeds a
user-defined chunk-count threshold.

This introduces a new field to the bitmap superblock and version 6.

Signed-off-by: Jonathan Derrick <jonathan.derrick@linux.dev>
---
 drivers/md/md-bitmap.c | 37 ++++++++++++++++++++++++++++++++++---
 drivers/md/md-bitmap.h |  5 ++++-
 drivers/md/md.h        |  1 +
 3 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 451259b38d25..fa6b3c71c314 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -499,6 +499,7 @@ void md_bitmap_print_sb(struct bitmap *bitmap)
 	pr_debug("         state: %08x\n", le32_to_cpu(sb->state));
 	pr_debug("     chunksize: %d B\n", le32_to_cpu(sb->chunksize));
 	pr_debug("  daemon sleep: %ds\n", le32_to_cpu(sb->daemon_sleep));
+	pr_debug("  flush chunks: %d\n", le32_to_cpu(sb->daemon_flush_chunks));
 	pr_debug("     sync size: %llu KB\n",
 		 (unsigned long long)le64_to_cpu(sb->sync_size)/2);
 	pr_debug("max write behind: %d\n", le32_to_cpu(sb->write_behind));
@@ -581,6 +582,7 @@ static int md_bitmap_read_sb(struct bitmap *bitmap)
 	bitmap_super_t *sb;
 	unsigned long chunksize, daemon_sleep, write_behind;
 	unsigned long long events;
+	unsigned int daemon_flush_chunks;
 	int nodes = 0;
 	unsigned long sectors_reserved = 0;
 	int err = -EINVAL;
@@ -644,7 +646,7 @@ static int md_bitmap_read_sb(struct bitmap *bitmap)
 	if (sb->magic != cpu_to_le32(BITMAP_MAGIC))
 		reason = "bad magic";
 	else if (le32_to_cpu(sb->version) < BITMAP_MAJOR_LO ||
-		 le32_to_cpu(sb->version) > BITMAP_MAJOR_CLUSTERED)
+		 le32_to_cpu(sb->version) > BITMAP_MAJOR_CHUNKFLUSH)
 		reason = "unrecognized superblock version";
 	else if (chunksize < 512)
 		reason = "bitmap chunksize too small";
@@ -660,6 +662,9 @@ static int md_bitmap_read_sb(struct bitmap *bitmap)
 		goto out;
 	}
 
+	if (sb->version == cpu_to_le32(BITMAP_MAJOR_CHUNKFLUSH))
+		daemon_flush_chunks = le32_to_cpu(sb->daemon_flush_chunks);
+
 	/*
 	 * Setup nodes/clustername only if bitmap version is
 	 * cluster-compatible
@@ -720,6 +725,7 @@ static int md_bitmap_read_sb(struct bitmap *bitmap)
 			bitmap->events_cleared = bitmap->mddev->events;
 		bitmap->mddev->bitmap_info.chunksize = chunksize;
 		bitmap->mddev->bitmap_info.daemon_sleep = daemon_sleep;
+		bitmap->mddev->bitmap_info.daemon_flush_chunks = daemon_flush_chunks;
 		bitmap->mddev->bitmap_info.max_write_behind = write_behind;
 		bitmap->mddev->bitmap_info.nodes = nodes;
 		if (bitmap->mddev->bitmap_info.space == 0 ||
@@ -1218,6 +1224,31 @@ static bitmap_counter_t *md_bitmap_get_counter(struct bitmap_counts *bitmap,
 					       sector_t offset, sector_t *blocks,
 					       int create);
 
+static bool md_daemon_should_sleep(struct mddev *mddev)
+{
+	struct bitmap *bitmap = mddev->bitmap;
+	struct bitmap_page *bp;
+	unsigned long k, pages;
+	unsigned int count = 0;
+
+	if (time_after(jiffies, bitmap->daemon_lastrun
+			+ mddev->bitmap_info.daemon_sleep))
+		return false;
+
+	if (mddev->bitmap_info.daemon_flush_chunks) {
+		bp = bitmap->counts.bp;
+		pages = bitmap->counts.pages;
+		for (k = 0; k < pages; k++)
+			if (bp[k].map && !bp[k].hijacked)
+				count += bp[k].count;
+
+		if (count >= mddev->bitmap_info.daemon_flush_chunks)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * bitmap daemon -- periodically wakes up to clean bits and flush pages
  *			out to disk
@@ -1240,8 +1271,8 @@ void md_bitmap_daemon_work(struct mddev *mddev)
 		mutex_unlock(&mddev->bitmap_info.mutex);
 		return;
 	}
-	if (time_before(jiffies, bitmap->daemon_lastrun
-			+ mddev->bitmap_info.daemon_sleep))
+
+	if (md_daemon_should_sleep(mddev))
 		goto done;
 
 	md_bitmap_unplug(bitmap);
diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index cfd7395de8fd..e0aeedbdde17 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -11,10 +11,12 @@
 /* version 4 insists the bitmap is in little-endian order
  * with version 3, it is host-endian which is non-portable
  * Version 5 is currently set only for clustered devices
++ * Version 6 supports the flush-chunks threshold
  */
 #define BITMAP_MAJOR_HI 4
 #define BITMAP_MAJOR_CLUSTERED 5
 #define	BITMAP_MAJOR_HOSTENDIAN 3
+#define BITMAP_MAJOR_CHUNKFLUSH 6
 
 /*
  * in-memory bitmap:
@@ -135,7 +137,8 @@ typedef struct bitmap_super_s {
 				  * reserved for the bitmap. */
 	__le32 nodes;        /* 68 the maximum number of nodes in cluster. */
 	__u8 cluster_name[64]; /* 72 cluster name to which this md belongs */
-	__u8  pad[256 - 136]; /* set to zero */
+	__le32 daemon_flush_chunks; /* 136 dirty chunks between flushes */
+	__u8  pad[256 - 140]; /* set to zero */
 } bitmap_super_t;
 
 /* notes:
diff --git a/drivers/md/md.h b/drivers/md/md.h
index b4e2d8b87b61..d25574e46283 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -497,6 +497,7 @@ struct mddev {
 		struct mutex		mutex;
 		unsigned long		chunksize;
 		unsigned long		daemon_sleep; /* how many jiffies between updates? */
+		unsigned int		daemon_flush_chunks; /* how many dirty chunks between updates */
 		unsigned long		max_write_behind; /* write-behind mode */
 		int			external;
 		int			nodes; /* Maximum number of nodes in the cluster */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] md/bitmap: Add chunk-count-based bitmap flushing
  2022-10-06 22:08 ` [PATCH 2/2] md/bitmap: Add chunk-count-based bitmap flushing Jonathan Derrick
@ 2022-10-07 17:50   ` Song Liu
  2022-10-07 18:58     ` Jonathan Derrick
  0 siblings, 1 reply; 10+ messages in thread
From: Song Liu @ 2022-10-07 17:50 UTC (permalink / raw)
  To: Jonathan Derrick
  Cc: linux-raid, linux-kernel, jonathan.derrick, jonathanx.sk.derrick

On Thu, Oct 6, 2022 at 3:09 PM Jonathan Derrick
<jonathan.derrick@linux.dev> wrote:

[...]

> diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
> index cfd7395de8fd..e0aeedbdde17 100644
> --- a/drivers/md/md-bitmap.h
> +++ b/drivers/md/md-bitmap.h
> @@ -11,10 +11,12 @@
>  /* version 4 insists the bitmap is in little-endian order
>   * with version 3, it is host-endian which is non-portable
>   * Version 5 is currently set only for clustered devices
> ++ * Version 6 supports the flush-chunks threshold
>   */
>  #define BITMAP_MAJOR_HI 4
>  #define BITMAP_MAJOR_CLUSTERED 5
>  #define        BITMAP_MAJOR_HOSTENDIAN 3
> +#define BITMAP_MAJOR_CHUNKFLUSH 6
>
>  /*
>   * in-memory bitmap:
> @@ -135,7 +137,8 @@ typedef struct bitmap_super_s {
>                                   * reserved for the bitmap. */
>         __le32 nodes;        /* 68 the maximum number of nodes in cluster. */
>         __u8 cluster_name[64]; /* 72 cluster name to which this md belongs */
> -       __u8  pad[256 - 136]; /* set to zero */
> +       __le32 daemon_flush_chunks; /* 136 dirty chunks between flushes */
> +       __u8  pad[256 - 140]; /* set to zero */
>  } bitmap_super_t;

Do we really need this to be persistent? How about we configure it at run
time via a sysfs file?

Also, please share more data on the performance benefit of the set.

Thanks,
Song

>
>  /* notes:
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index b4e2d8b87b61..d25574e46283 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -497,6 +497,7 @@ struct mddev {
>                 struct mutex            mutex;
>                 unsigned long           chunksize;
>                 unsigned long           daemon_sleep; /* how many jiffies between updates? */
> +               unsigned int            daemon_flush_chunks; /* how many dirty chunks between updates */
>                 unsigned long           max_write_behind; /* write-behind mode */
>                 int                     external;
>                 int                     nodes; /* Maximum number of nodes in the cluster */
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] md/bitmap: Add chunk-count-based bitmap flushing
  2022-10-07 17:50   ` Song Liu
@ 2022-10-07 18:58     ` Jonathan Derrick
  2022-10-10 18:18       ` Song Liu
  0 siblings, 1 reply; 10+ messages in thread
From: Jonathan Derrick @ 2022-10-07 18:58 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid, linux-kernel, jonathan.derrick, jonathanx.sk.derrick



On 10/7/2022 11:50 AM, Song Liu wrote:
> On Thu, Oct 6, 2022 at 3:09 PM Jonathan Derrick
> <jonathan.derrick@linux.dev> wrote:
> 
> [...]
> 
>> diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
>> index cfd7395de8fd..e0aeedbdde17 100644
>> --- a/drivers/md/md-bitmap.h
>> +++ b/drivers/md/md-bitmap.h
>> @@ -11,10 +11,12 @@
>>  /* version 4 insists the bitmap is in little-endian order
>>   * with version 3, it is host-endian which is non-portable
>>   * Version 5 is currently set only for clustered devices
>> ++ * Version 6 supports the flush-chunks threshold
>>   */
>>  #define BITMAP_MAJOR_HI 4
>>  #define BITMAP_MAJOR_CLUSTERED 5
>>  #define        BITMAP_MAJOR_HOSTENDIAN 3
>> +#define BITMAP_MAJOR_CHUNKFLUSH 6
>>
>>  /*
>>   * in-memory bitmap:
>> @@ -135,7 +137,8 @@ typedef struct bitmap_super_s {
>>                                   * reserved for the bitmap. */
>>         __le32 nodes;        /* 68 the maximum number of nodes in cluster. */
>>         __u8 cluster_name[64]; /* 72 cluster name to which this md belongs */
>> -       __u8  pad[256 - 136]; /* set to zero */
>> +       __le32 daemon_flush_chunks; /* 136 dirty chunks between flushes */
>> +       __u8  pad[256 - 140]; /* set to zero */
>>  } bitmap_super_t;
> 
> Do we really need this to be persistent? How about we configure it at run
> time via a sysfs file?
> 
> Also, please share more data on the performance benefit of the set.
> 
> Thanks,
> Song
> 
Hi Song,

Patch 1/2 changes default behavior, which patch 2/2 tries to address.
I can change it to be configurable via sysfs instead.
Should there be a default?


Here are my observations via biosnoop and RAID1, 4M chunksize, 238436 chunks, bitmap=internal
fio --name=test --direct=1 --filename=/dev/md0 --rw=randwrite --runtime=60
 --percentile_list=1.0:25.0:50.0:75.0:90.0:95.0:99.0:99.9:99.99:99..999999:100.0


Default, bitmap updates happened concurrently with I/O:
   bw (  KiB/s): min=18690, max=30618, per=99.94%, avg=23822.07, stdev=2522.73, samples=119
   iops        : min= 4672, max= 7654, avg=5955.20, stdev=630.71, samples=119

TIME(s)     COMM           PID     DISK      T SECTOR     BYTES  LAT(ms)
38.090366   md0_raid1      4800    nvme6n1   W 40         4096      0.01
38.090423   md0_raid1      4800    nvme3n1   W 40         4096      0.07
38.090442   md0_raid1      4800    nvme3n1   W 1016633184 4096      0.01
38.090439   md0_raid1      4800    nvme6n1   W 1016633184 4096      0.01
38.090479   md0_raid1      4800    nvme6n1   W 56         4096      0.01
38.090493   md0_raid1      4800    nvme6n1   W 1449894256 4096      0.01
38.090477   md0_raid1      4800    nvme3n1   W 56         4096      0.01
38.090496   md0_raid1      4800    nvme3n1   W 1449894256 4096      0.01
38.090530   md0_raid1      4800    nvme3n1   W 16         4096      0.01
38.090555   md0_raid1      4800    nvme3n1   W 110493568  4096      0.01
38.090538   md0_raid1      4800    nvme6n1   W 16         4096      0.01
38.090551   md0_raid1      4800    nvme6n1   W 110493568  4096      0.01
38.090596   md0_raid1      4800    nvme6n1   W 56         4096      0.01
38.090647   md0_raid1      4800    nvme3n1   W 56         4096      0.06
38.090666   md0_raid1      4800    nvme3n1   W 1455846976 4096      0.01
38.090663   md0_raid1      4800    nvme6n1   W 1455846976 4096      0.01
38.090707   md0_raid1      4800    nvme6n1   W 64         4096      0.01
38.090699   md0_raid1      4800    nvme3n1   W 64         4096      0.01
38.090723   md0_raid1      4800    nvme3n1   W 1665013728 4096      0.01
38.090720   md0_raid1      4800    nvme6n1   W 1665013728 4096      0.01
38.090764   md0_raid1      4800    nvme6n1   W 64         4096      0.01
38.090812   md0_raid1      4800    nvme3n1   W 64         4096      0.06
38.090832   md0_raid1      4800    nvme3n1   W 1637994296 4096      0.01
38.090828   md0_raid1      4800    nvme6n1   W 1637994296 4096      0.01




With patch 1/2, bitmaps only update on the 'delay' parameter (default 5s):
   bw (  KiB/s): min=135712, max=230938, per=100.00%, avg=209308.56, stdev=29254.31, samples=119
   iops        : min=33928, max=57734, avg=52326.78, stdev=7313.57, samples=119

TIME(s)     COMM           PID     DISK      T SECTOR     BYTES  LAT(ms)
16.292235   md0_raid1      4841    nvme6n1   W 297367432  4096      0.01
16.292258   md0_raid1      4841    nvme6n1   W 16         4096      0.01
16.292266   md0_raid1      4841    nvme6n1   W 24         4096      0.01
16.292277   md0_raid1      4841    nvme6n1   W 32         4096      0.01
16.292259   md0_raid1      4841    nvme3n1   W 16         4096      0.01
16.292280   md0_raid1      4841    nvme3n1   W 32         4096      0.01
16.292305   md0_raid1      4841    nvme3n1   W 56         4096      0.01
16.292286   md0_raid1      4841    nvme6n1   W 40         4096      0.01
16.292295   md0_raid1      4841    nvme6n1   W 48         4096      0.01
16.292326   md0_raid1      4841    nvme3n1   W 72         1536      0.01
16.292323   md0_raid1      4841    nvme6n1   W 64         4096      0.02
16.292326   md0_raid1      4841    nvme6n1   W 56         4096      0.03
16.292334   md0_raid1      4841    nvme6n1   W 72         1536      0.02
16.300697   md0_raid1      4841    nvme3n1   W 1297533744 4096      0.01
16.300702   md0_raid1      4841    nvme6n1   W 1297533744 4096      0.01
16.300803   md0_raid1      4841    nvme6n1   W 1649080856 4096      0.01
16.300798   md0_raid1      4841    nvme3n1   W 1649080856 4096      0.01
16.300823   md0_raid1      4841    nvme3n1   W 1539317792 4096      0.01
16.300845   md0_raid1      4841    nvme3n1   W 1634570232 4096      0.01
16.300867   md0_raid1      4841    nvme3n1   W 579232208  4096      0.01
16.300889   md0_raid1      4841    nvme3n1   W 1818140424 4096      0.01
16.300922   md0_raid1      4841    nvme3n1   W 412971920  4096      0.02
...
21.293225   md0_raid1      4841    nvme3n1   W 1279122360 4096      0.01
21.293242   md0_raid1      4841    nvme3n1   W 40326272   4096      0.01
21.293223   md0_raid1      4841    nvme6n1   W 1279122360 4096      0.01
21.293243   md0_raid1      4841    nvme6n1   W 40326272   4096      0.01
21.293261   md0_raid1      4841    nvme6n1   W 16         4096      0.01
21.293266   md0_raid1      4841    nvme6n1   W 24         4096      0.01
21.293271   md0_raid1      4841    nvme6n1   W 32         4096      0.01
21.293275   md0_raid1      4841    nvme3n1   W 32         4096      0.01
21.293292   md0_raid1      4841    nvme3n1   W 48         4096      0.01
21.293296   md0_raid1      4841    nvme3n1   W 56         4096      0.01
21.293309   md0_raid1      4841    nvme3n1   W 72         1536      0.01
21.293266   md0_raid1      4841    nvme3n1   W 24         4096      0.01
21.293326   md0_raid1      4841    nvme6n1   W 48         4096      0.05
21.293328   md0_raid1      4841    nvme6n1   W 40         4096      0.06
21.293331   md0_raid1      4841    nvme6n1   W 72         1536      0.03
21.293333   md0_raid1      4841    nvme6n1   W 64         4096      0.04
21.293334   md0_raid1      4841    nvme6n1   W 56         4096      0.05
21.298526   md0_raid1      4841    nvme3n1   W 681973000  4096      0.01




Good, but with the granularity of N seconds, it might be too infrequent.
Here is chunk-flush=512 (2GB threshold in 4MB chunk size):
   bw (  KiB/s): min=92692, max=134904, per=100.00%, avg=125127.43, stdev=6758.51, samples=119
   iops        : min=23173, max=33726, avg=31281.55, stdev=1689.63, samples=119

TIME(s)     COMM           PID     DISK      T SECTOR     BYTES  LAT(ms)
13.193339   md0_raid1      5972    nvme6n1   W 16         4096      0.01
13.193344   md0_raid1      5972    nvme6n1   W 32         4096      0.01
13.193346   md0_raid1      5972    nvme6n1   W 24         4096      0.01
13.193350   md0_raid1      5972    nvme6n1   W 40         4096      0.01
13.193356   md0_raid1      5972    nvme6n1   W 48         4096      0.01
13.193361   md0_raid1      5972    nvme6n1   W 64         4096      0.01
13.193363   md0_raid1      5972    nvme6n1   W 56         4096      0.01
13.193555   md0_raid1      5972    nvme6n1   W 72         1536      0.20
13.193289   md0_raid1      5972    nvme3n1   W 1912285848 4096      0.01
13.193306   md0_raid1      5972    nvme3n1   W 836455896  4096      0.01
13.193323   md0_raid1      5972    nvme3n1   W 233728136  4096      0.01
13.193339   md0_raid1      5972    nvme3n1   W 16         4096      0.01
13.193344   md0_raid1      5972    nvme3n1   W 24         4096      0.01
13.193362   md0_raid1      5972    nvme3n1   W 48         4096      0.01
13.193365   md0_raid1      5972    nvme3n1   W 64         4096      0.01
13.193366   md0_raid1      5972    nvme3n1   W 56         4096      0.01
13.193574   md0_raid1      5972    nvme3n1   W 72         1536      0.21
13.196759   md0_raid1      5972    nvme3n1   W 89571592   4096      0.01
13.196810   md0_raid1      5972    nvme6n1   W 89571592   4096      0.06
13.196913   md0_raid1      5972    nvme6n1   W 16         4096      0.01
13.196910   md0_raid1      5972    nvme3n1   W 16         4096      0.01
13.199444   md0_raid1      5972    nvme3n1   W 64         4096      0.01
13.199447   md0_raid1      5972    nvme3n1   W 137126232  4096      0.01
13.199515   md0_raid1      5972    nvme6n1   W 137126232  4096      0.08
13.199519   md0_raid1      5972    nvme6n1   W 64         4096      0.08
13.199617   md0_raid1      5972    nvme6n1   W 1216062808 4096      0.01
... (508 ios later)
13.208764   md0_raid1      5972    nvme6n1   W 16         4096      0.01
13.208768   md0_raid1      5972    nvme6n1   W 32         4096      0.01
13.208770   md0_raid1      5972    nvme6n1   W 24         4096      0.01
13.208775   md0_raid1      5972    nvme6n1   W 40         4096      0.01
13.208781   md0_raid1      5972    nvme6n1   W 48         4096      0.01
13.208786   md0_raid1      5972    nvme6n1   W 56         4096      0.01
13.208790   md0_raid1      5972    nvme6n1   W 64         4096      0.01
13.208729   md0_raid1      5972    nvme3n1   W 1607847808 4096      0.01
13.208747   md0_raid1      5972    nvme3n1   W 371214368  4096      0.01
13.208770   md0_raid1      5972    nvme3n1   W 32         4096      0.01
13.208789   md0_raid1      5972    nvme3n1   W 64         4096      0.01
13.208952   md0_raid1      5972    nvme6n1   W 72         1536      0.17
13.209079   md0_raid1      5972    nvme3n1   W 72         1536      0.29
13.212216   md0_raid1      5972    nvme3n1   W 1146106480 4096      0.01
13.212269   md0_raid1      5972    nvme6n1   W 1146106480 4096      0.06
13.212368   md0_raid1      5972    nvme6n1   W 16         4096      0.01
13.212365   md0_raid1      5972    nvme3n1   W 16         4096      0.01


Without 1/2: 6k iops
With 1/2: 52k iops
With 2/2 params as above: 31k iops

The count calculation could use some improvement to close the iops gap to delay-based flushing

>>
>>  /* notes:
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index b4e2d8b87b61..d25574e46283 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -497,6 +497,7 @@ struct mddev {
>>                 struct mutex            mutex;
>>                 unsigned long           chunksize;
>>                 unsigned long           daemon_sleep; /* how many jiffies between updates? */
>> +               unsigned int            daemon_flush_chunks; /* how many dirty chunks between updates */
>>                 unsigned long           max_write_behind; /* write-behind mode */
>>                 int                     external;
>>                 int                     nodes; /* Maximum number of nodes in the cluster */
>> --
>> 2.31.1
>>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] md/bitmap: Add chunk-count-based bitmap flushing
  2022-10-07 18:58     ` Jonathan Derrick
@ 2022-10-10 18:18       ` Song Liu
  2022-10-13 22:19         ` Jonathan Derrick
  0 siblings, 1 reply; 10+ messages in thread
From: Song Liu @ 2022-10-10 18:18 UTC (permalink / raw)
  To: Jonathan Derrick
  Cc: linux-raid, linux-kernel, jonathan.derrick, jonathanx.sk.derrick

On Fri, Oct 7, 2022 at 11:58 AM Jonathan Derrick
<jonathan.derrick@linux.dev> wrote:
>
>
>
> On 10/7/2022 11:50 AM, Song Liu wrote:
> > On Thu, Oct 6, 2022 at 3:09 PM Jonathan Derrick
> > <jonathan.derrick@linux.dev> wrote:
> >
> > [...]
> >
> >> diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
> >> index cfd7395de8fd..e0aeedbdde17 100644
> >> --- a/drivers/md/md-bitmap.h
> >> +++ b/drivers/md/md-bitmap.h
> >> @@ -11,10 +11,12 @@
> >>  /* version 4 insists the bitmap is in little-endian order
> >>   * with version 3, it is host-endian which is non-portable
> >>   * Version 5 is currently set only for clustered devices
> >> ++ * Version 6 supports the flush-chunks threshold
> >>   */
> >>  #define BITMAP_MAJOR_HI 4
> >>  #define BITMAP_MAJOR_CLUSTERED 5
> >>  #define        BITMAP_MAJOR_HOSTENDIAN 3
> >> +#define BITMAP_MAJOR_CHUNKFLUSH 6
> >>
> >>  /*
> >>   * in-memory bitmap:
> >> @@ -135,7 +137,8 @@ typedef struct bitmap_super_s {
> >>                                   * reserved for the bitmap. */
> >>         __le32 nodes;        /* 68 the maximum number of nodes in cluster. */
> >>         __u8 cluster_name[64]; /* 72 cluster name to which this md belongs */
> >> -       __u8  pad[256 - 136]; /* set to zero */
> >> +       __le32 daemon_flush_chunks; /* 136 dirty chunks between flushes */
> >> +       __u8  pad[256 - 140]; /* set to zero */
> >>  } bitmap_super_t;
> >
> > Do we really need this to be persistent? How about we configure it at run
> > time via a sysfs file?
> >
> > Also, please share more data on the performance benefit of the set.
> >
> > Thanks,
> > Song
> >
> Hi Song,
>
> Patch 1/2 changes default behavior, which patch 2/2 tries to address.

Have you tried to evaluate the impact on the accuracy of the bitmap?
Specifically, if we power off the system during writes, do we see data
or parity mismatch that is not covered by the bitmap?

> I can change it to be configurable via sysfs instead.
> Should there be a default?

If there is any impact on bitmap accuracy. I think the default should
work identical as before the set. IOW, we should not delay the bitmap
update.

Thanks,
Song

>
>
> Here are my observations via biosnoop and RAID1, 4M chunksize, 238436 chunks, bitmap=internal
> fio --name=test --direct=1 --filename=/dev/md0 --rw=randwrite --runtime=60
>  --percentile_list=1.0:25.0:50.0:75.0:90.0:95.0:99.0:99.9:99.99:99..999999:100.0
>
>
> Default, bitmap updates happened concurrently with I/O:
>    bw (  KiB/s): min=18690, max=30618, per=99.94%, avg=23822.07, stdev=2522.73, samples=119
>    iops        : min= 4672, max= 7654, avg=5955.20, stdev=630.71, samples=119
>
> TIME(s)     COMM           PID     DISK      T SECTOR     BYTES  LAT(ms)
> 38.090366   md0_raid1      4800    nvme6n1   W 40         4096      0.01
> 38.090423   md0_raid1      4800    nvme3n1   W 40         4096      0.07
> 38.090442   md0_raid1      4800    nvme3n1   W 1016633184 4096      0.01
> 38.090439   md0_raid1      4800    nvme6n1   W 1016633184 4096      0.01
> 38.090479   md0_raid1      4800    nvme6n1   W 56         4096      0.01
> 38.090493   md0_raid1      4800    nvme6n1   W 1449894256 4096      0.01
> 38.090477   md0_raid1      4800    nvme3n1   W 56         4096      0.01
> 38.090496   md0_raid1      4800    nvme3n1   W 1449894256 4096      0.01
> 38.090530   md0_raid1      4800    nvme3n1   W 16         4096      0.01
> 38.090555   md0_raid1      4800    nvme3n1   W 110493568  4096      0.01
> 38.090538   md0_raid1      4800    nvme6n1   W 16         4096      0.01
> 38.090551   md0_raid1      4800    nvme6n1   W 110493568  4096      0.01
> 38.090596   md0_raid1      4800    nvme6n1   W 56         4096      0.01
> 38.090647   md0_raid1      4800    nvme3n1   W 56         4096      0.06
> 38.090666   md0_raid1      4800    nvme3n1   W 1455846976 4096      0.01
> 38.090663   md0_raid1      4800    nvme6n1   W 1455846976 4096      0.01
> 38.090707   md0_raid1      4800    nvme6n1   W 64         4096      0.01
> 38.090699   md0_raid1      4800    nvme3n1   W 64         4096      0.01
> 38.090723   md0_raid1      4800    nvme3n1   W 1665013728 4096      0.01
> 38.090720   md0_raid1      4800    nvme6n1   W 1665013728 4096      0.01
> 38.090764   md0_raid1      4800    nvme6n1   W 64         4096      0.01
> 38.090812   md0_raid1      4800    nvme3n1   W 64         4096      0.06
> 38.090832   md0_raid1      4800    nvme3n1   W 1637994296 4096      0.01
> 38.090828   md0_raid1      4800    nvme6n1   W 1637994296 4096      0.01
>
>
>
>
> With patch 1/2, bitmaps only update on the 'delay' parameter (default 5s):
>    bw (  KiB/s): min=135712, max=230938, per=100.00%, avg=209308.56, stdev=29254.31, samples=119
>    iops        : min=33928, max=57734, avg=52326.78, stdev=7313.57, samples=119
>
> TIME(s)     COMM           PID     DISK      T SECTOR     BYTES  LAT(ms)
> 16.292235   md0_raid1      4841    nvme6n1   W 297367432  4096      0.01
> 16.292258   md0_raid1      4841    nvme6n1   W 16         4096      0.01
> 16.292266   md0_raid1      4841    nvme6n1   W 24         4096      0.01
> 16.292277   md0_raid1      4841    nvme6n1   W 32         4096      0.01
> 16.292259   md0_raid1      4841    nvme3n1   W 16         4096      0.01
> 16.292280   md0_raid1      4841    nvme3n1   W 32         4096      0.01
> 16.292305   md0_raid1      4841    nvme3n1   W 56         4096      0.01
> 16.292286   md0_raid1      4841    nvme6n1   W 40         4096      0.01
> 16.292295   md0_raid1      4841    nvme6n1   W 48         4096      0.01
> 16.292326   md0_raid1      4841    nvme3n1   W 72         1536      0.01
> 16.292323   md0_raid1      4841    nvme6n1   W 64         4096      0.02
> 16.292326   md0_raid1      4841    nvme6n1   W 56         4096      0.03
> 16.292334   md0_raid1      4841    nvme6n1   W 72         1536      0.02
> 16.300697   md0_raid1      4841    nvme3n1   W 1297533744 4096      0.01
> 16.300702   md0_raid1      4841    nvme6n1   W 1297533744 4096      0.01
> 16.300803   md0_raid1      4841    nvme6n1   W 1649080856 4096      0.01
> 16.300798   md0_raid1      4841    nvme3n1   W 1649080856 4096      0.01
> 16.300823   md0_raid1      4841    nvme3n1   W 1539317792 4096      0.01
> 16.300845   md0_raid1      4841    nvme3n1   W 1634570232 4096      0.01
> 16.300867   md0_raid1      4841    nvme3n1   W 579232208  4096      0.01
> 16.300889   md0_raid1      4841    nvme3n1   W 1818140424 4096      0.01
> 16.300922   md0_raid1      4841    nvme3n1   W 412971920  4096      0.02
> ...
> 21.293225   md0_raid1      4841    nvme3n1   W 1279122360 4096      0.01
> 21.293242   md0_raid1      4841    nvme3n1   W 40326272   4096      0.01
> 21.293223   md0_raid1      4841    nvme6n1   W 1279122360 4096      0.01
> 21.293243   md0_raid1      4841    nvme6n1   W 40326272   4096      0.01
> 21.293261   md0_raid1      4841    nvme6n1   W 16         4096      0.01
> 21.293266   md0_raid1      4841    nvme6n1   W 24         4096      0.01
> 21.293271   md0_raid1      4841    nvme6n1   W 32         4096      0.01
> 21.293275   md0_raid1      4841    nvme3n1   W 32         4096      0.01
> 21.293292   md0_raid1      4841    nvme3n1   W 48         4096      0.01
> 21.293296   md0_raid1      4841    nvme3n1   W 56         4096      0.01
> 21.293309   md0_raid1      4841    nvme3n1   W 72         1536      0.01
> 21.293266   md0_raid1      4841    nvme3n1   W 24         4096      0.01
> 21.293326   md0_raid1      4841    nvme6n1   W 48         4096      0.05
> 21.293328   md0_raid1      4841    nvme6n1   W 40         4096      0.06
> 21.293331   md0_raid1      4841    nvme6n1   W 72         1536      0.03
> 21.293333   md0_raid1      4841    nvme6n1   W 64         4096      0.04
> 21.293334   md0_raid1      4841    nvme6n1   W 56         4096      0.05
> 21.298526   md0_raid1      4841    nvme3n1   W 681973000  4096      0.01
>
>
>
>
> Good, but with the granularity of N seconds, it might be too infrequent.
> Here is chunk-flush=512 (2GB threshold in 4MB chunk size):
>    bw (  KiB/s): min=92692, max=134904, per=100.00%, avg=125127.43, stdev=6758.51, samples=119
>    iops        : min=23173, max=33726, avg=31281.55, stdev=1689.63, samples=119
>
> TIME(s)     COMM           PID     DISK      T SECTOR     BYTES  LAT(ms)
> 13.193339   md0_raid1      5972    nvme6n1   W 16         4096      0.01
> 13.193344   md0_raid1      5972    nvme6n1   W 32         4096      0.01
> 13.193346   md0_raid1      5972    nvme6n1   W 24         4096      0.01
> 13.193350   md0_raid1      5972    nvme6n1   W 40         4096      0.01
> 13.193356   md0_raid1      5972    nvme6n1   W 48         4096      0.01
> 13.193361   md0_raid1      5972    nvme6n1   W 64         4096      0.01
> 13.193363   md0_raid1      5972    nvme6n1   W 56         4096      0.01
> 13.193555   md0_raid1      5972    nvme6n1   W 72         1536      0.20
> 13.193289   md0_raid1      5972    nvme3n1   W 1912285848 4096      0.01
> 13.193306   md0_raid1      5972    nvme3n1   W 836455896  4096      0.01
> 13.193323   md0_raid1      5972    nvme3n1   W 233728136  4096      0.01
> 13.193339   md0_raid1      5972    nvme3n1   W 16         4096      0.01
> 13.193344   md0_raid1      5972    nvme3n1   W 24         4096      0.01
> 13.193362   md0_raid1      5972    nvme3n1   W 48         4096      0.01
> 13.193365   md0_raid1      5972    nvme3n1   W 64         4096      0.01
> 13.193366   md0_raid1      5972    nvme3n1   W 56         4096      0.01
> 13.193574   md0_raid1      5972    nvme3n1   W 72         1536      0.21
> 13.196759   md0_raid1      5972    nvme3n1   W 89571592   4096      0.01
> 13.196810   md0_raid1      5972    nvme6n1   W 89571592   4096      0.06
> 13.196913   md0_raid1      5972    nvme6n1   W 16         4096      0.01
> 13.196910   md0_raid1      5972    nvme3n1   W 16         4096      0.01
> 13.199444   md0_raid1      5972    nvme3n1   W 64         4096      0.01
> 13.199447   md0_raid1      5972    nvme3n1   W 137126232  4096      0.01
> 13.199515   md0_raid1      5972    nvme6n1   W 137126232  4096      0.08
> 13.199519   md0_raid1      5972    nvme6n1   W 64         4096      0.08
> 13.199617   md0_raid1      5972    nvme6n1   W 1216062808 4096      0.01
> ... (508 ios later)
> 13.208764   md0_raid1      5972    nvme6n1   W 16         4096      0.01
> 13.208768   md0_raid1      5972    nvme6n1   W 32         4096      0.01
> 13.208770   md0_raid1      5972    nvme6n1   W 24         4096      0.01
> 13.208775   md0_raid1      5972    nvme6n1   W 40         4096      0.01
> 13.208781   md0_raid1      5972    nvme6n1   W 48         4096      0.01
> 13.208786   md0_raid1      5972    nvme6n1   W 56         4096      0.01
> 13.208790   md0_raid1      5972    nvme6n1   W 64         4096      0.01
> 13.208729   md0_raid1      5972    nvme3n1   W 1607847808 4096      0.01
> 13.208747   md0_raid1      5972    nvme3n1   W 371214368  4096      0.01
> 13.208770   md0_raid1      5972    nvme3n1   W 32         4096      0.01
> 13.208789   md0_raid1      5972    nvme3n1   W 64         4096      0.01
> 13.208952   md0_raid1      5972    nvme6n1   W 72         1536      0.17
> 13.209079   md0_raid1      5972    nvme3n1   W 72         1536      0.29
> 13.212216   md0_raid1      5972    nvme3n1   W 1146106480 4096      0.01
> 13.212269   md0_raid1      5972    nvme6n1   W 1146106480 4096      0.06
> 13.212368   md0_raid1      5972    nvme6n1   W 16         4096      0.01
> 13.212365   md0_raid1      5972    nvme3n1   W 16         4096      0.01
>
>
> Without 1/2: 6k iops
> With 1/2: 52k iops
> With 2/2 params as above: 31k iops
>
> The count calculation could use some improvement to close the iops gap to delay-based flushing
>
> >>
> >>  /* notes:
> >> diff --git a/drivers/md/md.h b/drivers/md/md.h
> >> index b4e2d8b87b61..d25574e46283 100644
> >> --- a/drivers/md/md.h
> >> +++ b/drivers/md/md.h
> >> @@ -497,6 +497,7 @@ struct mddev {
> >>                 struct mutex            mutex;
> >>                 unsigned long           chunksize;
> >>                 unsigned long           daemon_sleep; /* how many jiffies between updates? */
> >> +               unsigned int            daemon_flush_chunks; /* how many dirty chunks between updates */
> >>                 unsigned long           max_write_behind; /* write-behind mode */
> >>                 int                     external;
> >>                 int                     nodes; /* Maximum number of nodes in the cluster */
> >> --
> >> 2.31.1
> >>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] mdadm: Add parameter for bitmap chunk threshold
  2022-10-06 22:08 ` [RFC PATCH] mdadm: Add parameter for bitmap chunk threshold Jonathan Derrick
@ 2022-10-12  7:17   ` Mariusz Tkaczyk
  0 siblings, 0 replies; 10+ messages in thread
From: Mariusz Tkaczyk @ 2022-10-12  7:17 UTC (permalink / raw)
  To: Jonathan Derrick
  Cc: Song Liu, linux-raid, linux-kernel, jonathan.derrick,
	jonathanx.sk.derrick

On Thu,  6 Oct 2022 16:08:38 -0600
Jonathan Derrick <jonathan.derrick@linux.dev> wrote:

> Adds parameter to mdadm create, grow, and build similar to the delay
> parameter, that specifies a chunk threshold. This value will instruct
> the kernel, in-tandem with the delay timer, to flush the bitmap after
> every N chunks have been dirtied. This can be used in-addition to the
> delay parameter and complements it.
> 
> This requires an addition to the bitmap superblock and version increment.

Hello Jonathan,
To provide that to parameter to bitmap we are updating bitmap superblock, right?
Why we need to define it in config then? I someone wants to change that
should use --grow. Am I correct?

The "threshold" is not a context property, it should be added to struct shape.
Ideally, we can extract bitmap properties to separate struct and pass it around.

And I would like to have IMSM support if that is possible.

Thanks,
Mariusz

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] md/bitmap: Add chunk-count-based bitmap flushing
  2022-10-10 18:18       ` Song Liu
@ 2022-10-13 22:19         ` Jonathan Derrick
  2022-10-13 22:56           ` Song Liu
  0 siblings, 1 reply; 10+ messages in thread
From: Jonathan Derrick @ 2022-10-13 22:19 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, linux-kernel, jonathan.derrick, jonathanx.sk.derrick,
	Mariusz Tkaczyk



On 10/10/2022 12:18 PM, Song Liu wrote:
> On Fri, Oct 7, 2022 at 11:58 AM Jonathan Derrick
> <jonathan.derrick@linux.dev> wrote:
>>
>>
>>
>> On 10/7/2022 11:50 AM, Song Liu wrote:
>>> On Thu, Oct 6, 2022 at 3:09 PM Jonathan Derrick
>>> <jonathan.derrick@linux.dev> wrote:
>>>
>>> [...]
>>>
>>>> diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
>>>> index cfd7395de8fd..e0aeedbdde17 100644
>>>> --- a/drivers/md/md-bitmap.h
>>>> +++ b/drivers/md/md-bitmap.h
>>>> @@ -11,10 +11,12 @@
>>>>  /* version 4 insists the bitmap is in little-endian order
>>>>   * with version 3, it is host-endian which is non-portable
>>>>   * Version 5 is currently set only for clustered devices
>>>> ++ * Version 6 supports the flush-chunks threshold
>>>>   */
>>>>  #define BITMAP_MAJOR_HI 4
>>>>  #define BITMAP_MAJOR_CLUSTERED 5
>>>>  #define        BITMAP_MAJOR_HOSTENDIAN 3
>>>> +#define BITMAP_MAJOR_CHUNKFLUSH 6
>>>>
>>>>  /*
>>>>   * in-memory bitmap:
>>>> @@ -135,7 +137,8 @@ typedef struct bitmap_super_s {
>>>>                                   * reserved for the bitmap. */
>>>>         __le32 nodes;        /* 68 the maximum number of nodes in cluster. */
>>>>         __u8 cluster_name[64]; /* 72 cluster name to which this md belongs */
>>>> -       __u8  pad[256 - 136]; /* set to zero */
>>>> +       __le32 daemon_flush_chunks; /* 136 dirty chunks between flushes */
>>>> +       __u8  pad[256 - 140]; /* set to zero */
>>>>  } bitmap_super_t;
>>>
>>> Do we really need this to be persistent? How about we configure it at run
>>> time via a sysfs file?
>>>
>>> Also, please share more data on the performance benefit of the set.
>>>
>>> Thanks,
>>> Song
>>>
>> Hi Song,
>>
>> Patch 1/2 changes default behavior, which patch 2/2 tries to address.
> 
> Have you tried to evaluate the impact on the accuracy of the bitmap?
> Specifically, if we power off the system during writes, do we see data
> or parity mismatch that is not covered by the bitmap?
Fair. I'm assuming this has to do with md_bitmap_init_from_disk()'s
outofdate BITMAP_STALE check? And my patch 1/2 would likely guarantee
a full resync unless the system was lost just after during the daemon
wake time. However patch 2/2 increases the likelihood of reading a good
bitmap.


> 
>> I can change it to be configurable via sysfs instead.
>> Should there be a default?
> 
> If there is any impact on bitmap accuracy. I think the default should
> work identical as before the set. IOW, we should not delay the bitmap
> update.
With results like mine, I'm under the impression bitmap=internal is not
regularly used for write-heavy workloads [1]. 

The thing is, is that it's not very consistent right now. I've had runs
where the bitmap isn't updated for minutes until the run ends, and then
I have most runs where it's doing it every other I/O or so. And it seems
to depend on the number of chunks relative to the device size (if it can
fit in a single page.)

I have v2 coming which should help fix a few of these inconsistencies.

[1] Similar results https://blog.liw.fi/posts/write-intent-bitmaps/

> 
> Thanks,
> Song
> 
>>
>>
>> Here are my observations via biosnoop and RAID1, 4M chunksize, 238436 chunks, bitmap=internal
>> fio --name=test --direct=1 --filename=/dev/md0 --rw=randwrite --runtime=60
>>  --percentile_list=1.0:25.0:50.0:75.0:90.0:95.0:99.0:99.9:99.99:99..999999:100.0
>>
>>
>> Default, bitmap updates happened concurrently with I/O:
>>    bw (  KiB/s): min=18690, max=30618, per=99.94%, avg=23822.07, stdev=2522.73, samples=119
>>    iops        : min= 4672, max= 7654, avg=5955.20, stdev=630.71, samples=119
>>
>> TIME(s)     COMM           PID     DISK      T SECTOR     BYTES  LAT(ms)
>> 38.090366   md0_raid1      4800    nvme6n1   W 40         4096      0.01
>> 38.090423   md0_raid1      4800    nvme3n1   W 40         4096      0.07
>> 38.090442   md0_raid1      4800    nvme3n1   W 1016633184 4096      0.01
>> 38.090439   md0_raid1      4800    nvme6n1   W 1016633184 4096      0.01
>> 38.090479   md0_raid1      4800    nvme6n1   W 56         4096      0.01
>> 38.090493   md0_raid1      4800    nvme6n1   W 1449894256 4096      0.01
>> 38.090477   md0_raid1      4800    nvme3n1   W 56         4096      0.01
>> 38.090496   md0_raid1      4800    nvme3n1   W 1449894256 4096      0.01
>> 38.090530   md0_raid1      4800    nvme3n1   W 16         4096      0.01
>> 38.090555   md0_raid1      4800    nvme3n1   W 110493568  4096      0.01
>> 38.090538   md0_raid1      4800    nvme6n1   W 16         4096      0.01
>> 38.090551   md0_raid1      4800    nvme6n1   W 110493568  4096      0.01
>> 38.090596   md0_raid1      4800    nvme6n1   W 56         4096      0.01
>> 38.090647   md0_raid1      4800    nvme3n1   W 56         4096      0.06
>> 38.090666   md0_raid1      4800    nvme3n1   W 1455846976 4096      0.01
>> 38.090663   md0_raid1      4800    nvme6n1   W 1455846976 4096      0.01
>> 38.090707   md0_raid1      4800    nvme6n1   W 64         4096      0.01
>> 38.090699   md0_raid1      4800    nvme3n1   W 64         4096      0.01
>> 38.090723   md0_raid1      4800    nvme3n1   W 1665013728 4096      0.01
>> 38.090720   md0_raid1      4800    nvme6n1   W 1665013728 4096      0.01
>> 38.090764   md0_raid1      4800    nvme6n1   W 64         4096      0.01
>> 38.090812   md0_raid1      4800    nvme3n1   W 64         4096      0.06
>> 38.090832   md0_raid1      4800    nvme3n1   W 1637994296 4096      0.01
>> 38.090828   md0_raid1      4800    nvme6n1   W 1637994296 4096      0.01
>>
>>
>>
>>
>> With patch 1/2, bitmaps only update on the 'delay' parameter (default 5s):
>>    bw (  KiB/s): min=135712, max=230938, per=100.00%, avg=209308.56, stdev=29254.31, samples=119
>>    iops        : min=33928, max=57734, avg=52326.78, stdev=7313.57, samples=119
>>
>> TIME(s)     COMM           PID     DISK      T SECTOR     BYTES  LAT(ms)
>> 16.292235   md0_raid1      4841    nvme6n1   W 297367432  4096      0.01
>> 16.292258   md0_raid1      4841    nvme6n1   W 16         4096      0.01
>> 16.292266   md0_raid1      4841    nvme6n1   W 24         4096      0.01
>> 16.292277   md0_raid1      4841    nvme6n1   W 32         4096      0.01
>> 16.292259   md0_raid1      4841    nvme3n1   W 16         4096      0.01
>> 16.292280   md0_raid1      4841    nvme3n1   W 32         4096      0.01
>> 16.292305   md0_raid1      4841    nvme3n1   W 56         4096      0.01
>> 16.292286   md0_raid1      4841    nvme6n1   W 40         4096      0.01
>> 16.292295   md0_raid1      4841    nvme6n1   W 48         4096      0.01
>> 16.292326   md0_raid1      4841    nvme3n1   W 72         1536      0.01
>> 16.292323   md0_raid1      4841    nvme6n1   W 64         4096      0.02
>> 16.292326   md0_raid1      4841    nvme6n1   W 56         4096      0.03
>> 16.292334   md0_raid1      4841    nvme6n1   W 72         1536      0.02
>> 16.300697   md0_raid1      4841    nvme3n1   W 1297533744 4096      0.01
>> 16.300702   md0_raid1      4841    nvme6n1   W 1297533744 4096      0.01
>> 16.300803   md0_raid1      4841    nvme6n1   W 1649080856 4096      0.01
>> 16.300798   md0_raid1      4841    nvme3n1   W 1649080856 4096      0.01
>> 16.300823   md0_raid1      4841    nvme3n1   W 1539317792 4096      0.01
>> 16.300845   md0_raid1      4841    nvme3n1   W 1634570232 4096      0.01
>> 16.300867   md0_raid1      4841    nvme3n1   W 579232208  4096      0.01
>> 16.300889   md0_raid1      4841    nvme3n1   W 1818140424 4096      0.01
>> 16.300922   md0_raid1      4841    nvme3n1   W 412971920  4096      0.02
>> ...
>> 21.293225   md0_raid1      4841    nvme3n1   W 1279122360 4096      0.01
>> 21.293242   md0_raid1      4841    nvme3n1   W 40326272   4096      0.01
>> 21.293223   md0_raid1      4841    nvme6n1   W 1279122360 4096      0.01
>> 21.293243   md0_raid1      4841    nvme6n1   W 40326272   4096      0.01
>> 21.293261   md0_raid1      4841    nvme6n1   W 16         4096      0.01
>> 21.293266   md0_raid1      4841    nvme6n1   W 24         4096      0.01
>> 21.293271   md0_raid1      4841    nvme6n1   W 32         4096      0.01
>> 21.293275   md0_raid1      4841    nvme3n1   W 32         4096      0.01
>> 21.293292   md0_raid1      4841    nvme3n1   W 48         4096      0.01
>> 21.293296   md0_raid1      4841    nvme3n1   W 56         4096      0.01
>> 21.293309   md0_raid1      4841    nvme3n1   W 72         1536      0.01
>> 21.293266   md0_raid1      4841    nvme3n1   W 24         4096      0.01
>> 21.293326   md0_raid1      4841    nvme6n1   W 48         4096      0.05
>> 21.293328   md0_raid1      4841    nvme6n1   W 40         4096      0.06
>> 21.293331   md0_raid1      4841    nvme6n1   W 72         1536      0.03
>> 21.293333   md0_raid1      4841    nvme6n1   W 64         4096      0.04
>> 21.293334   md0_raid1      4841    nvme6n1   W 56         4096      0.05
>> 21.298526   md0_raid1      4841    nvme3n1   W 681973000  4096      0.01
>>
>>
>>
>>
>> Good, but with the granularity of N seconds, it might be too infrequent.
>> Here is chunk-flush=512 (2GB threshold in 4MB chunk size):
>>    bw (  KiB/s): min=92692, max=134904, per=100.00%, avg=125127.43, stdev=6758.51, samples=119
>>    iops        : min=23173, max=33726, avg=31281.55, stdev=1689.63, samples=119
>>
>> TIME(s)     COMM           PID     DISK      T SECTOR     BYTES  LAT(ms)
>> 13.193339   md0_raid1      5972    nvme6n1   W 16         4096      0.01
>> 13.193344   md0_raid1      5972    nvme6n1   W 32         4096      0.01
>> 13.193346   md0_raid1      5972    nvme6n1   W 24         4096      0.01
>> 13.193350   md0_raid1      5972    nvme6n1   W 40         4096      0.01
>> 13.193356   md0_raid1      5972    nvme6n1   W 48         4096      0.01
>> 13.193361   md0_raid1      5972    nvme6n1   W 64         4096      0.01
>> 13.193363   md0_raid1      5972    nvme6n1   W 56         4096      0.01
>> 13.193555   md0_raid1      5972    nvme6n1   W 72         1536      0.20
>> 13.193289   md0_raid1      5972    nvme3n1   W 1912285848 4096      0.01
>> 13.193306   md0_raid1      5972    nvme3n1   W 836455896  4096      0.01
>> 13.193323   md0_raid1      5972    nvme3n1   W 233728136  4096      0.01
>> 13.193339   md0_raid1      5972    nvme3n1   W 16         4096      0.01
>> 13.193344   md0_raid1      5972    nvme3n1   W 24         4096      0.01
>> 13.193362   md0_raid1      5972    nvme3n1   W 48         4096      0.01
>> 13.193365   md0_raid1      5972    nvme3n1   W 64         4096      0.01
>> 13.193366   md0_raid1      5972    nvme3n1   W 56         4096      0.01
>> 13.193574   md0_raid1      5972    nvme3n1   W 72         1536      0.21
>> 13.196759   md0_raid1      5972    nvme3n1   W 89571592   4096      0.01
>> 13.196810   md0_raid1      5972    nvme6n1   W 89571592   4096      0.06
>> 13.196913   md0_raid1      5972    nvme6n1   W 16         4096      0.01
>> 13.196910   md0_raid1      5972    nvme3n1   W 16         4096      0.01
>> 13.199444   md0_raid1      5972    nvme3n1   W 64         4096      0.01
>> 13.199447   md0_raid1      5972    nvme3n1   W 137126232  4096      0.01
>> 13.199515   md0_raid1      5972    nvme6n1   W 137126232  4096      0.08
>> 13.199519   md0_raid1      5972    nvme6n1   W 64         4096      0.08
>> 13.199617   md0_raid1      5972    nvme6n1   W 1216062808 4096      0.01
>> ... (508 ios later)
>> 13.208764   md0_raid1      5972    nvme6n1   W 16         4096      0.01
>> 13.208768   md0_raid1      5972    nvme6n1   W 32         4096      0.01
>> 13.208770   md0_raid1      5972    nvme6n1   W 24         4096      0.01
>> 13.208775   md0_raid1      5972    nvme6n1   W 40         4096      0.01
>> 13.208781   md0_raid1      5972    nvme6n1   W 48         4096      0.01
>> 13.208786   md0_raid1      5972    nvme6n1   W 56         4096      0.01
>> 13.208790   md0_raid1      5972    nvme6n1   W 64         4096      0.01
>> 13.208729   md0_raid1      5972    nvme3n1   W 1607847808 4096      0.01
>> 13.208747   md0_raid1      5972    nvme3n1   W 371214368  4096      0.01
>> 13.208770   md0_raid1      5972    nvme3n1   W 32         4096      0.01
>> 13.208789   md0_raid1      5972    nvme3n1   W 64         4096      0.01
>> 13.208952   md0_raid1      5972    nvme6n1   W 72         1536      0.17
>> 13.209079   md0_raid1      5972    nvme3n1   W 72         1536      0.29
>> 13.212216   md0_raid1      5972    nvme3n1   W 1146106480 4096      0.01
>> 13.212269   md0_raid1      5972    nvme6n1   W 1146106480 4096      0.06
>> 13.212368   md0_raid1      5972    nvme6n1   W 16         4096      0.01
>> 13.212365   md0_raid1      5972    nvme3n1   W 16         4096      0.01
>>
>>
>> Without 1/2: 6k iops
>> With 1/2: 52k iops
>> With 2/2 params as above: 31k iops
>>
>> The count calculation could use some improvement to close the iops gap to delay-based flushing
>>
>>>>
>>>>  /* notes:
>>>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>>>> index b4e2d8b87b61..d25574e46283 100644
>>>> --- a/drivers/md/md.h
>>>> +++ b/drivers/md/md.h
>>>> @@ -497,6 +497,7 @@ struct mddev {
>>>>                 struct mutex            mutex;
>>>>                 unsigned long           chunksize;
>>>>                 unsigned long           daemon_sleep; /* how many jiffies between updates? */
>>>> +               unsigned int            daemon_flush_chunks; /* how many dirty chunks between updates */
>>>>                 unsigned long           max_write_behind; /* write-behind mode */
>>>>                 int                     external;
>>>>                 int                     nodes; /* Maximum number of nodes in the cluster */
>>>> --
>>>> 2.31.1
>>>>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] md/bitmap: Add chunk-count-based bitmap flushing
  2022-10-13 22:19         ` Jonathan Derrick
@ 2022-10-13 22:56           ` Song Liu
  0 siblings, 0 replies; 10+ messages in thread
From: Song Liu @ 2022-10-13 22:56 UTC (permalink / raw)
  To: Jonathan Derrick
  Cc: linux-raid, linux-kernel, jonathan.derrick, jonathanx.sk.derrick,
	Mariusz Tkaczyk

On Thu, Oct 13, 2022 at 3:19 PM Jonathan Derrick
<jonathan.derrick@linux.dev> wrote:
>
>
>
> On 10/10/2022 12:18 PM, Song Liu wrote:
> > On Fri, Oct 7, 2022 at 11:58 AM Jonathan Derrick
> > <jonathan.derrick@linux.dev> wrote:
> >>
> >>
> >>
> >> On 10/7/2022 11:50 AM, Song Liu wrote:
> >>> On Thu, Oct 6, 2022 at 3:09 PM Jonathan Derrick
> >>> <jonathan.derrick@linux.dev> wrote:
> >>>
> >>> [...]
> >>>
> >>>> diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
> >>>> index cfd7395de8fd..e0aeedbdde17 100644
> >>>> --- a/drivers/md/md-bitmap.h
> >>>> +++ b/drivers/md/md-bitmap.h
> >>>> @@ -11,10 +11,12 @@
> >>>>  /* version 4 insists the bitmap is in little-endian order
> >>>>   * with version 3, it is host-endian which is non-portable
> >>>>   * Version 5 is currently set only for clustered devices
> >>>> ++ * Version 6 supports the flush-chunks threshold
> >>>>   */
> >>>>  #define BITMAP_MAJOR_HI 4
> >>>>  #define BITMAP_MAJOR_CLUSTERED 5
> >>>>  #define        BITMAP_MAJOR_HOSTENDIAN 3
> >>>> +#define BITMAP_MAJOR_CHUNKFLUSH 6
> >>>>
> >>>>  /*
> >>>>   * in-memory bitmap:
> >>>> @@ -135,7 +137,8 @@ typedef struct bitmap_super_s {
> >>>>                                   * reserved for the bitmap. */
> >>>>         __le32 nodes;        /* 68 the maximum number of nodes in cluster. */
> >>>>         __u8 cluster_name[64]; /* 72 cluster name to which this md belongs */
> >>>> -       __u8  pad[256 - 136]; /* set to zero */
> >>>> +       __le32 daemon_flush_chunks; /* 136 dirty chunks between flushes */
> >>>> +       __u8  pad[256 - 140]; /* set to zero */
> >>>>  } bitmap_super_t;
> >>>
> >>> Do we really need this to be persistent? How about we configure it at run
> >>> time via a sysfs file?
> >>>
> >>> Also, please share more data on the performance benefit of the set.
> >>>
> >>> Thanks,
> >>> Song
> >>>
> >> Hi Song,
> >>
> >> Patch 1/2 changes default behavior, which patch 2/2 tries to address.
> >
> > Have you tried to evaluate the impact on the accuracy of the bitmap?
> > Specifically, if we power off the system during writes, do we see data
> > or parity mismatch that is not covered by the bitmap?
> Fair. I'm assuming this has to do with md_bitmap_init_from_disk()'s
> outofdate BITMAP_STALE check? And my patch 1/2 would likely guarantee
> a full resync unless the system was lost just after during the daemon
> wake time. However patch 2/2 increases the likelihood of reading a good
> bitmap.

kernel test bot reported a failed mdadm test after 1/2. Could you please check
whether that's accurate?

>
>
> >
> >> I can change it to be configurable via sysfs instead.
> >> Should there be a default?
> >
> > If there is any impact on bitmap accuracy. I think the default should
> > work identical as before the set. IOW, we should not delay the bitmap
> > update.
> With results like mine, I'm under the impression bitmap=internal is not
> regularly used for write-heavy workloads [1].

It is pretty bad for really random writes. But it shouldn't be too bad
for normal
workload (where folks already optimize writes to be more sequential).

>
> The thing is, is that it's not very consistent right now. I've had runs
> where the bitmap isn't updated for minutes until the run ends, and then
> I have most runs where it's doing it every other I/O or so. And it seems
> to depend on the number of chunks relative to the device size (if it can
> fit in a single page.)
>
> I have v2 coming which should help fix a few of these inconsistencies.

Sounds great. Thanks!
Song

>
> [1] Similar results https://blog.liw.fi/posts/write-intent-bitmaps/
>
> >
> > Thanks,
> > Song
> >
> >>
> >>
> >> Here are my observations via biosnoop and RAID1, 4M chunksize, 238436 chunks, bitmap=internal
> >> fio --name=test --direct=1 --filename=/dev/md0 --rw=randwrite --runtime=60
> >>  --percentile_list=1.0:25.0:50.0:75.0:90.0:95.0:99.0:99.9:99.99:99..999999:100.0
> >>
> >>
> >> Default, bitmap updates happened concurrently with I/O:
> >>    bw (  KiB/s): min=18690, max=30618, per=99.94%, avg=23822.07, stdev=2522.73, samples=119
> >>    iops        : min= 4672, max= 7654, avg=5955.20, stdev=630.71, samples=119
> >>
> >> TIME(s)     COMM           PID     DISK      T SECTOR     BYTES  LAT(ms)
> >> 38.090366   md0_raid1      4800    nvme6n1   W 40         4096      0.01
> >> 38.090423   md0_raid1      4800    nvme3n1   W 40         4096      0.07
> >> 38.090442   md0_raid1      4800    nvme3n1   W 1016633184 4096      0.01
> >> 38.090439   md0_raid1      4800    nvme6n1   W 1016633184 4096      0.01
> >> 38.090479   md0_raid1      4800    nvme6n1   W 56         4096      0.01
> >> 38.090493   md0_raid1      4800    nvme6n1   W 1449894256 4096      0.01
> >> 38.090477   md0_raid1      4800    nvme3n1   W 56         4096      0.01
> >> 38.090496   md0_raid1      4800    nvme3n1   W 1449894256 4096      0.01
> >> 38.090530   md0_raid1      4800    nvme3n1   W 16         4096      0.01
> >> 38.090555   md0_raid1      4800    nvme3n1   W 110493568  4096      0.01
> >> 38.090538   md0_raid1      4800    nvme6n1   W 16         4096      0.01
> >> 38.090551   md0_raid1      4800    nvme6n1   W 110493568  4096      0.01
> >> 38.090596   md0_raid1      4800    nvme6n1   W 56         4096      0.01
> >> 38.090647   md0_raid1      4800    nvme3n1   W 56         4096      0.06
> >> 38.090666   md0_raid1      4800    nvme3n1   W 1455846976 4096      0.01
> >> 38.090663   md0_raid1      4800    nvme6n1   W 1455846976 4096      0.01
> >> 38.090707   md0_raid1      4800    nvme6n1   W 64         4096      0.01
> >> 38.090699   md0_raid1      4800    nvme3n1   W 64         4096      0.01
> >> 38.090723   md0_raid1      4800    nvme3n1   W 1665013728 4096      0.01
> >> 38.090720   md0_raid1      4800    nvme6n1   W 1665013728 4096      0.01
> >> 38.090764   md0_raid1      4800    nvme6n1   W 64         4096      0.01
> >> 38.090812   md0_raid1      4800    nvme3n1   W 64         4096      0.06
> >> 38.090832   md0_raid1      4800    nvme3n1   W 1637994296 4096      0.01
> >> 38.090828   md0_raid1      4800    nvme6n1   W 1637994296 4096      0.01
> >>
> >>
> >>
> >>
> >> With patch 1/2, bitmaps only update on the 'delay' parameter (default 5s):
> >>    bw (  KiB/s): min=135712, max=230938, per=100.00%, avg=209308.56, stdev=29254.31, samples=119
> >>    iops        : min=33928, max=57734, avg=52326.78, stdev=7313.57, samples=119
> >>
> >> TIME(s)     COMM           PID     DISK      T SECTOR     BYTES  LAT(ms)
> >> 16.292235   md0_raid1      4841    nvme6n1   W 297367432  4096      0.01
> >> 16.292258   md0_raid1      4841    nvme6n1   W 16         4096      0.01
> >> 16.292266   md0_raid1      4841    nvme6n1   W 24         4096      0.01
> >> 16.292277   md0_raid1      4841    nvme6n1   W 32         4096      0.01
> >> 16.292259   md0_raid1      4841    nvme3n1   W 16         4096      0.01
> >> 16.292280   md0_raid1      4841    nvme3n1   W 32         4096      0.01
> >> 16.292305   md0_raid1      4841    nvme3n1   W 56         4096      0.01
> >> 16.292286   md0_raid1      4841    nvme6n1   W 40         4096      0.01
> >> 16.292295   md0_raid1      4841    nvme6n1   W 48         4096      0.01
> >> 16.292326   md0_raid1      4841    nvme3n1   W 72         1536      0.01
> >> 16.292323   md0_raid1      4841    nvme6n1   W 64         4096      0.02
> >> 16.292326   md0_raid1      4841    nvme6n1   W 56         4096      0.03
> >> 16.292334   md0_raid1      4841    nvme6n1   W 72         1536      0.02
> >> 16.300697   md0_raid1      4841    nvme3n1   W 1297533744 4096      0.01
> >> 16.300702   md0_raid1      4841    nvme6n1   W 1297533744 4096      0.01
> >> 16.300803   md0_raid1      4841    nvme6n1   W 1649080856 4096      0.01
> >> 16.300798   md0_raid1      4841    nvme3n1   W 1649080856 4096      0.01
> >> 16.300823   md0_raid1      4841    nvme3n1   W 1539317792 4096      0.01
> >> 16.300845   md0_raid1      4841    nvme3n1   W 1634570232 4096      0.01
> >> 16.300867   md0_raid1      4841    nvme3n1   W 579232208  4096      0.01
> >> 16.300889   md0_raid1      4841    nvme3n1   W 1818140424 4096      0.01
> >> 16.300922   md0_raid1      4841    nvme3n1   W 412971920  4096      0.02
> >> ...
> >> 21.293225   md0_raid1      4841    nvme3n1   W 1279122360 4096      0.01
> >> 21.293242   md0_raid1      4841    nvme3n1   W 40326272   4096      0.01
> >> 21.293223   md0_raid1      4841    nvme6n1   W 1279122360 4096      0.01
> >> 21.293243   md0_raid1      4841    nvme6n1   W 40326272   4096      0.01
> >> 21.293261   md0_raid1      4841    nvme6n1   W 16         4096      0.01
> >> 21.293266   md0_raid1      4841    nvme6n1   W 24         4096      0.01
> >> 21.293271   md0_raid1      4841    nvme6n1   W 32         4096      0.01
> >> 21.293275   md0_raid1      4841    nvme3n1   W 32         4096      0.01
> >> 21.293292   md0_raid1      4841    nvme3n1   W 48         4096      0.01
> >> 21.293296   md0_raid1      4841    nvme3n1   W 56         4096      0.01
> >> 21.293309   md0_raid1      4841    nvme3n1   W 72         1536      0.01
> >> 21.293266   md0_raid1      4841    nvme3n1   W 24         4096      0.01
> >> 21.293326   md0_raid1      4841    nvme6n1   W 48         4096      0.05
> >> 21.293328   md0_raid1      4841    nvme6n1   W 40         4096      0.06
> >> 21.293331   md0_raid1      4841    nvme6n1   W 72         1536      0.03
> >> 21.293333   md0_raid1      4841    nvme6n1   W 64         4096      0.04
> >> 21.293334   md0_raid1      4841    nvme6n1   W 56         4096      0.05
> >> 21.298526   md0_raid1      4841    nvme3n1   W 681973000  4096      0.01
> >>
> >>
> >>
> >>
> >> Good, but with the granularity of N seconds, it might be too infrequent.
> >> Here is chunk-flush=512 (2GB threshold in 4MB chunk size):
> >>    bw (  KiB/s): min=92692, max=134904, per=100.00%, avg=125127.43, stdev=6758.51, samples=119
> >>    iops        : min=23173, max=33726, avg=31281.55, stdev=1689.63, samples=119
> >>
> >> TIME(s)     COMM           PID     DISK      T SECTOR     BYTES  LAT(ms)
> >> 13.193339   md0_raid1      5972    nvme6n1   W 16         4096      0.01
> >> 13.193344   md0_raid1      5972    nvme6n1   W 32         4096      0.01
> >> 13.193346   md0_raid1      5972    nvme6n1   W 24         4096      0.01
> >> 13.193350   md0_raid1      5972    nvme6n1   W 40         4096      0.01
> >> 13.193356   md0_raid1      5972    nvme6n1   W 48         4096      0.01
> >> 13.193361   md0_raid1      5972    nvme6n1   W 64         4096      0.01
> >> 13.193363   md0_raid1      5972    nvme6n1   W 56         4096      0.01
> >> 13.193555   md0_raid1      5972    nvme6n1   W 72         1536      0.20
> >> 13.193289   md0_raid1      5972    nvme3n1   W 1912285848 4096      0.01
> >> 13.193306   md0_raid1      5972    nvme3n1   W 836455896  4096      0.01
> >> 13.193323   md0_raid1      5972    nvme3n1   W 233728136  4096      0.01
> >> 13.193339   md0_raid1      5972    nvme3n1   W 16         4096      0.01
> >> 13.193344   md0_raid1      5972    nvme3n1   W 24         4096      0.01
> >> 13.193362   md0_raid1      5972    nvme3n1   W 48         4096      0.01
> >> 13.193365   md0_raid1      5972    nvme3n1   W 64         4096      0.01
> >> 13.193366   md0_raid1      5972    nvme3n1   W 56         4096      0.01
> >> 13.193574   md0_raid1      5972    nvme3n1   W 72         1536      0.21
> >> 13.196759   md0_raid1      5972    nvme3n1   W 89571592   4096      0.01
> >> 13.196810   md0_raid1      5972    nvme6n1   W 89571592   4096      0.06
> >> 13.196913   md0_raid1      5972    nvme6n1   W 16         4096      0.01
> >> 13.196910   md0_raid1      5972    nvme3n1   W 16         4096      0.01
> >> 13.199444   md0_raid1      5972    nvme3n1   W 64         4096      0.01
> >> 13.199447   md0_raid1      5972    nvme3n1   W 137126232  4096      0.01
> >> 13.199515   md0_raid1      5972    nvme6n1   W 137126232  4096      0.08
> >> 13.199519   md0_raid1      5972    nvme6n1   W 64         4096      0.08
> >> 13.199617   md0_raid1      5972    nvme6n1   W 1216062808 4096      0.01
> >> ... (508 ios later)
> >> 13.208764   md0_raid1      5972    nvme6n1   W 16         4096      0.01
> >> 13.208768   md0_raid1      5972    nvme6n1   W 32         4096      0.01
> >> 13.208770   md0_raid1      5972    nvme6n1   W 24         4096      0.01
> >> 13.208775   md0_raid1      5972    nvme6n1   W 40         4096      0.01
> >> 13.208781   md0_raid1      5972    nvme6n1   W 48         4096      0.01
> >> 13.208786   md0_raid1      5972    nvme6n1   W 56         4096      0.01
> >> 13.208790   md0_raid1      5972    nvme6n1   W 64         4096      0.01
> >> 13.208729   md0_raid1      5972    nvme3n1   W 1607847808 4096      0.01
> >> 13.208747   md0_raid1      5972    nvme3n1   W 371214368  4096      0.01
> >> 13.208770   md0_raid1      5972    nvme3n1   W 32         4096      0.01
> >> 13.208789   md0_raid1      5972    nvme3n1   W 64         4096      0.01
> >> 13.208952   md0_raid1      5972    nvme6n1   W 72         1536      0.17
> >> 13.209079   md0_raid1      5972    nvme3n1   W 72         1536      0.29
> >> 13.212216   md0_raid1      5972    nvme3n1   W 1146106480 4096      0.01
> >> 13.212269   md0_raid1      5972    nvme6n1   W 1146106480 4096      0.06
> >> 13.212368   md0_raid1      5972    nvme6n1   W 16         4096      0.01
> >> 13.212365   md0_raid1      5972    nvme3n1   W 16         4096      0.01
> >>
> >>
> >> Without 1/2: 6k iops
> >> With 1/2: 52k iops
> >> With 2/2 params as above: 31k iops
> >>
> >> The count calculation could use some improvement to close the iops gap to delay-based flushing
> >>
> >>>>
> >>>>  /* notes:
> >>>> diff --git a/drivers/md/md.h b/drivers/md/md.h
> >>>> index b4e2d8b87b61..d25574e46283 100644
> >>>> --- a/drivers/md/md.h
> >>>> +++ b/drivers/md/md.h
> >>>> @@ -497,6 +497,7 @@ struct mddev {
> >>>>                 struct mutex            mutex;
> >>>>                 unsigned long           chunksize;
> >>>>                 unsigned long           daemon_sleep; /* how many jiffies between updates? */
> >>>> +               unsigned int            daemon_flush_chunks; /* how many dirty chunks between updates */
> >>>>                 unsigned long           max_write_behind; /* write-behind mode */
> >>>>                 int                     external;
> >>>>                 int                     nodes; /* Maximum number of nodes in the cluster */
> >>>> --
> >>>> 2.31.1
> >>>>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-10-13 22:57 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-06 22:08 [PATCH 0/2] Bitmap percentage flushing Jonathan Derrick
2022-10-06 22:08 ` [RFC PATCH] mdadm: Add parameter for bitmap chunk threshold Jonathan Derrick
2022-10-12  7:17   ` Mariusz Tkaczyk
2022-10-06 22:08 ` [PATCH 1/2] md/bitmap: Move unplug to daemon thread Jonathan Derrick
2022-10-06 22:08 ` [PATCH 2/2] md/bitmap: Add chunk-count-based bitmap flushing Jonathan Derrick
2022-10-07 17:50   ` Song Liu
2022-10-07 18:58     ` Jonathan Derrick
2022-10-10 18:18       ` Song Liu
2022-10-13 22:19         ` Jonathan Derrick
2022-10-13 22:56           ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).