[patch 0/4 v2] optimize raid1 read balance for SSD

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch 0/4 v2] optimize raid1 read balance for SSD
@ 2012-06-13  9:09 Shaohua Li
  2012-06-13  9:09 ` [patch 1/4 v2] raid1: move distance based read balance to a separate function Shaohua Li
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Shaohua Li @ 2012-06-13  9:09 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe

raid1 read balance is an important algorithm to make read performance optimal.
It's a distance based algorithm, eg, for each request dispatch, choose disk
whose last finished request is close the request. This is great for hard disk.
But SSD has some special characteristics:

1. nonrotational. Distance means nothing for SSD, though merging small rquests
to big request is still optimal for SSD. If no merge, distributing rquests to
raid disks as more as possible is more optimal.

2. Getting too big request isn't always optimal. For hard disk, compared to
spindle move, data transfer overhead is trival, so we always prefer bigger
request. In SSD, request size exceeds specific value, performance isn't always
increased with request size increased.  An example is readahead. If readahead
merges too big request and causes some disks idle, the performance is less
optimal than that when all disks are busy and running small requests.

The patches try to address the issues. The first two patches are clean up. The
third patch addresses the first item above. The forth addresses the second item
above. The idea can be applied to raid10 too, which is in my todo list.

V1->V2:
rebase to latest kernel.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 1/4 v2] raid1: move distance based read balance to a separate function
  2012-06-13  9:09 [patch 0/4 v2] optimize raid1 read balance for SSD Shaohua Li
@ 2012-06-13  9:09 ` Shaohua Li
  2012-06-13  9:09 ` [patch 2/4 v2] raid1: make sequential read detection per disk based Shaohua Li
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Shaohua Li @ 2012-06-13  9:09 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe

[-- Attachment #1: raid1-consolidate-read-balance.patch --]
[-- Type: text/plain, Size: 2198 bytes --]

Move distance based read balance algorithm to a separate function. No
functional change.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid1.c |   42 ++++++++++++++++++++++++++++++------------
 1 file changed, 30 insertions(+), 12 deletions(-)

Index: linux/drivers/md/raid1.c
===================================================================
--- linux.orig/drivers/md/raid1.c	2012-05-08 16:36:12.624232473 +0800
+++ linux/drivers/md/raid1.c	2012-05-08 16:36:14.476209200 +0800
@@ -463,6 +463,31 @@ static void raid1_end_write_request(stru
 		bio_put(to_put);
 }
 
+static int read_balance_measure_distance(struct r1conf *conf,
+	struct r1bio *r1_bio, int disk, int *best_disk, sector_t *best_dist)
+{
+	const sector_t this_sector = r1_bio->sector;
+	struct md_rdev *rdev;
+	sector_t dist;
+
+	rdev = rcu_dereference(conf->mirrors[disk].rdev);
+
+	dist = abs(this_sector - conf->mirrors[disk].head_position);
+	/* Don't change to another disk for sequential reads */
+	if (conf->next_seq_sect == this_sector
+	    || dist == 0
+	    /* If device is idle, use it */
+	    || atomic_read(&rdev->nr_pending) == 0) {
+		*best_disk = disk;
+		return 0;
+	}
+
+	if (dist < *best_dist) {
+		*best_dist = dist;
+		*best_disk = disk;
+	}
+	return 1;
+}
 
 /*
  * This routine returns the disk from which the requested read should
@@ -512,7 +537,6 @@ static int read_balance(struct r1conf *c
 	}
 
 	for (i = 0 ; i < conf->raid_disks * 2 ; i++) {
-		sector_t dist;
 		sector_t first_bad;
 		int bad_sectors;
 
@@ -577,20 +601,14 @@ static int read_balance(struct r1conf *c
 		} else
 			best_good_sectors = sectors;
 
-		dist = abs(this_sector - conf->mirrors[disk].head_position);
-		if (choose_first
-		    /* Don't change to another disk for sequential reads */
-		    || conf->next_seq_sect == this_sector
-		    || dist == 0
-		    /* If device is idle, use it */
-		    || atomic_read(&rdev->nr_pending) == 0) {
+		if (choose_first) {
 			best_disk = disk;
 			break;
 		}
-		if (dist < best_dist) {
-			best_dist = dist;
-			best_disk = disk;
-		}
+
+		if (!read_balance_measure_distance(conf, r1_bio, disk,
+		    &best_disk, &best_dist))
+			break;
 	}
 
 	if (best_disk >= 0) {


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 2/4 v2] raid1: make sequential read detection per disk based
  2012-06-13  9:09 [patch 0/4 v2] optimize raid1 read balance for SSD Shaohua Li
  2012-06-13  9:09 ` [patch 1/4 v2] raid1: move distance based read balance to a separate function Shaohua Li
@ 2012-06-13  9:09 ` Shaohua Li
  2012-06-13  9:09 ` [patch 3/4 v2] raid1: read balance chooses idlest disk Shaohua Li
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Shaohua Li @ 2012-06-13  9:09 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe

[-- Attachment #1: raid1-seq-detection.patch --]
[-- Type: text/plain, Size: 4208 bytes --]

Currently the sequential read detection is global wide. It's natural to make it
per disk based, which can improve the detection for concurrent multiple
sequential reads. And next patch will make SSD read balance not use distance
based algorithm, where this change help detect truly sequential read for SSD.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid1.c |   29 ++++++-----------------------
 drivers/md/raid1.h |   11 +++++------
 2 files changed, 11 insertions(+), 29 deletions(-)

Index: linux/drivers/md/raid1.c
===================================================================
--- linux.orig/drivers/md/raid1.c	2012-06-13 16:16:44.991391463 +0800
+++ linux/drivers/md/raid1.c	2012-06-13 16:16:46.751369597 +0800
@@ -474,7 +474,7 @@ static int read_balance_measure_distance
 
 	dist = abs(this_sector - conf->mirrors[disk].head_position);
 	/* Don't change to another disk for sequential reads */
-	if (conf->next_seq_sect == this_sector
+	if (conf->mirrors[disk].next_seq_sect == this_sector
 	    || dist == 0
 	    /* If device is idle, use it */
 	    || atomic_read(&rdev->nr_pending) == 0) {
@@ -508,7 +508,6 @@ static int read_balance(struct r1conf *c
 	const sector_t this_sector = r1_bio->sector;
 	int sectors;
 	int best_good_sectors;
-	int start_disk;
 	int best_disk;
 	int i;
 	sector_t best_dist;
@@ -528,19 +527,16 @@ static int read_balance(struct r1conf *c
 	best_good_sectors = 0;
 
 	if (conf->mddev->recovery_cp < MaxSector &&
-	    (this_sector + sectors >= conf->next_resync)) {
+	    (this_sector + sectors >= conf->next_resync))
 		choose_first = 1;
-		start_disk = 0;
-	} else {
+	else
 		choose_first = 0;
-		start_disk = conf->last_used;
-	}
 
 	for (i = 0 ; i < conf->raid_disks * 2 ; i++) {
 		sector_t first_bad;
 		int bad_sectors;
 
-		int disk = start_disk + i;
+		int disk = i;
 		if (disk >= conf->raid_disks)
 			disk -= conf->raid_disks;
 
@@ -624,8 +620,7 @@ static int read_balance(struct r1conf *c
 			goto retry;
 		}
 		sectors = best_good_sectors;
-		conf->next_seq_sect = this_sector + sectors;
-		conf->last_used = best_disk;
+		conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
 	}
 	rcu_read_unlock();
 	*max_sectors = sectors;
@@ -2599,7 +2594,6 @@ static struct r1conf *setup_conf(struct
 	conf->recovery_disabled = mddev->recovery_disabled - 1;
 
 	err = -EIO;
-	conf->last_used = -1;
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 
 		disk = conf->mirrors + i;
@@ -2625,19 +2619,9 @@ static struct r1conf *setup_conf(struct
 			if (disk->rdev &&
 			    (disk->rdev->saved_raid_disk < 0))
 				conf->fullsync = 1;
-		} else if (conf->last_used < 0)
-			/*
-			 * The first working device is used as a
-			 * starting point to read balancing.
-			 */
-			conf->last_used = i;
+		}
 	}
 
-	if (conf->last_used < 0) {
-		printk(KERN_ERR "md/raid1:%s: no operational mirrors\n",
-		       mdname(mddev));
-		goto abort;
-	}
 	err = -ENOMEM;
 	conf->thread = md_register_thread(raid1d, mddev, NULL);
 	if (!conf->thread) {
@@ -2894,7 +2878,6 @@ static int raid1_reshape(struct mddev *m
 	conf->raid_disks = mddev->raid_disks = raid_disks;
 	mddev->delta_disks = 0;
 
-	conf->last_used = 0; /* just make sure it is in-range */
 	lower_barrier(conf);
 
 	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
Index: linux/drivers/md/raid1.h
===================================================================
--- linux.orig/drivers/md/raid1.h	2012-06-13 16:16:32.679546214 +0800
+++ linux/drivers/md/raid1.h	2012-06-13 16:16:46.751369597 +0800
@@ -4,6 +4,11 @@
 struct mirror_info {
 	struct md_rdev	*rdev;
 	sector_t	head_position;
+
+	/* When choose the best device for a read (read_balance())
+	 * we try to keep sequential reads one the same device
+	 */
+	sector_t	next_seq_sect;
 };
 
 /*
@@ -29,12 +34,6 @@ struct r1conf {
 						 */
 	int			raid_disks;
 
-	/* When choose the best device for a read (read_balance())
-	 * we try to keep sequential reads one the same device
-	 * using 'last_used' and 'next_seq_sect'
-	 */
-	int			last_used;
-	sector_t		next_seq_sect;
 	/* During resync, read_balancing is only allowed on the part
 	 * of the array that has been resynced.  'next_resync' tells us
 	 * where that is.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 3/4 v2] raid1: read balance chooses idlest disk
  2012-06-13  9:09 [patch 0/4 v2] optimize raid1 read balance for SSD Shaohua Li
  2012-06-13  9:09 ` [patch 1/4 v2] raid1: move distance based read balance to a separate function Shaohua Li
  2012-06-13  9:09 ` [patch 2/4 v2] raid1: make sequential read detection per disk based Shaohua Li
@ 2012-06-13  9:09 ` Shaohua Li
  2012-06-13  9:09 ` [patch 4/4 v2] raid1: split large request for SSD Shaohua Li
  2012-06-28  1:06 ` [patch 0/4 v2] optimize raid1 read balance " NeilBrown
  4 siblings, 0 replies; 10+ messages in thread
From: Shaohua Li @ 2012-06-13  9:09 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe

[-- Attachment #1: raid1-ssd-read-balance.patch --]
[-- Type: text/plain, Size: 4832 bytes --]

SSD hasn't spindle, distance between requests means nothing. And the original
distance based algorithm sometimes can cause severe performance issue for SSD
raid.

Considering two thread groups, one accesses file A, the other access file B.
The first group will access one disk and the second will access the other disk,
because requests are near from one group and far between groups. In this case,
read balance might keep one disk very busy but the other relative idle.  For
SSD, we should try best to distribute requests to as more disks as possible.
There isn't spindle move penality anyway.

With below patch, I can see more than 50% throughput improvement sometimes
depending on workloads.

The only exception is small requests can be merged to a big request which
typically can drive higher throughput for SSD too. Such small requests are
sequential reads. Unlike hard disk, sequential read which can't be merged (for
example direct IO, or read without readahead) can be ignored for SSD. Again
there is no spindle move penality. readahead dispatches small requests and such
requests can be merged.

Last patch can help detect sequential read well, at least if concurrent read
number isn't greater than raid disk number. In that case, distance based
algorithm doesn't work well too.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid1.c |   55 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 drivers/md/raid1.h |    2 +
 2 files changed, 54 insertions(+), 3 deletions(-)

Index: linux/drivers/md/raid1.c
===================================================================
--- linux.orig/drivers/md/raid1.c	2012-06-13 16:16:46.751369597 +0800
+++ linux/drivers/md/raid1.c	2012-06-13 16:16:53.235287825 +0800
@@ -463,6 +463,43 @@ static void raid1_end_write_request(stru
 		bio_put(to_put);
 }
 
+static int read_balance_measure_ssd(struct r1conf *conf, struct r1bio *r1_bio,
+	int disk, int *best_disk, unsigned int *min_pending)
+{
+	const sector_t this_sector = r1_bio->sector;
+	struct md_rdev *rdev;
+	unsigned int pending;
+
+	rdev = rcu_dereference(conf->mirrors[disk].rdev);
+	pending = atomic_read(&rdev->nr_pending);
+
+	/* big request IO helps SSD too, allow sequential IO merge */
+	if (conf->mirrors[disk].next_seq_sect == this_sector) {
+		sector_t dist;
+		dist = abs(this_sector - conf->mirrors[disk].head_position);
+		/*
+		 * head_position is for finished request, such reqeust can't be
+		 * merged with current request, so it means nothing for SSD
+		 */
+		if (dist != 0)
+			goto done;
+	}
+
+	/* If device is idle, use it */
+	if (pending == 0)
+		goto done;
+
+	/* find device with less requests pending */
+	if (*min_pending > pending) {
+		*min_pending = pending;
+		*best_disk = disk;
+	}
+	return 1;
+done:
+	*best_disk = disk;
+	return 0;
+}
+
 static int read_balance_measure_distance(struct r1conf *conf,
 	struct r1bio *r1_bio, int disk, int *best_disk, sector_t *best_dist)
 {
@@ -511,6 +548,7 @@ static int read_balance(struct r1conf *c
 	int best_disk;
 	int i;
 	sector_t best_dist;
+	unsigned int min_pending;
 	struct md_rdev *rdev;
 	int choose_first;
 
@@ -524,6 +562,7 @@ static int read_balance(struct r1conf *c
 	sectors = r1_bio->sectors;
 	best_disk = -1;
 	best_dist = MaxSector;
+	min_pending = -1;
 	best_good_sectors = 0;
 
 	if (conf->mddev->recovery_cp < MaxSector &&
@@ -602,9 +641,15 @@ static int read_balance(struct r1conf *c
 			break;
 		}
 
-		if (!read_balance_measure_distance(conf, r1_bio, disk,
-		    &best_disk, &best_dist))
-			break;
+		if (!conf->nonrotational) {
+			if (!read_balance_measure_distance(conf, r1_bio, disk,
+			    &best_disk, &best_dist))
+				break;
+		} else {
+			if (!read_balance_measure_ssd(conf, r1_bio, disk,
+			    &best_disk, &min_pending))
+				break;
+		}
 	}
 
 	if (best_disk >= 0) {
@@ -2533,6 +2578,7 @@ static struct r1conf *setup_conf(struct
 	struct mirror_info *disk;
 	struct md_rdev *rdev;
 	int err = -ENOMEM;
+	bool nonrotational = true;
 
 	conf = kzalloc(sizeof(struct r1conf), GFP_KERNEL);
 	if (!conf)
@@ -2581,7 +2627,10 @@ static struct r1conf *setup_conf(struct
 			mddev->merge_check_needed = 1;
 
 		disk->head_position = 0;
+		if (!blk_queue_nonrot(bdev_get_queue(rdev->bdev)))
+			nonrotational = false;
 	}
+	conf->nonrotational = nonrotational;
 	conf->raid_disks = mddev->raid_disks;
 	conf->mddev = mddev;
 	INIT_LIST_HEAD(&conf->retry_list);
Index: linux/drivers/md/raid1.h
===================================================================
--- linux.orig/drivers/md/raid1.h	2012-06-13 16:16:46.751369597 +0800
+++ linux/drivers/md/raid1.h	2012-06-13 16:16:53.235287825 +0800
@@ -65,6 +65,8 @@ struct r1conf {
 	int			nr_queued;
 	int			barrier;
 
+	int			nonrotational;
+
 	/* Set to 1 if a full sync is needed, (fresh device added).
 	 * Cleared when a sync completes.
 	 */


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 4/4 v2] raid1: split large request for SSD
  2012-06-13  9:09 [patch 0/4 v2] optimize raid1 read balance for SSD Shaohua Li
                   ` (2 preceding siblings ...)
  2012-06-13  9:09 ` [patch 3/4 v2] raid1: read balance chooses idlest disk Shaohua Li
@ 2012-06-13  9:09 ` Shaohua Li
  2012-06-28  1:06 ` [patch 0/4 v2] optimize raid1 read balance " NeilBrown
  4 siblings, 0 replies; 10+ messages in thread
From: Shaohua Li @ 2012-06-13  9:09 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe

[-- Attachment #1: raid1-ssd-split-large-read.patch --]
[-- Type: text/plain, Size: 5805 bytes --]

For SSD, if request size exceeds specific value (optimal io size), request size
isn't important for bandwidth. In such condition, if making request size bigger
will cause some disks idle, the total throughput will actually drop. A good
example is doing a readahead in a two-disk raid1 setup.

So when we should split big request? We absolutly don't want to split big
request to very small requests. Even in SSD, big request transfer is more
efficient. Below patch only consider request with size above optimal io size.

If all disks are busy, is it worthy to do split? Say optimal io size is 16k,
two requests 32k and two disks. We can let each disk run one 32k request, or
split the requests to 4 16k requests and each disk runs two. It's hard to say
which case is better, depending on hardware.

So only consider case where there are idle disks. For readahead, split is
always better in this case. And in my test, below patch can improve > 30%
thoughput. Hmm, not 100%, because disk isn't 100% busy.

Such case can happen not just in readahead, for example, in directio. But I
suppose directio usually will have bigger IO depth and make all disks busy, so
I ignored it.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid1.c |   44 +++++++++++++++++++++++++++++++++++++-------
 drivers/md/raid1.h |    2 ++
 2 files changed, 39 insertions(+), 7 deletions(-)

Index: linux/drivers/md/raid1.c
===================================================================
--- linux.orig/drivers/md/raid1.c	2012-06-13 16:16:53.235287825 +0800
+++ linux/drivers/md/raid1.c	2012-06-13 16:16:56.315249040 +0800
@@ -464,31 +464,49 @@ static void raid1_end_write_request(stru
 }
 
 static int read_balance_measure_ssd(struct r1conf *conf, struct r1bio *r1_bio,
-	int disk, int *best_disk, unsigned int *min_pending)
+	int disk, int *best_disk, unsigned int *min_pending, int *choose_idle)
 {
 	const sector_t this_sector = r1_bio->sector;
 	struct md_rdev *rdev;
 	unsigned int pending;
+	struct mirror_info *mirror = &conf->mirrors[disk];
+	int ret = 0;
 
-	rdev = rcu_dereference(conf->mirrors[disk].rdev);
+	rdev = rcu_dereference(mirror->rdev);
 	pending = atomic_read(&rdev->nr_pending);
 
 	/* big request IO helps SSD too, allow sequential IO merge */
-	if (conf->mirrors[disk].next_seq_sect == this_sector) {
+	if (mirror->next_seq_sect == this_sector && *choose_idle == 0) {
 		sector_t dist;
-		dist = abs(this_sector - conf->mirrors[disk].head_position);
+		dist = abs(this_sector - mirror->head_position);
 		/*
 		 * head_position is for finished request, such reqeust can't be
 		 * merged with current request, so it means nothing for SSD
 		 */
-		if (dist != 0)
+		if (dist != 0) {
+			/*
+			 * If buffered sequential IO size exceeds optimal
+			 * iosize and there is idle disk, choose idle disk
+			 */
+			if (mirror->seq_start != MaxSector
+			    && conf->opt_iosize > 0
+			    && mirror->next_seq_sect > conf->opt_iosize
+			    && mirror->next_seq_sect - conf->opt_iosize >=
+			       mirror->seq_start) {
+				*choose_idle = 1;
+				ret = 1;
+			}
 			goto done;
+		}
 	}
 
 	/* If device is idle, use it */
 	if (pending == 0)
 		goto done;
 
+	if (*choose_idle == 1)
+		return 1;
+
 	/* find device with less requests pending */
 	if (*min_pending > pending) {
 		*min_pending = pending;
@@ -497,7 +515,7 @@ static int read_balance_measure_ssd(stru
 	return 1;
 done:
 	*best_disk = disk;
-	return 0;
+	return ret;
 }
 
 static int read_balance_measure_distance(struct r1conf *conf,
@@ -551,6 +569,7 @@ static int read_balance(struct r1conf *c
 	unsigned int min_pending;
 	struct md_rdev *rdev;
 	int choose_first;
+	int choose_idle;
 
 	rcu_read_lock();
 	/*
@@ -564,6 +583,7 @@ static int read_balance(struct r1conf *c
 	best_dist = MaxSector;
 	min_pending = -1;
 	best_good_sectors = 0;
+	choose_idle = 0;
 
 	if (conf->mddev->recovery_cp < MaxSector &&
 	    (this_sector + sectors >= conf->next_resync))
@@ -647,7 +667,7 @@ static int read_balance(struct r1conf *c
 				break;
 		} else {
 			if (!read_balance_measure_ssd(conf, r1_bio, disk,
-			    &best_disk, &min_pending))
+			    &best_disk, &min_pending, &choose_idle))
 				break;
 		}
 	}
@@ -665,6 +685,10 @@ static int read_balance(struct r1conf *c
 			goto retry;
 		}
 		sectors = best_good_sectors;
+
+		if (conf->mirrors[best_disk].next_seq_sect != this_sector)
+			conf->mirrors[best_disk].seq_start = this_sector;
+
 		conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
 	}
 	rcu_read_unlock();
@@ -2579,6 +2603,7 @@ static struct r1conf *setup_conf(struct
 	struct md_rdev *rdev;
 	int err = -ENOMEM;
 	bool nonrotational = true;
+	int opt_iosize = 0;
 
 	conf = kzalloc(sizeof(struct r1conf), GFP_KERNEL);
 	if (!conf)
@@ -2629,8 +2654,13 @@ static struct r1conf *setup_conf(struct
 		disk->head_position = 0;
 		if (!blk_queue_nonrot(bdev_get_queue(rdev->bdev)))
 			nonrotational = false;
+		else
+			opt_iosize = max(opt_iosize, bdev_io_opt(rdev->bdev));
+		disk->seq_start = MaxSector;
 	}
 	conf->nonrotational = nonrotational;
+	if (nonrotational)
+		conf->opt_iosize = opt_iosize >> 9;
 	conf->raid_disks = mddev->raid_disks;
 	conf->mddev = mddev;
 	INIT_LIST_HEAD(&conf->retry_list);
Index: linux/drivers/md/raid1.h
===================================================================
--- linux.orig/drivers/md/raid1.h	2012-06-13 16:16:53.235287825 +0800
+++ linux/drivers/md/raid1.h	2012-06-13 16:16:56.315249040 +0800
@@ -9,6 +9,7 @@ struct mirror_info {
 	 * we try to keep sequential reads one the same device
 	 */
 	sector_t	next_seq_sect;
+	sector_t	seq_start;
 };
 
 /*
@@ -66,6 +67,7 @@ struct r1conf {
 	int			barrier;
 
 	int			nonrotational;
+	sector_t		opt_iosize;
 
 	/* Set to 1 if a full sync is needed, (fresh device added).
 	 * Cleared when a sync completes.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 0/4 v2] optimize raid1 read balance for SSD
  2012-06-13  9:09 [patch 0/4 v2] optimize raid1 read balance for SSD Shaohua Li
                   ` (3 preceding siblings ...)
  2012-06-13  9:09 ` [patch 4/4 v2] raid1: split large request for SSD Shaohua Li
@ 2012-06-28  1:06 ` NeilBrown
  2012-06-28  1:47   ` Roberto Spadim
  2012-06-28  3:29   ` Shaohua Li
  4 siblings, 2 replies; 10+ messages in thread
From: NeilBrown @ 2012-06-28  1:06 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe

[-- Attachment #1: Type: text/plain, Size: 3910 bytes --]

On Wed, 13 Jun 2012 17:09:22 +0800 Shaohua Li <shli@kernel.org> wrote:

> raid1 read balance is an important algorithm to make read performance optimal.
> It's a distance based algorithm, eg, for each request dispatch, choose disk
> whose last finished request is close the request. This is great for hard disk.
> But SSD has some special characteristics:
> 
> 1. nonrotational. Distance means nothing for SSD, though merging small rquests
> to big request is still optimal for SSD. If no merge, distributing rquests to
> raid disks as more as possible is more optimal.

This, of course, has nothing to do with the devices rotating, and everything
to do with there being a seek-penalty.  So why the block-device flag is
called "rotational" I really don't know :-(

Can we make the flags that md uses be "seek_penalty" or "no_seek" or
something, even if the block layer has less meaningful names?

> 
> 2. Getting too big request isn't always optimal. For hard disk, compared to
> spindle move, data transfer overhead is trival, so we always prefer bigger
> request. In SSD, request size exceeds specific value, performance isn't always
> increased with request size increased.  An example is readahead. If readahead
> merges too big request and causes some disks idle, the performance is less
> optimal than that when all disks are busy and running small requests.

I note that the patch doesn't actually split requests, rather it avoids
allocating adjacent requests to the same device - is that right?
That seems sensible but doesn't seem to match your description so I wanted to
check.

> 
> The patches try to address the issues. The first two patches are clean up. The
> third patch addresses the first item above. The forth addresses the second item
> above. The idea can be applied to raid10 too, which is in my todo list.

Thanks for these - there is a lot to like here.

One concern that I have has that they assume that all devices are the same.
i.e. they are all "rotational", or none of them are.
     they all have the same optimal IO size.

md aims to work well on heterogeneous arrays so I'll like to make it more
general if I can.
Whether to "split" adjacent requests or not is decided in the context of a
single device (the device that the first request is queued for) so that
should be able to use the optimal IO size of that devices.

General balancing is a little harder as the decision is made in the context
of all active devices.  In particular we need to know how to choose between a
seek-penalty device and a no-seek-penalty device, if they both have requests
queued to them and the seek-penalty device is a long way from the target.

Maybe:
 - if the last request to some device is within optimal-io-size of this
   requests, then send this request to that device.
 - if either of two possible devices has no seek penalty, choose the one with
   the fewest outstanding requests.
 - if both of two possible devices have a seek-penalty, then choose the
   closest

I think this will do the same as the current code for 'rotational' devices,
and will be close to what your code does for 'non-rotational' devices.

There may well be room to improve this, but a key point is that it won't work
to have two separate balancing routines - one for disks and one for ssds.
I'm certainly happy for raid1_read_balance to be split up if it is too big,
but I don't think that they way the first patch splits it is useful.

Would you be able to try something like that instead?

BTW, in a couple of places you use an 'rdev' taken out of the 'conf'
structure without testing for NULL first.  That isn't safe.
We get the 'rdev' at the top of the loop and test for NULL and other things.
After that you need to always use *that* rdev value, don't try to get it out
of the conf->mirrors structure again.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 0/4 v2] optimize raid1 read balance for SSD
  2012-06-28  1:06 ` [patch 0/4 v2] optimize raid1 read balance " NeilBrown
@ 2012-06-28  1:47   ` Roberto Spadim
  2012-06-28  3:29   ` Shaohua Li
  1 sibling, 0 replies; 10+ messages in thread
From: Roberto Spadim @ 2012-06-28  1:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid, axboe

hi guys,
some time ago i made some tests in raid1 md code, and a 'good'
algorithm could improve 1% (mean) with some small variation (sometimes
3% sometime 0%) of performance boost (using new algorithm instead disk
distance), in other words... read of 1GB could be improved to 1,01GB
(10MB of improvement), well that was the result (about 1 year ago) and
for a busy database server, that´s a good improvement (some guys tell
that 0.01% of improvement is a improvement and should be used... well
doesn´t matter...)

IMHO, i think that a general algorithm to select 'best device to
read', should work with device information, and device information
functions (sata/scsi/etc), i don´t know if it´s implemented but if
not, should be implemented to work with this ideas....

putting code to select best distance is nice too, but it´s not general
(think about a disk with 2 heads... a hardware raid1 for example...)
the solution should be, get expected disks times to do some read job,
and select the disk with smallest time
for example, instead putting a lot of statistics code in md, put this
statistic in sata, scsi, or other 'under' device (nbd drbd ...) and
their elevators functions (noop cfq, etc...)

check that md should implement a prediction function too, for example
we could put a 2 pairs of raid1, and one raid0 over it, (a raid 1 + 0)
and raid0 will contact raid1 devices to know what´s the best disk to
read... well in this case just call 'unders' devices function to get a
estimate time and select the 'best' disk

this is a bit 'hard' and a lot of work should be done, but IMHO,
that´s what i think about 'ideal general solution'

at 'unders devices' (sata/scsi/drbd/nbd), today they could use
statistic based on read/write times without big problems (this is good
to have low cpu usage when we don´t know what´s the device is
constructed), or maybe in future this could be implement at disk
controller?! why not? see ssd drives, they implement a garbage
collection! why not implement a 'time prediction function'?!

just some ideas....=)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 0/4 v2] optimize raid1 read balance for SSD
  2012-06-28  1:06 ` [patch 0/4 v2] optimize raid1 read balance " NeilBrown
  2012-06-28  1:47   ` Roberto Spadim
@ 2012-06-28  3:29   ` Shaohua Li
  2012-06-28  3:40     ` NeilBrown
  1 sibling, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2012-06-28  3:29 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe

On Thu, Jun 28, 2012 at 11:06:16AM +1000, NeilBrown wrote:
> On Wed, 13 Jun 2012 17:09:22 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > raid1 read balance is an important algorithm to make read performance optimal.
> > It's a distance based algorithm, eg, for each request dispatch, choose disk
> > whose last finished request is close the request. This is great for hard disk.
> > But SSD has some special characteristics:
> > 
> > 1. nonrotational. Distance means nothing for SSD, though merging small rquests
> > to big request is still optimal for SSD. If no merge, distributing rquests to
> > raid disks as more as possible is more optimal.
> 
> This, of course, has nothing to do with the devices rotating, and everything
> to do with there being a seek-penalty.  So why the block-device flag is
> called "rotational" I really don't know :-(
> 
> Can we make the flags that md uses be "seek_penalty" or "no_seek" or
> something, even if the block layer has less meaningful names?

purely borrowed from block layer. I'm ok to change it.
 
> > 
> > 2. Getting too big request isn't always optimal. For hard disk, compared to
> > spindle move, data transfer overhead is trival, so we always prefer bigger
> > request. In SSD, request size exceeds specific value, performance isn't always
> > increased with request size increased.  An example is readahead. If readahead
> > merges too big request and causes some disks idle, the performance is less
> > optimal than that when all disks are busy and running small requests.
> 
> I note that the patch doesn't actually split requests, rather it avoids
> allocating adjacent requests to the same device - is that right?
> That seems sensible but doesn't seem to match your description so I wanted to
> check.

Yes, it doesn't do split, sorry for the misleading. I used to think about
split, but gave up later, but forgot to change the description. There are two
types of IO driving big request size: buffered read (readahead) buffered write
and direct read or write I assume buffered write and direct read/write can
drive high iodepth, so doing split is useless. buffered read split makes sense,
which really should be 'prevent merging to big request'

> > 
> > The patches try to address the issues. The first two patches are clean up. The
> > third patch addresses the first item above. The forth addresses the second item
> > above. The idea can be applied to raid10 too, which is in my todo list.
> 
> Thanks for these - there is a lot to like here.
> 
> One concern that I have has that they assume that all devices are the same.
> i.e. they are all "rotational", or none of them are.
>      they all have the same optimal IO size.
> 
> md aims to work well on heterogeneous arrays so I'll like to make it more
> general if I can.
> Whether to "split" adjacent requests or not is decided in the context of a
> single device (the device that the first request is queued for) so that
> should be able to use the optimal IO size of that devices.
> 
> General balancing is a little harder as the decision is made in the context
> of all active devices.  In particular we need to know how to choose between a
> seek-penalty device and a no-seek-penalty device, if they both have requests
> queued to them and the seek-penalty device is a long way from the target.
> 
> Maybe:
>  - if the last request to some device is within optimal-io-size of this
>    requests, then send this request to that device.
>  - if either of two possible devices has no seek penalty, choose the one with
>    the fewest outstanding requests.
>  - if both of two possible devices have a seek-penalty, then choose the
>    closest
> 
> I think this will do the same as the current code for 'rotational' devices,
> and will be close to what your code does for 'non-rotational' devices.

This is close to what I did except for the case of one hard disk and one SSD.
Talking about heterogeneous arrary, I assume people only do it with two
different hard disks or two different ssd. Will people really mix hard disk and
SSD? A hard disk can only drive 600 IOPS while a lowend SATA SSD can drive 20k
~ 40k IOPS. Plusing 600 to 20k doesn't significantly change IOPS.

> There may well be room to improve this, but a key point is that it won't work
> to have two separate balancing routines - one for disks and one for ssds.
> I'm certainly happy for raid1_read_balance to be split up if it is too big,
> but I don't think that they way the first patch splits it is useful.
> Would you be able to try something like that instead?

Yep, it's not worthy a separate function. My point is just making them
independent. I can try this but not be convinced we need handle mixed
harddisk/ssd case.

> BTW, in a couple of places you use an 'rdev' taken out of the 'conf'
> structure without testing for NULL first.  That isn't safe.
> We get the 'rdev' at the top of the loop and test for NULL and other things.
> After that you need to always use *that* rdev value, don't try to get it out
> of the conf->mirrors structure again.

Thanks for pointing out, I'll fix it later.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 0/4 v2] optimize raid1 read balance for SSD
  2012-06-28  3:29   ` Shaohua Li
@ 2012-06-28  3:40     ` NeilBrown
  2012-06-28  6:31       ` David Brown
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2012-06-28  3:40 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe

[-- Attachment #1: Type: text/plain, Size: 5302 bytes --]

On Thu, 28 Jun 2012 11:29:01 +0800 Shaohua Li <shli@kernel.org> wrote:

> On Thu, Jun 28, 2012 at 11:06:16AM +1000, NeilBrown wrote:
> > On Wed, 13 Jun 2012 17:09:22 +0800 Shaohua Li <shli@kernel.org> wrote:
> > 
> > > raid1 read balance is an important algorithm to make read performance optimal.
> > > It's a distance based algorithm, eg, for each request dispatch, choose disk
> > > whose last finished request is close the request. This is great for hard disk.
> > > But SSD has some special characteristics:
> > > 
> > > 1. nonrotational. Distance means nothing for SSD, though merging small rquests
> > > to big request is still optimal for SSD. If no merge, distributing rquests to
> > > raid disks as more as possible is more optimal.
> > 
> > This, of course, has nothing to do with the devices rotating, and everything
> > to do with there being a seek-penalty.  So why the block-device flag is
> > called "rotational" I really don't know :-(
> > 
> > Can we make the flags that md uses be "seek_penalty" or "no_seek" or
> > something, even if the block layer has less meaningful names?
> 
> purely borrowed from block layer. I'm ok to change it.
>  
> > > 
> > > 2. Getting too big request isn't always optimal. For hard disk, compared to
> > > spindle move, data transfer overhead is trival, so we always prefer bigger
> > > request. In SSD, request size exceeds specific value, performance isn't always
> > > increased with request size increased.  An example is readahead. If readahead
> > > merges too big request and causes some disks idle, the performance is less
> > > optimal than that when all disks are busy and running small requests.
> > 
> > I note that the patch doesn't actually split requests, rather it avoids
> > allocating adjacent requests to the same device - is that right?
> > That seems sensible but doesn't seem to match your description so I wanted to
> > check.
> 
> Yes, it doesn't do split, sorry for the misleading. I used to think about
> split, but gave up later, but forgot to change the description. There are two
> types of IO driving big request size: buffered read (readahead) buffered write
> and direct read or write I assume buffered write and direct read/write can
> drive high iodepth, so doing split is useless. buffered read split makes sense,
> which really should be 'prevent merging to big request'
> 
> > > 
> > > The patches try to address the issues. The first two patches are clean up. The
> > > third patch addresses the first item above. The forth addresses the second item
> > > above. The idea can be applied to raid10 too, which is in my todo list.
> > 
> > Thanks for these - there is a lot to like here.
> > 
> > One concern that I have has that they assume that all devices are the same.
> > i.e. they are all "rotational", or none of them are.
> >      they all have the same optimal IO size.
> > 
> > md aims to work well on heterogeneous arrays so I'll like to make it more
> > general if I can.
> > Whether to "split" adjacent requests or not is decided in the context of a
> > single device (the device that the first request is queued for) so that
> > should be able to use the optimal IO size of that devices.
> > 
> > General balancing is a little harder as the decision is made in the context
> > of all active devices.  In particular we need to know how to choose between a
> > seek-penalty device and a no-seek-penalty device, if they both have requests
> > queued to them and the seek-penalty device is a long way from the target.
> > 
> > Maybe:
> >  - if the last request to some device is within optimal-io-size of this
> >    requests, then send this request to that device.
> >  - if either of two possible devices has no seek penalty, choose the one with
> >    the fewest outstanding requests.
> >  - if both of two possible devices have a seek-penalty, then choose the
> >    closest
> > 
> > I think this will do the same as the current code for 'rotational' devices,
> > and will be close to what your code does for 'non-rotational' devices.
> 
> This is close to what I did except for the case of one hard disk and one SSD.
> Talking about heterogeneous arrary, I assume people only do it with two
> different hard disks or two different ssd. Will people really mix hard disk and
> SSD? A hard disk can only drive 600 IOPS while a lowend SATA SSD can drive 20k
> ~ 40k IOPS. Plusing 600 to 20k doesn't significantly change IOPS.

I'm fairly sure people will try it, even if it doesn't make sense. :-)

It possibly would make sense if the slower device was marked "write-mostly",
but that disables read-balance, so what we do with read balance might be
irrelevant.

I don't want to assume devices have different speeds unless explicitly told
so, and the "rotational" flag doesn't clearly imply anything about the speed
of the device, only about seek latency.  So e.g. I wouldn't want to always
assume that a non-rotational device were faster than a rotational device,
even though that may be the case in lots of situations.

I'd be less concerned about heterogeneous arrays with RAID10 as it makes less
sense there.  But I certainly want to at least acknowledge the possibility
for RAID1.

thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 0/4 v2] optimize raid1 read balance for SSD
  2012-06-28  3:40     ` NeilBrown
@ 2012-06-28  6:31       ` David Brown
  0 siblings, 0 replies; 10+ messages in thread
From: David Brown @ 2012-06-28  6:31 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid, axboe

On 28/06/12 05:40, NeilBrown wrote:
> On Thu, 28 Jun 2012 11:29:01 +0800 Shaohua Li<shli@kernel.org>  wrote:
>
>> On Thu, Jun 28, 2012 at 11:06:16AM +1000, NeilBrown wrote:
>>> On Wed, 13 Jun 2012 17:09:22 +0800 Shaohua Li<shli@kernel.org>  wrote:
<snip>
>>> General balancing is a little harder as the decision is made in the context
>>> of all active devices.  In particular we need to know how to choose between a
>>> seek-penalty device and a no-seek-penalty device, if they both have requests
>>> queued to them and the seek-penalty device is a long way from the target.
>>>
>>> Maybe:
>>>   - if the last request to some device is within optimal-io-size of this
>>>     requests, then send this request to that device.
>>>   - if either of two possible devices has no seek penalty, choose the one with
>>>     the fewest outstanding requests.
>>>   - if both of two possible devices have a seek-penalty, then choose the
>>>     closest
>>>
>>> I think this will do the same as the current code for 'rotational' devices,
>>> and will be close to what your code does for 'non-rotational' devices.
>>
>> This is close to what I did except for the case of one hard disk and one SSD.
>> Talking about heterogeneous arrary, I assume people only do it with two
>> different hard disks or two different ssd. Will people really mix hard disk and
>> SSD? A hard disk can only drive 600 IOPS while a lowend SATA SSD can drive 20k
>> ~ 40k IOPS. Plusing 600 to 20k doesn't significantly change IOPS.
>
> I'm fairly sure people will try it, even if it doesn't make sense. :-)
>
> It possibly would make sense if the slower device was marked "write-mostly",
> but that disables read-balance, so what we do with read balance might be
> irrelevant.

I would imagine that an SSD paired with an HD (or a partition on an HD) 
in RAID1 would be very common for small setups - at least until the 
current experimental flash cache systems solidify and mature.  You get 
the speed of the SSD, and the safety of RAID1 without the cost of two 
SSDs.  It is normal today to use write-mostly on the HD - but I'd guess 
a fair number of users don't think to do that.  And with good balancing 
from these patches, there is no need - and you can get the best from 
both devices.  SSD's are much faster than HD's when there is seeking 
involved - but for streamed accesses the difference is smaller, and it's 
good to have both devices available for reads.

>
> I don't want to assume devices have different speeds unless explicitly told
> so, and the "rotational" flag doesn't clearly imply anything about the speed
> of the device, only about seek latency.  So e.g. I wouldn't want to always
> assume that a non-rotational device were faster than a rotational device,
> even though that may be the case in lots of situations.
>
> I'd be less concerned about heterogeneous arrays with RAID10 as it makes less
> sense there.  But I certainly want to at least acknowledge the possibility
> for RAID1.
>
> thanks,
> NeilBrown
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-06-28  6:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-13  9:09 [patch 0/4 v2] optimize raid1 read balance for SSD Shaohua Li
2012-06-13  9:09 ` [patch 1/4 v2] raid1: move distance based read balance to a separate function Shaohua Li
2012-06-13  9:09 ` [patch 2/4 v2] raid1: make sequential read detection per disk based Shaohua Li
2012-06-13  9:09 ` [patch 3/4 v2] raid1: read balance chooses idlest disk Shaohua Li
2012-06-13  9:09 ` [patch 4/4 v2] raid1: split large request for SSD Shaohua Li
2012-06-28  1:06 ` [patch 0/4 v2] optimize raid1 read balance " NeilBrown
2012-06-28  1:47   ` Roberto Spadim
2012-06-28  3:29   ` Shaohua Li
2012-06-28  3:40     ` NeilBrown
2012-06-28  6:31       ` David Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).