Linux RAID subsystem development
 help / color / mirror / Atom feed
* [patch 2/3 v4]raid1: read balance chooses idlest disk for SSD
@ 2012-07-05  9:20 Shaohua Li
  2012-07-05 13:04 ` Roberto Spadim
  0 siblings, 1 reply; 4+ messages in thread
From: Shaohua Li @ 2012-07-05  9:20 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb

SSD hasn't spindle, distance between requests means nothing. And the original
distance based algorithm sometimes can cause severe performance issue for SSD
raid.

Considering two thread groups, one accesses file A, the other access file B.
The first group will access one disk and the second will access the other disk,
because requests are near from one group and far between groups. In this case,
read balance might keep one disk very busy but the other relative idle.  For
SSD, we should try best to distribute requests to as more disks as possible.
There isn't spindle move penality anyway.

With below patch, I can see more than 50% throughput improvement sometimes
depending on workloads.

The only exception is small requests can be merged to a big request which
typically can drive higher throughput for SSD too. Such small requests are
sequential reads. Unlike hard disk, sequential read which can't be merged (for
example direct IO, or read without readahead) can be ignored for SSD. Again
there is no spindle move penality. readahead dispatches small requests and such
requests can be merged.

Last patch can help detect sequential read well, at least if concurrent read
number isn't greater than raid disk number. In that case, distance based
algorithm doesn't work well too.

V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
random IO too. This makes the algorithm generic for raid with SSD.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid1.c |   34 +++++++++++++++++++++++++++++++---
 1 file changed, 31 insertions(+), 3 deletions(-)

Index: linux/drivers/md/raid1.c
===================================================================
--- linux.orig/drivers/md/raid1.c	2012-07-04 15:25:11.817869519 +0800
+++ linux/drivers/md/raid1.c	2012-07-04 15:42:30.280816275 +0800
@@ -483,9 +483,11 @@ static int read_balance(struct r1conf *c
 	const sector_t this_sector = r1_bio->sector;
 	int sectors;
 	int best_good_sectors;
-	int best_disk;
+	int best_disk, best_dist_disk, best_pending_disk;
+	int has_nonrot_disk;
 	int i;
 	sector_t best_dist;
+	unsigned int min_pending;
 	struct md_rdev *rdev;
 	int choose_first;
 
@@ -498,8 +500,12 @@ static int read_balance(struct r1conf *c
  retry:
 	sectors = r1_bio->sectors;
 	best_disk = -1;
+	best_dist_disk = -1;
 	best_dist = MaxSector;
+	best_pending_disk = -1;
+	min_pending = UINT_MAX;
 	best_good_sectors = 0;
+	has_nonrot_disk = 0;
 
 	if (conf->mddev->recovery_cp < MaxSector &&
 	    (this_sector + sectors >= conf->next_resync))
@@ -511,6 +517,7 @@ static int read_balance(struct r1conf *c
 		sector_t dist;
 		sector_t first_bad;
 		int bad_sectors;
+		unsigned int pending;
 
 		int disk = i;
 		if (disk >= conf->raid_disks * 2)
@@ -573,22 +580,43 @@ static int read_balance(struct r1conf *c
 		} else
 			best_good_sectors = sectors;
 
+		has_nonrot_disk |= blk_queue_nonrot(bdev_get_queue(rdev->bdev));
+		pending = atomic_read(&rdev->nr_pending);
 		dist = abs(this_sector - conf->mirrors[disk].head_position);
 		if (choose_first
 		    /* Don't change to another disk for sequential reads */
 		    || conf->mirrors[disk].next_seq_sect == this_sector
 		    || dist == 0
 		    /* If device is idle, use it */
-		    || atomic_read(&rdev->nr_pending) == 0) {
+		    || pending == 0) {
 			best_disk = disk;
 			break;
 		}
+
+		if (min_pending > pending) {
+			min_pending = pending;
+			best_pending_disk = disk;
+		}
+
 		if (dist < best_dist) {
 			best_dist = dist;
-			best_disk = disk;
+			best_dist_disk = disk;
 		}
 	}
 
+	/*
+	 * If all disks are rotational, choose the closest disk. If any disk is
+	 * non-rotational, choose the disk with less pending request even the
+	 * disk is rotational, which might/might not be optimal for raids with
+	 * mixed ratation/non-rotational disks depending on workload.
+	 */
+	if (best_disk == -1) {
+		if (has_nonrot_disk)
+			best_disk = best_pending_disk;
+		else
+			best_disk = best_dist_disk;
+	}
+
 	if (best_disk >= 0) {
 		rdev = rcu_dereference(conf->mirrors[best_disk].rdev);
 		if (!rdev)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [patch 2/3 v4]raid1: read balance chooses idlest disk for SSD
  2012-07-05  9:20 [patch 2/3 v4]raid1: read balance chooses idlest disk for SSD Shaohua Li
@ 2012-07-05 13:04 ` Roberto Spadim
  2012-07-06  0:27   ` Shaohua Li
  0 siblings, 1 reply; 4+ messages in thread
From: Roberto Spadim @ 2012-07-05 13:04 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, neilb

nice, just another question...
 since this use mixed raid disks (different types) could we improve
the algorithm to diferent hard disk speed, for example a raid1 with
7200 and 15000 rpm?
the distance continue the same but it will include a 'speed' factor
speed=1/rpm
(distance*speed)
or something to select fastest disk in the array
i don´t want to use write-mostly since it can reduce my total number
of read disks, but with this we could use the fastest disk with more
frequency without lose array 'speed'





2012/7/5 Shaohua Li <shli@kernel.org>
>
> SSD hasn't spindle, distance between requests means nothing. And the
> original
> distance based algorithm sometimes can cause severe performance issue for
> SSD
> raid.
>
> Considering two thread groups, one accesses file A, the other access file
> B.
> The first group will access one disk and the second will access the other
> disk,
> because requests are near from one group and far between groups. In this
> case,
> read balance might keep one disk very busy but the other relative idle.
> For
> SSD, we should try best to distribute requests to as more disks as
> possible.
> There isn't spindle move penality anyway.
>
> With below patch, I can see more than 50% throughput improvement sometimes
> depending on workloads.
>
> The only exception is small requests can be merged to a big request which
> typically can drive higher throughput for SSD too. Such small requests are
> sequential reads. Unlike hard disk, sequential read which can't be merged
> (for
> example direct IO, or read without readahead) can be ignored for SSD.
> Again
> there is no spindle move penality. readahead dispatches small requests and
> such
> requests can be merged.
>
> Last patch can help detect sequential read well, at least if concurrent
> read
> number isn't greater than raid disk number. In that case, distance based
> algorithm doesn't work well too.
>
> V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm
> for
> random IO too. This makes the algorithm generic for raid with SSD.
>
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/raid1.c |   34 +++++++++++++++++++++++++++++++---
>  1 file changed, 31 insertions(+), 3 deletions(-)
>
> Index: linux/drivers/md/raid1.c
> ===================================================================
> --- linux.orig/drivers/md/raid1.c       2012-07-04 15:25:11.817869519
> +0800
> +++ linux/drivers/md/raid1.c    2012-07-04 15:42:30.280816275 +0800
> @@ -483,9 +483,11 @@ static int read_balance(struct r1conf *c
>         const sector_t this_sector = r1_bio->sector;
>         int sectors;
>         int best_good_sectors;
> -       int best_disk;
> +       int best_disk, best_dist_disk, best_pending_disk;
> +       int has_nonrot_disk;
>         int i;
>         sector_t best_dist;
> +       unsigned int min_pending;
>         struct md_rdev *rdev;
>         int choose_first;
>
> @@ -498,8 +500,12 @@ static int read_balance(struct r1conf *c
>   retry:
>         sectors = r1_bio->sectors;
>         best_disk = -1;
> +       best_dist_disk = -1;
>         best_dist = MaxSector;
> +       best_pending_disk = -1;
> +       min_pending = UINT_MAX;
>         best_good_sectors = 0;
> +       has_nonrot_disk = 0;
>
>         if (conf->mddev->recovery_cp < MaxSector &&
>             (this_sector + sectors >= conf->next_resync))
> @@ -511,6 +517,7 @@ static int read_balance(struct r1conf *c
>                 sector_t dist;
>                 sector_t first_bad;
>                 int bad_sectors;
> +               unsigned int pending;
>
>                 int disk = i;
>                 if (disk >= conf->raid_disks * 2)
> @@ -573,22 +580,43 @@ static int read_balance(struct r1conf *c
>                 } else
>                         best_good_sectors = sectors;
>
> +               has_nonrot_disk |=
> blk_queue_nonrot(bdev_get_queue(rdev->bdev));
> +               pending = atomic_read(&rdev->nr_pending);
>                 dist = abs(this_sector -
> conf->mirrors[disk].head_position);
>                 if (choose_first
>                     /* Don't change to another disk for sequential reads
> */
>                     || conf->mirrors[disk].next_seq_sect == this_sector
>                     || dist == 0
>                     /* If device is idle, use it */
> -                   || atomic_read(&rdev->nr_pending) == 0) {
> +                   || pending == 0) {
>                         best_disk = disk;
>                         break;
>                 }
> +
> +               if (min_pending > pending) {
> +                       min_pending = pending;
> +                       best_pending_disk = disk;
> +               }
> +
>                 if (dist < best_dist) {
>                         best_dist = dist;
> -                       best_disk = disk;
> +                       best_dist_disk = disk;
>                 }
>         }
>
> +       /*
> +        * If all disks are rotational, choose the closest disk. If any
> disk is
> +        * non-rotational, choose the disk with less pending request even
> the
> +        * disk is rotational, which might/might not be optimal for raids
> with
> +        * mixed ratation/non-rotational disks depending on workload.
> +        */
> +       if (best_disk == -1) {
> +               if (has_nonrot_disk)
> +                       best_disk = best_pending_disk;
> +               else
> +                       best_disk = best_dist_disk;
> +       }
> +
>         if (best_disk >= 0) {
>                 rdev = rcu_dereference(conf->mirrors[best_disk].rdev);
>                 if (!rdev)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [patch 2/3 v4]raid1: read balance chooses idlest disk for SSD
  2012-07-05 13:04 ` Roberto Spadim
@ 2012-07-06  0:27   ` Shaohua Li
  2012-07-06  2:45     ` Roberto Spadim
  0 siblings, 1 reply; 4+ messages in thread
From: Shaohua Li @ 2012-07-06  0:27 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: linux-raid, neilb

2012/7/5 Roberto Spadim <roberto@spadim.com.br>:
> nice, just another question...
>  since this use mixed raid disks (different types) could we improve
> the algorithm to diferent hard disk speed, for example a raid1 with
> 7200 and 15000 rpm?
> the distance continue the same but it will include a 'speed' factor
> speed=1/rpm
> (distance*speed)
> or something to select fastest disk in the array
> i don´t want to use write-mostly since it can reduce my total number
> of read disks, but with this we could use the fastest disk with more
> frequency without lose array 'speed'

Legitimate optimization, however detecting hard disk speed is hard.
Or maybe you want to add an option to tell kernel the disk speed?

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [patch 2/3 v4]raid1: read balance chooses idlest disk for SSD
  2012-07-06  0:27   ` Shaohua Li
@ 2012-07-06  2:45     ` Roberto Spadim
  0 siblings, 0 replies; 4+ messages in thread
From: Roberto Spadim @ 2012-07-06  2:45 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, neilb

well, i´m thinking something like elevators, they have some files to
configure options at /sys/block/sda/xxxx, it´s a nice interface...
echo and cat to configure the elevator or sysctl...

check this example of md in a old linux kernel (i don´t remember
version maybe 2.6.18)



#cd /sys/block/md0/md
# ls -l
-rw-r--r-- 1 root root 4096 Jul  5 23:39 array_state
--w------- 1 root root 4096 Jul  5 23:39 bitmap_set_bits
-rw-r--r-- 1 root root 4096 Jul  5 23:39 chunk_size
-rw-r--r-- 1 root root 4096 Jul  5 23:39 component_size
-r--r--r-- 1 root root 4096 Jul  5 23:39 degraded
drwxr-xr-x 2 root root    0 Jul  5 23:39 dev-sda1 <<<<=====
drwxr-xr-x 2 root root    0 Jul  5 23:39 dev-sdb1 <<<<=====
-rw-r--r-- 1 root root 4096 Jul  5 23:39 layout
-rw-r--r-- 1 root root 4096 Jul  5 23:39 level
-rw-r--r-- 1 root root 4096 Jul  5 23:39 metadata_version
-r--r--r-- 1 root root 4096 Jul  5 23:39 mismatch_cnt
--w------- 1 root root 4096 Jul  5 23:39 new_dev
-rw-r--r-- 1 root root 4096 Jul  5 23:39 raid_disks
lrwxrwxrwx 1 root root    0 Jul  5 23:39 rd2 -> dev-sda1
lrwxrwxrwx 1 root root    0 Jul  5 23:39 rd3 -> dev-sdb1
-rw-r--r-- 1 root root 4096 Jul  5 23:39 reshape_position
-rw-r--r-- 1 root root 4096 Jul  5 23:39 resync_start
-rw-r--r-- 1 root root 4096 Jul  5 23:39 safe_mode_delay
-rw-r--r-- 1 root root 4096 Jul  5 23:39 suspend_hi
-rw-r--r-- 1 root root 4096 Jul  5 23:39 suspend_lo
-rw-r--r-- 1 root root 4096 Jul  5 23:39 sync_action
-r--r--r-- 1 root root 4096 Jul  5 23:39 sync_completed
-rw-r--r-- 1 root root 4096 Jul  5 23:39 sync_force_parallel
-rw-r--r-- 1 root root 4096 Jul  5 23:39 sync_max
-r--r--r-- 1 root root 4096 Jul  5 23:39 sync_speed
-rw-r--r-- 1 root root 4096 Jul  5 23:39 sync_speed_max
-rw-r--r-- 1 root root 4096 Jul  5 23:39 sync_speed_min


# ls dev-sda1/ -l
total 0
lrwxrwxrwx 1 root root    0 Jul  5 23:40 block -> ../../../sda/sda1
-rw-r--r-- 1 root root 4096 Jul  5 23:40 errors
-rw-r--r-- 1 root root 4096 Jul  5 23:40 offset
-rw-r--r-- 1 root root 4096 Jul  5 23:40 size
-rw-r--r-- 1 root root 4096 Jul  5 23:40 slot
-rw-r--r-- 1 root root 4096 Jul  5 23:40 state
--->>> maybe here we could put a file called: "raid1_read_balance_disk_speed"


maybe we could save this information at md metadata? i don´t know how
it works, just a idea, instead saving it at sysctl, use md metadata
informations...

-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-07-06  2:45 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-05  9:20 [patch 2/3 v4]raid1: read balance chooses idlest disk for SSD Shaohua Li
2012-07-05 13:04 ` Roberto Spadim
2012-07-06  0:27   ` Shaohua Li
2012-07-06  2:45     ` Roberto Spadim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox