[PATCH 0/2] Fix near layout I/O distribution

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Fix near layout I/O distribution
@ 2016-11-03 23:45 Robert LeBlanc
  2016-11-03 23:45 ` [PATCH 1/2] mdadm: raid10.h Fix typo Robert LeBlanc
  2016-11-03 23:45 ` [PATCH 2/2] mdadm: raid10.c Remove near atomic break Robert LeBlanc
  0 siblings, 2 replies; 8+ messages in thread
From: Robert LeBlanc @ 2016-11-03 23:45 UTC (permalink / raw)
  To: linux-raid


This fixes a small typo from when the code was copied from raid1 and removes a
loop break when a read is atomic on only 'near' layouts. 'Far' layouts never
triggered this break and so it would spread all read I/Os to all available
devices. This removes that break since it is contrary to the comments previous
to it and works as intended in the 'far' layout. After this change random I/O
for 'near' layout is distributed to all disks instead of being all serviced by
the first disk.  Sequential I/O is not affected as it would not trigger the
break.

i.e. Before patch
# mdadm --create /dev/md14 --level 10 --raid-devices 4 -p n4 /dev/loop{11..14}
# fio -rw=randread --size=5G --name=mdadm_test
<snip>
Disk stats (read/write):
   md14: ios=1304718/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=327680/0, aggrmerge=0/0, aggrticks=18719/0, aggrin_queue=18689,
aggrutil=88.37%
 loop13: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
 loop11: ios=1310108/0, merge=0/0, ticks=74856/0, in_queue=74736, util=88.37%
 loop14: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
 loop12: ios=612/0, merge=0/0, ticks=20/0, in_queue=20, util=0.02%

After patch:

<snip>
Disk stats (read/write):
    md14: ios=1309172/1, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=327680/3, aggrmerge=0/0, aggrticks=19866/29, aggrin_queue=19871,
aggrutil=37.54%
  loop13: ios=331904/3, merge=0/0, ticks=21184/24, in_queue=21180, util=24.64%
  loop11: ios=330075/3, merge=0/0, ticks=32252/24, in_queue=32248, util=37.54%
  loop14: ios=327612/3, merge=0/0, ticks=3204/24, in_queue=3200, util=3.73%
  loop12: ios=321129/3, merge=0/0, ticks=22824/44, in_queue=22856, util=26.59%

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] mdadm: raid10.h Fix typo
  2016-11-03 23:45 [PATCH 0/2] Fix near layout I/O distribution Robert LeBlanc
@ 2016-11-03 23:45 ` Robert LeBlanc
  2016-11-03 23:45 ` [PATCH 2/2] mdadm: raid10.c Remove near atomic break Robert LeBlanc
  1 sibling, 0 replies; 8+ messages in thread
From: Robert LeBlanc @ 2016-11-03 23:45 UTC (permalink / raw)
  To: linux-raid; +Cc: Robert LeBlanc

Fix typo from paste of code from RAID1.

Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
---
 drivers/md/raid10.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index 18ec1f7..1699a6a 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -53,7 +53,7 @@ struct r10conf {
 	sector_t		offset_diff;
 
 	struct list_head	retry_list;
-	/* A separate list of r1bio which just need raid_end_bio_io called.
+	/* A separate list of r10bio which just need raid_end_bio_io called.
 	 * This mustn't happen for writes which had any errors if the superblock
 	 * needs to be written.
 	 */
-- 
2.10.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] mdadm: raid10.c Remove near atomic break
  2016-11-03 23:45 [PATCH 0/2] Fix near layout I/O distribution Robert LeBlanc
  2016-11-03 23:45 ` [PATCH 1/2] mdadm: raid10.h Fix typo Robert LeBlanc
@ 2016-11-03 23:45 ` Robert LeBlanc
  2016-11-04  4:01   ` NeilBrown
  1 sibling, 1 reply; 8+ messages in thread
From: Robert LeBlanc @ 2016-11-03 23:45 UTC (permalink / raw)
  To: linux-raid; +Cc: Robert LeBlanc

This is always triggered for small reads preventing spreading the reads
across all available drives. The comments are also confusing as it is
supposed to apply only to 'far' layouts, but really only applies to 'near'
layouts. Since there isn't problems with 'far' layouts, there shouldn't
be a problem for 'near' layouts either. This change fairly distributes
reads across all drives where before only came from the first drive.

Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
---
 drivers/md/raid10.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index be1a9fc..8d83802 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -777,13 +777,6 @@ static struct md_rdev *read_balance(struct r10conf *conf,
 		if (!do_balance)
 			break;
 
-		/* This optimisation is debatable, and completely destroys
-		 * sequential read speed for 'far copies' arrays.  So only
-		 * keep it for 'near' arrays, and review those later.
-		 */
-		if (geo->near_copies > 1 && !atomic_read(&rdev->nr_pending))
-			break;
-
 		/* for far > 1 always use the lowest address */
 		if (geo->far_copies > 1)
 			new_distance = r10_bio->devs[slot].addr;
-- 
2.10.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] mdadm: raid10.c Remove near atomic break
  2016-11-03 23:45 ` [PATCH 2/2] mdadm: raid10.c Remove near atomic break Robert LeBlanc
@ 2016-11-04  4:01   ` NeilBrown
  2016-11-04  5:37     ` Robert LeBlanc
  0 siblings, 1 reply; 8+ messages in thread
From: NeilBrown @ 2016-11-04  4:01 UTC (permalink / raw)
  To: Robert LeBlanc, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1948 bytes --]

On Fri, Nov 04 2016, Robert LeBlanc wrote:

> This is always triggered for small reads preventing spreading the reads
> across all available drives. The comments are also confusing as it is
> supposed to apply only to 'far' layouts, but really only applies to 'near'
> layouts. Since there isn't problems with 'far' layouts, there shouldn't
> be a problem for 'near' layouts either. This change fairly distributes
> reads across all drives where before only came from the first drive.

Why is "fairness" an issue?
The current code will use a device if it finds that it is completely
idle. i.e. if nr_pending is 0.
Why is that ever the wrong thing to do?

Does your testing show that overall performance is improved?  If so,
that would certainly be useful.
But it isn't clear (to me) that simply spreading the load more "fairly"
is a worthy goal.

Thanks,
NeilBrown


>
> Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
> ---
>  drivers/md/raid10.c | 7 -------
>  1 file changed, 7 deletions(-)
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index be1a9fc..8d83802 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -777,13 +777,6 @@ static struct md_rdev *read_balance(struct r10conf *conf,
>  		if (!do_balance)
>  			break;
>  
> -		/* This optimisation is debatable, and completely destroys
> -		 * sequential read speed for 'far copies' arrays.  So only
> -		 * keep it for 'near' arrays, and review those later.
> -		 */
> -		if (geo->near_copies > 1 && !atomic_read(&rdev->nr_pending))
> -			break;
> -
>  		/* for far > 1 always use the lowest address */
>  		if (geo->far_copies > 1)
>  			new_distance = r10_bio->devs[slot].addr;
> -- 
> 2.10.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] mdadm: raid10.c Remove near atomic break
  2016-11-04  4:01   ` NeilBrown
@ 2016-11-04  5:37     ` Robert LeBlanc
  2016-11-04 23:08       ` Shaohua Li
  2016-11-06 23:03       ` NeilBrown
  0 siblings, 2 replies; 8+ messages in thread
From: Robert LeBlanc @ 2016-11-04  5:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Thu, Nov 3, 2016 at 10:01 PM, NeilBrown <neilb@suse.com> wrote:
> On Fri, Nov 04 2016, Robert LeBlanc wrote:
>
>> This is always triggered for small reads preventing spreading the reads
>> across all available drives. The comments are also confusing as it is
>> supposed to apply only to 'far' layouts, but really only applies to 'near'
>> layouts. Since there isn't problems with 'far' layouts, there shouldn't
>> be a problem for 'near' layouts either. This change fairly distributes
>> reads across all drives where before only came from the first drive.
>
> Why is "fairness" an issue?
> The current code will use a device if it finds that it is completely
> idle. i.e. if nr_pending is 0.
> Why is that ever the wrong thing to do?

The code also looks for a drive that is closest to the requested
sector which doesn't get a chance to happen without this patch. The
way this part of code is written, as soon as it finds a good disk, it
cuts out of the loop searching for a better disk. So it doesn't even
look for another disk. In a healthy array with array-disks X and -p
nX, this means that the first disk gets all the reads for small I/O.
Where nY is less than X, it may be covered up because the data is
naturally striped, but it still may be picking a disk that is farther
away from the selected sector causing extra head seeks.

> Does your testing show that overall performance is improved?  If so,
> that would certainly be useful.
> But it isn't clear (to me) that simply spreading the load more "fairly"
> is a worthy goal.

I'll see if I have some mechanical drives somewhere to test (I've been
testing four loopback devices on a single NVME drive so you don't see
an improvement). You can see from the fio I posted [1] that before the
patch, one drive had all the I/O and after the patch the I/O was
distributed between all the drives (it doesn't have to be exactly
even, just not as skewed as it was before is good enough). I would
expect similar results to the 'far' tests done here [0]. Based on the
previous tests I did, when I saw this code, it just made complete
sense to me why we had great performance with 'far' and subpar
performance with 'near'. I'll come back with some results tomorrow.

[0] https://raid.wiki.kernel.org/index.php/Performance
[1] http://marc.info/?l=linux-raid&m=147821671817947&w=2

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
> Thanks,
> NeilBrown
>
>
>>
>> Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
>> ---
>>  drivers/md/raid10.c | 7 -------
>>  1 file changed, 7 deletions(-)
>>
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index be1a9fc..8d83802 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -777,13 +777,6 @@ static struct md_rdev *read_balance(struct r10conf *conf,
>>               if (!do_balance)
>>                       break;
>>
>> -             /* This optimisation is debatable, and completely destroys
>> -              * sequential read speed for 'far copies' arrays.  So only
>> -              * keep it for 'near' arrays, and review those later.
>> -              */
>> -             if (geo->near_copies > 1 && !atomic_read(&rdev->nr_pending))
>> -                     break;
>> -
>>               /* for far > 1 always use the lowest address */
>>               if (geo->far_copies > 1)
>>                       new_distance = r10_bio->devs[slot].addr;
>> --
>> 2.10.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] mdadm: raid10.c Remove near atomic break
  2016-11-04  5:37     ` Robert LeBlanc
@ 2016-11-04 23:08       ` Shaohua Li
  2016-11-06 23:03       ` NeilBrown
  1 sibling, 0 replies; 8+ messages in thread
From: Shaohua Li @ 2016-11-04 23:08 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: NeilBrown, linux-raid

On Thu, Nov 03, 2016 at 11:37:48PM -0600, Robert LeBlanc wrote:
> On Thu, Nov 3, 2016 at 10:01 PM, NeilBrown <neilb@suse.com> wrote:
> > On Fri, Nov 04 2016, Robert LeBlanc wrote:
> >
> >> This is always triggered for small reads preventing spreading the reads
> >> across all available drives. The comments are also confusing as it is
> >> supposed to apply only to 'far' layouts, but really only applies to 'near'
> >> layouts. Since there isn't problems with 'far' layouts, there shouldn't
> >> be a problem for 'near' layouts either. This change fairly distributes
> >> reads across all drives where before only came from the first drive.
> >
> > Why is "fairness" an issue?
> > The current code will use a device if it finds that it is completely
> > idle. i.e. if nr_pending is 0.
> > Why is that ever the wrong thing to do?
> 
> The code also looks for a drive that is closest to the requested
> sector which doesn't get a chance to happen without this patch. The
> way this part of code is written, as soon as it finds a good disk, it
> cuts out of the loop searching for a better disk. So it doesn't even
> look for another disk. In a healthy array with array-disks X and -p
> nX, this means that the first disk gets all the reads for small I/O.
> Where nY is less than X, it may be covered up because the data is
> naturally striped, but it still may be picking a disk that is farther
> away from the selected sector causing extra head seeks.
> 
> > Does your testing show that overall performance is improved?  If so,
> > that would certainly be useful.
> > But it isn't clear (to me) that simply spreading the load more "fairly"
> > is a worthy goal.
> 
> I'll see if I have some mechanical drives somewhere to test (I've been
> testing four loopback devices on a single NVME drive so you don't see
> an improvement). You can see from the fio I posted [1] that before the
> patch, one drive had all the I/O and after the patch the I/O was
> distributed between all the drives (it doesn't have to be exactly
> even, just not as skewed as it was before is good enough). I would
> expect similar results to the 'far' tests done here [0]. Based on the
> previous tests I did, when I saw this code, it just made complete
> sense to me why we had great performance with 'far' and subpar
> performance with 'near'. I'll come back with some results tomorrow.

But in your test, iodepth is 1. So nr_pending is always 0 when we try to choose
a disk. In this case, always dispatching it to one disk doesn't matter. If your
test has high iodepth, the io will be distributed to all disks as the first
disk's nr_pending will not be 0.

That said the distribution algorithm does have problem. We should have
different algorithms for SSD and hardisk because seek isn't a problem for SSD.
I fixed it for raid1, but not raid10. I think we should do something similar
for raid10.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] mdadm: raid10.c Remove near atomic break
  2016-11-04  5:37     ` Robert LeBlanc
  2016-11-04 23:08       ` Shaohua Li
@ 2016-11-06 23:03       ` NeilBrown
  2016-11-08  0:12         ` Robert LeBlanc
  1 sibling, 1 reply; 8+ messages in thread
From: NeilBrown @ 2016-11-06 23:03 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2916 bytes --]

On Fri, Nov 04 2016, Robert LeBlanc wrote:

> On Thu, Nov 3, 2016 at 10:01 PM, NeilBrown <neilb@suse.com> wrote:
>> On Fri, Nov 04 2016, Robert LeBlanc wrote:
>>
>>> This is always triggered for small reads preventing spreading the reads
>>> across all available drives. The comments are also confusing as it is
>>> supposed to apply only to 'far' layouts, but really only applies to 'near'
>>> layouts. Since there isn't problems with 'far' layouts, there shouldn't
>>> be a problem for 'near' layouts either. This change fairly distributes
>>> reads across all drives where before only came from the first drive.
>>
>> Why is "fairness" an issue?
>> The current code will use a device if it finds that it is completely
>> idle. i.e. if nr_pending is 0.
>> Why is that ever the wrong thing to do?
>
> The code also looks for a drive that is closest to the requested
> sector which doesn't get a chance to happen without this patch. The
> way this part of code is written, as soon as it finds a good disk, it
> cuts out of the loop searching for a better disk. So it doesn't even
> look for another disk. In a healthy array with array-disks X and -p
> nX, this means that the first disk gets all the reads for small I/O.
> Where nY is less than X, it may be covered up because the data is
> naturally striped, but it still may be picking a disk that is farther
> away from the selected sector causing extra head seeks.
>
>> Does your testing show that overall performance is improved?  If so,
>> that would certainly be useful.
>> But it isn't clear (to me) that simply spreading the load more "fairly"
>> is a worthy goal.
>
> I'll see if I have some mechanical drives somewhere to test (I've been
> testing four loopback devices on a single NVME drive so you don't see
> an improvement). You can see from the fio I posted [1] that before the
> patch, one drive had all the I/O and after the patch the I/O was
> distributed between all the drives (it doesn't have to be exactly
> even, just not as skewed as it was before is good enough). I would
> expect similar results to the 'far' tests done here [0]. Based on the
> previous tests I did, when I saw this code, it just made complete
> sense to me why we had great performance with 'far' and subpar
> performance with 'near'. I'll come back with some results tomorrow.

The whole point of the "far" layout is to provide better read
performance than "near" because reads get spread out over all devices
like they do with RAID0.  So I wouldn't expect the make "near" close to
"far", but it certainly doesn't hurt to improve it.

As Shaohua notes, we you two different algorithms in RAID1, depending on
whether there are rotational devices or not.  Introducing similar logic
to RAID10 would seem to make sense.

Patches for performance improvements are always more convincing when
they come with numbers.  I'll look forward to your test results.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] mdadm: raid10.c Remove near atomic break
  2016-11-06 23:03       ` NeilBrown
@ 2016-11-08  0:12         ` Robert LeBlanc
  0 siblings, 0 replies; 8+ messages in thread
From: Robert LeBlanc @ 2016-11-08  0:12 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

I'm not sure why 'near' performance can't be close to 'far'
performance. Here is the results from some tests I did today. These
are 6TB SAS drives that I filled with 5TB of fio data (took all day
Friday to fill them) so that I prevent short stroking drives.

The format of the name is
[NVME-]RAID(level)-(num_drives)[-parity_layout]. The NVME test was an
afterthought so there may be some variance between tests not seen in
the others. I usually do several tests and average the results and do
a distribution to get what is significant, but I didn't have a lot of
time.

Pre-patch                                    clat (usec)
Seq                 io(MB)  bw(KB/s)  iops   min      max     avg    stdev
Single               12762   217801  54450     0    60462   17.71   156.30
RAID1-4              12903   220216  55053     0    42778   17.75   160.62
RAID10-4-n4          20057   342298  85574     0    50977   11.52   283.84
RAID10-4-f4          48711   831319 207829     0    74020    4.62   175.52
RAID10-3-n2          18439   314684  78671     0    61328   12.45   340.17
RAID10-3-f2          37169   634293 158573     0    65365    6.10   210.42
NVME-RAID10-4-n4    171950  2934682 733641     0     8016    1.16    14.15
NVME-RAID10-4-f4    172480  2943693 735903     0     7309    1.16    16.78

Post-patch
Seq
Single               12898   220118  55029     0    47805   17.85   159.62
RAID1-4              12895   220067  55016     0    51156   17.85   168.47
RAID10-4-n4          12797   218385  54596     0    65610   18.01   377.55
RAID10-4-f4          48751   832000 208000     0    90652    4.61   183.18
RAID10-3-n2          18656   318388  79596     0    62684   12.30   262.32
RAID10-3-f2          37181   634487 158621     0    72696    6.11   211.63
NVME-RAID10-4-n4    172738  2947174 737001     0     1057    1.16    13.08
NVME-RAID10-4-f4    188423  3215770 803926     0     1242    1.05    16.33

Pre-patch
Random
Single                19.5    333.3     83  1000    48000   12000     4010
RAID1-4               19.4    331.6     82  2000    49000   12060     4110
RAID10-4-n4           19.5    332.9     83  1000    38000   12010     4210
RAID10-4-f4           27.2    463.9    115  1000    50000    8620     3190
RAID10-3-n2           22.6    385.6     96  1000    44000   10370     3620
RAID10-3-f2           26.1    444.7    111  1000   366000    8990     5430
NVME-RAID10-4-n4    2458.3  41954.0  10488    77      414      94       10
NVME-RAID10-4-f4    2509.7  42830.0  10707    74      373      93       14

Post-patch
Random
Single                19.5    332.5     83  2000    37000   12020     4040
RAID1-4               19.4    331.0     82  2000    34000   12080     4070
RAID10-4-n4           27.0    460.6    115   178    50950    8678     3278
RAID10-4-f4           27.0    460.1    115  1000    43000    8690     3260
RAID10-3-n2           25.3    431.6    107  1000    46000    9260     3330
RAID10-3-f2           26.1    445.4    111  1000    44000    8970     3270
NVME-RAID10-4-n4    2334.5  39840.0   9960    47      308     100       13
NVME-RAID10-4-f4    2376.6  40551.0  10137    73     2675      97       18

With this patch, 'near' performance is almost exactly 'far'
performance for random reads. The sequential reads suffer from this
patch, but not worse than the the RAID1 or bare drive. RAID10-4-n4 has
38% random performance increase, RAID10-3-n2 has 12% random read
performance increase and RAID-4-n4 has 36% sequential performance
degradation where the RAID10-3-n2 seq performance has a 1% performance
increase (probably insignificant).

Interesting note:
Pre-patch Seq RAID10-4-n4 split the reads between the drives pretty
good and pre-patch random RAID10-4-n4 has all I/O going to one drive.
Post-patch these results are swapped with Seq RAID10-4-n4 being
serviced from a single drive and random RAID10-4-n4 spreading I/O to
all drives.

The patch doesn't really seem to impact NVME, there is possibly some
error in this test that throws doubts on the results in my mind since
both 'far' and 'near' have the same amount of change (~5%).

I hope this helps explain my reasoning. Just need to keep/improve the
original seq performance but get the improved random performance.

Robert LeBlanc

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-11-08  0:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-03 23:45 [PATCH 0/2] Fix near layout I/O distribution Robert LeBlanc
2016-11-03 23:45 ` [PATCH 1/2] mdadm: raid10.h Fix typo Robert LeBlanc
2016-11-03 23:45 ` [PATCH 2/2] mdadm: raid10.c Remove near atomic break Robert LeBlanc
2016-11-04  4:01   ` NeilBrown
2016-11-04  5:37     ` Robert LeBlanc
2016-11-04 23:08       ` Shaohua Li
2016-11-06 23:03       ` NeilBrown
2016-11-08  0:12         ` Robert LeBlanc

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).