* [PATCH 0/2] Fix near layout I/O distribution
@ 2016-11-03 23:45 Robert LeBlanc
2016-11-03 23:45 ` [PATCH 1/2] mdadm: raid10.h Fix typo Robert LeBlanc
2016-11-03 23:45 ` [PATCH 2/2] mdadm: raid10.c Remove near atomic break Robert LeBlanc
0 siblings, 2 replies; 8+ messages in thread
From: Robert LeBlanc @ 2016-11-03 23:45 UTC (permalink / raw)
To: linux-raid
This fixes a small typo from when the code was copied from raid1 and removes a
loop break when a read is atomic on only 'near' layouts. 'Far' layouts never
triggered this break and so it would spread all read I/Os to all available
devices. This removes that break since it is contrary to the comments previous
to it and works as intended in the 'far' layout. After this change random I/O
for 'near' layout is distributed to all disks instead of being all serviced by
the first disk. Sequential I/O is not affected as it would not trigger the
break.
i.e. Before patch
# mdadm --create /dev/md14 --level 10 --raid-devices 4 -p n4 /dev/loop{11..14}
# fio -rw=randread --size=5G --name=mdadm_test
<snip>
Disk stats (read/write):
md14: ios=1304718/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=327680/0, aggrmerge=0/0, aggrticks=18719/0, aggrin_queue=18689,
aggrutil=88.37%
loop13: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
loop11: ios=1310108/0, merge=0/0, ticks=74856/0, in_queue=74736, util=88.37%
loop14: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
loop12: ios=612/0, merge=0/0, ticks=20/0, in_queue=20, util=0.02%
After patch:
<snip>
Disk stats (read/write):
md14: ios=1309172/1, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=327680/3, aggrmerge=0/0, aggrticks=19866/29, aggrin_queue=19871,
aggrutil=37.54%
loop13: ios=331904/3, merge=0/0, ticks=21184/24, in_queue=21180, util=24.64%
loop11: ios=330075/3, merge=0/0, ticks=32252/24, in_queue=32248, util=37.54%
loop14: ios=327612/3, merge=0/0, ticks=3204/24, in_queue=3200, util=3.73%
loop12: ios=321129/3, merge=0/0, ticks=22824/44, in_queue=22856, util=26.59%
^ permalink raw reply [flat|nested] 8+ messages in thread* [PATCH 1/2] mdadm: raid10.h Fix typo
2016-11-03 23:45 [PATCH 0/2] Fix near layout I/O distribution Robert LeBlanc
@ 2016-11-03 23:45 ` Robert LeBlanc
2016-11-03 23:45 ` [PATCH 2/2] mdadm: raid10.c Remove near atomic break Robert LeBlanc
1 sibling, 0 replies; 8+ messages in thread
From: Robert LeBlanc @ 2016-11-03 23:45 UTC (permalink / raw)
To: linux-raid; +Cc: Robert LeBlanc
Fix typo from paste of code from RAID1.
Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
---
drivers/md/raid10.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index 18ec1f7..1699a6a 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -53,7 +53,7 @@ struct r10conf {
sector_t offset_diff;
struct list_head retry_list;
- /* A separate list of r1bio which just need raid_end_bio_io called.
+ /* A separate list of r10bio which just need raid_end_bio_io called.
* This mustn't happen for writes which had any errors if the superblock
* needs to be written.
*/
--
2.10.1
^ permalink raw reply related [flat|nested] 8+ messages in thread* [PATCH 2/2] mdadm: raid10.c Remove near atomic break
2016-11-03 23:45 [PATCH 0/2] Fix near layout I/O distribution Robert LeBlanc
2016-11-03 23:45 ` [PATCH 1/2] mdadm: raid10.h Fix typo Robert LeBlanc
@ 2016-11-03 23:45 ` Robert LeBlanc
2016-11-04 4:01 ` NeilBrown
1 sibling, 1 reply; 8+ messages in thread
From: Robert LeBlanc @ 2016-11-03 23:45 UTC (permalink / raw)
To: linux-raid; +Cc: Robert LeBlanc
This is always triggered for small reads preventing spreading the reads
across all available drives. The comments are also confusing as it is
supposed to apply only to 'far' layouts, but really only applies to 'near'
layouts. Since there isn't problems with 'far' layouts, there shouldn't
be a problem for 'near' layouts either. This change fairly distributes
reads across all drives where before only came from the first drive.
Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
---
drivers/md/raid10.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index be1a9fc..8d83802 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -777,13 +777,6 @@ static struct md_rdev *read_balance(struct r10conf *conf,
if (!do_balance)
break;
- /* This optimisation is debatable, and completely destroys
- * sequential read speed for 'far copies' arrays. So only
- * keep it for 'near' arrays, and review those later.
- */
- if (geo->near_copies > 1 && !atomic_read(&rdev->nr_pending))
- break;
-
/* for far > 1 always use the lowest address */
if (geo->far_copies > 1)
new_distance = r10_bio->devs[slot].addr;
--
2.10.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] mdadm: raid10.c Remove near atomic break
2016-11-03 23:45 ` [PATCH 2/2] mdadm: raid10.c Remove near atomic break Robert LeBlanc
@ 2016-11-04 4:01 ` NeilBrown
2016-11-04 5:37 ` Robert LeBlanc
0 siblings, 1 reply; 8+ messages in thread
From: NeilBrown @ 2016-11-04 4:01 UTC (permalink / raw)
To: Robert LeBlanc, linux-raid
[-- Attachment #1: Type: text/plain, Size: 1948 bytes --]
On Fri, Nov 04 2016, Robert LeBlanc wrote:
> This is always triggered for small reads preventing spreading the reads
> across all available drives. The comments are also confusing as it is
> supposed to apply only to 'far' layouts, but really only applies to 'near'
> layouts. Since there isn't problems with 'far' layouts, there shouldn't
> be a problem for 'near' layouts either. This change fairly distributes
> reads across all drives where before only came from the first drive.
Why is "fairness" an issue?
The current code will use a device if it finds that it is completely
idle. i.e. if nr_pending is 0.
Why is that ever the wrong thing to do?
Does your testing show that overall performance is improved? If so,
that would certainly be useful.
But it isn't clear (to me) that simply spreading the load more "fairly"
is a worthy goal.
Thanks,
NeilBrown
>
> Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
> ---
> drivers/md/raid10.c | 7 -------
> 1 file changed, 7 deletions(-)
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index be1a9fc..8d83802 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -777,13 +777,6 @@ static struct md_rdev *read_balance(struct r10conf *conf,
> if (!do_balance)
> break;
>
> - /* This optimisation is debatable, and completely destroys
> - * sequential read speed for 'far copies' arrays. So only
> - * keep it for 'near' arrays, and review those later.
> - */
> - if (geo->near_copies > 1 && !atomic_read(&rdev->nr_pending))
> - break;
> -
> /* for far > 1 always use the lowest address */
> if (geo->far_copies > 1)
> new_distance = r10_bio->devs[slot].addr;
> --
> 2.10.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] mdadm: raid10.c Remove near atomic break
2016-11-04 4:01 ` NeilBrown
@ 2016-11-04 5:37 ` Robert LeBlanc
2016-11-04 23:08 ` Shaohua Li
2016-11-06 23:03 ` NeilBrown
0 siblings, 2 replies; 8+ messages in thread
From: Robert LeBlanc @ 2016-11-04 5:37 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
On Thu, Nov 3, 2016 at 10:01 PM, NeilBrown <neilb@suse.com> wrote:
> On Fri, Nov 04 2016, Robert LeBlanc wrote:
>
>> This is always triggered for small reads preventing spreading the reads
>> across all available drives. The comments are also confusing as it is
>> supposed to apply only to 'far' layouts, but really only applies to 'near'
>> layouts. Since there isn't problems with 'far' layouts, there shouldn't
>> be a problem for 'near' layouts either. This change fairly distributes
>> reads across all drives where before only came from the first drive.
>
> Why is "fairness" an issue?
> The current code will use a device if it finds that it is completely
> idle. i.e. if nr_pending is 0.
> Why is that ever the wrong thing to do?
The code also looks for a drive that is closest to the requested
sector which doesn't get a chance to happen without this patch. The
way this part of code is written, as soon as it finds a good disk, it
cuts out of the loop searching for a better disk. So it doesn't even
look for another disk. In a healthy array with array-disks X and -p
nX, this means that the first disk gets all the reads for small I/O.
Where nY is less than X, it may be covered up because the data is
naturally striped, but it still may be picking a disk that is farther
away from the selected sector causing extra head seeks.
> Does your testing show that overall performance is improved? If so,
> that would certainly be useful.
> But it isn't clear (to me) that simply spreading the load more "fairly"
> is a worthy goal.
I'll see if I have some mechanical drives somewhere to test (I've been
testing four loopback devices on a single NVME drive so you don't see
an improvement). You can see from the fio I posted [1] that before the
patch, one drive had all the I/O and after the patch the I/O was
distributed between all the drives (it doesn't have to be exactly
even, just not as skewed as it was before is good enough). I would
expect similar results to the 'far' tests done here [0]. Based on the
previous tests I did, when I saw this code, it just made complete
sense to me why we had great performance with 'far' and subpar
performance with 'near'. I'll come back with some results tomorrow.
[0] https://raid.wiki.kernel.org/index.php/Performance
[1] http://marc.info/?l=linux-raid&m=147821671817947&w=2
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
>
> Thanks,
> NeilBrown
>
>
>>
>> Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
>> ---
>> drivers/md/raid10.c | 7 -------
>> 1 file changed, 7 deletions(-)
>>
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index be1a9fc..8d83802 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -777,13 +777,6 @@ static struct md_rdev *read_balance(struct r10conf *conf,
>> if (!do_balance)
>> break;
>>
>> - /* This optimisation is debatable, and completely destroys
>> - * sequential read speed for 'far copies' arrays. So only
>> - * keep it for 'near' arrays, and review those later.
>> - */
>> - if (geo->near_copies > 1 && !atomic_read(&rdev->nr_pending))
>> - break;
>> -
>> /* for far > 1 always use the lowest address */
>> if (geo->far_copies > 1)
>> new_distance = r10_bio->devs[slot].addr;
>> --
>> 2.10.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] mdadm: raid10.c Remove near atomic break
2016-11-04 5:37 ` Robert LeBlanc
@ 2016-11-04 23:08 ` Shaohua Li
2016-11-06 23:03 ` NeilBrown
1 sibling, 0 replies; 8+ messages in thread
From: Shaohua Li @ 2016-11-04 23:08 UTC (permalink / raw)
To: Robert LeBlanc; +Cc: NeilBrown, linux-raid
On Thu, Nov 03, 2016 at 11:37:48PM -0600, Robert LeBlanc wrote:
> On Thu, Nov 3, 2016 at 10:01 PM, NeilBrown <neilb@suse.com> wrote:
> > On Fri, Nov 04 2016, Robert LeBlanc wrote:
> >
> >> This is always triggered for small reads preventing spreading the reads
> >> across all available drives. The comments are also confusing as it is
> >> supposed to apply only to 'far' layouts, but really only applies to 'near'
> >> layouts. Since there isn't problems with 'far' layouts, there shouldn't
> >> be a problem for 'near' layouts either. This change fairly distributes
> >> reads across all drives where before only came from the first drive.
> >
> > Why is "fairness" an issue?
> > The current code will use a device if it finds that it is completely
> > idle. i.e. if nr_pending is 0.
> > Why is that ever the wrong thing to do?
>
> The code also looks for a drive that is closest to the requested
> sector which doesn't get a chance to happen without this patch. The
> way this part of code is written, as soon as it finds a good disk, it
> cuts out of the loop searching for a better disk. So it doesn't even
> look for another disk. In a healthy array with array-disks X and -p
> nX, this means that the first disk gets all the reads for small I/O.
> Where nY is less than X, it may be covered up because the data is
> naturally striped, but it still may be picking a disk that is farther
> away from the selected sector causing extra head seeks.
>
> > Does your testing show that overall performance is improved? If so,
> > that would certainly be useful.
> > But it isn't clear (to me) that simply spreading the load more "fairly"
> > is a worthy goal.
>
> I'll see if I have some mechanical drives somewhere to test (I've been
> testing four loopback devices on a single NVME drive so you don't see
> an improvement). You can see from the fio I posted [1] that before the
> patch, one drive had all the I/O and after the patch the I/O was
> distributed between all the drives (it doesn't have to be exactly
> even, just not as skewed as it was before is good enough). I would
> expect similar results to the 'far' tests done here [0]. Based on the
> previous tests I did, when I saw this code, it just made complete
> sense to me why we had great performance with 'far' and subpar
> performance with 'near'. I'll come back with some results tomorrow.
But in your test, iodepth is 1. So nr_pending is always 0 when we try to choose
a disk. In this case, always dispatching it to one disk doesn't matter. If your
test has high iodepth, the io will be distributed to all disks as the first
disk's nr_pending will not be 0.
That said the distribution algorithm does have problem. We should have
different algorithms for SSD and hardisk because seek isn't a problem for SSD.
I fixed it for raid1, but not raid10. I think we should do something similar
for raid10.
Thanks,
Shaohua
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] mdadm: raid10.c Remove near atomic break
2016-11-04 5:37 ` Robert LeBlanc
2016-11-04 23:08 ` Shaohua Li
@ 2016-11-06 23:03 ` NeilBrown
2016-11-08 0:12 ` Robert LeBlanc
1 sibling, 1 reply; 8+ messages in thread
From: NeilBrown @ 2016-11-06 23:03 UTC (permalink / raw)
To: Robert LeBlanc; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 2916 bytes --]
On Fri, Nov 04 2016, Robert LeBlanc wrote:
> On Thu, Nov 3, 2016 at 10:01 PM, NeilBrown <neilb@suse.com> wrote:
>> On Fri, Nov 04 2016, Robert LeBlanc wrote:
>>
>>> This is always triggered for small reads preventing spreading the reads
>>> across all available drives. The comments are also confusing as it is
>>> supposed to apply only to 'far' layouts, but really only applies to 'near'
>>> layouts. Since there isn't problems with 'far' layouts, there shouldn't
>>> be a problem for 'near' layouts either. This change fairly distributes
>>> reads across all drives where before only came from the first drive.
>>
>> Why is "fairness" an issue?
>> The current code will use a device if it finds that it is completely
>> idle. i.e. if nr_pending is 0.
>> Why is that ever the wrong thing to do?
>
> The code also looks for a drive that is closest to the requested
> sector which doesn't get a chance to happen without this patch. The
> way this part of code is written, as soon as it finds a good disk, it
> cuts out of the loop searching for a better disk. So it doesn't even
> look for another disk. In a healthy array with array-disks X and -p
> nX, this means that the first disk gets all the reads for small I/O.
> Where nY is less than X, it may be covered up because the data is
> naturally striped, but it still may be picking a disk that is farther
> away from the selected sector causing extra head seeks.
>
>> Does your testing show that overall performance is improved? If so,
>> that would certainly be useful.
>> But it isn't clear (to me) that simply spreading the load more "fairly"
>> is a worthy goal.
>
> I'll see if I have some mechanical drives somewhere to test (I've been
> testing four loopback devices on a single NVME drive so you don't see
> an improvement). You can see from the fio I posted [1] that before the
> patch, one drive had all the I/O and after the patch the I/O was
> distributed between all the drives (it doesn't have to be exactly
> even, just not as skewed as it was before is good enough). I would
> expect similar results to the 'far' tests done here [0]. Based on the
> previous tests I did, when I saw this code, it just made complete
> sense to me why we had great performance with 'far' and subpar
> performance with 'near'. I'll come back with some results tomorrow.
The whole point of the "far" layout is to provide better read
performance than "near" because reads get spread out over all devices
like they do with RAID0. So I wouldn't expect the make "near" close to
"far", but it certainly doesn't hurt to improve it.
As Shaohua notes, we you two different algorithms in RAID1, depending on
whether there are rotational devices or not. Introducing similar logic
to RAID10 would seem to make sense.
Patches for performance improvements are always more convincing when
they come with numbers. I'll look forward to your test results.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] mdadm: raid10.c Remove near atomic break
2016-11-06 23:03 ` NeilBrown
@ 2016-11-08 0:12 ` Robert LeBlanc
0 siblings, 0 replies; 8+ messages in thread
From: Robert LeBlanc @ 2016-11-08 0:12 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
I'm not sure why 'near' performance can't be close to 'far'
performance. Here is the results from some tests I did today. These
are 6TB SAS drives that I filled with 5TB of fio data (took all day
Friday to fill them) so that I prevent short stroking drives.
The format of the name is
[NVME-]RAID(level)-(num_drives)[-parity_layout]. The NVME test was an
afterthought so there may be some variance between tests not seen in
the others. I usually do several tests and average the results and do
a distribution to get what is significant, but I didn't have a lot of
time.
Pre-patch clat (usec)
Seq io(MB) bw(KB/s) iops min max avg stdev
Single 12762 217801 54450 0 60462 17.71 156.30
RAID1-4 12903 220216 55053 0 42778 17.75 160.62
RAID10-4-n4 20057 342298 85574 0 50977 11.52 283.84
RAID10-4-f4 48711 831319 207829 0 74020 4.62 175.52
RAID10-3-n2 18439 314684 78671 0 61328 12.45 340.17
RAID10-3-f2 37169 634293 158573 0 65365 6.10 210.42
NVME-RAID10-4-n4 171950 2934682 733641 0 8016 1.16 14.15
NVME-RAID10-4-f4 172480 2943693 735903 0 7309 1.16 16.78
Post-patch
Seq
Single 12898 220118 55029 0 47805 17.85 159.62
RAID1-4 12895 220067 55016 0 51156 17.85 168.47
RAID10-4-n4 12797 218385 54596 0 65610 18.01 377.55
RAID10-4-f4 48751 832000 208000 0 90652 4.61 183.18
RAID10-3-n2 18656 318388 79596 0 62684 12.30 262.32
RAID10-3-f2 37181 634487 158621 0 72696 6.11 211.63
NVME-RAID10-4-n4 172738 2947174 737001 0 1057 1.16 13.08
NVME-RAID10-4-f4 188423 3215770 803926 0 1242 1.05 16.33
Pre-patch
Random
Single 19.5 333.3 83 1000 48000 12000 4010
RAID1-4 19.4 331.6 82 2000 49000 12060 4110
RAID10-4-n4 19.5 332.9 83 1000 38000 12010 4210
RAID10-4-f4 27.2 463.9 115 1000 50000 8620 3190
RAID10-3-n2 22.6 385.6 96 1000 44000 10370 3620
RAID10-3-f2 26.1 444.7 111 1000 366000 8990 5430
NVME-RAID10-4-n4 2458.3 41954.0 10488 77 414 94 10
NVME-RAID10-4-f4 2509.7 42830.0 10707 74 373 93 14
Post-patch
Random
Single 19.5 332.5 83 2000 37000 12020 4040
RAID1-4 19.4 331.0 82 2000 34000 12080 4070
RAID10-4-n4 27.0 460.6 115 178 50950 8678 3278
RAID10-4-f4 27.0 460.1 115 1000 43000 8690 3260
RAID10-3-n2 25.3 431.6 107 1000 46000 9260 3330
RAID10-3-f2 26.1 445.4 111 1000 44000 8970 3270
NVME-RAID10-4-n4 2334.5 39840.0 9960 47 308 100 13
NVME-RAID10-4-f4 2376.6 40551.0 10137 73 2675 97 18
With this patch, 'near' performance is almost exactly 'far'
performance for random reads. The sequential reads suffer from this
patch, but not worse than the the RAID1 or bare drive. RAID10-4-n4 has
38% random performance increase, RAID10-3-n2 has 12% random read
performance increase and RAID-4-n4 has 36% sequential performance
degradation where the RAID10-3-n2 seq performance has a 1% performance
increase (probably insignificant).
Interesting note:
Pre-patch Seq RAID10-4-n4 split the reads between the drives pretty
good and pre-patch random RAID10-4-n4 has all I/O going to one drive.
Post-patch these results are swapped with Seq RAID10-4-n4 being
serviced from a single drive and random RAID10-4-n4 spreading I/O to
all drives.
The patch doesn't really seem to impact NVME, there is possibly some
error in this test that throws doubts on the results in my mind since
both 'far' and 'near' have the same amount of change (~5%).
I hope this helps explain my reasoning. Just need to keep/improve the
original seq performance but get the improved random performance.
Robert LeBlanc
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2016-11-08 0:12 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-03 23:45 [PATCH 0/2] Fix near layout I/O distribution Robert LeBlanc
2016-11-03 23:45 ` [PATCH 1/2] mdadm: raid10.h Fix typo Robert LeBlanc
2016-11-03 23:45 ` [PATCH 2/2] mdadm: raid10.c Remove near atomic break Robert LeBlanc
2016-11-04 4:01 ` NeilBrown
2016-11-04 5:37 ` Robert LeBlanc
2016-11-04 23:08 ` Shaohua Li
2016-11-06 23:03 ` NeilBrown
2016-11-08 0:12 ` Robert LeBlanc
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).