Linux RAID subsystem development
 help / color / mirror / Atom feed
* Re: Issue with growing RAID10
From: Robert LeBlanc @ 2016-11-02 18:13 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid
In-Reply-To: <581A2BD4.3070109@youngman.org.uk>

Grow on RAID10 does work. Here is my previous attempt at trying to
change --raid-devices and -p separately.
# mdadm --detail /dev/md13
/dev/md13:
       Version : 1.2
 Creation Time : Wed Nov  2 11:25:22 2016
    Raid Level : raid10
    Array Size : 10477568 (9.99 GiB 10.73 GB)
 Used Dev Size : 10477568 (9.99 GiB 10.73 GB)
  Raid Devices : 2
 Total Devices : 2
   Persistence : Superblock is persistent

   Update Time : Wed Nov  2 11:25:22 2016
         State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
 Spare Devices : 0

        Layout : near=2
    Chunk Size : 512K

          Name : rleblanc-pc:13  (local to host rleblanc-pc)
          UUID : 278c5e33:5ac1d25a:241a0cf7:66269542
        Events : 0

   Number   Major   Minor   RaidDevice State
      0       7        2        0      active sync set-A   /dev/loop2
      1       7        3        1      active sync set-B   /dev/loop3

# mdadm /dev/md13 -a /dev/loop4
mdadm: added /dev/loop4

root@rleblanc-pc:/home/rleblanc/Downloads# mdadm --detail /dev/md13
/dev/md13:
       Version : 1.2
 Creation Time : Wed Nov  2 11:25:22 2016
    Raid Level : raid10
    Array Size : 10477568 (9.99 GiB 10.73 GB)
 Used Dev Size : 10477568 (9.99 GiB 10.73 GB)
  Raid Devices : 2
 Total Devices : 3
   Persistence : Superblock is persistent

   Update Time : Wed Nov  2 11:27:33 2016
         State : clean
Active Devices : 2
Working Devices : 3
Failed Devices : 0
 Spare Devices : 1

        Layout : near=2
    Chunk Size : 512K

          Name : rleblanc-pc:13  (local to host rleblanc-pc)
          UUID : 278c5e33:5ac1d25a:241a0cf7:66269542
        Events : 1

   Number   Major   Minor   RaidDevice State
      0       7        2        0      active sync set-A   /dev/loop2
      1       7        3        1      active sync set-B   /dev/loop3

      2       7        4        -      spare   /dev/loop4

# mdadm --grow /dev/md13 --raid-devices 3

# mdadm --detail /dev/md13
/dev/md13:
       Version : 1.2
 Creation Time : Wed Nov  2 11:25:22 2016
    Raid Level : raid10
    Array Size : 10477568 (9.99 GiB 10.73 GB)
 Used Dev Size : 10477568 (9.99 GiB 10.73 GB)
  Raid Devices : 3
 Total Devices : 3
   Persistence : Superblock is persistent

   Update Time : Wed Nov  2 11:28:08 2016
         State : clean, reshaping
Active Devices : 3
Working Devices : 3
Failed Devices : 0
 Spare Devices : 0

        Layout : near=2
    Chunk Size : 512K

Reshape Status : 1% complete
 Delta Devices : 1, (2->3)

          Name : rleblanc-pc:13  (local to host rleblanc-pc)
          UUID : 278c5e33:5ac1d25a:241a0cf7:66269542
        Events : 12

   Number   Major   Minor   RaidDevice State
      0       7        2        0      active sync   /dev/loop2
      1       7        3        1      active sync   /dev/loop3
      2       7        4        2      active sync   /dev/loop4

----Wait for reshape to finish----
# mdadm --detail /dev/md13
/dev/md13:
       Version : 1.2
 Creation Time : Wed Nov  2 11:25:22 2016
    Raid Level : raid10
    Array Size : 15716352 (14.99 GiB 16.09 GB)
 Used Dev Size : 10477568 (9.99 GiB 10.73 GB)
  Raid Devices : 3
 Total Devices : 3
   Persistence : Superblock is persistent

   Update Time : Wed Nov  2 11:33:25 2016
         State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 0
 Spare Devices : 0

        Layout : near=2
    Chunk Size : 512K

          Name : rleblanc-pc:13  (local to host rleblanc-pc)
          UUID : 278c5e33:5ac1d25a:241a0cf7:66269542
        Events : 49

   Number   Major   Minor   RaidDevice State
      0       7        2        0      active sync   /dev/loop2
      1       7        3        1      active sync   /dev/loop3
      2       7        4        2      active sync   /dev/loop4

# mdadm --grow /dev/md13 -p n3
mdadm: Cannot change number of copies when reshaping RAID10
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Nov 2, 2016 at 12:09 PM, Wols Lists <antlists@youngman.org.uk> wrote:
> On 02/11/16 17:59, Robert LeBlanc wrote:
>> We would like to add read performance to our RAID10 volume by adding
>> another drive (we don't care about space), so I did the following test
>> with poor results.
>
> Quicky reply ...
>
> I don't think you can change the number of raid-devices on a raid10. Are
> you trying to replace a slow drive with a faster one? You can probably
> use the --replace option.
>
> If not that, what do you want to achieve?
>
> Cheers,
> Wol
>

^ permalink raw reply

* Re: Issue with growing RAID10
From: keld @ 2016-11-02 18:19 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: linux-raid
In-Reply-To: <CAANLjFrK25fUjMpif6m7y3PZ0phNa_zVLFWUVs2Yj1PtbkA_Qg@mail.gmail.com>

There is some speed limits om raid10,n2 as also reported in 
https://raid.wiki.kernel.org/index.php/Performance

f you want speed, I suggest you use raid10,f2.

Unfortunatlely you cannot grow "far" layouts, Neil says it is too complicated.

But in your case you should be  able to disable one of your raid10,N2 drives,
then build a raid10,n2 array for 3 disks, but only with the disk you removed from 
your N2 disk plus your new disk. Then you can copy the contents of the remaining
old disk to the new "far" disk, and when complete, add the old raid10,n2 disk to the 
new Far raid, with 3 disks. This should give you about 3 times the speed
of your old raid10,n2 array.

Best regards
keld



On Wed, Nov 02, 2016 at 11:59:25AM -0600, Robert LeBlanc wrote:
> We would like to add read performance to our RAID10 volume by adding
> another drive (we don't care about space), so I did the following test
> with poor results.
> 
> # mdadm --create /dev/md13 --level 10 --run --assume-clean -p n2
> --raid-devices 2 /dev/loop{2..3}
> mdadm: /dev/loop2 appears to be part of a raid array:
>       level=raid10 devices=3 ctime=Wed Nov  2 11:25:22 2016
> mdadm: /dev/loop3 appears to be part of a raid array:
>       level=raid10 devices=3 ctime=Wed Nov  2 11:25:22 2016
> mdadm: Defaulting to version 1.2 metadata
> mdadm: array /dev/md13 started.
> 
> # mdadm --detail /dev/md13
> /dev/md13:
>        Version : 1.2
>  Creation Time : Wed Nov  2 11:47:48 2016
>     Raid Level : raid10
>     Array Size : 10477568 (9.99 GiB 10.73 GB)
>  Used Dev Size : 10477568 (9.99 GiB 10.73 GB)
>   Raid Devices : 2
>  Total Devices : 2
>    Persistence : Superblock is persistent
> 
>    Update Time : Wed Nov  2 11:47:48 2016
>          State : clean
> Active Devices : 2
> Working Devices : 2
> Failed Devices : 0
>  Spare Devices : 0
> 
>         Layout : near=2
>     Chunk Size : 512K
> 
>           Name : rleblanc-pc:13  (local to host rleblanc-pc)
>           UUID : 1eb66d7c:21308453:1e731c8b:1c43dd55
>         Events : 0
> 
>    Number   Major   Minor   RaidDevice State
>       0       7        2        0      active sync set-A   /dev/loop2
>       1       7        3        1      active sync set-B   /dev/loop3
> 
> # mdadm /dev/md13 -a /dev/loop4
> mdadm: added /dev/loop4
> 
> # mdadm --detail /dev/md13
> /dev/md13:
>        Version : 1.2
>  Creation Time : Wed Nov  2 11:47:48 2016
>     Raid Level : raid10
>     Array Size : 10477568 (9.99 GiB 10.73 GB)
>  Used Dev Size : 10477568 (9.99 GiB 10.73 GB)
>   Raid Devices : 2
>  Total Devices : 3
>    Persistence : Superblock is persistent
> 
>    Update Time : Wed Nov  2 11:48:13 2016
>          State : clean
> Active Devices : 2
> Working Devices : 3
> Failed Devices : 0
>  Spare Devices : 1
> 
>         Layout : near=2
>     Chunk Size : 512K
> 
>           Name : rleblanc-pc:13  (local to host rleblanc-pc)
>           UUID : 1eb66d7c:21308453:1e731c8b:1c43dd55
>         Events : 1
> 
>    Number   Major   Minor   RaidDevice State
>       0       7        2        0      active sync set-A   /dev/loop2
>       1       7        3        1      active sync set-B   /dev/loop3
> 
>       2       7        4        -      spare   /dev/loop4
> 
> # mdadm --grow /dev/md13 -p n3 --raid-devices 3
> mdadm: Cannot change number of copies when reshaping RAID10
> 
> I also tried to add the device, grow raid-devices, let it reshape,
> then try to change the number of copies and it didn't like that
> either. It would be nice to supply -p nX and --raid-devices X at the
> same time to prevent the reshape and only copy the data over to the
> new drive (or drop a drive out completely). I could see changing -p
> separately or at a different rate of drives added/removed could be
> difficult, but for lockstep changes, it seems that it would be rather
> easy.
> 
> Any ideas?
> 
> Thanks,
> 
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Issue with growing RAID10
From: Robert LeBlanc @ 2016-11-02 19:02 UTC (permalink / raw)
  To: keld; +Cc: linux-raid
In-Reply-To: <20161102181923.GA21325@www5.open-std.org>

My boss basically wants RAID1 with all drives able to be read from. He
has a requirement to have all the drives identical (minus the
superblock) hence the 'near' option being used. From my rudimentary
tests, sequential reds do seem to use all drives, but random reads
don't. I wonder what logic is preventing the spreading out of random
workloads for 'near'. 'far' is using all disks in random read and
getting better performance on both random and sequential. I'm testing
loopbacks on an NVME drive so seek latency should not be a major
concern.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Nov 2, 2016 at 12:19 PM,  <keld@keldix.com> wrote:
> There is some speed limits om raid10,n2 as also reported in
> https://raid.wiki.kernel.org/index.php/Performance
>
> f you want speed, I suggest you use raid10,f2.
>
> Unfortunatlely you cannot grow "far" layouts, Neil says it is too complicated.
>
> But in your case you should be  able to disable one of your raid10,N2 drives,
> then build a raid10,n2 array for 3 disks, but only with the disk you removed from
> your N2 disk plus your new disk. Then you can copy the contents of the remaining
> old disk to the new "far" disk, and when complete, add the old raid10,n2 disk to the
> new Far raid, with 3 disks. This should give you about 3 times the speed
> of your old raid10,n2 array.
>
> Best regards
> keld
>
>
>
> On Wed, Nov 02, 2016 at 11:59:25AM -0600, Robert LeBlanc wrote:
>> We would like to add read performance to our RAID10 volume by adding
>> another drive (we don't care about space), so I did the following test
>> with poor results.
>>
>> # mdadm --create /dev/md13 --level 10 --run --assume-clean -p n2
>> --raid-devices 2 /dev/loop{2..3}
>> mdadm: /dev/loop2 appears to be part of a raid array:
>>       level=raid10 devices=3 ctime=Wed Nov  2 11:25:22 2016
>> mdadm: /dev/loop3 appears to be part of a raid array:
>>       level=raid10 devices=3 ctime=Wed Nov  2 11:25:22 2016
>> mdadm: Defaulting to version 1.2 metadata
>> mdadm: array /dev/md13 started.
>>
>> # mdadm --detail /dev/md13
>> /dev/md13:
>>        Version : 1.2
>>  Creation Time : Wed Nov  2 11:47:48 2016
>>     Raid Level : raid10
>>     Array Size : 10477568 (9.99 GiB 10.73 GB)
>>  Used Dev Size : 10477568 (9.99 GiB 10.73 GB)
>>   Raid Devices : 2
>>  Total Devices : 2
>>    Persistence : Superblock is persistent
>>
>>    Update Time : Wed Nov  2 11:47:48 2016
>>          State : clean
>> Active Devices : 2
>> Working Devices : 2
>> Failed Devices : 0
>>  Spare Devices : 0
>>
>>         Layout : near=2
>>     Chunk Size : 512K
>>
>>           Name : rleblanc-pc:13  (local to host rleblanc-pc)
>>           UUID : 1eb66d7c:21308453:1e731c8b:1c43dd55
>>         Events : 0
>>
>>    Number   Major   Minor   RaidDevice State
>>       0       7        2        0      active sync set-A   /dev/loop2
>>       1       7        3        1      active sync set-B   /dev/loop3
>>
>> # mdadm /dev/md13 -a /dev/loop4
>> mdadm: added /dev/loop4
>>
>> # mdadm --detail /dev/md13
>> /dev/md13:
>>        Version : 1.2
>>  Creation Time : Wed Nov  2 11:47:48 2016
>>     Raid Level : raid10
>>     Array Size : 10477568 (9.99 GiB 10.73 GB)
>>  Used Dev Size : 10477568 (9.99 GiB 10.73 GB)
>>   Raid Devices : 2
>>  Total Devices : 3
>>    Persistence : Superblock is persistent
>>
>>    Update Time : Wed Nov  2 11:48:13 2016
>>          State : clean
>> Active Devices : 2
>> Working Devices : 3
>> Failed Devices : 0
>>  Spare Devices : 1
>>
>>         Layout : near=2
>>     Chunk Size : 512K
>>
>>           Name : rleblanc-pc:13  (local to host rleblanc-pc)
>>           UUID : 1eb66d7c:21308453:1e731c8b:1c43dd55
>>         Events : 1
>>
>>    Number   Major   Minor   RaidDevice State
>>       0       7        2        0      active sync set-A   /dev/loop2
>>       1       7        3        1      active sync set-B   /dev/loop3
>>
>>       2       7        4        -      spare   /dev/loop4
>>
>> # mdadm --grow /dev/md13 -p n3 --raid-devices 3
>> mdadm: Cannot change number of copies when reshaping RAID10
>>
>> I also tried to add the device, grow raid-devices, let it reshape,
>> then try to change the number of copies and it didn't like that
>> either. It would be nice to supply -p nX and --raid-devices X at the
>> same time to prevent the reshape and only copy the data over to the
>> new drive (or drop a drive out completely). I could see changing -p
>> separately or at a different rate of drives added/removed could be
>> difficult, but for lockstep changes, it seems that it would be rather
>> easy.
>>
>> Any ideas?
>>
>> Thanks,
>>
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Issue with growing RAID10
From: keld @ 2016-11-02 19:48 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: linux-raid
In-Reply-To: <CAANLjFqqTvs0Uh1R6Bvk9FyQUgQqhj+48uC4iiut_A3Uh6j6DA@mail.gmail.com>

If you want all your disks to be identical, then you only can chose between
raid1 and raid10 near. I believe then the raid10  near is the better layout, as some 
stats say you will have better random performance. I don't know why. Probably a driver issue
I believe you can have raid1 in a 3-disk solution. You should try it out, and then please report the
stats back to the list, then I will add it to the wiki (it seems unacessibe at the moment, tho)

best regards
Keld

On Wed, Nov 02, 2016 at 01:02:29PM -0600, Robert LeBlanc wrote:
> My boss basically wants RAID1 with all drives able to be read from. He
> has a requirement to have all the drives identical (minus the
> superblock) hence the 'near' option being used. From my rudimentary
> tests, sequential reds do seem to use all drives, but random reads
> don't. I wonder what logic is preventing the spreading out of random
> workloads for 'near'. 'far' is using all disks in random read and
> getting better performance on both random and sequential. I'm testing
> loopbacks on an NVME drive so seek latency should not be a major
> concern.
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Wed, Nov 2, 2016 at 12:19 PM,  <keld@keldix.com> wrote:
> > There is some speed limits om raid10,n2 as also reported in
> > https://raid.wiki.kernel.org/index.php/Performance
> >
> > f you want speed, I suggest you use raid10,f2.
> >
> > Unfortunatlely you cannot grow "far" layouts, Neil says it is too complicated.
> >
> > But in your case you should be  able to disable one of your raid10,N2 drives,
> > then build a raid10,n2 array for 3 disks, but only with the disk you removed from
> > your N2 disk plus your new disk. Then you can copy the contents of the remaining
> > old disk to the new "far" disk, and when complete, add the old raid10,n2 disk to the
> > new Far raid, with 3 disks. This should give you about 3 times the speed
> > of your old raid10,n2 array.
> >
> > Best regards
> > keld
> >
> >
> >
> > On Wed, Nov 02, 2016 at 11:59:25AM -0600, Robert LeBlanc wrote:
> >> We would like to add read performance to our RAID10 volume by adding
> >> another drive (we don't care about space), so I did the following test
> >> with poor results.
> >>
> >> # mdadm --create /dev/md13 --level 10 --run --assume-clean -p n2
> >> --raid-devices 2 /dev/loop{2..3}
> >> mdadm: /dev/loop2 appears to be part of a raid array:
> >>       level=raid10 devices=3 ctime=Wed Nov  2 11:25:22 2016
> >> mdadm: /dev/loop3 appears to be part of a raid array:
> >>       level=raid10 devices=3 ctime=Wed Nov  2 11:25:22 2016
> >> mdadm: Defaulting to version 1.2 metadata
> >> mdadm: array /dev/md13 started.
> >>
> >> # mdadm --detail /dev/md13
> >> /dev/md13:
> >>        Version : 1.2
> >>  Creation Time : Wed Nov  2 11:47:48 2016
> >>     Raid Level : raid10
> >>     Array Size : 10477568 (9.99 GiB 10.73 GB)
> >>  Used Dev Size : 10477568 (9.99 GiB 10.73 GB)
> >>   Raid Devices : 2
> >>  Total Devices : 2
> >>    Persistence : Superblock is persistent
> >>
> >>    Update Time : Wed Nov  2 11:47:48 2016
> >>          State : clean
> >> Active Devices : 2
> >> Working Devices : 2
> >> Failed Devices : 0
> >>  Spare Devices : 0
> >>
> >>         Layout : near=2
> >>     Chunk Size : 512K
> >>
> >>           Name : rleblanc-pc:13  (local to host rleblanc-pc)
> >>           UUID : 1eb66d7c:21308453:1e731c8b:1c43dd55
> >>         Events : 0
> >>
> >>    Number   Major   Minor   RaidDevice State
> >>       0       7        2        0      active sync set-A   /dev/loop2
> >>       1       7        3        1      active sync set-B   /dev/loop3
> >>
> >> # mdadm /dev/md13 -a /dev/loop4
> >> mdadm: added /dev/loop4
> >>
> >> # mdadm --detail /dev/md13
> >> /dev/md13:
> >>        Version : 1.2
> >>  Creation Time : Wed Nov  2 11:47:48 2016
> >>     Raid Level : raid10
> >>     Array Size : 10477568 (9.99 GiB 10.73 GB)
> >>  Used Dev Size : 10477568 (9.99 GiB 10.73 GB)
> >>   Raid Devices : 2
> >>  Total Devices : 3
> >>    Persistence : Superblock is persistent
> >>
> >>    Update Time : Wed Nov  2 11:48:13 2016
> >>          State : clean
> >> Active Devices : 2
> >> Working Devices : 3
> >> Failed Devices : 0
> >>  Spare Devices : 1
> >>
> >>         Layout : near=2
> >>     Chunk Size : 512K
> >>
> >>           Name : rleblanc-pc:13  (local to host rleblanc-pc)
> >>           UUID : 1eb66d7c:21308453:1e731c8b:1c43dd55
> >>         Events : 1
> >>
> >>    Number   Major   Minor   RaidDevice State
> >>       0       7        2        0      active sync set-A   /dev/loop2
> >>       1       7        3        1      active sync set-B   /dev/loop3
> >>
> >>       2       7        4        -      spare   /dev/loop4
> >>
> >> # mdadm --grow /dev/md13 -p n3 --raid-devices 3
> >> mdadm: Cannot change number of copies when reshaping RAID10
> >>
> >> I also tried to add the device, grow raid-devices, let it reshape,
> >> then try to change the number of copies and it didn't like that
> >> either. It would be nice to supply -p nX and --raid-devices X at the
> >> same time to prevent the reshape and only copy the data over to the
> >> new drive (or drop a drive out completely). I could see changing -p
> >> separately or at a different rate of drives added/removed could be
> >> difficult, but for lockstep changes, it seems that it would be rather
> >> easy.
> >>
> >> Any ideas?
> >>
> >> Thanks,
> >>
> >> ----------------
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Issue with growing RAID10
From: Robert LeBlanc @ 2016-11-02 19:56 UTC (permalink / raw)
  To: keld; +Cc: linux-raid
In-Reply-To: <20161102194842.GA25173@www5.open-std.org>

Yes, we can have any number of disks in a RAID1 (we currently have
three), but reads only ever come from the first drive. We want to move
to RAID10 so that all drives can service reads and provide performance
as well. We just need the option to grow a RAID10 like we can with
RAID1. We don't need the "extra" space by growing a RAID10 without
changing '-p n'. Basically, we want to be super paranoid with several
identical copies of the data and get extra read performance. We know
that we will be limited in write performance which is kind of counter
intuitive for RAID10, but our workload is OK with that.

I hope that makes sense. I could provide some test data on n-disk
RAID1, but my experience says there is little value to it, it is very
similar to 2 disk RAID1. If I have time, I'll supply something.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Nov 2, 2016 at 1:48 PM,  <keld@keldix.com> wrote:
> If you want all your disks to be identical, then you only can chose between
> raid1 and raid10 near. I believe then the raid10  near is the better layout, as some
> stats say you will have better random performance. I don't know why. Probably a driver issue
> I believe you can have raid1 in a 3-disk solution. You should try it out, and then please report the
> stats back to the list, then I will add it to the wiki (it seems unacessibe at the moment, tho)
>
> best regards
> Keld
>
> On Wed, Nov 02, 2016 at 01:02:29PM -0600, Robert LeBlanc wrote:
>> My boss basically wants RAID1 with all drives able to be read from. He
>> has a requirement to have all the drives identical (minus the
>> superblock) hence the 'near' option being used. From my rudimentary
>> tests, sequential reds do seem to use all drives, but random reads
>> don't. I wonder what logic is preventing the spreading out of random
>> workloads for 'near'. 'far' is using all disks in random read and
>> getting better performance on both random and sequential. I'm testing
>> loopbacks on an NVME drive so seek latency should not be a major
>> concern.
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Wed, Nov 2, 2016 at 12:19 PM,  <keld@keldix.com> wrote:
>> > There is some speed limits om raid10,n2 as also reported in
>> > https://raid.wiki.kernel.org/index.php/Performance
>> >
>> > f you want speed, I suggest you use raid10,f2.
>> >
>> > Unfortunatlely you cannot grow "far" layouts, Neil says it is too complicated.
>> >
>> > But in your case you should be  able to disable one of your raid10,N2 drives,
>> > then build a raid10,n2 array for 3 disks, but only with the disk you removed from
>> > your N2 disk plus your new disk. Then you can copy the contents of the remaining
>> > old disk to the new "far" disk, and when complete, add the old raid10,n2 disk to the
>> > new Far raid, with 3 disks. This should give you about 3 times the speed
>> > of your old raid10,n2 array.
>> >
>> > Best regards
>> > keld
>> >
>> >
>> >
>> > On Wed, Nov 02, 2016 at 11:59:25AM -0600, Robert LeBlanc wrote:
>> >> We would like to add read performance to our RAID10 volume by adding
>> >> another drive (we don't care about space), so I did the following test
>> >> with poor results.
>> >>
>> >> # mdadm --create /dev/md13 --level 10 --run --assume-clean -p n2
>> >> --raid-devices 2 /dev/loop{2..3}
>> >> mdadm: /dev/loop2 appears to be part of a raid array:
>> >>       level=raid10 devices=3 ctime=Wed Nov  2 11:25:22 2016
>> >> mdadm: /dev/loop3 appears to be part of a raid array:
>> >>       level=raid10 devices=3 ctime=Wed Nov  2 11:25:22 2016
>> >> mdadm: Defaulting to version 1.2 metadata
>> >> mdadm: array /dev/md13 started.
>> >>
>> >> # mdadm --detail /dev/md13
>> >> /dev/md13:
>> >>        Version : 1.2
>> >>  Creation Time : Wed Nov  2 11:47:48 2016
>> >>     Raid Level : raid10
>> >>     Array Size : 10477568 (9.99 GiB 10.73 GB)
>> >>  Used Dev Size : 10477568 (9.99 GiB 10.73 GB)
>> >>   Raid Devices : 2
>> >>  Total Devices : 2
>> >>    Persistence : Superblock is persistent
>> >>
>> >>    Update Time : Wed Nov  2 11:47:48 2016
>> >>          State : clean
>> >> Active Devices : 2
>> >> Working Devices : 2
>> >> Failed Devices : 0
>> >>  Spare Devices : 0
>> >>
>> >>         Layout : near=2
>> >>     Chunk Size : 512K
>> >>
>> >>           Name : rleblanc-pc:13  (local to host rleblanc-pc)
>> >>           UUID : 1eb66d7c:21308453:1e731c8b:1c43dd55
>> >>         Events : 0
>> >>
>> >>    Number   Major   Minor   RaidDevice State
>> >>       0       7        2        0      active sync set-A   /dev/loop2
>> >>       1       7        3        1      active sync set-B   /dev/loop3
>> >>
>> >> # mdadm /dev/md13 -a /dev/loop4
>> >> mdadm: added /dev/loop4
>> >>
>> >> # mdadm --detail /dev/md13
>> >> /dev/md13:
>> >>        Version : 1.2
>> >>  Creation Time : Wed Nov  2 11:47:48 2016
>> >>     Raid Level : raid10
>> >>     Array Size : 10477568 (9.99 GiB 10.73 GB)
>> >>  Used Dev Size : 10477568 (9.99 GiB 10.73 GB)
>> >>   Raid Devices : 2
>> >>  Total Devices : 3
>> >>    Persistence : Superblock is persistent
>> >>
>> >>    Update Time : Wed Nov  2 11:48:13 2016
>> >>          State : clean
>> >> Active Devices : 2
>> >> Working Devices : 3
>> >> Failed Devices : 0
>> >>  Spare Devices : 1
>> >>
>> >>         Layout : near=2
>> >>     Chunk Size : 512K
>> >>
>> >>           Name : rleblanc-pc:13  (local to host rleblanc-pc)
>> >>           UUID : 1eb66d7c:21308453:1e731c8b:1c43dd55
>> >>         Events : 1
>> >>
>> >>    Number   Major   Minor   RaidDevice State
>> >>       0       7        2        0      active sync set-A   /dev/loop2
>> >>       1       7        3        1      active sync set-B   /dev/loop3
>> >>
>> >>       2       7        4        -      spare   /dev/loop4
>> >>
>> >> # mdadm --grow /dev/md13 -p n3 --raid-devices 3
>> >> mdadm: Cannot change number of copies when reshaping RAID10
>> >>
>> >> I also tried to add the device, grow raid-devices, let it reshape,
>> >> then try to change the number of copies and it didn't like that
>> >> either. It would be nice to supply -p nX and --raid-devices X at the
>> >> same time to prevent the reshape and only copy the data over to the
>> >> new drive (or drop a drive out completely). I could see changing -p
>> >> separately or at a different rate of drives added/removed could be
>> >> difficult, but for lockstep changes, it seems that it would be rather
>> >> easy.
>> >>
>> >> Any ideas?
>> >>
>> >> Thanks,
>> >>
>> >> ----------------
>> >> Robert LeBlanc
>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Issue with growing RAID10
From: keld @ 2016-11-02 20:16 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: linux-raid
In-Reply-To: <CAANLjFpQ-s4E3dKmaEEp0DcWaSQx21-Leafukir6idET++zMsg@mail.gmail.com>

I am not sure what the problem is then. If it is growing your raid10,n2
to a raid10,n3 - which may not be doable with mdadm grow - then you could try out
creating a raid10,n3 array on your new disk, with only 1 disk. copy the stuff,
and then adding the 2 old drives.

I think it is a insight that raid1 only - mostly - performs out of one disk,
regardslessly of how many disks you have. I have used multi-disk raid1 to
have redundancy for booting, so some use can be found.

Best regards
Keld

On Wed, Nov 02, 2016 at 01:56:02PM -0600, Robert LeBlanc wrote:
> Yes, we can have any number of disks in a RAID1 (we currently have
> three), but reads only ever come from the first drive. We want to move
> to RAID10 so that all drives can service reads and provide performance
> as well. We just need the option to grow a RAID10 like we can with
> RAID1. We don't need the "extra" space by growing a RAID10 without
> changing '-p n'. Basically, we want to be super paranoid with several
> identical copies of the data and get extra read performance. We know
> that we will be limited in write performance which is kind of counter
> intuitive for RAID10, but our workload is OK with that.
> 
> I hope that makes sense. I could provide some test data on n-disk
> RAID1, but my experience says there is little value to it, it is very
> similar to 2 disk RAID1. If I have time, I'll supply something.
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Wed, Nov 2, 2016 at 1:48 PM,  <keld@keldix.com> wrote:
> > If you want all your disks to be identical, then you only can chose between
> > raid1 and raid10 near. I believe then the raid10  near is the better layout, as some
> > stats say you will have better random performance. I don't know why. Probably a driver issue
> > I believe you can have raid1 in a 3-disk solution. You should try it out, and then please report the
> > stats back to the list, then I will add it to the wiki (it seems unacessibe at the moment, tho)
> >
> > best regards
> > Keld
> >
> > On Wed, Nov 02, 2016 at 01:02:29PM -0600, Robert LeBlanc wrote:
> >> My boss basically wants RAID1 with all drives able to be read from. He
> >> has a requirement to have all the drives identical (minus the
> >> superblock) hence the 'near' option being used. From my rudimentary
> >> tests, sequential reds do seem to use all drives, but random reads
> >> don't. I wonder what logic is preventing the spreading out of random
> >> workloads for 'near'. 'far' is using all disks in random read and
> >> getting better performance on both random and sequential. I'm testing
> >> loopbacks on an NVME drive so seek latency should not be a major
> >> concern.
> >> ----------------
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>
> >>
> >> On Wed, Nov 2, 2016 at 12:19 PM,  <keld@keldix.com> wrote:
> >> > There is some speed limits om raid10,n2 as also reported in
> >> > https://raid.wiki.kernel.org/index.php/Performance
> >> >
> >> > f you want speed, I suggest you use raid10,f2.
> >> >
> >> > Unfortunatlely you cannot grow "far" layouts, Neil says it is too complicated.
> >> >
> >> > But in your case you should be  able to disable one of your raid10,N2 drives,
> >> > then build a raid10,n2 array for 3 disks, but only with the disk you removed from
> >> > your N2 disk plus your new disk. Then you can copy the contents of the remaining
> >> > old disk to the new "far" disk, and when complete, add the old raid10,n2 disk to the
> >> > new Far raid, with 3 disks. This should give you about 3 times the speed
> >> > of your old raid10,n2 array.
> >> >
> >> > Best regards
> >> > keld
> >> >
> >> >
> >> >
> >> > On Wed, Nov 02, 2016 at 11:59:25AM -0600, Robert LeBlanc wrote:
> >> >> We would like to add read performance to our RAID10 volume by adding
> >> >> another drive (we don't care about space), so I did the following test
> >> >> with poor results.
> >> >>
> >> >> # mdadm --create /dev/md13 --level 10 --run --assume-clean -p n2
> >> >> --raid-devices 2 /dev/loop{2..3}
> >> >> mdadm: /dev/loop2 appears to be part of a raid array:
> >> >>       level=raid10 devices=3 ctime=Wed Nov  2 11:25:22 2016
> >> >> mdadm: /dev/loop3 appears to be part of a raid array:
> >> >>       level=raid10 devices=3 ctime=Wed Nov  2 11:25:22 2016
> >> >> mdadm: Defaulting to version 1.2 metadata
> >> >> mdadm: array /dev/md13 started.
> >> >>
> >> >> # mdadm --detail /dev/md13
> >> >> /dev/md13:
> >> >>        Version : 1.2
> >> >>  Creation Time : Wed Nov  2 11:47:48 2016
> >> >>     Raid Level : raid10
> >> >>     Array Size : 10477568 (9.99 GiB 10.73 GB)
> >> >>  Used Dev Size : 10477568 (9.99 GiB 10.73 GB)
> >> >>   Raid Devices : 2
> >> >>  Total Devices : 2
> >> >>    Persistence : Superblock is persistent
> >> >>
> >> >>    Update Time : Wed Nov  2 11:47:48 2016
> >> >>          State : clean
> >> >> Active Devices : 2
> >> >> Working Devices : 2
> >> >> Failed Devices : 0
> >> >>  Spare Devices : 0
> >> >>
> >> >>         Layout : near=2
> >> >>     Chunk Size : 512K
> >> >>
> >> >>           Name : rleblanc-pc:13  (local to host rleblanc-pc)
> >> >>           UUID : 1eb66d7c:21308453:1e731c8b:1c43dd55
> >> >>         Events : 0
> >> >>
> >> >>    Number   Major   Minor   RaidDevice State
> >> >>       0       7        2        0      active sync set-A   /dev/loop2
> >> >>       1       7        3        1      active sync set-B   /dev/loop3
> >> >>
> >> >> # mdadm /dev/md13 -a /dev/loop4
> >> >> mdadm: added /dev/loop4
> >> >>
> >> >> # mdadm --detail /dev/md13
> >> >> /dev/md13:
> >> >>        Version : 1.2
> >> >>  Creation Time : Wed Nov  2 11:47:48 2016
> >> >>     Raid Level : raid10
> >> >>     Array Size : 10477568 (9.99 GiB 10.73 GB)
> >> >>  Used Dev Size : 10477568 (9.99 GiB 10.73 GB)
> >> >>   Raid Devices : 2
> >> >>  Total Devices : 3
> >> >>    Persistence : Superblock is persistent
> >> >>
> >> >>    Update Time : Wed Nov  2 11:48:13 2016
> >> >>          State : clean
> >> >> Active Devices : 2
> >> >> Working Devices : 3
> >> >> Failed Devices : 0
> >> >>  Spare Devices : 1
> >> >>
> >> >>         Layout : near=2
> >> >>     Chunk Size : 512K
> >> >>
> >> >>           Name : rleblanc-pc:13  (local to host rleblanc-pc)
> >> >>           UUID : 1eb66d7c:21308453:1e731c8b:1c43dd55
> >> >>         Events : 1
> >> >>
> >> >>    Number   Major   Minor   RaidDevice State
> >> >>       0       7        2        0      active sync set-A   /dev/loop2
> >> >>       1       7        3        1      active sync set-B   /dev/loop3
> >> >>
> >> >>       2       7        4        -      spare   /dev/loop4
> >> >>
> >> >> # mdadm --grow /dev/md13 -p n3 --raid-devices 3
> >> >> mdadm: Cannot change number of copies when reshaping RAID10
> >> >>
> >> >> I also tried to add the device, grow raid-devices, let it reshape,
> >> >> then try to change the number of copies and it didn't like that
> >> >> either. It would be nice to supply -p nX and --raid-devices X at the
> >> >> same time to prevent the reshape and only copy the data over to the
> >> >> new drive (or drop a drive out completely). I could see changing -p
> >> >> separately or at a different rate of drives added/removed could be
> >> >> difficult, but for lockstep changes, it seems that it would be rather
> >> >> easy.
> >> >>
> >> >> Any ideas?
> >> >>
> >> >> Thanks,
> >> >>
> >> >> ----------------
> >> >> Robert LeBlanc
> >> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Issue with growing RAID10
From: Robert LeBlanc @ 2016-11-02 20:27 UTC (permalink / raw)
  To: keld; +Cc: linux-raid
In-Reply-To: <20161102201634.GA25517@www5.open-std.org>

Keld,

This is not a 'one-off' issue I'm trying to resolve. It may be
possible in the future that we have to add disks to thousands of
arrays consisting of hundreds of TBs of data. This should also be able
to be automated. It is also possible that we never have to add disks,
but we can't be backed into that corner, we would just stick with
RAID1 at that point. We just ran across RAID10 as what seemed to be a
solution to the problem we were having and are exploring the options.
It may be possible that we can 'adjust' the code to work for us, but
this is a bit out of our realm. Someone here might be able to say "oh,
that should be easy to add" if it isn't already there, where it would
take us weeks to understand the code, etc.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Nov 2, 2016 at 2:16 PM,  <keld@keldix.com> wrote:
> I am not sure what the problem is then. If it is growing your raid10,n2
> to a raid10,n3 - which may not be doable with mdadm grow - then you could try out
> creating a raid10,n3 array on your new disk, with only 1 disk. copy the stuff,
> and then adding the 2 old drives.
>
> I think it is a insight that raid1 only - mostly - performs out of one disk,
> regardslessly of how many disks you have. I have used multi-disk raid1 to
> have redundancy for booting, so some use can be found.
>
> Best regards
> Keld
>
> On Wed, Nov 02, 2016 at 01:56:02PM -0600, Robert LeBlanc wrote:
>> Yes, we can have any number of disks in a RAID1 (we currently have
>> three), but reads only ever come from the first drive. We want to move
>> to RAID10 so that all drives can service reads and provide performance
>> as well. We just need the option to grow a RAID10 like we can with
>> RAID1. We don't need the "extra" space by growing a RAID10 without
>> changing '-p n'. Basically, we want to be super paranoid with several
>> identical copies of the data and get extra read performance. We know
>> that we will be limited in write performance which is kind of counter
>> intuitive for RAID10, but our workload is OK with that.
>>
>> I hope that makes sense. I could provide some test data on n-disk
>> RAID1, but my experience says there is little value to it, it is very
>> similar to 2 disk RAID1. If I have time, I'll supply something.
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Wed, Nov 2, 2016 at 1:48 PM,  <keld@keldix.com> wrote:
>> > If you want all your disks to be identical, then you only can chose between
>> > raid1 and raid10 near. I believe then the raid10  near is the better layout, as some
>> > stats say you will have better random performance. I don't know why. Probably a driver issue
>> > I believe you can have raid1 in a 3-disk solution. You should try it out, and then please report the
>> > stats back to the list, then I will add it to the wiki (it seems unacessibe at the moment, tho)
>> >
>> > best regards
>> > Keld
>> >
>> > On Wed, Nov 02, 2016 at 01:02:29PM -0600, Robert LeBlanc wrote:
>> >> My boss basically wants RAID1 with all drives able to be read from. He
>> >> has a requirement to have all the drives identical (minus the
>> >> superblock) hence the 'near' option being used. From my rudimentary
>> >> tests, sequential reds do seem to use all drives, but random reads
>> >> don't. I wonder what logic is preventing the spreading out of random
>> >> workloads for 'near'. 'far' is using all disks in random read and
>> >> getting better performance on both random and sequential. I'm testing
>> >> loopbacks on an NVME drive so seek latency should not be a major
>> >> concern.
>> >> ----------------
>> >> Robert LeBlanc
>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>
>> >>
>> >> On Wed, Nov 2, 2016 at 12:19 PM,  <keld@keldix.com> wrote:
>> >> > There is some speed limits om raid10,n2 as also reported in
>> >> > https://raid.wiki.kernel.org/index.php/Performance
>> >> >
>> >> > f you want speed, I suggest you use raid10,f2.
>> >> >
>> >> > Unfortunatlely you cannot grow "far" layouts, Neil says it is too complicated.
>> >> >
>> >> > But in your case you should be  able to disable one of your raid10,N2 drives,
>> >> > then build a raid10,n2 array for 3 disks, but only with the disk you removed from
>> >> > your N2 disk plus your new disk. Then you can copy the contents of the remaining
>> >> > old disk to the new "far" disk, and when complete, add the old raid10,n2 disk to the
>> >> > new Far raid, with 3 disks. This should give you about 3 times the speed
>> >> > of your old raid10,n2 array.
>> >> >
>> >> > Best regards
>> >> > keld
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Nov 02, 2016 at 11:59:25AM -0600, Robert LeBlanc wrote:
>> >> >> We would like to add read performance to our RAID10 volume by adding
>> >> >> another drive (we don't care about space), so I did the following test
>> >> >> with poor results.
>> >> >>
>> >> >> # mdadm --create /dev/md13 --level 10 --run --assume-clean -p n2
>> >> >> --raid-devices 2 /dev/loop{2..3}
>> >> >> mdadm: /dev/loop2 appears to be part of a raid array:
>> >> >>       level=raid10 devices=3 ctime=Wed Nov  2 11:25:22 2016
>> >> >> mdadm: /dev/loop3 appears to be part of a raid array:
>> >> >>       level=raid10 devices=3 ctime=Wed Nov  2 11:25:22 2016
>> >> >> mdadm: Defaulting to version 1.2 metadata
>> >> >> mdadm: array /dev/md13 started.
>> >> >>
>> >> >> # mdadm --detail /dev/md13
>> >> >> /dev/md13:
>> >> >>        Version : 1.2
>> >> >>  Creation Time : Wed Nov  2 11:47:48 2016
>> >> >>     Raid Level : raid10
>> >> >>     Array Size : 10477568 (9.99 GiB 10.73 GB)
>> >> >>  Used Dev Size : 10477568 (9.99 GiB 10.73 GB)
>> >> >>   Raid Devices : 2
>> >> >>  Total Devices : 2
>> >> >>    Persistence : Superblock is persistent
>> >> >>
>> >> >>    Update Time : Wed Nov  2 11:47:48 2016
>> >> >>          State : clean
>> >> >> Active Devices : 2
>> >> >> Working Devices : 2
>> >> >> Failed Devices : 0
>> >> >>  Spare Devices : 0
>> >> >>
>> >> >>         Layout : near=2
>> >> >>     Chunk Size : 512K
>> >> >>
>> >> >>           Name : rleblanc-pc:13  (local to host rleblanc-pc)
>> >> >>           UUID : 1eb66d7c:21308453:1e731c8b:1c43dd55
>> >> >>         Events : 0
>> >> >>
>> >> >>    Number   Major   Minor   RaidDevice State
>> >> >>       0       7        2        0      active sync set-A   /dev/loop2
>> >> >>       1       7        3        1      active sync set-B   /dev/loop3
>> >> >>
>> >> >> # mdadm /dev/md13 -a /dev/loop4
>> >> >> mdadm: added /dev/loop4
>> >> >>
>> >> >> # mdadm --detail /dev/md13
>> >> >> /dev/md13:
>> >> >>        Version : 1.2
>> >> >>  Creation Time : Wed Nov  2 11:47:48 2016
>> >> >>     Raid Level : raid10
>> >> >>     Array Size : 10477568 (9.99 GiB 10.73 GB)
>> >> >>  Used Dev Size : 10477568 (9.99 GiB 10.73 GB)
>> >> >>   Raid Devices : 2
>> >> >>  Total Devices : 3
>> >> >>    Persistence : Superblock is persistent
>> >> >>
>> >> >>    Update Time : Wed Nov  2 11:48:13 2016
>> >> >>          State : clean
>> >> >> Active Devices : 2
>> >> >> Working Devices : 3
>> >> >> Failed Devices : 0
>> >> >>  Spare Devices : 1
>> >> >>
>> >> >>         Layout : near=2
>> >> >>     Chunk Size : 512K
>> >> >>
>> >> >>           Name : rleblanc-pc:13  (local to host rleblanc-pc)
>> >> >>           UUID : 1eb66d7c:21308453:1e731c8b:1c43dd55
>> >> >>         Events : 1
>> >> >>
>> >> >>    Number   Major   Minor   RaidDevice State
>> >> >>       0       7        2        0      active sync set-A   /dev/loop2
>> >> >>       1       7        3        1      active sync set-B   /dev/loop3
>> >> >>
>> >> >>       2       7        4        -      spare   /dev/loop4
>> >> >>
>> >> >> # mdadm --grow /dev/md13 -p n3 --raid-devices 3
>> >> >> mdadm: Cannot change number of copies when reshaping RAID10
>> >> >>
>> >> >> I also tried to add the device, grow raid-devices, let it reshape,
>> >> >> then try to change the number of copies and it didn't like that
>> >> >> either. It would be nice to supply -p nX and --raid-devices X at the
>> >> >> same time to prevent the reshape and only copy the data over to the
>> >> >> new drive (or drop a drive out completely). I could see changing -p
>> >> >> separately or at a different rate of drives added/removed could be
>> >> >> difficult, but for lockstep changes, it seems that it would be rather
>> >> >> easy.
>> >> >>
>> >> >> Any ideas?
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> ----------------
>> >> >> Robert LeBlanc
>> >> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >> >> --
>> >> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Issue with growing RAID10
From: Robin Hill @ 2016-11-02 20:41 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: linux-raid
In-Reply-To: <CAANLjFpQ-s4E3dKmaEEp0DcWaSQx21-Leafukir6idET++zMsg@mail.gmail.com>

On Wed Nov 02, 2016 at 01:56:02pm -0600, Robert LeBlanc wrote:

> Yes, we can have any number of disks in a RAID1 (we currently have
> three), but reads only ever come from the first drive.
> 
How are you testing? I use RAID1 on a number of systems and reads
look to be pretty evenly spread across the drives.

Cheers,
    Robin

^ permalink raw reply

* Re: Issue with growing RAID10
From: Robert LeBlanc @ 2016-11-02 20:59 UTC (permalink / raw)
  To: Robert LeBlanc, linux-raid
In-Reply-To: <20161102204153.GA23899@cthulhu.home.robinhill.me.uk>

root@rleblanc-pc:~# losetup -l
NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE   DIO
/dev/loop1         0      0         0  0 /root/junk1   0
/dev/loop4         0      0         0  0 /root/junk4   0
/dev/loop2         0      0         0  0 /root/junk2   0
/dev/loop5         0      0         0  0 /root/junk5   0
/dev/loop3         0      0         0  0 /root/junk3   0
root@rleblanc-pc:~# mdadm --create /dev/md13 --level 1 --raid-devices
4 --run /dev/loop{1..4}
mdadm: Note: this array has metadata at the start and
   may not be suitable as a boot device.  If you plan to
   store '/boot' on this device please ensure that
   your boot-loader understands md/v1.x metadata, or use
   --metadata=0.90
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md13 started.
root@rleblanc-pc:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md13 : active raid1 loop4[3] loop3[2] loop2[1] loop1[0]
     10477568 blocks super 1.2 [4/4] [UUUU]

unused devices: <none>
root@rleblanc-pc:~# mkfs.ext4 /dev/md13
mke2fs 1.43.3 (04-Sep-2016)
Discarding device blocks: done
Creating filesystem with 2619392 4k blocks and 655360 inodes
Filesystem UUID: 3bb68653-50af-492f-a3d4-8d0a5f2f4ca4
Superblock backups stored on blocks:
       32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done

root@rleblanc-pc:~# mkdir junk
root@rleblanc-pc:~# mount /dev/md13 junk
root@rleblanc-pc:~# cd junk
root@rleblanc-pc:~/junk# fio -rw=read --size=5G --name=mdadm_test
mdadm_test: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.10
Starting 1 process
mdadm_test: Laying out IO file(s) (1 file(s) / 5120MB)
Jobs: 1 (f=1): [R(1)] [100.0% done] [338.3MB/0KB/0KB /s] [86.6K/0/0
iops] [eta 00m:00s]
mdadm_test: (groupid=0, jobs=1): err= 0: pid=18198: Wed Nov  2 14:54:20 2016
 read : io=5120.0MB, bw=483750KB/s, iops=120937, runt= 10838msec
   clat (usec): min=0, max=21384, avg= 7.98, stdev=108.10
    lat (usec): min=0, max=21384, avg= 8.02, stdev=108.10
   clat percentiles (usec):
    |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    0],
    | 30.00th=[    0], 40.00th=[    0], 50.00th=[    1], 60.00th=[    1],
    | 70.00th=[    1], 80.00th=[    1], 90.00th=[    1], 95.00th=[    1],
    | 99.00th=[  274], 99.50th=[  386], 99.90th=[  828], 99.95th=[ 2704],
    | 99.99th=[ 4640]
   bw (KB  /s): min=324608, max=748032, per=95.94%, avg=464090.29,
stdev=120877.09
   lat (usec) : 2=95.25%, 4=3.09%, 10=0.06%, 20=0.02%, 50=0.09%
   lat (usec) : 100=0.01%, 250=0.35%, 500=0.88%, 750=0.13%, 1000=0.02%
   lat (msec) : 2=0.01%, 4=0.06%, 10=0.01%, 20=0.01%, 50=0.01%
 cpu          : usr=5.02%, sys=12.25%, ctx=19708, majf=0, minf=10
 IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    issued    : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
    latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  READ: io=5120.0MB, aggrb=483749KB/s, minb=483749KB/s,
maxb=483749KB/s, mint=10838msec, maxt=10838msec

Disk stats (read/write):
   md13: ios=60029/3, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=15360/6, aggrmerge=0/0, aggrticks=13502/101,
aggrin_queue=13600, aggrutil=98.75%
 loop1: ios=61427/6, merge=0/0, ticks=54008/116, in_queue=54112, util=98.75%
 loop4: ios=0/6, merge=0/0, ticks=0/92, in_queue=92, util=0.84%
 loop2: ios=16/6, merge=0/0, ticks=0/104, in_queue=104, util=0.95%
 loop3: ios=0/6, merge=0/0, ticks=0/92, in_queue=92, util=0.84%

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00  1206.50 3517.50 2018.50 446660.00 12878.00
166.02     1.60    0.29    0.42    0.06   0.17  93.00
loop1             0.00     0.00 5233.50    0.00 446536.25     0.00
170.65     5.01    0.96    0.96    0.00   0.19 100.00
loop2             0.00     0.00    1.00    0.00   120.00     0.00
240.00     0.00    0.00    0.00    0.00   0.00   0.00
loop3             0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
loop4             0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
loop5             0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md13              0.00     0.00 5235.00    0.00 446720.00     0.00
170.67     0.00    0.00    0.00    0.00   0.00   0.00

root@rleblanc-pc:~/junk# fio -rw=randread --size=5G --name=mdadm_test
mdadm_test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.10
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [444.5MB/0KB/0KB /s] [114K/0/0
iops] [eta 00m:00s]
mdadm_test: (groupid=0, jobs=1): err= 0: pid=18924: Wed Nov  2 14:55:16 2016
 read : io=5120.0MB, bw=463890KB/s, iops=115972, runt= 11302msec
   clat (usec): min=4, max=15649, avg= 8.03, stdev=37.76
    lat (usec): min=4, max=15649, avg= 8.07, stdev=37.76
   clat percentiles (usec):
    |  1.00th=[    5],  5.00th=[    5], 10.00th=[    6], 20.00th=[    6],
    | 30.00th=[    6], 40.00th=[    6], 50.00th=[    7], 60.00th=[    7],
    | 70.00th=[    7], 80.00th=[    8], 90.00th=[    9], 95.00th=[   10],
    | 99.00th=[   17], 99.50th=[   95], 99.90th=[  151], 99.95th=[  179],
    | 99.99th=[ 1528]
   bw (KB  /s): min=237416, max=543576, per=99.67%, avg=462350.91,
stdev=62842.83
   lat (usec) : 10=93.06%, 20=6.09%, 50=0.25%, 100=0.13%, 250=0.45%
   lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
   lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%
 cpu          : usr=12.39%, sys=46.90%, ctx=1310616, majf=1, minf=9
 IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    issued    : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
    latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  READ: io=5120.0MB, aggrb=463889KB/s, minb=463889KB/s,
maxb=463889KB/s, mint=11302msec, maxt=11302msec

Disk stats (read/write):
   md13: ios=1303936/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=327680/0, aggrmerge=0/0, aggrticks=1635/0, aggrin_queue=1621,
aggrutil=56.53%
 loop1: ios=1310359/0, merge=0/0, ticks=6504/0, in_queue=6448, util=56.53%
 loop4: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
 loop2: ios=361/0, merge=0/0, ticks=36/0, in_queue=36, util=0.32%
 loop3: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     8.50 1255.00    9.50  7552.00    64.00
12.05     0.23    0.18    0.17    1.68   0.12  15.60
loop1             0.00     0.00 115485.50    0.00 461942.00     0.00
  8.00     0.63    0.01    0.01    0.00   0.01  62.80
loop2             0.00     0.00   31.50    0.00   126.00     0.00
8.00     0.00    0.00    0.00    0.00   0.00   0.00
loop3             0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
loop4             0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
loop5             0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md13              0.00     0.00 115512.50    0.00 462050.00     0.00
  8.00     0.00    0.00    0.00    0.00   0.00   0.00

This is indicative of what we see in production as well. As you can
see fio closely matches what iostat shows as far as device work. I
don't know how you are seeing even reads. I've seen this on both
CentOS and Debian.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Nov 2, 2016 at 2:41 PM, Robin Hill <robin@robinhill.me.uk> wrote:
> On Wed Nov 02, 2016 at 01:56:02pm -0600, Robert LeBlanc wrote:
>
>> Yes, we can have any number of disks in a RAID1 (we currently have
>> three), but reads only ever come from the first drive.
>>
> How are you testing? I use RAID1 on a number of systems and reads
> look to be pretty evenly spread across the drives.
>
> Cheers,
>     Robin

^ permalink raw reply

* Re: Issue with growing RAID10
From: Andreas Klauer @ 2016-11-02 21:00 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: linux-raid
In-Reply-To: <CAANLjFpQ-s4E3dKmaEEp0DcWaSQx21-Leafukir6idET++zMsg@mail.gmail.com>

On Wed, Nov 02, 2016 at 01:56:02PM -0600, Robert LeBlanc wrote:
> Yes, we can have any number of disks in a RAID1 (we currently have
> three), but reads only ever come from the first drive.

Only if there's only one reader. So it depends on what activity 
there is on the machine. 

> We just need the option to grow a RAID10 like we can with RAID1.

Patches welcome, I'm sure? ;-)

> Basically, we want to be super paranoid with several identical copies 
> of the data and get extra read performance.

You could put RAID on RAID and thus achieve other modes but not sure 
if it's worth the overhead or even applies in any way to your use case 
and using non standard setups always comes with its own pitfalls.

RAID 1, with RAID0 on top, three disks ABC, two partitions ab,
different disk order.

  A B C
a 1 2 3
b 3 1 2

Three RAID 1 md1, md2, md3, (and md0 a RAID-0 on top).

You can grow it.

  A B C D
a 1 2 3 ?
b 3 1 2 ?

  A B C D
a 1 2 3 ?
b 3 1 2 3

md3 has 3 disks temporarily here.

  A B C D
a 1 2 3 4
b 4 1 2 3

md4 is new, to be added to md0.

Three copies? Same thing with three partitions.

Will it help any or make things worse? I dunno.
Have to be careful to make md0 assemble last.

Could also be RAID5 on top instead of RAID1.
That's even stranger though.

Regards
Andreas Klauer

^ permalink raw reply

* Re: Issue with growing RAID10
From: Robert LeBlanc @ 2016-11-02 21:11 UTC (permalink / raw)
  To: Robert LeBlanc, linux-raid
In-Reply-To: <CAANLjFoXWyjBnDyxtAcTYrVxrpSp4YFYFnpWXG-cDkxouw9QGg@mail.gmail.com>

As a comparision, here is a RAID10 n4 with 4 disks....

root@rleblanc-pc:~# mdadm --detail /dev/md14
/dev/md14:
       Version : 1.2
 Creation Time : Wed Nov  2 15:01:09 2016
    Raid Level : raid10
    Array Size : 10477568 (9.99 GiB 10.73 GB)
 Used Dev Size : 10477568 (9.99 GiB 10.73 GB)
  Raid Devices : 4
 Total Devices : 4
   Persistence : Superblock is persistent

   Update Time : Wed Nov  2 15:01:28 2016
         State : clean, resyncing
Active Devices : 4
Working Devices : 4
Failed Devices : 0
 Spare Devices : 0

        Layout : near=4
    Chunk Size : 512K

 Resync Status : 38% complete

          Name : rleblanc-pc:14  (local to host rleblanc-pc)
          UUID : 61114475:19a4404b:07b0a66d:a0e4447a
        Events : 6

   Number   Major   Minor   RaidDevice State
      0       7       11        0      active sync set-A   /dev/loop11
      1       7       12        1      active sync set-B   /dev/loop12
      2       7       13        2      active sync set-C   /dev/loop13
      3       7       14        3      active sync set-D   /dev/loop14

root@rleblanc-pc:~/junk# fio -rw=read --size=5G --name=mdadm_test
mdadm_test: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.10
Starting 1 process
mdadm_test: Laying out IO file(s) (1 file(s) / 5120MB)
Jobs: 1 (f=1): [R(1)] [100.0% done] [238.3MB/0KB/0KB /s] [60.1K/0/0
iops] [eta 00m:00s]
mdadm_test: (groupid=0, jobs=1): err= 0: pid=19925: Wed Nov  2 15:08:15 2016
 read : io=5120.0MB, bw=343278KB/s, iops=85819, runt= 15273msec
   clat (usec): min=0, max=25847, avg=11.16, stdev=237.64
    lat (usec): min=0, max=25847, avg=11.23, stdev=237.64
   clat percentiles (usec):
    |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    0],
    | 30.00th=[    1], 40.00th=[    1], 50.00th=[    1], 60.00th=[    1],
    | 70.00th=[    1], 80.00th=[    1], 90.00th=[    2], 95.00th=[    2],
    | 99.00th=[    4], 99.50th=[    8], 99.90th=[ 2992], 99.95th=[ 4080],
    | 99.99th=[11456]
   bw (KB  /s): min=240136, max=528384, per=100.00%, avg=345144.53,
stdev=83065.30
   lat (usec) : 2=82.29%, 4=16.62%, 10=0.63%, 20=0.05%, 50=0.03%
   lat (usec) : 100=0.01%, 250=0.01%, 500=0.04%, 750=0.02%, 1000=0.01%
   lat (msec) : 2=0.08%, 4=0.15%, 10=0.04%, 20=0.01%, 50=0.01%
 cpu          : usr=5.71%, sys=14.59%, ctx=4480, majf=0, minf=11
 IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    issued    : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
    latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  READ: io=5120.0MB, aggrb=343277KB/s, minb=343277KB/s,
maxb=343277KB/s, mint=15273msec, maxt=15273msec

Disk stats (read/write):
   md14: ios=46045/3, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=11520/7, aggrmerge=0/0, aggrticks=85659/98,
aggrin_queue=85756, aggrutil=80.49%
 loop13: ios=17421/7, merge=0/0, ticks=133600/132, in_queue=133732, util=74.76%
 loop11: ios=4006/7, merge=0/0, ticks=22572/80, in_queue=22648, util=45.68%
 loop14: ios=19532/7, merge=0/0, ticks=154152/112, in_queue=154268, util=80.49%
 loop12: ios=5124/7, merge=0/0, ticks=32312/68, in_queue=32376, util=49.54%

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1          46.50  1459.00 4351.00 2990.50 386402.00 17792.00
110.11     3.94    0.53    0.86    0.05   0.13  94.80
loop11            0.00     0.00  252.50    0.00 29785.50     0.00
235.92     1.77    6.82    6.82    0.00   1.89  47.60
loop12            0.00     0.00  260.50    0.00 30805.50     0.00
236.51     2.00    7.66    7.66    0.00   1.88  49.00
loop13            0.00     0.00  905.00    0.00 102173.00     0.00
225.80     8.08    8.95    8.95    0.00   0.80  72.80
loop14            0.00     0.00 1074.50    0.00 120820.25     0.00
224.89    10.61    9.90    9.90    0.00   0.78  83.60
loop15            0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md14              0.00     0.00 2493.00    0.00 283648.00     0.00
227.56     0.00    0.00    0.00    0.00   0.00   0.00

root@rleblanc-pc:~/junk# fio -rw=randread --size=5G --name=mdadm_test
mdadm_test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.10
Starting 1 process
Jobs: 1 (f=1): [r(1)] [97.7% done] [195.4MB/0KB/0KB /s] [49.1K/0/0
iops] [eta 00m:02s]
mdadm_test: (groupid=0, jobs=1): err= 0: pid=19953: Wed Nov  2 15:10:18 2016
 read : io=5120.0MB, bw=62013KB/s, iops=15503, runt= 84545msec
   clat (usec): min=4, max=11510, avg=63.40, stdev=96.01
    lat (usec): min=4, max=11510, avg=63.47, stdev=96.03
   clat percentiles (usec):
    |  1.00th=[    6],  5.00th=[    7], 10.00th=[    8], 20.00th=[    8],
    | 30.00th=[    9], 40.00th=[   11], 50.00th=[   17], 60.00th=[   61],
    | 70.00th=[  102], 80.00th=[  122], 90.00th=[  155], 95.00th=[  185],
    | 99.00th=[  258], 99.50th=[  298], 99.90th=[  494], 99.95th=[ 1816],
    | 99.99th=[ 3056]
   bw (KB  /s): min=22992, max=227816, per=99.90%, avg=61952.96, stdev=53309.04
   lat (usec) : 10=33.36%, 20=18.05%, 50=7.94%, 100=9.46%, 250=29.99%
   lat (usec) : 500=1.09%, 750=0.02%, 1000=0.01%
   lat (msec) : 2=0.03%, 4=0.04%, 10=0.01%, 20=0.01%
 cpu          : usr=2.63%, sys=13.01%, ctx=1310641, majf=0, minf=9
 IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    issued    : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
    latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  READ: io=5120.0MB, aggrb=62012KB/s, minb=62012KB/s, maxb=62012KB/s,
mint=84545msec, maxt=84545msec

Disk stats (read/write):
   md14: ios=1304718/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=327680/0, aggrmerge=0/0, aggrticks=18719/0,
aggrin_queue=18689, aggrutil=88.37%
 loop13: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
 loop11: ios=1310108/0, merge=0/0, ticks=74856/0, in_queue=74736, util=88.37%
 loop14: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
 loop12: ios=612/0, merge=0/0, ticks=20/0, in_queue=20, util=0.02%

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1          11.00     0.00 7046.00    0.00 30048.00     0.00
8.53     0.60    0.09    0.09    0.00   0.08  59.20
loop11            0.00     0.00 7953.00    0.00 31812.00     0.00
8.00     0.88    0.11    0.11    0.00   0.11  88.40
loop12            0.00     0.00    3.50    0.00    14.00     0.00
8.00     0.00    0.00    0.00    0.00   0.00   0.00
loop13            0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
loop14            0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
loop15            0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md14              0.00     0.00 7956.50    0.00 31826.00     0.00
8.00     0.00    0.00    0.00    0.00   0.00   0.00

So sequential reads are being spread out, not completely evenly, but
some. Random reads looks almost like RAID1 with only one disk doing
all the work.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Nov 2, 2016 at 2:59 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
> root@rleblanc-pc:~# losetup -l
> NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE   DIO
> /dev/loop1         0      0         0  0 /root/junk1   0
> /dev/loop4         0      0         0  0 /root/junk4   0
> /dev/loop2         0      0         0  0 /root/junk2   0
> /dev/loop5         0      0         0  0 /root/junk5   0
> /dev/loop3         0      0         0  0 /root/junk3   0
> root@rleblanc-pc:~# mdadm --create /dev/md13 --level 1 --raid-devices
> 4 --run /dev/loop{1..4}
> mdadm: Note: this array has metadata at the start and
>    may not be suitable as a boot device.  If you plan to
>    store '/boot' on this device please ensure that
>    your boot-loader understands md/v1.x metadata, or use
>    --metadata=0.90
> mdadm: Defaulting to version 1.2 metadata
> mdadm: array /dev/md13 started.
> root@rleblanc-pc:~# cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
> [raid4] [raid10]
> md13 : active raid1 loop4[3] loop3[2] loop2[1] loop1[0]
>      10477568 blocks super 1.2 [4/4] [UUUU]
>
> unused devices: <none>
> root@rleblanc-pc:~# mkfs.ext4 /dev/md13
> mke2fs 1.43.3 (04-Sep-2016)
> Discarding device blocks: done
> Creating filesystem with 2619392 4k blocks and 655360 inodes
> Filesystem UUID: 3bb68653-50af-492f-a3d4-8d0a5f2f4ca4
> Superblock backups stored on blocks:
>        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632
>
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (16384 blocks): done
> Writing superblocks and filesystem accounting information: done
>
> root@rleblanc-pc:~# mkdir junk
> root@rleblanc-pc:~# mount /dev/md13 junk
> root@rleblanc-pc:~# cd junk
> root@rleblanc-pc:~/junk# fio -rw=read --size=5G --name=mdadm_test
> mdadm_test: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
> fio-2.10
> Starting 1 process
> mdadm_test: Laying out IO file(s) (1 file(s) / 5120MB)
> Jobs: 1 (f=1): [R(1)] [100.0% done] [338.3MB/0KB/0KB /s] [86.6K/0/0
> iops] [eta 00m:00s]
> mdadm_test: (groupid=0, jobs=1): err= 0: pid=18198: Wed Nov  2 14:54:20 2016
>  read : io=5120.0MB, bw=483750KB/s, iops=120937, runt= 10838msec
>    clat (usec): min=0, max=21384, avg= 7.98, stdev=108.10
>     lat (usec): min=0, max=21384, avg= 8.02, stdev=108.10
>    clat percentiles (usec):
>     |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    0],
>     | 30.00th=[    0], 40.00th=[    0], 50.00th=[    1], 60.00th=[    1],
>     | 70.00th=[    1], 80.00th=[    1], 90.00th=[    1], 95.00th=[    1],
>     | 99.00th=[  274], 99.50th=[  386], 99.90th=[  828], 99.95th=[ 2704],
>     | 99.99th=[ 4640]
>    bw (KB  /s): min=324608, max=748032, per=95.94%, avg=464090.29,
> stdev=120877.09
>    lat (usec) : 2=95.25%, 4=3.09%, 10=0.06%, 20=0.02%, 50=0.09%
>    lat (usec) : 100=0.01%, 250=0.35%, 500=0.88%, 750=0.13%, 1000=0.02%
>    lat (msec) : 2=0.01%, 4=0.06%, 10=0.01%, 20=0.01%, 50=0.01%
>  cpu          : usr=5.02%, sys=12.25%, ctx=19708, majf=0, minf=10
>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     issued    : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>     latency   : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
>   READ: io=5120.0MB, aggrb=483749KB/s, minb=483749KB/s,
> maxb=483749KB/s, mint=10838msec, maxt=10838msec
>
> Disk stats (read/write):
>    md13: ios=60029/3, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=15360/6, aggrmerge=0/0, aggrticks=13502/101,
> aggrin_queue=13600, aggrutil=98.75%
>  loop1: ios=61427/6, merge=0/0, ticks=54008/116, in_queue=54112, util=98.75%
>  loop4: ios=0/6, merge=0/0, ticks=0/92, in_queue=92, util=0.84%
>  loop2: ios=16/6, merge=0/0, ticks=0/104, in_queue=104, util=0.95%
>  loop3: ios=0/6, merge=0/0, ticks=0/92, in_queue=92, util=0.84%
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> nvme0n1           0.00  1206.50 3517.50 2018.50 446660.00 12878.00
> 166.02     1.60    0.29    0.42    0.06   0.17  93.00
> loop1             0.00     0.00 5233.50    0.00 446536.25     0.00
> 170.65     5.01    0.96    0.96    0.00   0.19 100.00
> loop2             0.00     0.00    1.00    0.00   120.00     0.00
> 240.00     0.00    0.00    0.00    0.00   0.00   0.00
> loop3             0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> loop4             0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> loop5             0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> md13              0.00     0.00 5235.00    0.00 446720.00     0.00
> 170.67     0.00    0.00    0.00    0.00   0.00   0.00
>
> root@rleblanc-pc:~/junk# fio -rw=randread --size=5G --name=mdadm_test
> mdadm_test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
> fio-2.10
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [100.0% done] [444.5MB/0KB/0KB /s] [114K/0/0
> iops] [eta 00m:00s]
> mdadm_test: (groupid=0, jobs=1): err= 0: pid=18924: Wed Nov  2 14:55:16 2016
>  read : io=5120.0MB, bw=463890KB/s, iops=115972, runt= 11302msec
>    clat (usec): min=4, max=15649, avg= 8.03, stdev=37.76
>     lat (usec): min=4, max=15649, avg= 8.07, stdev=37.76
>    clat percentiles (usec):
>     |  1.00th=[    5],  5.00th=[    5], 10.00th=[    6], 20.00th=[    6],
>     | 30.00th=[    6], 40.00th=[    6], 50.00th=[    7], 60.00th=[    7],
>     | 70.00th=[    7], 80.00th=[    8], 90.00th=[    9], 95.00th=[   10],
>     | 99.00th=[   17], 99.50th=[   95], 99.90th=[  151], 99.95th=[  179],
>     | 99.99th=[ 1528]
>    bw (KB  /s): min=237416, max=543576, per=99.67%, avg=462350.91,
> stdev=62842.83
>    lat (usec) : 10=93.06%, 20=6.09%, 50=0.25%, 100=0.13%, 250=0.45%
>    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
>    lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%
>  cpu          : usr=12.39%, sys=46.90%, ctx=1310616, majf=1, minf=9
>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     issued    : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>     latency   : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
>   READ: io=5120.0MB, aggrb=463889KB/s, minb=463889KB/s,
> maxb=463889KB/s, mint=11302msec, maxt=11302msec
>
> Disk stats (read/write):
>    md13: ios=1303936/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=327680/0, aggrmerge=0/0, aggrticks=1635/0, aggrin_queue=1621,
> aggrutil=56.53%
>  loop1: ios=1310359/0, merge=0/0, ticks=6504/0, in_queue=6448, util=56.53%
>  loop4: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  loop2: ios=361/0, merge=0/0, ticks=36/0, in_queue=36, util=0.32%
>  loop3: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> nvme0n1           0.00     8.50 1255.00    9.50  7552.00    64.00
> 12.05     0.23    0.18    0.17    1.68   0.12  15.60
> loop1             0.00     0.00 115485.50    0.00 461942.00     0.00
>   8.00     0.63    0.01    0.01    0.00   0.01  62.80
> loop2             0.00     0.00   31.50    0.00   126.00     0.00
> 8.00     0.00    0.00    0.00    0.00   0.00   0.00
> loop3             0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> loop4             0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> loop5             0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> md13              0.00     0.00 115512.50    0.00 462050.00     0.00
>   8.00     0.00    0.00    0.00    0.00   0.00   0.00
>
> This is indicative of what we see in production as well. As you can
> see fio closely matches what iostat shows as far as device work. I
> don't know how you are seeing even reads. I've seen this on both
> CentOS and Debian.
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Wed, Nov 2, 2016 at 2:41 PM, Robin Hill <robin@robinhill.me.uk> wrote:
>> On Wed Nov 02, 2016 at 01:56:02pm -0600, Robert LeBlanc wrote:
>>
>>> Yes, we can have any number of disks in a RAID1 (we currently have
>>> three), but reads only ever come from the first drive.
>>>
>> How are you testing? I use RAID1 on a number of systems and reads
>> look to be pretty evenly spread across the drives.
>>
>> Cheers,
>>     Robin

^ permalink raw reply

* Re: [md PATCH 0/9] replace printk() with pr_*()
From: NeilBrown @ 2016-11-02 21:14 UTC (permalink / raw)
  To: Hannes Reinecke, Shaohua Li; +Cc: linux-raid
In-Reply-To: <5cd6d570-6d72-07b4-34a4-5b75503fcbf4@suse.de>

[-- Attachment #1: Type: text/plain, Size: 1871 bytes --]

On Wed, Nov 02 2016, Hannes Reinecke wrote:

> On 11/02/2016 04:16 AM, NeilBrown wrote:
>> This series removes all printk() calls from md code, preferring
>> pr_warn(), pr_err() etc.
>>
>> All strings that were split over multiple lines are not joined
>> back together because being able to search for a message is more
>> important that not having long lines in the code.
>> Lots of printk(KERN_DEBUG... are not pr_debug() which means the
>> messages won't get printed unless they are explicitly disabled.
>>
>> I included some rough guidelines on which pr_* to choose in md.c.
>> Many things became pr_debug(), most of the rest are pr_warn().
>> pr_err() and pr_crit() are used sparingly.
>>
>> I simplified some code in multipath.c too.
>>
>> A particular benefit of this is that the various "print_conf()"
>> functions are not silent by default.
>> On very large arrays (raid10 with hundreds of devices), these can
>> be very noisy.
>>
> Any specify reason why you didn't move to use dev_*() style of errors?
> Generally I prefer having the messages prefixed with the device which 
> caused them; making debugging _so_ much easier ...

I was already a bit concerned about how much change was in the one set
of patches:
 - rejoining large strings
 - changing printk to pr_*
 - revising log levels

I wouldn't want to add more change.  But as something to be added
afterwards I support the idea in principle, but in practice.....
What device would you pass to dev_*() ??
  disk_to_dev(mddev->gendisk)
  ??

That wouldn't work when dm-raid is in the picture, as mddev->gendisk is
NULL in that case.

Is some cases we already include the md_name().  If there a specific
places were it could be added, I would certainly support adding the dev
name.  I just don't think dev_*() is the way to do it for md.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: Issue with growing RAID10
From: Robert LeBlanc @ 2016-11-02 21:27 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: linux-raid
In-Reply-To: <20161102210022.GA19707@metamorpher.de>

hmmm....

RAID1
root@rleblanc-pc:~/junk# fio -rw=read --size=1G --numjobs=4
--name=mdadm_test --group_reporting
mdadm_test: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
...
fio-2.10
Starting 4 processes
mdadm_test: Laying out IO file(s) (1 file(s) / 1024MB)
mdadm_test: Laying out IO file(s) (1 file(s) / 1024MB)
mdadm_test: Laying out IO file(s) (1 file(s) / 1024MB)
Jobs: 1 (f=1): [R(1),_(3)] [88.9% done] [423.8MB/0KB/0KB /s] [108K/0/0
iops] [eta 00m:01s]
mdadm_test: (groupid=0, jobs=4): err= 0: pid=20564: Wed Nov  2 15:15:40 2016
 read : io=4096.0MB, bw=567642KB/s, iops=141910, runt=  7389msec
   clat (usec): min=0, max=22233, avg=23.02, stdev=288.38
    lat (usec): min=0, max=22233, avg=23.12, stdev=288.38
   clat percentiles (usec):
    |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    1],
    | 30.00th=[    1], 40.00th=[    1], 50.00th=[    1], 60.00th=[    2],
    | 70.00th=[    2], 80.00th=[    2], 90.00th=[    2], 95.00th=[    3],
    | 99.00th=[  644], 99.50th=[ 1144], 99.90th=[ 4128], 99.95th=[ 5600],
    | 99.99th=[11584]
   bw (KB  /s): min=94396, max=469418, per=28.62%, avg=162451.40, stdev=81106.83
   lat (usec) : 2=58.15%, 4=39.21%, 10=0.87%, 20=0.09%, 50=0.16%
   lat (usec) : 100=0.13%, 250=0.14%, 500=0.13%, 750=0.26%, 1000=0.29%
   lat (msec) : 2=0.29%, 4=0.20%, 10=0.09%, 20=0.01%, 50=0.01%
 cpu          : usr=4.14%, sys=10.87%, ctx=15564, majf=0, minf=41
 IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    issued    : total=r=1048576/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
    latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  READ: io=4096.0MB, aggrb=567641KB/s, minb=567641KB/s,
maxb=567641KB/s, mint=7389msec, maxt=7389msec

Disk stats (read/write):
   md13: ios=48375/3, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=12292/6, aggrmerge=0/0, aggrticks=31009/140,
aggrin_queue=31145, aggrutil=97.41%
 loop1: ios=14654/6, merge=0/0, ticks=39524/156, in_queue=39672, util=97.41%
 loop4: ios=5791/6, merge=0/0, ticks=13976/100, in_queue=14072, util=45.45%
 loop2: ios=16575/6, merge=0/0, ticks=37360/152, in_queue=37508, util=90.92%
 loop3: ios=12150/6, merge=0/0, ticks=33176/152, in_queue=33328, util=91.08%

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.50  1387.00 3234.00 2996.50 388746.00 17500.00
130.41     4.44    0.71    1.29    0.09   0.16  98.40
loop1             0.00     0.00 1510.00    2.50 128839.75     6.50
170.38     5.10    3.37    3.34   24.80   0.66 100.00
loop2             0.00     0.00 1570.00    2.50 133952.25     6.50
170.38     5.22    3.31    3.27   25.60   0.64 100.00
loop3             0.00     0.00 1521.50    2.50 129855.75     6.50
170.42     5.00    3.27    3.24   25.60   0.65  98.60
loop4             0.00     0.00    2.50    2.50   248.00     6.50
101.80     0.04    8.40    1.60   15.20   8.00   4.00
loop5             0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md13              0.00     0.00 4603.50    1.50 392832.00     6.00
170.61     0.00    0.00    0.00    0.00   0.00   0.00

root@rleblanc-pc:~/junk# fio -rw=randread --size=1G --numjobs=4
--name=mdadm_test --group_reporting
mdadm_test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
...
fio-2.10
Starting 4 processes
Jobs: 1 (f=1): [_(3),r(1)] [100.0% done] [35996KB/0KB/0KB /s]
[8999/0/0 iops] [eta 00m:00s]
mdadm_test: (groupid=0, jobs=4): err= 0: pid=21036: Wed Nov  2 15:17:47 2016
 read : io=4096.0MB, bw=133254KB/s, iops=33313, runt= 31476msec
   clat (usec): min=4, max=14896, avg=103.19, stdev=123.06
    lat (usec): min=4, max=14896, avg=103.27, stdev=123.06
   clat percentiles (usec):
    |  1.00th=[    7],  5.00th=[    9], 10.00th=[   11], 20.00th=[   90],
    | 30.00th=[   95], 40.00th=[   99], 50.00th=[  104], 60.00th=[  112],
    | 70.00th=[  118], 80.00th=[  125], 90.00th=[  141], 95.00th=[  167],
    | 99.00th=[  247], 99.50th=[  318], 99.90th=[ 2256], 99.95th=[ 2512],
    | 99.99th=[ 4256]
   bw (KB  /s): min=26472, max=57008, per=28.80%, avg=38380.41, stdev=7929.82
   lat (usec) : 10=6.96%, 20=10.26%, 50=1.27%, 100=22.67%, 250=57.86%
   lat (usec) : 500=0.68%, 750=0.04%, 1000=0.02%
   lat (msec) : 2=0.09%, 4=0.12%, 10=0.01%, 20=0.01%
 cpu          : usr=1.51%, sys=7.30%, ctx=1051111, majf=0, minf=38
 IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    issued    : total=r=1048576/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
    latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  READ: io=4096.0MB, aggrb=133254KB/s, minb=133254KB/s,
maxb=133254KB/s, mint=31476msec, maxt=31476msec

Disk stats (read/write):
   md13: ios=1047839/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=262144/0, aggrmerge=0/0, aggrticks=25507/0,
aggrin_queue=25490, aggrutil=92.98%
 loop1: ios=342845/0, merge=0/0, ticks=29440/0, in_queue=29424, util=92.98%
 loop4: ios=190900/0, merge=0/0, ticks=20568/0, in_queue=20552, util=65.09%
 loop2: ios=257401/0, merge=0/0, ticks=26512/0, in_queue=26492, util=83.65%
 loop3: ios=257430/0, merge=0/0, ticks=25508/0, in_queue=25492, util=80.67%

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 34484.50    0.00 141398.00     0.00
 8.20     3.02    0.09    0.09    0.00   0.03 100.00
loop11            0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
loop12            0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
loop13            0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
loop14            0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
loop15            0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md14              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00

RAID10
root@rleblanc-pc:~/junk# fio -rw=read --size=1G --numjobs=4
--name=mdadm_test --group_reporting
...
Disk stats (read/write):
   md14: ios=36295/19, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=9227/27, aggrmerge=0/0, aggrticks=274586/1967,
aggrin_queue=276552, aggrutil=98.05%
 loop13: ios=9006/27, merge=0/0, ticks=253296/1824, in_queue=255120, util=95.31%
 loop11: ios=9171/27, merge=0/0, ticks=260884/1876, in_queue=262760, util=96.57%
 loop14: ios=9593/27, merge=0/0, ticks=313672/2256, in_queue=315924, util=98.05%
 loop12: ios=9141/27, merge=0/0, ticks=270492/1912, in_queue=272404, util=97.20%

root@rleblanc-pc:~/junk# fio -rw=randread --size=1G --numjobs=4
--name=mdadm_test --group_reporting
...
Disk stats (read/write):
   md14: ios=1047470/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=262144/0, aggrmerge=0/0, aggrticks=33242/0,
aggrin_queue=33209, aggrutil=92.62%
 loop13: ios=258512/0, merge=0/0, ticks=33188/0, in_queue=33160, util=90.21%
 loop11: ios=275798/0, merge=0/0, ticks=34120/0, in_queue=34088, util=92.62%
 loop14: ios=252031/0, merge=0/0, ticks=31976/0, in_queue=31936, util=87.15%
 loop12: ios=262235/0, merge=0/0, ticks=33684/0, in_queue=33652, util=91.52%

Much better distribution, especially on RAID10. I wonder if because we
are running a single VM on the array that libvirt is basically single
threaded causing what we are seeing. I think libvirt can have multiple
threads for I/O, we'll have to look into that. It is obvious that md
can split reads from a single thread, I wonder what is preventing from
allowing it to do it more efficiently.

This warrants more probing.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Nov 2, 2016 at 3:00 PM, Andreas Klauer
<Andreas.Klauer@metamorpher.de> wrote:
> On Wed, Nov 02, 2016 at 01:56:02PM -0600, Robert LeBlanc wrote:
>> Yes, we can have any number of disks in a RAID1 (we currently have
>> three), but reads only ever come from the first drive.
>
> Only if there's only one reader. So it depends on what activity
> there is on the machine.
>
>> We just need the option to grow a RAID10 like we can with RAID1.
>
> Patches welcome, I'm sure? ;-)
>
>> Basically, we want to be super paranoid with several identical copies
>> of the data and get extra read performance.
>
> You could put RAID on RAID and thus achieve other modes but not sure
> if it's worth the overhead or even applies in any way to your use case
> and using non standard setups always comes with its own pitfalls.
>
> RAID 1, with RAID0 on top, three disks ABC, two partitions ab,
> different disk order.
>
>   A B C
> a 1 2 3
> b 3 1 2
>
> Three RAID 1 md1, md2, md3, (and md0 a RAID-0 on top).
>
> You can grow it.
>
>   A B C D
> a 1 2 3 ?
> b 3 1 2 ?
>
>   A B C D
> a 1 2 3 ?
> b 3 1 2 3
>
> md3 has 3 disks temporarily here.
>
>   A B C D
> a 1 2 3 4
> b 4 1 2 3
>
> md4 is new, to be added to md0.
>
> Three copies? Same thing with three partitions.
>
> Will it help any or make things worse? I dunno.
> Have to be careful to make md0 assemble last.
>
> Could also be RAID5 on top instead of RAID1.
> That's even stranger though.
>
> Regards
> Andreas Klauer

^ permalink raw reply

* Re: Issue with growing RAID10
From: Robert LeBlanc @ 2016-11-02 22:07 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: linux-raid
In-Reply-To: <CAANLjFqj44wE_Be8FOaVU9AwM-=n_hquyqD1CHUy5y0W06G=wA@mail.gmail.com>

Oh, and since '-p f4' works so well, it really seems like there is a
bug in the 'near' code. We are going to see if we can find anything in
the code. I could see that mechanical drives get an advantage with
'far', but SSDs should make little difference.

RAID10 f4
# fio -rw=read --size=5G --name=mdadm_test
...
Disk stats (read/write):
   md15: ios=45212/5, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=14064/13, aggrmerge=0/0, aggrticks=290590/893,
aggrin_queue=291481, aggrutil=98.95%
 loop23: ios=15328/13, merge=0/0, ticks=337884/928, in_queue=338816, util=98.95%
 loop21: ios=15329/13, merge=0/0, ticks=314396/984, in_queue=315372, util=98.75%
 loop24: ios=12800/13, merge=0/0, ticks=270368/904, in_queue=271268, util=98.59%
 loop22: ios=12800/13, merge=0/0, ticks=239712/756, in_queue=240468, util=98.51%

# fio -rw=randread --size=5G --name=mdadm_test
...
Disk stats (read/write):
   md15: ios=1305867/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=327680/0, aggrmerge=0/0, aggrticks=21163/0,
aggrin_queue=21146, aggrutil=23.32%
 loop23: ios=327680/0, merge=0/0, ticks=21512/0, in_queue=21496, util=23.32%
 loop21: ios=327680/0, merge=0/0, ticks=20716/0, in_queue=20692, util=22.44%
 loop24: ios=327680/0, merge=0/0, ticks=21500/0, in_queue=21488, util=23.31%
 loop22: ios=327680/0, merge=0/0, ticks=20924/0, in_queue=20908, util=22.68%
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Nov 2, 2016 at 3:27 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
> hmmm....
>
> RAID1
> root@rleblanc-pc:~/junk# fio -rw=read --size=1G --numjobs=4
> --name=mdadm_test --group_reporting
> mdadm_test: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
> ...
> fio-2.10
> Starting 4 processes
> mdadm_test: Laying out IO file(s) (1 file(s) / 1024MB)
> mdadm_test: Laying out IO file(s) (1 file(s) / 1024MB)
> mdadm_test: Laying out IO file(s) (1 file(s) / 1024MB)
> Jobs: 1 (f=1): [R(1),_(3)] [88.9% done] [423.8MB/0KB/0KB /s] [108K/0/0
> iops] [eta 00m:01s]
> mdadm_test: (groupid=0, jobs=4): err= 0: pid=20564: Wed Nov  2 15:15:40 2016
>  read : io=4096.0MB, bw=567642KB/s, iops=141910, runt=  7389msec
>    clat (usec): min=0, max=22233, avg=23.02, stdev=288.38
>     lat (usec): min=0, max=22233, avg=23.12, stdev=288.38
>    clat percentiles (usec):
>     |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    1],
>     | 30.00th=[    1], 40.00th=[    1], 50.00th=[    1], 60.00th=[    2],
>     | 70.00th=[    2], 80.00th=[    2], 90.00th=[    2], 95.00th=[    3],
>     | 99.00th=[  644], 99.50th=[ 1144], 99.90th=[ 4128], 99.95th=[ 5600],
>     | 99.99th=[11584]
>    bw (KB  /s): min=94396, max=469418, per=28.62%, avg=162451.40, stdev=81106.83
>    lat (usec) : 2=58.15%, 4=39.21%, 10=0.87%, 20=0.09%, 50=0.16%
>    lat (usec) : 100=0.13%, 250=0.14%, 500=0.13%, 750=0.26%, 1000=0.29%
>    lat (msec) : 2=0.29%, 4=0.20%, 10=0.09%, 20=0.01%, 50=0.01%
>  cpu          : usr=4.14%, sys=10.87%, ctx=15564, majf=0, minf=41
>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     issued    : total=r=1048576/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>     latency   : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
>   READ: io=4096.0MB, aggrb=567641KB/s, minb=567641KB/s,
> maxb=567641KB/s, mint=7389msec, maxt=7389msec
>
> Disk stats (read/write):
>    md13: ios=48375/3, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=12292/6, aggrmerge=0/0, aggrticks=31009/140,
> aggrin_queue=31145, aggrutil=97.41%
>  loop1: ios=14654/6, merge=0/0, ticks=39524/156, in_queue=39672, util=97.41%
>  loop4: ios=5791/6, merge=0/0, ticks=13976/100, in_queue=14072, util=45.45%
>  loop2: ios=16575/6, merge=0/0, ticks=37360/152, in_queue=37508, util=90.92%
>  loop3: ios=12150/6, merge=0/0, ticks=33176/152, in_queue=33328, util=91.08%
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> nvme0n1           0.50  1387.00 3234.00 2996.50 388746.00 17500.00
> 130.41     4.44    0.71    1.29    0.09   0.16  98.40
> loop1             0.00     0.00 1510.00    2.50 128839.75     6.50
> 170.38     5.10    3.37    3.34   24.80   0.66 100.00
> loop2             0.00     0.00 1570.00    2.50 133952.25     6.50
> 170.38     5.22    3.31    3.27   25.60   0.64 100.00
> loop3             0.00     0.00 1521.50    2.50 129855.75     6.50
> 170.42     5.00    3.27    3.24   25.60   0.65  98.60
> loop4             0.00     0.00    2.50    2.50   248.00     6.50
> 101.80     0.04    8.40    1.60   15.20   8.00   4.00
> loop5             0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> md13              0.00     0.00 4603.50    1.50 392832.00     6.00
> 170.61     0.00    0.00    0.00    0.00   0.00   0.00
>
> root@rleblanc-pc:~/junk# fio -rw=randread --size=1G --numjobs=4
> --name=mdadm_test --group_reporting
> mdadm_test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
> ...
> fio-2.10
> Starting 4 processes
> Jobs: 1 (f=1): [_(3),r(1)] [100.0% done] [35996KB/0KB/0KB /s]
> [8999/0/0 iops] [eta 00m:00s]
> mdadm_test: (groupid=0, jobs=4): err= 0: pid=21036: Wed Nov  2 15:17:47 2016
>  read : io=4096.0MB, bw=133254KB/s, iops=33313, runt= 31476msec
>    clat (usec): min=4, max=14896, avg=103.19, stdev=123.06
>     lat (usec): min=4, max=14896, avg=103.27, stdev=123.06
>    clat percentiles (usec):
>     |  1.00th=[    7],  5.00th=[    9], 10.00th=[   11], 20.00th=[   90],
>     | 30.00th=[   95], 40.00th=[   99], 50.00th=[  104], 60.00th=[  112],
>     | 70.00th=[  118], 80.00th=[  125], 90.00th=[  141], 95.00th=[  167],
>     | 99.00th=[  247], 99.50th=[  318], 99.90th=[ 2256], 99.95th=[ 2512],
>     | 99.99th=[ 4256]
>    bw (KB  /s): min=26472, max=57008, per=28.80%, avg=38380.41, stdev=7929.82
>    lat (usec) : 10=6.96%, 20=10.26%, 50=1.27%, 100=22.67%, 250=57.86%
>    lat (usec) : 500=0.68%, 750=0.04%, 1000=0.02%
>    lat (msec) : 2=0.09%, 4=0.12%, 10=0.01%, 20=0.01%
>  cpu          : usr=1.51%, sys=7.30%, ctx=1051111, majf=0, minf=38
>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     issued    : total=r=1048576/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>     latency   : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
>   READ: io=4096.0MB, aggrb=133254KB/s, minb=133254KB/s,
> maxb=133254KB/s, mint=31476msec, maxt=31476msec
>
> Disk stats (read/write):
>    md13: ios=1047839/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=262144/0, aggrmerge=0/0, aggrticks=25507/0,
> aggrin_queue=25490, aggrutil=92.98%
>  loop1: ios=342845/0, merge=0/0, ticks=29440/0, in_queue=29424, util=92.98%
>  loop4: ios=190900/0, merge=0/0, ticks=20568/0, in_queue=20552, util=65.09%
>  loop2: ios=257401/0, merge=0/0, ticks=26512/0, in_queue=26492, util=83.65%
>  loop3: ios=257430/0, merge=0/0, ticks=25508/0, in_queue=25492, util=80.67%
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> nvme0n1           0.00     0.00 34484.50    0.00 141398.00     0.00
>  8.20     3.02    0.09    0.09    0.00   0.03 100.00
> loop11            0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> loop12            0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> loop13            0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> loop14            0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> loop15            0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> md14              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>
> RAID10
> root@rleblanc-pc:~/junk# fio -rw=read --size=1G --numjobs=4
> --name=mdadm_test --group_reporting
> ...
> Disk stats (read/write):
>    md14: ios=36295/19, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=9227/27, aggrmerge=0/0, aggrticks=274586/1967,
> aggrin_queue=276552, aggrutil=98.05%
>  loop13: ios=9006/27, merge=0/0, ticks=253296/1824, in_queue=255120, util=95.31%
>  loop11: ios=9171/27, merge=0/0, ticks=260884/1876, in_queue=262760, util=96.57%
>  loop14: ios=9593/27, merge=0/0, ticks=313672/2256, in_queue=315924, util=98.05%
>  loop12: ios=9141/27, merge=0/0, ticks=270492/1912, in_queue=272404, util=97.20%
>
> root@rleblanc-pc:~/junk# fio -rw=randread --size=1G --numjobs=4
> --name=mdadm_test --group_reporting
> ...
> Disk stats (read/write):
>    md14: ios=1047470/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=262144/0, aggrmerge=0/0, aggrticks=33242/0,
> aggrin_queue=33209, aggrutil=92.62%
>  loop13: ios=258512/0, merge=0/0, ticks=33188/0, in_queue=33160, util=90.21%
>  loop11: ios=275798/0, merge=0/0, ticks=34120/0, in_queue=34088, util=92.62%
>  loop14: ios=252031/0, merge=0/0, ticks=31976/0, in_queue=31936, util=87.15%
>  loop12: ios=262235/0, merge=0/0, ticks=33684/0, in_queue=33652, util=91.52%
>
> Much better distribution, especially on RAID10. I wonder if because we
> are running a single VM on the array that libvirt is basically single
> threaded causing what we are seeing. I think libvirt can have multiple
> threads for I/O, we'll have to look into that. It is obvious that md
> can split reads from a single thread, I wonder what is preventing from
> allowing it to do it more efficiently.
>
> This warrants more probing.
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Wed, Nov 2, 2016 at 3:00 PM, Andreas Klauer
> <Andreas.Klauer@metamorpher.de> wrote:
>> On Wed, Nov 02, 2016 at 01:56:02PM -0600, Robert LeBlanc wrote:
>>> Yes, we can have any number of disks in a RAID1 (we currently have
>>> three), but reads only ever come from the first drive.
>>
>> Only if there's only one reader. So it depends on what activity
>> there is on the machine.
>>
>>> We just need the option to grow a RAID10 like we can with RAID1.
>>
>> Patches welcome, I'm sure? ;-)
>>
>>> Basically, we want to be super paranoid with several identical copies
>>> of the data and get extra read performance.
>>
>> You could put RAID on RAID and thus achieve other modes but not sure
>> if it's worth the overhead or even applies in any way to your use case
>> and using non standard setups always comes with its own pitfalls.
>>
>> RAID 1, with RAID0 on top, three disks ABC, two partitions ab,
>> different disk order.
>>
>>   A B C
>> a 1 2 3
>> b 3 1 2
>>
>> Three RAID 1 md1, md2, md3, (and md0 a RAID-0 on top).
>>
>> You can grow it.
>>
>>   A B C D
>> a 1 2 3 ?
>> b 3 1 2 ?
>>
>>   A B C D
>> a 1 2 3 ?
>> b 3 1 2 3
>>
>> md3 has 3 disks temporarily here.
>>
>>   A B C D
>> a 1 2 3 4
>> b 4 1 2 3
>>
>> md4 is new, to be added to md0.
>>
>> Three copies? Same thing with three partitions.
>>
>> Will it help any or make things worse? I dunno.
>> Have to be careful to make md0 assemble last.
>>
>> Could also be RAID5 on top instead of RAID1.
>> That's even stranger though.
>>
>> Regards
>> Andreas Klauer

^ permalink raw reply

* Re: [PATCH 09/60] dm: dm.c: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments
From: Ming Lei @ 2016-11-02 23:47 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Kent Overstreet, Christoph Hellwig, Jens Axboe,
	Linux Kernel Mailing List, linux-block, Linux FS Devel,
	Kirill A . Shutemov, Alasdair Kergon,
	maintainer:DEVICE-MAPPER (LVM), Shaohua Li,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT
In-Reply-To: <20161102142454.GA9263@redhat.com>

On Wed, Nov 2, 2016 at 10:24 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Wed, Nov 02 2016 at  3:56am -0400,
> Ming Lei <tom.leiming@gmail.com> wrote:
>
>> On Wed, Nov 2, 2016 at 11:09 AM, Kent Overstreet
>> <kent.overstreet@gmail.com> wrote:
>> > On Mon, Oct 31, 2016 at 08:29:01AM -0700, Christoph Hellwig wrote:
>> >> On Sat, Oct 29, 2016 at 04:08:08PM +0800, Ming Lei wrote:
>> >> > Avoid to access .bi_vcnt directly, because it may be not what
>> >> > the driver expected any more after supporting multipage bvec.
>> >> >
>> >> > Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>> >>
>> >> It would be really nice to have a comment in the code why it's
>> >> even checking for multiple segments.
>> >
>> > Or ideally refactor the code to not care about multiple segments at all.
>>
>> The check on 'bio->bi_vcnt == 1' is introduced in commit de3ec86dff160(dm:
>> don't start current request if it would've merged with the previous), which
>> fixed one performance issue.[1]
>>
>> Looks the idea of the patch is to delay dispatching the rq if it
>> would've merged with previous request and the rq is small(single bvec).
>> I guess the motivation is to try to increase chance of merging with the delay.
>>
>> But why does the code check on 'bio->bi_vcnt == 1'? Once the bio is
>> submitted, .bi_vcnt isn't changed any more and merging doesn't change
>> it too. So should the check have been on blk_rq_bytes(rq)?
>>
>> Mike, please correct me if my understanding is wrong.
>>
>>
>> [1] https://www.redhat.com/archives/dm-devel/2015-March/msg00014.html
>
> The patch was labored over for quite a while and is based on suggestions I
> got from Jens when discussing a very problematic aspect of old
> .request_fn request-based DM performance for a multi-threaded (64
> threads) sequential IO benchmark (vdbench IIRC).  The issue was reported
> by NetApp.
>
> The patch in question fixed the lack of merging that was seen with this
> interleaved sequential IO benchmark.  The lack of merging was made worse
> if a DM multipath device had more underlying paths (e.g. 4 instead of 2).
>
> As for your question, about using blk_rq_bytes(rq) vs 'bio->bi_vcnt == 1'
> .. not sure how that would be a suitable replacement.  But it has been a
> while since I've delved into these block core merge details of old

Just last year, looks not long enough, :-)

> .request_fn but _please_ don't change the logic of this code simply

As I explained before, neither .bi_vcnt will be changed after submitting,
nor be changed during merging, so I think the checking is wrong,
could you explain what is your initial motivation of checking on
'bio->bi_vcnt == 1'?

> because it is proving itself to be problematic for your current
> patchset's cleanliness.

Could you explain what is problematic for the cleanliness?

Thanks,
Ming Lei

^ permalink raw reply

* [PATCH 0/2] Fix near layout I/O distribution
From: Robert LeBlanc @ 2016-11-03 23:45 UTC (permalink / raw)
  To: linux-raid


This fixes a small typo from when the code was copied from raid1 and removes a
loop break when a read is atomic on only 'near' layouts. 'Far' layouts never
triggered this break and so it would spread all read I/Os to all available
devices. This removes that break since it is contrary to the comments previous
to it and works as intended in the 'far' layout. After this change random I/O
for 'near' layout is distributed to all disks instead of being all serviced by
the first disk.  Sequential I/O is not affected as it would not trigger the
break.

i.e. Before patch
# mdadm --create /dev/md14 --level 10 --raid-devices 4 -p n4 /dev/loop{11..14}
# fio -rw=randread --size=5G --name=mdadm_test
<snip>
Disk stats (read/write):
   md14: ios=1304718/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=327680/0, aggrmerge=0/0, aggrticks=18719/0, aggrin_queue=18689,
aggrutil=88.37%
 loop13: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
 loop11: ios=1310108/0, merge=0/0, ticks=74856/0, in_queue=74736, util=88.37%
 loop14: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
 loop12: ios=612/0, merge=0/0, ticks=20/0, in_queue=20, util=0.02%

After patch:

<snip>
Disk stats (read/write):
    md14: ios=1309172/1, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=327680/3, aggrmerge=0/0, aggrticks=19866/29, aggrin_queue=19871,
aggrutil=37.54%
  loop13: ios=331904/3, merge=0/0, ticks=21184/24, in_queue=21180, util=24.64%
  loop11: ios=330075/3, merge=0/0, ticks=32252/24, in_queue=32248, util=37.54%
  loop14: ios=327612/3, merge=0/0, ticks=3204/24, in_queue=3200, util=3.73%
  loop12: ios=321129/3, merge=0/0, ticks=22824/44, in_queue=22856, util=26.59%

^ permalink raw reply

* [PATCH 1/2] mdadm: raid10.h Fix typo
From: Robert LeBlanc @ 2016-11-03 23:45 UTC (permalink / raw)
  To: linux-raid; +Cc: Robert LeBlanc
In-Reply-To: <20161103234508.12641-1-robert@leblancnet.us>

Fix typo from paste of code from RAID1.

Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
---
 drivers/md/raid10.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index 18ec1f7..1699a6a 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -53,7 +53,7 @@ struct r10conf {
 	sector_t		offset_diff;
 
 	struct list_head	retry_list;
-	/* A separate list of r1bio which just need raid_end_bio_io called.
+	/* A separate list of r10bio which just need raid_end_bio_io called.
 	 * This mustn't happen for writes which had any errors if the superblock
 	 * needs to be written.
 	 */
-- 
2.10.1


^ permalink raw reply related

* [PATCH 2/2] mdadm: raid10.c Remove near atomic break
From: Robert LeBlanc @ 2016-11-03 23:45 UTC (permalink / raw)
  To: linux-raid; +Cc: Robert LeBlanc
In-Reply-To: <20161103234508.12641-1-robert@leblancnet.us>

This is always triggered for small reads preventing spreading the reads
across all available drives. The comments are also confusing as it is
supposed to apply only to 'far' layouts, but really only applies to 'near'
layouts. Since there isn't problems with 'far' layouts, there shouldn't
be a problem for 'near' layouts either. This change fairly distributes
reads across all drives where before only came from the first drive.

Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
---
 drivers/md/raid10.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index be1a9fc..8d83802 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -777,13 +777,6 @@ static struct md_rdev *read_balance(struct r10conf *conf,
 		if (!do_balance)
 			break;
 
-		/* This optimisation is debatable, and completely destroys
-		 * sequential read speed for 'far copies' arrays.  So only
-		 * keep it for 'near' arrays, and review those later.
-		 */
-		if (geo->near_copies > 1 && !atomic_read(&rdev->nr_pending))
-			break;
-
 		/* for far > 1 always use the lowest address */
 		if (geo->far_copies > 1)
 			new_distance = r10_bio->devs[slot].addr;
-- 
2.10.1


^ permalink raw reply related

* Re: Resync issue in RAID1
From: NeilBrown @ 2016-11-04  3:33 UTC (permalink / raw)
  To: V; +Cc: linux-raid
In-Reply-To: <CAF9xHmTGFXvQnKX5ZK5+ino4977SpHxMffCc0YPMxJGEhwGLuw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4105 bytes --]

On Fri, Oct 28 2016, V wrote:

> Is there any reason, why this happens in the resync flow. Normally the
> upper layer driver tries to align with device block size for the
> request. So could there be an issue in this path ?


This happens in the resync flow because there is a bug which lets the
number "3" escape and be used incorrectly as a device address.
The same bug wouldn't affect data from any upper level driver.

NeilBrown


>
> Thanks,
> V
>
> On Thu, Oct 27, 2016 at 11:01 PM, NeilBrown <neilb@suse.com> wrote:
>> On Fri, Oct 28 2016, V wrote:
>>
>>> Hi Neil,
>>>
>>> Thanks for the response. But during this phase, why is the scsi driver
>>> complaining about bad block number ?
>>>
>>> Oct 18 03:52:56  kernel: [  52.869378] sd 0:0:0:0: [sda] Bad block
>>> number requested
>>
>> Because md is asking to read blocks are offsets which are not a multiple
>> of 8 sectors.
>>
>> NeilBrown
>>
>>
>>> Oct 18 03:52:56  kernel: [  52.869414] sd 0:0:0:0: [sda] Bad block
>>> number requested
>>> Oct 18 03:52:56  kernel: [  52.869436] sd 0:0:0:0: [sda] Bad block
>>> number requested
>>> Oct 18 03:52:56  kernel: [  52.869465] sd 0:0:0:0: [sda] Bad block
>>> number requested
>>> Oct 18 03:52:56  kernel: [  52.869503] sd 0:0:1:0: [sdb] Bad block
>>> number requested
>>>
>>> Thanks,
>>> V
>>>
>>> On Thu, Oct 27, 2016 at 9:01 PM, NeilBrown <neilb@suse.com> wrote:
>>>> On Sat, Oct 22 2016, V wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am facing an issue during RAID1 resync. I have an ubuntu
>>>>> 4.4.0-31-generic running with raid1 configured with 2 disks as active
>>>>> and 2 as spares. On the first powercycle, after installing RAID, i see
>>>>> the following messages in kern.log
>>>>>
>>>>>
>>>>> My disks are configured with 4K sector size (both logical and
>>>>> physical) (sda and sdb are active disks for this raid)
>>>>>
>>>>>
>>>>> ===========
>>>>> Oct 18 03:52:56  kernel: [   52.869113] md: using 128k window, over a
>>>>> total of 51167104k.
>>>>> Oct 18 03:52:56  kernel: [   52.869114] md: resuming resync of md2 from checkpoint.
>>>>
>>>> This line (above) combined with ...
>>>>
>>>>> Oct 18 03:52:56  kernel: [   52.869536] md/raid1:md2: sda: unrecoverable I/O read error for block 3
>>>>
>>>> this line suggests that when you shut down, md had already started a
>>>> resync, and it had checkpointed at block '3'.
>>>>
>>>> The subsequent error are:
>>>>
>>>>> Oct 18 03:52:56  kernel: [   52.869692] md/raid1:md2: sda: unrecoverable I/O read error for block 131
>>>>> Oct 18 03:52:56  kernel: [   52.869837] md/raid1:md2: sda: unrecoverable I/O read error for block 259
>>>>> Oct 18 03:52:56  kernel: [   52.870022] md/raid1:md2: sda: unrecoverable I/O read error for block 387
>>>>
>>>> which are every 128 blocks (aka sectors) from '3'.
>>>> I know what caused that.  The patch below will stop it happening again.
>>>>
>>>> You might be able get your array working again by stopping it
>>>> and assembling with --update=resync.
>>>> That will reset the checkpoint to 0.
>>>>
>>>> NeilBrown
>>>>
>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>> index 2cf0e1c00b9a..aa2ca23463f4 100644
>>>> --- a/drivers/md/md.c
>>>> +++ b/drivers/md/md.c
>>>> @@ -8099,7 +8099,8 @@ void md_do_sync(struct md_thread *thread)
>>>>             mddev->curr_resync > 2) {
>>>>                 if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
>>>>                         if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
>>>> -                               if (mddev->curr_resync >= mddev->recovery_cp) {
>>>> +                               if (mddev->curr_resync >= mddev->recovery_cp &&
>>>> +                                   mddev->curr_resync > 3) {
>>>>                                         printk(KERN_INFO
>>>>                                                "md: checkpointing %s of %s.\n",
>>>>                                                desc, mdname(mddev));
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: RAID boot documentation
From: NeilBrown @ 2016-11-04  3:51 UTC (permalink / raw)
  To: doug; +Cc: linux-raid
In-Reply-To: <CAFx4rwTjDFS-uob=GpCGtOq2dfLWzKCWciA7RxWdORWkh2qn=g@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 322 bytes --]

On Thu, Nov 03 2016, Doug Dumitru wrote:

> Neil,
>
> Thank you very much.  If I chose to write up some documentation on
> this, where/who might want it?  udev documentation in general seems
> quite weak.

Maybe we should have an "md.systemd" man page.  I've been thinking the
same thing about "nfs.systemd"...

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [PATCH 2/2] mdadm: raid10.c Remove near atomic break
From: NeilBrown @ 2016-11-04  4:01 UTC (permalink / raw)
  To: Robert LeBlanc, linux-raid
In-Reply-To: <20161103234508.12641-3-robert@leblancnet.us>

[-- Attachment #1: Type: text/plain, Size: 1948 bytes --]

On Fri, Nov 04 2016, Robert LeBlanc wrote:

> This is always triggered for small reads preventing spreading the reads
> across all available drives. The comments are also confusing as it is
> supposed to apply only to 'far' layouts, but really only applies to 'near'
> layouts. Since there isn't problems with 'far' layouts, there shouldn't
> be a problem for 'near' layouts either. This change fairly distributes
> reads across all drives where before only came from the first drive.

Why is "fairness" an issue?
The current code will use a device if it finds that it is completely
idle. i.e. if nr_pending is 0.
Why is that ever the wrong thing to do?

Does your testing show that overall performance is improved?  If so,
that would certainly be useful.
But it isn't clear (to me) that simply spreading the load more "fairly"
is a worthy goal.

Thanks,
NeilBrown


>
> Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
> ---
>  drivers/md/raid10.c | 7 -------
>  1 file changed, 7 deletions(-)
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index be1a9fc..8d83802 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -777,13 +777,6 @@ static struct md_rdev *read_balance(struct r10conf *conf,
>  		if (!do_balance)
>  			break;
>  
> -		/* This optimisation is debatable, and completely destroys
> -		 * sequential read speed for 'far copies' arrays.  So only
> -		 * keep it for 'near' arrays, and review those later.
> -		 */
> -		if (geo->near_copies > 1 && !atomic_read(&rdev->nr_pending))
> -			break;
> -
>  		/* for far > 1 always use the lowest address */
>  		if (geo->far_copies > 1)
>  			new_distance = r10_bio->devs[slot].addr;
> -- 
> 2.10.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: Panicked and deleted superblock
From: NeilBrown @ 2016-11-04  4:34 UTC (permalink / raw)
  To: Peter Hoffmann, linux-raid
In-Reply-To: <0e68051d-1008-cf9b-1f8f-0a0736b1c58f@gmx.net>

[-- Attachment #1: Type: text/plain, Size: 4181 bytes --]

On Mon, Oct 31 2016, Peter Hoffmann wrote:

> My problem is the result of working late and not informing myself
> previously, I'm fully aware that I should have had a backup, be less
> spontaneous and more cautious.
>
> The initial situation is a RAID-5 array with three disks. I assume it to
> look follows:
>
> | Disk 1   | Disk 2   | Disk 3   |
> |----------|----------|----------|
> |    out   | Block 2  | P(1,2)   |
> |    of    | P(3,4)   | Block 4  |	degenerated but working
> |   sync   | Block 5  | Block 6  |

The default RAID5 layout (there a 4 to choose from) is
#define ALGORITHM_LEFT_SYMMETRIC	2 /* Rotating Parity N with Data Continuation */

The first data block on a stripe is always located after the parity
block.
So if data is D0 D1 D2 D3.... then

   D0   D1   P01
   D3   P23  D2
   P45  D4   D5

>
>
> Then I started the re-sync:
>
> | Disk 1   | Disk 2   | Disk 3   |
> |----------|----------|----------|
> | Block 1  | Block 2  | P(1,2)   |
> | Block 3  | P(3,4)   | Block 4  |   	already synced
> | P(5,6)   | Block 5  | Block 6  |
>                . . .
> |    out   | Block b  | P(a,b)   |
> |    of    | P(c,d)   | Block d  |	not yet synced
> |   sync   | Block e  | Block f  |
>
> But I didn't wait for it to finish as I actually wanted to add a fourth
> disk and so started a grow process. But I just changed the size of the
> array, I didn't actually add the fourth disk (don't ask why I cannot
> recall it). I assume that both processes - re-sync  and grow - raced
> through the array and did their job.

So you ran
  mdadm --grow /dev/md0 --raid-disks 4 --force

???
You would need --force or mdadm would refuse to do such a silly thing.

Also, the kernel would refuse to let a reshape start while a resync was
on-going, so the reshape attempt should have been rejected anyway.

>
> | Disk 1   | Disk 2   | Disk 3   |
> |----------|----------|----------|
> | Block 1  | Block 2  | Block 3  |
> | Block 4  | Block 5  | P(4,5,6) |	with four disks but degenerated
> | Block 7  | P(7,8,9) | Block 8  |
>                . . .
> | Block a  | Block b  | P(a,b)   |
> | Block c  | P(c,d)   | Block d  |	not yet grown but synced
> | P(e,f)   | Block e  | Block f  |
>                . . .
> |    out   | Block V  | P(U,V)   |
> |    of    | P(W,X)   | Block X  |		not yet synced
> |   sync   | Block Y  | Block Z  |
>
> And after running for a while - my NAS is very slow (partly because all
> disks are LUKS'd), mdstat showed around 1GiB of Data processed - we had
> a blackout. Water dropped in a distribution socket and *poff*. After a
> reboot I wanted to resemble everything, didn't know what I was doing so
> the RAID superblock is now lost and I failed to reassemble (this is the
> part I really can't recall, I panicked). I never wrote anything to the
> actual array so I assume, better hope that no actual data is lost.

So you deliberately erased the RAID superblock?  Presumably not.
Maybe you ran "mdadm --create ...." to try to create a new array?  That
would do it.

If the reshape hadn't actually started, then you have some chance of
recovering your data.  If it had, then recovery is virtually impossible
because you don't know how far it got.

>
> I have a plan but wanted to check with you before doing anything stupid
> again.
> My idea is to look for that magic number of the ext4-fs to find the
> beginning of Block 1 on Disk 1, then I would copy an reasonable amount
> of data and try to figure out how big Block 1 and hence chunk-size is -
> perhaps fsck.ext4 can help do that? After that I copy another reasonable
> amount of data from Disks 1-3 to figure out the border between the grown
> Stripes and the synced Stripes. And from there on I'd have my data in a
> defined state from which I can save the whole file system.
> One thing I'm wondering is if I got the layout right. And the other
> might be rather a case for the ext4-mailing list but I'd ask it anyway:
> how can I figure where the file system starts to be corrupted?

You might be able to make something like this work .. if reshape hadn't
started.  But if you can live without recovering the data, then that is
probably the more cost effective option.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [PATCH] Change the option from NoUpdate to NodeNumUpdate
From: NeilBrown @ 2016-11-04  5:35 UTC (permalink / raw)
  To: Guoqing Jiang, Jes.Sorensen; +Cc: linux-raid
In-Reply-To: <1458813635-14175-1-git-send-email-gqjiang@suse.com>

[-- Attachment #1: Type: text/plain, Size: 1853 bytes --]

On Thu, Mar 24 2016, Guoqing Jiang wrote:

> Actually, we need to use NodeNumUpdate here to
> ensure there are enough spaces for those nodes.
>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  Grow.c   | 2 +-
>  super1.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/Grow.c b/Grow.c
> index 5953db2..f58c753 100755
> --- a/Grow.c
> +++ b/Grow.c
> @@ -425,7 +425,7 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
>  						    bitmapsize, offset_setable,
>  						    major)
>  						)
> -						st->ss->write_bitmap(st, fd2, NoUpdate);
> +						st->ss->write_bitmap(st, fd2, NodeNumUpdate);
>  					else {
>  						pr_err("failed to create internal bitmap - chunksize problem.\n");
>  						close(fd2);
> diff --git a/super1.c b/super1.c
> index baa9a96..d6f3c93 100644
> --- a/super1.c
> +++ b/super1.c
> @@ -1867,7 +1867,7 @@ static int write_init_super1(struct supertype *st)
>  		}
>  
>  		if (rv == 0 && (__le32_to_cpu(sb->feature_map) & 1))
> -			rv = st->ss->write_bitmap(st, di->fd, NoUpdate);
> +			rv = st->ss->write_bitmap(st, di->fd, NodeNumUpdate);

This is wrong.
It might be correct for a clustered array, but it caused failure for
non-clustered arrays.
If you run the mdadm self tests, several will fail with errors like

mdadm: Warning: cluster md only works with superblock 1.2
mdadm: Warning: cluster md only works with superblock 1.2
mdadm: failed to set internal bitmap.

Reverting this commit made the errors go away.

NeilBrown


>  		close(di->fd);
>  		di->fd = -1;
>  		if (rv)
> -- 
> 2.6.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [PATCH 2/2] mdadm: raid10.c Remove near atomic break
From: Robert LeBlanc @ 2016-11-04  5:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87k2cjq4eh.fsf@notabene.neil.brown.name>

On Thu, Nov 3, 2016 at 10:01 PM, NeilBrown <neilb@suse.com> wrote:
> On Fri, Nov 04 2016, Robert LeBlanc wrote:
>
>> This is always triggered for small reads preventing spreading the reads
>> across all available drives. The comments are also confusing as it is
>> supposed to apply only to 'far' layouts, but really only applies to 'near'
>> layouts. Since there isn't problems with 'far' layouts, there shouldn't
>> be a problem for 'near' layouts either. This change fairly distributes
>> reads across all drives where before only came from the first drive.
>
> Why is "fairness" an issue?
> The current code will use a device if it finds that it is completely
> idle. i.e. if nr_pending is 0.
> Why is that ever the wrong thing to do?

The code also looks for a drive that is closest to the requested
sector which doesn't get a chance to happen without this patch. The
way this part of code is written, as soon as it finds a good disk, it
cuts out of the loop searching for a better disk. So it doesn't even
look for another disk. In a healthy array with array-disks X and -p
nX, this means that the first disk gets all the reads for small I/O.
Where nY is less than X, it may be covered up because the data is
naturally striped, but it still may be picking a disk that is farther
away from the selected sector causing extra head seeks.

> Does your testing show that overall performance is improved?  If so,
> that would certainly be useful.
> But it isn't clear (to me) that simply spreading the load more "fairly"
> is a worthy goal.

I'll see if I have some mechanical drives somewhere to test (I've been
testing four loopback devices on a single NVME drive so you don't see
an improvement). You can see from the fio I posted [1] that before the
patch, one drive had all the I/O and after the patch the I/O was
distributed between all the drives (it doesn't have to be exactly
even, just not as skewed as it was before is good enough). I would
expect similar results to the 'far' tests done here [0]. Based on the
previous tests I did, when I saw this code, it just made complete
sense to me why we had great performance with 'far' and subpar
performance with 'near'. I'll come back with some results tomorrow.

[0] https://raid.wiki.kernel.org/index.php/Performance
[1] http://marc.info/?l=linux-raid&m=147821671817947&w=2

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
> Thanks,
> NeilBrown
>
>
>>
>> Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
>> ---
>>  drivers/md/raid10.c | 7 -------
>>  1 file changed, 7 deletions(-)
>>
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index be1a9fc..8d83802 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -777,13 +777,6 @@ static struct md_rdev *read_balance(struct r10conf *conf,
>>               if (!do_balance)
>>                       break;
>>
>> -             /* This optimisation is debatable, and completely destroys
>> -              * sequential read speed for 'far copies' arrays.  So only
>> -              * keep it for 'near' arrays, and review those later.
>> -              */
>> -             if (geo->near_copies > 1 && !atomic_read(&rdev->nr_pending))
>> -                     break;
>> -
>>               /* for far > 1 always use the lowest address */
>>               if (geo->far_copies > 1)
>>                       new_distance = r10_bio->devs[slot].addr;
>> --
>> 2.10.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [md PATCH 0/4] Assorted minor improvements.
From: NeilBrown @ 2016-11-04  5:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

There is no real pattern to these patches except that they are fairly
boring but occasionally useful.

The first allows --add and --remove commands to succeed without
waiting for a metadata write, which is just wasted time.  This can be
useful when adding/removes hundreds of devices on a large RAID10
array.

The next two abort some writes which have become pointless.  If a
device fails in a way that causes long retries, this can reduce the
total time  for recovery

The last is a small correctness fix bitmap_daemon_work() doesn't wait
for writes to complete, so they might still be pending when the next
writes is sent, and two writes to the same location might not be
handled properly.  So we insert waits in the rare case that they are
needed.

Thanks,
NeilBrown

---

NeilBrown (4):
      md: perform async updates for metadata where possible.
      md/raid1: abort delayed writes when device fails.
      md/raid10: abort delayed writes when device fails.
      md/bitmap: Don't write bitmap while earlier writes might be in-flight


 drivers/md/bitmap.c |   27 ++++++++++++++++++++++-----
 drivers/md/md.c     |   16 ++++++++++++----
 drivers/md/raid1.c  |   20 +++++++++++++++-----
 drivers/md/raid10.c |   22 ++++++++++++++++------
 4 files changed, 65 insertions(+), 20 deletions(-)

--
Signature


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox