Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: kernel checksumming performance vs actual raid device performance
From: Doug Ledford @ 2016-08-23 20:15 UTC (permalink / raw)
  To: Matt Garman, Doug Dumitru; +Cc: Mdadm
In-Reply-To: <CAJvUf-ApkKJXm7Jjiq=gXY9b9RrEvwA5u35xrMUjX2x0btVL4g@mail.gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 19347 bytes --]

On 8/23/2016 3:26 PM, Matt Garman wrote:
> Doug & Doug,
> 
> Thank you for your helpful replies.  I merged both of your posts into
> one, see inline comments below:
> 
> On Tue, Aug 23, 2016 at 2:10 PM, Doug Ledford <dledford@redhat.com> wrote:
>> Of course.  I didn't mean to imply otherwise.  The read size is the read
>> size.  But, since the OPs test case was to "read random files" and not
>> "read random blocks of random files" I took it to mean it would be
>> sequential IO across a multitude of random files.  That assumption might
>> have been wrong, but I wrote my explanation with that in mind.
> 
> Yes, multiple parallel sequential reads.  Our test program generates a
> bunch of big random files (file size has an approximately normal
> distribution, centered around 500 MB, going down to 100 MB or so, up
> to a few multi-GB outliers).  The file generation is a one-time thing,
> and we don't really care about its performance.
> 
> The read testing program just randomly picks one of those files, then
> reads it start-to-finish using "dd".  But it kicks off several "dd"
> threads at once (currently 50, though this is a run-time parameter).
> This is how we generate the read load, and I use iostat while this is
> running to see how much read throughput I'm getting from the array.

OK, 50 sequential I/Os at a time.  Good point to know.

> 
> On Tue, Aug 23, 2016 at 1:00 PM, Doug Ledford <dledford@redhat.com> wrote:
>> This depends a lot on how you structured your raid array.  I didn't see
>> your earlier emails, so I'm inferring from the "one out of 22 reads will
>> be to the bad drive" that you have a 24 disk raid6 array?  If so, then
>> that's 22 data disks and 2 parity disks per stripe.  I'm gonna use that
>> as the basis for my next statement even if it's slightly wrong.
> 
> Yes, that is exactly correct, here's the relevant part of /proc/mdstat:
> 
> Personalities : [raid1] [raid6] [raid5] [raid4]
> 
> md0 : active raid6 sdl[11] sdi[8] sdx[23] sdc[2] sdo[14] sdn[13]
> sdm[12] sdr[17] sdk[10] sdb[1] sdu[20] sdp[15] sdq[16] sds[18] sdt[19]
> sdw[22] sdv[21] sda[0](F) sdj[9] sde[4] sdd[3] sdf[5] sdh[7] sdg[6]
> 
>       44005879808 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [24/23] [_UUUUUUUUUUUUUUUUUUUUUUU]
> 
>       bitmap: 0/15 pages [0KB], 65536KB chunk

You're raid device has a good chunk size for your usage pattern.  If you
had a smallish chunk size (like 64k or 32k), I would actually expect
things to behave differently.  But, then again, maybe I'm wrong and that
would help.  With a smaller chunk size, you would be able to fit more
stripes in the stripe cache using less memory.

> 
>> Doug was right in that you will have to read 21 data disks and 1 parity
>> disk to reconstruct reads from the missing block of any given stripe.
>> And while he is also correct that this doubles IO ops needed to get your
>> read data, it doesn't address the XOR load to get your data.  With 19
>> data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
>> 20 64k data blocks for 1 result.  If you are getting 200MB/s, you are
>> actually achieving more like 390MB/s of data read, with 190MB/s of it
>> being direct reads, and then you are using XOR on 200MB/s in order to
>> generate the other 10MB/s of results.
> 
> Most of this morning I've been setting/unsetting/changing various
> tunables, to see if I could increase the read speed.  I got a huge
> boost by increasing the /sys/block/md0/md/stripe_cache_size parameter
> from the default (256 IIRC) to 16384.  Doubling it again to 32k didn't
> seem to bring any further benefit.

Makes sense.  I know the stripe cache size is conservative by default
because of the fact that it's not shared with the page cache, so you
might as well consider it's memory lost.  When you upped it to 64k, and
you have 22 disks at 512k chunk, that 11MB per stripe and 65536 total
allowed stripes which is a maximum memory consumption of around 700GB
RAM.  I doubt you have that much in your machine, so I'm guessing it's
simply using all available RAM that the page cache or something else
isn't already using.  That's also explains why setting it higher doesn't
provide any additional benefits ;-).

>  So with the stripe_cache_size
> increased to 16k, I'm now getting around 1000 MB/s read in the
> degraded state.  When the degraded array was only doing 200 MB/s, the
> md0_raid6 process was taking about 50% CPU according to top.  Now I
> have a 5x increase in read speed, and md0_raid6 is taking 100% CPU.

You probably have maxed out your single CPU performance and won't see
any benefit without having a multi-threaded XOR routine.

> I'm still degraded by a factor of eight, though, where I'd expect only
> two.
> 
>> 1) 200MB/s of XOR is not insignificant.  Due to our single thread XOR
>> routines, you can actually keep a CPU pretty busy with this.  Also, even
>> though the XOR routines try to time their assembly 'just so' so that
>> they can use the cache avoiding instructions, this fails more often than
>> not so you end up blowing CPU caches while doing this work, which of
>> course effects the overall system.
> 
> While 200 MB/s of XOR sounds high, the kernel is "advertising" over
> 8000 MB/s, per dmesg:
> 
> [    6.386820] xor: automatically using best checksumming function:
> [    6.396690]    avx       : 24064.000 MB/sec
> [    6.414706] raid6: sse2x1   gen()  7636 MB/s
> [    6.431725] raid6: sse2x2   gen()  3656 MB/s
> [    6.448742] raid6: sse2x4   gen()  3917 MB/s
> [    6.465753] raid6: avx2x1   gen()  5425 MB/s
> [    6.482766] raid6: avx2x2   gen()  7593 MB/s
> [    6.499773] raid6: avx2x4   gen()  8648 MB/s
> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> [    6.499774] raid6: using avx2x2 recovery algorithm
> 
> (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)
> 
> I'm assuming however the kernel does its testing is fairly optimal,

It is *highly* optimal.  What's more, it uses 100% CPU during this time.
 The raid6 thread doing your recovery is responsible for lots of stuff,
issuing reads, doing xor, fulfilling write requests, maintaining the
cache, etc.  It has to have time to actually do other work.  So start
with that 8GB/s figure, but immediately start subtracting from that
because the CPU needs to do other things as well.  Then remember that we
are under *extreme* memory pressure.  When you have to bring in 22 reads
in order to reconstruct just 1 block of the same size, then for 100MB/s
of degraded reads you are generating 2200MB/s of PCI DMA -> MEM
bandwidth consumption, followed by 2200MB/s of MEM -> register load
bandwidth consumption, then I'd have to read the avx xor routine to know
how much write bandwidth it is using, but it's at least 100MB/s of
bandwidth, and likely at least four or five times that much because it
probably doesn't do all 22 blocks in a single xor pass, it likely loads
parity, then reads up to maybe four blocks and xors them together and
then stores the parity, so each pass will re-read and re-store the
parity block.  The point of all of this is that people forget to do the
math on the memory bandwidth used by these XOR operations.  The faster
they are, the higher the percentage of main memory bandwidth you are
consuming.  Now you have to subtract all of that main memory bandwidth
from the total main memory bandwidth for the CPU, and what's left over
is all you have for doing other productive work.  Even if you aren't
blowing your caches doing all of this XOR work, you are blowing your
main memory bandwidth.  Other threads or other actions end up stalling
waiting on main memory accesses to complete.

> and probably assumes ideal cache behavior... so maybe actual XOR
> performance won't be as good as what dmesg suggests...

It will never be that good, and you can thank your stars that it isn't,
because if it were, your computer would be ground to a halt with nothing
happening but data XOR computations.

> but still, 200
> MB/s (or even 1000 MB/s, as I'm now getting), is much lower than 8000
> MB/s...

The math fits.  Most quad channel Intel CPUs have memory bandwidths in
the 50GByte/s range theoretical maximum, but it's not bidirectional,
it's not even multi-access, so you have to remember that the usage looks
like this on a good read:

copy 1: DMA from PCI bus to main memory
copy 2: Load from main memory to CPU for copy_to_user
copy 3: Store from CPU to main memory for user

To get 8GB/s of read performance undregraded then required 24GB/s of
actual memory bandwidth just for the copies.  That's half of your entire
memory bandwidth (unless you have multiple sockets, then things get more
complex, but this is still true for one socket of the multiple socket
machine).  Once you add the XOR routine into the figure, the 3 accesses
is the same for part of it, but for degraded fixups, it is much worse.

> Is it possible to pin kernel threads to a CPU?  I'm thinking I could
> reboot with isolcpus=2 (for example) and if I can force that md0_raid6
> thread to run on CPU 2, at least the L1/L2 caches should be minimally
> affected...

You could try that, but I doubt it will effect much.

>> Possible fixes for this might include:
>>         c) Creating a consolidated page cache/stripe cache (if we can read more
>> of the blocks needed to get our data from cache instead of disk it helps
>> reduce that IO ops issue)
> 
> I suppose this might be an explanation for why increasing the array's
> stripe_cache_size gave me such a boost?

Yes.  The default setting is conservative, you told it to use as much
memory as it needed.

>>         d) Rearchitecting your arrays into raid50 instead of big raid6 array
> 
> My colleague tested that exact same config with hardware raid5, and
> striped the three raid5 arrays together with software raid1.

That's a huge waste, are you sure he didn't use raid0 for the stripe?

>  So
> clearly not apples-to-apples, but he did get dramatically better
> degraded and rebuild performance.  I do intend to test a pure software
> raid-50 implementation.

I would try it.  If you are OK with single disk failures anyway.

>> (or conversely has the random head seeks just gone so
>> radically through the roof that the problem here really is the time it
>> takes the heads to travel everywhere we are sending them).
> 
> I'm certain head movement time isn't the issue, as these are SSDs.  :)

Fair enough ;-).  And given these are SSDs, I'd be just fine doing
something like four 6 disk raid5s then striped in a raid0 myself.  The
main cause for concern with spinning disks is latent bad sectors causing
a read error on rebuild, with SSDs that's much less of a concern.

> On Tue, Aug 23, 2016 at 1:27 PM, Doug Dumitru <doug@easyco.com> wrote:
>> Can you run an iostat during your degraded test, and also a top run
>> over 20+ seconds with kernel threads showing up.  Even better would be
>> a perf capture, but you might not have all the tools installed.  You
>> can always try:
>>
>> perf record -a sleep 20
>>
>> then
>>
>> perf report
>>
>> should show you the top functions globally over the 20 second sample.
>> If you don't have perf loaded, you might (or might not) be able to
>> load it from the distro.
> 
> Running top for 20 or more seconds, the top processes in terms of CPU
> usage are pretty static:
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>  1228 root      20   0       0      0      0 R 100.0  0.0 562:16.83 md0_raid6
>  1315 root      20   0    4372    684    524 S  17.3  0.0  57:20.92 rngd
>   107 root      20   0       0      0      0 S   9.6  0.0  65:16.63 kswapd0
>   108 root      20   0       0      0      0 S   8.6  0.0  65:19.58 kswapd1
> 19424 root      20   0  108972   1676    560 D   3.3  0.0   0:00.52 dd
>  6909 root      20   0  108972   1676    560 D   2.7  0.0   0:01.53 dd
> 18383 root      20   0  108972   1680    560 D   2.7  0.0   0:00.63 dd
> 
> 
> I truncated the output.  The "dd" processes are part of our testing
> tool that generates the huge read load on the array.  Any given "dd"
> process might jump around, but those four kernel processes are always
> the top four.  (Note that before I increased the stripe_cache_size (as
> mentioned above), the md0_raid6 process was only consuming around 50%
> CPU.)

I would try to tune your stripe cache size such that the kswapd?
processes go to sleep.  Those are reading/writing swap.  That won't help
your overall performance.

> Here is a representative view of a non-first iteration of "iostat -mxt 5":
> 
> 
> 08/23/2016 01:37:59 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            4.84    0.00   27.41   67.59    0.00    0.17
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdy               0.00     0.40    0.80    0.60     0.05     0.00
> 83.43     0.00    1.00    0.50    1.67   1.00   0.14
> sdz               0.00     0.40    0.00    0.60     0.00     0.00
> 10.67     0.00    2.00    0.00    2.00   2.00   0.12
> sdd           12927.00     0.00  204.40    0.00    51.00     0.00
> 511.00     5.93   28.75   28.75    0.00   4.31  88.10

I'm not sure how much I trust some of these numbers.  According to this,
you are issuing 200 read/s, at an average size of 511KB, which should
work out to roughly 100MB/s of data read, but rMB/s is only 51.  I
wonder if the read requests from the raid6 thread are bypassing the
rMB/s accounting because they aren't coming from the VFS or some such?
It would explain why the rMB/s is only half of what it should be based
upon requests and average request size.

> sde           13002.60     0.00  205.20    0.00    51.20     0.00
> 511.00     6.29   30.39   30.39    0.00   4.59  94.12
> sdf           12976.80     0.00  205.00    0.00    51.00     0.00
> 509.50     6.17   29.76   29.76    0.00   4.57  93.78
> sdg           12950.20     0.00  205.60    0.00    50.80     0.00
> 506.03     6.20   29.75   29.75    0.00   4.57  93.88
> sdh           12949.00     0.00  207.20    0.00    50.90     0.00
> 503.11     6.36   30.35   30.35    0.00   4.59  95.10
> sdb           12196.40     0.00  192.60    0.00    48.10     0.00
> 511.47     5.48   28.15   28.15    0.00   4.38  84.36
> sda               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdi           12923.00     0.00  208.40    0.00    51.00     0.00
> 501.20     6.79   32.31   32.31    0.00   4.65  96.84
> sdj           12796.20     0.00  206.80    0.00    50.50     0.00
> 500.12     6.62   31.73   31.73    0.00   4.62  95.64
> sdk           12746.60     0.00  204.00    0.00    50.20     0.00
> 503.97     6.38   30.77   30.77    0.00   4.60  93.86
> sdl           12570.00     0.00  202.20    0.00    49.70     0.00
> 503.39     6.39   31.19   31.19    0.00   4.63  93.68
> sdn           12594.00     0.00  204.20    0.00    49.95     0.00
> 500.97     6.40   30.99   30.99    0.00   4.58  93.54
> sdm           12569.00     0.00  203.80    0.00    49.90     0.00
> 501.45     6.30   30.58   30.58    0.00   4.45  90.60
> sdp           12568.80     0.00  205.20    0.00    50.10     0.00
> 500.03     6.37   30.79   30.79    0.00   4.52  92.72
> sdo           12569.20     0.00  204.00    0.00    49.95     0.00
> 501.46     6.40   31.07   31.07    0.00   4.58  93.42
> sdw           12568.60     0.00  206.20    0.00    50.00     0.00
> 496.60     6.34   30.71   30.71    0.00   4.24  87.48
> sdx           12038.60     0.00  197.40    0.00    47.60     0.00
> 493.84     6.01   30.21   30.21    0.00   4.40  86.86
> sdq           12570.20     0.00  204.20    0.00    50.15     0.00
> 502.97     6.23   30.41   30.41    0.00   4.44  90.68
> sdr           12571.00     0.00  204.60    0.00    50.25     0.00
> 502.99     6.15   30.26   30.26    0.00   4.18  85.62
> sds           12495.20     0.00  203.80    0.00    49.95     0.00
> 501.95     6.00   29.62   29.62    0.00   4.24  86.38
> sdu           12695.60     0.00  207.80    0.00    50.65     0.00
> 499.17     6.22   30.00   30.00    0.00   4.16  86.38
> sdv           12619.00     0.00  207.80    0.00    50.35     0.00
> 496.22     6.23   30.03   30.03    0.00   4.20  87.32
> sdt           12671.20     0.00  206.20    0.00    50.50     0.00
> 501.56     6.05   29.30   29.30    0.00   4.24  87.44
> sdc           12851.60     0.00  203.00    0.00    50.70     0.00
> 511.50     5.84   28.49   28.49    0.00   4.17  84.64
> md126             0.00     0.00    0.60    1.00     0.05     0.00
> 71.00     0.00    0.00    0.00    0.00   0.00   0.00
> dm-0              0.00     0.00    0.60    0.80     0.05     0.00
> 81.14     0.00    2.29    0.67    3.50   1.14   0.16
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> md0               0.00     0.00 4475.20    0.00  1110.95     0.00
> 508.41     0.00    0.00    0.00    0.00   0.00   0.00
> 
> 
> sdy and sz are the system drives, so they are uninteresting.
> 
> sda is the md0 drive I failed, that's why it stays at zero.
> 
> And lastly, here's the output of the perf commands you suggested (at
> least the top part):
> 
> Samples: 561K of event 'cycles', Event count (approx.): 318536644203
> Overhead  Command         Shared Object                 Symbol
>   52.85%  swapper         [kernel.kallsyms]             [k] cpu_startup_entry
>    4.47%  md0_raid6       [kernel.kallsyms]             [k] memcpy
>    3.39%  dd              [kernel.kallsyms]             [k] __find_stripe
>    2.50%  md0_raid6       [kernel.kallsyms]             [k] analyse_stripe
>    2.43%  dd              [kernel.kallsyms]             [k] _raw_spin_lock_irq
>    1.75%  rngd            rngd                          [.] 0x000000000000288b
>    1.74%  md0_raid6       [kernel.kallsyms]             [k] xor_avx_5
>    1.49%  dd              [kernel.kallsyms]             [k]
> copy_user_enhanced_fast_string
>    1.33%  md0_raid6       [kernel.kallsyms]             [k] ops_run_io
>    0.65%  dd              [kernel.kallsyms]             [k] raid5_compute_sector
>    0.60%  md0_raid6       [kernel.kallsyms]             [k] _raw_spin_lock_irq
>    0.55%  ps              libc-2.17.so                  [.] _IO_vfscanf
>    0.53%  ps              [kernel.kallsyms]             [k] vsnprintf
>    0.51%  ps              [kernel.kallsyms]             [k] format_decode
>    0.47%  ps              [kernel.kallsyms]             [k] number.isra.2
>    0.41%  md0_raid6       [kernel.kallsyms]             [k] raid_run_ops
>    0.40%  md0_raid6       [kernel.kallsyms]             [k] __blk_segment_map_sg
> 
> 
> That's my first time using the perf tool, so I need a little hand-holding here.

You might get more interesting perf results if you could pin the md
raid6 thread to a single CPU and then filter the perf results to just
that CPU.


-- 
Doug Ledford <dledford@redhat.com>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Doug Dumitru @ 2016-08-23 19:41 UTC (permalink / raw)
  To: Matt Garman; +Cc: Doug Ledford, Mdadm
In-Reply-To: <CAJvUf-ApkKJXm7Jjiq=gXY9b9RrEvwA5u35xrMUjX2x0btVL4g@mail.gmail.com>

Matt,

So you are up at 1GB/sec, which is only 1/4 the degraded speed, but
1/2 the expected speed based on drive data transfers required.  This
is actually pretty good.

I should have mentioned the stripe cache parameter before, but I use
raid "differently" and stripe cache does not impact my use case.
Sorry.

The 1GB/sec saturating a core is probably as good as it gets.  This
core is doing a lot of stripe cache page manipulations which are not
all that fast.

Also, the single parity recovery case should be XOR and not the raid-6
logic, so it should be pretty cheap.  Another, not important point for
this issue, is that the benchmarks are to generate parity, not
recover.  Recovery with raid-6 (ie, two drives failed) is more
expensive that the writes.  I am not sure how optimized this is, but
it could be really bad.

If you need this to go faster, then it is either a raid re-design, or
perhaps you should consider cutting your array into two parts.  Two 12
drives raid-6 arrays will give you more bandwidth both because the
failures are less "wide", so a single drive will only do 11 reads
instead of 22.  Plus you get the benefit of two raid-6 threads should
you have dead drives on both halves.  You can raid-0 the arrays
together.  Then again, you lose two drives worth of space.

Doug


On Tue, Aug 23, 2016 at 12:26 PM, Matt Garman <matthew.garman@gmail.com> wrote:
> Doug & Doug,
>
> Thank you for your helpful replies.  I merged both of your posts into
> one, see inline comments below:
>
> On Tue, Aug 23, 2016 at 2:10 PM, Doug Ledford <dledford@redhat.com> wrote:
>> Of course.  I didn't mean to imply otherwise.  The read size is the read
>> size.  But, since the OPs test case was to "read random files" and not
>> "read random blocks of random files" I took it to mean it would be
>> sequential IO across a multitude of random files.  That assumption might
>> have been wrong, but I wrote my explanation with that in mind.
>
> Yes, multiple parallel sequential reads.  Our test program generates a
> bunch of big random files (file size has an approximately normal
> distribution, centered around 500 MB, going down to 100 MB or so, up
> to a few multi-GB outliers).  The file generation is a one-time thing,
> and we don't really care about its performance.
>
> The read testing program just randomly picks one of those files, then
> reads it start-to-finish using "dd".  But it kicks off several "dd"
> threads at once (currently 50, though this is a run-time parameter).
> This is how we generate the read load, and I use iostat while this is
> running to see how much read throughput I'm getting from the array.
>
>
> On Tue, Aug 23, 2016 at 1:00 PM, Doug Ledford <dledford@redhat.com> wrote:
>> This depends a lot on how you structured your raid array.  I didn't see
>> your earlier emails, so I'm inferring from the "one out of 22 reads will
>> be to the bad drive" that you have a 24 disk raid6 array?  If so, then
>> that's 22 data disks and 2 parity disks per stripe.  I'm gonna use that
>> as the basis for my next statement even if it's slightly wrong.
>
> Yes, that is exactly correct, here's the relevant part of /proc/mdstat:
>
> Personalities : [raid1] [raid6] [raid5] [raid4]
>
> md0 : active raid6 sdl[11] sdi[8] sdx[23] sdc[2] sdo[14] sdn[13]
> sdm[12] sdr[17] sdk[10] sdb[1] sdu[20] sdp[15] sdq[16] sds[18] sdt[19]
> sdw[22] sdv[21] sda[0](F) sdj[9] sde[4] sdd[3] sdf[5] sdh[7] sdg[6]
>
>       44005879808 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [24/23] [_UUUUUUUUUUUUUUUUUUUUUUU]
>
>       bitmap: 0/15 pages [0KB], 65536KB chunk
>
>
>> Doug was right in that you will have to read 21 data disks and 1 parity
>> disk to reconstruct reads from the missing block of any given stripe.
>> And while he is also correct that this doubles IO ops needed to get your
>> read data, it doesn't address the XOR load to get your data.  With 19
>> data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
>> 20 64k data blocks for 1 result.  If you are getting 200MB/s, you are
>> actually achieving more like 390MB/s of data read, with 190MB/s of it
>> being direct reads, and then you are using XOR on 200MB/s in order to
>> generate the other 10MB/s of results.
>
> Most of this morning I've been setting/unsetting/changing various
> tunables, to see if I could increase the read speed.  I got a huge
> boost by increasing the /sys/block/md0/md/stripe_cache_size parameter
> from the default (256 IIRC) to 16384.  Doubling it again to 32k didn't
> seem to bring any further benefit.  So with the stripe_cache_size
> increased to 16k, I'm now getting around 1000 MB/s read in the
> degraded state.  When the degraded array was only doing 200 MB/s, the
> md0_raid6 process was taking about 50% CPU according to top.  Now I
> have a 5x increase in read speed, and md0_raid6 is taking 100% CPU.
> I'm still degraded by a factor of eight, though, where I'd expect only
> two.
>
>> 1) 200MB/s of XOR is not insignificant.  Due to our single thread XOR
>> routines, you can actually keep a CPU pretty busy with this.  Also, even
>> though the XOR routines try to time their assembly 'just so' so that
>> they can use the cache avoiding instructions, this fails more often than
>> not so you end up blowing CPU caches while doing this work, which of
>> course effects the overall system.
>
> While 200 MB/s of XOR sounds high, the kernel is "advertising" over
> 8000 MB/s, per dmesg:
>
> [    6.386820] xor: automatically using best checksumming function:
> [    6.396690]    avx       : 24064.000 MB/sec
> [    6.414706] raid6: sse2x1   gen()  7636 MB/s
> [    6.431725] raid6: sse2x2   gen()  3656 MB/s
> [    6.448742] raid6: sse2x4   gen()  3917 MB/s
> [    6.465753] raid6: avx2x1   gen()  5425 MB/s
> [    6.482766] raid6: avx2x2   gen()  7593 MB/s
> [    6.499773] raid6: avx2x4   gen()  8648 MB/s
> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> [    6.499774] raid6: using avx2x2 recovery algorithm
>
> (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)
>
> I'm assuming however the kernel does its testing is fairly optimal,
> and probably assumes ideal cache behavior... so maybe actual XOR
> performance won't be as good as what dmesg suggests... but still, 200
> MB/s (or even 1000 MB/s, as I'm now getting), is much lower than 8000
> MB/s...
>
> Is it possible to pin kernel threads to a CPU?  I'm thinking I could
> reboot with isolcpus=2 (for example) and if I can force that md0_raid6
> thread to run on CPU 2, at least the L1/L2 caches should be minimally
> affected...
>
>> Possible fixes for this might include:
>>         c) Creating a consolidated page cache/stripe cache (if we can read more
>> of the blocks needed to get our data from cache instead of disk it helps
>> reduce that IO ops issue)
>
> I suppose this might be an explanation for why increasing the array's
> stripe_cache_size gave me such a boost?
>
>>         d) Rearchitecting your arrays into raid50 instead of big raid6 array
>
> My colleague tested that exact same config with hardware raid5, and
> striped the three raid5 arrays together with software raid1.  So
> clearly not apples-to-apples, but he did get dramatically better
> degraded and rebuild performance.  I do intend to test a pure software
> raid-50 implementation.
>
>> (or conversely has the random head seeks just gone so
>> radically through the roof that the problem here really is the time it
>> takes the heads to travel everywhere we are sending them).
>
> I'm certain head movement time isn't the issue, as these are SSDs.  :)
>
> On Tue, Aug 23, 2016 at 1:27 PM, Doug Dumitru <doug@easyco.com> wrote:
>> Can you run an iostat during your degraded test, and also a top run
>> over 20+ seconds with kernel threads showing up.  Even better would be
>> a perf capture, but you might not have all the tools installed.  You
>> can always try:
>>
>> perf record -a sleep 20
>>
>> then
>>
>> perf report
>>
>> should show you the top functions globally over the 20 second sample.
>> If you don't have perf loaded, you might (or might not) be able to
>> load it from the distro.
>
> Running top for 20 or more seconds, the top processes in terms of CPU
> usage are pretty static:
>
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>  1228 root      20   0       0      0      0 R 100.0  0.0 562:16.83 md0_raid6
>  1315 root      20   0    4372    684    524 S  17.3  0.0  57:20.92 rngd
>   107 root      20   0       0      0      0 S   9.6  0.0  65:16.63 kswapd0
>   108 root      20   0       0      0      0 S   8.6  0.0  65:19.58 kswapd1
> 19424 root      20   0  108972   1676    560 D   3.3  0.0   0:00.52 dd
>  6909 root      20   0  108972   1676    560 D   2.7  0.0   0:01.53 dd
> 18383 root      20   0  108972   1680    560 D   2.7  0.0   0:00.63 dd
>
>
> I truncated the output.  The "dd" processes are part of our testing
> tool that generates the huge read load on the array.  Any given "dd"
> process might jump around, but those four kernel processes are always
> the top four.  (Note that before I increased the stripe_cache_size (as
> mentioned above), the md0_raid6 process was only consuming around 50%
> CPU.)
>
> Here is a representative view of a non-first iteration of "iostat -mxt 5":
>
>
> 08/23/2016 01:37:59 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            4.84    0.00   27.41   67.59    0.00    0.17
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdy               0.00     0.40    0.80    0.60     0.05     0.00
> 83.43     0.00    1.00    0.50    1.67   1.00   0.14
> sdz               0.00     0.40    0.00    0.60     0.00     0.00
> 10.67     0.00    2.00    0.00    2.00   2.00   0.12
> sdd           12927.00     0.00  204.40    0.00    51.00     0.00
> 511.00     5.93   28.75   28.75    0.00   4.31  88.10
> sde           13002.60     0.00  205.20    0.00    51.20     0.00
> 511.00     6.29   30.39   30.39    0.00   4.59  94.12
> sdf           12976.80     0.00  205.00    0.00    51.00     0.00
> 509.50     6.17   29.76   29.76    0.00   4.57  93.78
> sdg           12950.20     0.00  205.60    0.00    50.80     0.00
> 506.03     6.20   29.75   29.75    0.00   4.57  93.88
> sdh           12949.00     0.00  207.20    0.00    50.90     0.00
> 503.11     6.36   30.35   30.35    0.00   4.59  95.10
> sdb           12196.40     0.00  192.60    0.00    48.10     0.00
> 511.47     5.48   28.15   28.15    0.00   4.38  84.36
> sda               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdi           12923.00     0.00  208.40    0.00    51.00     0.00
> 501.20     6.79   32.31   32.31    0.00   4.65  96.84
> sdj           12796.20     0.00  206.80    0.00    50.50     0.00
> 500.12     6.62   31.73   31.73    0.00   4.62  95.64
> sdk           12746.60     0.00  204.00    0.00    50.20     0.00
> 503.97     6.38   30.77   30.77    0.00   4.60  93.86
> sdl           12570.00     0.00  202.20    0.00    49.70     0.00
> 503.39     6.39   31.19   31.19    0.00   4.63  93.68
> sdn           12594.00     0.00  204.20    0.00    49.95     0.00
> 500.97     6.40   30.99   30.99    0.00   4.58  93.54
> sdm           12569.00     0.00  203.80    0.00    49.90     0.00
> 501.45     6.30   30.58   30.58    0.00   4.45  90.60
> sdp           12568.80     0.00  205.20    0.00    50.10     0.00
> 500.03     6.37   30.79   30.79    0.00   4.52  92.72
> sdo           12569.20     0.00  204.00    0.00    49.95     0.00
> 501.46     6.40   31.07   31.07    0.00   4.58  93.42
> sdw           12568.60     0.00  206.20    0.00    50.00     0.00
> 496.60     6.34   30.71   30.71    0.00   4.24  87.48
> sdx           12038.60     0.00  197.40    0.00    47.60     0.00
> 493.84     6.01   30.21   30.21    0.00   4.40  86.86
> sdq           12570.20     0.00  204.20    0.00    50.15     0.00
> 502.97     6.23   30.41   30.41    0.00   4.44  90.68
> sdr           12571.00     0.00  204.60    0.00    50.25     0.00
> 502.99     6.15   30.26   30.26    0.00   4.18  85.62
> sds           12495.20     0.00  203.80    0.00    49.95     0.00
> 501.95     6.00   29.62   29.62    0.00   4.24  86.38
> sdu           12695.60     0.00  207.80    0.00    50.65     0.00
> 499.17     6.22   30.00   30.00    0.00   4.16  86.38
> sdv           12619.00     0.00  207.80    0.00    50.35     0.00
> 496.22     6.23   30.03   30.03    0.00   4.20  87.32
> sdt           12671.20     0.00  206.20    0.00    50.50     0.00
> 501.56     6.05   29.30   29.30    0.00   4.24  87.44
> sdc           12851.60     0.00  203.00    0.00    50.70     0.00
> 511.50     5.84   28.49   28.49    0.00   4.17  84.64
> md126             0.00     0.00    0.60    1.00     0.05     0.00
> 71.00     0.00    0.00    0.00    0.00   0.00   0.00
> dm-0              0.00     0.00    0.60    0.80     0.05     0.00
> 81.14     0.00    2.29    0.67    3.50   1.14   0.16
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> md0               0.00     0.00 4475.20    0.00  1110.95     0.00
> 508.41     0.00    0.00    0.00    0.00   0.00   0.00
>
>
> sdy and sz are the system drives, so they are uninteresting.
>
> sda is the md0 drive I failed, that's why it stays at zero.
>
> And lastly, here's the output of the perf commands you suggested (at
> least the top part):
>
> Samples: 561K of event 'cycles', Event count (approx.): 318536644203
> Overhead  Command         Shared Object                 Symbol
>   52.85%  swapper         [kernel.kallsyms]             [k] cpu_startup_entry
>    4.47%  md0_raid6       [kernel.kallsyms]             [k] memcpy
>    3.39%  dd              [kernel.kallsyms]             [k] __find_stripe
>    2.50%  md0_raid6       [kernel.kallsyms]             [k] analyse_stripe
>    2.43%  dd              [kernel.kallsyms]             [k] _raw_spin_lock_irq
>    1.75%  rngd            rngd                          [.] 0x000000000000288b
>    1.74%  md0_raid6       [kernel.kallsyms]             [k] xor_avx_5
>    1.49%  dd              [kernel.kallsyms]             [k]
> copy_user_enhanced_fast_string
>    1.33%  md0_raid6       [kernel.kallsyms]             [k] ops_run_io
>    0.65%  dd              [kernel.kallsyms]             [k] raid5_compute_sector
>    0.60%  md0_raid6       [kernel.kallsyms]             [k] _raw_spin_lock_irq
>    0.55%  ps              libc-2.17.so                  [.] _IO_vfscanf
>    0.53%  ps              [kernel.kallsyms]             [k] vsnprintf
>    0.51%  ps              [kernel.kallsyms]             [k] format_decode
>    0.47%  ps              [kernel.kallsyms]             [k] number.isra.2
>    0.41%  md0_raid6       [kernel.kallsyms]             [k] raid_run_ops
>    0.40%  md0_raid6       [kernel.kallsyms]             [k] __blk_segment_map_sg
>
>
> That's my first time using the perf tool, so I need a little hand-holding here.
>
> Thanks again all!
> Matt



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Doug Ledford @ 2016-08-23 19:26 UTC (permalink / raw)
  To: doug; +Cc: Matt Garman, Mdadm
In-Reply-To: <CAFx4rwTwZPs5gzDXBc+fLVLCi6RW2=uWC1TaK7uNfJ48MxzHWQ@mail.gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 697 bytes --]

On 8/23/2016 3:19 PM, Doug Dumitru wrote:
> Mr. Ledford,
> 
> I am glad that we are in agreement.  My issue is that if the customer
> is reading 4GB/sec with a non-degraded array, the degraded array
> should only have 2X the number of IOs and 2X the transfer sizes to the
> drives.  If the data rate falls to 1GB, I can suspect cpu overhead.
> With this case falling to 200MB/sec, then something else is going on.
> 
> SSDs tend to be very "flat" reading from q=1 up to about q=20 assuming
> the HBAs can keep up.

Is he using SSDs?  If so, I missed that bit.  I wrote my response
assuming rotating media.




-- 
Doug Ledford <dledford@redhat.com>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Matt Garman @ 2016-08-23 19:26 UTC (permalink / raw)
  To: Doug Dumitru; +Cc: Doug Ledford, Mdadm
In-Reply-To: <CAFx4rwT0jt9NCu4imruPUhfAR71=cvHwx-kdZxoTniZaQcPByQ@mail.gmail.com>

Doug & Doug,

Thank you for your helpful replies.  I merged both of your posts into
one, see inline comments below:

On Tue, Aug 23, 2016 at 2:10 PM, Doug Ledford <dledford@redhat.com> wrote:
> Of course.  I didn't mean to imply otherwise.  The read size is the read
> size.  But, since the OPs test case was to "read random files" and not
> "read random blocks of random files" I took it to mean it would be
> sequential IO across a multitude of random files.  That assumption might
> have been wrong, but I wrote my explanation with that in mind.

Yes, multiple parallel sequential reads.  Our test program generates a
bunch of big random files (file size has an approximately normal
distribution, centered around 500 MB, going down to 100 MB or so, up
to a few multi-GB outliers).  The file generation is a one-time thing,
and we don't really care about its performance.

The read testing program just randomly picks one of those files, then
reads it start-to-finish using "dd".  But it kicks off several "dd"
threads at once (currently 50, though this is a run-time parameter).
This is how we generate the read load, and I use iostat while this is
running to see how much read throughput I'm getting from the array.


On Tue, Aug 23, 2016 at 1:00 PM, Doug Ledford <dledford@redhat.com> wrote:
> This depends a lot on how you structured your raid array.  I didn't see
> your earlier emails, so I'm inferring from the "one out of 22 reads will
> be to the bad drive" that you have a 24 disk raid6 array?  If so, then
> that's 22 data disks and 2 parity disks per stripe.  I'm gonna use that
> as the basis for my next statement even if it's slightly wrong.

Yes, that is exactly correct, here's the relevant part of /proc/mdstat:

Personalities : [raid1] [raid6] [raid5] [raid4]

md0 : active raid6 sdl[11] sdi[8] sdx[23] sdc[2] sdo[14] sdn[13]
sdm[12] sdr[17] sdk[10] sdb[1] sdu[20] sdp[15] sdq[16] sds[18] sdt[19]
sdw[22] sdv[21] sda[0](F) sdj[9] sde[4] sdd[3] sdf[5] sdh[7] sdg[6]

      44005879808 blocks super 1.2 level 6, 512k chunk, algorithm 2
[24/23] [_UUUUUUUUUUUUUUUUUUUUUUU]

      bitmap: 0/15 pages [0KB], 65536KB chunk


> Doug was right in that you will have to read 21 data disks and 1 parity
> disk to reconstruct reads from the missing block of any given stripe.
> And while he is also correct that this doubles IO ops needed to get your
> read data, it doesn't address the XOR load to get your data.  With 19
> data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
> 20 64k data blocks for 1 result.  If you are getting 200MB/s, you are
> actually achieving more like 390MB/s of data read, with 190MB/s of it
> being direct reads, and then you are using XOR on 200MB/s in order to
> generate the other 10MB/s of results.

Most of this morning I've been setting/unsetting/changing various
tunables, to see if I could increase the read speed.  I got a huge
boost by increasing the /sys/block/md0/md/stripe_cache_size parameter
from the default (256 IIRC) to 16384.  Doubling it again to 32k didn't
seem to bring any further benefit.  So with the stripe_cache_size
increased to 16k, I'm now getting around 1000 MB/s read in the
degraded state.  When the degraded array was only doing 200 MB/s, the
md0_raid6 process was taking about 50% CPU according to top.  Now I
have a 5x increase in read speed, and md0_raid6 is taking 100% CPU.
I'm still degraded by a factor of eight, though, where I'd expect only
two.

> 1) 200MB/s of XOR is not insignificant.  Due to our single thread XOR
> routines, you can actually keep a CPU pretty busy with this.  Also, even
> though the XOR routines try to time their assembly 'just so' so that
> they can use the cache avoiding instructions, this fails more often than
> not so you end up blowing CPU caches while doing this work, which of
> course effects the overall system.

While 200 MB/s of XOR sounds high, the kernel is "advertising" over
8000 MB/s, per dmesg:

[    6.386820] xor: automatically using best checksumming function:
[    6.396690]    avx       : 24064.000 MB/sec
[    6.414706] raid6: sse2x1   gen()  7636 MB/s
[    6.431725] raid6: sse2x2   gen()  3656 MB/s
[    6.448742] raid6: sse2x4   gen()  3917 MB/s
[    6.465753] raid6: avx2x1   gen()  5425 MB/s
[    6.482766] raid6: avx2x2   gen()  7593 MB/s
[    6.499773] raid6: avx2x4   gen()  8648 MB/s
[    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
[    6.499774] raid6: using avx2x2 recovery algorithm

(CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)

I'm assuming however the kernel does its testing is fairly optimal,
and probably assumes ideal cache behavior... so maybe actual XOR
performance won't be as good as what dmesg suggests... but still, 200
MB/s (or even 1000 MB/s, as I'm now getting), is much lower than 8000
MB/s...

Is it possible to pin kernel threads to a CPU?  I'm thinking I could
reboot with isolcpus=2 (for example) and if I can force that md0_raid6
thread to run on CPU 2, at least the L1/L2 caches should be minimally
affected...

> Possible fixes for this might include:
>         c) Creating a consolidated page cache/stripe cache (if we can read more
> of the blocks needed to get our data from cache instead of disk it helps
> reduce that IO ops issue)

I suppose this might be an explanation for why increasing the array's
stripe_cache_size gave me such a boost?

>         d) Rearchitecting your arrays into raid50 instead of big raid6 array

My colleague tested that exact same config with hardware raid5, and
striped the three raid5 arrays together with software raid1.  So
clearly not apples-to-apples, but he did get dramatically better
degraded and rebuild performance.  I do intend to test a pure software
raid-50 implementation.

> (or conversely has the random head seeks just gone so
> radically through the roof that the problem here really is the time it
> takes the heads to travel everywhere we are sending them).

I'm certain head movement time isn't the issue, as these are SSDs.  :)

On Tue, Aug 23, 2016 at 1:27 PM, Doug Dumitru <doug@easyco.com> wrote:
> Can you run an iostat during your degraded test, and also a top run
> over 20+ seconds with kernel threads showing up.  Even better would be
> a perf capture, but you might not have all the tools installed.  You
> can always try:
>
> perf record -a sleep 20
>
> then
>
> perf report
>
> should show you the top functions globally over the 20 second sample.
> If you don't have perf loaded, you might (or might not) be able to
> load it from the distro.

Running top for 20 or more seconds, the top processes in terms of CPU
usage are pretty static:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1228 root      20   0       0      0      0 R 100.0  0.0 562:16.83 md0_raid6
 1315 root      20   0    4372    684    524 S  17.3  0.0  57:20.92 rngd
  107 root      20   0       0      0      0 S   9.6  0.0  65:16.63 kswapd0
  108 root      20   0       0      0      0 S   8.6  0.0  65:19.58 kswapd1
19424 root      20   0  108972   1676    560 D   3.3  0.0   0:00.52 dd
 6909 root      20   0  108972   1676    560 D   2.7  0.0   0:01.53 dd
18383 root      20   0  108972   1680    560 D   2.7  0.0   0:00.63 dd


I truncated the output.  The "dd" processes are part of our testing
tool that generates the huge read load on the array.  Any given "dd"
process might jump around, but those four kernel processes are always
the top four.  (Note that before I increased the stripe_cache_size (as
mentioned above), the md0_raid6 process was only consuming around 50%
CPU.)

Here is a representative view of a non-first iteration of "iostat -mxt 5":


08/23/2016 01:37:59 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.84    0.00   27.41   67.59    0.00    0.17

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdy               0.00     0.40    0.80    0.60     0.05     0.00
83.43     0.00    1.00    0.50    1.67   1.00   0.14
sdz               0.00     0.40    0.00    0.60     0.00     0.00
10.67     0.00    2.00    0.00    2.00   2.00   0.12
sdd           12927.00     0.00  204.40    0.00    51.00     0.00
511.00     5.93   28.75   28.75    0.00   4.31  88.10
sde           13002.60     0.00  205.20    0.00    51.20     0.00
511.00     6.29   30.39   30.39    0.00   4.59  94.12
sdf           12976.80     0.00  205.00    0.00    51.00     0.00
509.50     6.17   29.76   29.76    0.00   4.57  93.78
sdg           12950.20     0.00  205.60    0.00    50.80     0.00
506.03     6.20   29.75   29.75    0.00   4.57  93.88
sdh           12949.00     0.00  207.20    0.00    50.90     0.00
503.11     6.36   30.35   30.35    0.00   4.59  95.10
sdb           12196.40     0.00  192.60    0.00    48.10     0.00
511.47     5.48   28.15   28.15    0.00   4.38  84.36
sda               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdi           12923.00     0.00  208.40    0.00    51.00     0.00
501.20     6.79   32.31   32.31    0.00   4.65  96.84
sdj           12796.20     0.00  206.80    0.00    50.50     0.00
500.12     6.62   31.73   31.73    0.00   4.62  95.64
sdk           12746.60     0.00  204.00    0.00    50.20     0.00
503.97     6.38   30.77   30.77    0.00   4.60  93.86
sdl           12570.00     0.00  202.20    0.00    49.70     0.00
503.39     6.39   31.19   31.19    0.00   4.63  93.68
sdn           12594.00     0.00  204.20    0.00    49.95     0.00
500.97     6.40   30.99   30.99    0.00   4.58  93.54
sdm           12569.00     0.00  203.80    0.00    49.90     0.00
501.45     6.30   30.58   30.58    0.00   4.45  90.60
sdp           12568.80     0.00  205.20    0.00    50.10     0.00
500.03     6.37   30.79   30.79    0.00   4.52  92.72
sdo           12569.20     0.00  204.00    0.00    49.95     0.00
501.46     6.40   31.07   31.07    0.00   4.58  93.42
sdw           12568.60     0.00  206.20    0.00    50.00     0.00
496.60     6.34   30.71   30.71    0.00   4.24  87.48
sdx           12038.60     0.00  197.40    0.00    47.60     0.00
493.84     6.01   30.21   30.21    0.00   4.40  86.86
sdq           12570.20     0.00  204.20    0.00    50.15     0.00
502.97     6.23   30.41   30.41    0.00   4.44  90.68
sdr           12571.00     0.00  204.60    0.00    50.25     0.00
502.99     6.15   30.26   30.26    0.00   4.18  85.62
sds           12495.20     0.00  203.80    0.00    49.95     0.00
501.95     6.00   29.62   29.62    0.00   4.24  86.38
sdu           12695.60     0.00  207.80    0.00    50.65     0.00
499.17     6.22   30.00   30.00    0.00   4.16  86.38
sdv           12619.00     0.00  207.80    0.00    50.35     0.00
496.22     6.23   30.03   30.03    0.00   4.20  87.32
sdt           12671.20     0.00  206.20    0.00    50.50     0.00
501.56     6.05   29.30   29.30    0.00   4.24  87.44
sdc           12851.60     0.00  203.00    0.00    50.70     0.00
511.50     5.84   28.49   28.49    0.00   4.17  84.64
md126             0.00     0.00    0.60    1.00     0.05     0.00
71.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.60    0.80     0.05     0.00
81.14     0.00    2.29    0.67    3.50   1.14   0.16
dm-1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00 4475.20    0.00  1110.95     0.00
508.41     0.00    0.00    0.00    0.00   0.00   0.00


sdy and sz are the system drives, so they are uninteresting.

sda is the md0 drive I failed, that's why it stays at zero.

And lastly, here's the output of the perf commands you suggested (at
least the top part):

Samples: 561K of event 'cycles', Event count (approx.): 318536644203
Overhead  Command         Shared Object                 Symbol
  52.85%  swapper         [kernel.kallsyms]             [k] cpu_startup_entry
   4.47%  md0_raid6       [kernel.kallsyms]             [k] memcpy
   3.39%  dd              [kernel.kallsyms]             [k] __find_stripe
   2.50%  md0_raid6       [kernel.kallsyms]             [k] analyse_stripe
   2.43%  dd              [kernel.kallsyms]             [k] _raw_spin_lock_irq
   1.75%  rngd            rngd                          [.] 0x000000000000288b
   1.74%  md0_raid6       [kernel.kallsyms]             [k] xor_avx_5
   1.49%  dd              [kernel.kallsyms]             [k]
copy_user_enhanced_fast_string
   1.33%  md0_raid6       [kernel.kallsyms]             [k] ops_run_io
   0.65%  dd              [kernel.kallsyms]             [k] raid5_compute_sector
   0.60%  md0_raid6       [kernel.kallsyms]             [k] _raw_spin_lock_irq
   0.55%  ps              libc-2.17.so                  [.] _IO_vfscanf
   0.53%  ps              [kernel.kallsyms]             [k] vsnprintf
   0.51%  ps              [kernel.kallsyms]             [k] format_decode
   0.47%  ps              [kernel.kallsyms]             [k] number.isra.2
   0.41%  md0_raid6       [kernel.kallsyms]             [k] raid_run_ops
   0.40%  md0_raid6       [kernel.kallsyms]             [k] __blk_segment_map_sg


That's my first time using the perf tool, so I need a little hand-holding here.

Thanks again all!
Matt

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Doug Dumitru @ 2016-08-23 19:19 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Matt Garman, Mdadm
In-Reply-To: <5416db5c-2d2b-8cc5-b477-604e8ccf0707@redhat.com>

Mr. Ledford,

I am glad that we are in agreement.  My issue is that if the customer
is reading 4GB/sec with a non-degraded array, the degraded array
should only have 2X the number of IOs and 2X the transfer sizes to the
drives.  If the data rate falls to 1GB, I can suspect cpu overhead.
With this case falling to 200MB/sec, then something else is going on.

SSDs tend to be very "flat" reading from q=1 up to about q=20 assuming
the HBAs can keep up.

Then again, 4GB/sec is actually pretty good for a real array with a file system.

In thinking more about this, it is possible that the raid layer is
passing all of the read overhead for the degraded read to the single
raid5 background thread.  200MB/sec after the overhead of populating
stripe pages is then very believable.  My write testing with raid-5
shows that the stripe cache and single thread doing computes can lower
linear write throughput from 10GB/sec (raid-5) or 8GB/sec (raid-6)
down to under 1.5GB/sec.  Getting to 10 or 8 GB/sec requires patches
to raid5.c bypassing the stripe cache and background thread for
"perfect writes" (writes that are exactly an array stripe in a single
BIO).

The whole raid design is intended to keep locks low.  In looking at
SSD performance, perhaps this needs to be rethought so that processing
can more effectively use multi-cores and deep queue depths.



Doug


On Tue, Aug 23, 2016 at 12:10 PM, Doug Ledford <dledford@redhat.com> wrote:
> On 8/23/2016 2:27 PM, Doug Dumitru wrote:
>> Mr. Ledford,
>>
>> I think your explanation of RAID "dirty" read performance is a bit off.
>>
>> If you have 64KB chunks, this describes the layout.  I don't think
>> this also requires 64K reads.  I know that this is true with RAID-5,
>> and I am pretty sure it applies to raid-6 as well.  So if you do 4K
>> reads, you should see 4K reads to all the member drives.
>
> Of course.  I didn't mean to imply otherwise.  The read size is the read
> size.  But, since the OPs test case was to "read random files" and not
> "read random blocks of random files" I took it to mean it would be
> sequential IO across a multitude of random files.  That assumption might
> have been wrong, but I wrote my explanation with that in mind.
>
>> You can verify this pretty easily with iostat.
>>
>> Mr. Garman,
>>
>> Your results are a lot worse than expected.  I always assume that a
>> raid "dirty" read will try to hit the disk hard.  This implies issuing
>> the 22 reads requests in parallel.  This is how "SSD" folks think.  It
>> is possible that this code is old enough to be in an HDD "mindset" and
>> that the requests are issued sequentially.  If so, then this is
>> something to "fix" in the raid code (I use the term fix here loosely
>> as this is not really a bug).
>>
>> Can you run an iostat during your degraded test, and also a top run
>> over 20+ seconds with kernel threads showing up.  Even better would be
>> a perf capture, but you might not have all the tools installed.  You
>> can always try:
>>
>> perf record -a sleep 20
>>
>> then
>>
>> perf report
>>
>> should show you the top functions globally over the 20 second sample.
>> If you don't have perf loaded, you might (or might not) be able to
>> load it from the distro.
>>
>> Doug
>>
>>
>> On Tue, Aug 23, 2016 at 11:00 AM, Doug Ledford <dledford@redhat.com> wrote:
>>> On 8/23/2016 10:54 AM, Matt Garman wrote:
>>>> On Tue, Aug 16, 2016 at 11:36 AM, Doug Dumitru <doug@easyco.com> wrote:
>>>>> The RAID rebuild for a single bad drive "should" be an XOR and should run at
>>>>> 200,000 kb/sec (the default speed_limit_max).  I might be wrong on this and
>>>>> this might still need a full RAID-6 syndrome compute, but I dont think so.
>>>>>
>>>>> The rebuild might not hit 200MB/sec if the drive you replaced is
>>>>> "conditioned".  Be sure to secure erase any non-new drive before you replace
>>>>> it.
>>>>>
>>>>> Your read IOPS will compete with now busy drives which may increase the IO
>>>>> latency a lot, and slow you down a lot.
>>>>>
>>>>> One out of 22 read OPS will be to the bad drive, so this will now take 22
>>>>> reads to re-construct the IO.  The reconstruction is XOR, so pretty cheap
>>>>> from a CPU point of view.  Regardless, your IOPS total will double.
>>>>>
>>>>> You can probably mitigate the amount of degradation by lowering the rebuild
>>>>> speed, but this will make the rebuild take longer, so you are messed up
>>>>> either way.  If the server has "down time" at night, you might lower the
>>>>> rebuild to a really small value during the day, and up it at night.
>>>>
>>>> OK, right now I'm looking purely at performance in a degraded state,
>>>> no rebuild taking place.
>>>>
>>>> We have designed a simple read load test to simulate the actual
>>>> production workload.  (It's not perfect of course, but a reasonable
>>>> approximation.  I can share with the list if there's interest.)  But
>>>> basically it just runs multiple threads of reading random files
>>>> continuously.
>>>>
>>>> When the array is in a pristine state, we can achieve read throughput
>>>> of 8000 MB/sec (at the array level, per iostat with 5 second samples).
>>>>
>>>> Now I failed a single drive.  Running the same test, read performance
>>>> drops all the way down to 200 MB/sec.
>>>>
>>>> I understand that IOPS should double, which to me says we should
>>>> expect a roughly 50% read performance drop (napkin math).  But this is
>>>> a drop of over 95%.
>>>>
>>>> Again, this is with no rebuild taking place...
>>>>
>>>> Thoughts?
>>>
>>> This depends a lot on how you structured your raid array.  I didn't see
>>> your earlier emails, so I'm inferring from the "one out of 22 reads will
>>> be to the bad drive" that you have a 24 disk raid6 array?  If so, then
>>> that's 22 data disks and 2 parity disks per stripe.  I'm gonna use that
>>> as the basis for my next statement even if it's slightly wrong.
>>>
>>> Doug was right in that you will have to read 21 data disks and 1 parity
>>> disk to reconstruct reads from the missing block of any given stripe.
>>> And while he is also correct that this doubles IO ops needed to get your
>>> read data, it doesn't address the XOR load to get your data.  With 19
>>> data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
>>> 20 64k data blocks for 1 result.  If you are getting 200MB/s, you are
>>> actually achieving more like 390MB/s of data read, with 190MB/s of it
>>> being direct reads, and then you are using XOR on 200MB/s in order to
>>> generate the other 10MB/s of results.
>>>
>>> The question of why that performance is so bad is probably (and I say
>>> probably because without actually testing it this is just some hand-wavy
>>> explanation based upon what I've tested and found in the past, but may
>>> not be true today) due to a couple factors:
>>>
>>> 1) 200MB/s of XOR is not insignificant.  Due to our single thread XOR
>>> routines, you can actually keep a CPU pretty busy with this.  Also, even
>>> though the XOR routines try to time their assembly 'just so' so that
>>> they can use the cache avoiding instructions, this fails more often than
>>> not so you end up blowing CPU caches while doing this work, which of
>>> course effects the overall system.  Possible fixes for this might include:
>>>         a) Multi-threaded XOR becoming the default (last I knew it wasn't,
>>> correct me if I'm wrong)
>>>         b) Improved XOR routines that deal with cache more intelligently
>>>         c) Creating a consolidated page cache/stripe cache (if we can read more
>>> of the blocks needed to get our data from cache instead of disk it helps
>>> reduce that IO ops issue)
>>>         d) Rearchitecting your arrays into raid50 instead of big raid6 array
>>>
>>> 2) Even though we theoretically doubled IO ops, we haven't addressed
>>> whether or not that doubling is done efficiently.  Testing would be
>>> warranted here to make sure that our reads for reconstruction aren't
>>> negatively impacting overall disk IO op capability.  We might be doing
>>> something that we can fix, such as interfering with merges or with
>>> ordering or with latency sensitive commands.  A person would need to do
>>> some deep inspection of how commands are being created and sent to each
>>> device in order to see if we are keeping them busy or our own latencies
>>> at the kernel level are leaving the disks idle and killing our overall
>>> throughput (or conversely has the random head seeks just gone so
>>> radically through the roof that the problem here really is the time it
>>> takes the heads to travel everywhere we are sending them).
>>>
>>>
>>> --
>>> Doug Ledford <dledford@redhat.com>
>>>     GPG Key ID: 0E572FDD
>>>
>>
>>
>>
>
>
> --
> Doug Ledford <dledford@redhat.com>
>     GPG Key ID: 0E572FDD
>



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Doug Ledford @ 2016-08-23 19:10 UTC (permalink / raw)
  To: doug; +Cc: Matt Garman, Mdadm
In-Reply-To: <CAFx4rwT0jt9NCu4imruPUhfAR71=cvHwx-kdZxoTniZaQcPByQ@mail.gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 7329 bytes --]

On 8/23/2016 2:27 PM, Doug Dumitru wrote:
> Mr. Ledford,
> 
> I think your explanation of RAID "dirty" read performance is a bit off.
> 
> If you have 64KB chunks, this describes the layout.  I don't think
> this also requires 64K reads.  I know that this is true with RAID-5,
> and I am pretty sure it applies to raid-6 as well.  So if you do 4K
> reads, you should see 4K reads to all the member drives.

Of course.  I didn't mean to imply otherwise.  The read size is the read
size.  But, since the OPs test case was to "read random files" and not
"read random blocks of random files" I took it to mean it would be
sequential IO across a multitude of random files.  That assumption might
have been wrong, but I wrote my explanation with that in mind.

> You can verify this pretty easily with iostat.
> 
> Mr. Garman,
> 
> Your results are a lot worse than expected.  I always assume that a
> raid "dirty" read will try to hit the disk hard.  This implies issuing
> the 22 reads requests in parallel.  This is how "SSD" folks think.  It
> is possible that this code is old enough to be in an HDD "mindset" and
> that the requests are issued sequentially.  If so, then this is
> something to "fix" in the raid code (I use the term fix here loosely
> as this is not really a bug).
> 
> Can you run an iostat during your degraded test, and also a top run
> over 20+ seconds with kernel threads showing up.  Even better would be
> a perf capture, but you might not have all the tools installed.  You
> can always try:
> 
> perf record -a sleep 20
> 
> then
> 
> perf report
> 
> should show you the top functions globally over the 20 second sample.
> If you don't have perf loaded, you might (or might not) be able to
> load it from the distro.
> 
> Doug
> 
> 
> On Tue, Aug 23, 2016 at 11:00 AM, Doug Ledford <dledford@redhat.com> wrote:
>> On 8/23/2016 10:54 AM, Matt Garman wrote:
>>> On Tue, Aug 16, 2016 at 11:36 AM, Doug Dumitru <doug@easyco.com> wrote:
>>>> The RAID rebuild for a single bad drive "should" be an XOR and should run at
>>>> 200,000 kb/sec (the default speed_limit_max).  I might be wrong on this and
>>>> this might still need a full RAID-6 syndrome compute, but I dont think so.
>>>>
>>>> The rebuild might not hit 200MB/sec if the drive you replaced is
>>>> "conditioned".  Be sure to secure erase any non-new drive before you replace
>>>> it.
>>>>
>>>> Your read IOPS will compete with now busy drives which may increase the IO
>>>> latency a lot, and slow you down a lot.
>>>>
>>>> One out of 22 read OPS will be to the bad drive, so this will now take 22
>>>> reads to re-construct the IO.  The reconstruction is XOR, so pretty cheap
>>>> from a CPU point of view.  Regardless, your IOPS total will double.
>>>>
>>>> You can probably mitigate the amount of degradation by lowering the rebuild
>>>> speed, but this will make the rebuild take longer, so you are messed up
>>>> either way.  If the server has "down time" at night, you might lower the
>>>> rebuild to a really small value during the day, and up it at night.
>>>
>>> OK, right now I'm looking purely at performance in a degraded state,
>>> no rebuild taking place.
>>>
>>> We have designed a simple read load test to simulate the actual
>>> production workload.  (It's not perfect of course, but a reasonable
>>> approximation.  I can share with the list if there's interest.)  But
>>> basically it just runs multiple threads of reading random files
>>> continuously.
>>>
>>> When the array is in a pristine state, we can achieve read throughput
>>> of 8000 MB/sec (at the array level, per iostat with 5 second samples).
>>>
>>> Now I failed a single drive.  Running the same test, read performance
>>> drops all the way down to 200 MB/sec.
>>>
>>> I understand that IOPS should double, which to me says we should
>>> expect a roughly 50% read performance drop (napkin math).  But this is
>>> a drop of over 95%.
>>>
>>> Again, this is with no rebuild taking place...
>>>
>>> Thoughts?
>>
>> This depends a lot on how you structured your raid array.  I didn't see
>> your earlier emails, so I'm inferring from the "one out of 22 reads will
>> be to the bad drive" that you have a 24 disk raid6 array?  If so, then
>> that's 22 data disks and 2 parity disks per stripe.  I'm gonna use that
>> as the basis for my next statement even if it's slightly wrong.
>>
>> Doug was right in that you will have to read 21 data disks and 1 parity
>> disk to reconstruct reads from the missing block of any given stripe.
>> And while he is also correct that this doubles IO ops needed to get your
>> read data, it doesn't address the XOR load to get your data.  With 19
>> data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
>> 20 64k data blocks for 1 result.  If you are getting 200MB/s, you are
>> actually achieving more like 390MB/s of data read, with 190MB/s of it
>> being direct reads, and then you are using XOR on 200MB/s in order to
>> generate the other 10MB/s of results.
>>
>> The question of why that performance is so bad is probably (and I say
>> probably because without actually testing it this is just some hand-wavy
>> explanation based upon what I've tested and found in the past, but may
>> not be true today) due to a couple factors:
>>
>> 1) 200MB/s of XOR is not insignificant.  Due to our single thread XOR
>> routines, you can actually keep a CPU pretty busy with this.  Also, even
>> though the XOR routines try to time their assembly 'just so' so that
>> they can use the cache avoiding instructions, this fails more often than
>> not so you end up blowing CPU caches while doing this work, which of
>> course effects the overall system.  Possible fixes for this might include:
>>         a) Multi-threaded XOR becoming the default (last I knew it wasn't,
>> correct me if I'm wrong)
>>         b) Improved XOR routines that deal with cache more intelligently
>>         c) Creating a consolidated page cache/stripe cache (if we can read more
>> of the blocks needed to get our data from cache instead of disk it helps
>> reduce that IO ops issue)
>>         d) Rearchitecting your arrays into raid50 instead of big raid6 array
>>
>> 2) Even though we theoretically doubled IO ops, we haven't addressed
>> whether or not that doubling is done efficiently.  Testing would be
>> warranted here to make sure that our reads for reconstruction aren't
>> negatively impacting overall disk IO op capability.  We might be doing
>> something that we can fix, such as interfering with merges or with
>> ordering or with latency sensitive commands.  A person would need to do
>> some deep inspection of how commands are being created and sent to each
>> device in order to see if we are keeping them busy or our own latencies
>> at the kernel level are leaving the disks idle and killing our overall
>> throughput (or conversely has the random head seeks just gone so
>> radically through the roof that the problem here really is the time it
>> takes the heads to travel everywhere we are sending them).
>>
>>
>> --
>> Doug Ledford <dledford@redhat.com>
>>     GPG Key ID: 0E572FDD
>>
> 
> 
> 


-- 
Doug Ledford <dledford@redhat.com>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Doug Dumitru @ 2016-08-23 18:27 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Matt Garman, Mdadm
In-Reply-To: <51443e5b-3eef-c35f-8ee7-ad3e85e4e76c@redhat.com>

Mr. Ledford,

I think your explanation of RAID "dirty" read performance is a bit off.

If you have 64KB chunks, this describes the layout.  I don't think
this also requires 64K reads.  I know that this is true with RAID-5,
and I am pretty sure it applies to raid-6 as well.  So if you do 4K
reads, you should see 4K reads to all the member drives.

You can verify this pretty easily with iostat.

Mr. Garman,

Your results are a lot worse than expected.  I always assume that a
raid "dirty" read will try to hit the disk hard.  This implies issuing
the 22 reads requests in parallel.  This is how "SSD" folks think.  It
is possible that this code is old enough to be in an HDD "mindset" and
that the requests are issued sequentially.  If so, then this is
something to "fix" in the raid code (I use the term fix here loosely
as this is not really a bug).

Can you run an iostat during your degraded test, and also a top run
over 20+ seconds with kernel threads showing up.  Even better would be
a perf capture, but you might not have all the tools installed.  You
can always try:

perf record -a sleep 20

then

perf report

should show you the top functions globally over the 20 second sample.
If you don't have perf loaded, you might (or might not) be able to
load it from the distro.

Doug


On Tue, Aug 23, 2016 at 11:00 AM, Doug Ledford <dledford@redhat.com> wrote:
> On 8/23/2016 10:54 AM, Matt Garman wrote:
>> On Tue, Aug 16, 2016 at 11:36 AM, Doug Dumitru <doug@easyco.com> wrote:
>>> The RAID rebuild for a single bad drive "should" be an XOR and should run at
>>> 200,000 kb/sec (the default speed_limit_max).  I might be wrong on this and
>>> this might still need a full RAID-6 syndrome compute, but I dont think so.
>>>
>>> The rebuild might not hit 200MB/sec if the drive you replaced is
>>> "conditioned".  Be sure to secure erase any non-new drive before you replace
>>> it.
>>>
>>> Your read IOPS will compete with now busy drives which may increase the IO
>>> latency a lot, and slow you down a lot.
>>>
>>> One out of 22 read OPS will be to the bad drive, so this will now take 22
>>> reads to re-construct the IO.  The reconstruction is XOR, so pretty cheap
>>> from a CPU point of view.  Regardless, your IOPS total will double.
>>>
>>> You can probably mitigate the amount of degradation by lowering the rebuild
>>> speed, but this will make the rebuild take longer, so you are messed up
>>> either way.  If the server has "down time" at night, you might lower the
>>> rebuild to a really small value during the day, and up it at night.
>>
>> OK, right now I'm looking purely at performance in a degraded state,
>> no rebuild taking place.
>>
>> We have designed a simple read load test to simulate the actual
>> production workload.  (It's not perfect of course, but a reasonable
>> approximation.  I can share with the list if there's interest.)  But
>> basically it just runs multiple threads of reading random files
>> continuously.
>>
>> When the array is in a pristine state, we can achieve read throughput
>> of 8000 MB/sec (at the array level, per iostat with 5 second samples).
>>
>> Now I failed a single drive.  Running the same test, read performance
>> drops all the way down to 200 MB/sec.
>>
>> I understand that IOPS should double, which to me says we should
>> expect a roughly 50% read performance drop (napkin math).  But this is
>> a drop of over 95%.
>>
>> Again, this is with no rebuild taking place...
>>
>> Thoughts?
>
> This depends a lot on how you structured your raid array.  I didn't see
> your earlier emails, so I'm inferring from the "one out of 22 reads will
> be to the bad drive" that you have a 24 disk raid6 array?  If so, then
> that's 22 data disks and 2 parity disks per stripe.  I'm gonna use that
> as the basis for my next statement even if it's slightly wrong.
>
> Doug was right in that you will have to read 21 data disks and 1 parity
> disk to reconstruct reads from the missing block of any given stripe.
> And while he is also correct that this doubles IO ops needed to get your
> read data, it doesn't address the XOR load to get your data.  With 19
> data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
> 20 64k data blocks for 1 result.  If you are getting 200MB/s, you are
> actually achieving more like 390MB/s of data read, with 190MB/s of it
> being direct reads, and then you are using XOR on 200MB/s in order to
> generate the other 10MB/s of results.
>
> The question of why that performance is so bad is probably (and I say
> probably because without actually testing it this is just some hand-wavy
> explanation based upon what I've tested and found in the past, but may
> not be true today) due to a couple factors:
>
> 1) 200MB/s of XOR is not insignificant.  Due to our single thread XOR
> routines, you can actually keep a CPU pretty busy with this.  Also, even
> though the XOR routines try to time their assembly 'just so' so that
> they can use the cache avoiding instructions, this fails more often than
> not so you end up blowing CPU caches while doing this work, which of
> course effects the overall system.  Possible fixes for this might include:
>         a) Multi-threaded XOR becoming the default (last I knew it wasn't,
> correct me if I'm wrong)
>         b) Improved XOR routines that deal with cache more intelligently
>         c) Creating a consolidated page cache/stripe cache (if we can read more
> of the blocks needed to get our data from cache instead of disk it helps
> reduce that IO ops issue)
>         d) Rearchitecting your arrays into raid50 instead of big raid6 array
>
> 2) Even though we theoretically doubled IO ops, we haven't addressed
> whether or not that doubling is done efficiently.  Testing would be
> warranted here to make sure that our reads for reconstruction aren't
> negatively impacting overall disk IO op capability.  We might be doing
> something that we can fix, such as interfering with merges or with
> ordering or with latency sensitive commands.  A person would need to do
> some deep inspection of how commands are being created and sent to each
> device in order to see if we are keeping them busy or our own latencies
> at the kernel level are leaving the disks idle and killing our overall
> throughput (or conversely has the random head seeks just gone so
> radically through the roof that the problem here really is the time it
> takes the heads to travel everywhere we are sending them).
>
>
> --
> Doug Ledford <dledford@redhat.com>
>     GPG Key ID: 0E572FDD
>



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Doug Ledford @ 2016-08-23 18:00 UTC (permalink / raw)
  To: Matt Garman, Doug Dumitru, Mdadm
In-Reply-To: <CAJvUf-BoEJte8TF7_Su90CnjAiJ6q57m+PdGhxHA4cx5AEtxSg@mail.gmail.com>

[-- Attachment #1.1: Type: text/plain, Size: 5096 bytes --]

On 8/23/2016 10:54 AM, Matt Garman wrote:
> On Tue, Aug 16, 2016 at 11:36 AM, Doug Dumitru <doug@easyco.com> wrote:
>> The RAID rebuild for a single bad drive "should" be an XOR and should run at
>> 200,000 kb/sec (the default speed_limit_max).  I might be wrong on this and
>> this might still need a full RAID-6 syndrome compute, but I dont think so.
>>
>> The rebuild might not hit 200MB/sec if the drive you replaced is
>> "conditioned".  Be sure to secure erase any non-new drive before you replace
>> it.
>>
>> Your read IOPS will compete with now busy drives which may increase the IO
>> latency a lot, and slow you down a lot.
>>
>> One out of 22 read OPS will be to the bad drive, so this will now take 22
>> reads to re-construct the IO.  The reconstruction is XOR, so pretty cheap
>> from a CPU point of view.  Regardless, your IOPS total will double.
>>
>> You can probably mitigate the amount of degradation by lowering the rebuild
>> speed, but this will make the rebuild take longer, so you are messed up
>> either way.  If the server has "down time" at night, you might lower the
>> rebuild to a really small value during the day, and up it at night.
> 
> OK, right now I'm looking purely at performance in a degraded state,
> no rebuild taking place.
> 
> We have designed a simple read load test to simulate the actual
> production workload.  (It's not perfect of course, but a reasonable
> approximation.  I can share with the list if there's interest.)  But
> basically it just runs multiple threads of reading random files
> continuously.
> 
> When the array is in a pristine state, we can achieve read throughput
> of 8000 MB/sec (at the array level, per iostat with 5 second samples).
> 
> Now I failed a single drive.  Running the same test, read performance
> drops all the way down to 200 MB/sec.
> 
> I understand that IOPS should double, which to me says we should
> expect a roughly 50% read performance drop (napkin math).  But this is
> a drop of over 95%.
> 
> Again, this is with no rebuild taking place...
> 
> Thoughts?

This depends a lot on how you structured your raid array.  I didn't see
your earlier emails, so I'm inferring from the "one out of 22 reads will
be to the bad drive" that you have a 24 disk raid6 array?  If so, then
that's 22 data disks and 2 parity disks per stripe.  I'm gonna use that
as the basis for my next statement even if it's slightly wrong.

Doug was right in that you will have to read 21 data disks and 1 parity
disk to reconstruct reads from the missing block of any given stripe.
And while he is also correct that this doubles IO ops needed to get your
read data, it doesn't address the XOR load to get your data.  With 19
data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
20 64k data blocks for 1 result.  If you are getting 200MB/s, you are
actually achieving more like 390MB/s of data read, with 190MB/s of it
being direct reads, and then you are using XOR on 200MB/s in order to
generate the other 10MB/s of results.

The question of why that performance is so bad is probably (and I say
probably because without actually testing it this is just some hand-wavy
explanation based upon what I've tested and found in the past, but may
not be true today) due to a couple factors:

1) 200MB/s of XOR is not insignificant.  Due to our single thread XOR
routines, you can actually keep a CPU pretty busy with this.  Also, even
though the XOR routines try to time their assembly 'just so' so that
they can use the cache avoiding instructions, this fails more often than
not so you end up blowing CPU caches while doing this work, which of
course effects the overall system.  Possible fixes for this might include:
	a) Multi-threaded XOR becoming the default (last I knew it wasn't,
correct me if I'm wrong)
	b) Improved XOR routines that deal with cache more intelligently
	c) Creating a consolidated page cache/stripe cache (if we can read more
of the blocks needed to get our data from cache instead of disk it helps
reduce that IO ops issue)
	d) Rearchitecting your arrays into raid50 instead of big raid6 array

2) Even though we theoretically doubled IO ops, we haven't addressed
whether or not that doubling is done efficiently.  Testing would be
warranted here to make sure that our reads for reconstruction aren't
negatively impacting overall disk IO op capability.  We might be doing
something that we can fix, such as interfering with merges or with
ordering or with latency sensitive commands.  A person would need to do
some deep inspection of how commands are being created and sent to each
device in order to see if we are keeping them busy or our own latencies
at the kernel level are leaving the disks idle and killing our overall
throughput (or conversely has the random head seeks just gone so
radically through the roof that the problem here really is the time it
takes the heads to travel everywhere we are sending them).

-- 
Doug Ledford <dledford@redhat.com>
    GPG Key ID: 0E572FDD

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply

* Re: Need Help with crashed RAID5 (that was rebuilding and then had SATA error on another drive)
From: Ben @ 2016-08-23 15:44 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <57BC354D.4080000@youngman.org.uk>

On 8/23/2016 6:36 AM, Wols Lists wrote:
> On 23/08/16 00:06, Adam Goryachev wrote:
>
> And while it's probably too late now, read up on mdadm --replace. If
> you've got the spare slots, it's much better/safer than physically
> pulling a dodgy disk and replacing it.
>
> NB - get the data Adam asked for - and the output of "mdadm --examine
> ..." and "mdadm --display ..." might well be useful (or might have been
> included elsewhere).

hi there!

Thanks -- Adam mentioned and yea, it's too late but I have it for next time.

the vast bulk of the data on the array is duplicated to another NAS -- so it's not the end of the world.

Adam helped me get the array back online so I can do some things to it (like some 'nice to have' files).. it's staying reasonably in sync when it craps out...

so hopefully soon I'll have it resolved.

but will probably switch to a RAID6 soon down the road.

Thanks for the help,

  -Ben

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Chris Murphy @ 2016-08-23 15:02 UTC (permalink / raw)
  To: Mdadm
In-Reply-To: <CAJvUf-C8ov4tHRGj5SPamJE8PF0Qsr4Cwuv02-9F0SV1yp4ccQ@mail.gmail.com>

On Tue, Aug 23, 2016 at 8:34 AM, Matt Garman <matthew.garman@gmail.com> wrote:
> On Tue, Aug 16, 2016 at 5:43 PM, Doug Dumitru <doug@easyco.com> wrote:
>> One last thing I would highly recommend is:
>>
>> Secure erase the replacement disk before rebuilding onto it.
>>
>> If the replacement disk is "pre conditioned" with random writes, even if
>> very slowly, this will lower the write performance of the disk during the
>> rebuild.
>
> Does that also apply to brand-new disks from the manufacturer?
>
> I.e., should we just always do a secure erase, or sometimes depending
> on how the drive was sourced?

The main issue is to get rid of previous fs signatures so there are no
longer stale file systems. If the drive is/was ever partitioned, those
signatures could be anywhere on the drive, so the ATA secure erase is
a way to clobber all of them. An alternative is fully encrypting the
drive. If you merely change the encryption key, cipher text on the
drive becomes different "cipher text" as it's decrypted, so again
everything that was on the drive is effectively obliterated, but is
much faster. If security isn't a big concern, you can automate opening
the LUKS device with a keyfile to avoid manually typing in
passphrases. The other nice thing about it is if you ever have to
return such a drive under warranty, or otherwise decommission it, you
don't have to worry about drive contents.

-- 
Chris Murphy

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Matt Garman @ 2016-08-23 14:54 UTC (permalink / raw)
  To: Doug Dumitru, Mdadm
In-Reply-To: <CAFx4rwSQQuqeCFm+60+Gm75D49tg+mVjU=BnQSZThdE7E6KqPQ@mail.gmail.com>

On Tue, Aug 16, 2016 at 11:36 AM, Doug Dumitru <doug@easyco.com> wrote:
> The RAID rebuild for a single bad drive "should" be an XOR and should run at
> 200,000 kb/sec (the default speed_limit_max).  I might be wrong on this and
> this might still need a full RAID-6 syndrome compute, but I dont think so.
>
> The rebuild might not hit 200MB/sec if the drive you replaced is
> "conditioned".  Be sure to secure erase any non-new drive before you replace
> it.
>
> Your read IOPS will compete with now busy drives which may increase the IO
> latency a lot, and slow you down a lot.
>
> One out of 22 read OPS will be to the bad drive, so this will now take 22
> reads to re-construct the IO.  The reconstruction is XOR, so pretty cheap
> from a CPU point of view.  Regardless, your IOPS total will double.
>
> You can probably mitigate the amount of degradation by lowering the rebuild
> speed, but this will make the rebuild take longer, so you are messed up
> either way.  If the server has "down time" at night, you might lower the
> rebuild to a really small value during the day, and up it at night.

OK, right now I'm looking purely at performance in a degraded state,
no rebuild taking place.

We have designed a simple read load test to simulate the actual
production workload.  (It's not perfect of course, but a reasonable
approximation.  I can share with the list if there's interest.)  But
basically it just runs multiple threads of reading random files
continuously.

When the array is in a pristine state, we can achieve read throughput
of 8000 MB/sec (at the array level, per iostat with 5 second samples).

Now I failed a single drive.  Running the same test, read performance
drops all the way down to 200 MB/sec.

I understand that IOPS should double, which to me says we should
expect a roughly 50% read performance drop (napkin math).  But this is
a drop of over 95%.

Again, this is with no rebuild taking place...

Thoughts?

Thanks again,
Matt

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Matt Garman @ 2016-08-23 14:34 UTC (permalink / raw)
  To: Doug Dumitru; +Cc: Mdadm
In-Reply-To: <CAFx4rwTawqrBOWVwtPnGhRRAM1XiGQkS-o3YykmD0AftR45YkA@mail.gmail.com>

On Tue, Aug 16, 2016 at 5:43 PM, Doug Dumitru <doug@easyco.com> wrote:
> One last thing I would highly recommend is:
>
> Secure erase the replacement disk before rebuilding onto it.
>
> If the replacement disk is "pre conditioned" with random writes, even if
> very slowly, this will lower the write performance of the disk during the
> rebuild.

Does that also apply to brand-new disks from the manufacturer?

I.e., should we just always do a secure erase, or sometimes depending
on how the drive was sourced?

^ permalink raw reply

* Re: Need Help with crashed RAID5 (that was rebuilding and then had SATA error on another drive)
From: Wols Lists @ 2016-08-23 11:36 UTC (permalink / raw)
  To: Adam Goryachev, Ben Kamen, linux-raid
In-Reply-To: <215fd175-65b6-e24b-338f-0c44ae030573@websitemanagers.com.au>

On 23/08/16 00:06, Adam Goryachev wrote:
> I hope the above is helpful, but really we will need more information
> about your drives before being able to make further suggestions. output
> of lsdrv (google it), smartctl, mdadm --misc --detail /dev/md127 would
> all be helpful.

And while it's probably too late now, read up on mdadm --replace. If
you've got the spare slots, it's much better/safer than physically
pulling a dodgy disk and replacing it.

NB - get the data Adam asked for - and the output of "mdadm --examine
..." and "mdadm --display ..." might well be useful (or might have been
included elsewhere).

Cheers,
Wol

^ permalink raw reply

* Re: [PATCH 1/2] raid5: fix memory leak of bio integrity data
From: Yi Zhang @ 2016-08-23  9:10 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, xni, jes sorensen, jmoyer, Kernel-team
In-Reply-To: <4a00786c49856f37454376ddfc02c040c85ab14d.1471925425.git.shli@fb.com>

Cannot reproduce the OOM/panic bug after 100 times' test.

Thanks
Yi

----- Original Message -----
From: "Shaohua Li" <shli@fb.com>
To: linux-raid@vger.kernel.org
Cc: yizhan@redhat.com, xni@redhat.com, "jes sorensen" <jes.sorensen@redhat.com>, jmoyer@redhat.com, Kernel-team@fb.com
Sent: Tuesday, August 23, 2016 12:14:01 PM
Subject: [PATCH 1/2] raid5: fix memory leak of bio integrity data

Yi reported a memory leak of raid5 with DIF/DIX enabled disks. raid5
doesn't alloc/free bio, instead it reuses bios. There are two issues in
current code:
1. the code calls bio_init (from
init_stripe->raid5_build_block->bio_init) then bio_reset (ops_run_io).
The bio is reused, so likely there is integrity data attached. bio_init
will clear a pointer to integrity data and makes bio_reset can't release
the data
2. bio_reset is called before dispatching bio. After bio is finished,
it's possible we don't free bio's integrity data (eg, we don't call
bio_reset again)
Both issues will cause memory leak. The patch moves bio_init to stripe
creation and bio_reset to bio end io. This will fix the two issues.

Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4f8f524..52cf205 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1005,7 +1005,6 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 
 			set_bit(STRIPE_IO_STARTED, &sh->state);
 
-			bio_reset(bi);
 			bi->bi_bdev = rdev->bdev;
 			bio_set_op_attrs(bi, op, op_flags);
 			bi->bi_end_io = op_is_write(op)
@@ -1057,7 +1056,6 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 
 			set_bit(STRIPE_IO_STARTED, &sh->state);
 
-			bio_reset(rbi);
 			rbi->bi_bdev = rrdev->bdev;
 			bio_set_op_attrs(rbi, op, op_flags);
 			BUG_ON(!op_is_write(op));
@@ -1990,9 +1988,11 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
 	put_cpu();
 }
 
-static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp)
+static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
+	int disks)
 {
 	struct stripe_head *sh;
+	int i;
 
 	sh = kmem_cache_zalloc(sc, gfp);
 	if (sh) {
@@ -2001,6 +2001,12 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp)
 		INIT_LIST_HEAD(&sh->batch_list);
 		INIT_LIST_HEAD(&sh->lru);
 		atomic_set(&sh->count, 1);
+		for (i = 0; i < disks; i++) {
+			struct r5dev *dev = &sh->dev[i];
+
+			bio_init(&dev->req);
+			bio_init(&dev->rreq);
+		}
 	}
 	return sh;
 }
@@ -2008,7 +2014,7 @@ static int grow_one_stripe(struct r5conf *conf, gfp_t gfp)
 {
 	struct stripe_head *sh;
 
-	sh = alloc_stripe(conf->slab_cache, gfp);
+	sh = alloc_stripe(conf->slab_cache, gfp, conf->pool_size);
 	if (!sh)
 		return 0;
 
@@ -2179,7 +2185,7 @@ static int resize_stripes(struct r5conf *conf, int newsize)
 	mutex_lock(&conf->cache_size_mutex);
 
 	for (i = conf->max_nr_stripes; i; i--) {
-		nsh = alloc_stripe(sc, GFP_KERNEL);
+		nsh = alloc_stripe(sc, GFP_KERNEL, newsize);
 		if (!nsh)
 			break;
 
@@ -2311,6 +2317,7 @@ static void raid5_end_read_request(struct bio * bi)
 		(unsigned long long)sh->sector, i, atomic_read(&sh->count),
 		bi->bi_error);
 	if (i == disks) {
+		bio_reset(bi);
 		BUG();
 		return;
 	}
@@ -2414,6 +2421,7 @@ static void raid5_end_read_request(struct bio * bi)
 	clear_bit(R5_LOCKED, &sh->dev[i].flags);
 	set_bit(STRIPE_HANDLE, &sh->state);
 	raid5_release_stripe(sh);
+	bio_reset(bi);
 }
 
 static void raid5_end_write_request(struct bio *bi)
@@ -2448,6 +2456,7 @@ static void raid5_end_write_request(struct bio *bi)
 		(unsigned long long)sh->sector, i, atomic_read(&sh->count),
 		bi->bi_error);
 	if (i == disks) {
+		bio_reset(bi);
 		BUG();
 		return;
 	}
@@ -2491,18 +2500,17 @@ static void raid5_end_write_request(struct bio *bi)
 
 	if (sh->batch_head && sh != sh->batch_head)
 		raid5_release_stripe(sh->batch_head);
+	bio_reset(bi);
 }
 
 static void raid5_build_block(struct stripe_head *sh, int i, int previous)
 {
 	struct r5dev *dev = &sh->dev[i];
 
-	bio_init(&dev->req);
 	dev->req.bi_io_vec = &dev->vec;
 	dev->req.bi_max_vecs = 1;
 	dev->req.bi_private = sh;
 
-	bio_init(&dev->rreq);
 	dev->rreq.bi_io_vec = &dev->rvec;
 	dev->rreq.bi_max_vecs = 1;
 	dev->rreq.bi_private = sh;
-- 
2.8.0.rc2


^ permalink raw reply related

* [PATCH v2] raid10: record correct address of bad block
From: Tomasz Majchrzak @ 2016-08-23  8:53 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, aleksey.obitotskiy, pawel.baldysiak, artur.paszkiewicz,
	maksymilian.kunt

For failed write request record block address on a device, not block
address in an array.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 drivers/md/raid10.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index cfa96b5..cd8d197 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2465,18 +2465,19 @@ static int narrow_write_error(struct r10bio *r10_bio, int i)
 
 	while (sect_to_write) {
 		struct bio *wbio;
+		sector_t wsector;
 		if (sectors > sect_to_write)
 			sectors = sect_to_write;
 		/* Write at 'sector' for 'sectors' */
 		wbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
 		bio_trim(wbio, sector - bio->bi_iter.bi_sector, sectors);
-		wbio->bi_iter.bi_sector = (r10_bio->devs[i].addr+
-				   choose_data_offset(r10_bio, rdev) +
-				   (sector - r10_bio->sector));
+		wsector = r10_bio->devs[i].addr + (sector - r10_bio->sector);
+		wbio->bi_iter.bi_sector = wsector +
+				   choose_data_offset(r10_bio, rdev);
 		wbio->bi_bdev = rdev->bdev;
 		if (submit_bio_wait(WRITE, wbio) < 0)
 			/* Failure! */
-			ok = rdev_set_badblocks(rdev, sector,
+			ok = rdev_set_badblocks(rdev, wsector,
 						sectors, 0)
 				&& ok;
 
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH 2/2] r5cache: remove journal support
From: Song Liu @ 2016-08-23  6:12 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid@vger.kernel.org, Shaohua Li
In-Reply-To: <20160823053214.GB2314@kernel.org>


>>> On 8/22/16, 10:32 PM, "Shaohua Li" <shli@kernel.org> wrote:
>>  
>    > +enum r5c_cache_mode {
>    > +	R5C_MODE_NO_CACHE = 0,
>    > +	R5C_MODE_WRITE_THROUGH = 1,
>    > +	R5C_MODE_WRITE_BACK = 2,
>    > +	R5C_MODE_BROKEN_CACHE = 3,
>    The idea of setting different modes makes sense.
>    But this is a little confusing. The first three are modes, the last one is status.
>    We can't set BROKEN_CACHE mode, right?

How about we name it as r5c_cache_state? Splitting it into two entries 
seems an overkill to me. 

>> +	spin_lock_irq(&conf->device_lock);
>    > +	if (log)
>    > +		conf->log->cache_mode = val;
>    > +	if (val == R5C_MODE_NO_CACHE) {
>    > +		clear_bit(MD_HAS_JOURNAL, &mddev->flags);
>    > +		set_bit(MD_UPDATE_SB_FLAGS, &mddev->flags);
    
>    If the journal disk is Faulty and we clear HAS_JOURNAL, what's role of journal
>    disk?

The on disk role is faulty (0xffff). The in memory data structure will show both 
Faulty and Journal. It works in my test (create -> fail journal -> remove journal -> 
write -> stop -> assemble).  
    
>    This sounds incomplete. If we assemble the array and journal disk is missing,
>    can we set the mode to NO_CACHE and allow the array writeable?

Do you mean setting to NO_CACHE automatically? I think it is not safe, as we 
may assemble and run the array accidentally without the journal. If the array 
is set to NO_CACHE, we will lost pending data on the journal. 
    
>> +extern ssize_t r5c_show_cache_mode(struct mddev *mddev, char *page);
>    > +extern ssize_t
>    > +r5c_store_cache_mode(struct mddev *mddev, const char *page, size_t len);
>    
>    Instead of export two functions, you can move r5c_cache_mode sysfs entry to raid5-cache.c
>    and export it.

Will fix it in the next version. 

Thanks, 
Song


^ permalink raw reply

* Re: [PATCH 2/2] r5cache: remove journal support
From: Shaohua Li @ 2016-08-23  5:32 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid, shli
In-Reply-To: <1471646042-2366825-2-git-send-email-songliubraving@fb.com>

On Fri, Aug 19, 2016 at 03:34:02PM -0700, Song Liu wrote:
> In current r5cache, when the journal device is broken, the raid
> array is forced in readonly mode. There is no way to remove the
> "journal feature", and thus make the array read-write without
> journal.
> 
> This patch provides sysfs entry r5c_cache_mode that can be used
> to remove journal feature.
> 
> r5c_cache_mode has 4 different values:
> * no-cache;
> * write-through (write journal only);
> * write-back (w/ write cache feature, which will be added soon);
> * broken-cache (journal missing or Faulty)
> 
> By writing into r5c_cache_mode, the array can transit from
> broken-cache to no-cache, which removes journal feature for the
> array.
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5-cache.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/md/raid5.c       |  5 ++++
>  drivers/md/raid5.h       |  6 +++++
>  3 files changed, 75 insertions(+)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 5504ce2..508d470 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -40,6 +40,16 @@
>   */
>  #define R5L_POOL_SIZE	4
>  
> +enum r5c_cache_mode {
> +	R5C_MODE_NO_CACHE = 0,
> +	R5C_MODE_WRITE_THROUGH = 1,
> +	R5C_MODE_WRITE_BACK = 2,
> +	R5C_MODE_BROKEN_CACHE = 3,
The idea of setting different modes makes sense.
But this is a little confusing. The first three are modes, the last one is status.
We can't set BROKEN_CACHE mode, right?

> +};
> +
> +static char *r5c_cache_mode_str[] = {"no-cache", "write-through",
> +				     "write-back", "broken-cache"};
> +
>  struct r5l_log {
>  	struct md_rdev *rdev;
>  
> @@ -97,6 +107,8 @@ struct r5l_log {
>  
>  	bool need_cache_flush;
>  	bool in_teardown;
> +
> +	enum r5c_cache_mode cache_mode;
>  };
>  
>  /*
> @@ -1193,6 +1205,56 @@ ioerr:
>  	return ret;
>  }
>  
> +ssize_t r5c_show_cache_mode(struct mddev *mddev, char *page)
> +{
> +	struct r5conf *conf = mddev->private;
> +	int val = 0;
> +	int ret = 0;
> +
> +	if (conf->log)
> +		val = conf->log->cache_mode;
> +	else if (test_bit(MD_HAS_JOURNAL, &mddev->flags))
> +		val = R5C_MODE_BROKEN_CACHE;
> +	ret += snprintf(page, PAGE_SIZE - ret, "%d: %s\n",
> +			val, r5c_cache_mode_str[val]);
> +	return ret;
> +}
> +
> +ssize_t r5c_store_cache_mode(struct mddev *mddev, const char *page, size_t len)
> +{
> +	struct r5conf *conf = mddev->private;
> +	struct r5l_log *log = conf->log;
> +	int val;
> +
> +	if (kstrtoint(page, 10, &val))
> +		return -EINVAL;
> +	if (!log && val != R5C_MODE_NO_CACHE)
> +		return -EINVAL;
> +	/* currently only support write through (write journal) */
> +	if (val < R5C_MODE_NO_CACHE || val > R5C_MODE_WRITE_THROUGH)
> +		return -EINVAL;
> +	if (val == R5C_MODE_NO_CACHE) {
> +		if (conf->log &&
> +		    !test_bit(Faulty, &log->rdev->flags)) {
> +			pr_err("md/raid:%s: journal device is in use, cannot remove it\n",
> +			       mdname(mddev));
> +			return -EINVAL;
> +		}
> +	}
> +
> +	spin_lock_irq(&conf->device_lock);
> +	if (log)
> +		conf->log->cache_mode = val;
> +	if (val == R5C_MODE_NO_CACHE) {
> +		clear_bit(MD_HAS_JOURNAL, &mddev->flags);
> +		set_bit(MD_UPDATE_SB_FLAGS, &mddev->flags);

If the journal disk is Faulty and we clear HAS_JOURNAL, what's role of journal
disk?

This sounds incomplete. If we assemble the array and journal disk is missing,
can we set the mode to NO_CACHE and allow the array writeable?

> +	}
> +	spin_unlock_irq(&conf->device_lock);
> +	pr_info("%s: setting r5c cache mode to %d: %s\n",
> +		       mdname(mddev), val, r5c_cache_mode_str[val]);
> +	return len;
> +}
> +
>  int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
>  {
>  	struct request_queue *q = bdev_get_queue(rdev->bdev);
> @@ -1246,6 +1308,8 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
>  	INIT_LIST_HEAD(&log->no_space_stripes);
>  	spin_lock_init(&log->no_space_stripes_lock);
>  
> +	log->cache_mode = R5C_MODE_WRITE_THROUGH;
> +
>  	if (r5l_load_log(log))
>  		goto error;
>  
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 2119e09..665d853 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6230,6 +6230,10 @@ raid5_group_thread_cnt = __ATTR(group_thread_cnt, S_IRUGO | S_IWUSR,
>  				raid5_show_group_thread_cnt,
>  				raid5_store_group_thread_cnt);
>  
> +static struct md_sysfs_entry
> +r5c_cache_mode = __ATTR(r5c_cache_mode, S_IRUGO | S_IWUSR,
> +			r5c_show_cache_mode, r5c_store_cache_mode);
> +
>  static struct attribute *raid5_attrs[] =  {
>  	&raid5_stripecache_size.attr,
>  	&raid5_stripecache_active.attr,
> @@ -6237,6 +6241,7 @@ static struct attribute *raid5_attrs[] =  {
>  	&raid5_group_thread_cnt.attr,
>  	&raid5_skip_copy.attr,
>  	&raid5_rmw_level.attr,
> +	&r5c_cache_mode.attr,
>  	NULL,
>  };
>  static struct attribute_group raid5_attrs_group = {
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 517d4b6..ace9675 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -635,4 +635,10 @@ extern void r5l_stripe_write_finished(struct stripe_head *sh);
>  extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
>  extern void r5l_quiesce(struct r5l_log *log, int state);
>  extern bool r5l_log_disk_error(struct r5conf *conf);
> +
> +
> +extern ssize_t r5c_show_cache_mode(struct mddev *mddev, char *page);
> +extern ssize_t
> +r5c_store_cache_mode(struct mddev *mddev, const char *page, size_t len);

Instead of export two functions, you can move r5c_cache_mode sysfs entry to raid5-cache.c
and export it.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH 1/2] r5cache: set MD_JOURNAL_CLEAN correctly
From: Shaohua Li @ 2016-08-23  5:10 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid, shli
In-Reply-To: <1471646042-2366825-1-git-send-email-songliubraving@fb.com>

On Fri, Aug 19, 2016 at 03:34:01PM -0700, Song Liu wrote:
> Currently, the code sets MD_JOURNAL_CLEAN when the array has
> MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array
> will be MD_JOURNAL_CLEAN even if the journal device is missing.
> 
> With this patch, the MD_JOURNAL_CLEAN is only set when the journal
> device presents.

Applied this one, thanks!

^ permalink raw reply

* bootsect replicated in p1, RAID enclosure suggestions?
From: travis+ml-linux-raid @ 2016-08-23  5:09 UTC (permalink / raw)
  To: linux-raid

Hello all,

So I have an Intel NUC (for low power Linux) plugged via USB into a 4
bay enclosure doing linear (yeah I know; it's the backup server, the
primary is raid10).

And every once in a while, this happens (*see end).  The partition 1
that would normally contain a MD slice ends up being a replica of the
boot cylinder.  I can't tell if it's the mdraid linear impl, the
kernel doing something weird, the USB drivers, the enclosure firmware,
or what.

Anyway, this happened while I was restoring a Windows machine whose
root drive suddenly took a nosedive, and it happens every 6 months
or so.  Today it happened while I was in the middle of recovering
a Windows machine whose 1TB SSD threw up on C: and totally nuked
the data.

The last low-power option I tried was an OpenRD Ultimate based around
ARMv5TE which was basically unsupported by debian by the time I got
it, and subsequently became ultra-flaky due to what seemed to be RAM
problems - it was crashing every 3 days with kernel panics, and every
once in a while would do something worse.

Any recommendations on a low power hardware with a well-supported
distro, that matches up well with a real backplane and SATA
connections instead of USB.  The only caveat is that I want to encrypt
raw disks and it has to not be very noisy - so no rackmount gear
with 65dB 1" dog whistle fans.  Obviously, whatever backplane must
be well-supported by the distro.

Also, does anyone have experience with cryptsetup on multiple
partitions?  I can do that but get prompted multiple times and I was
wondering if anyone knew an easy way to fix the boot time scripts to
avoid that, only prompting once per unique underlying crypttab.

And finally, I have a story about buggy drive firmware that you
might enjoy, especially if you were doing this sort of stuff in
the 90s as well.  Cheers:

http://www.subspacefield.org/security/hard_drives_of_doom/

[*]

# parted /dev/sde
GNU Parted 2.3
Using /dev/sde
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p                                                                
Model: WDC WD40 EFRX-68WT0N0 (scsi)
Disk /dev/sde: 4001GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system  Name        Flags
 1      1049kB  4001GB  4001GB               Linux RAID  raid

(parted) q                                                                
# parted /dev/sdd1
GNU Parted 2.3
Using /dev/sdd1
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p                                                                
Model: Unknown (unknown)
Disk /dev/sdd1: 4001GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system  Name        Flags
 1      1049kB  4001GB  4001GB               Linux RAID  raid

-- 
http://www.subspacefield.org/~travis/ | if spammer then john@subspacefield.org
"Computer crime, the glamor crime of the 1970s, will become in the
1980s one of the greatest sources of preventable business loss."
John M. Carroll, "Computer Security", first edition cover flap, 1977

^ permalink raw reply

* [PATCH 2/2] raid5: avoid unnecessary bio data set
From: Shaohua Li @ 2016-08-23  4:14 UTC (permalink / raw)
  To: linux-raid; +Cc: yizhan, xni, jes.sorensen, jmoyer, Kernel-team
In-Reply-To: <4a00786c49856f37454376ddfc02c040c85ab14d.1471925425.git.shli@fb.com>

bio_reset doesn't change bi_io_vec and bi_max_vecs, so we don't need to
set them every time. bi_private will be set before the bio is
dispatched.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 52cf205..9b1a41f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2005,7 +2005,12 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
 			struct r5dev *dev = &sh->dev[i];
 
 			bio_init(&dev->req);
+			dev->req.bi_io_vec = &dev->vec;
+			dev->req.bi_max_vecs = 1;
+
 			bio_init(&dev->rreq);
+			dev->rreq.bi_io_vec = &dev->rvec;
+			dev->rreq.bi_max_vecs = 1;
 		}
 	}
 	return sh;
@@ -2507,14 +2512,6 @@ static void raid5_build_block(struct stripe_head *sh, int i, int previous)
 {
 	struct r5dev *dev = &sh->dev[i];
 
-	dev->req.bi_io_vec = &dev->vec;
-	dev->req.bi_max_vecs = 1;
-	dev->req.bi_private = sh;
-
-	dev->rreq.bi_io_vec = &dev->rvec;
-	dev->rreq.bi_max_vecs = 1;
-	dev->rreq.bi_private = sh;
-
 	dev->flags = 0;
 	dev->sector = raid5_compute_blocknr(sh, i, previous);
 }
-- 
2.8.0.rc2


^ permalink raw reply related

* [PATCH 1/2] raid5: fix memory leak of bio integrity data
From: Shaohua Li @ 2016-08-23  4:14 UTC (permalink / raw)
  To: linux-raid; +Cc: yizhan, xni, jes.sorensen, jmoyer, Kernel-team

Yi reported a memory leak of raid5 with DIF/DIX enabled disks. raid5
doesn't alloc/free bio, instead it reuses bios. There are two issues in
current code:
1. the code calls bio_init (from
init_stripe->raid5_build_block->bio_init) then bio_reset (ops_run_io).
The bio is reused, so likely there is integrity data attached. bio_init
will clear a pointer to integrity data and makes bio_reset can't release
the data
2. bio_reset is called before dispatching bio. After bio is finished,
it's possible we don't free bio's integrity data (eg, we don't call
bio_reset again)
Both issues will cause memory leak. The patch moves bio_init to stripe
creation and bio_reset to bio end io. This will fix the two issues.

Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4f8f524..52cf205 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1005,7 +1005,6 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 
 			set_bit(STRIPE_IO_STARTED, &sh->state);
 
-			bio_reset(bi);
 			bi->bi_bdev = rdev->bdev;
 			bio_set_op_attrs(bi, op, op_flags);
 			bi->bi_end_io = op_is_write(op)
@@ -1057,7 +1056,6 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 
 			set_bit(STRIPE_IO_STARTED, &sh->state);
 
-			bio_reset(rbi);
 			rbi->bi_bdev = rrdev->bdev;
 			bio_set_op_attrs(rbi, op, op_flags);
 			BUG_ON(!op_is_write(op));
@@ -1990,9 +1988,11 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
 	put_cpu();
 }
 
-static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp)
+static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
+	int disks)
 {
 	struct stripe_head *sh;
+	int i;
 
 	sh = kmem_cache_zalloc(sc, gfp);
 	if (sh) {
@@ -2001,6 +2001,12 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp)
 		INIT_LIST_HEAD(&sh->batch_list);
 		INIT_LIST_HEAD(&sh->lru);
 		atomic_set(&sh->count, 1);
+		for (i = 0; i < disks; i++) {
+			struct r5dev *dev = &sh->dev[i];
+
+			bio_init(&dev->req);
+			bio_init(&dev->rreq);
+		}
 	}
 	return sh;
 }
@@ -2008,7 +2014,7 @@ static int grow_one_stripe(struct r5conf *conf, gfp_t gfp)
 {
 	struct stripe_head *sh;
 
-	sh = alloc_stripe(conf->slab_cache, gfp);
+	sh = alloc_stripe(conf->slab_cache, gfp, conf->pool_size);
 	if (!sh)
 		return 0;
 
@@ -2179,7 +2185,7 @@ static int resize_stripes(struct r5conf *conf, int newsize)
 	mutex_lock(&conf->cache_size_mutex);
 
 	for (i = conf->max_nr_stripes; i; i--) {
-		nsh = alloc_stripe(sc, GFP_KERNEL);
+		nsh = alloc_stripe(sc, GFP_KERNEL, newsize);
 		if (!nsh)
 			break;
 
@@ -2311,6 +2317,7 @@ static void raid5_end_read_request(struct bio * bi)
 		(unsigned long long)sh->sector, i, atomic_read(&sh->count),
 		bi->bi_error);
 	if (i == disks) {
+		bio_reset(bi);
 		BUG();
 		return;
 	}
@@ -2414,6 +2421,7 @@ static void raid5_end_read_request(struct bio * bi)
 	clear_bit(R5_LOCKED, &sh->dev[i].flags);
 	set_bit(STRIPE_HANDLE, &sh->state);
 	raid5_release_stripe(sh);
+	bio_reset(bi);
 }
 
 static void raid5_end_write_request(struct bio *bi)
@@ -2448,6 +2456,7 @@ static void raid5_end_write_request(struct bio *bi)
 		(unsigned long long)sh->sector, i, atomic_read(&sh->count),
 		bi->bi_error);
 	if (i == disks) {
+		bio_reset(bi);
 		BUG();
 		return;
 	}
@@ -2491,18 +2500,17 @@ static void raid5_end_write_request(struct bio *bi)
 
 	if (sh->batch_head && sh != sh->batch_head)
 		raid5_release_stripe(sh->batch_head);
+	bio_reset(bi);
 }
 
 static void raid5_build_block(struct stripe_head *sh, int i, int previous)
 {
 	struct r5dev *dev = &sh->dev[i];
 
-	bio_init(&dev->req);
 	dev->req.bi_io_vec = &dev->vec;
 	dev->req.bi_max_vecs = 1;
 	dev->req.bi_private = sh;
 
-	bio_init(&dev->rreq);
 	dev->rreq.bi_io_vec = &dev->rvec;
 	dev->rreq.bi_max_vecs = 1;
 	dev->rreq.bi_private = sh;
-- 
2.8.0.rc2


^ permalink raw reply related

* Re: [PATCH] raid6: fix the input of raid6 algorithm
From: H. Peter Anvin @ 2016-08-23  3:53 UTC (permalink / raw)
  To: liuzhengyuan; +Cc: shli, linux-raid, fenghua.yu, linux-kernel, liuzhengyuang521
In-Reply-To: <1471922577-18512-1-git-send-email-liuzhengyuan@kylinos.cn>

On August 22, 2016 8:22:57 PM PDT, liuzhengyuan@kylinos.cn wrote:
>
>To test and choose an best algorithm for raid6, disk number
>and disk data must be offered. Those input depend on page
>size and gfmul table at current time. It would lead the disk
>number less than 4 when the page size is more than 64KB.This
>patch would support arbitrarily page size by defining a macro
>for disk number and using random number to fill with disk data.
>
>Signed-off-by: ZhengYuan Liu <liuzhengyuan@kylinos.cn>
>---
> lib/raid6/algos.c | 36 ++++++++++++++++++++++--------------
> 1 file changed, 22 insertions(+), 14 deletions(-)
>
>diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
>index 975c6e0..f15a4d2 100644
>--- a/lib/raid6/algos.c
>+++ b/lib/raid6/algos.c
>@@ -23,6 +23,7 @@
> #else
> #include <linux/module.h>
> #include <linux/gfp.h>
>+#include <linux/random.h>
> #if !RAID6_USE_EMPTY_ZERO_PAGE
> /* In .bss so it's zeroed */
>const char raid6_empty_zero_page[PAGE_SIZE]
>__attribute__((aligned(256)));
>@@ -30,6 +31,8 @@ EXPORT_SYMBOL(raid6_empty_zero_page);
> #endif
> #endif
> 
>+#define RAID6_DISKS	8
>+
> struct raid6_calls raid6_call;
> EXPORT_SYMBOL_GPL(raid6_call);
> 
>@@ -129,7 +132,7 @@ static inline const struct raid6_recov_calls
>*raid6_choose_recov(void)
> }
> 
> static inline const struct raid6_calls *raid6_choose_gen(
>-	void *(*const dptrs)[(65536/PAGE_SIZE)+2], const int disks)
>+	void *(*const dptrs)[RAID6_DISKS], const int disks)
> {
> 	unsigned long perf, bestgenperf, bestxorperf, j0, j1;
>	int start = (disks>>1)-1, stop = disks-3;	/* work on the second half
>of the disks */
>@@ -206,27 +209,32 @@ static inline const struct raid6_calls
>*raid6_choose_gen(
> 
> int __init raid6_select_algo(void)
> {
>-	const int disks = (65536/PAGE_SIZE)+2;
>+	const int disks = RAID6_DISKS;
> 
> 	const struct raid6_calls *gen_best;
> 	const struct raid6_recov_calls *rec_best;
>-	char *syndromes;
>-	void *dptrs[(65536/PAGE_SIZE)+2];
>-	int i;
>-
>-	for (i = 0; i < disks-2; i++)
>-		dptrs[i] = ((char *)raid6_gfmul) + PAGE_SIZE*i;
>+	char *disk_ptr;
>+	void *dptrs[RAID6_DISKS];
>+	int i, j;
> 
>-	/* Normal code - use a 2-page allocation to avoid D$ conflict */
>-	syndromes = (void *) __get_free_pages(GFP_KERNEL, 1);
>+	/* use a 8-page allocation, The first 6 pages for disks
>+	   and the last 2 pages for syndromes */
>+	disk_ptr = (void *) __get_free_pages(GFP_KERNEL, 3);
> 
>-	if (!syndromes) {
>+	if (!disk_ptr) {
> 		pr_err("raid6: Yikes!  No memory available.\n");
> 		return -ENOMEM;
> 	}
> 
>-	dptrs[disks-2] = syndromes;
>-	dptrs[disks-1] = syndromes + PAGE_SIZE;
>+	/* Fix-me: may should use get_random_bytes_arch() instead of
>get_random_bytes() */
>+	for (i = 0; i < disks-2; i++) {
>+		dptrs[i] = disk_ptr + PAGE_SIZE*i;
>+		for (j = 0; j < PAGE_SIZE; j++)
>+			get_random_bytes(dptrs[i]+j, 1);
>+	}
>+
>+	dptrs[disks-2] = disk_ptr + PAGE_SIZE*(disks-2);
>+	dptrs[disks-1] = disk_ptr + PAGE_SIZE*(disks-1);
> 
> 	/* select raid gen_syndrome function */
> 	gen_best = raid6_choose_gen(&dptrs, disks);
>@@ -234,7 +242,7 @@ int __init raid6_select_algo(void)
> 	/* select raid recover functions */
> 	rec_best = raid6_choose_recov();
> 
>-	free_pages((unsigned long)syndromes, 1);
>+	free_pages((unsigned long)disk_ptr, 3);
> 
> 	return gen_best && rec_best ? 0 : -EINVAL;
> }

Do you have any idea how long this takes to run?  People are already complaining about the boot time penalty.  get_random_*() is quite expensive and is overkill...
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply

* [PATCH] raid6: fix the input of raid6 algorithm
From: liuzhengyuan @ 2016-08-23  3:22 UTC (permalink / raw)
  To: hpa
  Cc: shli, linux-raid, fenghua.yu, linux-kernel, liuzhengyuang521,
	ZhengYuan Liu

From: ZhengYuan Liu <liuzhengyuan@kylinos.cn>

To test and choose an best algorithm for raid6, disk number
and disk data must be offered. Those input depend on page
size and gfmul table at current time. It would lead the disk
number less than 4 when the page size is more than 64KB.This
patch would support arbitrarily page size by defining a macro
for disk number and using random number to fill with disk data.

Signed-off-by: ZhengYuan Liu <liuzhengyuan@kylinos.cn>
---
 lib/raid6/algos.c | 36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index 975c6e0..f15a4d2 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -23,6 +23,7 @@
 #else
 #include <linux/module.h>
 #include <linux/gfp.h>
+#include <linux/random.h>
 #if !RAID6_USE_EMPTY_ZERO_PAGE
 /* In .bss so it's zeroed */
 const char raid6_empty_zero_page[PAGE_SIZE] __attribute__((aligned(256)));
@@ -30,6 +31,8 @@ EXPORT_SYMBOL(raid6_empty_zero_page);
 #endif
 #endif
 
+#define RAID6_DISKS	8
+
 struct raid6_calls raid6_call;
 EXPORT_SYMBOL_GPL(raid6_call);
 
@@ -129,7 +132,7 @@ static inline const struct raid6_recov_calls *raid6_choose_recov(void)
 }
 
 static inline const struct raid6_calls *raid6_choose_gen(
-	void *(*const dptrs)[(65536/PAGE_SIZE)+2], const int disks)
+	void *(*const dptrs)[RAID6_DISKS], const int disks)
 {
 	unsigned long perf, bestgenperf, bestxorperf, j0, j1;
 	int start = (disks>>1)-1, stop = disks-3;	/* work on the second half of the disks */
@@ -206,27 +209,32 @@ static inline const struct raid6_calls *raid6_choose_gen(
 
 int __init raid6_select_algo(void)
 {
-	const int disks = (65536/PAGE_SIZE)+2;
+	const int disks = RAID6_DISKS;
 
 	const struct raid6_calls *gen_best;
 	const struct raid6_recov_calls *rec_best;
-	char *syndromes;
-	void *dptrs[(65536/PAGE_SIZE)+2];
-	int i;
-
-	for (i = 0; i < disks-2; i++)
-		dptrs[i] = ((char *)raid6_gfmul) + PAGE_SIZE*i;
+	char *disk_ptr;
+	void *dptrs[RAID6_DISKS];
+	int i, j;
 
-	/* Normal code - use a 2-page allocation to avoid D$ conflict */
-	syndromes = (void *) __get_free_pages(GFP_KERNEL, 1);
+	/* use a 8-page allocation, The first 6 pages for disks
+	   and the last 2 pages for syndromes */
+	disk_ptr = (void *) __get_free_pages(GFP_KERNEL, 3);
 
-	if (!syndromes) {
+	if (!disk_ptr) {
 		pr_err("raid6: Yikes!  No memory available.\n");
 		return -ENOMEM;
 	}
 
-	dptrs[disks-2] = syndromes;
-	dptrs[disks-1] = syndromes + PAGE_SIZE;
+	/* Fix-me: may should use get_random_bytes_arch() instead of get_random_bytes() */
+	for (i = 0; i < disks-2; i++) {
+		dptrs[i] = disk_ptr + PAGE_SIZE*i;
+		for (j = 0; j < PAGE_SIZE; j++)
+			get_random_bytes(dptrs[i]+j, 1);
+	}
+
+	dptrs[disks-2] = disk_ptr + PAGE_SIZE*(disks-2);
+	dptrs[disks-1] = disk_ptr + PAGE_SIZE*(disks-1);
 
 	/* select raid gen_syndrome function */
 	gen_best = raid6_choose_gen(&dptrs, disks);
@@ -234,7 +242,7 @@ int __init raid6_select_algo(void)
 	/* select raid recover functions */
 	rec_best = raid6_choose_recov();
 
-	free_pages((unsigned long)syndromes, 1);
+	free_pages((unsigned long)disk_ptr, 3);
 
 	return gen_best && rec_best ? 0 : -EINVAL;
 }
-- 
1.9.1





^ permalink raw reply related

* Re: Need Help with crashed RAID5 (that was rebuilding and then had SATA error on another drive)
From: Adam Goryachev @ 2016-08-22 23:06 UTC (permalink / raw)
  To: Ben Kamen, linux-raid
In-Reply-To: <CADDTLRB9z6J0F6+uO3k-u74qnjZxiagbNLWrg_tP5GSZt-Vd5A@mail.gmail.com>

On 23/08/16 07:51, Ben Kamen wrote:
> Hey all. I'm looking at the RAID Wiki and need some help.
>
> First Info:
>
> I have a RAID5 with 4 members /dev/sd[cdef]1 where last night, sdc1
> reported a smart error recommended drive replacement (after watching
> sector errors pile up for about a week.)
>
> no problem. shut down the drive, pulled it, replace it with a cold
> spare. Started the rebuild (around midnight CDT).
>
> At 5:43am, I got this message:
>
> This is an automatically generated mail message from mdadm
> running on quantum
>
> A Fail event had been detected on md device /dev/md127.
>
> It could be related to component device /dev/sde1.
>
> Faithfully yours, etc.
>
> P.S. The /proc/mdstat file currently contains the following:
>
> Personalities : [raid1] [raid6] [raid5] [raid4]
> md0 : active raid1 sda2[0] sdb2[2]
>        511988 blocks super 1.0 [2/2] [UU]
>
> md127 : active raid5 sdc1[4] sdf1[6] sde1[1](F) sdd1[5]
>        2930276352 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/2] [U_U_]
>        [===========>.........]  recovery = 55.9% (546131076/976758784)
> finish=381.6min speed=18805K/sec
>        bitmap: 4/8 pages [16KB], 65536KB chunk
>
> md1 : active raid1 sda3[0] sdb3[2]
>        239489916 blocks super 1.1 [2/2] [UU]
>        bitmap: 2/2 pages [8KB], 65536KB chunk
>
> md10 : active raid1 sda1[0] sdb1[2]
>        4193272 blocks super 1.1 [2/2] [UU]
>
> unused devices: <none>
>
> /dev/md127  is the one with issues.
>
> It looks like the SATA controller had issues. I couldn't see sde - so
> I rebooted. (scold me later.)
>
> All the drives are available. SMARTCTL tells me /dev/sde is happy as
> can be (has a few bad sectors and is slated for replacement next, but
> smart says drive is healthy).
>
> I looked at the raid Wiki - and saved the mdadm --examine info. Of the
> active members, the event count is off by 25 for happy vs unhappy
> members.
>
> But forcing the assembly claims
>
> mdadm --assemble --force /dev/md127 /dev/sd[cdef]1
> mdadm: /dev/sdc1 is busy - skipping
> mdadm: /dev/sdd1 is busy - skipping
> mdadm: /dev/sdf1 is busy - skipping
> mdadm: Found some drive for an array that is already active: /dev/md/:BigRAID
> mdadm: giving up.
>
> So before I mess up ANYTHING else...
>
> What should I be doing?
>
> (should I be stopping the RAID as right now it's seems like it's running)
>
> Thanks,
>
First step, if the raid is running, then do a backup.
Second step, read all about SCT/ERC, and almost certainly fix the issues 
with your drives (either enable SCT/ERC on the drive or set the timeout 
appropriately).
Third step, make sure your backup is up to date
Fourth step, provide the current output of the raid array, is it 
resyncing, is the resync pending, is it finished, etc...
If it's finished, then don't replace the next drive in the same way, use 
the replace method instead. That will keep redundancy in the array 
during the replacement, and hopefully avoid this sort of issue.
Later, you might consider moving to RAID6 to add some additional 
redundancy instead of using a cold spare.

I hope the above is helpful, but really we will need more information 
about your drives before being able to make further suggestions. output 
of lsdrv (google it), smartctl, mdadm --misc --detail /dev/md127 would 
all be helpful.

Regards,
Adam


-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply

* Need Help with crashed RAID5 (that was rebuilding and then had SATA error on another drive)
From: Ben Kamen @ 2016-08-22 21:51 UTC (permalink / raw)
  To: linux-raid

Hey all. I'm looking at the RAID Wiki and need some help.

First Info:

I have a RAID5 with 4 members /dev/sd[cdef]1 where last night, sdc1
reported a smart error recommended drive replacement (after watching
sector errors pile up for about a week.)

no problem. shut down the drive, pulled it, replace it with a cold
spare. Started the rebuild (around midnight CDT).

At 5:43am, I got this message:

This is an automatically generated mail message from mdadm
running on quantum

A Fail event had been detected on md device /dev/md127.

It could be related to component device /dev/sde1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1] [raid6] [raid5] [raid4]
md0 : active raid1 sda2[0] sdb2[2]
      511988 blocks super 1.0 [2/2] [UU]

md127 : active raid5 sdc1[4] sdf1[6] sde1[1](F) sdd1[5]
      2930276352 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/2] [U_U_]
      [===========>.........]  recovery = 55.9% (546131076/976758784)
finish=381.6min speed=18805K/sec
      bitmap: 4/8 pages [16KB], 65536KB chunk

md1 : active raid1 sda3[0] sdb3[2]
      239489916 blocks super 1.1 [2/2] [UU]
      bitmap: 2/2 pages [8KB], 65536KB chunk

md10 : active raid1 sda1[0] sdb1[2]
      4193272 blocks super 1.1 [2/2] [UU]

unused devices: <none>

/dev/md127  is the one with issues.

It looks like the SATA controller had issues. I couldn't see sde - so
I rebooted. (scold me later.)

All the drives are available. SMARTCTL tells me /dev/sde is happy as
can be (has a few bad sectors and is slated for replacement next, but
smart says drive is healthy).

I looked at the raid Wiki - and saved the mdadm --examine info. Of the
active members, the event count is off by 25 for happy vs unhappy
members.

But forcing the assembly claims

mdadm --assemble --force /dev/md127 /dev/sd[cdef]1
mdadm: /dev/sdc1 is busy - skipping
mdadm: /dev/sdd1 is busy - skipping
mdadm: /dev/sdf1 is busy - skipping
mdadm: Found some drive for an array that is already active: /dev/md/:BigRAID
mdadm: giving up.

So before I mess up ANYTHING else...

What should I be doing?

(should I be stopping the RAID as right now it's seems like it's running)

Thanks,

   -Ben

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox