linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Shaohua Li <shli@kernel.org>
To: Matt Garman <matthew.garman@gmail.com>
Cc: Mdadm <linux-raid@vger.kernel.org>
Subject: Re: kernel checksumming performance vs actual raid device performance
Date: Tue, 23 Aug 2016 18:02:41 -0700	[thread overview]
Message-ID: <20160824010241.GC57645@kernel.org> (raw)
In-Reply-To: <CAJvUf-C-Nr8sSnSPL-5jt1NLOAiZjhZ=bjDRUbX_RjphRL+yWA@mail.gmail.com>

On Tue, Jul 12, 2016 at 04:09:25PM -0500, Matt Garman wrote:
> We have a system with a 24-disk raid6 array, using 2TB SSDs.  We use
> this system in a workload that is 99.9% read-only (a few small
> writes/day, versus countless reads).  This system is an NFS server for
> about 50 compute nodes that continually read its data.
> 
> In a non-degraded state, the system works wonderfully: the md0_raid6
> process uses less than 1% CPU, each drive is around 20% utilization
> (via iostat), no swapping is taking place.  The outbound throughput
> averages around 2.0 GB/sec, with 2.5 GB/sec peaks.
> 
> However, we had a disk fail, and the throughput dropped considerably,
> with the md0_raid6 process pegged at 100% CPU.
> 
> I understand that data from the failed disk will need to be
> reconstructed from parity, and this will cause the md0_raid6 process
> to consume considerable CPU.
> 
> What I don't understand is how I can determine what kind of actual MD
> device performance (throughput) I can expect in this state?
> 
> Dmesg seems to give some hints:
> 
> [    6.386820] xor: automatically using best checksumming function:
> [    6.396690]    avx       : 24064.000 MB/sec
> [    6.414706] raid6: sse2x1   gen()  7636 MB/s
> [    6.431725] raid6: sse2x2   gen()  3656 MB/s
> [    6.448742] raid6: sse2x4   gen()  3917 MB/s
> [    6.465753] raid6: avx2x1   gen()  5425 MB/s
> [    6.482766] raid6: avx2x2   gen()  7593 MB/s
> [    6.499773] raid6: avx2x4   gen()  8648 MB/s
> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> [    6.499774] raid6: using avx2x2 recovery algorithm
> 
> (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)
> 
> Perhaps naively, I would expect that second-to-last line:
> 
> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> 
> to indicate what kind of throughput I could expect in a degraded
> state, but clearly that is not right---or I have something
> misconfigured.
> 
> So in other words, what does that gen() 8648 MB/s metric mean in terms
> of real-world throughput?  Is there a way I can "convert" that number
> to expected throughput of a degraded array?

In non-degrade mode, raid6 just directly dispatch IO to raid disks, software
involvement is very small. In degrade mode, the data is calculated. There are a
lot of factors impacting the performance:
1. enter the raid6 state machine, which has a long code path. (this is
debatable, if a read doesn't read the faulty disk and it's a small random read,
raid6 doesn't need to run the state machine. Fixing this could hugely improve
the performance)
2. the state machine runs in a single thread, which is a bottleneck. try to
increase group_thread_cnt, which will make the handling multi-thread.
3. stripe cache involves. try to increase stripe_cache_size.
4. the faulty disk data must be calculated, which involves read from other
disks. If this is a numa machine, and each disk interrupts to different
cpus/nodes, there will be big impact (cache, wakeup IPI)
5. the xor calculation overhead. Actually I don't think the impact is big,
mordern cpu can do the calculation fast.

Thanks,
Shaohua

  parent reply	other threads:[~2016-08-24  1:02 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-12 21:09 kernel checksumming performance vs actual raid device performance Matt Garman
2016-07-13  3:58 ` Brad Campbell
     [not found] ` <CAFx4rwQj3_JTNiS0zsQjp_sPXWkrp0ggjg_UiR7oJ8u0X9PQVA@mail.gmail.com>
2016-07-13 16:52   ` Fwd: " Doug Dumitru
2016-08-16 19:44   ` Matt Garman
2016-08-16 22:51     ` Doug Dumitru
2016-08-17  0:27       ` Adam Goryachev
     [not found]     ` <CAFx4rwTawqrBOWVwtPnGhRRAM1XiGQkS-o3YykmD0AftR45YkA@mail.gmail.com>
2016-08-23 14:34       ` Matt Garman
2016-08-23 15:02         ` Chris Murphy
     [not found]   ` <CAJvUf-Dqesy2TJX7W-bPakzeDcOoNy0VoSWWM06rKMYMhyhY7g@mail.gmail.com>
     [not found]     ` <CAFx4rwSQQuqeCFm+60+Gm75D49tg+mVjU=BnQSZThdE7E6KqPQ@mail.gmail.com>
2016-08-23 14:54       ` Matt Garman
2016-08-23 18:00         ` Doug Ledford
2016-08-23 18:27           ` Doug Dumitru
2016-08-23 19:10             ` Doug Ledford
2016-08-23 19:19               ` Doug Dumitru
2016-08-23 19:26                 ` Doug Ledford
2016-08-23 19:26             ` Matt Garman
2016-08-23 19:41               ` Doug Dumitru
2016-08-23 20:15               ` Doug Ledford
2016-08-23 21:42                 ` Phil Turmel
2016-08-24  1:02 ` Shaohua Li [this message]
2016-08-25 15:07   ` Matt Garman
2016-08-25 23:39     ` Adam Goryachev
2016-08-26 13:01       ` Matt Garman
2016-08-26 20:04         ` Doug Dumitru
2016-08-26 21:57           ` Phil Turmel
2016-08-26 22:11             ` Doug Dumitru
2016-08-26 18:11       ` Wols Lists

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160824010241.GC57645@kernel.org \
    --to=shli@kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=matthew.garman@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).