From: Shaohua Li <shli@kernel.org>
To: Matt Garman <matthew.garman@gmail.com>
Cc: Mdadm <linux-raid@vger.kernel.org>
Subject: Re: kernel checksumming performance vs actual raid device performance
Date: Tue, 23 Aug 2016 18:02:41 -0700 [thread overview]
Message-ID: <20160824010241.GC57645@kernel.org> (raw)
In-Reply-To: <CAJvUf-C-Nr8sSnSPL-5jt1NLOAiZjhZ=bjDRUbX_RjphRL+yWA@mail.gmail.com>
On Tue, Jul 12, 2016 at 04:09:25PM -0500, Matt Garman wrote:
> We have a system with a 24-disk raid6 array, using 2TB SSDs. We use
> this system in a workload that is 99.9% read-only (a few small
> writes/day, versus countless reads). This system is an NFS server for
> about 50 compute nodes that continually read its data.
>
> In a non-degraded state, the system works wonderfully: the md0_raid6
> process uses less than 1% CPU, each drive is around 20% utilization
> (via iostat), no swapping is taking place. The outbound throughput
> averages around 2.0 GB/sec, with 2.5 GB/sec peaks.
>
> However, we had a disk fail, and the throughput dropped considerably,
> with the md0_raid6 process pegged at 100% CPU.
>
> I understand that data from the failed disk will need to be
> reconstructed from parity, and this will cause the md0_raid6 process
> to consume considerable CPU.
>
> What I don't understand is how I can determine what kind of actual MD
> device performance (throughput) I can expect in this state?
>
> Dmesg seems to give some hints:
>
> [ 6.386820] xor: automatically using best checksumming function:
> [ 6.396690] avx : 24064.000 MB/sec
> [ 6.414706] raid6: sse2x1 gen() 7636 MB/s
> [ 6.431725] raid6: sse2x2 gen() 3656 MB/s
> [ 6.448742] raid6: sse2x4 gen() 3917 MB/s
> [ 6.465753] raid6: avx2x1 gen() 5425 MB/s
> [ 6.482766] raid6: avx2x2 gen() 7593 MB/s
> [ 6.499773] raid6: avx2x4 gen() 8648 MB/s
> [ 6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> [ 6.499774] raid6: using avx2x2 recovery algorithm
>
> (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)
>
> Perhaps naively, I would expect that second-to-last line:
>
> [ 6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
>
> to indicate what kind of throughput I could expect in a degraded
> state, but clearly that is not right---or I have something
> misconfigured.
>
> So in other words, what does that gen() 8648 MB/s metric mean in terms
> of real-world throughput? Is there a way I can "convert" that number
> to expected throughput of a degraded array?
In non-degrade mode, raid6 just directly dispatch IO to raid disks, software
involvement is very small. In degrade mode, the data is calculated. There are a
lot of factors impacting the performance:
1. enter the raid6 state machine, which has a long code path. (this is
debatable, if a read doesn't read the faulty disk and it's a small random read,
raid6 doesn't need to run the state machine. Fixing this could hugely improve
the performance)
2. the state machine runs in a single thread, which is a bottleneck. try to
increase group_thread_cnt, which will make the handling multi-thread.
3. stripe cache involves. try to increase stripe_cache_size.
4. the faulty disk data must be calculated, which involves read from other
disks. If this is a numa machine, and each disk interrupts to different
cpus/nodes, there will be big impact (cache, wakeup IPI)
5. the xor calculation overhead. Actually I don't think the impact is big,
mordern cpu can do the calculation fast.
Thanks,
Shaohua
next prev parent reply other threads:[~2016-08-24 1:02 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-07-12 21:09 kernel checksumming performance vs actual raid device performance Matt Garman
2016-07-13 3:58 ` Brad Campbell
[not found] ` <CAFx4rwQj3_JTNiS0zsQjp_sPXWkrp0ggjg_UiR7oJ8u0X9PQVA@mail.gmail.com>
2016-07-13 16:52 ` Fwd: " Doug Dumitru
2016-08-16 19:44 ` Matt Garman
2016-08-16 22:51 ` Doug Dumitru
2016-08-17 0:27 ` Adam Goryachev
[not found] ` <CAFx4rwTawqrBOWVwtPnGhRRAM1XiGQkS-o3YykmD0AftR45YkA@mail.gmail.com>
2016-08-23 14:34 ` Matt Garman
2016-08-23 15:02 ` Chris Murphy
[not found] ` <CAJvUf-Dqesy2TJX7W-bPakzeDcOoNy0VoSWWM06rKMYMhyhY7g@mail.gmail.com>
[not found] ` <CAFx4rwSQQuqeCFm+60+Gm75D49tg+mVjU=BnQSZThdE7E6KqPQ@mail.gmail.com>
2016-08-23 14:54 ` Matt Garman
2016-08-23 18:00 ` Doug Ledford
2016-08-23 18:27 ` Doug Dumitru
2016-08-23 19:10 ` Doug Ledford
2016-08-23 19:19 ` Doug Dumitru
2016-08-23 19:26 ` Doug Ledford
2016-08-23 19:26 ` Matt Garman
2016-08-23 19:41 ` Doug Dumitru
2016-08-23 20:15 ` Doug Ledford
2016-08-23 21:42 ` Phil Turmel
2016-08-24 1:02 ` Shaohua Li [this message]
2016-08-25 15:07 ` Matt Garman
2016-08-25 23:39 ` Adam Goryachev
2016-08-26 13:01 ` Matt Garman
2016-08-26 20:04 ` Doug Dumitru
2016-08-26 21:57 ` Phil Turmel
2016-08-26 22:11 ` Doug Dumitru
2016-08-26 18:11 ` Wols Lists
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160824010241.GC57645@kernel.org \
--to=shli@kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=matthew.garman@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).