Re: kernel checksumming performance vs actual raid device performance

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Shaohua Li <shli@kernel.org>
To: Matt Garman <matthew.garman@gmail.com>
Cc: Mdadm <linux-raid@vger.kernel.org>
Subject: Re: kernel checksumming performance vs actual raid device performance
Date: Tue, 23 Aug 2016 18:02:41 -0700	[thread overview]
Message-ID: <20160824010241.GC57645@kernel.org> (raw)
In-Reply-To: <CAJvUf-C-Nr8sSnSPL-5jt1NLOAiZjhZ=bjDRUbX_RjphRL+yWA@mail.gmail.com>

On Tue, Jul 12, 2016 at 04:09:25PM -0500, Matt Garman wrote:
> We have a system with a 24-disk raid6 array, using 2TB SSDs.  We use
> this system in a workload that is 99.9% read-only (a few small
> writes/day, versus countless reads).  This system is an NFS server for
> about 50 compute nodes that continually read its data.
> 
> In a non-degraded state, the system works wonderfully: the md0_raid6
> process uses less than 1% CPU, each drive is around 20% utilization
> (via iostat), no swapping is taking place.  The outbound throughput
> averages around 2.0 GB/sec, with 2.5 GB/sec peaks.
> 
> However, we had a disk fail, and the throughput dropped considerably,
> with the md0_raid6 process pegged at 100% CPU.
> 
> I understand that data from the failed disk will need to be
> reconstructed from parity, and this will cause the md0_raid6 process
> to consume considerable CPU.
> 
> What I don't understand is how I can determine what kind of actual MD
> device performance (throughput) I can expect in this state?
> 
> Dmesg seems to give some hints:
> 
> [    6.386820] xor: automatically using best checksumming function:
> [    6.396690]    avx       : 24064.000 MB/sec
> [    6.414706] raid6: sse2x1   gen()  7636 MB/s
> [    6.431725] raid6: sse2x2   gen()  3656 MB/s
> [    6.448742] raid6: sse2x4   gen()  3917 MB/s
> [    6.465753] raid6: avx2x1   gen()  5425 MB/s
> [    6.482766] raid6: avx2x2   gen()  7593 MB/s
> [    6.499773] raid6: avx2x4   gen()  8648 MB/s
> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> [    6.499774] raid6: using avx2x2 recovery algorithm
> 
> (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)
> 
> Perhaps naively, I would expect that second-to-last line:
> 
> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> 
> to indicate what kind of throughput I could expect in a degraded
> state, but clearly that is not right---or I have something
> misconfigured.
> 
> So in other words, what does that gen() 8648 MB/s metric mean in terms
> of real-world throughput?  Is there a way I can "convert" that number
> to expected throughput of a degraded array?

In non-degrade mode, raid6 just directly dispatch IO to raid disks, software
involvement is very small. In degrade mode, the data is calculated. There are a
lot of factors impacting the performance:
1. enter the raid6 state machine, which has a long code path. (this is
debatable, if a read doesn't read the faulty disk and it's a small random read,
raid6 doesn't need to run the state machine. Fixing this could hugely improve
the performance)
2. the state machine runs in a single thread, which is a bottleneck. try to
increase group_thread_cnt, which will make the handling multi-thread.
3. stripe cache involves. try to increase stripe_cache_size.
4. the faulty disk data must be calculated, which involves read from other
disks. If this is a numa machine, and each disk interrupts to different
cpus/nodes, there will be big impact (cache, wakeup IPI)
5. the xor calculation overhead. Actually I don't think the impact is big,
mordern cpu can do the calculation fast.

Thanks,
Shaohua

next prev parent reply	other threads:[~2016-08-24  1:02 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-12 21:09 kernel checksumming performance vs actual raid device performance Matt Garman
2016-07-13  3:58 ` Brad Campbell
     [not found] ` <CAFx4rwQj3_JTNiS0zsQjp_sPXWkrp0ggjg_UiR7oJ8u0X9PQVA@mail.gmail.com>
2016-07-13 16:52   ` Fwd: " Doug Dumitru
2016-08-16 19:44   ` Matt Garman
2016-08-16 22:51     ` Doug Dumitru
2016-08-17  0:27       ` Adam Goryachev
     [not found]     ` <CAFx4rwTawqrBOWVwtPnGhRRAM1XiGQkS-o3YykmD0AftR45YkA@mail.gmail.com>
2016-08-23 14:34       ` Matt Garman
2016-08-23 15:02         ` Chris Murphy
     [not found]   ` <CAJvUf-Dqesy2TJX7W-bPakzeDcOoNy0VoSWWM06rKMYMhyhY7g@mail.gmail.com>
     [not found]     ` <CAFx4rwSQQuqeCFm+60+Gm75D49tg+mVjU=BnQSZThdE7E6KqPQ@mail.gmail.com>
2016-08-23 14:54       ` Matt Garman
2016-08-23 18:00         ` Doug Ledford
2016-08-23 18:27           ` Doug Dumitru
2016-08-23 19:10             ` Doug Ledford
2016-08-23 19:19               ` Doug Dumitru
2016-08-23 19:26                 ` Doug Ledford
2016-08-23 19:26             ` Matt Garman
2016-08-23 19:41               ` Doug Dumitru
2016-08-23 20:15               ` Doug Ledford
2016-08-23 21:42                 ` Phil Turmel
2016-08-24  1:02 ` Shaohua Li [this message]
2016-08-25 15:07   ` Matt Garman
2016-08-25 23:39     ` Adam Goryachev
2016-08-26 13:01       ` Matt Garman
2016-08-26 20:04         ` Doug Dumitru
2016-08-26 21:57           ` Phil Turmel
2016-08-26 22:11             ` Doug Dumitru
2016-08-26 18:11       ` Wols Lists

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160824010241.GC57645@kernel.org \
    --to=shli@kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=matthew.garman@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.