From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: dstat shows unexpected result for two disk RAID1
Date: Thu, 10 Mar 2016 04:06:45 +0000 (UTC) [thread overview]
Message-ID: <pan$9dde$acf9ba51$bb837cb8$8c954a04@cox.net> (raw)
In-Reply-To: 20160310023627.2e915667@natsu
Roman Mamedov posted on Thu, 10 Mar 2016 02:36:27 +0500 as excerpted:
> It's a known limitation that the disks are in effect "pinned" to running
> processes, based on their process ID. One process reads from the same
> disk, from the point it started and until it terminates. Other processes
> by luck may read from a different disk, thus achieving load balancing.
> Or they may not, and you will have contention with the other disk
> idling. This is unlike MD RAID1, which knows to distribute read load
> dynamically to the least-utilized array members.
>
> Now if you want to do some more performance evaluation, check with your
> dstat if both disks happen to *write* data in parallel, when you write
> to the array,
> as ideally they should. Last I checked they mostly didn't, and this
> almost halved write performance on a Btrfs RAID1 compared to a single
> disk.
As stated, at present btrfs mostly handles devices (I've made it a
personal point to try not to say disks, because SSD, etc, unless it's
/specific/ /to/ spinning rust, but device remains correct) one at a time
per task.
And for raid1 read in particular, the read scheduler is a very simple
even/odd PID based scheduler, implemented early on when simplicity of
implementation and easy testing of single-task single-device, multi-task
multi-device, and multi-task-bottlenecked-to-single-device, all three
scenarios, was of prime consideration, far more so than a speed. Indeed,
at that point, optimization would have been a prime example of "premature
optimization", as it would have almost certainly either restricted
various later added feature implementation choices later on, or would
have needed redone later, once those features and their constraints were
known, thus losing the work done in the first optimization.
And in fact, I've pointed out this very fact as a an easily seen example
of why btrfs isn't yet fully stable or production ready -- as can be seen
in the work of the very developers themselves. Any developer worth the
name will be very wary of the dangers of "premature optimization" and the
risk it brings of either severely limiting implementations of further
features or having to be good work thrown out because it doesn't match
the new code.
When the devs consider the btrfs code stable enough, they'll optimize
this. Until then, it's prime evidence that they do _not_ consider btrfs
stable and mature enough for this sort of optimization just yet. =:^)
Meanwhile, for quite some time (since at least kernel 3.5 when raid56 was
expected in kernel 3.6) on the roadmap for implementation after raid56,
is N-way-mirroring -- basically, raid1 the way mdraid does it, so 5
devices means 5 mirrors, not the precisely two mirrors of each chunk,
with new chunks distributed across the other devices until they've all
been used, that we have now (tho it would continue to be an option).
And FWIW, N-way-mirroring is a primary feature interest of mine so I've
been following it more closely than much of btrfs development.
Of course the logical raid10 extension of that would be the ability to
specify N mirrors and M stripes on raid10 as well, so that for a 6-device
raid10, you could choose between the existing two-way-mirroring, three-
way-striping, and a new three-way-mirroring, two-way-striping, mode, tho
I don't know if they'll implement both N-way-mirroring raid1 and N-way-
mirroring raid10 at the same time, or wait on the latter.
Either way, my point in bringing up N-way-mirroring, is that it has been
roadmapped for quite some time, and with it roadmapped, attempting either
two-way-only-optimization or N-way-optimization, now, arguably _would_ be
premature optimization, because the first would have to be redone for N-
way once it became available, and there's no way to test that the second
actually works beyond two-way, until n-way is actually available.
So I'd guess N-way-read-optimization, with N=2 just one of the
possibilities, will come after N-way-mirroring, which in turn has long
been roadmapped for after raid56.
Meanwhile, while parity-raid (aka raid56) isn't as bad as it was when
first nominally completed in 3.19, as of 4.4 (and I think 4.5 as I've not
seen a full trace yet, let alone a fix), there's still at least one known
bug remaining to be traced down and exterminated, that's causing at least
some raid56 reshapes to different numbers of devices or recovery from a
lost device to take at least 10 times as long as they logically should,
we're talking times of weeks to months, during which time the array can
be used, but if it's a bad device replacement and more devices go down in
that time... So even if it's not an immediate data-loss bug, it's still
a blocker in terms of actually using parity-raid for the purposes parity-
raid is normally used.
So raid56, while nominally complete now (after nearing four /years/ of
work, remember, originally it was intended for kernel 3.5 or 3.6), still
isn't anything close to stable as the rest of btrfs, and is still
requiring developer focus, so it could be awhile before we see that N-way-
mirroring that was roadmapped after it, which in turn means it'll likely
be even longer before we see good raid1 read optimization.
Tho hopefully all the really tough problems they would have hit with N-
way-mirroring were hit and resolved with raid56, and N-way-mirroring will
thus be relatively simple, so hopefully it's less than the four years
it's taking raid56. But I don't expect to see it for another year or
two, and don't expect to be actually use it as intended (as a more
failure resistant raid1) for some time after that as the bugs get worked
out, so realistically, 2-3 years.
If multi-device scheduling optimization is done in say 6 months after
that... that means we're looking at 2.5-3.5 years, perhaps longer, for
it. So it's a known issue, yes, and on the roadmap, yes, but don't
expect to see anything in the near (-2-year) future, more like
intermediate (3-5) year future. In all honesty I don't seriously expect
it to be long-term future, beyond 5 years, but it's possible.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2016-03-10 4:06 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-03-09 20:21 dstat shows unexpected result for two disk RAID1 Nicholas D Steeves
2016-03-09 20:25 ` Nicholas D Steeves
2016-03-09 20:50 ` Goffredo Baroncelli
2016-03-09 21:26 ` Chris Murphy
2016-03-09 22:51 ` Nicholas D Steeves
2016-03-11 23:42 ` Nicholas D Steeves
2016-03-09 21:36 ` Roman Mamedov
2016-03-09 21:43 ` Chris Murphy
2016-03-09 22:08 ` Nicholas D Steeves
2016-03-10 4:06 ` Duncan [this message]
2016-03-10 5:01 ` Chris Murphy
2016-03-10 8:10 ` Duncan
2016-03-12 0:04 ` Nicholas D Steeves
2016-03-12 0:10 ` Nicholas D Steeves
2016-03-12 1:20 ` Chris Murphy
2016-04-06 3:58 ` Nicholas D Steeves
2016-04-06 12:02 ` Austin S. Hemmelgarn
2016-04-22 22:36 ` Nicholas D Steeves
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$9dde$acf9ba51$bb837cb8$8c954a04@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).