From: Adam Borowski <kilobyte@angband.pl>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: Qu Wenruo <quwenruo@cn.fujitsu.com>,
"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: RAID system with adaption to changed number of disks
Date: Thu, 13 Oct 2016 05:40:11 +0200 [thread overview]
Message-ID: <20161013034011.GB17385@angband.pl> (raw)
In-Reply-To: <20161012211017.GJ26140@hungrycats.org>
On Wed, Oct 12, 2016 at 05:10:18PM -0400, Zygo Blaxell wrote:
> On Wed, Oct 12, 2016 at 09:55:28PM +0200, Adam Borowski wrote:
> > On Wed, Oct 12, 2016 at 01:19:37PM -0400, Zygo Blaxell wrote:
> > > I had been thinking that we could inject "plug" extents to fill up
> > > RAID5 stripes.
> > Your idea sounds good, but there's one problem: most real users don't
> > balance. Ever. Contrary to the tribal wisdom here, this actually works
> > fine, unless you had a pathologic load skewed to either data or metadata on
> > the first write then fill the disk to near-capacity with a load skewed the
> > other way.
>
> > Most usage patterns produce a mix of transient and persistent data (and at
> > write time you don't know which file is which), meaning that with time every
> > stripe will contain a smidge of cold data plus a fill of plug extents.
>
> Yes, it'll certainly reduce storage efficiency. I think all the
> RMW-avoidance strategies have this problem. The alternative is to risk
> losing data or the entire filesystem on disk failure, so any of the
> RMW-avoidance strategies are probably a worthwhile tradeoff. Big RAID5/6
> arrays tend to be used mostly for storing large sequentially-accessed
> files which are less susceptible to this kind of problem.
>
> If the pattern is lots of small random writes then performance on raid5
> will be terrible anyway (though it may even be improved by using plug
> extents, since RMW stripe updates would be replaced with pure CoW).
I've looked at some simple scenarios, and it appears that, with your scheme,
the total amount of I/O would increase, but it would not hinder performance
as increases happen only when the disk would be otherwise idle. There's
also a latency win and a fragmentation win -- all while fixing the write
hole!
Let's assume leaf size 16KB, stripe size 64KB. The disk has four stripes,
each 75% full 25% deleted. '*' marks cold data, '.' deleted/plug space, 'x'
new data. I'm not drawing entirely empty stripes.
***.
***.
***.
***.
The user wants to write 64KB of data.
RMW needs to read 12 leafs, write 16, no matter if the data comes in one
commit or four.
***x
***x
***x
***x
Latency 28 (big commit)/7 per commit (small commits), total I/O 28.
The plug extents scheme requires compaction (partial balance):
****
****
****
I/O so far 24.
Big commit:
****
****
****
xxxx
Latency 4, total I/O 28.
If we had to compact on-demand, the latency is 28 (assuming we can do
stripe-sized balance).
Small commits, no concurrent writes:
****
****
****
x...
x...
x...
x...
Latency 1 per commit, I/O so far 28, need another compaction:
****
****
****
xxxx
Total I/O 32.
Small io, concurrent writes that peg the disk:
****
****
****
xyyy
xyyy
xyyy
xyyy
Total I/O 28 (not counting concurrent writes).
Other scenarios I've analyzed give similar results.
I'm not sure if my thinking is correct, but if it is, the outcome is quite
surprising: no performance loss even though we had to rewrite the stripes!
> > Thus, while the plug extents idea doesn't suffer from problems of big
> > sectors you just mentioned, we'd need some kind of auto-balance.
>
> Another way to approach the problem is to relocate the blocks in
> partially filled RMW stripes so they can be effectively CoW stripes;
> however, the requirement to do full extent relocations leads to some
> nasty write amplification and performance ramifications. Balance is
> hugely heavy I/O load and there are good reasons not to incur it at
> unexpected times.
We don't need balance in btrfs sense, it's enough to compact stripes -- ie,
something akin to balance except done at stripe level rather than allocation
block level.
As for write amplification, F2FS guys solved the issue by having two types
of cleaning (balancing):
* on demand (when there is no free space and thus it needs to be done NOW)
* in the background (done only on cold data)
The on-demand clean goes for juiciest targets first (least data/stripe),
background clean on the other hand uses a formula that takes into account
both the amount of space to reclaim and age of the stripe. If the data is
hot, it shouldn't be cleaned yet -- it's likely to be deleted/modified soon.
Meow!
--
A MAP07 (Dead Simple) raspberry tincture recipe: 0.5l 95% alcohol, 1kg
raspberries, 0.4kg sugar; put into a big jar for 1 month. Filter out and
throw away the fruits (can dump them into a cake, etc), let the drink age
at least 3-6 months.
next prev parent reply other threads:[~2016-10-13 4:38 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-10-11 15:14 RAID system with adaption to changed number of disks Philip Louis Moetteli
2016-10-11 16:06 ` Hugo Mills
2016-10-11 23:58 ` Chris Murphy
2016-10-12 1:32 ` Qu Wenruo
2016-10-12 4:37 ` Zygo Blaxell
2016-10-12 5:48 ` Qu Wenruo
2016-10-12 17:19 ` Zygo Blaxell
2016-10-12 19:55 ` Adam Borowski
2016-10-12 21:10 ` Zygo Blaxell
2016-10-13 3:40 ` Adam Borowski [this message]
2016-10-12 20:41 ` Chris Murphy
2016-10-13 0:35 ` Qu Wenruo
2016-10-13 21:03 ` Zygo Blaxell
2016-10-14 1:24 ` Qu Wenruo
2016-10-14 7:16 ` Chris Murphy
2016-10-14 19:55 ` Zygo Blaxell
2016-10-14 21:19 ` Duncan
2016-10-14 21:38 ` Chris Murphy
2016-10-14 22:30 ` Chris Murphy
2016-10-15 3:19 ` Zygo Blaxell
2016-10-12 7:02 ` Anand Jain
2016-10-12 7:25 ` Roman Mamedov
2016-10-12 17:31 ` Zygo Blaxell
2016-10-12 19:19 ` Zygo Blaxell
2016-10-12 19:33 ` Roman Mamedov
2016-10-12 20:33 ` Zygo Blaxell
2016-10-11 16:37 ` Austin S. Hemmelgarn
2016-10-11 17:16 ` Tomasz Kusmierz
2016-10-11 17:29 ` ronnie sahlberg
2016-10-12 1:33 ` Dan Mons
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161013034011.GB17385@angband.pl \
--to=kilobyte@angband.pl \
--cc=ce3g8jdj@umail.furryterror.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo@cn.fujitsu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).