Re: RAID system with adaption to changed number of disks

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Adam Borowski <kilobyte@angband.pl>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: Qu Wenruo <quwenruo@cn.fujitsu.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: RAID system with adaption to changed number of disks
Date: Thu, 13 Oct 2016 05:40:11 +0200	[thread overview]
Message-ID: <20161013034011.GB17385@angband.pl> (raw)
In-Reply-To: <20161012211017.GJ26140@hungrycats.org>

On Wed, Oct 12, 2016 at 05:10:18PM -0400, Zygo Blaxell wrote:
> On Wed, Oct 12, 2016 at 09:55:28PM +0200, Adam Borowski wrote:
> > On Wed, Oct 12, 2016 at 01:19:37PM -0400, Zygo Blaxell wrote:
> > > I had been thinking that we could inject "plug" extents to fill up
> > > RAID5 stripes.
> > Your idea sounds good, but there's one problem: most real users don't
> > balance.  Ever.  Contrary to the tribal wisdom here, this actually works
> > fine, unless you had a pathologic load skewed to either data or metadata on
> > the first write then fill the disk to near-capacity with a load skewed the
> > other way.
> 
> > Most usage patterns produce a mix of transient and persistent data (and at
> > write time you don't know which file is which), meaning that with time every
> > stripe will contain a smidge of cold data plus a fill of plug extents.
> 
> Yes, it'll certainly reduce storage efficiency.  I think all the
> RMW-avoidance strategies have this problem.  The alternative is to risk
> losing data or the entire filesystem on disk failure, so any of the
> RMW-avoidance strategies are probably a worthwhile tradeoff.  Big RAID5/6
> arrays tend to be used mostly for storing large sequentially-accessed
> files which are less susceptible to this kind of problem.
> 
> If the pattern is lots of small random writes then performance on raid5
> will be terrible anyway (though it may even be improved by using plug
> extents, since RMW stripe updates would be replaced with pure CoW).

I've looked at some simple scenarios, and it appears that, with your scheme,
the total amount of I/O would increase, but it would not hinder performance
as increases happen only when the disk would be otherwise idle.  There's
also a latency win and a fragmentation win -- all while fixing the write
hole!

Let's assume leaf size 16KB, stripe size 64KB.  The disk has four stripes,
each 75% full 25% deleted.  '*' marks cold data, '.' deleted/plug space, 'x'
new data.  I'm not drawing entirely empty stripes.
***.
***.
***.
***.
The user wants to write 64KB of data.
RMW needs to read 12 leafs, write 16, no matter if the data comes in one
commit or four.
***x
***x
***x
***x
Latency 28 (big commit)/7 per commit (small commits), total I/O 28.

The plug extents scheme requires compaction (partial balance):
****
****
****
I/O so far 24.
Big commit:
****
****
****
xxxx
Latency 4, total I/O 28.
If we had to compact on-demand, the latency is 28 (assuming we can do
stripe-sized balance).

Small commits, no concurrent writes:
****
****
****
x...
x...
x...
x...
Latency 1 per commit, I/O so far 28, need another compaction:
****
****
****
xxxx
Total I/O 32.

Small io, concurrent writes that peg the disk:
****
****
****
xyyy
xyyy
xyyy
xyyy
Total I/O 28 (not counting concurrent writes).

Other scenarios I've analyzed give similar results.

I'm not sure if my thinking is correct, but if it is, the outcome is quite
surprising: no performance loss even though we had to rewrite the stripes!

> > Thus, while the plug extents idea doesn't suffer from problems of big
> > sectors you just mentioned, we'd need some kind of auto-balance.
> 
> Another way to approach the problem is to relocate the blocks in
> partially filled RMW stripes so they can be effectively CoW stripes;
> however, the requirement to do full extent relocations leads to some
> nasty write amplification and performance ramifications.  Balance is
> hugely heavy I/O load and there are good reasons not to incur it at
> unexpected times.

We don't need balance in btrfs sense, it's enough to compact stripes -- ie,
something akin to balance except done at stripe level rather than allocation
block level.

As for write amplification, F2FS guys solved the issue by having two types
of cleaning (balancing):
* on demand (when there is no free space and thus it needs to be done NOW)
* in the background (done only on cold data)

The on-demand clean goes for juiciest targets first (least data/stripe),
background clean on the other hand uses a formula that takes into account
both the amount of space to reclaim and age of the stripe.  If the data is
hot, it shouldn't be cleaned yet -- it's likely to be deleted/modified soon.

Meow!
-- 
A MAP07 (Dead Simple) raspberry tincture recipe: 0.5l 95% alcohol, 1kg
raspberries, 0.4kg sugar; put into a big jar for 1 month.  Filter out and
throw away the fruits (can dump them into a cake, etc), let the drink age
at least 3-6 months.

next prev parent reply	other threads:[~2016-10-13  4:38 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-11 15:14 RAID system with adaption to changed number of disks Philip Louis Moetteli
2016-10-11 16:06 ` Hugo Mills
2016-10-11 23:58   ` Chris Murphy
2016-10-12  1:32     ` Qu Wenruo
2016-10-12  4:37       ` Zygo Blaxell
2016-10-12  5:48         ` Qu Wenruo
2016-10-12 17:19           ` Zygo Blaxell
2016-10-12 19:55             ` Adam Borowski
2016-10-12 21:10               ` Zygo Blaxell
2016-10-13  3:40                 ` Adam Borowski [this message]
2016-10-12 20:41             ` Chris Murphy
2016-10-13  0:35             ` Qu Wenruo
2016-10-13 21:03               ` Zygo Blaxell
2016-10-14  1:24                 ` Qu Wenruo
2016-10-14  7:16                   ` Chris Murphy
2016-10-14 19:55                     ` Zygo Blaxell
2016-10-14 21:19                       ` Duncan
2016-10-14 21:38                       ` Chris Murphy
2016-10-14 22:30                         ` Chris Murphy
2016-10-15  3:19                           ` Zygo Blaxell
2016-10-12  7:02         ` Anand Jain
2016-10-12  7:25     ` Roman Mamedov
2016-10-12 17:31       ` Zygo Blaxell
2016-10-12 19:19         ` Zygo Blaxell
2016-10-12 19:33           ` Roman Mamedov
2016-10-12 20:33             ` Zygo Blaxell
2016-10-11 16:37 ` Austin S. Hemmelgarn
2016-10-11 17:16 ` Tomasz Kusmierz
2016-10-11 17:29 ` ronnie sahlberg
2016-10-12  1:33 ` Dan Mons

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161013034011.GB17385@angband.pl \
    --to=kilobyte@angband.pl \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).