Re: Raid10 performance issues during copy and balance, only half the spindles used for reading data

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: Raid10 performance issues during copy and balance, only half the spindles used for reading data
Date: Sun, 15 Mar 2015 03:35:34 +0000 (UTC)	[thread overview]
Message-ID: <pan$eb9d8$64fc2106$58d9b21f$dfb3c40f@cox.net> (raw)
In-Reply-To: 5504E0A3.3040908@gmail.com

Sven Witterstein posted on Sun, 15 Mar 2015 02:30:11 +0100 as excerpted:

> Probably an option-parameter in analogy to (single-spindle pre-ssd ideas
> for the I/O scheduler) like
> 
> elevator=cfq (for btrfs="try to balance reads [...]
> 
> elevator=noop (assign by even/odd, current behavior (testing)
> 
> elevator=jumpy (every rand x secs switch stripeset [...]
> 
> would bring room to experiment in the years till 2020 as you outlined
> and to review,

The problem is, btrfs is what I've seen referred to as a target-rich 
environment, way more stuff to do than time to do it... at least 
reasonably correctly, anyway.

This in fact might be what eventually happens, and OK, 2020 is very 
possibly pessimistic, but if there's no time to code it as other things 
are taking priority or it'd simply complex enough it'll take several 
hundreds of manhours to get it coded, and predictably a good portion of 
that again to review it, commit it, chase down all the immediate bugs, 
and get them fixed, then there's no time to code it.

Which is exactly the problem with your proposal.  It's on the list, but 
so are several hundred other things... .  Well, that, and the 
programmer's adage about premature optimization.  But it's true, why 
spend several hundred hours optimizing this, and then have to throw the 
work away because when you go to add N-way-mirroring you discover some 
unforeseen angle makes your optimization a pessimisation now, or worse, 
discover you've cut off an avenue of better optimization that now won't 
be done because it's not worth spending that several hundred hours of 
development again.

Which is the beauty of the simplicity of the even/odd scheme.  It's so 
dead simple it's both easily demonstrated workable and hard to get wrong 
in terms of bugs, even if it's clearly not production-suitable.

Meanwhile, as I've said, other than raw breakage bugs this is one of the 
clearest demonstrations that btrfs really is /not/ a mature filesystem, 
despite the removal of all the dire warnings about it potentially eating 
your baby (data, that is) possibly leading some to the conclusion it's 
mature/stable/ready-for-production-use, because as you said this is 
clearly not production-suitable; it's clearly test-suitable.

> Interesting enough, all my other btrfses are single-SSD for operating
> system with auto-snap to be able to revert...
> and one is a 2-disk raid 0 for throw away data, so I never had a setup
> that would expose this behaviour...

I do hope you're reasonably thinning down those snapshots over time.  
Btrfs has a scalability issue when it comes to too many snapshots, and 
while they're instant to create as it's simply saving a bit of extra 
metadata, they're **NOT** instant to delete or to otherwise work with, 
once you get several hundred of them going.

Fortunately, it's easy enough to cut back a bit on the creation if 
necessary, so there's time to delete them too, and then to thin down to 
under say 300 or less per subvolume (and that's with original 
snapshotting at say half-hour intervals or more frequently!).  Say keep 
six hours of half-hour, then thin to hourly.  Keep the remainder of 24 
hours (18 hours) at hourly, and thin to say six-hourly... and so on.  It 
really is reasonably easy to keep it well under three-hundred snapshots 
per subvolume, even with half-hourly snapshotting, originally.

Also fortunately, should you really have to go back a full year, in 
practice, you're not normally going to care much about the individual 
hour and often not even the individual day.  Often, simply getting a 
snapshot from the correct week, or correct quarter, is enough, and it's a 
LOT easier to pick out when you've been doing proper thinning.

And if you're snapshotting multiple subvolumes per filesystem, try to 
keep total snapshots to a couple thousand or so if at all possible, and 
if you can get away with under a thousand total, do it. Because once you 
get into the thousands of snapshots, there's reports and reports of 
people complaining about how poorly btrfs scales when trying to do any 
filesystem maintenance at all, even on SSD.

Which is actually one of the things the devs have been spending major 
time on.  Scaling isn't good yet, but it's MUCH better than it was... 
basically unworkable at times.  Of course that's why snapshot-aware-
defrag is disabled ATM as well -- it was simply unworkable, and the 
thought was, better to let defrag work on the current copy and going 
forward, even if it breaks references and forces duplication of the 
defragged blocks, than to not have it working at all.

And FWIW, quotas are another scaling issue.  But they've always been 
bugged and not worked entirely correctly anyway, and as such, the 
recommendation has always been to disable then on btrfs unless you really 
need them, and if you really need them, better use a more mature 
filesystem where they work reliably, because you simply can't count on 
quotas actually working on btrfs.  Again, there has been major work 
invested here and it's getting better, but there's still corner-cases due 
to subvolume deletion where the quota math still simply doesn't work.  So 
while quotas are a scaling issue, it's not a major one, since quotas have 
to date never worked correctly anyway, so few actually use them.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

next prev parent reply	other threads:[~2015-03-15  3:35 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-15  1:30 Raid10 performance issues during copy and balance, only half the spindles used for reading data Sven Witterstein
2015-03-15  3:35 ` Duncan [this message]
  -- strict thread matches above, loose matches on Subject: below --
2015-03-09 23:45 Sven Witterstein
2015-03-10  4:37 ` Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='pan$eb9d8$64fc2106$58d9b21f$dfb3c40f@cox.net' \
    --to=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.