Raid10 performance issues during copy and balance, only half the spindles used for reading data

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Raid10 performance issues during copy and balance, only half the spindles used for reading data
@ 2015-03-09 23:45 Sven Witterstein
  2015-03-10  4:37 ` Duncan
  0 siblings, 1 reply; 4+ messages in thread
From: Sven Witterstein @ 2015-03-09 23:45 UTC (permalink / raw)
  To: linux-btrfs

Hello,

I have used btrfs and zfs for some years and feel pretty confident about 
their administration - and both with ther snaps and subvols saved me 
quite often.

I had to grow my 4x250GB Raid10-Backup-Array to a 6x500GB raid10-backup 
array - the slower half of 4 1TB 2.5" Spinpoint M8's. were to be 
enhanced with the slowest quarter of 2 2TB 2.5" spinpoint M9T's.

During balance or copies, the second image of the stripeset A + B | A' + 
B' is never used, thus throwing away about 40% of performance, e.g. it 
NEVER used A' + B' to read from even if 50% of the needed assembled data 
could have been read from there..., so 2 disks were maxed out, the other 
writing at about 40% their I/O capacity.

Also when rsyncing to ssd raid0 zpool (just for testing, the ssd-pool is 
the working pool, the zfs and btrfs disk pools are for backup) - only 3 
disks of 6 are read from.

As opposed, a properly set up mdadm "far or offset" + xfs and zfs itself 
use all spindles (devices) to read from and net data is delivered twice 
as fast.

I would love to see btrfs trying harder to deliver data - it slips my 
mind whether it is a missing feature in btrfs raid10 right now or a bug 
in the 3.16 lines of kernel I am using (mint rebecca on my workstation).

If anybody knows about it, or I am missing something (-m=raid10 
-d=raid10 was OK I hope when rebalancing?)
I'd like to be enlightened (when I googled it was always stated that 
btrfs would read from all spindles, but it's not the case for me...)

Sven.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Raid10 performance issues during copy and balance, only half the spindles used for reading data
  2015-03-09 23:45 Raid10 performance issues during copy and balance, only half the spindles used for reading data Sven Witterstein
@ 2015-03-10  4:37 ` Duncan
  0 siblings, 0 replies; 4+ messages in thread
From: Duncan @ 2015-03-10  4:37 UTC (permalink / raw)
  To: linux-btrfs

Sven Witterstein posted on Tue, 10 Mar 2015 00:45:23 +0100 as excerpted:

> During balance or copies, the second image of the stripeset A + B | A' +
> B' is never used, thus throwing away about 40% of performance, e.g. it
> NEVER used A' + B' to read from even if 50% of the needed assembled data
> could have been read from there..., so 2 disks were maxed out, the other
> writing at about 40% their I/O capacity.
> 
> Also when rsyncing to ssd raid0 zpool (just for testing, the ssd-pool is
> the working pool, the zfs and btrfs disk pools are for backup) - only 3
> disks of 6 are read from.
> 
> As opposed, a properly set up mdadm "far or offset" + xfs and zfs itself
> use all spindles (devices) to read from and net data is delivered twice
> as fast.
> 
> I would love to see btrfs trying harder to deliver data - it slips my
> mind whether it is a missing feature in btrfs raid10 right now or a bug
> in the 3.16 lines of kernel I am using (mint rebecca on my workstation).
> 
> If anybody knows about it, or I am missing something (-m=raid10
> -d=raid10 was OK I hope when rebalancing?)
> I'd like to be enlightened (when I googled it was always stated that
> btrfs would read from all spindles, but it's not the case for me...)

Known issue, explained below...

The btrfs raid1 (and thus raid10, since it's inherited) read-scheduling 
algorithm remains a rather simplistic one, suitable for btrfs development 
and testing, but not yet optimized.

The existing algorithm is a very simple even/odd PID-based algorithm.  
Thus, single-thread testing will indeed always read from the same side of 
the pair-mirror (since btrfs raid1 and raid10 are pair-mirrored, no N-way-
mirroring available yet, tho it's the next new feature on the raid roadmap 
now that raid56 is essentially code-complete altho not yet well bug-
flushed, with 3.19).  With a reasonably balanced mix of even/odd-PID 
readers, however, you should indeed get reasonable balanced read activity.

The obvious worst-case, of course, is an alternate read/write PID 
spawning script or other arrangement such that all the readers tend to be 
on the same side of the even/odd.

Meanwhile, as stated above, this sort of extremely simplistic algorithm 
is reasonably suited to testing, as it's very easy to force multi-PID-
read scenarios with either good balance, or worst-case-stress-test where 
all activity should be from one side or the other.  However, it's 
obviously not production-grade optimization yet, one of the clearest 
indicators remaining (other than flat-out bugs) that btrfs really is /
not/ fully stable yet, even for raid-types that have been around long 
enough to be effectively as stable as btrfs itself is (unlike the newly 
completed in 3.19 raid56 code).

OK, but when /can/ we expect optimization?

Good question.  With the caveat that I'm only an admin and list regular 
myself, not a dev, and that I've seen no specifics on this particular 
matter, reasonable speculation at better raid1/10 read optimization 
timing would put its introduction either as part of N-way-mirroring, or 
shortly thereafter, since that's a definitely planned and long roadmapped 
feature that was waiting for raid56 as the N-way-mirroring code is 
planned to build on the raid56 code, and arguably, optimization before 
that would be premature optimization of the pair-mirror special-case.

So when can N-way-mirroring be expected?

Another good question.  A /very/ good one for me, personally, since 
that's the feature I really /really/ want to see for my own use case.

Given that various btrfs features have repeatedly taken longer to 
implement than planned, and just raid56 took about three years years 
(original introduction was delayed from 3.5 or so to 3.9, where it was 
introduced but in a code-incomplete state, undegraded runtime worked, 
recovery, not so much, and only with 3.19 is the code essentially 
complete, altho I'd consider it bug-testing until 3.21 aka 4.1 at least), 
I'm really not expecting N-way-mirroring until maybe this time next 
year... and even that's potentially wildly optimistic, given the three 
years raid56 took.

So again, a best-guess for raid1 read-optimization, still keeping in mind 
that I'm simply a btrfs user and list regular myself, and I've not seen 
any specific discussion on the timing here, only the explanation of the 
current algorithm I repeated above...

Some time in 2016... if we're lucky.  I'd frankly be surprised to see it 
this year.  I do expect we'll see it before 2020, and I'd /hope/ by 2018, 
but 2016-2018, 1-3 years out... really is about my best guess, given 
btrfs history.

(FWIW, I've seen people compare zfs to btrfs in terms of feature 
development timing.  ZFS moved faster, wikipedia says 2001-2006 so half a 
decade, but I believe they had a rather larger dedicated/paid team 
working on it, and it /still/ took them half a decade.  Btrfs has fewer 
dedicated engineers working on it but /does/ have the advantages of free 
and open source, tho AFAIK that shows up mostly in the bug testing/
reporting and to some extent fixing department, not so much main feature 
development.  Person-hour-wise, from the comparison I read, it's 
reasonably equivalent; btrfs is simply doing it with fewer devs, 
resulting in it being spread out rather longer.  I think some folks are 
on record as predicting btrfs would take about a decade to reach a 
comparable level, and looking back and forward, that's quite a good 
prediction, a decade out on a software project, where software 
development happens at internet speed.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Raid10 performance issues during copy and balance, only half the spindles used for reading data
@ 2015-03-15  1:30 Sven Witterstein
  2015-03-15  3:35 ` Duncan
  0 siblings, 1 reply; 4+ messages in thread
From: Sven Witterstein @ 2015-03-15  1:30 UTC (permalink / raw)
  To: linux-btrfs

Hi Duncan,

thank you for that explanation

> The existing algorithm is a very simple even/odd PID-based algorithm.
> Thus, single-thread testing will indeed always read from the same side of
> the pair-mirror (since btrfs raid1 and raid10 are pair-mirrored, no N-way-
> mirroring available yet, tho it's the next new feature on the raid roadmap
> now that raid56 is essentially code-complete altho not yet well bug-
> flushed, with 3.19).  With a reasonably balanced mix of even/odd-PID
> readers, however, you should indeed get reasonable balanced read activity.

OK, that is really no good design for production:
eveness/oddity of PIDs is not related (not a good criterion to predict) to what PIDs
will request I/O from the pool at a given time.

In Raid10 (N0.. to come) probably all read requests from all PIDs need to be queued and
spread across the available number of redundant stripesets
(or simple mirrors, if the "striped mirrors" layout 1+0 à la zfs is used or were possible)
Would apply for pure-ssd pool as well, though some fancy "near/far/seek time"
considerations are obsoleted
Something like that...

Probably an option-parameter in analogy to (single-spindle pre-ssd ideas for the I/O scheduler) like

elevator=cfq
(for btrfs="try to balance reads between devices by common read queue" resp "max out all resources and distribute fairly to requesting apps")
(optimum for large reads (and parallel several of them such as balance / send/receive/copy/tar with the pool and the same time to external backup...)

elevator=noop (assign by even/odd, current behavior (testing)

elevator=jumpy (e.g. assign a read to the stripeset which has the smallest number of other reads on it, every rand x
secs switch stripeset if number of "customers" on other 1..N redundant stripesets has decreased
(similar to core switching of a long-running process in kernel)
(optimised for smaller r/w operations such as many users accessing a central server)
etc..

would bring room to experiment in the years till 2020 as you outlined and to review,
whether mdadms raid10 near/far/offset should be considered when most future storage will be non-rotary...
Some kind of self-optimzing should also be included, i.e. if the filesystem new,
how much is to be read, it could know if it made sense to try different methods
and find the fastest, such as gparteds' block size adapting...
Again, it would be interessting if the impact on non-rotary storage would be insignificant

In my use case it's a simple rsync or cp -a or nemo/nautilus
copying between zfs and btrfs pools. Those are single-threaded I guess
and I understand in btrfs there is not such a ton of z_read - processess that
probably account for the "flying" zfs-reads on 6disk raidz2 or 3x2 or 2x3
vdev layout compared to btrfs reads.

I still find it strange, that also a balance also only uses half the spindles,
but it is explainable when the same logic is used as for any other reading from
the array. At least the scrub reads all data and not only one copy ;-)

Interesting enough, all my other btrfses are single-SSD for operating system with auto-snap to be able to revert...
and one is a 2-disk raid 0 for throw away data, so I never had a setup that would expose this behaviour...

Goodbye,

Sven.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Raid10 performance issues during copy and balance, only half the spindles used for reading data
  2015-03-15  1:30 Sven Witterstein
@ 2015-03-15  3:35 ` Duncan
  0 siblings, 0 replies; 4+ messages in thread
From: Duncan @ 2015-03-15  3:35 UTC (permalink / raw)
  To: linux-btrfs

Sven Witterstein posted on Sun, 15 Mar 2015 02:30:11 +0100 as excerpted:

> Probably an option-parameter in analogy to (single-spindle pre-ssd ideas
> for the I/O scheduler) like
> 
> elevator=cfq (for btrfs="try to balance reads [...]
> 
> elevator=noop (assign by even/odd, current behavior (testing)
> 
> elevator=jumpy (every rand x secs switch stripeset [...]
> 
> would bring room to experiment in the years till 2020 as you outlined
> and to review,

The problem is, btrfs is what I've seen referred to as a target-rich 
environment, way more stuff to do than time to do it... at least 
reasonably correctly, anyway.

This in fact might be what eventually happens, and OK, 2020 is very 
possibly pessimistic, but if there's no time to code it as other things 
are taking priority or it'd simply complex enough it'll take several 
hundreds of manhours to get it coded, and predictably a good portion of 
that again to review it, commit it, chase down all the immediate bugs, 
and get them fixed, then there's no time to code it.

Which is exactly the problem with your proposal.  It's on the list, but 
so are several hundred other things... .  Well, that, and the 
programmer's adage about premature optimization.  But it's true, why 
spend several hundred hours optimizing this, and then have to throw the 
work away because when you go to add N-way-mirroring you discover some 
unforeseen angle makes your optimization a pessimisation now, or worse, 
discover you've cut off an avenue of better optimization that now won't 
be done because it's not worth spending that several hundred hours of 
development again.

Which is the beauty of the simplicity of the even/odd scheme.  It's so 
dead simple it's both easily demonstrated workable and hard to get wrong 
in terms of bugs, even if it's clearly not production-suitable.

Meanwhile, as I've said, other than raw breakage bugs this is one of the 
clearest demonstrations that btrfs really is /not/ a mature filesystem, 
despite the removal of all the dire warnings about it potentially eating 
your baby (data, that is) possibly leading some to the conclusion it's 
mature/stable/ready-for-production-use, because as you said this is 
clearly not production-suitable; it's clearly test-suitable.

> Interesting enough, all my other btrfses are single-SSD for operating
> system with auto-snap to be able to revert...
> and one is a 2-disk raid 0 for throw away data, so I never had a setup
> that would expose this behaviour...

I do hope you're reasonably thinning down those snapshots over time.  
Btrfs has a scalability issue when it comes to too many snapshots, and 
while they're instant to create as it's simply saving a bit of extra 
metadata, they're **NOT** instant to delete or to otherwise work with, 
once you get several hundred of them going.

Fortunately, it's easy enough to cut back a bit on the creation if 
necessary, so there's time to delete them too, and then to thin down to 
under say 300 or less per subvolume (and that's with original 
snapshotting at say half-hour intervals or more frequently!).  Say keep 
six hours of half-hour, then thin to hourly.  Keep the remainder of 24 
hours (18 hours) at hourly, and thin to say six-hourly... and so on.  It 
really is reasonably easy to keep it well under three-hundred snapshots 
per subvolume, even with half-hourly snapshotting, originally.

Also fortunately, should you really have to go back a full year, in 
practice, you're not normally going to care much about the individual 
hour and often not even the individual day.  Often, simply getting a 
snapshot from the correct week, or correct quarter, is enough, and it's a 
LOT easier to pick out when you've been doing proper thinning.

And if you're snapshotting multiple subvolumes per filesystem, try to 
keep total snapshots to a couple thousand or so if at all possible, and 
if you can get away with under a thousand total, do it. Because once you 
get into the thousands of snapshots, there's reports and reports of 
people complaining about how poorly btrfs scales when trying to do any 
filesystem maintenance at all, even on SSD.

Which is actually one of the things the devs have been spending major 
time on.  Scaling isn't good yet, but it's MUCH better than it was... 
basically unworkable at times.  Of course that's why snapshot-aware-
defrag is disabled ATM as well -- it was simply unworkable, and the 
thought was, better to let defrag work on the current copy and going 
forward, even if it breaks references and forces duplication of the 
defragged blocks, than to not have it working at all.

And FWIW, quotas are another scaling issue.  But they've always been 
bugged and not worked entirely correctly anyway, and as such, the 
recommendation has always been to disable then on btrfs unless you really 
need them, and if you really need them, better use a more mature 
filesystem where they work reliably, because you simply can't count on 
quotas actually working on btrfs.  Again, there has been major work 
invested here and it's getting better, but there's still corner-cases due 
to subvolume deletion where the quota math still simply doesn't work.  So 
while quotas are a scaling issue, it's not a major one, since quotas have 
to date never worked correctly anyway, so few actually use them.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-03-15  3:35 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-09 23:45 Raid10 performance issues during copy and balance, only half the spindles used for reading data Sven Witterstein
2015-03-10  4:37 ` Duncan
  -- strict thread matches above, loose matches on Subject: below --
2015-03-15  1:30 Sven Witterstein
2015-03-15  3:35 ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).