* Raid10 performance issues during copy and balance, only half the spindles used for reading data
@ 2015-03-09 23:45 Sven Witterstein
2015-03-10 4:37 ` Duncan
0 siblings, 1 reply; 4+ messages in thread
From: Sven Witterstein @ 2015-03-09 23:45 UTC (permalink / raw)
To: linux-btrfs
Hello,
I have used btrfs and zfs for some years and feel pretty confident about
their administration - and both with ther snaps and subvols saved me
quite often.
I had to grow my 4x250GB Raid10-Backup-Array to a 6x500GB raid10-backup
array - the slower half of 4 1TB 2.5" Spinpoint M8's. were to be
enhanced with the slowest quarter of 2 2TB 2.5" spinpoint M9T's.
During balance or copies, the second image of the stripeset A + B | A' +
B' is never used, thus throwing away about 40% of performance, e.g. it
NEVER used A' + B' to read from even if 50% of the needed assembled data
could have been read from there..., so 2 disks were maxed out, the other
writing at about 40% their I/O capacity.
Also when rsyncing to ssd raid0 zpool (just for testing, the ssd-pool is
the working pool, the zfs and btrfs disk pools are for backup) - only 3
disks of 6 are read from.
As opposed, a properly set up mdadm "far or offset" + xfs and zfs itself
use all spindles (devices) to read from and net data is delivered twice
as fast.
I would love to see btrfs trying harder to deliver data - it slips my
mind whether it is a missing feature in btrfs raid10 right now or a bug
in the 3.16 lines of kernel I am using (mint rebecca on my workstation).
If anybody knows about it, or I am missing something (-m=raid10
-d=raid10 was OK I hope when rebalancing?)
I'd like to be enlightened (when I googled it was always stated that
btrfs would read from all spindles, but it's not the case for me...)
Sven.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Raid10 performance issues during copy and balance, only half the spindles used for reading data
2015-03-09 23:45 Raid10 performance issues during copy and balance, only half the spindles used for reading data Sven Witterstein
@ 2015-03-10 4:37 ` Duncan
0 siblings, 0 replies; 4+ messages in thread
From: Duncan @ 2015-03-10 4:37 UTC (permalink / raw)
To: linux-btrfs
Sven Witterstein posted on Tue, 10 Mar 2015 00:45:23 +0100 as excerpted:
> During balance or copies, the second image of the stripeset A + B | A' +
> B' is never used, thus throwing away about 40% of performance, e.g. it
> NEVER used A' + B' to read from even if 50% of the needed assembled data
> could have been read from there..., so 2 disks were maxed out, the other
> writing at about 40% their I/O capacity.
>
> Also when rsyncing to ssd raid0 zpool (just for testing, the ssd-pool is
> the working pool, the zfs and btrfs disk pools are for backup) - only 3
> disks of 6 are read from.
>
> As opposed, a properly set up mdadm "far or offset" + xfs and zfs itself
> use all spindles (devices) to read from and net data is delivered twice
> as fast.
>
> I would love to see btrfs trying harder to deliver data - it slips my
> mind whether it is a missing feature in btrfs raid10 right now or a bug
> in the 3.16 lines of kernel I am using (mint rebecca on my workstation).
>
> If anybody knows about it, or I am missing something (-m=raid10
> -d=raid10 was OK I hope when rebalancing?)
> I'd like to be enlightened (when I googled it was always stated that
> btrfs would read from all spindles, but it's not the case for me...)
Known issue, explained below...
The btrfs raid1 (and thus raid10, since it's inherited) read-scheduling
algorithm remains a rather simplistic one, suitable for btrfs development
and testing, but not yet optimized.
The existing algorithm is a very simple even/odd PID-based algorithm.
Thus, single-thread testing will indeed always read from the same side of
the pair-mirror (since btrfs raid1 and raid10 are pair-mirrored, no N-way-
mirroring available yet, tho it's the next new feature on the raid roadmap
now that raid56 is essentially code-complete altho not yet well bug-
flushed, with 3.19). With a reasonably balanced mix of even/odd-PID
readers, however, you should indeed get reasonable balanced read activity.
The obvious worst-case, of course, is an alternate read/write PID
spawning script or other arrangement such that all the readers tend to be
on the same side of the even/odd.
Meanwhile, as stated above, this sort of extremely simplistic algorithm
is reasonably suited to testing, as it's very easy to force multi-PID-
read scenarios with either good balance, or worst-case-stress-test where
all activity should be from one side or the other. However, it's
obviously not production-grade optimization yet, one of the clearest
indicators remaining (other than flat-out bugs) that btrfs really is /
not/ fully stable yet, even for raid-types that have been around long
enough to be effectively as stable as btrfs itself is (unlike the newly
completed in 3.19 raid56 code).
OK, but when /can/ we expect optimization?
Good question. With the caveat that I'm only an admin and list regular
myself, not a dev, and that I've seen no specifics on this particular
matter, reasonable speculation at better raid1/10 read optimization
timing would put its introduction either as part of N-way-mirroring, or
shortly thereafter, since that's a definitely planned and long roadmapped
feature that was waiting for raid56 as the N-way-mirroring code is
planned to build on the raid56 code, and arguably, optimization before
that would be premature optimization of the pair-mirror special-case.
So when can N-way-mirroring be expected?
Another good question. A /very/ good one for me, personally, since
that's the feature I really /really/ want to see for my own use case.
Given that various btrfs features have repeatedly taken longer to
implement than planned, and just raid56 took about three years years
(original introduction was delayed from 3.5 or so to 3.9, where it was
introduced but in a code-incomplete state, undegraded runtime worked,
recovery, not so much, and only with 3.19 is the code essentially
complete, altho I'd consider it bug-testing until 3.21 aka 4.1 at least),
I'm really not expecting N-way-mirroring until maybe this time next
year... and even that's potentially wildly optimistic, given the three
years raid56 took.
So again, a best-guess for raid1 read-optimization, still keeping in mind
that I'm simply a btrfs user and list regular myself, and I've not seen
any specific discussion on the timing here, only the explanation of the
current algorithm I repeated above...
Some time in 2016... if we're lucky. I'd frankly be surprised to see it
this year. I do expect we'll see it before 2020, and I'd /hope/ by 2018,
but 2016-2018, 1-3 years out... really is about my best guess, given
btrfs history.
(FWIW, I've seen people compare zfs to btrfs in terms of feature
development timing. ZFS moved faster, wikipedia says 2001-2006 so half a
decade, but I believe they had a rather larger dedicated/paid team
working on it, and it /still/ took them half a decade. Btrfs has fewer
dedicated engineers working on it but /does/ have the advantages of free
and open source, tho AFAIK that shows up mostly in the bug testing/
reporting and to some extent fixing department, not so much main feature
development. Person-hour-wise, from the comparison I read, it's
reasonably equivalent; btrfs is simply doing it with fewer devs,
resulting in it being spread out rather longer. I think some folks are
on record as predicting btrfs would take about a decade to reach a
comparable level, and looking back and forward, that's quite a good
prediction, a decade out on a software project, where software
development happens at internet speed.)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Raid10 performance issues during copy and balance, only half the spindles used for reading data
@ 2015-03-15 1:30 Sven Witterstein
2015-03-15 3:35 ` Duncan
0 siblings, 1 reply; 4+ messages in thread
From: Sven Witterstein @ 2015-03-15 1:30 UTC (permalink / raw)
To: linux-btrfs
Hi Duncan,
thank you for that explanation
> The existing algorithm is a very simple even/odd PID-based algorithm.
> Thus, single-thread testing will indeed always read from the same side of
> the pair-mirror (since btrfs raid1 and raid10 are pair-mirrored, no N-way-
> mirroring available yet, tho it's the next new feature on the raid roadmap
> now that raid56 is essentially code-complete altho not yet well bug-
> flushed, with 3.19). With a reasonably balanced mix of even/odd-PID
> readers, however, you should indeed get reasonable balanced read activity.
OK, that is really no good design for production:
eveness/oddity of PIDs is not related (not a good criterion to predict) to what PIDs
will request I/O from the pool at a given time.
In Raid10 (N0.. to come) probably all read requests from all PIDs need to be queued and
spread across the available number of redundant stripesets
(or simple mirrors, if the "striped mirrors" layout 1+0 à la zfs is used or were possible)
Would apply for pure-ssd pool as well, though some fancy "near/far/seek time"
considerations are obsoleted
Something like that...
Probably an option-parameter in analogy to (single-spindle pre-ssd ideas for the I/O scheduler) like
elevator=cfq
(for btrfs="try to balance reads between devices by common read queue" resp "max out all resources and distribute fairly to requesting apps")
(optimum for large reads (and parallel several of them such as balance / send/receive/copy/tar with the pool and the same time to external backup...)
elevator=noop (assign by even/odd, current behavior (testing)
elevator=jumpy (e.g. assign a read to the stripeset which has the smallest number of other reads on it, every rand x
secs switch stripeset if number of "customers" on other 1..N redundant stripesets has decreased
(similar to core switching of a long-running process in kernel)
(optimised for smaller r/w operations such as many users accessing a central server)
etc..
would bring room to experiment in the years till 2020 as you outlined and to review,
whether mdadms raid10 near/far/offset should be considered when most future storage will be non-rotary...
Some kind of self-optimzing should also be included, i.e. if the filesystem new,
how much is to be read, it could know if it made sense to try different methods
and find the fastest, such as gparteds' block size adapting...
Again, it would be interessting if the impact on non-rotary storage would be insignificant
In my use case it's a simple rsync or cp -a or nemo/nautilus
copying between zfs and btrfs pools. Those are single-threaded I guess
and I understand in btrfs there is not such a ton of z_read - processess that
probably account for the "flying" zfs-reads on 6disk raidz2 or 3x2 or 2x3
vdev layout compared to btrfs reads.
I still find it strange, that also a balance also only uses half the spindles,
but it is explainable when the same logic is used as for any other reading from
the array. At least the scrub reads all data and not only one copy ;-)
Interesting enough, all my other btrfses are single-SSD for operating system with auto-snap to be able to revert...
and one is a 2-disk raid 0 for throw away data, so I never had a setup that would expose this behaviour...
Goodbye,
Sven.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Raid10 performance issues during copy and balance, only half the spindles used for reading data
2015-03-15 1:30 Sven Witterstein
@ 2015-03-15 3:35 ` Duncan
0 siblings, 0 replies; 4+ messages in thread
From: Duncan @ 2015-03-15 3:35 UTC (permalink / raw)
To: linux-btrfs
Sven Witterstein posted on Sun, 15 Mar 2015 02:30:11 +0100 as excerpted:
> Probably an option-parameter in analogy to (single-spindle pre-ssd ideas
> for the I/O scheduler) like
>
> elevator=cfq (for btrfs="try to balance reads [...]
>
> elevator=noop (assign by even/odd, current behavior (testing)
>
> elevator=jumpy (every rand x secs switch stripeset [...]
>
> would bring room to experiment in the years till 2020 as you outlined
> and to review,
The problem is, btrfs is what I've seen referred to as a target-rich
environment, way more stuff to do than time to do it... at least
reasonably correctly, anyway.
This in fact might be what eventually happens, and OK, 2020 is very
possibly pessimistic, but if there's no time to code it as other things
are taking priority or it'd simply complex enough it'll take several
hundreds of manhours to get it coded, and predictably a good portion of
that again to review it, commit it, chase down all the immediate bugs,
and get them fixed, then there's no time to code it.
Which is exactly the problem with your proposal. It's on the list, but
so are several hundred other things... . Well, that, and the
programmer's adage about premature optimization. But it's true, why
spend several hundred hours optimizing this, and then have to throw the
work away because when you go to add N-way-mirroring you discover some
unforeseen angle makes your optimization a pessimisation now, or worse,
discover you've cut off an avenue of better optimization that now won't
be done because it's not worth spending that several hundred hours of
development again.
Which is the beauty of the simplicity of the even/odd scheme. It's so
dead simple it's both easily demonstrated workable and hard to get wrong
in terms of bugs, even if it's clearly not production-suitable.
Meanwhile, as I've said, other than raw breakage bugs this is one of the
clearest demonstrations that btrfs really is /not/ a mature filesystem,
despite the removal of all the dire warnings about it potentially eating
your baby (data, that is) possibly leading some to the conclusion it's
mature/stable/ready-for-production-use, because as you said this is
clearly not production-suitable; it's clearly test-suitable.
> Interesting enough, all my other btrfses are single-SSD for operating
> system with auto-snap to be able to revert...
> and one is a 2-disk raid 0 for throw away data, so I never had a setup
> that would expose this behaviour...
I do hope you're reasonably thinning down those snapshots over time.
Btrfs has a scalability issue when it comes to too many snapshots, and
while they're instant to create as it's simply saving a bit of extra
metadata, they're **NOT** instant to delete or to otherwise work with,
once you get several hundred of them going.
Fortunately, it's easy enough to cut back a bit on the creation if
necessary, so there's time to delete them too, and then to thin down to
under say 300 or less per subvolume (and that's with original
snapshotting at say half-hour intervals or more frequently!). Say keep
six hours of half-hour, then thin to hourly. Keep the remainder of 24
hours (18 hours) at hourly, and thin to say six-hourly... and so on. It
really is reasonably easy to keep it well under three-hundred snapshots
per subvolume, even with half-hourly snapshotting, originally.
Also fortunately, should you really have to go back a full year, in
practice, you're not normally going to care much about the individual
hour and often not even the individual day. Often, simply getting a
snapshot from the correct week, or correct quarter, is enough, and it's a
LOT easier to pick out when you've been doing proper thinning.
And if you're snapshotting multiple subvolumes per filesystem, try to
keep total snapshots to a couple thousand or so if at all possible, and
if you can get away with under a thousand total, do it. Because once you
get into the thousands of snapshots, there's reports and reports of
people complaining about how poorly btrfs scales when trying to do any
filesystem maintenance at all, even on SSD.
Which is actually one of the things the devs have been spending major
time on. Scaling isn't good yet, but it's MUCH better than it was...
basically unworkable at times. Of course that's why snapshot-aware-
defrag is disabled ATM as well -- it was simply unworkable, and the
thought was, better to let defrag work on the current copy and going
forward, even if it breaks references and forces duplication of the
defragged blocks, than to not have it working at all.
And FWIW, quotas are another scaling issue. But they've always been
bugged and not worked entirely correctly anyway, and as such, the
recommendation has always been to disable then on btrfs unless you really
need them, and if you really need them, better use a more mature
filesystem where they work reliably, because you simply can't count on
quotas actually working on btrfs. Again, there has been major work
invested here and it's getting better, but there's still corner-cases due
to subvolume deletion where the quota math still simply doesn't work. So
while quotas are a scaling issue, it's not a major one, since quotas have
to date never worked correctly anyway, so few actually use them.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2015-03-15 3:35 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-09 23:45 Raid10 performance issues during copy and balance, only half the spindles used for reading data Sven Witterstein
2015-03-10 4:37 ` Duncan
-- strict thread matches above, loose matches on Subject: below --
2015-03-15 1:30 Sven Witterstein
2015-03-15 3:35 ` Duncan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).