linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Raid10 performance issues during copy and balance, only half the spindles used for reading data
@ 2015-03-15  1:30 Sven Witterstein
  2015-03-15  3:35 ` Duncan
  0 siblings, 1 reply; 4+ messages in thread
From: Sven Witterstein @ 2015-03-15  1:30 UTC (permalink / raw)
  To: linux-btrfs

Hi Duncan,

thank you for that explanation

> The existing algorithm is a very simple even/odd PID-based algorithm.
> Thus, single-thread testing will indeed always read from the same side of
> the pair-mirror (since btrfs raid1 and raid10 are pair-mirrored, no N-way-
> mirroring available yet, tho it's the next new feature on the raid roadmap
> now that raid56 is essentially code-complete altho not yet well bug-
> flushed, with 3.19).  With a reasonably balanced mix of even/odd-PID
> readers, however, you should indeed get reasonable balanced read activity.

OK, that is really no good design for production:
eveness/oddity of PIDs is not related (not a good criterion to predict) to what PIDs
will request I/O from the pool at a given time.

In Raid10 (N0.. to come) probably all read requests from all PIDs need to be queued and
spread across the available number of redundant stripesets
(or simple mirrors, if the "striped mirrors" layout 1+0 à la zfs is used or were possible)
Would apply for pure-ssd pool as well, though some fancy "near/far/seek time"
considerations are obsoleted
Something like that...

Probably an option-parameter in analogy to (single-spindle pre-ssd ideas for the I/O scheduler) like

elevator=cfq
(for btrfs="try to balance reads between devices by common read queue" resp "max out all resources and distribute fairly to requesting apps")
(optimum for large reads (and parallel several of them such as balance / send/receive/copy/tar with the pool and the same time to external backup...)

elevator=noop (assign by even/odd, current behavior (testing)

elevator=jumpy (e.g. assign a read to the stripeset which has the smallest number of other reads on it, every rand x
secs switch stripeset if number of "customers" on other 1..N redundant stripesets has decreased
(similar to core switching of a long-running process in kernel)
(optimised for smaller r/w operations such as many users accessing a central server)
etc..

would bring room to experiment in the years till 2020 as you outlined and to review,
whether mdadms raid10 near/far/offset should be considered when most future storage will be non-rotary...
Some kind of self-optimzing should also be included, i.e. if the filesystem new,
how much is to be read, it could know if it made sense to try different methods
and find the fastest, such as gparteds' block size adapting...
Again, it would be interessting if the impact on non-rotary storage would be insignificant

In my use case it's a simple rsync or cp -a or nemo/nautilus
copying between zfs and btrfs pools. Those are single-threaded I guess
and I understand in btrfs there is not such a ton of z_read - processess that
probably account for the "flying" zfs-reads on 6disk raidz2 or 3x2 or 2x3
vdev layout compared to btrfs reads.


I still find it strange, that also a balance also only uses half the spindles,
but it is explainable when the same logic is used as for any other reading from
the array. At least the scrub reads all data and not only one copy ;-)

Interesting enough, all my other btrfses are single-SSD for operating system with auto-snap to be able to revert...
and one is a 2-disk raid 0 for throw away data, so I never had a setup that would expose this behaviour...

Goodbye,

Sven.


^ permalink raw reply	[flat|nested] 4+ messages in thread
* Raid10 performance issues during copy and balance, only half the spindles used for reading data
@ 2015-03-09 23:45 Sven Witterstein
  2015-03-10  4:37 ` Duncan
  0 siblings, 1 reply; 4+ messages in thread
From: Sven Witterstein @ 2015-03-09 23:45 UTC (permalink / raw)
  To: linux-btrfs

Hello,

I have used btrfs and zfs for some years and feel pretty confident about 
their administration - and both with ther snaps and subvols saved me 
quite often.

I had to grow my 4x250GB Raid10-Backup-Array to a 6x500GB raid10-backup 
array - the slower half of 4 1TB 2.5" Spinpoint M8's. were to be 
enhanced with the slowest quarter of 2 2TB 2.5" spinpoint M9T's.

During balance or copies, the second image of the stripeset A + B | A' + 
B' is never used, thus throwing away about 40% of performance, e.g. it 
NEVER used A' + B' to read from even if 50% of the needed assembled data 
could have been read from there..., so 2 disks were maxed out, the other 
writing at about 40% their I/O capacity.

Also when rsyncing to ssd raid0 zpool (just for testing, the ssd-pool is 
the working pool, the zfs and btrfs disk pools are for backup) - only 3 
disks of 6 are read from.

As opposed, a properly set up mdadm "far or offset" + xfs and zfs itself 
use all spindles (devices) to read from and net data is delivered twice 
as fast.

I would love to see btrfs trying harder to deliver data - it slips my 
mind whether it is a missing feature in btrfs raid10 right now or a bug 
in the 3.16 lines of kernel I am using (mint rebecca on my workstation).

If anybody knows about it, or I am missing something (-m=raid10 
-d=raid10 was OK I hope when rebalancing?)
I'd like to be enlightened (when I googled it was always stated that 
btrfs would read from all spindles, but it's not the case for me...)

Sven.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-03-15  3:35 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-15  1:30 Raid10 performance issues during copy and balance, only half the spindles used for reading data Sven Witterstein
2015-03-15  3:35 ` Duncan
  -- strict thread matches above, loose matches on Subject: below --
2015-03-09 23:45 Sven Witterstein
2015-03-10  4:37 ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).