From: "Konstantin V. Gavrilenko" <k.gavrilenko@arhont.com>
To: Linux fs Btrfs <linux-btrfs@vger.kernel.org>
Cc: Peter Grandi <pg@btrfs.list.sabi.co.UK>
Subject: Re: Btrfs + compression = slow performance and high cpu usage
Date: Thu, 31 Aug 2017 11:56:47 +0100 (BST) [thread overview]
Message-ID: <7655278.170.1504177003160.JavaMail.gkos@dynomob> (raw)
In-Reply-To: <22912.57311.417226.447973@tree.ty.sabi.co.uk>
Hello again list. I thought I would clear the things out and describe what is happening with my troubled RAID setup.
So having received the help from the list, I've initially run the full defragmentation of all the data and recompressed everything with zlib.
That didn't help. Then I run the full rebalance of the data and that didn't help either.
So I had to take a disk out of the raid, copy all the data onto it, recreate the RAID drive with 32kb chunk size and 96kb stripe and copied the data back. Then added the disk back and resynced the raid.
So currently the RAID device is
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :
RAID Level : Primary-5, Secondary-0, RAID Level Qualifier-3
Size : 21.830 TB
Sector Size : 512
Is VD emulated : Yes
Parity Size : 7.276 TB
State : Optimal
Strip Size : 32 KB
Number Of Drives : 4
Span Depth : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None
Bad Blocks Exist: No
Is VD Cached: No
It is about 40% full with compressed data
# btrfs fi usage /mnt/arh-backup1/
Overall:
Device size: 21.83TiB
Device allocated: 8.98TiB
Device unallocated: 12.85TiB
Device missing: 0.00B
Used: 8.98TiB
Free (estimated): 12.85TiB (min: 6.43TiB)
Data ratio: 1.00
Metadata ratio: 2.00
Global reserve: 512.00MiB (used: 0.00B)
I've decided to run a set of test, where 5 gb file was created using different blocksizes and different flags.
one file with urandom data was generated and another one filled with zeroes. the data was written with compression and without compression, and it seems that without compression it is possible to gain 30-40% speed, while the cpu was running at 50% idle during the highest loads.
dd write speeds (mb/s)
flags: conv=fsync
compress-force=zlib compress-force=none
RAND ZERO RAND ZERO
bs1024k 387 407 584 577
bs512k 389 414 532 547
bs256k 412 409 558 585
bs128k 412 403 572 583
bs64k 409 419 563 574
bs32k 407 404 569 572
flags: oflag=sync
compress-force=zlib compress-force=none
RAND ZERO RAND ZERO
bs1024k 86.1 97.0 203 210
bs512k 50.6 64.4 85.0 170
bs256k 25.0 29.8 67.6 67.5
bs128k 13.2 16.4 48.4 49.8
bs64k 7.4 8.3 24.5 27.9
bs32k 3.8 4.1 14.0 13.7
flags: no flags
compress-force=zlib compress-force=none
RAND ZERO RAND ZERO
bs1024k 480 419 681 595
bs512k 422 412 633 585
bs256k 413 384 707 712
bs128k 414 387 695 704
bs64k 482 467 622 587
bs32k 416 412 610 598
I have also run a test where I filled the array to about 97% capacity and the write speed went down by about 50% compared with the empty RAID.
thanks for the help.
----- Original Message -----
From: "Peter Grandi" <pg@btrfs.list.sabi.co.UK>
To: "Linux fs Btrfs" <linux-btrfs@vger.kernel.org>
Sent: Tuesday, 1 August, 2017 10:09:03 PM
Subject: Re: Btrfs + compression = slow performance and high cpu usage
>> [ ... ] a "RAID5 with 128KiB writes and a 768KiB stripe
>> size". [ ... ] several back-to-back 128KiB writes [ ... ] get
>> merged by the 3ware firmware only if it has a persistent
>> cache, and maybe your 3ware does not have one,
> KOS: No I don't have persistent cache. Only the 512 Mb cache
> on board of a controller, that is BBU.
If it is a persistent cache, that can be battery-backed (as I
wrote, but it seems that you don't have too much time to read
replies) then the size of the write, 128KiB or not, should not
matter much; the write will be reported complete when it hits
the persistent cache (whichever technology it used), and then
the HA fimware will spill write cached data to the disks using
the optimal operation width.
Unless the 3ware firmware is really terrible (and depending on
model and vintage it can be amazingly terrible) or the battery
is no longer recharging and then the host adapter switches to
write-through.
That you see very different rates between uncompressed and
compressed writes, where the main difference is the limitation
on the segment size, seems to indicate that compressed writes
involve a lot of RMW, that is sub-stripe updates. As I mentioned
already, it would be interesting to retry 'dd' with different
'bs' values without compression and with 'sync' (or 'direct'
which only makes sense without compression).
> If I had additional SSD caching on the controller I would have
> mentioned it.
So far you had not mentioned the presence of BBU cache either,
which is equivalent, even if in one of your previous message
(which I try to read carefully) there were these lines:
>>>> Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
>>>> Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
So perhaps someone else would have checked long ago the status
of the BBU and whether the "No Write Cache if Bad BBU" case has
happened. If the BBU is still working and the policy is still
"WriteBack" then things are stranger still.
> I was also under impression, that in a situation where mostly
> extra large files will be stored on the massive, the bigger
> strip size would indeed increase the speed, thus I went with
> with the 256 Kb strip size.
That runs counter to this simple story: suppose a program is
doing 64KiB IO:
* For *reads*, there are 4 data drives and the strip size is
16KiB: the 64KiB will be read in parallel on 4 drives. If the
strip size is 256KiB then the 64KiB will be read sequentially
from just one disk, and 4 successive reads will be read
sequentially from the same drive.
* For *writes* on a parity RAID like RAID5 things are much, much
more extreme: the 64KiB will be written with 16KiB strips on a
5-wide RAID5 set in parallel to 5 drives, with 4 stripes being
updated with RMW. But with 256KiB strips it will partially
update 5 drives, because the stripe is 1024+256KiB, and it
needs to do RMW, and four successive 64KiB drives will need to
do that too, even if only one drive is updated. Usually for
RAID5 there is an optimization that means that only the
specific target drive and the parity drives(s) need RMW, but
it is still very expensive.
This is the "storage for beginners" version, what happens in
practice however depends a lot on specific workload profile
(typical read/write size and latencies and rates), caching and
queueing algorithms in both Linux and the HA firmware.
> Would I be correct in assuming that the RAID strip size of 128
> Kb will be a better choice if one plans to use the BTRFS with
> compression?
That would need to be tested, because of "depends a lot on
specific workload profile, caching and queueing algorithms", but
my expectation is the the lower the better. Given that you have
4 drives giving a 3+1 RAID set, perhaps a 32KiB or 64KiB strip
size, given a data stripe size of 96KiB or 192KiB, would be
better.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2017-08-31 10:56 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <33040946.535.1501254718807.JavaMail.gkos@dynomob>
2017-07-28 16:40 ` Btrfs + compression = slow performance and high cpu usage Konstantin V. Gavrilenko
2017-07-28 17:48 ` Roman Mamedov
2017-07-28 18:20 ` William Muriithi
2017-07-28 18:37 ` Hugo Mills
2017-07-28 18:08 ` Peter Grandi
2017-07-30 13:42 ` Konstantin V. Gavrilenko
2017-07-31 11:41 ` Peter Grandi
2017-07-31 12:33 ` Peter Grandi
2017-07-31 12:49 ` Peter Grandi
2017-08-01 9:58 ` Konstantin V. Gavrilenko
2017-08-01 10:53 ` Paul Jones
2017-08-01 13:14 ` Peter Grandi
2017-08-01 18:09 ` Konstantin V. Gavrilenko
2017-08-01 20:09 ` Peter Grandi
2017-08-01 23:54 ` Peter Grandi
2017-08-31 10:56 ` Konstantin V. Gavrilenko [this message]
2017-07-28 18:44 ` Peter Grandi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7655278.170.1504177003160.JavaMail.gkos@dynomob \
--to=k.gavrilenko@arhont.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=pg@btrfs.list.sabi.co.UK \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).