From: "Konstantin V. Gavrilenko" <k.gavrilenko@arhont.com>
To: Peter Grandi <pg@btrfs.list.sabi.co.UK>
Cc: Linux fs Btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Btrfs + compression = slow performance and high cpu usage
Date: Tue, 1 Aug 2017 10:58:15 +0100 (BST) [thread overview]
Message-ID: <6022267.255.1501581534233.JavaMail.gkos@dynomob> (raw)
In-Reply-To: <22911.5971.350346.969146@tree.ty.sabi.co.uk>
Peter, I don't think the filefrag is showing the correct fragmentation status of the file when the compression is used.
At least the one that is installed by default in Ubuntu 16.04 - e2fsprogs | 1.42.13-1ubuntu1
So for example, fragmentation of compressed file is 320 times more then uncompressed one.
root@homenas:/mnt/storage/NEW# filefrag test5g-zeroes
test5g-zeroes: 40903 extents found
root@homenas:/mnt/storage/NEW# filefrag test5g-data
test5g-data: 129 extents found
I am currently defragmenting that mountpoint, ensuring that everrything is compressed with zlib.
# btrfs fi defragment -rv -czlib /mnt/arh-backup
my guess is that it will take another 24-36 hours to complete and then I will redo the test to see if that has helped.
will keep the list posted.
p.s. any other suggestion that might help with the fragmentation and data allocation. Should I try and rebalance the data on the drive?
kos
----- Original Message -----
From: "Peter Grandi" <pg@btrfs.list.sabi.co.UK>
To: "Linux fs Btrfs" <linux-btrfs@vger.kernel.org>
Sent: Monday, 31 July, 2017 1:41:07 PM
Subject: Re: Btrfs + compression = slow performance and high cpu usage
[ ... ]
> grep 'model name' /proc/cpuinfo | sort -u
> model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz
Good, contemporary CPU with all accelerations.
> The sda device is a hardware RAID5 consisting of 4x8TB drives.
[ ... ]
> Strip Size : 256 KB
So the full RMW data stripe length is 768KiB.
> [ ... ] don't see the previously reported behaviour of one of
> the kworker consuming 100% of the cputime, but the write speed
> difference between the compression ON vs OFF is pretty large.
That's weird; of course 'lzo' is a lot cheaper than 'zlib', but
in my test the much higher CPU time of the latter was spread
across many CPUs, while in your case it wasn't, even if the
E5645 has 6 CPUs and can do 12 threads. That seemed to point to
some high cost of finding free blocks, that is a very fragmented
free list, or something else.
> dd if=/dev/sdb of=./testing count=5120 bs=1M status=progress oflag=direct
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 26.0685 s, 206 MB/s
The results with 'oflag=direct' are not relevant, because Btrfs
behaves "differently" with that.
> mountflags: (rw,relatime,compress-force=zlib,space_cache=v2,subvolid=5,subvol=/)
[ ... ]
> dd if=/dev/sdb of=./testing count=5120 bs=1M status=progress conv=fsync
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 77.4845 s, 69.3 MB/s
> mountflags: (rw,relatime,compress-force=lzo,space_cache=v2,subvolid=5,subvol=/)
[ ... ]
> dd if=/dev/sdb of=./testing count=5120 bs=1M status=progress conv=fsync
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 122.321 s, 43.9 MB/s
That's pretty good for a RAID5 with 128KiB writes and a 768KiB
stripe size, on a 3ware, and looks like that the hw host adapter
does not have a persistent cache (battery backed usually). My
guess that watching transfer rates and latencies with 'iostat
-dk -zyx 1' did not happen.
> mountflags: (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
[ ... ]
> dd if=/dev/sdb of=./testing count=5120 bs=1M status=progress conv=fsync
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 10.1033 s, 531 MB/s
I had mentioned in my previous reply the output of 'filefrag'.
That to me seems relevant here, because of RAID5 RMW and maximum
extent size with Brfs compression and strip/stripe size.
Perhaps redoing the tests with a 128KiB 'bs' *without*
compression would be interesting, perhaps even with 'oflag=sync'
instead of 'conv=fsync'.
It is hard for me to see a speed issue here with Btrfs: for
comparison I have done a simple test with a both a 3+1 MD RAID5
set with a 256KiB chunk size and a single block device on
"contemporary" 1T/2TB drives, capable of sequential transfer
rates of 150-190MB/s:
soft# grep -A2 sdb3 /proc/mdstat
md127 : active raid5 sde3[4] sdd3[2] sdc3[1] sdb3[0]
729808128 blocks super 1.0 level 5, 256k chunk, algorithm 2 [4/4] [UUUU]
with compression:
soft# mount -t btrfs -o commit=10,compress-force=zlib /dev/md/test5 /mnt/test5
soft# mount -t btrfs -o commit=10,compress-force=zlib /dev/sdg3 /mnt/sdg3
soft# rm -f /mnt/test5/testfile /mnt/sdg3/testfile
soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile bs=1M count=10000 conv=fsync
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 94.3605 s, 111 MB/s
0.01user 12.59system 1:34.36elapsed 13%CPU (0avgtext+0avgdata 2932maxresident)k
13042144inputs+20482144outputs (3major+345minor)pagefaults 0swaps
soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sdg3/testfile bs=1M count=10000 conv=fsync
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 93.5885 s, 112 MB/s
0.03user 12.35system 1:33.59elapsed 13%CPU (0avgtext+0avgdata 2940maxresident)k
13042144inputs+20482400outputs (3major+346minor)pagefaults 0swaps
soft# filefrag /mnt/test5/testfile /mnt/sdg3/testfile
/mnt/test5/testfile: 48945 extents found
/mnt/sdg3/testfile: 49029 extents found
soft# btrfs fi df /mnt/test5/ | grep Data
Data, single: total=7.00GiB, used=6.55GiB
soft# btrfs fi df /mnt/sdg3 | grep Data
Data, single: total=7.00GiB, used=6.55GiB
soft# sysctl vm/drop_caches=3
vm.drop_caches = 3
soft# /usr/bin/time dd iflag=fullblock if=/mnt/test5/testfile bs=1M count=10000 of=/dev/zero
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 23.2975 s, 450 MB/s
0.01user 7.59system 0:23.32elapsed 32%CPU (0avgtext+0avgdata 2932maxresident)k
13759624inputs+0outputs (3major+344minor)pagefaults 0swaps
soft# sysctl vm/drop_caches=3
vm.drop_caches = 3
soft# /usr/bin/time dd iflag=fullblock if=/mnt/sdg3/testfile bs=1M count=10000 of=/dev/zero
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 35.0032 s, 300 MB/s
0.01user 8.46system 0:35.03elapsed 24%CPU (0avgtext+0avgdata 2924maxresident)k
13750568inputs+0outputs (3major+345minor)pagefaults 0swaps
and without compression:
soft# mount -t btrfs -o commit=10 /dev/sdg3 /mnt/sdg3
soft# rm -f /mnt/test5/testfile /mnt/sdg3/testfile
soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile bs=1M count=10000 conv=fsync
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 74.7256 s, 140 MB/s
0.02user 13.31system 1:14.72elapsed 17%CPU (0avgtext+0avgdata 2936maxresident)k
13047640inputs+20483808outputs (3major+345minor)pagefaults 0swaps
soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sdg3/testfile bs=1M count=10000 conv=fsync
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 102.002 s, 103 MB/s
0.02user 14.49system 1:42.00elapsed 14%CPU (0avgtext+0avgdata 2972maxresident)k
13030592inputs+20484032outputs (3major+345minor)pagefaults 0swaps
soft# filefrag /mnt/test5/testfile /mnt/sdg3/testfile
/mnt/test5/testfile: 23 extents found
/mnt/sdg3/testfile: 13 extents found
> The CPU usage is pretty low as well. For example when the
> force-compress=zlib is in effect, the cpu usage is pretty low
> now.
That's 24 threads around 4-5% CPU each, that's around a 100% CPU
of system time spread around, for 70MB/s.
That's quite low. My report, which is mirrored by using 'pigz'
at the user level (very similar algorithms), was that 90MB/s
took 300% of an FX-6100 CPU at 3.3Ghz, and it is not that much
less efficient than a Xeon-E5645 at 2.4Ghz.
I have redone the test on a faster CPU:
base# grep 'model name' /proc/cpuinfo | sort -u
model name : AMD FX-8370E Eight-Core Processor
base# cpufreq-info | grep 'current CPU frequency'
current CPU frequency is 3.30 GHz (asserted by call to hardware).
current CPU frequency is 3.30 GHz (asserted by call to hardware).
current CPU frequency is 3.30 GHz (asserted by call to hardware).
current CPU frequency is 3.30 GHz (asserted by call to hardware).
current CPU frequency is 3.30 GHz (asserted by call to hardware).
current CPU frequency is 3.30 GHz (asserted by call to hardware).
current CPU frequency is 3.30 GHz (asserted by call to hardware).
current CPU frequency is 3.30 GHz (asserted by call to hardware).
And the result is (from a fast flash Samsung SSD to a fast 2TB
Toshiba drive):
base# mount -t btrfs -o commit=10,compress-force=zlib /dev/sdb6 /mnt/sdb6
base# /usr/bin/time dd iflag=fullblock if=/dev/sde6 of=/mnt/sdb6/testfile bs=1M count=10000 conv=fsync
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 41.7702 s, 251 MB/s
0.00user 11.41system 0:43.41elapsed 26%CPU (0avgtext+0avgdata 3132maxresident)k
20482288inputs+20503368outputs (1major+339minor)pagefaults 0swaps
With CPU usage as:
top - 09:20:38 up 20:48, 4 users, load average: 5.04, 2.03, 2.06
Tasks: 576 total, 10 running, 566 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.0 us, 94.0 sy, 0.7 ni, 0.0 id, 2.0 wa, 0.0 hi, 3.3 si, 0.0 st
%Cpu1 : 0.3 us, 97.0 sy, 0.0 ni, 2.0 id, 0.7 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 3.0 us, 94.4 sy, 0.0 ni, 1.7 id, 1.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 0.7 us, 95.4 sy, 0.0 ni, 2.3 id, 1.3 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu4 : 1.0 us, 95.7 sy, 0.3 ni, 2.6 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 0.3 us, 94.7 sy, 0.0 ni, 4.3 id, 0.7 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 1.7 us, 94.3 sy, 0.0 ni, 3.0 id, 1.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 0.0 us, 97.3 sy, 0.0 ni, 2.3 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 16395076 total, 15987476 used, 407600 free, 5206304 buffers
KiB Swap: 0 total, 0 used, 0 free. 8392648 cached Mem
so around 7 CPUs for 250MB/s, or around 35MB/s per CPU (more or
less what I also get user-level with 'pigz'), and it is hard for
me to imagine the Xeon-E5745 being twice as fast per-CPU for
"integer" work, but that's another discussion.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2017-08-01 9:58 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <33040946.535.1501254718807.JavaMail.gkos@dynomob>
2017-07-28 16:40 ` Btrfs + compression = slow performance and high cpu usage Konstantin V. Gavrilenko
2017-07-28 17:48 ` Roman Mamedov
2017-07-28 18:20 ` William Muriithi
2017-07-28 18:37 ` Hugo Mills
2017-07-28 18:08 ` Peter Grandi
2017-07-30 13:42 ` Konstantin V. Gavrilenko
2017-07-31 11:41 ` Peter Grandi
2017-07-31 12:33 ` Peter Grandi
2017-07-31 12:49 ` Peter Grandi
2017-08-01 9:58 ` Konstantin V. Gavrilenko [this message]
2017-08-01 10:53 ` Paul Jones
2017-08-01 13:14 ` Peter Grandi
2017-08-01 18:09 ` Konstantin V. Gavrilenko
2017-08-01 20:09 ` Peter Grandi
2017-08-01 23:54 ` Peter Grandi
2017-08-31 10:56 ` Konstantin V. Gavrilenko
2017-07-28 18:44 ` Peter Grandi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6022267.255.1501581534233.JavaMail.gkos@dynomob \
--to=k.gavrilenko@arhont.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=pg@btrfs.list.sabi.co.UK \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).