Btrfs + compression = slow performance and high cpu usage

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Btrfs + compression = slow performance and high cpu usage
       [not found] <33040946.535.1501254718807.JavaMail.gkos@dynomob>
@ 2017-07-28 16:40 ` Konstantin V. Gavrilenko
  2017-07-28 17:48   ` Roman Mamedov
                     ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Konstantin V. Gavrilenko @ 2017-07-28 16:40 UTC (permalink / raw)
  To: linux-btrfs

Hello list, 

I am stuck with a problem of btrfs slow performance when using compression.

when the compress-force=lzo mount flag is enabled, the performance drops to 30-40 mb/s and one of the btrfs processes utilises 100% cpu time.
mount options: btrfs relatime,discard,autodefrag,compress=lzo,compress-force,space_cache=v2,commit=10

The command I am testing the write throughput is

# pv -tpreb /dev/sdb | dd of=./testfile bs=1M oflag=direct

# top -d 1 
top - 15:49:13 up  1:52,  2 users,  load average: 5.28, 2.32, 1.39
Tasks: 320 total,   2 running, 318 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  2.0 sy,  0.0 ni, 77.0 id, 21.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  1.0 sy,  0.0 ni, 90.0 id,  9.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.0 us,  1.0 sy,  0.0 ni, 72.0 id, 27.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,100.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  0.0 us,  1.0 sy,  0.0 ni, 57.0 id, 42.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.0 us,  0.0 sy,  0.0 ni, 96.0 id,  4.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  0.0 us,  0.0 sy,  0.0 ni, 94.0 id,  6.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  0.0 us,  1.0 sy,  0.0 ni, 95.1 id,  3.9 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  :  1.0 us,  2.0 sy,  0.0 ni, 24.0 id, 73.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  :  0.0 us,  0.0 sy,  0.0 ni, 81.8 id, 18.2 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 :  1.0 us,  0.0 sy,  0.0 ni, 98.0 id,  1.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  0.0 us,  2.0 sy,  0.0 ni, 83.3 id, 14.7 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32934136 total, 10137496 free,   602244 used, 22194396 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 30525664 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                               
37017 root      20   0       0      0      0 R 100.0  0.0   0:32.42 kworker/u49:8                                                                                                                         
36732 root      20   0       0      0      0 D   4.0  0.0   0:02.40 btrfs-transacti                                                                                                                       
40105 root      20   0    8388   3040   2000 D   4.0  0.0   0:02.88 dd       


The keyworker process that causes the high cpu usage is  most likely searching for the free space.

# echo l > /proc/sysrq-trigger

# dmest -T
Fri Jul 28 15:57:51 2017] CPU: 1 PID: 36430 Comm: kworker/u49:2 Not tainted 4.10.0-28-generic #32~16.04.2-Ubuntu
[Fri Jul 28 15:57:51 2017] Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1b       11/16/2012
[Fri Jul 28 15:57:51 2017] Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
[Fri Jul 28 15:57:51 2017] task: ffff9ddce6206a40 task.stack: ffffaa9121f6c000
[Fri Jul 28 15:57:51 2017] RIP: 0010:rb_next+0x1e/0x40
[Fri Jul 28 15:57:51 2017] RSP: 0018:ffffaa9121f6fb40 EFLAGS: 00000282
[Fri Jul 28 15:57:51 2017] RAX: ffff9dddc34df1b0 RBX: 0000000000010000 RCX: 0000000000001000
[Fri Jul 28 15:57:51 2017] RDX: ffff9dddc34df708 RSI: ffff9ddccaf470a4 RDI: ffff9dddc34df2d0
[Fri Jul 28 15:57:51 2017] RBP: ffffaa9121f6fb40 R08: 0000000000000001 R09: 0000000000003000
[Fri Jul 28 15:57:51 2017] R10: 0000000000000000 R11: 0000000000020000 R12: ffff9ddccaf47080
[Fri Jul 28 15:57:51 2017] R13: 0000000000001000 R14: ffffaa9121f6fc50 R15: ffff9dddc34df2d0
[Fri Jul 28 15:57:51 2017] FS:  0000000000000000(0000) GS:ffff9ddcefa40000(0000) knlGS:0000000000000000
[Fri Jul 28 15:57:51 2017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Jul 28 15:57:51 2017] Call Trace:_space_for_alloc+0xde/0x270 [btrfs]
[Fri Jul 28 15:57:51 2017]  btrfs_find_space_for_alloc+0xde/0x270 [btrfs]
[Fri Jul 28 15:57:51 2017]  find_free_extent.isra.68+0x3c6/0x1040 [btrfs]s]
[Fri Jul 28 15:57:51 2017]  btrfs_reserve_extent+0xab/0x210 [btrfs]btrfs]
[Fri Jul 28 15:57:51 2017]  submit_compressed_extents+0x154/0x580 [btrfs]s]
[Fri Jul 28 15:57:51 2017]  ? submit_compressed_extents+0x580/0x580 [btrfs]
[Fri Jul 28 15:57:51 2017]  async_cow_submit+0x82/0x90 [btrfs]00 [btrfs]
[Fri Jul 28 15:57:51 2017]  btrfs_scrubparity_helper+0x1fe/0x300 [btrfs]
[Fri Jul 28 15:57:51 2017]  btrfs_delalloc_helper+0xe/0x10 [btrfs]
[Fri Jul 28 15:57:51 2017]  process_one_work+0x16b/0x4a0a0
[Fri Jul 28 15:57:51 2017]  worker_thread+0x4b/0x500+0x60/0x60
[Fri Jul 28 15:57:51 2017]  kthread+0x109/0x1400x4a0/0x4a0




When the compression is turned off, I am able to get the maximum 500-600 mb/s write speed on this disk (raid array) with minimal cpu usage.

mount options: relatime,discard,autodefrag,space_cache=v2,commit=10

# iostat -m 1 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.08    0.00    7.74   10.77    0.00   81.40

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda            2376.00         0.00       594.01          0        594


I have tried deleting mounting with nospace_cache, clear_cache, and then rebuilding with space_cache=v2
but it doesn't make any difference. The same sluggish performance is experienced when I try to write over the NFS.

Any ideas why the compression makes such a big difference and causes a bottleneck?





# uname -a
Linux backup1 4.10.0-28-generic #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

# btrfs --version
btrfs-progs v4.8.1

# cat /etc/issue
Ubuntu 16.04.2 LTS \n \l

# btrfs fi show
Label: none  uuid: f56bdc4a-239d-4268-81d8-01cdd7a3c1c9
        Total devices 1 FS bytes used 9.32TiB
        devid    2 size 21.83TiB used 9.33TiB path /dev/sda

# btrfs fi df /mnt/arh-backup1/
Data, single: total=9.28TiB, used=9.28TiB
System, single: total=32.00MiB, used=1.00MiB
Metadata, single: total=46.00GiB, used=44.20GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


# btrfs device usage /mnt/arh-backup1/
/dev/sda, ID: 2
   Device size:            21.83TiB
   Device slack:              0.00B
   Data,single:             9.29TiB
   Metadata,single:        46.00GiB
   System,single:          32.00MiB
   Unallocated:            12.49TiB


thanks in advance.
kos





^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs + compression = slow performance and high cpu usage
  2017-07-28 16:40 ` Btrfs + compression = slow performance and high cpu usage Konstantin V. Gavrilenko
@ 2017-07-28 17:48   ` Roman Mamedov
  2017-07-28 18:20     ` William Muriithi
  2017-07-28 18:08   ` Peter Grandi
  2017-07-28 18:44   ` Peter Grandi
  2 siblings, 1 reply; 17+ messages in thread
From: Roman Mamedov @ 2017-07-28 17:48 UTC (permalink / raw)
  To: Konstantin V. Gavrilenko; +Cc: linux-btrfs

On Fri, 28 Jul 2017 17:40:50 +0100 (BST)
"Konstantin V. Gavrilenko" <k.gavrilenko@arhont.com> wrote:

> Hello list, 
> 
> I am stuck with a problem of btrfs slow performance when using compression.
> 
> when the compress-force=lzo mount flag is enabled, the performance drops to 30-40 mb/s and one of the btrfs processes utilises 100% cpu time.
> mount options: btrfs relatime,discard,autodefrag,compress=lzo,compress-force,space_cache=v2,commit=10

It does not work like that, you need to set compress-force=lzo (and remove
compress=).

With your setup I believe you currently use compress-force[=zlib](default),
overriding compress=lzo, since it's later in the options order.

Secondly,

> autodefrag

This sure sounded like a good thing to enable? on paper? right?...

The moment you see anything remotely weird about btrfs, this is the first
thing you have to disable and retest without. Oh wait, the first would be
qgroups, this one is second.

Finally, what is the reasoning behind "commit=10", and did you check with the
default value of 30?

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs + compression = slow performance and high cpu usage
  2017-07-28 16:40 ` Btrfs + compression = slow performance and high cpu usage Konstantin V. Gavrilenko
  2017-07-28 17:48   ` Roman Mamedov
@ 2017-07-28 18:08   ` Peter Grandi
  2017-07-30 13:42     ` Konstantin V. Gavrilenko
  2017-07-28 18:44   ` Peter Grandi
  2 siblings, 1 reply; 17+ messages in thread
From: Peter Grandi @ 2017-07-28 18:08 UTC (permalink / raw)
  To: Linux fs Btrfs

> I am stuck with a problem of btrfs slow performance when using
> compression. [ ... ]

That to me looks like an issue with speed, not performance, and
in particular with PEBCAK issues.

As to high CPU usage, when you find a way to do both compression
and checksumming without using much CPU time, please send patches
urgently :-).

In your case the increase in CPU time is bizarre. I have the
Ubuntu 4.4 "lts-xenial" kernel and what you report does not
happen here (with a few little changes):

  soft#  grep 'model name' /proc/cpuinfo | sort -u
  model name      : AMD FX(tm)-6100 Six-Core Processor
  soft#  cpufreq-info | grep 'current CPU frequency'
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).

  soft#  lsscsi | grep 'sd[ae]'
  [0:0:0:0]    disk    ATA      HFS256G32MNB-220 3L00  /dev/sda
  [5:0:0:0]    disk    ATA      ST2000DM001-1CH1 CC44  /dev/sde

  soft#  mkfs.btrfs -f /dev/sde3
  [ ... ]
  soft#  mount -t btrfs -o discard,autodefrag,compress=lzo,compress-force,commit=10 /dev/sde3 /mnt/sde3

  soft#  df /dev/sda6 /mnt/sde3
  Filesystem     1M-blocks  Used Available Use% Mounted on
  /dev/sda6          90048 76046     14003  85% /
  /dev/sde3         237568    19    235501   1% /mnt/sde3

The above is useful context information that was "amazingly"
omitted from your reported.

In dmesg I see (not the "force zlib compression"):

  [327730.917285] BTRFS info (device sde3): turning on discard
  [327730.917294] BTRFS info (device sde3): enabling auto defrag
  [327730.917300] BTRFS info (device sde3): setting 8 feature flag
  [327730.917304] BTRFS info (device sde3): force zlib compression
  [327730.917313] BTRFS info (device sde3): disk space caching is enabled
  [327730.917315] BTRFS: has skinny extents
  [327730.917317] BTRFS: flagging fs with big metadata feature
  [327730.920740] BTRFS: creating UUID tree

and the result is:

  soft#  pv -tpreb /dev/sda6 | time dd iflag=fullblock of=/mnt/sde3/testfile bs=1M count=10000 oflag=direct
  10000+0 records in17MB/s] [==>                                ] 11% ETA 0:15:06
  10000+0 records out
  10485760000 bytes (10 GB) copied, 112.845 s, 92.9 MB/s
  0.05user 9.93system 1:53.20elapsed 8%CPU (0avgtext+0avgdata 3016maxresident)k
  120inputs+20496000outputs (1major+346minor)pagefaults 0swaps
  9.77GB 0:01:53 [88.3MB/s] [==>                                ]
  11%

  soft#  btrfs fi df /mnt/sde3/
  Data, single: total=10.01GiB, used=9.77GiB
  System, DUP: total=8.00MiB, used=16.00KiB
  Metadata, DUP: total=1.00GiB, used=11.66MiB
  GlobalReserve, single: total=16.00MiB, used=0.00B

As it was running system CPU time was under 20% of one CPU:

  top - 18:57:29 up 3 days, 19:27,  4 users,  load average: 5.44, 2.82, 1.45
  Tasks: 325 total,   1 running, 324 sleeping,   0 stopped,   0 zombie
  %Cpu0  :  0.0 us,  2.3 sy,  0.0 ni, 91.3 id,  6.3 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu1  :  0.0 us,  1.3 sy,  0.0 ni, 78.5 id, 20.2 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu2  :  0.3 us,  5.8 sy,  0.0 ni, 81.0 id, 12.5 wa,  0.0 hi,  0.3 si,  0.0 st
  %Cpu3  :  0.3 us,  3.4 sy,  0.0 ni, 91.9 id,  4.4 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu4  :  0.3 us, 10.6 sy,  0.0 ni, 55.4 id, 33.7 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu5  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
  KiB Mem:   8120660 total,  5162236 used,  2958424 free,  4440100 buffers
  KiB Swap:        0 total,        0 used,        0 free.   351848 cached Mem

    PID  PPID USER      PR  NI    VIRT    RES    DATA  %CPU %MEM     TIME+ TTY      COMMAND
  21047 21046 root      20   0    8872   2616    1364  12.9  0.0   0:02.31 pts/3    dd iflag=fullblo+
  21045  3535 root      20   0    7928   1948     460  12.3  0.0   0:00.72 pts/3    pv -tpreb /dev/s+
  21019     2 root      20   0       0      0       0   1.3  0.0   0:42.88 ?        [kworker/u16:1]

Of course "oflag=direct" is a rather "optimistic" option in this
context, so I tried again with something more sensible:

  soft#  pv -tpreb /dev/sda6 | time dd iflag=fullblock of=/mnt/sde3/testfile bs=1M count=10000 conv=fsync
  10000+0 records in.4MB/s] [==>                                ] 11% ETA 0:14:41
  10000+0 records out
  10485760000 bytes (10 GB) copied, 110.523 s, 94.9 MB/s
  0.03user 8.94system 1:50.71elapsed 8%CPU (0avgtext+0avgdata 3024maxresident)k
  136inputs+20499648outputs (1major+348minor)pagefaults 0swaps
  9.77GB 0:01:50 [90.3MB/s] [==>                                ] 11%

  soft#  btrfs fi df /mnt/sde3/
  Data, single: total=7.01GiB, used=6.35GiB
  System, DUP: total=8.00MiB, used=16.00KiB
  Metadata, DUP: total=1.00GiB, used=15.81MiB
  GlobalReserve, single: total=16.00MiB, used=0.00B

As it was running system as expected CPU time was around 55-60%
of each of 6 CPUs:

  top - 18:56:03 up 3 days, 19:26,  4 users,  load average: 2.31, 1.39, 0.90
  Tasks: 325 total,   5 running, 320 sleeping,   0 stopped,   0 zombie
  %Cpu0  :  0.0 us, 57.9 sy,  0.0 ni, 28.3 id, 13.8 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu1  :  0.0 us, 46.8 sy,  0.0 ni, 36.9 id, 16.3 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu2  :  0.0 us, 72.8 sy,  0.0 ni, 13.4 id, 12.8 wa,  0.0 hi,  1.0 si,  0.0 st
  %Cpu3  :  0.3 us, 63.8 sy,  0.0 ni, 17.4 id, 17.4 wa,  0.0 hi,  1.0 si,  0.0 st
  %Cpu4  :  0.0 us, 53.3 sy,  0.0 ni, 29.7 id, 17.0 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu5  :  0.0 us, 54.0 sy,  0.0 ni, 32.7 id, 13.3 wa,  0.0 hi,  0.0 si,  0.0 st
  KiB Mem:   8120660 total,  7988368 used,   132292 free,  3646496 buffers
  KiB Swap:        0 total,        0 used,        0 free.  3967692 cached Mem

    PID  PPID USER      PR  NI    VIRT    RES    DATA  %CPU %MEM     TIME+ TTY      COMMAND
  21022     2 root      20   0       0      0       0  45.2  0.0   0:19.69 ?        [kworker/u16:5]
  21028     2 root      20   0       0      0       0  39.9  0.0   0:27.84 ?        [kworker/u16:11]
  21043     2 root      20   0       0      0       0  39.9  0.0   0:04.26 ?        [kworker/u16:19]
  21009     2 root      20   0       0      0       0  38.2  0.0   0:24.50 ?        [kworker/u16:0]
  21037     2 root      20   0       0      0       0  34.2  0.0   0:17.94 ?        [kworker/u16:17]
  21021     2 root      20   0       0      0       0  19.9  0.0   0:14.83 ?        [kworker/u16:3]
  21019     2 root      20   0       0      0       0  19.3  0.0   0:29.98 ?        [kworker/u16:1]
  21034     2 root      20   0       0      0       0  19.3  0.0   0:28.18 ?        [kworker/u16:14]
  21030     2 root      20   0       0      0       0  17.9  0.0   0:24.85 ?        [kworker/u16:13]
  21035     2 root      20   0       0      0       0  17.6  0.0   0:20.75 ?        [kworker/u16:15]
  21023     2 root      20   0       0      0       0  15.0  0.0   0:28.20 ?        [kworker/u16:6]
  21020     2 root      20   0       0      0       0   9.6  0.0   0:27.02 ?        [kworker/u16:2]
  21040  3535 root      20   0    7928   1872     460   8.6  0.0   0:04.11 pts/3    pv -tpreb /dev/s+
  21042 21041 root      20   0    8872   2628    1364   8.3  0.0   0:05.19 pts/3    dd iflag=fullblo+

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Btrfs + compression = slow performance and high cpu usage
  2017-07-28 17:48   ` Roman Mamedov
@ 2017-07-28 18:20     ` William Muriithi
  2017-07-28 18:37       ` Hugo Mills
  0 siblings, 1 reply; 17+ messages in thread
From: William Muriithi @ 2017-07-28 18:20 UTC (permalink / raw)
  To: Roman Mamedov, Konstantin V. Gavrilenko; +Cc: linux-btrfs@vger.kernel.org

Hi Roman,

> autodefrag

This sure sounded like a good thing to enable? on paper? right?...

The moment you see anything remotely weird about btrfs, this is the first thing you have to disable and retest without. Oh wait, the first would be qgroups, this one is second.

What's the problem with autodefrag?  I am also using it, so you caught my attention when you implied that it shouldn't be used.  According to docs, it seem like one of the very mature feature of the filesystem.  See below for the doc I am referring to 

https://btrfs.wiki.kernel.org/index.php/Status

I am using it as I assumed it could prevent the filesystem being too fragmented long term, but never thought there was price to pay for using it

Regards,
William

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs + compression = slow performance and high cpu usage
  2017-07-28 18:20     ` William Muriithi
@ 2017-07-28 18:37       ` Hugo Mills
  0 siblings, 0 replies; 17+ messages in thread
From: Hugo Mills @ 2017-07-28 18:37 UTC (permalink / raw)
  To: William Muriithi
  Cc: Roman Mamedov, Konstantin V. Gavrilenko,
	linux-btrfs@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1485 bytes --]

On Fri, Jul 28, 2017 at 06:20:14PM +0000, William Muriithi wrote:
> Hi Roman,
> 
> > autodefrag
> 
> This sure sounded like a good thing to enable? on paper? right?...
> 
> The moment you see anything remotely weird about btrfs, this is the first thing you have to disable and retest without. Oh wait, the first would be qgroups, this one is second.
> 
> What's the problem with autodefrag?  I am also using it, so you caught my attention when you implied that it shouldn't be used.  According to docs, it seem like one of the very mature feature of the filesystem.  See below for the doc I am referring to 
> 
> https://btrfs.wiki.kernel.org/index.php/Status
> 
> I am using it as I assumed it could prevent the filesystem being too fragmented long term, but never thought there was price to pay for using it

   It introduces additional I/O on writes, as it modifies a small area
surrounding any write or cluster of writes.

   I'm not aware of it causing massive slowdowns, in the way the
qgroups does in some situations.

   If your system is already marginal in terms of being able to
support the I/O required, then turning on autodefrag will make things
worse (but you may be heading for _much_ worse performance in the
future as the FS becomes more fragmented -- depending on your write
patterns and use case).

   Hugo.

-- 
Hugo Mills             | Great oxymorons of the world, no. 6:
hugo@... carfax.org.uk | Mature Student
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs + compression = slow performance and high cpu usage
  2017-07-28 16:40 ` Btrfs + compression = slow performance and high cpu usage Konstantin V. Gavrilenko
  2017-07-28 17:48   ` Roman Mamedov
  2017-07-28 18:08   ` Peter Grandi
@ 2017-07-28 18:44   ` Peter Grandi
  2 siblings, 0 replies; 17+ messages in thread
From: Peter Grandi @ 2017-07-28 18:44 UTC (permalink / raw)
  To: Linux fs Btrfs

In addition to my previous "it does not happen here" comment, if
someone is reading this thread, there are some other interesting
details:

> When the compression is turned off, I am able to get the
> maximum 500-600 mb/s write speed on this disk (raid array)
> with minimal cpu usage.

No details on whether it is a parity RAID or not.

> btrfs device usage /mnt/arh-backup1/
> /dev/sda, ID: 2
>    Device size:            21.83TiB
>    Device slack:              0.00B
>    Data,single:             9.29TiB
>    Metadata,single:        46.00GiB
>    System,single:          32.00MiB
>    Unallocated:            12.49TiB

That's exactly 24TB of "Device size", of which around 45% are
used, and the string "backup" may suggest that the content is
backups, which may indicate a very fragmented freespace.
Of course compression does not help with that, in my freshly
created Btrfs volume I get as expected:

  soft#  umount /mnt/sde3
  soft#  mount -t btrfs -o commit=10 /dev/sde3 /mnt/sde3                         

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sde3/testfile bs=1M count=10000 conv=fsync
  10000+0 records in
  10000+0 records out
  10485760000 bytes (10 GB) copied, 103.747 s, 101 MB/s
  0.00user 11.56system 1:44.86elapsed 11%CPU (0avgtext+0avgdata 3072maxresident)k
  20480672inputs+20498272outputs (1major+349minor)pagefaults 0swaps

  soft#  filefrag /mnt/sde3/testfile 
  /mnt/sde3/testfile: 11 extents found

versus:

  soft#  umount /mnt/sde3                                                        
  soft#  mount -t btrfs -o commit=10,compress=lzo,compress-force /dev/sde3 /mnt/sde3

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sde3/testfile bs=1M count=10000 conv=fsync
  10000+0 records in                                                      
  10000+0 records out
  10485760000 bytes (10 GB) copied, 109.051 s, 96.2 MB/s
  0.02user 13.03system 1:49.49elapsed 11%CPU (0avgtext+0avgdata 3068maxresident)k
  20494784inputs+20492320outputs (1major+347minor)pagefaults 0swaps

  soft#  filefrag /mnt/sde3/testfile 
  /mnt/sde3/testfile: 49287 extents found

Most the latter extents are mercifully rather contiguous, their
size is just limited by the compression code, here is an extract
from 'filefrag -v' from around the middle:

  24757:  1321888.. 1321919:   11339579..  11339610:     32:   11339594:
  24758:  1321920.. 1321951:   11339597..  11339628:     32:   11339611:
  24759:  1321952.. 1321983:   11339615..  11339646:     32:   11339629:
  24760:  1321984.. 1322015:   11339632..  11339663:     32:   11339647:
  24761:  1322016.. 1322047:   11339649..  11339680:     32:   11339664:
  24762:  1322048.. 1322079:   11339667..  11339698:     32:   11339681:
  24763:  1322080.. 1322111:   11339686..  11339717:     32:   11339699:
  24764:  1322112.. 1322143:   11339703..  11339734:     32:   11339718:
  24765:  1322144.. 1322175:   11339720..  11339751:     32:   11339735:
  24766:  1322176.. 1322207:   11339737..  11339768:     32:   11339752:
  24767:  1322208.. 1322239:   11339754..  11339785:     32:   11339769:
  24768:  1322240.. 1322271:   11339771..  11339802:     32:   11339786:
  24769:  1322272.. 1322303:   11339789..  11339820:     32:   11339803:

But again this is on a fresh empty Btrfs volume.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs + compression = slow performance and high cpu usage
  2017-07-28 18:08   ` Peter Grandi
@ 2017-07-30 13:42     ` Konstantin V. Gavrilenko
  2017-07-31 11:41       ` Peter Grandi
  0 siblings, 1 reply; 17+ messages in thread
From: Konstantin V. Gavrilenko @ 2017-07-30 13:42 UTC (permalink / raw)
  To: Linux fs Btrfs; +Cc: Peter Grandi

Thanks for the comments. Initially the system performed well, I don't have the benchmark details written, but the compressed vs non compressed speeds were more or less similar. However, after several weeks of usage, the system started experiencing the described slowdowns, thus I started investigating the problem. This indeed is a backup drive, but it predominantly contains large files.

# ls -lahR | awk '/^-/ {print $5}' | sort | uniq -c  | sort -n | tail -n 15
      5 322
      5 396
      5 400
      6 1000G
      6 11
      6 200G
      8 24G
      8 48G
     13 500G
     20 8.0G
     25 165G
     32 20G
     57 100G
    103 50G
    201 10G


# grep 'model name' /proc/cpuinfo | sort -u 
model name      : Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz

# lsscsi | grep 'sd[ae]'
[4:2:0:0]    disk    LSI      MR9260-8i        2.13  /dev/sda 


The sda device is a hardware RAID5 consisting of 4x8TB drives.

Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-5, Secondary-0, RAID Level Qualifier-3
Size                : 21.830 TB
Sector Size         : 512
Is VD emulated      : Yes
Parity Size         : 7.276 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 4
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Bad Blocks Exist: No
Is VD Cached: No
Number of Spans: 1
Span: 0 - Number of PDs: 4


I have changed the mount flags as suggested, and I don't see the previously reported behaviour of one of the kworker consuming 100% of the cputime, but the write speed difference between the compression ON vs OFF is pretty large. 
Have run several tests under with zlib, lzo and no compression and the results are rather strange.

mountflags: (rw,relatime,compress-force=zlib,space_cache=v2,subvolid=5,subvol=/)
dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress 
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 93.3418 s, 57.5 MB/s

dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress oflag=direct
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 26.0685 s, 206 MB/s

dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 77.4845 s, 69.3 MB/s



mountflags: (rw,relatime,compress-force=lzo,space_cache=v2,subvolid=5,subvol=/)
dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress 
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 116.246 s, 46.2 MB/s


dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress oflag=direct
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 14.704 s, 365 MB/s


dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 122.321 s, 43.9 MB/s



mountflags: (rw,relatime,space_cache=v2,subvolid=5,subvol=/)

dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress 
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 32.2551 s, 166 MB/s

dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress oflag=direct
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 19.9464 s, 269 MB/s

dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 10.1033 s, 531 MB/s


The CPU usage is pretty low as well. For example when the  force-compress=zlib is in effect, the cpu usage is pretty low now.

Linux 4.10.0-28-generic (ais-backup1)   30/07/17        _x86_64_        (12 CPU)

14:31:27        CPU     %user     %nice   %system   %iowait    %steal     %idle
14:31:28        all      0.00      0.00      1.50      0.00      0.00     98.50
14:31:29        all      0.00      0.00      4.78      3.52      0.00     91.69
14:31:30        all      0.08      0.00      4.92      3.75      0.00     91.25
14:31:31        all      0.00      0.00      4.76      3.76      0.00     91.49
14:31:32        all      0.00      0.00      4.76      3.76      0.00     91.48
14:31:33        all      0.08      0.00      4.67      3.76      0.00     91.49
14:31:34        all      0.00      0.00      4.76      3.68      0.00     91.56
14:31:35        all      0.08      0.00      4.76      3.76      0.00     91.40
14:31:36        all      0.00      0.00      4.60      3.77      0.00     91.63
14:31:37        all      0.00      0.00      4.68      3.68      0.00     91.64
14:31:38        all      0.08      0.00      4.52      3.76      0.00     91.64
14:31:39        all      0.08      0.00      4.68      3.76      0.00     91.48
14:31:40        all      0.08      0.00      4.52      3.76      0.00     91.64
14:31:41        all      0.00      0.00      4.61      3.77      0.00     91.62
14:31:42        all      0.08      0.00      5.07      3.74      0.00     91.10
14:31:43        all      0.00      0.00      4.68      3.68      0.00     91.64
14:31:44        all      0.00      0.00      4.84      5.09      0.00     90.08
14:31:45        all      0.17      0.00      4.67      4.75      0.00     90.42
14:31:46        all      0.00      0.00      4.60      3.76      0.00     91.64
14:31:47        all      0.08      0.00      5.07      3.66      0.00     91.18
14:31:48        all      0.00      0.00      5.01      3.68      0.00     91.31
14:31:49        all      0.00      0.00      4.76      3.68      0.00     91.56
14:31:50        all      0.08      0.00      4.59      3.59      0.00     91.73
14:31:51        all      0.00      0.00      2.67      1.92      0.00     95.41






----- Original Message -----
From: "Peter Grandi" <pg@btrfs.list.sabi.co.UK>
To: "Linux fs Btrfs" <linux-btrfs@vger.kernel.org>
Sent: Friday, 28 July, 2017 8:08:47 PM
Subject: Re: Btrfs + compression = slow performance and high cpu usage

> I am stuck with a problem of btrfs slow performance when using
> compression. [ ... ]

That to me looks like an issue with speed, not performance, and
in particular with PEBCAK issues.

As to high CPU usage, when you find a way to do both compression
and checksumming without using much CPU time, please send patches
urgently :-).

In your case the increase in CPU time is bizarre. I have the
Ubuntu 4.4 "lts-xenial" kernel and what you report does not
happen here (with a few little changes):

  soft#  grep 'model name' /proc/cpuinfo | sort -u
  model name      : AMD FX(tm)-6100 Six-Core Processor
  soft#  cpufreq-info | grep 'current CPU frequency'
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).

  soft#  lsscsi | grep 'sd[ae]'
  [0:0:0:0]    disk    ATA      HFS256G32MNB-220 3L00  /dev/sda
  [5:0:0:0]    disk    ATA      ST2000DM001-1CH1 CC44  /dev/sde

  soft#  mkfs.btrfs -f /dev/sde3
  [ ... ]
  soft#  mount -t btrfs -o discard,autodefrag,compress=lzo,compress-force,commit=10 /dev/sde3 /mnt/sde3

  soft#  df /dev/sda6 /mnt/sde3
  Filesystem     1M-blocks  Used Available Use% Mounted on
  /dev/sda6          90048 76046     14003  85% /
  /dev/sde3         237568    19    235501   1% /mnt/sde3

The above is useful context information that was "amazingly"
omitted from your reported.

In dmesg I see (not the "force zlib compression"):

  [327730.917285] BTRFS info (device sde3): turning on discard
  [327730.917294] BTRFS info (device sde3): enabling auto defrag
  [327730.917300] BTRFS info (device sde3): setting 8 feature flag
  [327730.917304] BTRFS info (device sde3): force zlib compression
  [327730.917313] BTRFS info (device sde3): disk space caching is enabled
  [327730.917315] BTRFS: has skinny extents
  [327730.917317] BTRFS: flagging fs with big metadata feature
  [327730.920740] BTRFS: creating UUID tree

and the result is:

  soft#  pv -tpreb /dev/sda6 | time dd iflag=fullblock of=/mnt/sde3/testfile bs=1M count=10000 oflag=direct
  10000+0 records in17MB/s] [==>                                ] 11% ETA 0:15:06
  10000+0 records out
  10485760000 bytes (10 GB) copied, 112.845 s, 92.9 MB/s
  0.05user 9.93system 1:53.20elapsed 8%CPU (0avgtext+0avgdata 3016maxresident)k
  120inputs+20496000outputs (1major+346minor)pagefaults 0swaps
  9.77GB 0:01:53 [88.3MB/s] [==>                                ]
  11%

  soft#  btrfs fi df /mnt/sde3/
  Data, single: total=10.01GiB, used=9.77GiB
  System, DUP: total=8.00MiB, used=16.00KiB
  Metadata, DUP: total=1.00GiB, used=11.66MiB
  GlobalReserve, single: total=16.00MiB, used=0.00B

As it was running system CPU time was under 20% of one CPU:

  top - 18:57:29 up 3 days, 19:27,  4 users,  load average: 5.44, 2.82, 1.45
  Tasks: 325 total,   1 running, 324 sleeping,   0 stopped,   0 zombie
  %Cpu0  :  0.0 us,  2.3 sy,  0.0 ni, 91.3 id,  6.3 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu1  :  0.0 us,  1.3 sy,  0.0 ni, 78.5 id, 20.2 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu2  :  0.3 us,  5.8 sy,  0.0 ni, 81.0 id, 12.5 wa,  0.0 hi,  0.3 si,  0.0 st
  %Cpu3  :  0.3 us,  3.4 sy,  0.0 ni, 91.9 id,  4.4 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu4  :  0.3 us, 10.6 sy,  0.0 ni, 55.4 id, 33.7 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu5  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
  KiB Mem:   8120660 total,  5162236 used,  2958424 free,  4440100 buffers
  KiB Swap:        0 total,        0 used,        0 free.   351848 cached Mem

    PID  PPID USER      PR  NI    VIRT    RES    DATA  %CPU %MEM     TIME+ TTY      COMMAND
  21047 21046 root      20   0    8872   2616    1364  12.9  0.0   0:02.31 pts/3    dd iflag=fullblo+
  21045  3535 root      20   0    7928   1948     460  12.3  0.0   0:00.72 pts/3    pv -tpreb /dev/s+
  21019     2 root      20   0       0      0       0   1.3  0.0   0:42.88 ?        [kworker/u16:1]

Of course "oflag=direct" is a rather "optimistic" option in this
context, so I tried again with something more sensible:

  soft#  pv -tpreb /dev/sda6 | time dd iflag=fullblock of=/mnt/sde3/testfile bs=1M count=10000 conv=fsync
  10000+0 records in.4MB/s] [==>                                ] 11% ETA 0:14:41
  10000+0 records out
  10485760000 bytes (10 GB) copied, 110.523 s, 94.9 MB/s
  0.03user 8.94system 1:50.71elapsed 8%CPU (0avgtext+0avgdata 3024maxresident)k
  136inputs+20499648outputs (1major+348minor)pagefaults 0swaps
  9.77GB 0:01:50 [90.3MB/s] [==>                                ] 11%

  soft#  btrfs fi df /mnt/sde3/
  Data, single: total=7.01GiB, used=6.35GiB
  System, DUP: total=8.00MiB, used=16.00KiB
  Metadata, DUP: total=1.00GiB, used=15.81MiB
  GlobalReserve, single: total=16.00MiB, used=0.00B

As it was running system as expected CPU time was around 55-60%
of each of 6 CPUs:

  top - 18:56:03 up 3 days, 19:26,  4 users,  load average: 2.31, 1.39, 0.90
  Tasks: 325 total,   5 running, 320 sleeping,   0 stopped,   0 zombie
  %Cpu0  :  0.0 us, 57.9 sy,  0.0 ni, 28.3 id, 13.8 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu1  :  0.0 us, 46.8 sy,  0.0 ni, 36.9 id, 16.3 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu2  :  0.0 us, 72.8 sy,  0.0 ni, 13.4 id, 12.8 wa,  0.0 hi,  1.0 si,  0.0 st
  %Cpu3  :  0.3 us, 63.8 sy,  0.0 ni, 17.4 id, 17.4 wa,  0.0 hi,  1.0 si,  0.0 st
  %Cpu4  :  0.0 us, 53.3 sy,  0.0 ni, 29.7 id, 17.0 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu5  :  0.0 us, 54.0 sy,  0.0 ni, 32.7 id, 13.3 wa,  0.0 hi,  0.0 si,  0.0 st
  KiB Mem:   8120660 total,  7988368 used,   132292 free,  3646496 buffers
  KiB Swap:        0 total,        0 used,        0 free.  3967692 cached Mem

    PID  PPID USER      PR  NI    VIRT    RES    DATA  %CPU %MEM     TIME+ TTY      COMMAND
  21022     2 root      20   0       0      0       0  45.2  0.0   0:19.69 ?        [kworker/u16:5]
  21028     2 root      20   0       0      0       0  39.9  0.0   0:27.84 ?        [kworker/u16:11]
  21043     2 root      20   0       0      0       0  39.9  0.0   0:04.26 ?        [kworker/u16:19]
  21009     2 root      20   0       0      0       0  38.2  0.0   0:24.50 ?        [kworker/u16:0]
  21037     2 root      20   0       0      0       0  34.2  0.0   0:17.94 ?        [kworker/u16:17]
  21021     2 root      20   0       0      0       0  19.9  0.0   0:14.83 ?        [kworker/u16:3]
  21019     2 root      20   0       0      0       0  19.3  0.0   0:29.98 ?        [kworker/u16:1]
  21034     2 root      20   0       0      0       0  19.3  0.0   0:28.18 ?        [kworker/u16:14]
  21030     2 root      20   0       0      0       0  17.9  0.0   0:24.85 ?        [kworker/u16:13]
  21035     2 root      20   0       0      0       0  17.6  0.0   0:20.75 ?        [kworker/u16:15]
  21023     2 root      20   0       0      0       0  15.0  0.0   0:28.20 ?        [kworker/u16:6]
  21020     2 root      20   0       0      0       0   9.6  0.0   0:27.02 ?        [kworker/u16:2]
  21040  3535 root      20   0    7928   1872     460   8.6  0.0   0:04.11 pts/3    pv -tpreb /dev/s+
  21042 21041 root      20   0    8872   2628    1364   8.3  0.0   0:05.19 pts/3    dd iflag=fullblo+
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs + compression = slow performance and high cpu usage
  2017-07-30 13:42     ` Konstantin V. Gavrilenko
@ 2017-07-31 11:41       ` Peter Grandi
  2017-07-31 12:33         ` Peter Grandi
  2017-08-01  9:58         ` Konstantin V. Gavrilenko
  0 siblings, 2 replies; 17+ messages in thread
From: Peter Grandi @ 2017-07-31 11:41 UTC (permalink / raw)
  To: Linux fs Btrfs

[ ... ]

> grep 'model name' /proc/cpuinfo | sort -u 
> model name      : Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz

Good, contemporary CPU with all accelerations.

> The sda device is a hardware RAID5 consisting of 4x8TB drives.
[ ... ]
> Strip Size          : 256 KB

So the full RMW data stripe length is 768KiB.

> [ ... ] don't see the previously reported behaviour of one of
> the kworker consuming 100% of the cputime, but the write speed
> difference between the compression ON vs OFF is pretty large.

That's weird; of course 'lzo' is a lot cheaper than 'zlib', but
in my test the much higher CPU time of the latter was spread
across many CPUs, while in your case it wasn't, even if the
E5645 has 6 CPUs and can do 12 threads. That seemed to point to
some high cost of finding free blocks, that is a very fragmented
free list, or something else.

> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress oflag=direct
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 26.0685 s, 206 MB/s

The results with 'oflag=direct' are not relevant, because Btrfs
behaves "differently" with that.

> mountflags: (rw,relatime,compress-force=zlib,space_cache=v2,subvolid=5,subvol=/)
[ ... ]
> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 77.4845 s, 69.3 MB/s
> mountflags: (rw,relatime,compress-force=lzo,space_cache=v2,subvolid=5,subvol=/)
[ ... ]
> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 122.321 s, 43.9 MB/s

That's pretty good for a RAID5 with 128KiB writes and a 768KiB
stripe size, on a 3ware, and looks like that the hw host adapter
does not have a persistent cache (battery backed usually). My
guess that watching transfer rates and latencies with 'iostat
-dk -zyx 1' did not happen.

> mountflags: (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
[ ... ]
> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 10.1033 s, 531 MB/s

I had mentioned in my previous reply the output of 'filefrag'.
That to me seems relevant here, because of RAID5 RMW and maximum
extent size with Brfs compression and strip/stripe size.

Perhaps redoing the tests with a 128KiB 'bs' *without*
compression would be interesting, perhaps even with 'oflag=sync'
instead of 'conv=fsync'.

It is hard for me to see a speed issue here with Btrfs: for
comparison I have done a simple test with a both a 3+1 MD RAID5
set with a 256KiB chunk size and a single block device on
"contemporary" 1T/2TB drives, capable of sequential transfer
rates of 150-190MB/s:

  soft#  grep -A2 sdb3 /proc/mdstat 
  md127 : active raid5 sde3[4] sdd3[2] sdc3[1] sdb3[0]
	729808128 blocks super 1.0 level 5, 256k chunk, algorithm 2 [4/4] [UUUU]

with compression:

  soft#  mount -t btrfs -o commit=10,compress-force=zlib /dev/md/test5 /mnt/test5                                                       
  soft#  mount -t btrfs -o commit=10,compress-force=zlib /dev/sdg3 /mnt/sdg3
  soft#  rm -f /mnt/test5/testfile /mnt/sdg3/testfile

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile bs=1M count=10000 conv=fsync
  10000+0 records in
  10000+0 records out
  10485760000 bytes (10 GB) copied, 94.3605 s, 111 MB/s
  0.01user 12.59system 1:34.36elapsed 13%CPU (0avgtext+0avgdata 2932maxresident)k
  13042144inputs+20482144outputs (3major+345minor)pagefaults 0swaps

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sdg3/testfile bs=1M count=10000 conv=fsync
  10000+0 records in
  10000+0 records out
  10485760000 bytes (10 GB) copied, 93.5885 s, 112 MB/s
  0.03user 12.35system 1:33.59elapsed 13%CPU (0avgtext+0avgdata 2940maxresident)k
  13042144inputs+20482400outputs (3major+346minor)pagefaults 0swaps

  soft#  filefrag /mnt/test5/testfile /mnt/sdg3/testfile
  /mnt/test5/testfile: 48945 extents found
  /mnt/sdg3/testfile: 49029 extents found

  soft#  btrfs fi df /mnt/test5/ | grep Data
  Data, single: total=7.00GiB, used=6.55GiB

  soft#  btrfs fi df /mnt/sdg3 | grep Data
  Data, single: total=7.00GiB, used=6.55GiB

  soft#  sysctl vm/drop_caches=3
  vm.drop_caches = 3
  soft#  /usr/bin/time dd iflag=fullblock if=/mnt/test5/testfile bs=1M count=10000 of=/dev/zero
  10000+0 records in
  10000+0 records out
  10485760000 bytes (10 GB) copied, 23.2975 s, 450 MB/s
  0.01user 7.59system 0:23.32elapsed 32%CPU (0avgtext+0avgdata 2932maxresident)k
  13759624inputs+0outputs (3major+344minor)pagefaults 0swaps

  soft#  sysctl vm/drop_caches=3
  vm.drop_caches = 3
  soft#  /usr/bin/time dd iflag=fullblock if=/mnt/sdg3/testfile bs=1M count=10000 of=/dev/zero                                          
  10000+0 records in
  10000+0 records out
  10485760000 bytes (10 GB) copied, 35.0032 s, 300 MB/s
  0.01user 8.46system 0:35.03elapsed 24%CPU (0avgtext+0avgdata 2924maxresident)k
  13750568inputs+0outputs (3major+345minor)pagefaults 0swaps

and without compression:

  soft#  mount -t btrfs -o commit=10 /dev/sdg3 /mnt/sdg3                                                                                
  soft#  rm -f /mnt/test5/testfile /mnt/sdg3/testfile

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile bs=1M count=10000 conv=fsync
  10000+0 records in
  10000+0 records out
  10485760000 bytes (10 GB) copied, 74.7256 s, 140 MB/s
  0.02user 13.31system 1:14.72elapsed 17%CPU (0avgtext+0avgdata 2936maxresident)k
  13047640inputs+20483808outputs (3major+345minor)pagefaults 0swaps

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sdg3/testfile bs=1M count=10000 conv=fsync
  10000+0 records in
  10000+0 records out
  10485760000 bytes (10 GB) copied, 102.002 s, 103 MB/s
  0.02user 14.49system 1:42.00elapsed 14%CPU (0avgtext+0avgdata 2972maxresident)k
  13030592inputs+20484032outputs (3major+345minor)pagefaults 0swaps

  soft#  filefrag /mnt/test5/testfile /mnt/sdg3/testfile
  /mnt/test5/testfile: 23 extents found
  /mnt/sdg3/testfile: 13 extents found

> The CPU usage is pretty low as well. For example when the
> force-compress=zlib is in effect, the cpu usage is pretty low
> now.

That's 24 threads around 4-5% CPU each, that's around a 100% CPU
of system time spread around, for 70MB/s.

That's quite low. My report, which is mirrored by using 'pigz'
at the user level (very similar algorithms), was that 90MB/s
took 300% of an FX-6100 CPU at 3.3Ghz, and it is not that much
less efficient than a Xeon-E5645 at 2.4Ghz.

I have redone the test on a faster CPU:

  base#  grep 'model name' /proc/cpuinfo | sort -u
  model name      : AMD FX-8370E Eight-Core Processor
  base#  cpufreq-info | grep 'current CPU frequency'
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).

And the result is (from a fast flash Samsung SSD to a fast 2TB
Toshiba drive):

  base#  mount -t btrfs -o commit=10,compress-force=zlib /dev/sdb6 /mnt/sdb6
  base#  /usr/bin/time dd iflag=fullblock if=/dev/sde6 of=/mnt/sdb6/testfile bs=1M count=10000 conv=fsync
  10000+0 records in
  10000+0 records out
  10485760000 bytes (10 GB) copied, 41.7702 s, 251 MB/s
  0.00user 11.41system 0:43.41elapsed 26%CPU (0avgtext+0avgdata 3132maxresident)k
  20482288inputs+20503368outputs (1major+339minor)pagefaults 0swaps

With CPU usage as:

  top - 09:20:38 up 20:48,  4 users,  load average: 5.04, 2.03, 2.06
  Tasks: 576 total,  10 running, 566 sleeping,   0 stopped,   0 zombie
  %Cpu0  :  0.0 us, 94.0 sy,  0.7 ni,  0.0 id,  2.0 wa,  0.0 hi,  3.3 si,  0.0 st
  %Cpu1  :  0.3 us, 97.0 sy,  0.0 ni,  2.0 id,  0.7 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu2  :  3.0 us, 94.4 sy,  0.0 ni,  1.7 id,  1.0 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu3  :  0.7 us, 95.4 sy,  0.0 ni,  2.3 id,  1.3 wa,  0.0 hi,  0.3 si,  0.0 st
  %Cpu4  :  1.0 us, 95.7 sy,  0.3 ni,  2.6 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu5  :  0.3 us, 94.7 sy,  0.0 ni,  4.3 id,  0.7 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu6  :  1.7 us, 94.3 sy,  0.0 ni,  3.0 id,  1.0 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu7  :  0.0 us, 97.3 sy,  0.0 ni,  2.3 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
  KiB Mem:  16395076 total, 15987476 used,   407600 free,  5206304 buffers
  KiB Swap:        0 total,        0 used,        0 free.  8392648 cached Mem

so around 7 CPUs for 250MB/s, or around 35MB/s per CPU (more or
less what I also get user-level with 'pigz'), and it is hard for
me to imagine the Xeon-E5745 being twice as fast per-CPU for
"integer" work, but that's another discussion.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs + compression = slow performance and high cpu usage
  2017-07-31 11:41       ` Peter Grandi
@ 2017-07-31 12:33         ` Peter Grandi
  2017-07-31 12:49           ` Peter Grandi
  2017-08-01  9:58         ` Konstantin V. Gavrilenko
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Grandi @ 2017-07-31 12:33 UTC (permalink / raw)
  To: Linux fs Btrfs

> [ ... ] It is hard for me to see a speed issue here with
> Btrfs: for comparison I have done a simple test with a both a
> 3+1 MD RAID5 set with a 256KiB chunk size and a single block
> device on "contemporary" 1T/2TB drives, capable of sequential
> transfer rates of 150-190MB/s: [ ... ]

The figures after this are a bit on the low side because I
realized looking at 'vmstat' that the source block device 'sda6'
was being a bottleneck, as the host has only 8GiB instead of the
16GiB I misremembered, and also 'sda' is a relatively slow flash
SSD that reads are most at around 220MB/s. So I have redone the
simple tests with a transfer size of 3GB, which ensures that
all reads are from memory cache:

with compression:

  soft#  mount -t btrfs -o commit=10,compress-force=zlib /dev/md/test5 /mnt/test5
  soft#  mount -t btrfs -o commit=10,compress-force=zlib /dev/sdg3 /mnt/sdg3
  soft#  rm -f /mnt/test5/testfile /mnt/sdg3/testfile                                                                                   

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile bs=1M count=3000 conv=fsync
  3000+0 records in
  3000+0 records out
  3145728000 bytes (3.1 GB) copied, 15.8869 s, 198 MB/s
  0.00user 2.80system 0:15.88elapsed 17%CPU (0avgtext+0avgdata 3056maxresident)k
  0inputs+6148256outputs (0major+346minor)pagefaults 0swaps

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sdg3/testfile bs=1M count=3000 conv=fsync
  3000+0 records in
  3000+0 records out
  3145728000 bytes (3.1 GB) copied, 16.9663 s, 185 MB/s
  0.00user 2.61system 0:16.96elapsed 15%CPU (0avgtext+0avgdata 3056maxresident)k
  0inputs+6144672outputs (0major+346minor)pagefaults 0swaps

  soft#  btrfs fi df /mnt/test5/ | grep Data                                                                
  Data, single: total=3.00GiB, used=2.28GiB
  soft#  btrfs fi df /mnt/sdg3 | grep Data
  Data, single: total=3.00GiB, used=2.28GiB

  soft#  filefrag /mnt/test5/testfile /mnt/sdg3/testfile
  /mnt/test5/testfile: 8811 extents found
  /mnt/sdg3/testfile: 8759 extents found

Slightly weird that with a 3GB size the number of extents is
almost double that for the 10GB, but I guess that depends on
speed.

Then without compression:

  soft#  mount -t btrfs -o commit=10 /dev/md/test5 /mnt/test5
  soft#  mount -t btrfs -o commit=10 /dev/sdg3 /mnt/sdg3
  soft#  rm -f /mnt/test5/testfile /mnt/sdg3/testfile

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile bs=1M count=3000 conv=fsync
  3000+0 records in
  3000+0 records out
  3145728000 bytes (3.1 GB) copied, 8.06841 s, 390 MB/s
  0.00user 3.90system 0:08.80elapsed 44%CPU (0avgtext+0avgdata 2880maxresident)k
  0inputs+6153856outputs (0major+345minor)pagefaults 0swaps

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sdg3/testfile bs=1M count=3000 conv=fsync
  3000+0 records in
  3000+0 records out
  3145728000 bytes (3.1 GB) copied, 30.215 s, 104 MB/s
  0.00user 4.82system 0:30.93elapsed 15%CPU (0avgtext+0avgdata 2888maxresident)k
  0inputs+6152128outputs (0major+347minor)pagefaults 0swaps

  soft#  filefrag /mnt/test5/testfile /mnt/sdg3/testfile                                                                                
  /mnt/test5/testfile: 5 extents found
  /mnt/sdg3/testfile: 3 extents found

Also added:

  soft#  rm -f /mnt/test5/testfile /mnt/sdg3/testfile                                                                                                          

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 bs=128k count=3000 | dd iflag=fullblock of=/mnt/test5/testfile bs=128k oflag=sync
  3000+0 records in
  3000+0 records out
  393216000 bytes (393 MB) copied, 160.315 s, 2.5 MB/s
  0.02user 0.46system 2:40.31elapsed 0%CPU (0avgtext+0avgdata 1992maxresident)k
  0inputs+0outputs (0major+124minor)pagefaults 0swaps
  3000+0 records in
  3000+0 records out
  393216000 bytes (393 MB) copied, 160.365 s, 2.5 MB/s

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 bs=128k count=3000 | dd iflag=fullblock of=/mnt/sdg3/testfile bs=128k oflag=sync                        
  3000+0 records in
  3000+0 records out
  393216000 bytes (393 MB) copied, 113.51 s, 3.5 MB/s
  0.02user 0.56system 1:53.51elapsed 0%CPU (0avgtext+0avgdata 2156maxresident)k
  0inputs+0outputs (0major+120minor)pagefaults 0swaps
  3000+0 records in
  3000+0 records out
  393216000 bytes (393 MB) copied, 113.544 s, 3.5 MB/s

  soft#  filefrag /mnt/test5/testfile /mnt/sdg3/testfile                                                                                                       
  /mnt/test5/testfile: 1 extent found
  /mnt/sdg3/testfile: 22 extents found

  soft#  rm -f /mnt/test5/testfile /mnt/sdg3/testfile                                                                                                          

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 bs=1M count=1000 | dd iflag=fullblock of=/mnt/test5/testfile bs=1M oflag=sync                           
  1000+0 records in
  1000+0 records out
  1048576000 bytes (1.0 GB) copied, 68.5037 s, 15.3 MB/s
  0.00user 1.16system 1:08.50elapsed 1%CPU (0avgtext+0avgdata 2888maxresident)k
  0inputs+0outputs (0major+347minor)pagefaults 0swaps
  1000+0 records in
  1000+0 records out
  1048576000 bytes (1.0 GB) copied, 68.5859 s, 15.3 MB/s

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 bs=1M count=1000 | dd iflag=fullblock of=/mnt/sdg3/testfile bs=1M oflag=sync                            
  1000+0 records in
  1000+0 records out
  1048576000 bytes (1.0 GB) copied, 56.6714 s, 18.5 MB/s
  0.00user 1.21system 0:56.67elapsed 2%CPU (0avgtext+0avgdata 3056maxresident)k
  0inputs+0outputs (0major+345minor)pagefaults 0swaps
  1000+0 records in
  1000+0 records out
  1048576000 bytes (1.0 GB) copied, 56.7116 s, 18.5 MB/s

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs + compression = slow performance and high cpu usage
  2017-07-31 12:33         ` Peter Grandi
@ 2017-07-31 12:49           ` Peter Grandi
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Grandi @ 2017-07-31 12:49 UTC (permalink / raw)
  To: Linux fs Btrfs

[ ... ]

> Also added:

Feeling very generous :-) today, adding these too:

  soft#  mkfs.btrfs -mraid10 -draid10 -L test5 /dev/sd{b,c,d,e}3
  [ ... ]
  soft#  mount -t btrfs -o commit=10,compress-force=zlib /dev/sdb3 /mnt/test5

  soft#  rm -f /mnt/test5/testfile
  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile bs=1M count=3000 conv=fsync
  3000+0 records in
  3000+0 records out
  3145728000 bytes (3.1 GB) copied, 14.2166 s, 221 MB/s
  0.00user 2.54system 0:14.21elapsed 17%CPU (0avgtext+0avgdata 3056maxresident)k
  0inputs+6144768outputs (0major+346minor)pagefaults 0swaps

  soft#  rm -f /mnt/test5/testfile
  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile bs=128k count=3000 conv=fsync                                                    
  3000+0 records in
  3000+0 records out
  393216000 bytes (393 MB) copied, 2.05933 s, 191 MB/s
  0.00user 0.32system 0:02.06elapsed 15%CPU (0avgtext+0avgdata 1996maxresident)k
  0inputs+772512outputs (0major+124minor)pagefaults 0swaps

  soft#  rm -f /mnt/test5/testfile
  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 bs=1M count=1000 | dd iflag=fullblock of=/mnt/test5/testfile bs=1M oflag=sync                           
  1000+0 records in
  1000+0 records out
  1048576000 bytes (1.0 GB) copied, 60.6019 s, 17.3 MB/s
  0.01user 1.04system 1:00.60elapsed 1%CPU (0avgtext+0avgdata 2888maxresident)k
  0inputs+0outputs (0major+348minor)pagefaults 0swaps
  1000+0 records in
  1000+0 records out
  1048576000 bytes (1.0 GB) copied, 60.4116 s, 17.4 MB/s

  soft#  rm -f /mnt/test5/testfile
  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 bs=128k count=3000 | dd iflag=fullblock of=/mnt/test5/testfile bs=128k oflag=sync
  3000+0 records in
  3000+0 records out
  393216000 bytes (393 MB) copied, 148.04 s, 2.7 MB/s
  0.00user 0.62system 2:28.04elapsed 0%CPU (0avgtext+0avgdata 1996maxresident)k
  0inputs+0outputs (0major+125minor)pagefaults 0swaps
  3000+0 records in
  3000+0 records out
  393216000 bytes (393 MB) copied, 148.083 s, 2.7 MB/s

  soft#  sysctl vm/drop_caches=3
  vm.drop_caches = 3
  soft#  /usr/bin/time dd iflag=fullblock if=/mnt/test5/testfile bs=128k count=3000 of=/dev/zero                                                               
  3000+0 records in
  3000+0 records out
  393216000 bytes (393 MB) copied, 1.09729 s, 358 MB/s
  0.00user 0.24system 0:01.10elapsed 23%CPU (0avgtext+0avgdata 2164maxresident)k
  459768inputs+0outputs (3major+121minor)pagefaults 0swaps

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs + compression = slow performance and high cpu usage
  2017-07-31 11:41       ` Peter Grandi
  2017-07-31 12:33         ` Peter Grandi
@ 2017-08-01  9:58         ` Konstantin V. Gavrilenko
  2017-08-01 10:53           ` Paul Jones
  2017-08-01 13:14           ` Peter Grandi
  1 sibling, 2 replies; 17+ messages in thread
From: Konstantin V. Gavrilenko @ 2017-08-01  9:58 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs Btrfs

Peter, I don't think the filefrag is showing the correct fragmentation status of the file when the compression is used.
At least the one that is installed by default in Ubuntu 16.04 -  e2fsprogs | 1.42.13-1ubuntu1

So for example, fragmentation of compressed file is 320 times more then uncompressed one.

root@homenas:/mnt/storage/NEW# filefrag test5g-zeroes
test5g-zeroes: 40903 extents found

root@homenas:/mnt/storage/NEW# filefrag test5g-data 
test5g-data: 129 extents found


I am currently defragmenting that mountpoint, ensuring that everrything is compressed with zlib. 
# btrfs fi defragment -rv -czlib /mnt/arh-backup 

my guess is that it will take another 24-36 hours to complete and then I will redo the test to see if that has helped.
will keep the list posted.

p.s. any other suggestion that might help with the fragmentation and data allocation. Should I try and rebalance the data on the drive?

kos



----- Original Message -----
From: "Peter Grandi" <pg@btrfs.list.sabi.co.UK>
To: "Linux fs Btrfs" <linux-btrfs@vger.kernel.org>
Sent: Monday, 31 July, 2017 1:41:07 PM
Subject: Re: Btrfs + compression = slow performance and high cpu usage

[ ... ]

> grep 'model name' /proc/cpuinfo | sort -u 
> model name      : Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz

Good, contemporary CPU with all accelerations.

> The sda device is a hardware RAID5 consisting of 4x8TB drives.
[ ... ]
> Strip Size          : 256 KB

So the full RMW data stripe length is 768KiB.

> [ ... ] don't see the previously reported behaviour of one of
> the kworker consuming 100% of the cputime, but the write speed
> difference between the compression ON vs OFF is pretty large.

That's weird; of course 'lzo' is a lot cheaper than 'zlib', but
in my test the much higher CPU time of the latter was spread
across many CPUs, while in your case it wasn't, even if the
E5645 has 6 CPUs and can do 12 threads. That seemed to point to
some high cost of finding free blocks, that is a very fragmented
free list, or something else.

> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress oflag=direct
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 26.0685 s, 206 MB/s

The results with 'oflag=direct' are not relevant, because Btrfs
behaves "differently" with that.

> mountflags: (rw,relatime,compress-force=zlib,space_cache=v2,subvolid=5,subvol=/)
[ ... ]
> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 77.4845 s, 69.3 MB/s
> mountflags: (rw,relatime,compress-force=lzo,space_cache=v2,subvolid=5,subvol=/)
[ ... ]
> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 122.321 s, 43.9 MB/s

That's pretty good for a RAID5 with 128KiB writes and a 768KiB
stripe size, on a 3ware, and looks like that the hw host adapter
does not have a persistent cache (battery backed usually). My
guess that watching transfer rates and latencies with 'iostat
-dk -zyx 1' did not happen.

> mountflags: (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
[ ... ]
> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 10.1033 s, 531 MB/s

I had mentioned in my previous reply the output of 'filefrag'.
That to me seems relevant here, because of RAID5 RMW and maximum
extent size with Brfs compression and strip/stripe size.

Perhaps redoing the tests with a 128KiB 'bs' *without*
compression would be interesting, perhaps even with 'oflag=sync'
instead of 'conv=fsync'.

It is hard for me to see a speed issue here with Btrfs: for
comparison I have done a simple test with a both a 3+1 MD RAID5
set with a 256KiB chunk size and a single block device on
"contemporary" 1T/2TB drives, capable of sequential transfer
rates of 150-190MB/s:

  soft#  grep -A2 sdb3 /proc/mdstat 
  md127 : active raid5 sde3[4] sdd3[2] sdc3[1] sdb3[0]
	729808128 blocks super 1.0 level 5, 256k chunk, algorithm 2 [4/4] [UUUU]

with compression:

  soft#  mount -t btrfs -o commit=10,compress-force=zlib /dev/md/test5 /mnt/test5                                                       
  soft#  mount -t btrfs -o commit=10,compress-force=zlib /dev/sdg3 /mnt/sdg3
  soft#  rm -f /mnt/test5/testfile /mnt/sdg3/testfile

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile bs=1M count=10000 conv=fsync
  10000+0 records in
  10000+0 records out
  10485760000 bytes (10 GB) copied, 94.3605 s, 111 MB/s
  0.01user 12.59system 1:34.36elapsed 13%CPU (0avgtext+0avgdata 2932maxresident)k
  13042144inputs+20482144outputs (3major+345minor)pagefaults 0swaps

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sdg3/testfile bs=1M count=10000 conv=fsync
  10000+0 records in
  10000+0 records out
  10485760000 bytes (10 GB) copied, 93.5885 s, 112 MB/s
  0.03user 12.35system 1:33.59elapsed 13%CPU (0avgtext+0avgdata 2940maxresident)k
  13042144inputs+20482400outputs (3major+346minor)pagefaults 0swaps

  soft#  filefrag /mnt/test5/testfile /mnt/sdg3/testfile
  /mnt/test5/testfile: 48945 extents found
  /mnt/sdg3/testfile: 49029 extents found

  soft#  btrfs fi df /mnt/test5/ | grep Data
  Data, single: total=7.00GiB, used=6.55GiB

  soft#  btrfs fi df /mnt/sdg3 | grep Data
  Data, single: total=7.00GiB, used=6.55GiB

  soft#  sysctl vm/drop_caches=3
  vm.drop_caches = 3
  soft#  /usr/bin/time dd iflag=fullblock if=/mnt/test5/testfile bs=1M count=10000 of=/dev/zero
  10000+0 records in
  10000+0 records out
  10485760000 bytes (10 GB) copied, 23.2975 s, 450 MB/s
  0.01user 7.59system 0:23.32elapsed 32%CPU (0avgtext+0avgdata 2932maxresident)k
  13759624inputs+0outputs (3major+344minor)pagefaults 0swaps

  soft#  sysctl vm/drop_caches=3
  vm.drop_caches = 3
  soft#  /usr/bin/time dd iflag=fullblock if=/mnt/sdg3/testfile bs=1M count=10000 of=/dev/zero                                          
  10000+0 records in
  10000+0 records out
  10485760000 bytes (10 GB) copied, 35.0032 s, 300 MB/s
  0.01user 8.46system 0:35.03elapsed 24%CPU (0avgtext+0avgdata 2924maxresident)k
  13750568inputs+0outputs (3major+345minor)pagefaults 0swaps

and without compression:

  soft#  mount -t btrfs -o commit=10 /dev/sdg3 /mnt/sdg3                                                                                
  soft#  rm -f /mnt/test5/testfile /mnt/sdg3/testfile

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile bs=1M count=10000 conv=fsync
  10000+0 records in
  10000+0 records out
  10485760000 bytes (10 GB) copied, 74.7256 s, 140 MB/s
  0.02user 13.31system 1:14.72elapsed 17%CPU (0avgtext+0avgdata 2936maxresident)k
  13047640inputs+20483808outputs (3major+345minor)pagefaults 0swaps

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sdg3/testfile bs=1M count=10000 conv=fsync
  10000+0 records in
  10000+0 records out
  10485760000 bytes (10 GB) copied, 102.002 s, 103 MB/s
  0.02user 14.49system 1:42.00elapsed 14%CPU (0avgtext+0avgdata 2972maxresident)k
  13030592inputs+20484032outputs (3major+345minor)pagefaults 0swaps

  soft#  filefrag /mnt/test5/testfile /mnt/sdg3/testfile
  /mnt/test5/testfile: 23 extents found
  /mnt/sdg3/testfile: 13 extents found

> The CPU usage is pretty low as well. For example when the
> force-compress=zlib is in effect, the cpu usage is pretty low
> now.

That's 24 threads around 4-5% CPU each, that's around a 100% CPU
of system time spread around, for 70MB/s.

That's quite low. My report, which is mirrored by using 'pigz'
at the user level (very similar algorithms), was that 90MB/s
took 300% of an FX-6100 CPU at 3.3Ghz, and it is not that much
less efficient than a Xeon-E5645 at 2.4Ghz.

I have redone the test on a faster CPU:

  base#  grep 'model name' /proc/cpuinfo | sort -u
  model name      : AMD FX-8370E Eight-Core Processor
  base#  cpufreq-info | grep 'current CPU frequency'
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).
    current CPU frequency is 3.30 GHz (asserted by call to hardware).

And the result is (from a fast flash Samsung SSD to a fast 2TB
Toshiba drive):

  base#  mount -t btrfs -o commit=10,compress-force=zlib /dev/sdb6 /mnt/sdb6
  base#  /usr/bin/time dd iflag=fullblock if=/dev/sde6 of=/mnt/sdb6/testfile bs=1M count=10000 conv=fsync
  10000+0 records in
  10000+0 records out
  10485760000 bytes (10 GB) copied, 41.7702 s, 251 MB/s
  0.00user 11.41system 0:43.41elapsed 26%CPU (0avgtext+0avgdata 3132maxresident)k
  20482288inputs+20503368outputs (1major+339minor)pagefaults 0swaps

With CPU usage as:

  top - 09:20:38 up 20:48,  4 users,  load average: 5.04, 2.03, 2.06
  Tasks: 576 total,  10 running, 566 sleeping,   0 stopped,   0 zombie
  %Cpu0  :  0.0 us, 94.0 sy,  0.7 ni,  0.0 id,  2.0 wa,  0.0 hi,  3.3 si,  0.0 st
  %Cpu1  :  0.3 us, 97.0 sy,  0.0 ni,  2.0 id,  0.7 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu2  :  3.0 us, 94.4 sy,  0.0 ni,  1.7 id,  1.0 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu3  :  0.7 us, 95.4 sy,  0.0 ni,  2.3 id,  1.3 wa,  0.0 hi,  0.3 si,  0.0 st
  %Cpu4  :  1.0 us, 95.7 sy,  0.3 ni,  2.6 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu5  :  0.3 us, 94.7 sy,  0.0 ni,  4.3 id,  0.7 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu6  :  1.7 us, 94.3 sy,  0.0 ni,  3.0 id,  1.0 wa,  0.0 hi,  0.0 si,  0.0 st
  %Cpu7  :  0.0 us, 97.3 sy,  0.0 ni,  2.3 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
  KiB Mem:  16395076 total, 15987476 used,   407600 free,  5206304 buffers
  KiB Swap:        0 total,        0 used,        0 free.  8392648 cached Mem

so around 7 CPUs for 250MB/s, or around 35MB/s per CPU (more or
less what I also get user-level with 'pigz'), and it is hard for
me to imagine the Xeon-E5745 being twice as fast per-CPU for
"integer" work, but that's another discussion.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Btrfs + compression = slow performance and high cpu usage
  2017-08-01  9:58         ` Konstantin V. Gavrilenko
@ 2017-08-01 10:53           ` Paul Jones
  2017-08-01 13:14           ` Peter Grandi
  1 sibling, 0 replies; 17+ messages in thread
From: Paul Jones @ 2017-08-01 10:53 UTC (permalink / raw)
  To: Konstantin V. Gavrilenko, Peter Grandi; +Cc: Linux fs Btrfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1243 bytes --]

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-
> owner@vger.kernel.org] On Behalf Of Konstantin V. Gavrilenko
> Sent: Tuesday, 1 August 2017 7:58 PM
> To: Peter Grandi <pg@btrfs.list.sabi.co.UK>
> Cc: Linux fs Btrfs <linux-btrfs@vger.kernel.org>
> Subject: Re: Btrfs + compression = slow performance and high cpu usage
> 
> Peter, I don't think the filefrag is showing the correct fragmentation status of
> the file when the compression is used.
> At least the one that is installed by default in Ubuntu 16.04 -  e2fsprogs |
> 1.42.13-1ubuntu1
> 
> So for example, fragmentation of compressed file is 320 times more then
> uncompressed one.
> 
> root@homenas:/mnt/storage/NEW# filefrag test5g-zeroes
> test5g-zeroes: 40903 extents found
> 
> root@homenas:/mnt/storage/NEW# filefrag test5g-data
> test5g-data: 129 extents found

Compressed extents are about 128kb, uncompressed extents are about 128Mb. (can't remember the exact numbers.) 
I've had trouble with slow filesystems when using compression. The problem seems to go away when removing compression.

Paul.






ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨èÚ&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~††Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ß£øm

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs + compression = slow performance and high cpu usage
  2017-08-01  9:58         ` Konstantin V. Gavrilenko
  2017-08-01 10:53           ` Paul Jones
@ 2017-08-01 13:14           ` Peter Grandi
  2017-08-01 18:09             ` Konstantin V. Gavrilenko
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Grandi @ 2017-08-01 13:14 UTC (permalink / raw)
  To: Linux fs Btrfs

> Peter, I don't think the filefrag is showing the correct
> fragmentation status of the file when the compression is used.

As reported on a previous message the output of 'filefrag -v'
which can be used to see what is going on:

>>>> filefrag /mnt/sde3/testfile 
>>>>   /mnt/sde3/testfile: 49287 extents found

>>>> Most the latter extents are mercifully rather contiguous, their
>>>> size is just limited by the compression code, here is an extract
>>>> from 'filefrag -v' from around the middle:

>>>>   24757:  1321888.. 1321919:   11339579..  11339610:     32:   11339594:
>>>>   24758:  1321920.. 1321951:   11339597..  11339628:     32:   11339611:
>>>>   24759:  1321952.. 1321983:   11339615..  11339646:     32:   11339629:
>>>>   24760:  1321984.. 1322015:   11339632..  11339663:     32:   11339647:
>>>>   24761:  1322016.. 1322047:   11339649..  11339680:     32:   11339664:
>>>>   24762:  1322048.. 1322079:   11339667..  11339698:     32:   11339681:
>>>>   24763:  1322080.. 1322111:   11339686..  11339717:     32:   11339699:
>>>>   24764:  1322112.. 1322143:   11339703..  11339734:     32:   11339718:
>>>>   24765:  1322144.. 1322175:   11339720..  11339751:     32:   11339735:
>>>>   24766:  1322176.. 1322207:   11339737..  11339768:     32:   11339752:
>>>>   24767:  1322208.. 1322239:   11339754..  11339785:     32:   11339769:
>>>>   24768:  1322240.. 1322271:   11339771..  11339802:     32:   11339786:
>>>>   24769:  1322272.. 1322303:   11339789..  11339820:     32:   11339803:

>>>> But again this is on a fresh empty Btrfs volume.

As I wrote, "their size is just limited by the compression code"
which results in "128KiB writes". On a "fresh empty Btrfs volume"
the compressed extents limited to 128KiB also happen to be pretty
physically contiguous, but on a more fragmented free space list
they can be more scattered.

As I already wrote the main issue here seems to be that we are
talking about a "RAID5 with 128KiB writes and a 768KiB stripe
size". On MD RAID5 the slowdown because of RMW seems only to be
around 30-40%, but it looks like that several back-to-back 128KiB
writes get merged by the Linux IO subsystem (not sure whether
that's thoroughly legal), and perhaps they get merged by the 3ware
firmware only if it has a persistent cache, and maybe your 3ware
does not have one, but you have kept your counsel as to that.

My impression is that you read the Btrfs documentation and my
replies with a lot less attention than I write them. Some of the
things you have done and said make me think that you did not read
https://btrfs.wiki.kernel.org/index.php/Compression and 'man 5
btrfs', for example:

   "How does compression interact with direct IO or COW?

     Compression does not work with DIO, does work with COW and
     does not work for NOCOW files. If a file is opened in DIO
     mode, it will fall back to buffered IO.

   Are there speed penalties when doing random access to a
   compressed file?

     Yes. The compression processes ranges of a file of maximum
     size 128 KiB and compresses each 4 KiB (or page-sized) block
     separately."

> I am currently defragmenting that mountpoint, ensuring that
> everrything is compressed with zlib.

Defragmenting the used space might help find more contiguous
allocations.

> p.s. any other suggestion that might help with the fragmentation
> and data allocation. Should I try and rebalance the data on the
> drive?

Yes, regularly, as that defragments the unused space.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs + compression = slow performance and high cpu usage
  2017-08-01 13:14           ` Peter Grandi
@ 2017-08-01 18:09             ` Konstantin V. Gavrilenko
  2017-08-01 20:09               ` Peter Grandi
  0 siblings, 1 reply; 17+ messages in thread
From: Konstantin V. Gavrilenko @ 2017-08-01 18:09 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs Btrfs

----- Original Message -----
From: "Peter Grandi" <pg@btrfs.list.sabi.co.UK>
To: "Linux fs Btrfs" <linux-btrfs@vger.kernel.org>
Sent: Tuesday, 1 August, 2017 3:14:07 PM
Subject: Re: Btrfs + compression = slow performance and high cpu usage

> Peter, I don't think the filefrag is showing the correct
> fragmentation status of the file when the compression is used.

<SNIP>

As I wrote, "their size is just limited by the compression code"
which results in "128KiB writes". On a "fresh empty Btrfs volume"
the compressed extents limited to 128KiB also happen to be pretty
physically contiguous, but on a more fragmented free space list
they can be more scattered.

KOS: Ok, thanks for pointing it out. I have compared the filefrag -v on another btrfs  that is not fragmented
and see the difference with what is happening on the sluggish one.

5824:   186368..  186399: 2430093383..2430093414:     32: 2430093414: encoded
5825:   186400..  186431: 2430093384..2430093415:     32: 2430093415: encoded
5826:   186432..  186463: 2430093385..2430093416:     32: 2430093416: encoded
5827:   186464..  186495: 2430093386..2430093417:     32: 2430093417: encoded
5828:   186496..  186527: 2430093387..2430093418:     32: 2430093418: encoded
5829:   186528..  186559: 2430093388..2430093419:     32: 2430093419: encoded
5830:   186560..  186591: 2430093389..2430093420:     32: 2430093420: encoded

As I already wrote the main issue here seems to be that we are
talking about a "RAID5 with 128KiB writes and a 768KiB stripe
size". On MD RAID5 the slowdown because of RMW seems only to be
around 30-40%, but it looks like that several back-to-back 128KiB
writes get merged by the Linux IO subsystem (not sure whether
that's thoroughly legal), and perhaps they get merged by the 3ware
firmware only if it has a persistent cache, and maybe your 3ware
does not have one, but you have kept your counsel as to that.

KOS: No I don't have persistent cache. Only the 512 Mb cache on board of a controller, that is 
BBU. If I had additional SSD caching on the controller I would have mentioned it.

I was also under impression, that in a situation where mostly extra large files will be stored on the massive, the bigger strip size would indeed increase the speed, thus I went with with the 256 Kb strip size.  Would I be correct in assuming that the RAID strip size of 128 Kb will be a better choice if one plans to use the BTRFS with compression?

thanks,
kos

<SNIP>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs + compression = slow performance and high cpu usage
  2017-08-01 18:09             ` Konstantin V. Gavrilenko
@ 2017-08-01 20:09               ` Peter Grandi
  2017-08-01 23:54                 ` Peter Grandi
  2017-08-31 10:56                 ` Konstantin V. Gavrilenko
  0 siblings, 2 replies; 17+ messages in thread
From: Peter Grandi @ 2017-08-01 20:09 UTC (permalink / raw)
  To: Linux fs Btrfs

>> [ ... ] a "RAID5 with 128KiB writes and a 768KiB stripe
>> size". [ ... ] several back-to-back 128KiB writes [ ... ] get
>> merged by the 3ware firmware only if it has a persistent
>> cache, and maybe your 3ware does not have one,

> KOS: No I don't have persistent cache. Only the 512 Mb cache
> on board of a controller, that is BBU.

If it is a persistent cache, that can be battery-backed (as I
wrote, but it seems that you don't have too much time to read
replies) then the size of the write, 128KiB or not, should not
matter much; the write will be reported complete when it hits
the persistent cache (whichever technology it used), and then
the HA fimware will spill write cached data to the disks using
the optimal operation width.

Unless the 3ware firmware is really terrible (and depending on
model and vintage it can be amazingly terrible) or the battery
is no longer recharging and then the host adapter switches to
write-through.

That you see very different rates between uncompressed and
compressed writes, where the main difference is the limitation
on the segment size, seems to indicate that compressed writes
involve a lot of RMW, that is sub-stripe updates. As I mentioned
already, it would be interesting to retry 'dd' with different
'bs' values without compression and with 'sync' (or 'direct'
which only makes sense without compression).

> If I had additional SSD caching on the controller I would have
> mentioned it.

So far you had not mentioned the presence of BBU cache either,
which is equivalent, even if in one of your previous message
(which I try to read carefully) there were these lines:

>>>> Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
>>>> Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

So perhaps someone else would have checked long ago the status
of the BBU and whether the "No Write Cache if Bad BBU" case has
happened. If the BBU is still working and the policy is still
"WriteBack" then things are stranger still.

> I was also under impression, that in a situation where mostly
> extra large files will be stored on the massive, the bigger
> strip size would indeed increase the speed, thus I went with
> with the 256 Kb strip size.

That runs counter to this simple story: suppose a program is
doing 64KiB IO:

* For *reads*, there are 4 data drives and the strip size is
  16KiB: the 64KiB will be read in parallel on 4 drives. If the
  strip size is 256KiB then the 64KiB will be read sequentially
  from just one disk, and 4 successive reads will be read
  sequentially from the same drive.

* For *writes* on a parity RAID like RAID5 things are much, much
  more extreme: the 64KiB will be written with 16KiB strips on a
  5-wide RAID5 set in parallel to 5 drives, with 4 stripes being
  updated with RMW. But with 256KiB strips it will partially
  update 5 drives, because the stripe is 1024+256KiB, and it
  needs to do RMW, and four successive 64KiB drives will need to
  do that too, even if only one drive is updated. Usually for
  RAID5 there is an optimization that means that only the
  specific target drive and the parity drives(s) need RMW, but
  it is still very expensive.

This is the "storage for beginners" version, what happens in
practice however depends a lot on specific workload profile
(typical read/write size and latencies and rates), caching and
queueing algorithms in both Linux and the HA firmware.

> Would I be correct in assuming that the RAID strip size of 128
> Kb will be a better choice if one plans to use the BTRFS with
> compression?

That would need to be tested, because of "depends a lot on
specific workload profile, caching and queueing algorithms", but
my expectation is the the lower the better. Given that you have
4 drives giving a 3+1 RAID set, perhaps a 32KiB or 64KiB strip
size, given a data stripe size of 96KiB or 192KiB, would be
better.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs + compression = slow performance and high cpu usage
  2017-08-01 20:09               ` Peter Grandi
@ 2017-08-01 23:54                 ` Peter Grandi
  2017-08-31 10:56                 ` Konstantin V. Gavrilenko
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Grandi @ 2017-08-01 23:54 UTC (permalink / raw)
  To: Linux fs Btrfs

[ ... ]

> This is the "storage for beginners" version, what happens in
> practice however depends a lot on specific workload profile
> (typical read/write size and latencies and rates), caching and
> queueing algorithms in both Linux and the HA firmware.

To add a bit of slightly more advanced discussion, the main
reason for larger strips ("chunk size) is to avoid the huge
latencies of disk rotation using unsynchronized disk drives, as
detailed here:

  http://www.sabi.co.uk/blog/12-thr.html?120310#120310

That relates weakly to Btrfs.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs + compression = slow performance and high cpu usage
  2017-08-01 20:09               ` Peter Grandi
  2017-08-01 23:54                 ` Peter Grandi
@ 2017-08-31 10:56                 ` Konstantin V. Gavrilenko
  1 sibling, 0 replies; 17+ messages in thread
From: Konstantin V. Gavrilenko @ 2017-08-31 10:56 UTC (permalink / raw)
  To: Linux fs Btrfs; +Cc: Peter Grandi

Hello again list. I thought I would clear the things out and describe what is happening with my troubled RAID setup.

So having received the help from the list, I've initially run the full defragmentation of all the data and recompressed everything with zlib. 
That didn't help. Then I run the full rebalance of the data and that didn't help either.

So I had to take a disk out of the raid, copy all the data onto it, recreate the RAID drive with 32kb chunk size and 96kb stripe and copied the data back. Then added the disk back and resynced the raid.

So currently the RAID device is 

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-5, Secondary-0, RAID Level Qualifier-3
Size                : 21.830 TB
Sector Size         : 512
Is VD emulated      : Yes
Parity Size         : 7.276 TB
State               : Optimal
Strip Size          : 32 KB
Number Of Drives    : 4
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Bad Blocks Exist: No
Is VD Cached: No

It is about 40% full with compressed data
# btrfs fi usage /mnt/arh-backup1/
Overall:
    Device size:                  21.83TiB
    Device allocated:              8.98TiB
    Device unallocated:           12.85TiB
    Device missing:                  0.00B
    Used:                          8.98TiB
    Free (estimated):             12.85TiB      (min: 6.43TiB)
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

I've decided to run a set of test, where 5 gb file was created using different blocksizes and different flags.
one file with urandom data was generated and another one filled with zeroes. the data was written with compression and without compression, and it seems that without compression it is possible to gain 30-40% speed, while the cpu was running at 50% idle during the highest loads.
dd write speeds (mb/s)

flags: conv=fsync
compress-force=zlib  compress-force=none
         RAND ZERO    RAND ZERO
bs1024k  387  407     584  577
bs512k   389  414     532  547
bs256k   412  409     558  585
bs128k   412  403     572  583
bs64k    409  419     563  574
bs32k    407  404     569  572

flags: oflag=sync
compress-force=zlib  compress-force=none
         RAND  ZERO    RAND  ZERO
bs1024k  86.1  97.0    203   210
bs512k   50.6  64.4    85.0  170
bs256k   25.0  29.8    67.6  67.5
bs128k   13.2  16.4    48.4  49.8
bs64k    7.4   8.3     24.5  27.9
bs32k    3.8   4.1     14.0  13.7

flags: no flags
compress-force=zlib  compress-force=none
         RAND  ZERO    RAND  ZERO
bs1024k  480   419     681   595
bs512k   422   412     633   585
bs256k   413   384     707   712
bs128k   414   387     695   704
bs64k    482   467     622   587
bs32k    416   412     610   598

I have also run a test where I filled the array to about 97% capacity and the write speed went down by about 50% compared with the empty RAID.

thanks for the help. 

----- Original Message -----
From: "Peter Grandi" <pg@btrfs.list.sabi.co.UK>
To: "Linux fs Btrfs" <linux-btrfs@vger.kernel.org>
Sent: Tuesday, 1 August, 2017 10:09:03 PM
Subject: Re: Btrfs + compression = slow performance and high cpu usage

>> [ ... ] a "RAID5 with 128KiB writes and a 768KiB stripe
>> size". [ ... ] several back-to-back 128KiB writes [ ... ] get
>> merged by the 3ware firmware only if it has a persistent
>> cache, and maybe your 3ware does not have one,

> KOS: No I don't have persistent cache. Only the 512 Mb cache
> on board of a controller, that is BBU.

If it is a persistent cache, that can be battery-backed (as I
wrote, but it seems that you don't have too much time to read
replies) then the size of the write, 128KiB or not, should not
matter much; the write will be reported complete when it hits
the persistent cache (whichever technology it used), and then
the HA fimware will spill write cached data to the disks using
the optimal operation width.

Unless the 3ware firmware is really terrible (and depending on
model and vintage it can be amazingly terrible) or the battery
is no longer recharging and then the host adapter switches to
write-through.

That you see very different rates between uncompressed and
compressed writes, where the main difference is the limitation
on the segment size, seems to indicate that compressed writes
involve a lot of RMW, that is sub-stripe updates. As I mentioned
already, it would be interesting to retry 'dd' with different
'bs' values without compression and with 'sync' (or 'direct'
which only makes sense without compression).

> If I had additional SSD caching on the controller I would have
> mentioned it.

So far you had not mentioned the presence of BBU cache either,
which is equivalent, even if in one of your previous message
(which I try to read carefully) there were these lines:

>>>> Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
>>>> Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

So perhaps someone else would have checked long ago the status
of the BBU and whether the "No Write Cache if Bad BBU" case has
happened. If the BBU is still working and the policy is still
"WriteBack" then things are stranger still.

> I was also under impression, that in a situation where mostly
> extra large files will be stored on the massive, the bigger
> strip size would indeed increase the speed, thus I went with
> with the 256 Kb strip size.

That runs counter to this simple story: suppose a program is
doing 64KiB IO:

* For *reads*, there are 4 data drives and the strip size is
  16KiB: the 64KiB will be read in parallel on 4 drives. If the
  strip size is 256KiB then the 64KiB will be read sequentially
  from just one disk, and 4 successive reads will be read
  sequentially from the same drive.

* For *writes* on a parity RAID like RAID5 things are much, much
  more extreme: the 64KiB will be written with 16KiB strips on a
  5-wide RAID5 set in parallel to 5 drives, with 4 stripes being
  updated with RMW. But with 256KiB strips it will partially
  update 5 drives, because the stripe is 1024+256KiB, and it
  needs to do RMW, and four successive 64KiB drives will need to
  do that too, even if only one drive is updated. Usually for
  RAID5 there is an optimization that means that only the
  specific target drive and the parity drives(s) need RMW, but
  it is still very expensive.

This is the "storage for beginners" version, what happens in
practice however depends a lot on specific workload profile
(typical read/write size and latencies and rates), caching and
queueing algorithms in both Linux and the HA firmware.

> Would I be correct in assuming that the RAID strip size of 128
> Kb will be a better choice if one plans to use the BTRFS with
> compression?

That would need to be tested, because of "depends a lot on
specific workload profile, caching and queueing algorithms", but
my expectation is the the lower the better. Given that you have
4 drives giving a 3+1 RAID set, perhaps a 32KiB or 64KiB strip
size, given a data stripe size of 96KiB or 192KiB, would be
better.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-08-31 10:56 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <33040946.535.1501254718807.JavaMail.gkos@dynomob>
2017-07-28 16:40 ` Btrfs + compression = slow performance and high cpu usage Konstantin V. Gavrilenko
2017-07-28 17:48   ` Roman Mamedov
2017-07-28 18:20     ` William Muriithi
2017-07-28 18:37       ` Hugo Mills
2017-07-28 18:08   ` Peter Grandi
2017-07-30 13:42     ` Konstantin V. Gavrilenko
2017-07-31 11:41       ` Peter Grandi
2017-07-31 12:33         ` Peter Grandi
2017-07-31 12:49           ` Peter Grandi
2017-08-01  9:58         ` Konstantin V. Gavrilenko
2017-08-01 10:53           ` Paul Jones
2017-08-01 13:14           ` Peter Grandi
2017-08-01 18:09             ` Konstantin V. Gavrilenko
2017-08-01 20:09               ` Peter Grandi
2017-08-01 23:54                 ` Peter Grandi
2017-08-31 10:56                 ` Konstantin V. Gavrilenko
2017-07-28 18:44   ` Peter Grandi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).