Re: BTRFS free space handling still needs more work: Hangs again

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Robert White <rwhite@pobox.com>
To: Martin Steigerwald <Martin@lichtvoll.de>,
	Hugo Mills <hugo@carfax.org.uk>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: BTRFS free space handling still needs more work: Hangs again
Date: Sat, 27 Dec 2014 06:54:33 -0800	[thread overview]
Message-ID: <549EC829.8080808@pobox.com> (raw)
In-Reply-To: <2235526.D1sZ4Blnlh@merkaba>

On 12/27/2014 05:55 AM, Martin Steigerwald wrote:
> Summarized at
>
> Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401
>
> see below. This is reproducable with fio, no need for Windows XP in
> Virtualbox for reproducing the issue. Next I will try to reproduce with
> a freshly creating filesystem.
>
>
> Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
>> On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
>>> Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
>>>> On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
>>>>> Hello!
>>>>>
>>>>> First: Have a merry christmas and enjoy a quiet time in these days.
>>>>>
>>>>> Second: At a time you feel like it, here is a little rant, but also a
>>>>> bug
>>>>> report:
>>>>>
>>>>> I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
>>>>> space_cache, skinny meta data extents – are these a problem? – and
>>>>
>>>>> compress=lzo:
>>>> (there is no known problem with skinny metadata, it's actually more
>>>> efficient than the older format. There has been some anecdotes about
>>>> mixing the skinny and fat metadata but nothing has ever been
>>>> demonstrated problematic.)
>>>>
>>>>> merkaba:~> btrfs fi sh /home
>>>>> Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
>>>>>
>>>>>           Total devices 2 FS bytes used 144.41GiB
>>>>>           devid    1 size 160.00GiB used 160.00GiB path
>>>>>           /dev/mapper/msata-home
>>>>>           devid    2 size 160.00GiB used 160.00GiB path
>>>>>           /dev/mapper/sata-home
>>>>>
>>>>> Btrfs v3.17
>>>>> merkaba:~> btrfs fi df /home
>>>>> Data, RAID1: total=154.97GiB, used=141.12GiB
>>>>> System, RAID1: total=32.00MiB, used=48.00KiB
>>>>> Metadata, RAID1: total=5.00GiB, used=3.29GiB
>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>
>>>> This filesystem, at the allocation level, is "very full" (see below).
>>>>
>>>>> And I had hangs with BTRFS again. This time as I wanted to install tax
>>>>> return software in Virtualbox´d Windows XP VM (which I use once a year
>>>>> cause I know no tax return software for Linux which would be suitable
>>>>> for
>>>>> Germany and I frankly don´t care about the end of security cause all
>>>>> surfing and other network access I will do from the Linux box and I
>>>>> only
>>>>> run the VM behind a firewall).
>>>>
>>>>> And thus I try the balance dance again:
>>>> ITEM: Balance... it doesn't do what you think it does... 8-)
>>>>
>>>> "Balancing" is something you should almost never need to do. It is only
>>>> for cases of changing geometry (adding disks, switching RAID levels,
>>>> etc.) of for cases when you've radically changed allocation behaviors
>>>> (like you decided to remove all your VM's or you've decided to remove a
>>>> mail spool directory full of thousands of tiny files).
>>>>
>>>> People run balance all the time because they think they should. They are
>>>> _usually_ incorrect in that belief.
>>>
>>> I only see the lockups of BTRFS is the trees *occupy* all space on the
>>> device.
>>     No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
>> space. What's more, balance does *not* balance the metadata trees. The
>> remaining space -- 154.97 GiB -- is unstructured storage for file
>> data, and you have some 13 GiB of that available for use.
>>
>>     Now, since you're seeing lockups when the space on your disks is
>> all allocated I'd say that's a bug. However, you're the *only* person
>> who's reported this as a regular occurrence. Does this happen with all
>> filesystems you have, or just this one?
>>
>>> I *never* so far saw it lockup if there is still space BTRFS can allocate
>>> from to *extend* a tree.
>>
>>     It's not a tree. It's simply space allocation. It's not even space
>> *usage* you're talking about here -- it's just allocation (i.e. the FS
>> saying "I'm going to use this piece of disk for this purpose").
>>
>>> This may be a bug, but this is what I see.
>>>
>>> And no amount of "you should not balance a BTRFS" will make that
>>> perception go away.
>>>
>>> See, I see the sun coming out on a morning and you tell me "no, it
>>> doesn´t". Simply that is not going to match my perception.
>>
>>     Duncan's assertion is correct in its detail. Looking at your space
>
> Robert's :)
>
>> usage, I would not suggest that running a balance is something you
>> need to do. Now, since you have these lockups that seem quite
>> repeatable, there's probably a lurking bug in there, but hacking
>> around with balance every time you hit it isn't going to get the
>> problem solved properly.
>>
>>     I think I would suggest the following:
>>
>>   - make sure you have some way of logging your dmesg permanently (use
>>     a different filesystem for /var/log, or a serial console, or a
>>     netconsole)
>>
>>   - when the lockup happens, hit Alt-SysRq-t a few times
>>
>>   - send the dmesg output here, or post to bugzilla.kernel.org
>>
>>     That's probably going to give enough information to the developers
>> to work out where the lockup is happening, and is clearly the way
>> forward here.
>
> And I got it reproduced. *Perfectly* reproduced, I´d say.
>
> But let me run the whole story:
>
> 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.
>
> Which gave me:
>
> merkaba:~> btrfs fi sh /home
> Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
>          Total devices 2 FS bytes used 144.19GiB
>          devid    1 size 160.00GiB used 150.01GiB path /dev/mapper/msata-home
>          devid    2 size 160.00GiB used 150.01GiB path /dev/mapper/sata-home
>
> Btrfs v3.17
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=144.98GiB, used=140.95GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.24GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
>
> 2) I run the Virtualbox machine again and defragmented the NTFS filesystem
> in the VDI image file. And: It worked *just* fine. Fine as in *fine*. No issues
> whatsoever.
>
>
> I got this during the run:
>
> ATOP - merkaba                          2014/12/27  12:58:42                          -----------                           10s elapsed
> PRC |  sys   10.41s |  user   1.08s |  #proc    357  | #trun      4  |  #tslpi   694 |  #tslpu     0 |  #zombie    0  | no  procacct  |
> CPU |  sys     107% |  user     11% |  irq       0%  | idle    259%  |  wait     23% |  guest     0% |  curf 3.01GHz  | curscal  93%  |
> cpu |  sys      29% |  user      3% |  irq       0%  | idle     63%  |  cpu002 w  5% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys      27% |  user      3% |  irq       0%  | idle     65%  |  cpu000 w  5% |  guest     0% |  curf 3.03GHz  | curscal  94%  |
> cpu |  sys      26% |  user      3% |  irq       0%  | idle     63%  |  cpu003 w  8% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys      24% |  user      2% |  irq       0%  | idle     68%  |  cpu001 w  6% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> CPL |  avg1    1.92 |  avg5    1.01 |  avg15   0.56  |               |  csw   501619 |  intr  129279 |                | numcpu     4  |
> MEM |  tot    15.5G |  free  610.1M |  cache   9.1G  | buff    0.1M  |  slab    1.0G |  shmem 183.5M |  vmbal   0.0M  | hptot   0.0M  |
> SWP |  tot    12.0G |  free   11.6G |                |               |               |               |  vmcom   7.1G  | vmlim  19.7G  |
> PAG |  scan  219141 |  steal 215577 |  stall    936  |               |               |               |  swin       0  | swout    940  |
> LVM |     sata-home |  busy     53% |  read  181413  | write      0  |  KiB/w      0 |  MBr/s  70.86 |  MBw/s   0.00  | avio 0.03 ms  |
> LVM |     sata-swap |  busy      2% |  read       0  | write    940  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.37  | avio 0.17 ms  |
> LVM |   sata-debian |  busy      0% |  read       0  | write      1  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.00  | avio 1.00 ms  |
> LVM |  msata-debian |  busy      0% |  read       0  | write      1  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.00 ms  |
> DSK |           sda |  busy     53% |  read  181413  | write    477  |  KiB/w      7 |  MBr/s  70.86 |  MBw/s   0.37  | avio 0.03 ms  |
> DSK |           sdb |  busy      0% |  read       0  | write      1  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.00 ms  |
> NET |  transport    |  tcpi      16 |  tcpo      16  | udpi       0  |  udpo       0 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> NET |  network      |  ipi       16 |  ipo       16  | ipfrw      0  |  deliv     16 |               |  icmpi      0  | icmpo      0  |
> NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
>
>    PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
>   9650      -   martin    martin       22   7.89s    0.65s      0K     128K  705.5M   382.1M  --     -  S       2   87%   VirtualBox
>   9911      -   root      root          1   0.69s    0.01s      0K       0K      0K       0K  --     -  S       3    7%   watch
>   9598      -   root      root          1   0.38s    0.00s      0K       0K      0K      20K  --     -  S       0    4%   kworker/u8:9
>   9892      -   root      root          1   0.36s    0.00s      0K       0K      0K       0K  --     -  S       1    4%   kworker/u8:17
>   9428      -   root      root          1   0.30s    0.00s      0K       0K      0K       0K  --     -  R       0    3%   kworker/u8:3
>   9589      -   root      root          1   0.23s    0.00s      0K       0K      0K       0K  --     -  S       1    2%   kworker/u8:6
>   4746      -   martin    martin        2   0.04s    0.13s      0K     -16K      0K       0K  --     -  R       2    2%   konsole
>
>
>
> Every 1,0s: cat /proc/meminfo                                                                                  Sat Dec 27 12:59:23 2014
>
> MemTotal:       16210512 kB
> MemFree:          786632 kB
> MemAvailable:   10271500 kB
> Buffers:              52 kB
> Cached:          9564340 kB
> SwapCached:        70268 kB
> Active:          6847560 kB
> Inactive:        5257956 kB
> Active(anon):    2016412 kB
> Inactive(anon):   703076 kB
> Active(file):    4831148 kB
> Inactive(file):  4554880 kB
> Unevictable:        9068 kB
> Mlocked:            9068 kB
> SwapTotal:      12582908 kB
> SwapFree:       12186680 kB
> Dirty:            972324 kB
> Writeback:             0 kB
> AnonPages:       2526340 kB
> Mapped:          2457096 kB
> Shmem:            173564 kB
> Slab:             918128 kB
> SReclaimable:     848816 kB
> SUnreclaim:        69312 kB
> KernelStack:       11200 kB
> PageTables:        64556 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    20688164 kB
> Committed_AS:    7438348 kB
>
>
>
> I am not seeing more than one GiB of dirty here during regular usage and
> it is no problem.
>
> And kworker thread CPU usage just fine. So no, the dirty_background_ratio
> isn´t an issue with this 16 GiB ThinkPad T520. Please note: I do Linux
> performance analysis and tuning courses for about 7 years or so meanwhile.
>
> I *know* these knobs. I may have used wrong terms regarding BTRFS, and my
> understanding of BTRFS space allocation probably can be more accurate, but
> I do think that I am onto something here. This is no rotating disk, it can handle
> the write burst just fine and I generally do not tune where there is no need to
> tune. Here there isn´t. And it wouldn´t be much more than a fine tuning.
>
> With slow devices or with rsync over NFS by all means reduce it. But here it
> simply isn´t an issue as you can see with the low kworker thread CPU usage
> and the general SSD usage above.
>
>
> So defragmentation completed just nice, no issue so far.
>
> But I am close to full device space reservation already:
>
> merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> Sa 27. Dez 13:02:40 CET 2014
> Label: 'home'  uuid: [some UUID]
>          Total devices 2 FS bytes used 151.58GiB
>          devid    1 size 160.00GiB used 158.01GiB path /dev/mapper/msata-home
>          devid    2 size 160.00GiB used 158.01GiB path /dev/mapper/sata-home
>
>
>
> I thought I can trigger it again by defragmenting in Windows XP again, but
> mind you, its defragmented already so it doesn´t to much. I did the sdelete
> dance just to trigger something and well I saw kworker a bit higher, but not
> much.
>
> But finally I got to:
>
>
>
>
> So I did, if Virtualbox can write randomly in a file, I can too.
>
> So I did:
>
>
> martin@merkaba:~> cat ssd-test.fio
> [global]
> bs=4k
> #ioengine=libaio
> #iodepth=4
> size=4g
> #direct=1
> runtime=120
> filename=ssd.test.file
>
> [seq-write]
> rw=write
> stonewall
>
> [rand-write]
> rw=randwrite
> stonewall
>
>
>
> And got:
>
> ATOP - merkaba                          2014/12/27  13:41:02                          -----------                           10s elapsed
> PRC |  sys   10.14s |  user   0.38s |  #proc    332  | #trun      2  |  #tslpi   548 |  #tslpu     0 |  #zombie    0  | no  procacct  |
> CPU |  sys     102% |  user      4% |  irq       0%  | idle    295%  |  wait      0% |  guest     0% |  curf 3.10GHz  | curscal  96%  |
> cpu |  sys      76% |  user      0% |  irq       0%  | idle     24%  |  cpu001 w  0% |  guest     0% |  curf 3.20GHz  | curscal  99%  |
> cpu |  sys      24% |  user      1% |  irq       0%  | idle     75%  |  cpu000 w  0% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  cpu003 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> CPL |  avg1    0.82 |  avg5    0.78 |  avg15   0.99  |               |  csw     6233 |  intr   12023 |                | numcpu     4  |
> MEM |  tot    15.5G |  free    4.0G |  cache   9.7G  | buff    0.0M  |  slab  333.1M |  shmem 206.6M |  vmbal   0.0M  | hptot   0.0M  |
> SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> LVM |     sata-home |  busy      0% |  read       8  | write      0  |  KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
> DSK |           sda |  busy      0% |  read       8  | write      0  |  KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
> NET |  transport    |  tcpi      16 |  tcpo      16  | udpi       0  |  udpo       0 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> NET |  network      |  ipi       16 |  ipo       16  | ipfrw      0  |  deliv     16 |               |  icmpi      0  | icmpo      0  |
> NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
>
>    PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/2
> 18079      -   martin    martin        2   9.99s    0.00s      0K       0K      0K      16K  --     -  R       1  100%   fio
>   4746      -   martin    martin        2   0.01s    0.14s      0K       0K      0K       0K  --     -  S       2    2%   konsole
>   3291      -   martin    martin        4   0.01s    0.11s      0K       0K      0K       0K  --     -  S       0    1%   plasma-desktop
>   1488      -   root      root          1   0.03s    0.04s      0K       0K      0K       0K  --     -  S       0    1%   Xorg
> 10036      -   root      root          1   0.04s    0.02s      0K       0K      0K       0K  --     -  R       2    1%   atop
>
> while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
>
> martin@merkaba:~> LANG=C df -hT /home
> Filesystem             Type   Size  Used Avail Use% Mounted on
> /dev/mapper/msata-home btrfs  170G  156G   17G  91% /home
>
> where a 4 GiB file should easily fit, no? (And this output is with the 4
> GiB file. So it was even 4 GiB more free before.)

No. /usr/bin/df is an _approximation_ in BTRFS because of the limits of 
the fsstat() function call. The fstat function call was defined in 1990 
and "can't understand" the dynamic allocation model used in BTRFS as it 
assumes fixed geometry for filesystems. You do _not_ have 17G actually 
available. You need to rely on btrfs fi df and btrfs fi show to figure 
out how much space you _really_ have.

According to this block you have a RAID1 of ~ 160GB expanse (two 160G disks)

 > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
 > Sa 27. Dez 13:26:39 CET 2014
 > Label: 'home'  uuid: [some UUID]
 >          Total devices 2 FS bytes used 152.83GiB
 >          devid    1 size 160.00GiB used 160.00GiB path 
/dev/mapper/msata-home
 >          devid    2 size 160.00GiB used 160.00GiB path 
/dev/mapper/sata-home

And according to this block you have about 4.49GiB of data space:

 > Btrfs v3.17
 > Data, RAID1: total=154.97GiB, used=149.58GiB
 > System, RAID1: total=32.00MiB, used=48.00KiB
 > Metadata, RAID1: total=5.00GiB, used=3.26GiB
 > GlobalReserve, single: total=512.00MiB, used=0.00B

154.97
   5.00
   0.032
+ 0.512

Pretty much as close to 160GiB as you are going to get (those numbers 
being rounded up in places for "human readability") BTRFS has allocate 
100% of the raw storage into typed extents.

A large datafile can only fit in the 154.97-149.58 = 5.39

Trying to allocate that 4GiB file into that 5.39GiB of space becomes an 
NP-complete (e.g. "very hard") problem if it is very fragmented.

I also don't know what kind of tool you are using, but it might be 
repeatedly trying and failing to fallocate the file as a single extent 
or something equally dumb.

If the tool that takes those .fio files "isn't smart" about transient 
allocation failures it might be trying the same allocation again, and 
again, and again, and again.... forever... which is not a problem with 
BTRFS but which _would_ lead to runaway CPU usage with no actual disk 
activity.

So try again with more normal tools and see if you can allocate 4GiB.

dd if=/dev/urandom of=file bs=1m count=4096

Does that create a four-gig file? Probably works fine.

You need to isolate not "overall cpu usage" but _what_ program is doing 
what an why. So strace your fio program or whatever it is to see what 
function call(s) it is making and what is being returned.

But seriously dude, if the DD works and the fio doesn't, then that's a 
problem with fio.

(I've got _zero_ idea what fio is, but if it does "testing" and 
repeatedly writes random bits of the file, since you've only got 5.39G 
of space it's likely going to have a lot of problems doing _anything_ 
"intensive" to a COW file of 4G.

So yea, that simultaneous write/rewrite test is going to fail. You don't 
have enough room to permute that file.

None of the results below "surprise me" given that you _don't_ have 
enough room to do the tests you (seem to have) initiated on a COW file. 
Minimum likely needed space is just under 8GiB. Maximum could be much, 
much larger.

>
>
> But it gets even more visible:
>
> martin@merkaba:~> fio ssd-test.fio
> seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> fio-2.1.11
> Starting 2 processes
> Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:57s]
> 0$ zsh  1$ zsh  2$ zsh  3-$ zsh  4$ zsh  5$* zsh
>
>
> yes, thats 0 IOPS.
>
> 0 IOPS and in zero IOPS. For minutes.
>
>
>
> And here is why:
>
> ATOP - merkaba                          2014/12/27  13:46:52                          -----------                           10s elapsed
> PRC |  sys   10.77s |  user   0.31s |  #proc    334  | #trun      2  |  #tslpi   548 |  #tslpu     3 |  #zombie    0  | no  procacct  |
> CPU |  sys     108% |  user      3% |  irq       0%  | idle    286%  |  wait      2% |  guest     0% |  curf 3.08GHz  | curscal  96%  |
> cpu |  sys      72% |  user      1% |  irq       0%  | idle     28%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys      19% |  user      0% |  irq       0%  | idle     81%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys      11% |  user      1% |  irq       0%  | idle     87%  |  cpu003 w  1% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> cpu |  sys       6% |  user      1% |  irq       0%  | idle     91%  |  cpu002 w  1% |  guest     0% |  curf 3.11GHz  | curscal  97%  |
> CPL |  avg1    2.78 |  avg5    1.34 |  avg15   1.12  |               |  csw    50192 |  intr   32379 |                | numcpu     4  |
> MEM |  tot    15.5G |  free    5.0G |  cache   8.7G  | buff    0.0M  |  slab  332.6M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> LVM |     sata-home |  busy      5% |  read     160  | write  11177  |  KiB/w      3 |  MBr/s   0.06 |  MBw/s   4.36  | avio 0.05 ms  |
> LVM |    msata-home |  busy      4% |  read      28  | write  11177  |  KiB/w      3 |  MBr/s   0.01 |  MBw/s   4.36  | avio 0.04 ms  |
> LVM |   sata-debian |  busy      0% |  read       0  | write    844  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
> LVM |  msata-debian |  busy      0% |  read       0  | write    844  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
> DSK |           sda |  busy      5% |  read     160  | write  10200  |  KiB/w      4 |  MBr/s   0.06 |  MBw/s   4.69  | avio 0.05 ms  |
> DSK |           sdb |  busy      4% |  read      28  | write  10558  |  KiB/w      4 |  MBr/s   0.01 |  MBw/s   4.69  | avio 0.04 ms  |
> NET |  transport    |  tcpi      35 |  tcpo      33  | udpi       3  |  udpo       3 |  tcpao      2 |  tcppo      1  | tcprs      0  |
> NET |  network      |  ipi       38 |  ipo       36  | ipfrw      0  |  deliv     38 |               |  icmpi      0  | icmpo      0  |
> NET |  eth0      0% |  pcki      22 |  pcko      20  | si    9 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
>
>    PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
> 14973      -   root      root          1   8.92s    0.00s      0K       0K      0K     144K  --     -  S       0   89%   kworker/u8:14
> 17450      -   root      root          1   0.86s    0.00s      0K       0K      0K      32K  --     -  R       3    9%   kworker/u8:5
>    788      -   root      root          1   0.25s    0.00s      0K       0K    128K   18880K  --     -  S       3    3%   btrfs-transact
> 12254      -   root      root          1   0.14s    0.00s      0K       0K     64K     576K  --     -  S       2    1%   kworker/u8:3
> 17332      -   root      root          1   0.11s    0.00s      0K       0K    112K    1348K  --     -  S       2    1%   kworker/u8:4
>   3291      -   martin    martin        4   0.01s    0.09s      0K       0K      0K       0K  --     -  S       1    1%   plasma-deskto
>
>
>
>
> ATOP - merkaba                          2014/12/27  13:47:12                          -----------                           10s elapsed
> PRC |  sys   10.78s |  user   0.44s |  #proc    334  | #trun      3  |  #tslpi   547 |  #tslpu     3 |  #zombie    0  | no  procacct  |
> CPU |  sys     106% |  user      4% |  irq       0%  | idle    288%  |  wait      1% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys      93% |  user      0% |  irq       0%  | idle      7%  |  cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys       7% |  user      0% |  irq       0%  | idle     93%  |  cpu003 w  0% |  guest     0% |  curf 3.01GHz  | curscal  94%  |
> cpu |  sys       3% |  user      2% |  irq       0%  | idle     94%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys       3% |  user      2% |  irq       0%  | idle     95%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> CPL |  avg1    3.33 |  avg5    1.56 |  avg15   1.20  |               |  csw    38253 |  intr   23104 |                | numcpu     4  |
> MEM |  tot    15.5G |  free    4.9G |  cache   8.7G  | buff    0.0M  |  slab  336.5M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> LVM |    msata-home |  busy      2% |  read       0  | write   2337  |  KiB/w      3 |  MBr/s   0.00 |  MBw/s   0.91  | avio 0.07 ms  |
> LVM |     sata-home |  busy      2% |  read      36  | write   2337  |  KiB/w      3 |  MBr/s   0.01 |  MBw/s   0.91  | avio 0.07 ms  |
> LVM |  msata-debian |  busy      1% |  read       1  | write   1630  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.65  | avio 0.03 ms  |
> LVM |   sata-debian |  busy      0% |  read       0  | write   1019  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.41  | avio 0.02 ms  |
> DSK |           sdb |  busy      2% |  read       1  | write   2545  |  KiB/w      5 |  MBr/s   0.00 |  MBw/s   1.45  | avio 0.07 ms  |
> DSK |           sda |  busy      1% |  read      36  | write   2461  |  KiB/w      5 |  MBr/s   0.01 |  MBw/s   1.28  | avio 0.06 ms  |
> NET |  transport    |  tcpi      20 |  tcpo      20  | udpi       1  |  udpo       1 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> NET |  network      |  ipi       21 |  ipo       21  | ipfrw      0  |  deliv     21 |               |  icmpi      0  | icmpo      0  |
> NET |  eth0      0% |  pcki       5 |  pcko       5  | si    0 Kbps  |  so    0 Kbps |  erri       0 |  erro       0  | drpo       0  |
> NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
>
>    PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
> 17450      -   root      root          1   9.96s    0.00s      0K       0K      0K       0K  --     -  R       2  100%   kworker/u8:5
>   4746      -   martin    martin        2   0.06s    0.15s      0K       0K      0K       0K  --     -  S       1    2%   konsole
> 10508      -   root      root          1   0.13s    0.00s      0K       0K     96K    4048K  --     -  S       1    1%   kworker/u8:18
>   1488      -   root      root          1   0.06s    0.06s      0K       0K      0K       0K  --     -  S       0    1%   Xorg
> 17332      -   root      root          1   0.12s    0.00s      0K       0K     96K     580K  --     -  R       3    1%   kworker/u8:4
> 17454      -   root      root          1   0.11s    0.00s      0K       0K     32K    4416K  --     -  D       1    1%   kworker/u8:6
> 17516      -   root      root          1   0.09s    0.00s      0K       0K     16K     136K  --     -  S       3    1%   kworker/u8:7
>   3268      -   martin    martin        3   0.02s    0.05s      0K       0K      0K       0K  --     -  S       1    1%   kwin
> 10036      -   root      root          1   0.05s    0.02s      0K       0K      0K       0K  --     -  R       0    1%   atop
>
>
>
> So BTRFS is basically busy with itself and nothing else. Look at the SSD
> usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
> Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
> you measure of course, like request size, read, write, iodepth and so).
>
> Its kworker/u8:5 utilizing 100% of one core for minutes.
>
>
>
> Its the random write case it seems. Here are values from fio job:
>
> martin@merkaba:~> fio ssd-test.fio
> seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> fio-2.1.11
> Starting 2 processes
> Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s]
> seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
>    write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
>      clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
>       lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
>      clat percentiles (usec):
>       |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
>       | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
>       | 70.00th=[    7], 80.00th=[    8], 90.00th=[    8], 95.00th=[    9],
>       | 99.00th=[   14], 99.50th=[   20], 99.90th=[  211], 99.95th=[ 2128],
>       | 99.99th=[10304]
>      bw (KB  /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20
>      lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
>      lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
>      lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
>    cpu          : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>       issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
>
> Seems fine.
>
>
> But:
>
> rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
>    write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
>      clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
>       lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
>      clat percentiles (usec):
>       |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[    5],
>       | 30.00th=[    6], 40.00th=[    6], 50.00th=[    6], 60.00th=[    6],
>       | 70.00th=[    7], 80.00th=[    7], 90.00th=[    9], 95.00th=[   10],
>       | 99.00th=[   18], 99.50th=[   19], 99.90th=[   28], 99.95th=[  116],
>       | 99.99th=[16711680]
>      bw (KB  /s): min=    0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
>      lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
>      lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
>    cpu          : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>       issued    : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
>    WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec
>
> Run status group 1 (all jobs):
>    WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec
>
>
> What? 254 IOPS? With a Dual SSD BTRFS RAID 1?
>
> What?
>
> Ey, *what*?
>
>
>
> Repeating with the random write case.
>
> Its a different kworker now, but similar result:
>
> ATOP - merkaba                          2014/12/27  13:51:48                          -----------                           10s elapsed
> PRC |  sys   10.66s |  user   0.25s |  #proc    330  | #trun      2  |  #tslpi   545 |  #tslpu     2 |  #zombie    0  | no  procacct  |
> CPU |  sys     105% |  user      3% |  irq       0%  | idle    292%  |  wait      0% |  guest     0% |  curf 3.07GHz  | curscal  95%  |
> cpu |  sys      92% |  user      0% |  irq       0%  | idle      8%  |  cpu002 w  0% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> cpu |  sys       8% |  user      0% |  irq       0%  | idle     92%  |  cpu003 w  0% |  guest     0% |  curf 3.09GHz  | curscal  96%  |
> cpu |  sys       3% |  user      2% |  irq       0%  | idle     95%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys       2% |  user      1% |  irq       0%  | idle     97%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> CPL |  avg1    1.00 |  avg5    1.32 |  avg15   1.23  |               |  csw    34484 |  intr   23182 |                | numcpu     4  |
> MEM |  tot    15.5G |  free    5.4G |  cache   8.3G  | buff    0.0M  |  slab  334.8M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> LVM |     sata-home |  busy      1% |  read      36  | write   2502  |  KiB/w      4 |  MBr/s   0.01 |  MBw/s   0.98  | avio 0.06 ms  |
> LVM |    msata-home |  busy      1% |  read      48  | write   2502  |  KiB/w      4 |  MBr/s   0.02 |  MBw/s   0.98  | avio 0.04 ms  |
> LVM |  msata-debian |  busy      0% |  read       0  | write      6  |  KiB/w      7 |  MBr/s   0.00 |  MBw/s   0.00  | avio 1.33 ms  |
> LVM |   sata-debian |  busy      0% |  read       0  | write      6  |  KiB/w      7 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.17 ms  |
> DSK |           sda |  busy      1% |  read      36  | write   2494  |  KiB/w      4 |  MBr/s   0.01 |  MBw/s   0.98  | avio 0.06 ms  |
> DSK |           sdb |  busy      1% |  read      48  | write   2494  |  KiB/w      4 |  MBr/s   0.02 |  MBw/s   0.98  | avio 0.04 ms  |
> NET |  transport    |  tcpi      32 |  tcpo      30  | udpi       2  |  udpo       2 |  tcpao      2 |  tcppo      1  | tcprs      0  |
> NET |  network      |  ipi       35 |  ipo       32  | ipfrw      0  |  deliv     35 |               |  icmpi      0  | icmpo      0  |
> NET |  eth0      0% |  pcki      19 |  pcko      16  | si    9 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
>
>    PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/2
> 11746      -   root      root          1  10.00s    0.00s      0K       0K      0K       0K  --     -  R       2  100%   kworker/u8:0
> 12254      -   root      root          1   0.16s    0.00s      0K       0K    112K    1712K  --     -  S       3    2%   kworker/u8:3
> 17517      -   root      root          1   0.16s    0.00s      0K       0K    144K    1764K  --     -  S       1    2%   kworker/u8:8
>
>
>
> And now the graphical environemnt is locked. Continuining on TTY1.
>
> Doing another fio job with tee so I can get output easily.
>
> Wow! I wonder whether this is reproducable with a fresh BTRFS with fio stressing it.
>
> Like a 10 GiB BTRFS with 5 GiB fio test file and just letting it run.
>
>
> Okay, I let the final fio job complete and include the output here.
>
>
> Okay, and there we are and I do have sysrq-t figures.
>
> Okay, this is 1,2 MiB xz packed. So I better start a bug report about this
> and attach it there. Is dislike cloud URLs that may disappear at some time.
>
>
>
> Now please finally acknowledge that there is an issue. Maybe I was not
> using the correct terms at the beginning, but there is a real issue. I do
> performance things for half a decade at least, I know that there is an issue
> when I see it.
>
>
>
>
> There we go:
>
> Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401
>
> Thanks,
>

next prev parent reply	other threads:[~2014-12-27 14:54 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-26 13:37 BTRFS free space handling still needs more work: Hangs again Martin Steigerwald
2014-12-26 14:20 ` Martin Steigerwald
2014-12-26 14:41   ` Martin Steigerwald
2014-12-27  3:33     ` Duncan
2014-12-26 15:59 ` Martin Steigerwald
2014-12-27  4:26   ` Duncan
2014-12-26 22:48 ` Robert White
2014-12-27  5:54   ` Duncan
2014-12-27  9:01   ` Martin Steigerwald
2014-12-27  9:30     ` Hugo Mills
2014-12-27 10:54       ` Martin Steigerwald
2014-12-27 11:52         ` Robert White
2014-12-27 13:16           ` Martin Steigerwald
2014-12-27 13:49             ` Robert White
2014-12-27 14:06               ` Martin Steigerwald
2014-12-27 14:00             ` Robert White
2014-12-27 14:14               ` Martin Steigerwald
2014-12-27 14:21                 ` Martin Steigerwald
2014-12-27 15:14                   ` Robert White
2014-12-27 16:01                     ` Martin Steigerwald
2014-12-28  0:25                       ` Robert White
2014-12-28  1:01                         ` Bardur Arantsson
2014-12-28  4:03                           ` Robert White
2014-12-28 12:03                             ` Martin Steigerwald
2014-12-28 17:04                               ` Patrik Lundquist
2014-12-29 10:14                                 ` Martin Steigerwald
2014-12-28 12:07                             ` Martin Steigerwald
2014-12-28 14:52                               ` Robert White
2014-12-28 15:42                                 ` Martin Steigerwald
2014-12-28 15:47                                   ` Martin Steigerwald
2014-12-29  0:27                                   ` Robert White
2014-12-29  9:14                                     ` Martin Steigerwald
2014-12-27 16:10                     ` Martin Steigerwald
2014-12-27 14:19               ` Robert White
2014-12-27 11:11       ` Martin Steigerwald
2014-12-27 12:08         ` Robert White
2014-12-27 13:55       ` Martin Steigerwald
2014-12-27 14:54         ` Robert White [this message]
2014-12-27 16:26           ` Hugo Mills
2014-12-27 17:11             ` Martin Steigerwald
2014-12-27 17:59               ` Martin Steigerwald
2014-12-28  0:06             ` Robert White
2014-12-28 11:05               ` Martin Steigerwald
2014-12-28 13:00         ` BTRFS free space handling still needs more work: Hangs again (further tests) Martin Steigerwald
2014-12-28 13:40           ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare) Martin Steigerwald
2014-12-28 13:56             ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea) Martin Steigerwald
2014-12-28 15:00               ` Martin Steigerwald
2014-12-29  9:25               ` Martin Steigerwald
2014-12-27 18:28       ` BTRFS free space handling still needs more work: Hangs again Zygo Blaxell
2014-12-27 18:40         ` Hugo Mills
2014-12-27 19:23           ` BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time) Martin Steigerwald
2014-12-29  2:07             ` Zygo Blaxell
2014-12-29  9:32               ` Martin Steigerwald
2015-01-06 20:03                 ` Zygo Blaxell
2015-01-07 19:08                   ` Martin Steigerwald
2015-01-07 21:41                     ` Zygo Blaxell
2015-01-08  5:45                     ` Duncan
2015-01-08 10:18                       ` Martin Steigerwald
2015-01-09  8:25                         ` Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=549EC829.8080808@pobox.com \
    --to=rwhite@pobox.com \
    --cc=Martin@lichtvoll.de \
    --cc=hugo@carfax.org.uk \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).