From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mondschein.lichtvoll.de ([194.150.191.11]:45022 "EHLO mail.lichtvoll.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751619AbaL1PAo (ORCPT ); Sun, 28 Dec 2014 10:00:44 -0500 From: Martin Steigerwald To: Hugo Mills Cc: Robert White , linux-btrfs@vger.kernel.org Subject: Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea) Date: Sun, 28 Dec 2014 16:00:22 +0100 Message-ID: <1614719.u339PY3u4L@merkaba> In-Reply-To: <2574221.IMnk8vTl9R@merkaba> References: <3738341.y7uRQFcLJH@merkaba> <4646600.pkiZfU6OfG@merkaba> <2574221.IMnk8vTl9R@merkaba> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart1568728.2ysDS6iQng"; micalg="pgp-sha1"; protocol="application/pgp-signature" Sender: linux-btrfs-owner@vger.kernel.org List-ID: --nextPart1568728.2ysDS6iQng Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Am Sonntag, 28. Dezember 2014, 14:56:21 schrieb Martin Steigerwald: > Am Sonntag, 28. Dezember 2014, 14:40:32 schrieb Martin Steigerwald: > > Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald:= > > > Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwal= d: > > > > Summarized at > > > >=20 > > > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge = core for minutes on random write into big file > > > > https://bugzilla.kernel.org/show_bug.cgi?id=3D90401 > > > >=20 > > > > see below. This is reproducable with fio, no need for Windows X= P in > > > > Virtualbox for reproducing the issue. Next I will try to reprod= uce with > > > > a freshly creating filesystem. > > > >=20 > > > >=20 > > > > Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills: > > > > > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald = wrote: > > > > > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert Whit= e: > > > > > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote: > > > > > > > > Hello! > > > > > > > >=20 > > > > > > > > First: Have a merry christmas and enjoy a quiet time in= these days. > > > > > > > >=20 > > > > > > > > Second: At a time you feel like it, here is a little ra= nt, but also a > > > > > > > > bug > > > > > > > > report: > > > > > > > >=20 > > > > > > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dua= l SSD RAID with > > > > > > > > space_cache, skinny meta data extents =E2=80=93 are the= se a problem? =E2=80=93 and > > > > > > >=20 > > > > > > > > compress=3Dlzo: > > > > > > > (there is no known problem with skinny metadata, it's act= ually more > > > > > > > efficient than the older format. There has been some anec= dotes about > > > > > > > mixing the skinny and fat metadata but nothing has ever b= een > > > > > > > demonstrated problematic.) > > > > > > >=20 > > > > > > > > merkaba:~> btrfs fi sh /home > > > > > > > > Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd62= 4a > > > > > > > >=20 > > > > > > > > Total devices 2 FS bytes used 144.41GiB > > > > > > > > devid 1 size 160.00GiB used 160.00GiB path > > > > > > > > /dev/mapper/msata-home > > > > > > > > devid 2 size 160.00GiB used 160.00GiB path > > > > > > > > /dev/mapper/sata-home > > > > > > > >=20 > > > > > > > > Btrfs v3.17 > > > > > > > > merkaba:~> btrfs fi df /home > > > > > > > > Data, RAID1: total=3D154.97GiB, used=3D141.12GiB > > > > > > > > System, RAID1: total=3D32.00MiB, used=3D48.00KiB > > > > > > > > Metadata, RAID1: total=3D5.00GiB, used=3D3.29GiB > > > > > > > > GlobalReserve, single: total=3D512.00MiB, used=3D0.00B > > > > > > >=20 > > > > > > > This filesystem, at the allocation level, is "very full" = (see below). > > > > > > >=20 > > > > > > > > And I had hangs with BTRFS again. This time as I wanted= to install tax > > > > > > > > return software in Virtualbox=C2=B4d Windows XP VM (whi= ch I use once a year > > > > > > > > cause I know no tax return software for Linux which wou= ld be suitable > > > > > > > > for > > > > > > > > Germany and I frankly don=C2=B4t care about the end of = security cause all > > > > > > > > surfing and other network access I will do from the Lin= ux box and I > > > > > > > > only > > > > > > > > run the VM behind a firewall). > > > > > > >=20 > > > > > > > > And thus I try the balance dance again: > > > > > > > ITEM: Balance... it doesn't do what you think it does...=20= > > > > > > >=20 > > > > > > > "Balancing" is something you should almost never need to = do. It is only > > > > > > > for cases of changing geometry (adding disks, switching R= AID levels, > > > > > > > etc.) of for cases when you've radically changed allocati= on behaviors > > > > > > > (like you decided to remove all your VM's or you've decid= ed to remove a > > > > > > > mail spool directory full of thousands of tiny files). > > > > > > >=20 > > > > > > > People run balance all the time because they think they s= hould. They are > > > > > > > _usually_ incorrect in that belief. > > > > > >=20 > > > > > > I only see the lockups of BTRFS is the trees *occupy* all s= pace on the > > > > > > device. > > > > > No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored = metadata > > > > > space. What's more, balance does *not* balance the metadata t= rees. The > > > > > remaining space -- 154.97 GiB -- is unstructured storage for = file > > > > > data, and you have some 13 GiB of that available for use. > > > > >=20 > > > > > Now, since you're seeing lockups when the space on your di= sks is > > > > > all allocated I'd say that's a bug. However, you're the *only= * person > > > > > who's reported this as a regular occurrence. Does this happen= with all > > > > > filesystems you have, or just this one? > > > > >=20 > > > > > > I *never* so far saw it lockup if there is still space BTRF= S can allocate > > > > > > from to *extend* a tree. > > > > >=20 > > > > > It's not a tree. It's simply space allocation. It's not ev= en space > > > > > *usage* you're talking about here -- it's just allocation (i.= e. the FS > > > > > saying "I'm going to use this piece of disk for this purpose"= ). > > > > >=20 > > > > > > This may be a bug, but this is what I see. > > > > > >=20 > > > > > > And no amount of "you should not balance a BTRFS" will make= that > > > > > > perception go away. > > > > > >=20 > > > > > > See, I see the sun coming out on a morning and you tell me = "no, it > > > > > > doesn=C2=B4t". Simply that is not going to match my percept= ion. > > > > >=20 > > > > > Duncan's assertion is correct in its detail. Looking at yo= ur space > > > >=20 > > > > Robert's=20 > > > >=20 > > > > > usage, I would not suggest that running a balance is somethin= g you > > > > > need to do. Now, since you have these lockups that seem quite= > > > > > repeatable, there's probably a lurking bug in there, but hack= ing > > > > > around with balance every time you hit it isn't going to get = the > > > > > problem solved properly. > > > > >=20 > > > > > I think I would suggest the following: > > > > >=20 > > > > > - make sure you have some way of logging your dmesg permanen= tly (use > > > > > a different filesystem for /var/log, or a serial console, = or a > > > > > netconsole) > > > > >=20 > > > > > - when the lockup happens, hit Alt-SysRq-t a few times > > > > >=20 > > > > > - send the dmesg output here, or post to bugzilla.kernel.org= > > > > >=20 > > > > > That's probably going to give enough information to the de= velopers > > > > > to work out where the lockup is happening, and is clearly the= way > > > > > forward here. > > > >=20 > > > > And I got it reproduced. *Perfectly* reproduced, I=C2=B4d say. > > > >=20 > > > > But let me run the whole story: > > > >=20 > > > > 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB= again. > > >=20 > > > [=E2=80=A6 story of trying to reproduce with Windows XP defragmen= ting which was > > > unsuccessful as BTRFS still had free device space to allocate new= chunks > > > from =E2=80=A6] > > >=20 > > > > But finally I got to: > > > >=20 > > > > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home > > > > Sa 27. Dez 13:26:39 CET 2014 > > > > Label: 'home' uuid: [some UUID] > > > > Total devices 2 FS bytes used 152.83GiB > > > > devid 1 size 160.00GiB used 160.00GiB path /dev/mapp= er/msata-home > > > > devid 2 size 160.00GiB used 160.00GiB path /dev/mapp= er/sata-home > > > >=20 > > > > Btrfs v3.17 > > > > Data, RAID1: total=3D154.97GiB, used=3D149.58GiB > > > > System, RAID1: total=3D32.00MiB, used=3D48.00KiB > > > > Metadata, RAID1: total=3D5.00GiB, used=3D3.26GiB > > > > GlobalReserve, single: total=3D512.00MiB, used=3D0.00B > > > >=20 > > > >=20 > > > >=20 > > > > So I did, if Virtualbox can write randomly in a file, I can too= . > > > >=20 > > > > So I did: > > > >=20 > > > >=20 > > > > martin@merkaba:~> cat ssd-test.fio=20 > > > > [global] > > > > bs=3D4k > > > > #ioengine=3Dlibaio > > > > #iodepth=3D4 > > > > size=3D4g > > > > #direct=3D1 > > > > runtime=3D120 > > > > filename=3Dssd.test.file > > > >=20 > > > > [seq-write] > > > > rw=3Dwrite > > > > stonewall > > > >=20 > > > > [rand-write] > > > > rw=3Drandwrite > > > > stonewall > > > >=20 > > > >=20 > > > >=20 > > > > And got: > > > >=20 > > > > ATOP - merkaba 2014/12/27 13:41:02 = ----------- 10s elapse= d > > > > PRC | sys 10.14s | user 0.38s | #proc 332 | #trun = 2 | #tslpi 548 | #tslpu 0 | #zombie 0 | no procacct = | > > > > CPU | sys 102% | user 4% | irq 0% | idle = 295% | wait 0% | guest 0% | curf 3.10GHz | curscal 96% = | > > > > cpu | sys 76% | user 0% | irq 0% | idle = 24% | cpu001 w 0% | guest 0% | curf 3.20GHz | curscal 99% = | > > > > cpu | sys 24% | user 1% | irq 0% | idle = 75% | cpu000 w 0% | guest 0% | curf 3.19GHz | curscal 99% = | > > > > cpu | sys 1% | user 1% | irq 0% | idle = 98% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% = | > > > > cpu | sys 1% | user 1% | irq 0% | idle = 98% | cpu003 w 0% | guest 0% | curf 3.00GHz | curscal 93% = | > > > > CPL | avg1 0.82 | avg5 0.78 | avg15 0.99 | = | csw 6233 | intr 12023 | | numcpu 4 = | > > > > MEM | tot 15.5G | free 4.0G | cache 9.7G | buff = 0.0M | slab 333.1M | shmem 206.6M | vmbal 0.0M | hptot 0.0M = | > > > > SWP | tot 12.0G | free 11.7G | | = | | | vmcom 3.4G | vmlim 19.7G = | > > > > LVM | sata-home | busy 0% | read 8 | write = 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms = | > > > > DSK | sda | busy 0% | read 8 | write = 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms = | > > > > NET | transport | tcpi 16 | tcpo 16 | udpi = 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 = | > > > > NET | network | ipi 16 | ipo 16 | ipfrw = 0 | deliv 16 | | icmpi 0 | icmpo 0 = | > > > > NET | lo ---- | pcki 16 | pcko 16 | si 2 = Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 = | > > > >=20 > > > > PID TID RUID EUID THR SYSCPU USRCPU VGR= OW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/= 2 > > > > 18079 - martin martin 2 9.99s 0.00s = 0K 0K 0K 16K -- - R 1 100% fio > > > > 4746 - martin martin 2 0.01s 0.14s = 0K 0K 0K 0K -- - S 2 2% konsole > > > > 3291 - martin martin 4 0.01s 0.11s = 0K 0K 0K 0K -- - S 0 1% plasma-deskto= p > > > > 1488 - root root 1 0.03s 0.04s = 0K 0K 0K 0K -- - S 0 1% Xorg > > > > 10036 - root root 1 0.04s 0.02s = 0K 0K 0K 0K -- - R 2 1% atop > > > >=20 > > > > while fio was just *laying* out the 4 GiB file. Yes, thats 100%= system CPU > > > > for 10 seconds while allocatiing a 4 GiB file on a filesystem l= ike: > > > >=20 > > > > martin@merkaba:~> LANG=3DC df -hT /home > > > > Filesystem Type Size Used Avail Use% Mounted on > > > > /dev/mapper/msata-home btrfs 170G 156G 17G 91% /home > > > >=20 > > > > where a 4 GiB file should easily fit, no? (And this output is w= ith the 4 > > > > GiB file. So it was even 4 GiB more free before.) > > > >=20 > > > >=20 > > > > But it gets even more visible: > > > >=20 > > > > martin@merkaba:~> fio ssd-test.fio > > > > seq-write: (g=3D0): rw=3Dwrite, bs=3D4K-4K/4K-4K/4K-4K, ioengin= e=3Dsync, iodepth=3D1 > > > > rand-write: (g=3D1): rw=3Drandwrite, bs=3D4K-4K/4K-4K/4K-4K, io= engine=3Dsync, iodepth=3D1 > > > > fio-2.1.11 > > > > Starting 2 processes > > > > Jobs: 1 (f=3D1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0= /0 iops] [eta 01m:57s] =20 > > > > 0$ zsh 1$ zsh 2$ zsh 3-$ zsh 4$ zsh 5$* zsh = =20 > > > >=20 > > > >=20 > > > > yes, thats 0 IOPS. > > > >=20 > > > > 0 IOPS and in zero IOPS. For minutes. > > > >=20 > > > >=20 > > > >=20 > > > > And here is why: > > > >=20 > > > > ATOP - merkaba 2014/12/27 13:46:52 = ----------- 10s elapse= d > > > > PRC | sys 10.77s | user 0.31s | #proc 334 | #trun = 2 | #tslpi 548 | #tslpu 3 | #zombie 0 | no procacct = | > > > > CPU | sys 108% | user 3% | irq 0% | idle = 286% | wait 2% | guest 0% | curf 3.08GHz | curscal 96% = | > > > > cpu | sys 72% | user 1% | irq 0% | idle = 28% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% = | > > > > cpu | sys 19% | user 0% | irq 0% | idle = 81% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% = | > > > > cpu | sys 11% | user 1% | irq 0% | idle = 87% | cpu003 w 1% | guest 0% | curf 3.19GHz | curscal 99% = | > > > > cpu | sys 6% | user 1% | irq 0% | idle = 91% | cpu002 w 1% | guest 0% | curf 3.11GHz | curscal 97% = | > > > > CPL | avg1 2.78 | avg5 1.34 | avg15 1.12 | = | csw 50192 | intr 32379 | | numcpu 4 = | > > > > MEM | tot 15.5G | free 5.0G | cache 8.7G | buff = 0.0M | slab 332.6M | shmem 207.2M | vmbal 0.0M | hptot 0.0M = | > > > > SWP | tot 12.0G | free 11.7G | | = | | | vmcom 3.4G | vmlim 19.7G = | > > > > LVM | sata-home | busy 5% | read 160 | write 1= 1177 | KiB/w 3 | MBr/s 0.06 | MBw/s 4.36 | avio 0.05 ms = | > > > > LVM | msata-home | busy 4% | read 28 | write 1= 1177 | KiB/w 3 | MBr/s 0.01 | MBw/s 4.36 | avio 0.04 ms = | > > > > LVM | sata-debian | busy 0% | read 0 | write = 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms = | > > > > LVM | msata-debian | busy 0% | read 0 | write = 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms = | > > > > DSK | sda | busy 5% | read 160 | write 1= 0200 | KiB/w 4 | MBr/s 0.06 | MBw/s 4.69 | avio 0.05 ms = | > > > > DSK | sdb | busy 4% | read 28 | write 1= 0558 | KiB/w 4 | MBr/s 0.01 | MBw/s 4.69 | avio 0.04 ms = | > > > > NET | transport | tcpi 35 | tcpo 33 | udpi = 3 | udpo 3 | tcpao 2 | tcppo 1 | tcprs 0 = | > > > > NET | network | ipi 38 | ipo 36 | ipfrw = 0 | deliv 38 | | icmpi 0 | icmpo 0 = | > > > > NET | eth0 0% | pcki 22 | pcko 20 | si 9 = Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 = | > > > > NET | lo ---- | pcki 16 | pcko 16 | si 2 = Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 = | > > > >=20 > > > > PID TID RUID EUID THR SYSCPU USRCPU VGR= OW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/= 3 > > > > 14973 - root root 1 8.92s 0.00s = 0K 0K 0K 144K -- - S 0 89% kworker/u8:14= > > > > 17450 - root root 1 0.86s 0.00s = 0K 0K 0K 32K -- - R 3 9% kworker/u8:5 > > > > 788 - root root 1 0.25s 0.00s = 0K 0K 128K 18880K -- - S 3 3% btrfs-transac= t > > > > 12254 - root root 1 0.14s 0.00s = 0K 0K 64K 576K -- - S 2 1% kworker/u8:3 > > > > 17332 - root root 1 0.11s 0.00s = 0K 0K 112K 1348K -- - S 2 1% kworker/u8:4 > > > > 3291 - martin martin 4 0.01s 0.09s = 0K 0K 0K 0K -- - S 1 1% plasma-deskto= > > > >=20 > > > >=20 > > > >=20 > > > >=20 > > > > ATOP - merkaba 2014/12/27 13:47:12 = ----------- 10s elapse= d > > > > PRC | sys 10.78s | user 0.44s | #proc 334 | #trun = 3 | #tslpi 547 | #tslpu 3 | #zombie 0 | no procacct = | > > > > CPU | sys 106% | user 4% | irq 0% | idle = 288% | wait 1% | guest 0% | curf 3.00GHz | curscal 93% = | > > > > cpu | sys 93% | user 0% | irq 0% | idle = 7% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% = | > > > > cpu | sys 7% | user 0% | irq 0% | idle = 93% | cpu003 w 0% | guest 0% | curf 3.01GHz | curscal 94% = | > > > > cpu | sys 3% | user 2% | irq 0% | idle = 94% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% = | > > > > cpu | sys 3% | user 2% | irq 0% | idle = 95% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% = | > > > > CPL | avg1 3.33 | avg5 1.56 | avg15 1.20 | = | csw 38253 | intr 23104 | | numcpu 4 = | > > > > MEM | tot 15.5G | free 4.9G | cache 8.7G | buff = 0.0M | slab 336.5M | shmem 207.2M | vmbal 0.0M | hptot 0.0M = | > > > > SWP | tot 12.0G | free 11.7G | | = | | | vmcom 3.4G | vmlim 19.7G = | > > > > LVM | msata-home | busy 2% | read 0 | write = 2337 | KiB/w 3 | MBr/s 0.00 | MBw/s 0.91 | avio 0.07 ms = | > > > > LVM | sata-home | busy 2% | read 36 | write = 2337 | KiB/w 3 | MBr/s 0.01 | MBw/s 0.91 | avio 0.07 ms = | > > > > LVM | msata-debian | busy 1% | read 1 | write = 1630 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.65 | avio 0.03 ms = | > > > > LVM | sata-debian | busy 0% | read 0 | write = 1019 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.41 | avio 0.02 ms = | > > > > DSK | sdb | busy 2% | read 1 | write = 2545 | KiB/w 5 | MBr/s 0.00 | MBw/s 1.45 | avio 0.07 ms = | > > > > DSK | sda | busy 1% | read 36 | write = 2461 | KiB/w 5 | MBr/s 0.01 | MBw/s 1.28 | avio 0.06 ms = | > > > > NET | transport | tcpi 20 | tcpo 20 | udpi = 1 | udpo 1 | tcpao 1 | tcppo 1 | tcprs 0 = | > > > > NET | network | ipi 21 | ipo 21 | ipfrw = 0 | deliv 21 | | icmpi 0 | icmpo 0 = | > > > > NET | eth0 0% | pcki 5 | pcko 5 | si 0 = Kbps | so 0 Kbps | erri 0 | erro 0 | drpo 0 = | > > > > NET | lo ---- | pcki 16 | pcko 16 | si 2 = Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 = | > > > >=20 > > > > PID TID RUID EUID THR SYSCPU USRCPU VGR= OW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/= 3 > > > > 17450 - root root 1 9.96s 0.00s = 0K 0K 0K 0K -- - R 2 100% kworker/u8:5 > > > > 4746 - martin martin 2 0.06s 0.15s = 0K 0K 0K 0K -- - S 1 2% konsole > > > > 10508 - root root 1 0.13s 0.00s = 0K 0K 96K 4048K -- - S 1 1% kworker/u8:18= > > > > 1488 - root root 1 0.06s 0.06s = 0K 0K 0K 0K -- - S 0 1% Xorg > > > > 17332 - root root 1 0.12s 0.00s = 0K 0K 96K 580K -- - R 3 1% kworker/u8:4 > > > > 17454 - root root 1 0.11s 0.00s = 0K 0K 32K 4416K -- - D 1 1% kworker/u8:6 > > > > 17516 - root root 1 0.09s 0.00s = 0K 0K 16K 136K -- - S 3 1% kworker/u8:7 > > > > 3268 - martin martin 3 0.02s 0.05s = 0K 0K 0K 0K -- - S 1 1% kwin > > > > 10036 - root root 1 0.05s 0.02s = 0K 0K 0K 0K -- - R 0 1% atop > > > >=20 > > > >=20 > > > >=20 > > > > So BTRFS is basically busy with itself and nothing else. Look a= t the SSD > > > > usage. They are *idling* around. Heck 2400 write accesses in 10= seconds. > > > > Thats a joke with SSDs that can do 40000 IOPS (depending on how= and what > > > > you measure of course, like request size, read, write, iodepth = and so). > > > >=20 > > > > Its kworker/u8:5 utilizing 100% of one core for minutes. > > > >=20 > > > >=20 > > > >=20 > > > > Its the random write case it seems. Here are values from fio jo= b: > > > >=20 > > > > martin@merkaba:~> fio ssd-test.fio > > > > seq-write: (g=3D0): rw=3Dwrite, bs=3D4K-4K/4K-4K/4K-4K, ioengin= e=3Dsync, iodepth=3D1 > > > > rand-write: (g=3D1): rw=3Drandwrite, bs=3D4K-4K/4K-4K/4K-4K, io= engine=3Dsync, iodepth=3D1 > > > > fio-2.1.11 > > > > Starting 2 processes > > > > Jobs: 1 (f=3D1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/= 0 iops] [eta 01h:06m:26s] > > > > seq-write: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D19212: Sat = Dec 27 13:48:33 2014 > > > > write: io=3D4096.0MB, bw=3D343683KB/s, iops=3D85920, runt=3D = 12204msec > > > > clat (usec): min=3D3, max=3D38048, avg=3D10.52, stdev=3D205= .25 > > > > lat (usec): min=3D3, max=3D38048, avg=3D10.66, stdev=3D205= .43 > > > > clat percentiles (usec): > > > > | 1.00th=3D[ 4], 5.00th=3D[ 4], 10.00th=3D[ 4],= 20.00th=3D[ 4], > > > > | 30.00th=3D[ 4], 40.00th=3D[ 5], 50.00th=3D[ 5],= 60.00th=3D[ 5], > > > > | 70.00th=3D[ 7], 80.00th=3D[ 8], 90.00th=3D[ 8],= 95.00th=3D[ 9], > > > > | 99.00th=3D[ 14], 99.50th=3D[ 20], 99.90th=3D[ 211],= 99.95th=3D[ 2128], > > > > | 99.99th=3D[10304] > > > > bw (KB /s): min=3D164328, max=3D812984, per=3D100.00%, avg= =3D345585.75, stdev=3D201695.20 > > > > lat (usec) : 4=3D0.18%, 10=3D95.31%, 20=3D4.00%, 50=3D0.18%= , 100=3D0.12% > > > > lat (usec) : 250=3D0.12%, 500=3D0.02%, 750=3D0.01%, 1000=3D= 0.01% > > > > lat (msec) : 2=3D0.01%, 4=3D0.01%, 10=3D0.03%, 20=3D0.01%, = 50=3D0.01% > > > > cpu : usr=3D13.55%, sys=3D46.89%, ctx=3D7810, majf=3D= 0, minf=3D6 > > > > IO depths : 1=3D100.0%, 2=3D0.0%, 4=3D0.0%, 8=3D0.0%, 16=3D= 0.0%, 32=3D0.0%, >=3D64=3D0.0% > > > > submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D= 0.0%, 64=3D0.0%, >=3D64=3D0.0% > > > > complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D= 0.0%, 64=3D0.0%, >=3D64=3D0.0% > > > > issued : total=3Dr=3D0/w=3D1048576/d=3D0, short=3Dr=3D0= /w=3D0/d=3D0 > > > > latency : target=3D0, window=3D0, percentile=3D100.00%, = depth=3D1 > > > >=20 > > > > Seems fine. > > > >=20 > > > >=20 > > > > But: > > > >=20 > > > > rand-write: (groupid=3D1, jobs=3D1): err=3D 0: pid=3D19243: Sat= Dec 27 13:48:33 2014 > > > > write: io=3D140336KB, bw=3D1018.4KB/s, iops=3D254, runt=3D137= 803msec > > > > clat (usec): min=3D4, max=3D21299K, avg=3D3708.02, stdev=3D= 266885.61 > > > > lat (usec): min=3D4, max=3D21299K, avg=3D3708.14, stdev=3D= 266885.61 > > > > clat percentiles (usec): > > > > | 1.00th=3D[ 4], 5.00th=3D[ 5], 10.00th=3D[ 5],= 20.00th=3D[ 5], > > > > | 30.00th=3D[ 6], 40.00th=3D[ 6], 50.00th=3D[ 6],= 60.00th=3D[ 6], > > > > | 70.00th=3D[ 7], 80.00th=3D[ 7], 90.00th=3D[ 9],= 95.00th=3D[ 10], > > > > | 99.00th=3D[ 18], 99.50th=3D[ 19], 99.90th=3D[ 28],= 99.95th=3D[ 116], > > > > | 99.99th=3D[16711680] > > > > bw (KB /s): min=3D 0, max=3D 3426, per=3D100.00%, avg=3D= 1030.10, stdev=3D938.02 > > > > lat (usec) : 10=3D92.63%, 20=3D6.89%, 50=3D0.43%, 100=3D0.0= 1%, 250=3D0.02% > > > > lat (msec) : 250=3D0.01%, 500=3D0.01%, >=3D2000=3D0.02% > > > > cpu : usr=3D0.06%, sys=3D1.59%, ctx=3D28720, majf=3D= 0, minf=3D7 > > > > IO depths : 1=3D100.0%, 2=3D0.0%, 4=3D0.0%, 8=3D0.0%, 16=3D= 0.0%, 32=3D0.0%, >=3D64=3D0.0% > > > > submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D= 0.0%, 64=3D0.0%, >=3D64=3D0.0% > > > > complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D= 0.0%, 64=3D0.0%, >=3D64=3D0.0% > > > > issued : total=3Dr=3D0/w=3D35084/d=3D0, short=3Dr=3D0/w= =3D0/d=3D0 > > > > latency : target=3D0, window=3D0, percentile=3D100.00%, = depth=3D1 > > > >=20 > > > > Run status group 0 (all jobs): > > > > WRITE: io=3D4096.0MB, aggrb=3D343682KB/s, minb=3D343682KB/s, = maxb=3D343682KB/s, mint=3D12204msec, maxt=3D12204msec > > > >=20 > > > > Run status group 1 (all jobs): > > > > WRITE: io=3D140336KB, aggrb=3D1018KB/s, minb=3D1018KB/s, maxb= =3D1018KB/s, mint=3D137803msec, maxt=3D137803msec > > > >=20 > > > >=20 > > > > What? 254 IOPS? With a Dual SSD BTRFS RAID 1? > > > >=20 > > > > What? > > > >=20 > > > > Ey, *what*? > > [=E2=80=A6]=20 > > > > There we go: > > > >=20 > > > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge = core for minutes on random write into big file > > > > https://bugzilla.kernel.org/show_bug.cgi?id=3D90401 > > >=20 > > > I have done more tests. > > >=20 > > > This is on the same /home after extending it to 170 GiB and balan= cing it to > > > btrfs balance start -dusage=3D80 > > >=20 > > > It has plenty of free space free. I updated the bug report and ho= pe it can > > > give an easy enough to comprehend summary. The new tests are in: > > >=20 > > > https://bugzilla.kernel.org/show_bug.cgi?id=3D90401#c6 > > >=20 > > >=20 > > >=20 > > > Pasting below for discussion on list. Summary: I easily get 38000= (!) > > > IOPS. It may be an idea to reduce to 160 GiB, but right now this = does > > > not work as it says no free space on device when trying to downsi= ze it. > > > I may try with 165 or 162GiB. > > >=20 > > > So now we have three IOPS figures: > > >=20 > > > - 256 IOPS in worst case scenario > > > - 4700 IOPS when trying to reproduce worst case scenario with a f= resh and small > > > BTRFS > > > - 38000 IOPS when /home has unused device space to allocate chunk= s from > > >=20 > > > https://bugzilla.kernel.org/show_bug.cgi?id=3D90401#c8 > > >=20 > > >=20 > > > This is another test. > >=20 > >=20 > > Okay, and this is the last series of tests for today. > >=20 > > Conclusion: > >=20 > > I cannot manage to get it down to the knees as before, but I come n= ear to it. > >=20 > > Still its 8000 IOPS, instead of 250 IOPS, in an according to btrfs = fi sh > > even *worse* situation than before. > >=20 > > That hints me at the need to look at the free space fragmentation, = as in the > > beginning the problem started appearing with: > >=20 > > merkaba:~> btrfs fi sh /home > > Label: 'home' uuid: [=E2=80=A6] > > Total devices 2 FS bytes used 144.41GiB > > devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/m= sata-home > > devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/s= ata-home > >=20 > > Btrfs v3.17 > > merkaba:~> btrfs fi df /home > > Data, RAID1: total=3D154.97GiB, used=3D141.12GiB > > System, RAID1: total=3D32.00MiB, used=3D48.00KiB > > Metadata, RAID1: total=3D5.00GiB, used=3D3.29GiB > > GlobalReserve, single: total=3D512.00MiB, used=3D0.00B > >=20 > >=20 > >=20 > > Yes, thats 13 GiB of free space *within* the chunks. > >=20 > > So while I can get it down in IOPS by bringing it to a situation wh= ere it > > can not reserve additional data chunks again, I cannot recreate the= > > abysmal 250 IOPS figure by this. Not even with my /home filesystem.= > >=20 > > So there is more to it. I think its important to look into free spa= ce > > fragmentation. It seems it needs an *aged* filesystem to recreate. = At > > it seems the balances really helped. As I am not able to recreate t= he > > issue to that extent right now. > >=20 > > So this shows my original idea about free device space to allocate = from > > also doesn=C2=B4t explain it fully. It seems to be something thats = going on > > within the chunks that explains the worst case <300 IOPS, kworker u= sing > > one core for minutes and desktop locked scenario. > >=20 > > Is there a way to view free space fragmentation in BTRFS? >=20 > So to rephrase that: >=20 > From what I perceive the worst case issue happens when >=20 > 1) BTRFS cannot reserve any new chunks from unused device space anymo= re. >=20 > 2) The free space in the existing chunks is highly fragmented. >=20 > Only one of those conditions is not sufficient to trigger it. >=20 > Thats at least my current idea about it. One more note about the IOPS. I currently let fio run with: martin@merkaba:~> cat ssd-test.fio=20 [global] bs=3D4k #ioengine=3Dlibaio #iodepth=3D4 size=3D4g #direct=3D1 runtime=3D120 filename=3Dssd.test.file #[seq-write] #rw=3Dwrite #stonewall [rand-write] rw=3Drandwrite stonewall This is using buffered I/O write read()/write() system calls. So these IOPS are not regarding the device raw capabilities. I specifically want= ed to test through the page cache to simulate what I see with Virtualbox writing to the VDI file (i.e. dirty piling up and dirty_background_rati= o in effect). Just like with a real app. But that also means that IOPS may be higher cause fio ends before all o= f the writes have been completed to disk. That means when I reach <300IOPS with buffered writes, that means that through the pagecache BTRFS was not able to yield a higher IOPS. But it also means that I measure write requests like an application wou= ld be doing (unless it uses fsync() or direct I/O which it seems to me Virtualbox doesn=C2=B4t at least not with every request). Just wanted to make that explicit. Its basically visible in the job fil= e from=20what I commented out in there, but still, I thought I mention it. I just tested the effect by reducing the test file to 500MiB and the ru= ntime to 10 seconds and I got 98000 IOPS for that. So the larger test file si= ze, but specifically the runtime forces the kernel to do actual writes due = to: merkaba:~> grep . /proc/sys/vm/dirty_* /proc/sys/vm/dirty_background_bytes:0 /proc/sys/vm/dirty_background_ratio:10 /proc/sys/vm/dirty_bytes:0 /proc/sys/vm/dirty_expire_centisecs:3000 /proc/sys/vm/dirty_ratio:20 /proc/sys/vm/dirty_writeback_centisecs:500 (standard values, I still see no need to optimize anything in here with= those SSDs, not even with the 16 GiB of RAM the laptop has, as the SSDs= usually easily can keep up, and I=C2=B4d rather wait for a change in th= e default value unless I am convinced of a benefit in manually adapting it in *th= is* case) Ciao, =2D-=20 Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 --nextPart1568728.2ysDS6iQng Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEABECAAYFAlSgGxYACgkQmRvqrKWZhMdUPwCfYKs0osXWblnd7mlfnhNNrMFU 7HYAn2Tr/kyOLelgJudXnVnUtilBY8JH =qXuX -----END PGP SIGNATURE----- --nextPart1568728.2ysDS6iQng--