From: Martin Steigerwald <Martin@lichtvoll.de>
To: Hugo Mills <hugo@carfax.org.uk>
Cc: Robert White <rwhite@pobox.com>, linux-btrfs@vger.kernel.org
Subject: Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea)
Date: Mon, 29 Dec 2014 10:25:39 +0100 [thread overview]
Message-ID: <6556726.TrNmnHvhLa@merkaba> (raw)
In-Reply-To: <2574221.IMnk8vTl9R@merkaba>
[-- Attachment #1: Type: text/plain, Size: 29945 bytes --]
Am Sonntag, 28. Dezember 2014, 14:56:21 schrieb Martin Steigerwald:
> Am Sonntag, 28. Dezember 2014, 14:40:32 schrieb Martin Steigerwald:
> > Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald:
> > > Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald:
> > > > Summarized at
> > > >
> > > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > > >
> > > > see below. This is reproducable with fio, no need for Windows XP in
> > > > Virtualbox for reproducing the issue. Next I will try to reproduce with
> > > > a freshly creating filesystem.
> > > >
> > > >
> > > > Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> > > > > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > > > > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > > > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > > > > > > Hello!
> > > > > > > >
> > > > > > > > First: Have a merry christmas and enjoy a quiet time in these days.
> > > > > > > >
> > > > > > > > Second: At a time you feel like it, here is a little rant, but also a
> > > > > > > > bug
> > > > > > > > report:
> > > > > > > >
> > > > > > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > > > > > > > space_cache, skinny meta data extents – are these a problem? – and
> > > > > > >
> > > > > > > > compress=lzo:
> > > > > > > (there is no known problem with skinny metadata, it's actually more
> > > > > > > efficient than the older format. There has been some anecdotes about
> > > > > > > mixing the skinny and fat metadata but nothing has ever been
> > > > > > > demonstrated problematic.)
> > > > > > >
> > > > > > > > merkaba:~> btrfs fi sh /home
> > > > > > > > Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > > > > > >
> > > > > > > > Total devices 2 FS bytes used 144.41GiB
> > > > > > > > devid 1 size 160.00GiB used 160.00GiB path
> > > > > > > > /dev/mapper/msata-home
> > > > > > > > devid 2 size 160.00GiB used 160.00GiB path
> > > > > > > > /dev/mapper/sata-home
> > > > > > > >
> > > > > > > > Btrfs v3.17
> > > > > > > > merkaba:~> btrfs fi df /home
> > > > > > > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > > > > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > > > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > > > > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > > > > >
> > > > > > > This filesystem, at the allocation level, is "very full" (see below).
> > > > > > >
> > > > > > > > And I had hangs with BTRFS again. This time as I wanted to install tax
> > > > > > > > return software in Virtualbox´d Windows XP VM (which I use once a year
> > > > > > > > cause I know no tax return software for Linux which would be suitable
> > > > > > > > for
> > > > > > > > Germany and I frankly don´t care about the end of security cause all
> > > > > > > > surfing and other network access I will do from the Linux box and I
> > > > > > > > only
> > > > > > > > run the VM behind a firewall).
> > > > > > >
> > > > > > > > And thus I try the balance dance again:
> > > > > > > ITEM: Balance... it doesn't do what you think it does...
> > > > > > >
> > > > > > > "Balancing" is something you should almost never need to do. It is only
> > > > > > > for cases of changing geometry (adding disks, switching RAID levels,
> > > > > > > etc.) of for cases when you've radically changed allocation behaviors
> > > > > > > (like you decided to remove all your VM's or you've decided to remove a
> > > > > > > mail spool directory full of thousands of tiny files).
> > > > > > >
> > > > > > > People run balance all the time because they think they should. They are
> > > > > > > _usually_ incorrect in that belief.
> > > > > >
> > > > > > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > > > > > device.
> > > > > No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> > > > > space. What's more, balance does *not* balance the metadata trees. The
> > > > > remaining space -- 154.97 GiB -- is unstructured storage for file
> > > > > data, and you have some 13 GiB of that available for use.
> > > > >
> > > > > Now, since you're seeing lockups when the space on your disks is
> > > > > all allocated I'd say that's a bug. However, you're the *only* person
> > > > > who's reported this as a regular occurrence. Does this happen with all
> > > > > filesystems you have, or just this one?
> > > > >
> > > > > > I *never* so far saw it lockup if there is still space BTRFS can allocate
> > > > > > from to *extend* a tree.
> > > > >
> > > > > It's not a tree. It's simply space allocation. It's not even space
> > > > > *usage* you're talking about here -- it's just allocation (i.e. the FS
> > > > > saying "I'm going to use this piece of disk for this purpose").
> > > > >
> > > > > > This may be a bug, but this is what I see.
> > > > > >
> > > > > > And no amount of "you should not balance a BTRFS" will make that
> > > > > > perception go away.
> > > > > >
> > > > > > See, I see the sun coming out on a morning and you tell me "no, it
> > > > > > doesn´t". Simply that is not going to match my perception.
> > > > >
> > > > > Duncan's assertion is correct in its detail. Looking at your space
> > > >
> > > > Robert's
> > > >
> > > > > usage, I would not suggest that running a balance is something you
> > > > > need to do. Now, since you have these lockups that seem quite
> > > > > repeatable, there's probably a lurking bug in there, but hacking
> > > > > around with balance every time you hit it isn't going to get the
> > > > > problem solved properly.
> > > > >
> > > > > I think I would suggest the following:
> > > > >
> > > > > - make sure you have some way of logging your dmesg permanently (use
> > > > > a different filesystem for /var/log, or a serial console, or a
> > > > > netconsole)
> > > > >
> > > > > - when the lockup happens, hit Alt-SysRq-t a few times
> > > > >
> > > > > - send the dmesg output here, or post to bugzilla.kernel.org
> > > > >
> > > > > That's probably going to give enough information to the developers
> > > > > to work out where the lockup is happening, and is clearly the way
> > > > > forward here.
> > > >
> > > > And I got it reproduced. *Perfectly* reproduced, I´d say.
> > > >
> > > > But let me run the whole story:
> > > >
> > > > 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.
> > >
> > > [… story of trying to reproduce with Windows XP defragmenting which was
> > > unsuccessful as BTRFS still had free device space to allocate new chunks
> > > from …]
> > >
> > > > But finally I got to:
> > > >
> > > > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> > > > Sa 27. Dez 13:26:39 CET 2014
> > > > Label: 'home' uuid: [some UUID]
> > > > Total devices 2 FS bytes used 152.83GiB
> > > > devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> > > > devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> > > >
> > > > Btrfs v3.17
> > > > Data, RAID1: total=154.97GiB, used=149.58GiB
> > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > Metadata, RAID1: total=5.00GiB, used=3.26GiB
> > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > >
> > > >
> > > >
> > > > So I did, if Virtualbox can write randomly in a file, I can too.
> > > >
> > > > So I did:
> > > >
> > > >
> > > > martin@merkaba:~> cat ssd-test.fio
> > > > [global]
> > > > bs=4k
> > > > #ioengine=libaio
> > > > #iodepth=4
> > > > size=4g
> > > > #direct=1
> > > > runtime=120
> > > > filename=ssd.test.file
> > > >
> > > > [seq-write]
> > > > rw=write
> > > > stonewall
> > > >
> > > > [rand-write]
> > > > rw=randwrite
> > > > stonewall
> > > >
> > > >
> > > >
> > > > And got:
> > > >
> > > > ATOP - merkaba 2014/12/27 13:41:02 ----------- 10s elapsed
> > > > PRC | sys 10.14s | user 0.38s | #proc 332 | #trun 2 | #tslpi 548 | #tslpu 0 | #zombie 0 | no procacct |
> > > > CPU | sys 102% | user 4% | irq 0% | idle 295% | wait 0% | guest 0% | curf 3.10GHz | curscal 96% |
> > > > cpu | sys 76% | user 0% | irq 0% | idle 24% | cpu001 w 0% | guest 0% | curf 3.20GHz | curscal 99% |
> > > > cpu | sys 24% | user 1% | irq 0% | idle 75% | cpu000 w 0% | guest 0% | curf 3.19GHz | curscal 99% |
> > > > cpu | sys 1% | user 1% | irq 0% | idle 98% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > cpu | sys 1% | user 1% | irq 0% | idle 98% | cpu003 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > CPL | avg1 0.82 | avg5 0.78 | avg15 0.99 | | csw 6233 | intr 12023 | | numcpu 4 |
> > > > MEM | tot 15.5G | free 4.0G | cache 9.7G | buff 0.0M | slab 333.1M | shmem 206.6M | vmbal 0.0M | hptot 0.0M |
> > > > SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> > > > LVM | sata-home | busy 0% | read 8 | write 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms |
> > > > DSK | sda | busy 0% | read 8 | write 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms |
> > > > NET | transport | tcpi 16 | tcpo 16 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 |
> > > > NET | network | ipi 16 | ipo 16 | ipfrw 0 | deliv 16 | | icmpi 0 | icmpo 0 |
> > > > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> > > >
> > > > PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2
> > > > 18079 - martin martin 2 9.99s 0.00s 0K 0K 0K 16K -- - R 1 100% fio
> > > > 4746 - martin martin 2 0.01s 0.14s 0K 0K 0K 0K -- - S 2 2% konsole
> > > > 3291 - martin martin 4 0.01s 0.11s 0K 0K 0K 0K -- - S 0 1% plasma-desktop
> > > > 1488 - root root 1 0.03s 0.04s 0K 0K 0K 0K -- - S 0 1% Xorg
> > > > 10036 - root root 1 0.04s 0.02s 0K 0K 0K 0K -- - R 2 1% atop
> > > >
> > > > while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> > > > for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
> > > >
> > > > martin@merkaba:~> LANG=C df -hT /home
> > > > Filesystem Type Size Used Avail Use% Mounted on
> > > > /dev/mapper/msata-home btrfs 170G 156G 17G 91% /home
> > > >
> > > > where a 4 GiB file should easily fit, no? (And this output is with the 4
> > > > GiB file. So it was even 4 GiB more free before.)
> > > >
> > > >
> > > > But it gets even more visible:
> > > >
> > > > martin@merkaba:~> fio ssd-test.fio
> > > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > fio-2.1.11
> > > > Starting 2 processes
> > > > Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:57s]
> > > > 0$ zsh 1$ zsh 2$ zsh 3-$ zsh 4$ zsh 5$* zsh
> > > >
> > > >
> > > > yes, thats 0 IOPS.
> > > >
> > > > 0 IOPS and in zero IOPS. For minutes.
> > > >
> > > >
> > > >
> > > > And here is why:
> > > >
> > > > ATOP - merkaba 2014/12/27 13:46:52 ----------- 10s elapsed
> > > > PRC | sys 10.77s | user 0.31s | #proc 334 | #trun 2 | #tslpi 548 | #tslpu 3 | #zombie 0 | no procacct |
> > > > CPU | sys 108% | user 3% | irq 0% | idle 286% | wait 2% | guest 0% | curf 3.08GHz | curscal 96% |
> > > > cpu | sys 72% | user 1% | irq 0% | idle 28% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > cpu | sys 19% | user 0% | irq 0% | idle 81% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > cpu | sys 11% | user 1% | irq 0% | idle 87% | cpu003 w 1% | guest 0% | curf 3.19GHz | curscal 99% |
> > > > cpu | sys 6% | user 1% | irq 0% | idle 91% | cpu002 w 1% | guest 0% | curf 3.11GHz | curscal 97% |
> > > > CPL | avg1 2.78 | avg5 1.34 | avg15 1.12 | | csw 50192 | intr 32379 | | numcpu 4 |
> > > > MEM | tot 15.5G | free 5.0G | cache 8.7G | buff 0.0M | slab 332.6M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
> > > > SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> > > > LVM | sata-home | busy 5% | read 160 | write 11177 | KiB/w 3 | MBr/s 0.06 | MBw/s 4.36 | avio 0.05 ms |
> > > > LVM | msata-home | busy 4% | read 28 | write 11177 | KiB/w 3 | MBr/s 0.01 | MBw/s 4.36 | avio 0.04 ms |
> > > > LVM | sata-debian | busy 0% | read 0 | write 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms |
> > > > LVM | msata-debian | busy 0% | read 0 | write 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms |
> > > > DSK | sda | busy 5% | read 160 | write 10200 | KiB/w 4 | MBr/s 0.06 | MBw/s 4.69 | avio 0.05 ms |
> > > > DSK | sdb | busy 4% | read 28 | write 10558 | KiB/w 4 | MBr/s 0.01 | MBw/s 4.69 | avio 0.04 ms |
> > > > NET | transport | tcpi 35 | tcpo 33 | udpi 3 | udpo 3 | tcpao 2 | tcppo 1 | tcprs 0 |
> > > > NET | network | ipi 38 | ipo 36 | ipfrw 0 | deliv 38 | | icmpi 0 | icmpo 0 |
> > > > NET | eth0 0% | pcki 22 | pcko 20 | si 9 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> > > > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> > > >
> > > > PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
> > > > 14973 - root root 1 8.92s 0.00s 0K 0K 0K 144K -- - S 0 89% kworker/u8:14
> > > > 17450 - root root 1 0.86s 0.00s 0K 0K 0K 32K -- - R 3 9% kworker/u8:5
> > > > 788 - root root 1 0.25s 0.00s 0K 0K 128K 18880K -- - S 3 3% btrfs-transact
> > > > 12254 - root root 1 0.14s 0.00s 0K 0K 64K 576K -- - S 2 1% kworker/u8:3
> > > > 17332 - root root 1 0.11s 0.00s 0K 0K 112K 1348K -- - S 2 1% kworker/u8:4
> > > > 3291 - martin martin 4 0.01s 0.09s 0K 0K 0K 0K -- - S 1 1% plasma-deskto
> > > >
> > > >
> > > >
> > > >
> > > > ATOP - merkaba 2014/12/27 13:47:12 ----------- 10s elapsed
> > > > PRC | sys 10.78s | user 0.44s | #proc 334 | #trun 3 | #tslpi 547 | #tslpu 3 | #zombie 0 | no procacct |
> > > > CPU | sys 106% | user 4% | irq 0% | idle 288% | wait 1% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > cpu | sys 93% | user 0% | irq 0% | idle 7% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > cpu | sys 7% | user 0% | irq 0% | idle 93% | cpu003 w 0% | guest 0% | curf 3.01GHz | curscal 94% |
> > > > cpu | sys 3% | user 2% | irq 0% | idle 94% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > cpu | sys 3% | user 2% | irq 0% | idle 95% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > CPL | avg1 3.33 | avg5 1.56 | avg15 1.20 | | csw 38253 | intr 23104 | | numcpu 4 |
> > > > MEM | tot 15.5G | free 4.9G | cache 8.7G | buff 0.0M | slab 336.5M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
> > > > SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> > > > LVM | msata-home | busy 2% | read 0 | write 2337 | KiB/w 3 | MBr/s 0.00 | MBw/s 0.91 | avio 0.07 ms |
> > > > LVM | sata-home | busy 2% | read 36 | write 2337 | KiB/w 3 | MBr/s 0.01 | MBw/s 0.91 | avio 0.07 ms |
> > > > LVM | msata-debian | busy 1% | read 1 | write 1630 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.65 | avio 0.03 ms |
> > > > LVM | sata-debian | busy 0% | read 0 | write 1019 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.41 | avio 0.02 ms |
> > > > DSK | sdb | busy 2% | read 1 | write 2545 | KiB/w 5 | MBr/s 0.00 | MBw/s 1.45 | avio 0.07 ms |
> > > > DSK | sda | busy 1% | read 36 | write 2461 | KiB/w 5 | MBr/s 0.01 | MBw/s 1.28 | avio 0.06 ms |
> > > > NET | transport | tcpi 20 | tcpo 20 | udpi 1 | udpo 1 | tcpao 1 | tcppo 1 | tcprs 0 |
> > > > NET | network | ipi 21 | ipo 21 | ipfrw 0 | deliv 21 | | icmpi 0 | icmpo 0 |
> > > > NET | eth0 0% | pcki 5 | pcko 5 | si 0 Kbps | so 0 Kbps | erri 0 | erro 0 | drpo 0 |
> > > > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> > > >
> > > > PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
> > > > 17450 - root root 1 9.96s 0.00s 0K 0K 0K 0K -- - R 2 100% kworker/u8:5
> > > > 4746 - martin martin 2 0.06s 0.15s 0K 0K 0K 0K -- - S 1 2% konsole
> > > > 10508 - root root 1 0.13s 0.00s 0K 0K 96K 4048K -- - S 1 1% kworker/u8:18
> > > > 1488 - root root 1 0.06s 0.06s 0K 0K 0K 0K -- - S 0 1% Xorg
> > > > 17332 - root root 1 0.12s 0.00s 0K 0K 96K 580K -- - R 3 1% kworker/u8:4
> > > > 17454 - root root 1 0.11s 0.00s 0K 0K 32K 4416K -- - D 1 1% kworker/u8:6
> > > > 17516 - root root 1 0.09s 0.00s 0K 0K 16K 136K -- - S 3 1% kworker/u8:7
> > > > 3268 - martin martin 3 0.02s 0.05s 0K 0K 0K 0K -- - S 1 1% kwin
> > > > 10036 - root root 1 0.05s 0.02s 0K 0K 0K 0K -- - R 0 1% atop
> > > >
> > > >
> > > >
> > > > So BTRFS is basically busy with itself and nothing else. Look at the SSD
> > > > usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
> > > > Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
> > > > you measure of course, like request size, read, write, iodepth and so).
> > > >
> > > > Its kworker/u8:5 utilizing 100% of one core for minutes.
> > > >
> > > >
> > > >
> > > > Its the random write case it seems. Here are values from fio job:
> > > >
> > > > martin@merkaba:~> fio ssd-test.fio
> > > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > fio-2.1.11
> > > > Starting 2 processes
> > > > Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s]
> > > > seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
> > > > write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
> > > > clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
> > > > lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
> > > > clat percentiles (usec):
> > > > | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4],
> > > > | 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5],
> > > > | 70.00th=[ 7], 80.00th=[ 8], 90.00th=[ 8], 95.00th=[ 9],
> > > > | 99.00th=[ 14], 99.50th=[ 20], 99.90th=[ 211], 99.95th=[ 2128],
> > > > | 99.99th=[10304]
> > > > bw (KB /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20
> > > > lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
> > > > lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
> > > > lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
> > > > cpu : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
> > > > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> > > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > > issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
> > > > latency : target=0, window=0, percentile=100.00%, depth=1
> > > >
> > > > Seems fine.
> > > >
> > > >
> > > > But:
> > > >
> > > > rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
> > > > write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
> > > > clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
> > > > lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
> > > > clat percentiles (usec):
> > > > | 1.00th=[ 4], 5.00th=[ 5], 10.00th=[ 5], 20.00th=[ 5],
> > > > | 30.00th=[ 6], 40.00th=[ 6], 50.00th=[ 6], 60.00th=[ 6],
> > > > | 70.00th=[ 7], 80.00th=[ 7], 90.00th=[ 9], 95.00th=[ 10],
> > > > | 99.00th=[ 18], 99.50th=[ 19], 99.90th=[ 28], 99.95th=[ 116],
> > > > | 99.99th=[16711680]
> > > > bw (KB /s): min= 0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
> > > > lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
> > > > lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
> > > > cpu : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
> > > > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> > > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > > issued : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
> > > > latency : target=0, window=0, percentile=100.00%, depth=1
> > > >
> > > > Run status group 0 (all jobs):
> > > > WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec
> > > >
> > > > Run status group 1 (all jobs):
> > > > WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec
> > > >
> > > >
> > > > What? 254 IOPS? With a Dual SSD BTRFS RAID 1?
> > > >
> > > > What?
> > > >
> > > > Ey, *what*?
> > […]
> > > > There we go:
> > > >
> > > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > >
> > > I have done more tests.
> > >
> > > This is on the same /home after extending it to 170 GiB and balancing it to
> > > btrfs balance start -dusage=80
> > >
> > > It has plenty of free space free. I updated the bug report and hope it can
> > > give an easy enough to comprehend summary. The new tests are in:
> > >
> > > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c6
> > >
> > >
> > >
> > > Pasting below for discussion on list. Summary: I easily get 38000 (!)
> > > IOPS. It may be an idea to reduce to 160 GiB, but right now this does
> > > not work as it says no free space on device when trying to downsize it.
> > > I may try with 165 or 162GiB.
> > >
> > > So now we have three IOPS figures:
> > >
> > > - 256 IOPS in worst case scenario
> > > - 4700 IOPS when trying to reproduce worst case scenario with a fresh and small
> > > BTRFS
> > > - 38000 IOPS when /home has unused device space to allocate chunks from
> > >
> > > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c8
> > >
> > >
> > > This is another test.
> >
> >
> > Okay, and this is the last series of tests for today.
> >
> > Conclusion:
> >
> > I cannot manage to get it down to the knees as before, but I come near to it.
> >
> > Still its 8000 IOPS, instead of 250 IOPS, in an according to btrfs fi sh
> > even *worse* situation than before.
> >
> > That hints me at the need to look at the free space fragmentation, as in the
> > beginning the problem started appearing with:
> >
> > merkaba:~> btrfs fi sh /home
> > Label: 'home' uuid: […]
> > Total devices 2 FS bytes used 144.41GiB
> > devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> > devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> >
> > Btrfs v3.17
> > merkaba:~> btrfs fi df /home
> > Data, RAID1: total=154.97GiB, used=141.12GiB
> > System, RAID1: total=32.00MiB, used=48.00KiB
> > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> >
> >
> >
> > Yes, thats 13 GiB of free space *within* the chunks.
> >
> > So while I can get it down in IOPS by bringing it to a situation where it
> > can not reserve additional data chunks again, I cannot recreate the
> > abysmal 250 IOPS figure by this. Not even with my /home filesystem.
> >
> > So there is more to it. I think its important to look into free space
> > fragmentation. It seems it needs an *aged* filesystem to recreate. At
> > it seems the balances really helped. As I am not able to recreate the
> > issue to that extent right now.
> >
> > So this shows my original idea about free device space to allocate from
> > also doesn´t explain it fully. It seems to be something thats going on
> > within the chunks that explains the worst case <300 IOPS, kworker using
> > one core for minutes and desktop locked scenario.
> >
> > Is there a way to view free space fragmentation in BTRFS?
>
> So to rephrase that:
>
> From what I perceive the worst case issue happens when
>
> 1) BTRFS cannot reserve any new chunks from unused device space anymore.
>
> 2) The free space in the existing chunks is highly fragmented.
>
> Only one of those conditions is not sufficient to trigger it.
>
> Thats at least my current idea about it.
With
merkaba:~> btrfs fi df /home
Data, RAID1: total=163.87GiB, used=146.92GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.94GiB, used=3.26GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
merkaba:~> btrfs fi sh /home
Label: 'home' uuid: […]
Total devices 2 FS bytes used 150.18GiB
devid 1 size 170.00GiB used 169.84GiB path /dev/mapper/msata-home
devid 2 size 170.00GiB used 169.84GiB path /dev/mapper/sata-home
Btrfs v3.17
I had a noticable hang during sdelete.exe -z in Windows XP VM with 20 GiB VDI file – Patrik on mailing list told me they have changed the argument from -c to -z as I wondered by VBoxManage modifyhd Winlala.vdi --compact did not reduce the size of the file).
It was not as bad, but desktop was locked for more than 5 seconds easily.
So this also happens with larger free space *within* the chunks. Before I to the VBoxManage --compact I will now rebalance partly.
So this definately shows, it can happen when BTRFS cannot reserve any new
chunks anymore, yet still has *plenty* of free space within the existing data
chunks.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
next prev parent reply other threads:[~2014-12-29 9:25 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-12-26 13:37 BTRFS free space handling still needs more work: Hangs again Martin Steigerwald
2014-12-26 14:20 ` Martin Steigerwald
2014-12-26 14:41 ` Martin Steigerwald
2014-12-27 3:33 ` Duncan
2014-12-26 15:59 ` Martin Steigerwald
2014-12-27 4:26 ` Duncan
2014-12-26 22:48 ` Robert White
2014-12-27 5:54 ` Duncan
2014-12-27 9:01 ` Martin Steigerwald
2014-12-27 9:30 ` Hugo Mills
2014-12-27 10:54 ` Martin Steigerwald
2014-12-27 11:52 ` Robert White
2014-12-27 13:16 ` Martin Steigerwald
2014-12-27 13:49 ` Robert White
2014-12-27 14:06 ` Martin Steigerwald
2014-12-27 14:00 ` Robert White
2014-12-27 14:14 ` Martin Steigerwald
2014-12-27 14:21 ` Martin Steigerwald
2014-12-27 15:14 ` Robert White
2014-12-27 16:01 ` Martin Steigerwald
2014-12-28 0:25 ` Robert White
2014-12-28 1:01 ` Bardur Arantsson
2014-12-28 4:03 ` Robert White
2014-12-28 12:03 ` Martin Steigerwald
2014-12-28 17:04 ` Patrik Lundquist
2014-12-29 10:14 ` Martin Steigerwald
2014-12-28 12:07 ` Martin Steigerwald
2014-12-28 14:52 ` Robert White
2014-12-28 15:42 ` Martin Steigerwald
2014-12-28 15:47 ` Martin Steigerwald
2014-12-29 0:27 ` Robert White
2014-12-29 9:14 ` Martin Steigerwald
2014-12-27 16:10 ` Martin Steigerwald
2014-12-27 14:19 ` Robert White
2014-12-27 11:11 ` Martin Steigerwald
2014-12-27 12:08 ` Robert White
2014-12-27 13:55 ` Martin Steigerwald
2014-12-27 14:54 ` Robert White
2014-12-27 16:26 ` Hugo Mills
2014-12-27 17:11 ` Martin Steigerwald
2014-12-27 17:59 ` Martin Steigerwald
2014-12-28 0:06 ` Robert White
2014-12-28 11:05 ` Martin Steigerwald
2014-12-28 13:00 ` BTRFS free space handling still needs more work: Hangs again (further tests) Martin Steigerwald
2014-12-28 13:40 ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare) Martin Steigerwald
2014-12-28 13:56 ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea) Martin Steigerwald
2014-12-28 15:00 ` Martin Steigerwald
2014-12-29 9:25 ` Martin Steigerwald [this message]
2014-12-27 18:28 ` BTRFS free space handling still needs more work: Hangs again Zygo Blaxell
2014-12-27 18:40 ` Hugo Mills
2014-12-27 19:23 ` BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time) Martin Steigerwald
2014-12-29 2:07 ` Zygo Blaxell
2014-12-29 9:32 ` Martin Steigerwald
2015-01-06 20:03 ` Zygo Blaxell
2015-01-07 19:08 ` Martin Steigerwald
2015-01-07 21:41 ` Zygo Blaxell
2015-01-08 5:45 ` Duncan
2015-01-08 10:18 ` Martin Steigerwald
2015-01-09 8:25 ` Duncan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6556726.TrNmnHvhLa@merkaba \
--to=martin@lichtvoll.de \
--cc=hugo@carfax.org.uk \
--cc=linux-btrfs@vger.kernel.org \
--cc=rwhite@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).