btrfs-transaction blocked for more than 120 seconds

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* btrfs-transaction blocked for more than 120 seconds
@ 2013-12-31 11:46 Sulla
  2014-01-01 12:37 ` Duncan
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Sulla @ 2013-12-31 11:46 UTC (permalink / raw)
  To: linux-btrfs

Dear all!

On my Ubuntu Server 13.10 I use a RAID5 blockdevice consisting of 3 WD20EARS
drives. On this I built a LVM and in this LVM I use quite normal partitions
/, /home, SWAP (/boot resides on a RAID1.) and also a custom /data
partition. Everything (except boot and swap) is on btrfs.

sometimes my system hangs for quite some time (top is showing a high wait
percentage), then runs on normally. I get kernel messages into
/var/log/sylsog, see below. I am unable to make any sense of the kernel
messages, there is no reference to the filesystem or drive affected (at
least I can not find one).

Question: What is happening here?
* Is a HDD failing (smart looks good, however)
* Is something wrong with my btrfs-filesystem? with which one?
* How can I find the cause?

thanks, Wolfgang


Dec 31 12:27:49 freedom kernel: [ 4681.264112] INFO: task
btrfs-transacti:529 blocked for more than 120 seconds.

Dec 31 12:27:49 freedom kernel: [ 4681.264239] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.

Dec 31 12:27:49 freedom kernel: [ 4681.264367] btrfs-transacti D
ffff88013fc14580     0   529      2 0x00000000

Dec 31 12:27:49 freedom kernel: [ 4681.264377]  ffff880138345e10
0000000000000046 ffff880138345fd8 0000000000014580

Dec 31 12:27:49 freedom kernel: [ 4681.264386]  ffff880138345fd8
0000000000014580 ffff880135615dc0 ffff880132fb6a00

Dec 31 12:27:49 freedom kernel: [ 4681.264393]  ffff880133f45800
ffff880138345e30 ffff880137ee2000 ffff880137ee2070

Dec 31 12:27:49 freedom kernel: [ 4681.264402] Call Trace:

Dec 31 12:27:49 freedom kernel: [ 4681.264418]  [<ffffffff816eaa79>]
schedule+0x29/0x70

Dec 31 12:27:49 freedom kernel: [ 4681.264477]  [<ffffffffa032a57d>]
btrfs_commit_transaction+0x34d/0x980 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.264487]  [<ffffffff81085580>] ?
wake_up_atomic_t+0x30/0x30

Dec 31 12:27:49 freedom kernel: [ 4681.264517]  [<ffffffffa0321be5>]
transaction_kthread+0x1a5/0x240 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.264548]  [<ffffffffa0321a40>] ?
verify_parent_transid+0x150/0x150 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.264557]  [<ffffffff810847b0>]
kthread+0xc0/0xd0

Dec 31 12:27:49 freedom kernel: [ 4681.264565]  [<ffffffff810846f0>] ?
kthread_create_on_node+0x120/0x120

Dec 31 12:27:49 freedom kernel: [ 4681.264573]  [<ffffffff816f566c>]
ret_from_fork+0x7c/0xb0

Dec 31 12:27:49 freedom kernel: [ 4681.264580]  [<ffffffff810846f0>] ?
kthread_create_on_node+0x120/0x120

Dec 31 12:27:49 freedom kernel: [ 4681.264610] INFO: task kworker/u4:0:9975
blocked for more than 120 seconds.

Dec 31 12:27:49 freedom kernel: [ 4681.264722] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.

Dec 31 12:27:49 freedom kernel: [ 4681.264847] kworker/u4:0    D
ffff88013fd14580     0  9975      2 0x00000000

Dec 31 12:27:49 freedom kernel: [ 4681.264861] Workqueue: writeback
bdi_writeback_workfn (flush-btrfs-4)

Dec 31 12:27:49 freedom kernel: [ 4681.264865]  ffff8800a8739538
0000000000000046 ffff8800a8739fd8 0000000000014580

Dec 31 12:27:49 freedom kernel: [ 4681.264873]  ffff8800a8739fd8
0000000000014580 ffff8801351e5dc0 ffff8801351e5dc0

Dec 31 12:27:49 freedom kernel: [ 4681.264880]  ffff880134c5e6a8
ffff880134c5e6b0 ffffffff00000000 ffff880134c5e6b8

Dec 31 12:27:49 freedom kernel: [ 4681.264887] Call Trace:

Dec 31 12:27:49 freedom kernel: [ 4681.264895]  [<ffffffff816eaa79>]
schedule+0x29/0x70

Dec 31 12:27:49 freedom kernel: [ 4681.264902]  [<ffffffff816ec465>]
rwsem_down_write_failed+0x105/0x1e0

Dec 31 12:27:49 freedom kernel: [ 4681.264911]  [<ffffffff8136257d>] ?
__rwsem_do_wake+0xdd/0x160

Dec 31 12:27:49 freedom kernel: [ 4681.264918]  [<ffffffff81369763>]
call_rwsem_down_write_failed+0x13/0x20

Dec 31 12:27:49 freedom kernel: [ 4681.264927]  [<ffffffff816e9e7d>] ?
down_write+0x2d/0x30

Dec 31 12:27:49 freedom kernel: [ 4681.264956]  [<ffffffffa030fbe0>]
cache_block_group+0x290/0x3b0 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.264963]  [<ffffffff81085580>] ?
wake_up_atomic_t+0x30/0x30

Dec 31 12:27:49 freedom kernel: [ 4681.264991]  [<ffffffffa0317d48>]
find_free_extent+0xa38/0xac0 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.265022]  [<ffffffffa0317ef2>]
btrfs_reserve_extent+0xa2/0x1c0 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.265056]  [<ffffffffa033103d>]
__cow_file_range+0x15d/0x4a0 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.265090]  [<ffffffffa0331efa>]
cow_file_range+0x8a/0xd0 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.265122]  [<ffffffffa0332290>]
run_delalloc_range+0x350/0x390 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.265158]  [<ffffffffa0346bf1>] ?
find_lock_delalloc_range.constprop.42+0x1d1/0x1f0 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.265194]  [<ffffffffa0348764>]
__extent_writepage+0x304/0x750 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.265202]  [<ffffffff8109a1d5>] ?
set_next_entity+0x95/0xb0

Dec 31 12:27:49 freedom kernel: [ 4681.265212]  [<ffffffff810115c6>] ?
__switch_to+0x126/0x4b0

Dec 31 12:27:49 freedom kernel: [ 4681.265221]  [<ffffffff8104dee9>] ?
default_spin_lock_flags+0x9/0x10

Dec 31 12:27:49 freedom kernel: [ 4681.265229]  [<ffffffff8113f6c1>] ?
find_get_pages_tag+0xd1/0x180

Dec 31 12:27:49 freedom kernel: [ 4681.265266]  [<ffffffffa0348e32>]
extent_write_cache_pages.isra.31.constprop.46+0x282/0x3e0 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.265303]  [<ffffffffa034928d>]
extent_writepages+0x4d/0x70 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.265336]  [<ffffffffa032ea90>] ?
btrfs_real_readdir+0x5c0/0x5c0 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.265369]  [<ffffffffa032caa8>]
btrfs_writepages+0x28/0x30 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.265378]  [<ffffffff8114a4ae>]
do_writepages+0x1e/0x40

Dec 31 12:27:49 freedom kernel: [ 4681.265387]  [<ffffffff811ce7d0>]
__writeback_single_inode+0x40/0x220

Dec 31 12:27:49 freedom kernel: [ 4681.265395]  [<ffffffff811ceb4b>]
writeback_sb_inodes+0x19b/0x3b0

Dec 31 12:27:49 freedom kernel: [ 4681.265403]  [<ffffffff811cedff>]
__writeback_inodes_wb+0x9f/0xd0

Dec 31 12:27:49 freedom kernel: [ 4681.265411]  [<ffffffff811cf623>]
wb_writeback+0x243/0x2c0

Dec 31 12:27:49 freedom kernel: [ 4681.265418]  [<ffffffff811d1489>]
bdi_writeback_workfn+0x1b9/0x3d0

Dec 31 12:27:49 freedom kernel: [ 4681.265426]  [<ffffffff8107d05c>]
process_one_work+0x17c/0x430

Dec 31 12:27:49 freedom kernel: [ 4681.265432]  [<ffffffff8107dcac>]
worker_thread+0x11c/0x3c0

Dec 31 12:27:49 freedom kernel: [ 4681.265439]  [<ffffffff8107db90>] ?
manage_workers.isra.24+0x2a0/0x2a0

Dec 31 12:27:49 freedom kernel: [ 4681.265447]  [<ffffffff810847b0>]
kthread+0xc0/0xd0

Dec 31 12:27:49 freedom kernel: [ 4681.265454]  [<ffffffff810846f0>] ?
kthread_create_on_node+0x120/0x120

Dec 31 12:27:49 freedom kernel: [ 4681.265461]  [<ffffffff816f566c>]
ret_from_fork+0x7c/0xb0

Dec 31 12:27:49 freedom kernel: [ 4681.265469]  [<ffffffff810846f0>] ?
kthread_create_on_node+0x120/0x120

Dec 31 12:27:49 freedom kernel: [ 4681.265476] INFO: task smbd:10275 blocked
for more than 120 seconds.

Dec 31 12:27:49 freedom kernel: [ 4681.265579] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.

Dec 31 12:27:49 freedom kernel: [ 4681.265704] smbd            D
ffff88013fc14580     0 10275    723 0x00000004

Dec 31 12:27:49 freedom kernel: [ 4681.265711]  ffff8800a5abbbc0
0000000000000046 ffff8800a5abbfd8 0000000000014580

Dec 31 12:27:49 freedom kernel: [ 4681.265718]  ffff8800a5abbfd8
0000000000014580 ffff880133d5aee0 ffff880137ee2000

Dec 31 12:27:49 freedom kernel: [ 4681.265726]  ffff880133db79e8
ffff880133db79e8 0000000000000001 ffff880132d2dc80

Dec 31 12:27:49 freedom kernel: [ 4681.265733] Call Trace:

Dec 31 12:27:49 freedom kernel: [ 4681.265739]  [<ffffffff816eaa79>]
schedule+0x29/0x70

Dec 31 12:27:49 freedom kernel: [ 4681.265772]  [<ffffffffa03296df>]
wait_current_trans.isra.18+0xbf/0x120 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.265778]  [<ffffffff81085580>] ?
wake_up_atomic_t+0x30/0x30

Dec 31 12:27:49 freedom kernel: [ 4681.265810]  [<ffffffffa032af06>]
start_transaction+0x356/0x520 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.265843]  [<ffffffffa032b0eb>]
btrfs_start_transaction+0x1b/0x20 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.265876]  [<ffffffffa0334887>]
btrfs_cont_expand+0x1c7/0x460 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.265911]  [<ffffffffa033cc26>]
btrfs_file_aio_write+0x346/0x520 [btrfs]

Dec 31 12:27:49 freedom kernel: [ 4681.265919]  [<ffffffff811b9810>] ?
poll_select_copy_remaining+0x130/0x130

Dec 31 12:27:49 freedom kernel: [ 4681.265928]  [<ffffffff811a6640>]
do_sync_write+0x80/0xb0

Dec 31 12:27:49 freedom kernel: [ 4681.265936]  [<ffffffff811a6d7d>]
vfs_write+0xbd/0x1e0

Dec 31 12:27:49 freedom kernel: [ 4681.265942]  [<ffffffff811a7932>]
SyS_pwrite64+0x72/0xb0

Dec 31 12:27:49 freedom kernel: [ 4681.265949]  [<ffffffff816f571d>]
system_call_fastpath+0x1a/0x1f


-- 
For a successful technology, reality must take precedence over
public relations, for Nature cannot be fooled.
  Richard P. Feynman
















^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2013-12-31 11:46 btrfs-transaction blocked for more than 120 seconds Sulla
@ 2014-01-01 12:37 ` Duncan
  2014-01-01 20:08   ` Sulla
  2014-01-03 17:25   ` Marc MERLIN
  2014-01-02  8:49 ` Jojo
  2014-01-05 20:32 ` Chris Murphy
  2 siblings, 2 replies; 31+ messages in thread
From: Duncan @ 2014-01-01 12:37 UTC (permalink / raw)
  To: linux-btrfs

Sulla posted on Tue, 31 Dec 2013 12:46:04 +0100 as excerpted:

> On my Ubuntu Server 13.10 I use a RAID5 blockdevice consisting of 3
> WD20EARS drives. On this I built a LVM and in this LVM I use quite
> normal partitions /, /home, SWAP (/boot resides on a RAID1.) and also a
> custom /data partition. Everything (except boot and swap) is on btrfs.
> 
> sometimes my system hangs for quite some time (top is showing a high
> wait percentage), then runs on normally. I get kernel messages into
> /var/log/sylsog, see below. I am unable to make any sense of the kernel
> messages, there is no reference to the filesystem or drive affected (at
> least I can not find one).
> 
> Question: What is happening here?
> * Is a HDD failing (smart looks good, however)
> * Is something wrong with my btrfs-filesystem? with which one?
> * How can I find the cause?
> 
> Dec 31 12:27:49 freedom kernel: [ 4681.264112] INFO: task
> btrfs-transacti:529 blocked for more than 120 seconds.
> 
> Dec 31 12:27:49 freedom kernel: [ 4681.264239] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.

First to put your mind at rest, no, it's unlikely that your hardware is 
failing; and it's not an indication of a filesystem bug either.  Rather, 
it's a characteristic of btrfs behavior in certain corner-cases, and yes, 
you /can/ do something about it with some relatively minor btrfs 
configuration adjustments... altho on spinning rust at multi-terabyte 
sizes, those otherwise minor adjustments might take some time (hours)!

There seem to be two primary btrfs triggers for these "blocked for more 
than N seconds" messages.  One is COW-related (COW=copy-on-write, the 
basis of BTRFS) fragmentation, the other is many-hardlink related.  The 
only scenario-trigger I've seen for the many-hardlink case, however, has 
been when people are using a hardlink-based backup scheme, which you 
don't mention, so I'd guess it's the COW-related trigger for you.

A bit of background on COW:  (Assuming I get this correct, I don't claim 
to be an expert on it.)  In general, copy-on-write is a data handling 
technique where any modification to the original data is made out-of-line 
from the original, then the extent map (be it memory extent map for in-
memory COW applications, or on-device data extent map for filesystems, 
or...) is modified, replacing the original inline extent index with that 
of the new modification.

The advantage of COW for filesystems, over in-place-modification, is that 
should the system crash at just the right (wrong?) moment, before the 
full record has been written, an in-place-modification may corrupt the 
entire file (or worse yet, the metadata for a whole bunch of files, 
effectively killing them all!), while with COW the update is atomic -- at 
least in theory, it has either been fully written and you get the new 
version, or the remapping hasn't yet occurred and you get the old version 
-- no corrupted case which is if you're lucky, part new and part old, and 
if you're unlucky, has something entirely unrelated and very possibly 
binary in the middle of what might have previously been for example a 
plain-text config file.

However, COW-based filesystems work best when most updates either replace 
the entire file, or append to the end of the file, luckily the most 
common case.  COW's primary down side in filesystem implementations is 
that for use-cases where only a small piece of the file somewhere in the 
middle is modified and saved, then another small piece somewhere else, 
and another and another... repeated tens of thousands of times, each 
small modification and save gets mapped to a new location and the file 
fragments into possibly tens of thousands of extents, each with just the 
content of the individual modification made to the file at that point.

On a spinning rust hard drive, the time necessary to seek to each of 
those possibly tens of thousands of extents in ordered to read the file, 
as compared to the cost of simply reading the same data were it stored 
sequentially in a straight line, is... non-trivial to say the least!

It's exactly that fragmentation and the delays caused by all the seeks to 
read an affected file, that result in the stalls and system hangs you are 
seeing.

OK, so now that we know what causes it, what files are affected, and what 
can you do to help the situation?

Fortunately, COW-fragmentation isn't a situation that dramatically 
impacts operations on most files, as obviously if it was, it'd be 
unsuited for filesystem use at all.  But it does have a dramatic effect 
in some cases -- the ones I've seen people report on this list are listed 
below:

1) Installation.

Apparently the way some distribution installation scripts work results in 
even a brand new installation being highly fragmented. =:^(  If in 
addition they don't add autodefrag to the mount options used when 
mounting the filesystem for the original installation, the problem is 
made even worse, since the autodefrag mount option is designed to help 
catch some of this sort of issue, and schedule the affected files for 
auto-defrag by a separate thread.

The fix here is to run a manual btrfs filesystem defrag -r on the 
filesystem immediately after installation completes, and to add 
autodefrag to the mount options used for the filesystem from then on, to 
keep updates and routine operation from triggering new fragmentation.

(It's possible to do the same with just the autodefrag option over time, 
but depending on how fragmented the filesystem was to begin with, some 
people report that this makes the problem worse for awhile, and the 
system unusable, until the autodefrag mechanism has caught up to the 
existing problem.  Autodefrag works best to /keep/ an already in good 
shape filesystem in good shape; it's not so good at getting one that's 
highly fragmented back into good shape.  That's what btrfs filesystem 
defrag -r is for. =:^)

2) Pre-allocated files.

Systemd's journal file is probably the most common single case here, but 
it's not the only case, and AFAIK ubuntu doesn't use systemd anyway, so 
that's highly unlikely to be your problem.

A less widespread case that's never-the-less common enough is bittorrent 
clients that preallocate files at their final size before the download, 
then write into them as the torrent chunks are downloaded.  BAD situation 
for COW filesystems including btrfs, since now the entire file is one 
relocated chunk after another.  If the file's a multi-gig DVD image or 
the like, as mentioned above, that can be tens of thousands of extents!  
This situation is *KNOWN* to cause N-second block reports and system 
stalls of the nature you're reporting, but of course only triggers for 
those running such bittorrent clients.

One potential fix if your bittorrent client has the option, is to turn 
preallocation off.  However, it's there for a couple reasons -- on normal 
non-COW filesystems it has exactly the opposite effect, ensuring a file 
stays sequentially mapped, AND, by preallocating the file, it's easier to 
ensure that there's space available for the entire thing.  (Altho if 
you're using btrfs' compression option and it compresses the allocation, 
more space will still be used as the actual data downloads and the file 
is filled in, as that won't compress as well.)

Additionally, there's other cases of pre-allocated files.  For these and 
for bittorrent if you don't want to or can't turn pre-allocation off, 
there's the NOCOW file attribute.  See below for that.

3) Virtual machine images.

Virtual machine images tend to be rather large, often several gig, and to 
trigger internal-image writes every time the configuration changes or 
something is saved to the virtual disk in the image.  Again, a big worst-
case for COW-based filesystems such as btrfs, as those internal image-
writes are precisely the sort of behavior that triggers image file 
fragmentation.

For these, the NOCOW option is the best.  Again, see below.

4) Database files.

Same COW-based-filesystem-worst-case behavior pattern here.

The autodefrag mount option was actually designed to help deal with this 
case, however, for small databases (typically the small sqlite databases 
used in firefox and thunderbird, for instance).  It'll detect the 
fragmentation and rewrite the entire file as a single extent.  Of course 
that works well for reasonably small databases, but won't work so well 
for multi-gig databases, or multi-gig VMs or torrent images for that 
matter, since the write magnification would be very large (rewriting a 
whole multi-gig image for every change of a few bytes).  Which is where 
the NOCOW file attribute comes in...

Solutions beyond btrfs filesystem defrag -r, and the autodefrag mount 
option:

The nodatacow mount option.

At the filesystem level, btrfs has the nodatacow mount option.  For use-
cases where there's several files of the same problematic type, say a 
bunch of VM images, or a bunch of torrent files downloading to the same 
target subdir or subdirectory tree, or a bunch of database files all in 
the same directory subtree, creating a dedicated filesystem which can be 
mounted with the nodatacow option can make sense.

At some point in the future, btrfs is supposed to support different mount 
options per subvolume, and at that point, a simple subvolume mounted with 
nodatacow but still located on a main system volume mounted without it, 
might make sense, but at this point, differing subvolume mount options 
aren't available, so to use this solution, you have to create a fully 
separate btrfs filesystem to use the nodatacow option on.

But nodatacow also disables some of the other features of btrfs, such as 
checksumming and compression.  While those don't work so well with COW-
averse use-cases anyway (for some of the same reasons COW doesn't work on 
them), once you get rid of them on a global filesystem level, you're 
almost back to the level of a normal filesystem, and might as well use 
one.  So in that case, rather than a dedicated btrfs mounted with 
nodatacow, I'd suggest a dedicated ext4 or reiserfs or xfs or whatever 
filesystem instead, particularly since btrfs is still under development, 
while these other filesystems have been mature and stable for years.

The NOCOW file attribute.

Simple command form:

chattr +C /path/to/file/or/directory

*CAVEAT!  This attribute should be set on new/empty files before they 
have any content.  The easiest way to do that is to set the attribute on 
the parent directory, after which all new files created in it will 
inherit the attribute.  (Alternatively, touch the file to create it 
empty, do the chattr, then append data into it using cat source >> target 
or the like.)

Meanwhile, if there's a point at which the file exists in its more or 
less permanent form and won't be written into any longer (a torrented 
file is fully downloaded, or a VM image is backed up), sequentially 
copying it elsewhere (possibly using cp --reflink=never if on the same 
filesystem, to avoid a reflink copy pointing at the same fragmented 
extents!), then deleting the original fragmented version, should 
effectively defragment the file too.  And since it's not being written 
into any more at that point, it should stay defragmented.

Or just btrfs filesystem defrag the individual file...

Finally, there's some more work going into autodefrag now, to hopefully 
increase its performance, and make it work more efficiently on a bit 
larger files as well.  The goal is to eliminate the problems with 
systemd's journal, among other things, now that it's known to be a common 
problem, given systemd's widespread use and the fact that both systemd 
and btrfs aim to be the accepted general Linux default within a few years.

Summary:

Figure out what applications on your system have the "internal write" 
pattern that causes so much trouble to COW-based filesystems, and turn 
off that behavior either in that app (as possible with torrent clients), 
or in the filesystem, using either a dedicated filesystem mount, or more 
likely, by setting the NOCOW attribute (chattr +C) on the individual 
target files or directories.

Figuring out which files and applications are affected is left to the 
reader, but the information above should provide a good starting point.

Then btrfs filesystem defrag -r the filesystem and add autodefrag to its 
mount options to help keep it free of at least smaller-file fragmentation.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-01 12:37 ` Duncan
@ 2014-01-01 20:08   ` Sulla
  2014-01-02  8:38     ` Duncan
  2014-01-05  0:12     ` Sulla
  2014-01-03 17:25   ` Marc MERLIN
  1 sibling, 2 replies; 31+ messages in thread
From: Sulla @ 2014-01-01 20:08 UTC (permalink / raw)
  To: linux-btrfs

Dear Duncan!

Thanks very much for your exhaustive answer.

Hm, I also thought of fragmentation. Alhtough I don't think this is really
very likely, as my server doesn't serve things that likely cause fragmentation.
It is a mailserver (but only maildir-format), fileserver for windows clients
(huge files that hardly don't get rewritten), a server for TV-records (but
only copy recordings from a sat receiver after they have been recorded, so
no heavy rewriting here), a tiny webserver and all kinds of such things, but
not a storage for huge databases, virtual machines or a target for
filesharing clients.
It however serves as a target for a hardlink-based backupprogram run on
windows PCs, but only once per month or so, so that shouldn't bee too much.

The problem must lie somewhere on the root partition itslef, because the
system is already slow before mounting the fat data-partitions.

I'll give the defragmentation a try. But
# sudo btrfs filesystem defrag -r
doesn't work, because "-r" is an unknown option (I'm running 
Btrfs v0.20-rc1 on an Ubuntu 3.11.0-14-generic kernel).

I'm doing a
# sudo btrfs filesystem defrag / &
on the root directory at the moment.

Question: will this defragment everything or just the root-fs and will I
need to run a defragment on /home as well, as /home is a separate btrfs
filesystem?

I've also added autodefrag mountoptions and will do a "mount -a" after the
defragmentation.

I've considered a
# sudo btrfs balance start
as well, would this do any good? How close should I let the data fill the
partition? The large data partitions are 85% used, root is 70% used. Is this
safe or should I add space?

Thanx, Wolfgang

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-01 20:08   ` Sulla
@ 2014-01-02  8:38     ` Duncan
  2014-01-03  1:24       ` Kai Krakow
  2014-01-05  0:12     ` Sulla
  1 sibling, 1 reply; 31+ messages in thread
From: Duncan @ 2014-01-02  8:38 UTC (permalink / raw)
  To: linux-btrfs

Sulla posted on Wed, 01 Jan 2014 20:08:21 +0000 as excerpted:

> Dear Duncan!
> 
> Thanks very much for your exhaustive answer.
> 
> Hm, I also thought of fragmentation. Alhtough I don't think this is
> really very likely, as my server doesn't serve things that likely cause
> fragmentation.
> It is a mailserver (but only maildir-format), fileserver for windows
> clients (huge files that hardly don't get rewritten), a server for
> TV-records (but only copy recordings from a sat receiver after they have
> been recorded, so no heavy rewriting here), a tiny webserver and all
> kinds of such things, but not a storage for huge databases, virtual
> machines or a target for filesharing clients.
> It however serves as a target for a hardlink-based backupprogram run on
> windows PCs, but only once per month or so, so that shouldn't bee too
> much.

One thing I didn't mention originally, was how to check for fragmentation.
filefrag is part of e2fsprogs, and does the trick -- with one caveat.  
filefrag currently doesn't know about btrfs compression, and interprets 
each 128 KiB block as a separate extent.  So if you have btrfs 
compression turned on and check a (larger than 128 KiB) file that btrfs 
has compressed, filefrag will falsely report fragmentation.

If in doubt, you can always try defragging that individual file and see 
if filefrag reports fewer extents or not.  If it has fewer extents you 
know it was fragmented, if not...

With that you should actually be able to check some of those big files 
that you don't think are fragmented, to see.

> The problem must lie somewhere on the root partition itslef, because the
> system is already slow before mounting the fat data-partitions.
> 
> I'll give the defragmentation a try. But
> # sudo btrfs filesystem defrag -r
> doesn't work, because "-r" is an unknown option (I'm running Btrfs
> v0.20-rc1 on an Ubuntu 3.11.0-14-generic kernel).

The -r option was added quite recently.

As the wiki (at https://btrfs.wiki.kernel.org ) urges, btrfs is a 
development filesystem and people choosing to test it should really try 
to keep current, both because you're unnecessarily putting the data 
you're testing on btrfs at risk when running old versions with bugs 
patched in newer versions (that part's mostly for the kernel, tho), and 
because as a tester, when things /do/ go wrong and you report it, the 
reports are far more useful if you're running a current version.

Kernal 3.11.0 is old.  3.12 has been out for well over a month now.  And 
the btrfs-progs userspace recently switched to kernel-synced versioning 
as well, with version 3.12 the latest version, which also happens to be 
the first kernel-version-synced version.

That's assuming you don't choose to run the latest git version of the 
userspace, and the Linus kernel RCs, which many btrfs testers do.  (Tho 
last I updated btrfs-progs, about a week ago, the last git commit was 
still the version bump to 3.12, but I'm running a git kernel at version 
3.13.0-rc5 plus 69 commits.)

So you are encouraged to update. =:^)

However, if you don't choose to upgrade ... (see next)

> I'm doing a # sudo btrfs filesystem defrag / &
> on the root directory at the moment.

...  Before the -r option was added, btrfs filesystem defrag would only 
defrag the specific file it was pointed at.  If pointed at a directory, 
it would defrag the directory metadata, but not files or subdirs below it.

The way to defrag the entire system then, involved a rather more 
complicated command using find to output a list of everything on the 
system, and run defrag individually on each item listed.  It's on the 
wiki.  Let's see if I can find it... (yes, but note the wrapped link):

https://btrfs.wiki.kernel.org/index.php/
UseCases#How_do_I_defragment_many_files.3F

sudo find [subvol [subvol]…] -xdev -type f -exec btrfs filesystem 
defragment -- {} +

As the wiki warns, that doesn't recurse into subvolumes (the -xdev keeps 
it from going onto non-btrfs filesystems but also keeps it from going 
into subvolumes), but you can list them as paths where noted.

> Question: will this defragment everything or just the root-fs and will I
> need to run a defragment on /home as well, as /home is a separate btrfs
> filesystem?

Well, as noted your command doesn't really defragment that much.  But the 
find command should defragment everything on the named subvolumes.

But of course this is where that bit I mentioned in the original post 
about possibly taking hours with multiple terabytes on spinning rust 
comes in too.  It could take awhile, and when it gets to really 
fragmented files, it'll probably trigger the same sort of stalls that has 
us discussing the whole thing in the first place, so the system may not 
be exactly usable. =:^(

> I've also added autodefrag mountoptions and will do a "mount -a" after
> the defragmentation.
> 
> I've considered a # sudo btrfs balance start as well, would this do any
> good? How close should I let the data fill the partition? The large data
> partitions are 85% used, root is 70% used. Is this safe or should I add
> space?

!! Be careful!!  You mentioned running 3.11.  Both early versions of 3.11 
and 3.12 had a bug where if you tried to run a balance and a defrag at 
the same time, bad things could happen (lockups or even corrupted data)!

Running just one at a time and letting it finish, then the other, should 
be fine.  And later stable kernels of both 3.11 and 3.12 have that bug 
fixed (as does 3.13).  But 3.11.0 is almost certainly still bugged in 
that regard, unless ubuntu backported the fix and didn't bump the kernel 
version.

But because a full balance rewrites everything anyway, it'll effectively 
defrag too.  So if you're going to do a balance, you can skip the 
defrag.  =:^) And since it's likely to take hours at the terabyte scale 
on spinning rust, that's just as well.

As for the space question, that's a whole different subject with its own 
convolutions.  =:^\

Very briefly, the rule of thumb I use is that for partitions of 
sufficient size (several GiB low end), you always want btrfs filesystem 
show to have at LEAST enough unallocated space left to allocate one each 
data and metadata chunk.  Data chunks default to 1 GiB, while metadata 
chunks default to 256 MiB, but because single-device metadata defaults to 
DUP mode, metadata chunks are normally allocated in pairs and that 
doubles to half a GiB.

So you need at LEAST 1.5 GiB unallocated, in ordered to be sure balance 
can work, since it allocates a new chunk and writes into it from the old 
chunks, until it can free up the old chunks.

Assuming you have large enough filesystems, I'd try to keep twice that, 3 
GiB unallocated according to btrfs filesystem show, and would definitely 
recommend doing a rebalance any time it starts getting close to that.

If you tend to have many multi-gig files, you'll probably want to keep 
enough unallocated space (rounded up to a whole gig, plus the 3 gig 
minimum I suggested above) around to handle at least one of those as 
well, just so you know you always have space available to move at least 
one of those if necessary, without using up your 3 gig safety margin.

Beyond that, take a look at your btrfs filesystem df output.  I already 
mentioned that data chunk size is 1 GiB, metadata 256 MiB (doubled to 512 
MiB for default dup mode for a single device btrfs).  So if data says 
something like total=248.00GiB, used=123.24GiB (example picked out of 
thin air), you know you're running a whole bunch of half empty chunks, 
and a balance should trim that down dramatically, to probably 
total=124.00GiB altho it's possible it might be 125.00GiB or something, 
but in any case it should be FAR closer to used than the twice-used 
figure in my example above.  Any time total is more than a GiB above 
used, a balance is likely to be able to reduce it and return the extra to 
the unallocated pool.

Of course the same applies to metadata, keeping in mind its default-dup, 
so you're effectively allocating in 512 MiB chunks for it.  But any time 
total is more than 512 MiB above used, a balance will probably reduce it, 
returning the extra space to the unallocated pool.

Of course single vs. dup on single devices, and multiple devices with all 
the different btrfs raid modes, throw various curves into the numbers 
given above.  While it's reasonably straightforward to figure an 
individual case, explaining all the permutations gets quite complex.  And 
while it's not supported yet, eventually btrfs is supposed to support 
different raid levels, etc, for different subvolumes, which will throw 
even MORE complexity into the thing!   And obviously for small single-
digit GiB partitions the rules must be adjusted, even more so for mixed-
blockgroup, which is the default below 1 GiB but makes some sense in the 
single-digit GiB size range as well.  But the reasonably large single-
device default isn't /too/ bad, even if it takes a bit to explain, as I 
did here.

Meanwhile, especially on spinning rust at terabyte sizes, those balances 
are going to take awhile, so you probably don't want to run them daily.

And on SSDs, balances (and defrags and anything else for that matter) 
should go MUCH faster, but SSDs are limited-write-cycle, and any time you 
balance you're rewriting all that data and metadata, thus using up 
limited write cycles on all those gigs worth of blocks in one fell swoop!

So either way, doing balances without any clear return probably isn't a 
good idea.  But when the allocated space gets within a few gigs of total 
as shown by btrfs filesystem show, or when total gets multiple gigs above 
used as shown by btrfs filesystem df, it's time to consider a balance.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-02  8:38     ` Duncan
@ 2014-01-03  1:24       ` Kai Krakow
  2014-01-03  9:18         ` Duncan
  0 siblings, 1 reply; 31+ messages in thread
From: Kai Krakow @ 2014-01-03  1:24 UTC (permalink / raw)
  To: linux-btrfs

Duncan <1i5t5.duncan@cox.net> schrieb:

> But because a full balance rewrites everything anyway, it'll effectively
> defrag too.

Is that really true? I thought it just rewrites each distinct extent and 
shuffels chunks around... This would mean it does not merge extents 
together.

Regards,
Kai


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-03  1:24       ` Kai Krakow
@ 2014-01-03  9:18         ` Duncan
  0 siblings, 0 replies; 31+ messages in thread
From: Duncan @ 2014-01-03  9:18 UTC (permalink / raw)
  To: linux-btrfs

Kai Krakow posted on Fri, 03 Jan 2014 02:24:01 +0100 as excerpted:

> Duncan <1i5t5.duncan@cox.net> schrieb:
> 
>> But because a full balance rewrites everything anyway, it'll
>> effectively defrag too.
> 
> Is that really true? I thought it just rewrites each distinct extent and
> shuffels chunks around... This would mean it does not merge extents
> together.

While I'm not a coder and they're free to correct me if I'm wrong...

With a full balance (there are now options allowing one to do only data, 
or only metadata, or for that matter only system, and do other filtering, 
say to rebalance only chunks less than 10% used or only those not yet 
converted to a new raid level, if desired, but we're talking a full 
balance here), all chunks are rewritten, merging data (or metadata) into 
fewer chunks if possible, eliminating the then unused chunks and 
returning the space they took to the unallocated pool.

Given that everything is being rewritten anyway, a process that can take 
hours or even days on multi-terabyte spinning rust filesystems, /not/ 
doing a file defrag as part of the process would be stupid.

So doing a separate defrag and balance isn't necessary.  And while we're 
at it, doing a separate scrub and balance isn't necessary, for the same 
reason.  (If one copy of the data is invalid and there's another, it'll 
be used for the rewrite and redup if necessary during the balance and the 
invalid copy will simply be erased.  If there's no valid copy, then there 
will be balance errors and I believe the chunks containing the bad data 
are simply not rewritten at all, tho the valid data from them might be 
rewritten, leaving only the bad data (I'm not sure which, on that), thus 
allowing the admin to try other tools to clean up or recover from the 
damage as necessary.)

That's one reason why the balance operation can take so much longer than 
a straight sequential read/write of the data might indicate, because it's 
doing all that extra work behind the scenes as well.

Tho I'm not sure that it defrags across chunks, particularly if a file's 
fragments reach across enough chunks that they'd not have been processed 
by the time a written chunk is full and the balance progresses to the 
next one.  However, given that data chunks are 1 GiB in size, that should 
still cut down a multi-thousand-extent file to perhaps a few dozen 
extents, one each per rewritten chunk.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-01 20:08   ` Sulla
  2014-01-02  8:38     ` Duncan
@ 2014-01-05  0:12     ` Sulla
  1 sibling, 0 replies; 31+ messages in thread
From: Sulla @ 2014-01-05  0:12 UTC (permalink / raw)
  To: linux-btrfs

Oh gosh, I don't know what went wrong with my btrfs root filesystem, and I
probably will never know, too:

The "sudo balance start /" was running fine for about 4 or 5 hours, running
at a system load of ~3 when "balance status /" told me the balancing was on
its way and had completed 19 out of 23 extents.

At this moment the system load started to increase and increase an increase
and when it reached 147 (!!) (while top was showing me NOTHING was going on)
I resetted the computer. TTY1 showed some kernel panics and btrfs-bug
messages, but those files were lost because they've never made it to disk.

Fortunately my RAID5 stayed in sync and everything was fine. System also
booted, but with the same 120+ secs hangs as before. System was unusable, as
e.g. all IMAP logins time-out-ed.

So
* I booted into a live-CD
* mounted a backup disk
* cp-ed all files of the root fs to the backup disk (it could read them
flawlessly)
* formatted the root-partition to ext4 (yes, I feel sad about it)
* cp-ed all root-files from the backupdisk to the ext4 root system
* stroke the subvol=@ boot argument from /boot/grub/grub.cfg
* and rebooted my server.

How I love linux! Wouldn't be possible with M$!!

Now its running fine again, system is responsive as it should be. No clue
'bout what went wrong, though.

I still have /home and the huge data partitions on btrfs and plan to leave
it so. While it would not be difficult to put /home on ext4 it would be a
major effort to cp the ~3TB data off and on the disks...

Thanx for your support,
Sulla

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-01 12:37 ` Duncan
  2014-01-01 20:08   ` Sulla
@ 2014-01-03 17:25   ` Marc MERLIN
  2014-01-03 21:34     ` Duncan
  2014-01-04 20:48     ` Roger Binns
  1 sibling, 2 replies; 31+ messages in thread
From: Marc MERLIN @ 2014-01-03 17:25 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

First, a big thank you for taking the time to post this very informative
message.

On Wed, Jan 01, 2014 at 12:37:42PM +0000, Duncan wrote:
> Apparently the way some distribution installation scripts work results in 
> even a brand new installation being highly fragmented. =:^(  If in 
> addition they don't add autodefrag to the mount options used when 
> mounting the filesystem for the original installation, the problem is 
> made even worse, since the autodefrag mount option is designed to help 
> catch some of this sort of issue, and schedule the affected files for 
> auto-defrag by a separate thread.

Assuming you can stomach a bit of occasional performance loss due to
autodefrag, is there a reason not to always have this on btrfs
filesystems in newer kernels? (let's say 3.12+)?

Is there even a reason for this not to become a default mount option in
newer kernels?

> The NOCOW file attribute.
> 
> Simple command form:
> 
> chattr +C /path/to/file/or/directory

Thank you for that tip, I had been unaware of it 'till now.
This will make my virtualbox image directory much happier :)

> Meanwhile, if there's a point at which the file exists in its more or 
> less permanent form and won't be written into any longer (a torrented 
> file is fully downloaded, or a VM image is backed up), sequentially 
> copying it elsewhere (possibly using cp --reflink=never if on the same 
> filesystem, to avoid a reflink copy pointing at the same fragmented 
> extents!), then deleting the original fragmented version, should 
> effectively defragment the file too.  And since it's not being written 
> into any more at that point, it should stay defragmented.
> 
> Or just btrfs filesystem defrag the individual file..

I know I can do the cp --reflink=never, but that will generate 100GB of
new files and force me to drop all my hourly/daily/weekly snapshots, so 
file defrag is definitely a better option.

> Finally, there's some more work going into autodefrag now, to hopefully 
> increase its performance, and make it work more efficiently on a bit 
> larger files as well.  The goal is to eliminate the problems with 
> systemd's journal, among other things, now that it's known to be a common 
> problem, given systemd's widespread use and the fact that both systemd 
> and btrfs aim to be the accepted general Linux default within a few years.

Is there a good guideline on which kinds of btrfs filesystems autodefrag
is likely not a good idea, even if the current code does not have
optimal performance?
I suppose fragmented files that are deleted soon after being written are
a loss, but otherwise it's mostly a win. Am I missing something?

Unfortunately, on a 83GB vdi (virtualbox) file, with 3.12.5, it did a
lot of writing and chewed up my 4 CPUs.
Then, it started to be hard to move my mouse cursor and my procmeter
graph was barely updating seconds.
Next, nothing updated on my X server anymore, not even seconds in time
widgets.
But, I could still sometimes move my mouse cursor, and I could sometimes
see the HD light fliker a bit before going dead again. In other words,
the system wasn't fully deadlocked, but btrfs sure got into a state
where it was unable to to finish the job, and took the kernel down with
it (64bit, 8GB of RAM).
I waited 2H and it never came out of it, I had to power down the system
in the end.
Note that this was on a top of the line 500MB/s write Samsung Evo 840 SSD,
not a slow HD.

I think I had enough free space:
Label: 'btrfs_pool1'  uuid: 4850ee22-bf32-4131-a841-02abdb4a5ba6
	Total devices 1 FS bytes used 732.14GB
	devid    1 size 865.01GB used 865.01GB path /dev/dm-0

Is it possible expected behaviour of defrag to lock up on big files?
Should I have had more spare free space for it to work?
Other?

On the plus side, the file I was trying to defragment and hung my system, 
was not corrupted by the process.

Any idea what I should try from here?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-03 17:25   ` Marc MERLIN
@ 2014-01-03 21:34     ` Duncan
  2014-01-05  6:39       ` Marc MERLIN
  2014-01-08  3:22       ` Marc MERLIN
  2014-01-04 20:48     ` Roger Binns
  1 sibling, 2 replies; 31+ messages in thread
From: Duncan @ 2014-01-03 21:34 UTC (permalink / raw)
  To: linux-btrfs

Marc MERLIN posted on Fri, 03 Jan 2014 09:25:06 -0800 as excerpted:

> First, a big thank you for taking the time to post this very informative
> message.
> 
> On Wed, Jan 01, 2014 at 12:37:42PM +0000, Duncan wrote:
>> Apparently the way some distribution installation scripts work results
>> in even a brand new installation being highly fragmented. =:^(  If in
>> addition they don't add autodefrag to the mount options used when
>> mounting the filesystem for the original installation, the problem is
>> made even worse, since the autodefrag mount option is designed to help
>> catch some of this sort of issue, and schedule the affected files for
>> auto-defrag by a separate thread.
>  
> Assuming you can stomach a bit of occasional performance loss due to
> autodefrag, is there a reason not to always have this on btrfs
> filesystems in newer kernels? (let's say 3.12+)?
> 
> Is there even a reason for this not to become a default mount option in
> newer kernels?

For big "internal write" files, autodefrag isn't yet well tuned, because 
it effectively write-magnifies too much, forcing rewrite of the entire 
file for just a small change.  If whatever app is more or less constantly 
writing those small changes, faster than the file can be rewritten...

I don't know where the break-over might be, but certainly, multi-gig 
sized IO-active VMs images or databases aren't something I'd want to use 
it with.  That's where the NOCOW thing will likely work better.

IIRC someone also mentioned problems with autodefrag and an about 3/4 gig 
systemd journal.  My gut feeling (IOW, *NOT* benchmarked!) is that double-
digit MiB files should /normally/ be fine, but somewhere in the lower 
triple digits, write-magnification could well become an issue, depending 
of course on exactly how much active writing the app is doing into the 
file.

As I said there's more work going into tuning autodefrag ATM, but as it 
is, I couldn't really recommend making it a global default... tho maybe a 
distro could enable it by default on a no-VM desktop system (as opposed 
to a server).  Certainly I'd recommend most desktop types enable it.

>> The NOCOW file attribute.
>> 
>> Simple command form:
>> 
>> chattr +C /path/to/file/or/directory
>  
> Thank you for that tip, I had been unaware of it 'till now.
> This will make my virtualbox image directory much happier :)

I think I said it, but it bears repeating.  Once you set that attribute 
on the dir, you may want to move the files out of the dir (to another 
partition would make sure the data is actually moved) and back in, so 
they're effectively new files in the dir.  Or use something like cat 
oldfile > newfile, so you know it's actually creating the new file, not 
reflinking.  That'll ensure the NOCOW takes effect.

> Unfortunately, on a 83GB vdi (virtualbox) file, with 3.12.5, it did a
> lot of writing and chewed up my 4 CPUs. Then, it started to be hard to
> move my mouse cursor and my procmeter graph was barely updating seconds.
> Next, nothing updated on my X server anymore, not even seconds in time
> widgets.
> 
> But, I could still sometimes move my mouse cursor, and I could sometimes
> see the HD light fliker a bit before going dead again. In other words,
> the system wasn't fully deadlocked, but btrfs sure got into a state
> where it was unable to to finish the job, and took the kernel down with
> it (64bit, 8GB of RAM).
> 
> I waited 2H and it never came out of it, I had to power down the system
> in the end.  Note that this was on a top of the line 500MB/s write
> Samsung Evo 840 SSD, not a slow HD.

That was defrag (the command) or autodefrag (the mount option)?  I'd 
guess defrag (the command).

That's fragmentation for you!  What did/does filefrag have to say about 
that file?  Were you the one that posted the 6-digit extents?

For something that bad, it might be faster to copy/move it off-device 
(expect it to take awhile) then move it back.  That way you're only 
trying to read OR write on the device, not both, and the move elsewhere 
should defrag it quite a bit, effectively sequential write, then read and 
write on the move back.

But even that might be prohibitive.  At some point, you may need to 
either simply give up on it (if you're lazy), or get down and dirty with 
the tracing/profiling, working with a dev to figure out where it's 
spending its time and hopefully get btrfs recoded to work a bit faster 
for that sort of thing.

> I think I had enough free space:
> Label: 'btrfs_pool1'  uuid: 4850ee22-bf32-4131-a841-02abdb4a5ba6
> 	Total devices 1 FS bytes used 732.14GB
> 	devid    1 size 865.01GB used 865.01GB path /dev/dm-0
> 
> Is it possible expected behaviour of defrag to lock up on big files?
> Should I have had more spare free space for it to work?
> Other?

>From my understanding it's not the file size, but the number of 
fragments.  I'm guessing you simply overwhelmed the system.  Ideally you 
never let it get that bad in the first place. =:^(

As I suggested above, you might try the old school method of defrag, move 
the file to a different device, then move it back.  And if possible do it 
when nothing else is using the system.  But it may simply be practically 
inaccessible with a current kernel, in which case you'd either have to 
work with the devs to optimize, or give it up as a lost cause. =:(

> On the plus side, the file I was trying to defragment and hung my
> system,
> was not corrupted by the process.
> 
> Any idea what I should try from here?

Beyond the above, it's let the devs hack on it time. =:^\

One other /narrow/ possibility if you're desperate.  You could try 
splitting the file into chunks (generic term not btrfs chunks) of some 
arbitrary shorter size, and copying them out.  If you spit into say 10 
parts, then each piece should take roughly a tenth of the time, altho 
more fragmented areas will likely take longer.  But by splitting into say 
100 parts (which would be ~830 MiB apiece), you could at least see the 
progress and if there was one particular area where it suddenly got a lot 
worse.

I know there's tools for that sort of thing, but I'm not enough into 
forensics to know much about them...

Then if the process completed successfully, you could cat the parts back 
together again... and the written parts would be basically sequential, so 
that should go MUCH faster! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-03 21:34     ` Duncan
@ 2014-01-05  6:39       ` Marc MERLIN
  2014-01-05 17:09         ` Chris Murphy
  2014-01-08  3:22       ` Marc MERLIN
  1 sibling, 1 reply; 31+ messages in thread
From: Marc MERLIN @ 2014-01-05  6:39 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On Fri, Jan 03, 2014 at 09:34:10PM +0000, Duncan wrote:
> > Thank you for that tip, I had been unaware of it 'till now.
> > This will make my virtualbox image directory much happier :)
> 
> I think I said it, but it bears repeating.  Once you set that attribute 
> on the dir, you may want to move the files out of the dir (to another 
> partition would make sure the data is actually moved) and back in, so 
> they're effectively new files in the dir.  Or use something like cat 
> oldfile > newfile, so you know it's actually creating the new file, not 
> reflinking.  That'll ensure the NOCOW takes effect.

Yes, I got that. That why I ran btrfs defrag on the files after that (I
explained why, copy would waste lots of snapshot space by replacing all
the block needlessly).
 
> > Unfortunately, on a 83GB vdi (virtualbox) file, with 3.12.5, it did a
> > lot of writing and chewed up my 4 CPUs. Then, it started to be hard to
> > move my mouse cursor and my procmeter graph was barely updating seconds.
> > Next, nothing updated on my X server anymore, not even seconds in time
> > widgets.
> > 
> > But, I could still sometimes move my mouse cursor, and I could sometimes
> > see the HD light fliker a bit before going dead again. In other words,
> > the system wasn't fully deadlocked, but btrfs sure got into a state
> > where it was unable to to finish the job, and took the kernel down with
> > it (64bit, 8GB of RAM).
> > 
> > I waited 2H and it never came out of it, I had to power down the system
> > in the end.  Note that this was on a top of the line 500MB/s write
> > Samsung Evo 840 SSD, not a slow HD.
> 
> That was defrag (the command) or autodefrag (the mount option)?  I'd 
> guess defrag (the command).

defrag, the btrfs subcommand.

> That's fragmentation for you!  What did/does filefrag have to say about 
> that file?  Were you the one that posted the 6-digit extents?

Nope, I never posted anything until now. Hopefully you agree that it's
not ok for btrfs/kernel to just kill my system for over 2H until I power
it off before of defragging one file. I did hit a severe performance but
if it's not a never ending loop.

gandalfthegreat:/var/local/nobck/VirtualBox VMs/Win7# filefrag Win7.vdi 
Win7.vdi: 156222 extents found

Considering how virtualbox works, that's hardly surprising.

> For something that bad, it might be faster to copy/move it off-device 
> (expect it to take awhile) then move it back.  That way you're only 
> trying to read OR write on the device, not both, and the move elsewhere 
> should defrag it quite a bit, effectively sequential write, then read and 
> write on the move back.

Yes, I know how I can work around the problem (although I'll likely have
to delete all my historical snapshots to delete the old blocks, which I
don't love to do).
But doesn't it make sense to see why the kernel is near deadlocking on a
single file defrag first?

> But even that might be prohibitive.  At some point, you may need to 
> either simply give up on it (if you're lazy), or get down and dirty with 
> the tracing/profiling, working with a dev to figure out where it's 
> spending its time and hopefully get btrfs recoded to work a bit faster 
> for that sort of thing.

I'm on my way to a linux conf where I'm speaking, so I have limited time
and can't crash my laptop, but I'm happy to type some commands and give
output.

> As I suggested above, you might try the old school method of defrag, move 
> the file to a different device, then move it back.  And if possible do it 
> when nothing else is using the system.  But it may simply be practically 
> inaccessible with a current kernel, in which case you'd either have to 
> work with the devs to optimize, or give it up as a lost cause. =:(
 
I can fix my problem, actually virtualbox works fine with the fragmented
file, without even feeling slow, so really I don't need to fix it
urgently, I was just trying it out after your post.
 
> Then if the process completed successfully, you could cat the parts back 
> together again... and the written parts would be basically sequential, so 
> that should go MUCH faster! =:^)

All that noted, but I'm not desperate, just trying commands I hadn't
tried yet :)

Thanks for your replies,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-05  6:39       ` Marc MERLIN
@ 2014-01-05 17:09         ` Chris Murphy
  2014-01-05 17:54           ` Jim Salter
  0 siblings, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2014-01-05 17:09 UTC (permalink / raw)
  To: Btrfs BTRFS

On Jan 4, 2014, at 11:39 PM, Marc MERLIN <marc@merlins.org> wrote:

> 
> Nope, I never posted anything until now. Hopefully you agree that it's
> not ok for btrfs/kernel to just kill my system for over 2H until I power
> it off before of defragging one file. I did hit a severe performance but
> if it's not a never ending loop.
> 
> gandalfthegreat:/var/local/nobck/VirtualBox VMs/Win7# filefrag Win7.vdi 
> Win7.vdi: 156222 extents found
> 
> Considering how virtualbox works, that's hardly surprising.

I haven't read anything so far indicating defrag applies to the VM container use case, rather nodatacow via xattr +C is the way to go. At least for now.

> 
> But doesn't it make sense to see why the kernel is near deadlocking on a
> single file defrag first?

It's better than a panic or corrupt data. So far the best combination I've found, open to other suggestions though, is +C xattr on /var/lib/libvirt/images, creating non-preallocated qcow2 files, and snapshotting the qcow2 file with qemu-img. Granted when sysroot is snapshot, I'm making btrfs snapshots of these qcow2 files. Another option is to make /var/lib/libvirt/images a subvolume, and then when sysroot is snapshot, then /var/lib/libvirt/images is immune to being snapshot automatically with the parent subvolume. I'd have to explicitly snapshot it. This may be a better way to go to avoid accumulation of btrfs snapshots of qcow2 snapshot files.

This may already be a known problem but it's worth sysrq+w, and then dmesg and posting those results if you haven't already.

Chris Murphy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-05 17:09         ` Chris Murphy
@ 2014-01-05 17:54           ` Jim Salter
  2014-01-05 19:57             ` Duncan
  0 siblings, 1 reply; 31+ messages in thread
From: Jim Salter @ 2014-01-05 17:54 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

On 01/05/2014 12:09 PM, Chris Murphy wrote:
> I haven't read anything so far indicating defrag applies to the VM 
> container use case, rather nodatacow via xattr +C is the way to go. At 
> least for now. 

Can you elaborate on the rationale behind database or VM binaries being 
set nodatacow? I experimented with this*, and found no significant (to 
me, anyway) performance enhancement with nodatacow on - maybe 10% at 
best, and if I understand correctly, that implies losing the live 
per-block checksumming of the data that's set nodatacow, meaning you 
won't get automatic correction if you're on a redundant array.

All I've heard so far is "better performance" without any more detailed 
explanation, and if the only benefit is an added MAYBE 10%ish 
performance... I'd rather take the hit, personally.

* "experimented with this" == set up a Win2008R2 test VM and ran 
HDTunePro for several runs on binaries stored with and without nodatacow 
set, 5G of random and sequential read and write access per run.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-05 17:54           ` Jim Salter
@ 2014-01-05 19:57             ` Duncan
  2014-01-05 20:44               ` Chris Murphy
  0 siblings, 1 reply; 31+ messages in thread
From: Duncan @ 2014-01-05 19:57 UTC (permalink / raw)
  To: linux-btrfs

Jim Salter posted on Sun, 05 Jan 2014 12:54:44 -0500 as excerpted:

> On 01/05/2014 12:09 PM, Chris Murphy wrote:
>> I haven't read anything so far indicating defrag applies to the VM
>> container use case, rather nodatacow via xattr +C is the way to go. At
>> least for now.

Well, NOCOW from the get-go would certainly be better, but given that the 
file is already there and heavily fragmented, my idea was to get it 
defragmented and then set the +C, to prevent it reoccurring.

But I do very little snapshotting here, and as a result hadn't considered 
the knockon effect of 100K-plus extents in perhaps 1000 snapshots.  I 
guess that's what's killing the defrag, however it's initiated.  The only 
way to get rid of the problem, then, would be to move the file away and 
then back, but doing so does still leave all those snapshots with the 
crazy fragmentation, and to kill that would require either killing all 
those snapshots, or setting them writable and doing the same move out, 
move back, on each one!  OUCH, but I guess that's why it just seems 
impossible to deal with the fragmentation on these things, whether it's 
autodefrag, or named file defrag, or doing the whole move out and back 
thing, and then having to worry about all those snapshots.

Still, I'd guess ultimately it'll need done, whether it's a wipe the 
filesystem and restore from backup or whatever.

> Can you elaborate on the rationale behind database or VM binaries being
> set nodatacow? I experimented with this*, and found no significant (to
> me,
> anyway) performance enhancement with nodatacow on - maybe 10% at best,
> and if I understand correctly, that implies losing the live per-block
> checksumming of the data that's set nodatacow, meaning you won't get
> automatic correction if you're on a redundant array.
> 
> All I've heard so far is "better performance" without any more detailed
> explanation, and if the only benefit is an added MAYBE 10%ish
> performance... I'd rather take the hit, personally.
> 
> * "experimented with this" == set up a Win2008R2 test VM and ran
> HDTunePro for several runs on binaries stored with and without nodatacow
> set, 5G of random and sequential read and write access per run.

Well, the problem isn't just performance, it's that in most such cases 
the apps actually have their own date integrity checking and management, 
and sometimes the app's integrity management and that of btrfs end up 
fighting each other, destroying the data as a result.

In normal operation, everything's fine.  But should the system crash at 
the wrong moment, btrfs' atomic commit and data integrity mechanisms can 
roll back to a slightly earlier version of the file.

Which is normally fine.  But because hardware is known to often lie about 
having committed writes that may actually still only be in buffer, if the 
power outage/crash occurred at the wrong moment, ordinary write-barrier 
ordering guarantees may be invalid (particularly on large files with 
finite-seek-speed devices), the app's own integrity checksum may have 
been updated before the data it was supposed to be a checksum on actually 
got to disk.  If btrfs ends up rolling back to that condition, btrfs will 
likely consider the file fine, but the app's own integrity management 
will consider it corrupted, which it actually is.

But if btrfs only stays out of the way, the application often can fix 
whatever minor corruption it detects, doing its own roll-backs to an 
earlier checkpoint, because it's /designed/ to be able to handle such 
problems on filesystems that don't have integrity management.

So having btrfs trying to manage integrity too on such data where the app 
already handles it is self-defeating, because neither knows about nor 
considers what the other one is doing, and the two end up undoing each 
other's careful work.

Again, this isn't something you'll see in normal operation, but several 
people have reported exactly that sort of problem with the general large-
internally-written-file, application-self-managed-file-integrity, 
scenario.  In those cases, the best thing btrfs can do is simply get out 
of the way and let the application handle its own integrity management, 
and the way to tell btrfs to do that, as well as to do in-place rewrites 
instead of COW-based rewrites, is with the NOCOW xattrib, chattr +C, and 
that must be done before the file gets so fragmented (and multi-
snapshotted in its fragmented state) in the first place.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-05 19:57             ` Duncan
@ 2014-01-05 20:44               ` Chris Murphy
  0 siblings, 0 replies; 31+ messages in thread
From: Chris Murphy @ 2014-01-05 20:44 UTC (permalink / raw)
  To: Btrfs BTRFS, Duncan

On Jan 5, 2014, at 12:57 PM, Duncan <1i5t5.duncan@cox.net> wrote:

> 
> But I do very little snapshotting here, and as a result hadn't considered 
> the knockon effect of 100K-plus extents in perhaps 1000 snapshots.

I wonder if this is an issue with snapshot aware defrag? Some problems were fixed recently but I'm not sure of the status.

The OP's case involves Btrfs on LVM on (I think) md raid5. The mdadm default stripe size is 512KB, which would be a 1MB full stripe. There are some optimizations for non-full stripe reads and writes for raid5 (not for raid6 so it takes a much bigger performance hit) but nevertheless it might be a factor.

>  I 
> guess that's what's killing the defrag, however it's initiated.  The only 
> way to get rid of the problem, then, would be to move the file away and 
> then back, but doing so does still leave all those snapshots with the 
> crazy fragmentation, and to kill that would require either killing all 
> those snapshots, or setting them writable and doing the same move out, 
> move back, on each one!  OUCH, but I guess that's why it just seems 
> impossible to deal with the fragmentation on these things, whether it's 
> autodefrag, or named file defrag, or doing the whole move out and back 
> thing, and then having to worry about all those snapshots.

It's why in the short term I'm using +C from the get go. And if I had more VM images and qcow2 snapshots, I would put them in a subvolume of their own so that they aren't snapshotted along with rootfs. Using Btrfs within the VM I still get the features I expect and the performance is quite good.

Chris Murphy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-03 21:34     ` Duncan
  2014-01-05  6:39       ` Marc MERLIN
@ 2014-01-08  3:22       ` Marc MERLIN
  2014-01-08  9:45         ` Duncan
  1 sibling, 1 reply; 31+ messages in thread
From: Marc MERLIN @ 2014-01-08  3:22 UTC (permalink / raw)
  To: Duncan, Chris Murphy; +Cc: linux-btrfs, Jim Salter

On Fri, Jan 03, 2014 at 09:34:10PM +0000, Duncan wrote:
> IIRC someone also mentioned problems with autodefrag and an about 3/4 gig 
> systemd journal.  My gut feeling (IOW, *NOT* benchmarked!) is that double-
> digit MiB files should /normally/ be fine, but somewhere in the lower 
> triple digits, write-magnification could well become an issue, depending 
> of course on exactly how much active writing the app is doing into the 
> file.
 
When I defrag'ed my 83GB vm file with 156222 extents, it was not in use
or being written to.

> As I said there's more work going into tuning autodefrag ATM, but as it 
> is, I couldn't really recommend making it a global default... tho maybe a 
> distro could enable it by default on a no-VM desktop system (as opposed 
> to a server).  Certainly I'd recommend most desktop types enable it.

I use VMs on my desktop :) but point taken.

On Sun, Jan 05, 2014 at 10:09:38AM -0700, Chris Murphy wrote:
> > gandalfthegreat:/var/local/nobck/VirtualBox VMs/Win7# filefrag Win7.vdi 
> > Win7.vdi: 156222 extents found
> > 
> > Considering how virtualbox works, that's hardly surprising.
> 
> I haven't read anything so far indicating defrag applies to the VM container use case, rather nodatacow via xattr +C is the way to go. At least for now.
 
Yep, I'll convert the file, but since I found a pretty severe
performance problem, does anyone care to get details off my system
before I make the problem go away for me?

> It's better than a panic or corrupt data. So far the best combination

To be honest, I'd have taken a panic, it would have saved me 2H of
waiting for a laptop to recover when it was never going to recover :(
Data corruption, sure, obviously :)

> I've found, open to other suggestions though, is +C xattr on

So you're saying that defragmentation has known performance problems
that can't get fixed for now, and that the solution is not to get
fragmented or recreate the relevant files.
If so, I'll go ahead, I just wanted to make sure I didn't have useful
debug state before clearing my problem.

> This may already be a known problem but it's worth sysrq+w, and then dmesg and posting those results if you haven't already.

No, I had not yet, but I'll do this.

On Sun, Jan 05, 2014 at 01:44:25PM -0700, Duncan wrote:
> [I normally try to reply directly to list but don't believe I've seen
> this there yet, but got it direct-mailed so will reply-all in response.]
 
I like direct Cc on replies, makes my filter and mutt coloring happier
:)
Dupes with the same message-id are what procmail and others were written
for :)

> I now believe the lockup must be due to processing the hundreds of
> thousands of extents on all those snapshots, too, in addition to doing

That's a good call. I do have this:
gandalfthegreat:/mnt/btrfs_pool1# ls var
var/                          var_hourly_20140105_16:00:01/
var_daily_20140102_00:01:01/  var_hourly_20140105_17:00:26/
var_daily_20140103_00:59:28/  var_weekly_20131208_00:02:02/
var_daily_20140104_00:01:01/  var_weekly_20131215_00:02:01/
var_daily_20140105_00:33:14/  var_weekly_20131229_00:02:02/
var_hourly_20140105_05:00:01/ var_weekly_20140105_00:33:14/

> it on the main volume.  I don't actually make very extensive use of
> snapshots here anyway, so I didn't think about that aspect originally,
> but that's gotta be what's throwing the real spanner in the works,
> turning a possibly long but workable normal defrag (O(1)) into a lockup
> scenario (O(n)) where virtually no progress is made as currently
> coded.

That is indeed what I'm seeing, so it's very possible you're right.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-08  3:22       ` Marc MERLIN
@ 2014-01-08  9:45         ` Duncan
  0 siblings, 0 replies; 31+ messages in thread
From: Duncan @ 2014-01-08  9:45 UTC (permalink / raw)
  To: linux-btrfs

Marc MERLIN posted on Tue, 07 Jan 2014 19:22:58 -0800 as excerpted:

> On Fri, Jan 03, 2014 at 09:34:10PM +0000, Duncan wrote:
>> IIRC someone also mentioned problems with autodefrag and an about 3/4
>> gig systemd journal.  My gut feeling (IOW, *NOT* benchmarked!) is that
>> double-digit MiB files should /normally/ be fine, but somewhere in the
>> lower triple digits, write-magnification could well become an issue,
>> depending of course on exactly how much active writing the app is doing
>> into the file.
>  
> When I defrag'ed my 83GB vm file with 156222 extents, it was not in use
> or being written to.

Note the scale... I said double-digit _MiB_ should be fine, but somewhere 
in the triple-digits write magnification likely becomes a problem (this 
based on my memory of someone mentioning an issue with a 3/4 gig systemd 
journal file).

You then say 83 _GB_, which may or may not be GiB, but either way, it's 
three orders of magnitude above the scale I said should be fine, and two 
orders of magnitude above the scale at which I said problems likely start 
appearing.

So problems at that size are a given.

> On Sun, Jan 05, 2014 at 10:09:38AM -0700, Chris Murphy wrote:

>> I've found, open to other suggestions though, is +C xattr on
> 
> So you're saying that defragmentation has known performance problems
> that can't get fixed for now, and that the solution is not to get
> fragmented or recreate the relevant files.
> If so, I'll go ahead, I just wanted to make sure I didn't have useful
> debug state before clearing my problem.

Basically, yes.  One of the devs said he's just starting to focus on it 
again now.  So it's a known issue that'll take some work to make better.  
However, since he's focusing on it again now, now's the time to report 
stuff like the sysrq+w trace mentioned.

> On Sun, Jan 05, 2014 at 01:44:25PM -0700, Duncan wrote:
>> [I normally try to reply directly to list but don't believe I've seen
>> this there yet, but got it direct-mailed so will reply-all in
>> response.]
>  
> I like direct Cc on replies, makes my filter and mutt coloring happier
> :)
> Dupes with the same message-id are what procmail and others were written
> for :)

Some of us think this sort of list works best as a public newsgroup... 
such distributed discussion is what they were designed for, after all... 
and that keeps it separate from actual email.  That's where gmane.org 
comes in with its list2news (as well as list2web) archiving service.  We 
subscribe to our lists as newsgroups there, use a news/nntp client for 
it, and save our email client for actually handling (more private) email.

If you watch, you'll see links to particular messages on the gmane web 
interface posted from time to time.  For those using gmane's list2news 
service (and obviously for those using its web interface as well) that's 
real easy, since gmane adds a header with the web link to messages it 
serves on the news interface as well.  I've been using gmane for perhaps 
a decade now, but apparently it's more popular for people on this list 
than I might have expected from other lists, since I see more of those 
gmane web links posted.

But I've also noticed that a lot more people on this list want CCed/
direct-mailed too, not just to read it on the list.  I generally do that 
when I see the explicit request, but /only/ when I see the explicit 
request.

>> I now believe the lockup must be due to processing the hundreds of
>> thousands of extents on all those snapshots, too
> 
> That's a good call. I do have this:
> gandalfthegreat:/mnt/btrfs_pool1# ls var var/                         
> var_hourly_20140105_16:00:01/ var_daily_20140102_00:01:01/ 
> var_hourly_20140105_17:00:26/ var_daily_20140103_00:59:28/ 
> var_weekly_20131208_00:02:02/ var_daily_20140104_00:01:01/ 
> var_weekly_20131215_00:02:01/ var_daily_20140105_00:33:14/ 
> var_weekly_20131229_00:02:02/ var_hourly_20140105_05:00:01/
> var_weekly_20140105_00:33:14/
> 
>> I don't actually make very extensive use of
>> snapshots here anyway, so I didn't think about that aspect originally,
>> but that's gotta be what's throwing the real spanner in the works,
>> turning a possibly long but workable normal defrag (O(1)) into a lockup
>> scenario (O(n)) where virtually no progress is made as currently coded.
> 
> That is indeed what I'm seeing, so it's very possible you're right.

That's where the evidence is pointing, ATM.  Hopefully the defrag work 
they're doing now will turn snapshotted defrag back into O(1), too.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-03 17:25   ` Marc MERLIN
  2014-01-03 21:34     ` Duncan
@ 2014-01-04 20:48     ` Roger Binns
  1 sibling, 0 replies; 31+ messages in thread
From: Roger Binns @ 2014-01-04 20:48 UTC (permalink / raw)
  To: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03/01/14 09:25, Marc MERLIN wrote:
> Is there even a reason for this not to become a default mount option
> in newer kernels?

autodefrag can go insane because it is unbounded.  For example I have a
4GB RAM system (3.12, no gui) that kept hanging.  I eventually managed to
work out the cause being a MySQL database (about 750MB of data only being
used by tt-rss refreshing RSS feeds every 4 hours).

autodefrag would eventually consume all the RAM and 20GB of swap kicking
off the OOM killer and with so little RAM left for anything else that the
only recourse was sysrq keys.

What I'd love to see is some sort of background worker that does sensible
things.  For example it could defragment files, but pick the ones that
need it the most, and I'd love to see extra copies of (meta)data in
currently unused space that is freed as needed.  deduping is another
worthwhile option.  So is recompressing data that hasn't changed recently
but using larger block sizes to get more effective ratios.  Some of these
happen at the moment but they are independent and you have to be aware of
the caveats.

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)

iEYEARECAAYFAlLIc6wACgkQmOOfHg372QQgjgCeJp1sZQ0+Y7WRGE+U+IFljiDY
MgQAnjEBspyJZvTC2caEn1Qkn942vPQ2
=rhNY
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2013-12-31 11:46 btrfs-transaction blocked for more than 120 seconds Sulla
  2014-01-01 12:37 ` Duncan
@ 2014-01-02  8:49 ` Jojo
  2014-01-05 20:32 ` Chris Murphy
  2 siblings, 0 replies; 31+ messages in thread
From: Jojo @ 2014-01-02  8:49 UTC (permalink / raw)
  To: Sulla, linux-btrfs

Am 31.12.2013 12:46, schrieb Sulla:
> Dear all!
>
> On my Ubuntu Server 13.10 I use a RAID5 blockdevice consisting of 3 WD20EARS
> drives. On this I built a LVM and in this LVM I use quite normal partitions
> /, /home, SWAP (/boot resides on a RAID1.) and also a custom /data
> partition. Everything (except boot and swap) is on btrfs.
>
> sometimes my system hangs for quite some time (top is showing a high wait
> percentage), then runs on normally. I get kernel messages into
> /var/log/sylsog, see below. I am unable to make any sense of the kernel
> messages, there is no reference to the filesystem or drive affected (at
> least I can not find one).
>
> Question: What is happening here?
> * Is a HDD failing (smart looks good, however)
> * Is something wrong with my btrfs-filesystem? with which one?
> * How can I find the cause?
>
Moin Wolfgang,
first ot: Happy new Year,

over the last celebration days one of our servers (ubuntu 13.04) with 
custom kernel 3.11.04 did quite simular things, also rais5/raid6.
Our Problem was writing to backup showed quit the same kernelog.
Also btrfs-transaction was hanging.
Also Filesystem usage with 83% looked fine. But that was not true.

After some time eating investigation I found, that BTRFS may have in 
3.11.x and other kernels(?) a problem with free block lists and 
fragmentation.

Our Server was able to self recover after defragmentation and 
compressing run.

We had problems with end-of-free blocks.
After rebuilding the free block list and running defrag the server got 
enough free blocks to operate well.

To be able to do that, we were forced to use the btrfs-git kernel and 
also the btrfs-progs from git. (3.13-rcX)

I did on 26.12.13:
# umount /ar
# btrfsck --repair --init-extent-tree /dev/sda1
# mount -o clear_cache,skip_balance,autodefrag /dev/sda1 /ar
# btrfs fi defragment -rc /ar/backup

But attention, I thougt 83% used space shoud be enough "free blocks", 
but this was wrong. It seems that BTRFS free Block lists are somewhat 
errous.
Especially "balance" may crash if an file has got too many 
extents/fragments, and allocating space may also hang if
free blocks are running low.

During the defragmentation run the response of the Server was getting 
slow, but did not stop in Read Access.

Our state today:
root@bk:~# df -m /ar
Dateisystem    1M-Blöcke Benutzt Verfügbar Verw% Eingehängt auf
/dev/sda1       13232966 7213717   3181874   70% /ar

root@bk:~# btrfs fi show /ar
Label: Archiv+Backup  uuid: 72b710aa-49a0-4ff5-a470-231560bfee81
         Total devices 5 FS bytes used 6.88TiB
         devid    1 size 2.73TiB used 2.70TiB path /dev/sda1
         devid    2 size 2.73TiB used 2.70TiB path /dev/sdb1
         devid    3 size 2.73TiB used 2.70TiB path /dev/sdc1
         devid    4 size 2.73TiB used 2.70TiB path /dev/sdd1
         devid    5 size 1.70TiB used 4.25GiB path /dev/sde4

Btrfs v3.12
root@bk:~# btrfs fi df /ar
Data, single: total=8.00MiB, used=0.00
Data, RAID5: total=8.10TiB, used=6.87TiB
System, single: total=4.00MiB, used=0.00
System, RAID5: total=12.00MiB, used=600.00KiB
Metadata, single: total=8.00MiB, used=0.00
Metadata, RAID5: total=12.25GiB, used=10.41GiB

Today the server completely recovered to full operation.

Is there a plan ongoing to hangle such out of free blocks/space 
situations more comfortable?

TIA
J. Sauer

-- 
Jürgen Sauer - automatiX GmbH,
+49-4209-4699, juergen.sauer@automatix.de
Geschäftsführer: Jürgen Sauer,
Gerichtstand: Amtsgericht Walsrode • HRB 120986
Ust-Id: DE191468481 • St.Nr.: 36/211/08000
GPG Public Key zur Signaturprüfung:
http://www.automatix.de/juergen_sauer_publickey.gpg

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2013-12-31 11:46 btrfs-transaction blocked for more than 120 seconds Sulla
  2014-01-01 12:37 ` Duncan
  2014-01-02  8:49 ` Jojo
@ 2014-01-05 20:32 ` Chris Murphy
  2014-01-05 21:17   ` Sulla
  2 siblings, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2014-01-05 20:32 UTC (permalink / raw)
  To: Sulla; +Cc: linux-btrfs


On Dec 31, 2013, at 4:46 AM, Sulla <Sulla@gmx.at> wrote:

> Dear all!
> 
> On my Ubuntu Server 13.10 I use a RAID5 blockdevice consisting of 3 WD20EARS

Sulla is this md raid5? If so can you report the result from mdadm -D <mddevice>, I'm curious what the chunk size is. Thanks.

Chris Murphy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-05 20:32 ` Chris Murphy
@ 2014-01-05 21:17   ` Sulla
  2014-01-05 22:36     ` Brendan Hide
  2014-01-05 23:48     ` Chris Murphy
  0 siblings, 2 replies; 31+ messages in thread
From: Sulla @ 2014-01-05 21:17 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dear Chris!

Certainly: I have 3 HDDs, all of which WD20EARS. Originally I wanted to
let btrfs handle all 3 devices directly without making partitions, but
this was impossible, as at least /boot needed to be ext4, at least back
then when I set up the server. And back then btrfs also hadn't raid5-like
functionality, so I decided to put good old partitions and md-Raids and
LVM on them and use btrfs just as plain file-systems on the partitions
provided by LVM.

On the WD disks I thus created 2 partitions each, the first sdX1 being
~500MiB, the rest, 1.9995 TiB is one partition of, sdX2.

I built a Raid1 on the 3 small partitions sdX1 with ext4 for boot, each
disk is bootable with grub installed into the MBR.

I combined the 3 large partitions to a Raid5 of size 3,64TB:

/proc/mdstat reads:
md0 : active raid1 sda1[5] sdb1[4] sdc1[3]
      498676 blocks super 1.2 [3/3] [UUU]
md1 : active raid5 sda2[5] sdb2[4] sdc2[3]
      3904907520 blocks super 1.2 level 5, 8k chunk, algorithm 2 [3/3] [UUU]

the information you requested:
# sudo mdadm -D /dev/md1
/dev/md1:
        Version : 1.2
  Creation Time : Thu Jul 14 18:49:25 2011
     Raid Level : raid5
     Array Size : 3904907520 (3724.01 GiB 3998.63 GB)
  Used Dev Size : 1952453760 (1862.01 GiB 1999.31 GB)
   Raid Devices : 3
  Total Devices : 3
    Persistence : Superblock is persistent
    Update Time : Sun Jan  5 22:07:22 2014
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0
         Layout : left-symmetric
     Chunk Size : 8K
           Name : freedom:1  (local to host freedom)
           UUID : 44b72520:a78af6f7:dba13fb3:2203127d
         Events : 576884
    Number   Major   Minor   RaidDevice State
       4       8       18        0      active sync   /dev/sdb2
       5       8        2        1      active sync   /dev/sda2
       3       8       34        2      active sync   /dev/sdc2



I use the Raid5 md1 as physical volume for LVM: pvdisplay gives:
  --- Physical volume ---
  PV Name               /dev/md1
  VG Name               MAIN
  PV Size               3.64 TiB / not usable 2.06 MiB
  Allocatable           yes
  PE Size               4.00 MiB
  Total PE              953346
  Free PE               6274
  Allocated PE          947072
  PV UUID               WcuEx8-ehJL-xHdf-ElwF-b9s3-dlmM-KZlDNG

I keep a reserve of 6274 4MiB blocks (=24GiB) in case one of the logical
volumes runs out of space...

I created the following logical volumes, named after their intended
mountpoints:
  --- Logical volume ---
  LV Path                /dev/MAIN/ROOT
  LV Name                ROOT
  VG Name                MAIN
  LV UUID                kURJks-xHox-73B5-n02x-eZfS-agDD-n1dtAm
  LV Write Access        read/write
  LV Creation host, time ,
  LV Status              available
  # open                 1
  LV Size                19.31 GiB
  Current LE             4944
  Segments               2
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:0

and similar:
  --- Logical volume ---
  LV Path                /dev/MAIN/SWAP: 1.8GB
  LV Path                /dev/MAIN/HOME: 18.6GB
  LV Path                /dev/MAIN/TMP: 9.3 GB
  LV Path                /dev/MAIN/DATA1 2.6 TB
  LV Path                /dev/MAIN/DATA2: 0.9 TB


as filesystem I used btrfs during install form an ubuntu server, I don't
recall which, might have been 11.10 or 12.04 (?) for all logical
partitions except swap, of course,

any other information I can supply?
regards, Sulla

- -- 
Cogito cogito ergo cogito sum.
   Ambrose Bierce














-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.21 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlLJy+8ACgkQR6b2EdogPFupxgCfeDRdeO+PYoQNIjtySAYEmSEr
PNoAoLPNcSqDHsDzM8pAuHlbva7j18MS
=XBOA
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-05 21:17   ` Sulla
@ 2014-01-05 22:36     ` Brendan Hide
  2014-01-05 22:57       ` Roman Mamedov
  2014-01-06  0:15       ` Chris Murphy
  2014-01-05 23:48     ` Chris Murphy
  1 sibling, 2 replies; 31+ messages in thread
From: Brendan Hide @ 2014-01-05 22:36 UTC (permalink / raw)
  To: Sulla, Chris Murphy; +Cc: linux-btrfs

On 2014/01/05 11:17 PM, Sulla wrote:
> Certainly: I have 3 HDDs, all of which WD20EARS.
Maybe/maybe-not off-topic:
Poor hardware performance, though not necessarily the root cause, can be 
a major factor with these errors.

WD Greens (Reds too, for that matter) have poor non-sequential 
performance. An educated guess I'd say there's a 15% chance this is a 
major factor to the problem and, perhaps, a 60% chance it is merely a 
"small contributor" to the problem. Greens are aimed at consumers 
wanting high capacity and a low pricepoint. The result is poor 
performance. See footnote * re my experience.

My general recommendation (use cases vary of course) is to install a 
tiny SSD (60GB, for example) just for the OS. It is typically cheaper 
than the larger drives and will be *much* faster. WD Greens and Reds 
have good *sequential* throughput but comparatively abysmal random 
throughput even in comparison to regular non-SSD consumer drives.

*
I had 8x 1.5TB WD1500EARS drives in an mdRAID5 array. With it I had a 
single 250GB IDE disk for the OS. When the very old IDE disk inevitably 
died, I decided to use a spare 1.5TB drive for the OS. Performance was 
bad enough that I simply bought my first SSD the same week.

-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-05 22:36     ` Brendan Hide
@ 2014-01-05 22:57       ` Roman Mamedov
  2014-01-07 10:22         ` Brendan Hide
  2014-01-06  0:15       ` Chris Murphy
  1 sibling, 1 reply; 31+ messages in thread
From: Roman Mamedov @ 2014-01-05 22:57 UTC (permalink / raw)
  To: Brendan Hide; +Cc: Sulla, Chris Murphy, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 483 bytes --]

On Mon, 06 Jan 2014 00:36:22 +0200
Brendan Hide <brendan@swiftspirit.co.za> wrote:

> I had 8x 1.5TB WD1500EARS drives in an mdRAID5 array. With it I had a 
> single 250GB IDE disk for the OS. When the very old IDE disk inevitably 
> died, I decided to use a spare 1.5TB drive for the OS. Performance was 
> bad enough that I simply bought my first SSD the same week.

Did you align your partitions to accommodate for the 4K sector of the EARS?

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-05 22:57       ` Roman Mamedov
@ 2014-01-07 10:22         ` Brendan Hide
  0 siblings, 0 replies; 31+ messages in thread
From: Brendan Hide @ 2014-01-07 10:22 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Sulla, Chris Murphy, linux-btrfs

On 2014/01/06 12:57 AM, Roman Mamedov wrote:
> Did you align your partitions to accommodate for the 4K sector of the EARS?
I had, yes. I had to do a lot of research to get the array working 
"optimally". I didn't need to repartition the spare so this carried over 
to its being used as an OS disk.

I actually lost the "Green" array twice - and learned some valuable lessons:

1. I had an 8-port SCSI card which was dropping the disks due to the 
timeout issue mentioned by Chris. That caused the first array failure. 
Technically all the data was on the disks - but temporarily 
irrecoverable as disks were constantly being dropped. I made a mistake 
during ddrescue which simultaneously destroyed two disks' data, meaning 
that the recovery operation was finally for nought. The only consolation 
was that I had very little data at the time and none of it was 
irreplaceable.

2. After replacing the SCSI card with two 4-port SATA cards, a few 
months later I still had a double-failure (the second failure being 
during the RAID5 rebuild). This time it was only due to bad disks and a 
lack of scrubbing/early warning - clearly my own fault.

Having learnt these lessons, I'm now a big fan of scrubbing and backups. ;)

I'm also pushing for RAID15 wherever data is mission-critical. I simply 
don't "trust" the reliability of disks any more and I also better 
understand how, by having more and/or larger disks in a RAID5/6 array, 
the overall reliability of that array array plummets.

-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-05 22:36     ` Brendan Hide
  2014-01-05 22:57       ` Roman Mamedov
@ 2014-01-06  0:15       ` Chris Murphy
  2014-01-06  0:19         ` Chris Murphy
  1 sibling, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2014-01-06  0:15 UTC (permalink / raw)
  To: Btrfs BTRFS; +Cc: Sulla, Brendan Hide

On Jan 5, 2014, at 3:36 PM, Brendan Hide <brendan@swiftspirit.co.za> wrote:

> WD Greens (Reds too, for that matter) have poor non-sequential performance. An educated guess I'd say there's a 15% chance this is a major factor to the problem and, perhaps, a 60% chance it is merely a "small contributor" to the problem. Greens are aimed at consumers wanting high capacity and a low pricepoint. The result is poor performance. See footnote * re my experience.
> 
> My general recommendation (use cases vary of course) is to install a tiny SSD (60GB, for example) just for the OS. It is typically cheaper than the larger drives and will be *much* faster. WD Greens and Reds have good *sequential* throughput but comparatively abysmal random throughput even in comparison to regular non-SSD consumer drives.

Another thing with md raid and parallel flie systems that's been an issue is cqf. On the XFS list cqf is approximately in the realm of persona non grata. It might be worth Sulla also setting elevator=deadline and see if simply different scheduling is a work around, not that it's OK to get blocks with cqf. But it might be worth a shot as a more conservative approach to upgrading the kernel from 3.11.0.

> I had 8x 1.5TB WD1500EARS drives in an mdRAID5 array. With it I had a single 250GB IDE disk for the OS. When the very old IDE disk inevitably died, I decided to use a spare 1.5TB drive for the OS. Performance was bad enough that I simply bought my first SSD the same week.

Yeah for what it's worth, the current WD Green PDF says these drives are not to be used in RAID at all. Not 0, 1, 5 or 6.  Even Caviar Black is proscribed from use in RAID environments using multibay chassis, as in, no warranty. It's desktop raid0 and raid1 only, and arguably the lack of configurable SCT ERC makes it not ideal even for raid1.

Anyway, Sulla, how about putting up a smartctl -x for each drive? Curious if there are any bad sectors that have developed, and may be worth filtering all /var/log/messages for the word "reset" and see if you find any of these drives ever being reset by the kernel and if so, post the full output of that.

Chris Murphy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-06  0:15       ` Chris Murphy
@ 2014-01-06  0:19         ` Chris Murphy
  0 siblings, 0 replies; 31+ messages in thread
From: Chris Murphy @ 2014-01-06  0:19 UTC (permalink / raw)
  To: Btrfs BTRFS


On Jan 5, 2014, at 5:15 PM, Chris Murphy <lists@colorremedies.com> wrote:

> 
> On Jan 5, 2014, at 3:36 PM, Brendan Hide <brendan@swiftspirit.co.za> wrote:
> 
>> WD Greens (Reds too, for that matter) have poor non-sequential performance. An educated guess I'd say there's a 15% chance this is a major factor to the problem and, perhaps, a 60% chance it is merely a "small contributor" to the problem. Greens are aimed at consumers wanting high capacity and a low pricepoint. The result is poor performance. See footnote * re my experience.
>> 
>> My general recommendation (use cases vary of course) is to install a tiny SSD (60GB, for example) just for the OS. It is typically cheaper than the larger drives and will be *much* faster. WD Greens and Reds have good *sequential* throughput but comparatively abysmal random throughput even in comparison to regular non-SSD consumer drives.
> 
> 
> Another thing with md raid and parallel flie systems that's been an issue is cqf.

Oops, CFQ!

Chris Murphy


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-05 21:17   ` Sulla
  2014-01-05 22:36     ` Brendan Hide
@ 2014-01-05 23:48     ` Chris Murphy
  2014-01-05 23:57       ` Chris Murphy
  1 sibling, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2014-01-05 23:48 UTC (permalink / raw)
  To: Sulla; +Cc: linux-btrfs

On Jan 5, 2014, at 2:17 PM, Sulla <Sulla@gmx.at> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Dear Chris!
> 
> Certainly: I have 3 HDDs, all of which WD20EARS.

These drives don't have a configurable SCT ERC, so you need to modify the SCSI block layer timeout:

echo 120 >/sys/block/sdX/device/timeout

You also need to schedule regular scrubs at the md level as well.

echo check > /sys/block/mdX/md/sync_action
cat /sys/block/mdX/mismatch_cnt

More info about this is in man 4 md, and on the linux-raid list.

> 
>      3904907520 blocks super 1.2 level 5, 8k chunk, algorithm 2 [3/3] [UUU]

OK so 8KB chunk, 16KB full stripe, so that doesn't apply to what I was thinking might be the case. The workload is presumably small file sizes, like a mail server?

> any other information I can supply?

I'm not a developer, I don't know if this problem is known or maybe fixed in a newer kernel than 3.11.0 - which has been around for 5-6 months. I think the main suggestion is to try a newer kernel, granted with the configuration of md, lvm, and btrfs you have three layers that will likely have kernel changes. I'd make sure you have backups. While this layout is valid and should work, it's also probably less common and therefore less tested.

Usually in case of blocking devs want to see sysrq+w issued. The setup is dmesg -n7, and enable sysrq functions. Then reproduce the block, and during the block issue w to the sysrq trigger, then capture dmesg contents and post the block and any other nearby btrfs messages.

https://www.kernel.org/doc/Documentation/sysrq.txt

Chris Murphy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-05 23:48     ` Chris Murphy
@ 2014-01-05 23:57       ` Chris Murphy
  2014-01-06  0:25         ` Sulla
  0 siblings, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2014-01-05 23:57 UTC (permalink / raw)
  To: Sulla; +Cc: linux-btrfs


On Jan 5, 2014, at 4:48 PM, Chris Murphy <lists@colorremedies.com> wrote:

> 
> On Jan 5, 2014, at 2:17 PM, Sulla <Sulla@gmx.at> wrote:
> 
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>> 
>> Dear Chris!
>> 
>> Certainly: I have 3 HDDs, all of which WD20EARS.
> 
> These drives don't have a configurable SCT ERC, so you need to modify the SCSI block layer timeout:
> 
> echo 120 >/sys/block/sdX/device/timeout
> 
> You also need to schedule regular scrubs at the md level as well.
> 
> echo check > /sys/block/mdX/md/sync_action
> cat /sys/block/mdX/mismatch_cnt
> 
> More info about this is in man 4 md, and on the linux-raid list.
> 
>> 
>>     3904907520 blocks super 1.2 level 5, 8k chunk, algorithm 2 [3/3] [UUU]
> 
> OK so 8KB chunk, 16KB full stripe, so that doesn't apply to what I was thinking might be the case. The workload is presumably small file sizes, like a mail server?
> 
> 
>> any other information I can supply?
> 
> I'm not a developer, I don't know if this problem is known or maybe fixed in a newer kernel than 3.11.0 - which has been around for 5-6 months. I think the main suggestion is to try a newer kernel, granted with the configuration of md, lvm, and btrfs you have three layers that will likely have kernel changes. I'd make sure you have backups. While this layout is valid and should work, it's also probably less common and therefore less tested.
> 
> Usually in case of blocking devs want to see sysrq+w issued. The setup is dmesg -n7, and enable sysrq functions. Then reproduce the block, and during the block issue w to the sysrq trigger, then capture dmesg contents and post the block and any other nearby btrfs messages.
> 
> https://www.kernel.org/doc/Documentation/sysrq.txt

Also, this thread is pretty cluttered with other conversations by now so I think you're best off starting a new thread with this information, maybe a title of "PROBLEM: btrfs on LVM on md raid, blocking > 120 seconds"

Since it's almost inevitable you'd be asked to test with a newer kernel anyway, you might as well go to 3.13rc7 and see if you can reproduce, if reproducible, be specific with the problem report by following this template:

https://www.kernel.org/pub/linux/docs/lkml/reporting-bugs.html



Chris Murphy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-05 23:57       ` Chris Murphy
@ 2014-01-06  0:25         ` Sulla
  2014-01-06  0:49           ` Chris Murphy
  0 siblings, 1 reply; 31+ messages in thread
From: Sulla @ 2014-01-06  0:25 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thanks Chris!

Thanks for your support.

>> echo 120 >/sys/block/sdX/device/timeout
timeout is 30 for my HDDs. I'm well aware that the WD green HDDs are not
the perfect ones for servers, but they were cheaper - and quieter - than
the black ones for servers. I'll get the red ones next, though. ;-)

>> You also need to schedule regular scrubs at the md level as well.

Ubuntu does that once a month.

>> cat /sys/block/mdX/mismatch_cnt
this resides in cat /sys/devices/virtual/block/md1/md/mismatch_cnt on my
machine.
the count is zero.

>> The workload is presumably small file sizes, like a mail server?
Yes. It serves as a mailserver (maildir-format), but also as a samba file
server with quite big files...

btrfs ran fine for more than a year, so I'm not sure how reproducible the
problem is...

I don't really wish to install or compile cumstom kernels, to be honest.
Not sure how problematic they might be during the next do-release-upgrade...

Sulla


- -- 
Russian Roulette is not the same without a gun
and baby when it's love, if it's not rough, it isn't fun, fun.
   Lady GaGa, "Pokerface"












-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.21 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlLJ+A8ACgkQR6b2EdogPFuFwwCffSjZpDJvIj70Ag+CPbClCVuc
viEAnjqnxcEdhKR2Gq84eGYEXfjfb23F
=pmTS
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: btrfs-transaction blocked for more than 120 seconds
  2014-01-06  0:25         ` Sulla
@ 2014-01-06  0:49           ` Chris Murphy
       [not found]             ` <52CA06FE.2030802@gmx.at>
  0 siblings, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2014-01-06  0:49 UTC (permalink / raw)
  To: Sulla; +Cc: linux-btrfs

On Jan 5, 2014, at 5:25 PM, Sulla <Sulla@gmx.at> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Thanks Chris!
> 
> Thanks for your support.
> 
>>> echo 120 >/sys/block/sdX/device/timeout
> timeout is 30 for my HDDs.

I don't think those drives support a configurable time out; the Green hasn't support it in years. Where are you getting this information? What do you get for 'smartctl -l scterc /dev/sdX'?

> I don't really wish to install or compile cumstom kernels, to be honest.

If the problem is reproducible, then that's the fastest way to find out if it's been fixed or not. In this case 3.11 is EOL already, no more updates.

Chris Murphy

^ permalink raw reply	[flat|nested] 31+ messages in thread

[parent not found: <52CA06FE.2030802@gmx.at>]

* Re: btrfs-transaction blocked for more than 120 seconds
       [not found]             ` <52CA06FE.2030802@gmx.at>
@ 2014-01-06  1:55               ` Chris Murphy
  0 siblings, 0 replies; 31+ messages in thread
From: Chris Murphy @ 2014-01-06  1:55 UTC (permalink / raw)
  To: Sulla; +Cc: Btrfs BTRFS

On Jan 5, 2014, at 6:29 PM, Sulla <Sulla@gmx.at> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi Chris!
> 
> # sudo smartctl -l scterc /dev/sda
> tells me
> SCT Error Recovery Control command not supported
> 
> you're right. the /sys/block/sdX/device/timeout file probably is useless then.

OK there's some confusion. /sys/block/sdX/device/timeout is the SCSI block layer timeout - linux itself has a timeout for each command issued to a block device, and will reset the link upon timeout being reached. So writing 120 to this will cause linux to wait for up to 120 seconds for the drive to respond. This is necessary because if there's a bad sector, the drive must report a read error in order for the md driver to reconstruct that data from parity. This is needed  bothfor effective scrubs, and recovery on read error in normal operation. It is not a persistent setting so you'll want to create a start up script for it.

Chris Murphy

^ permalink raw reply	[flat|nested] 31+ messages in thread

[parent not found: <ADin1n00P0VAdqd01DioM9>]

* Re: btrfs-transaction blocked for more than 120 seconds
       [not found] <ADin1n00P0VAdqd01DioM9>
@ 2014-01-05 20:44 ` Duncan
  0 siblings, 0 replies; 31+ messages in thread
From: Duncan @ 2014-01-05 20:44 UTC (permalink / raw)
  To: Jim Salter; +Cc: Marc MERLIN, linux-btrfs

On Sun, 05 Jan 2014 08:42:46 -0500
Jim Salter <jim@jrs-s.net> wrote:

> On Jan 5, 2014 1:39 AM, Marc MERLIN <marc@merlins.org> wrote:
> >
> > On Fri, Jan 03, 2014 at 09:34:10PM +0000, Duncan wrote: 
> > Yes, I got that. That why I ran btrfs defrag on the files after that
> 
> Why are you trying to defrag an SSD? There's no seek penalty for
> moving between fragmented blocks, so defrag isn't really desirable in
> the first place.

[I normally try to reply directly to list but don't believe I've seen
this there yet, but got it direct-mailed so will reply-all in response.]

There's no seek penalty so the overall problem is dramatically lessened
as that's the significant part of it on spinning rust, correct, but...

SSDs do remain IOPS-bound, and tens or hundreds of thousands of extents
do exact an IOPS (as well as general extent bookkeeping) toll, too.

That's why I ended up enabling autodefrag here when I was first setting
up, even tho I'm on SSD.  (Only after asking the list basically the same
question, what good it is autodefrag on SSD, tho.)

Luckily I don't happen to deal with any of the
internal-write-in-huge-files scenarios, however, and I enabled
autodefrag to cover the internal-write-in-small-file scenarios BEFORE I
started putting any data on the filesystems at all,  so I'm basically
covered, here, without actually having to do chattr +C on anything.

> That doesn't change the fact that the described lockup sounds like a
> bug not a feature of course, but I think the answer to your personal
> issue on that particular machine is "don't defrag a solid state
> drive".

I now believe the lockup must be due to processing the hundreds of
thousands of extents on all those snapshots, too, in addition to doing
it on the main volume.  I don't actually make very extensive use of
snapshots here anyway, so I didn't think about that aspect originally,
but that's gotta be what's throwing the real spanner in the works,
turning a possibly long but workable normal defrag (O(1)) into a lockup
scenario (O(n)) where virtually no progress is made as currently
coded.

-- 
Duncan - No HTML messages please, as they are filtered as spam.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2014-01-08  9:45 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-31 11:46 btrfs-transaction blocked for more than 120 seconds Sulla
2014-01-01 12:37 ` Duncan
2014-01-01 20:08   ` Sulla
2014-01-02  8:38     ` Duncan
2014-01-03  1:24       ` Kai Krakow
2014-01-03  9:18         ` Duncan
2014-01-05  0:12     ` Sulla
2014-01-03 17:25   ` Marc MERLIN
2014-01-03 21:34     ` Duncan
2014-01-05  6:39       ` Marc MERLIN
2014-01-05 17:09         ` Chris Murphy
2014-01-05 17:54           ` Jim Salter
2014-01-05 19:57             ` Duncan
2014-01-05 20:44               ` Chris Murphy
2014-01-08  3:22       ` Marc MERLIN
2014-01-08  9:45         ` Duncan
2014-01-04 20:48     ` Roger Binns
2014-01-02  8:49 ` Jojo
2014-01-05 20:32 ` Chris Murphy
2014-01-05 21:17   ` Sulla
2014-01-05 22:36     ` Brendan Hide
2014-01-05 22:57       ` Roman Mamedov
2014-01-07 10:22         ` Brendan Hide
2014-01-06  0:15       ` Chris Murphy
2014-01-06  0:19         ` Chris Murphy
2014-01-05 23:48     ` Chris Murphy
2014-01-05 23:57       ` Chris Murphy
2014-01-06  0:25         ` Sulla
2014-01-06  0:49           ` Chris Murphy
     [not found]             ` <52CA06FE.2030802@gmx.at>
2014-01-06  1:55               ` Chris Murphy
     [not found] <ADin1n00P0VAdqd01DioM9>
2014-01-05 20:44 ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).