* btrfs-transaction blocked for more than 120 seconds @ 2013-12-31 11:46 Sulla 2014-01-01 12:37 ` Duncan ` (2 more replies) 0 siblings, 3 replies; 31+ messages in thread From: Sulla @ 2013-12-31 11:46 UTC (permalink / raw) To: linux-btrfs Dear all! On my Ubuntu Server 13.10 I use a RAID5 blockdevice consisting of 3 WD20EARS drives. On this I built a LVM and in this LVM I use quite normal partitions /, /home, SWAP (/boot resides on a RAID1.) and also a custom /data partition. Everything (except boot and swap) is on btrfs. sometimes my system hangs for quite some time (top is showing a high wait percentage), then runs on normally. I get kernel messages into /var/log/sylsog, see below. I am unable to make any sense of the kernel messages, there is no reference to the filesystem or drive affected (at least I can not find one). Question: What is happening here? * Is a HDD failing (smart looks good, however) * Is something wrong with my btrfs-filesystem? with which one? * How can I find the cause? thanks, Wolfgang Dec 31 12:27:49 freedom kernel: [ 4681.264112] INFO: task btrfs-transacti:529 blocked for more than 120 seconds. Dec 31 12:27:49 freedom kernel: [ 4681.264239] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 31 12:27:49 freedom kernel: [ 4681.264367] btrfs-transacti D ffff88013fc14580 0 529 2 0x00000000 Dec 31 12:27:49 freedom kernel: [ 4681.264377] ffff880138345e10 0000000000000046 ffff880138345fd8 0000000000014580 Dec 31 12:27:49 freedom kernel: [ 4681.264386] ffff880138345fd8 0000000000014580 ffff880135615dc0 ffff880132fb6a00 Dec 31 12:27:49 freedom kernel: [ 4681.264393] ffff880133f45800 ffff880138345e30 ffff880137ee2000 ffff880137ee2070 Dec 31 12:27:49 freedom kernel: [ 4681.264402] Call Trace: Dec 31 12:27:49 freedom kernel: [ 4681.264418] [<ffffffff816eaa79>] schedule+0x29/0x70 Dec 31 12:27:49 freedom kernel: [ 4681.264477] [<ffffffffa032a57d>] btrfs_commit_transaction+0x34d/0x980 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.264487] [<ffffffff81085580>] ? wake_up_atomic_t+0x30/0x30 Dec 31 12:27:49 freedom kernel: [ 4681.264517] [<ffffffffa0321be5>] transaction_kthread+0x1a5/0x240 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.264548] [<ffffffffa0321a40>] ? verify_parent_transid+0x150/0x150 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.264557] [<ffffffff810847b0>] kthread+0xc0/0xd0 Dec 31 12:27:49 freedom kernel: [ 4681.264565] [<ffffffff810846f0>] ? kthread_create_on_node+0x120/0x120 Dec 31 12:27:49 freedom kernel: [ 4681.264573] [<ffffffff816f566c>] ret_from_fork+0x7c/0xb0 Dec 31 12:27:49 freedom kernel: [ 4681.264580] [<ffffffff810846f0>] ? kthread_create_on_node+0x120/0x120 Dec 31 12:27:49 freedom kernel: [ 4681.264610] INFO: task kworker/u4:0:9975 blocked for more than 120 seconds. Dec 31 12:27:49 freedom kernel: [ 4681.264722] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 31 12:27:49 freedom kernel: [ 4681.264847] kworker/u4:0 D ffff88013fd14580 0 9975 2 0x00000000 Dec 31 12:27:49 freedom kernel: [ 4681.264861] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-4) Dec 31 12:27:49 freedom kernel: [ 4681.264865] ffff8800a8739538 0000000000000046 ffff8800a8739fd8 0000000000014580 Dec 31 12:27:49 freedom kernel: [ 4681.264873] ffff8800a8739fd8 0000000000014580 ffff8801351e5dc0 ffff8801351e5dc0 Dec 31 12:27:49 freedom kernel: [ 4681.264880] ffff880134c5e6a8 ffff880134c5e6b0 ffffffff00000000 ffff880134c5e6b8 Dec 31 12:27:49 freedom kernel: [ 4681.264887] Call Trace: Dec 31 12:27:49 freedom kernel: [ 4681.264895] [<ffffffff816eaa79>] schedule+0x29/0x70 Dec 31 12:27:49 freedom kernel: [ 4681.264902] [<ffffffff816ec465>] rwsem_down_write_failed+0x105/0x1e0 Dec 31 12:27:49 freedom kernel: [ 4681.264911] [<ffffffff8136257d>] ? __rwsem_do_wake+0xdd/0x160 Dec 31 12:27:49 freedom kernel: [ 4681.264918] [<ffffffff81369763>] call_rwsem_down_write_failed+0x13/0x20 Dec 31 12:27:49 freedom kernel: [ 4681.264927] [<ffffffff816e9e7d>] ? down_write+0x2d/0x30 Dec 31 12:27:49 freedom kernel: [ 4681.264956] [<ffffffffa030fbe0>] cache_block_group+0x290/0x3b0 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.264963] [<ffffffff81085580>] ? wake_up_atomic_t+0x30/0x30 Dec 31 12:27:49 freedom kernel: [ 4681.264991] [<ffffffffa0317d48>] find_free_extent+0xa38/0xac0 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.265022] [<ffffffffa0317ef2>] btrfs_reserve_extent+0xa2/0x1c0 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.265056] [<ffffffffa033103d>] __cow_file_range+0x15d/0x4a0 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.265090] [<ffffffffa0331efa>] cow_file_range+0x8a/0xd0 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.265122] [<ffffffffa0332290>] run_delalloc_range+0x350/0x390 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.265158] [<ffffffffa0346bf1>] ? find_lock_delalloc_range.constprop.42+0x1d1/0x1f0 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.265194] [<ffffffffa0348764>] __extent_writepage+0x304/0x750 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.265202] [<ffffffff8109a1d5>] ? set_next_entity+0x95/0xb0 Dec 31 12:27:49 freedom kernel: [ 4681.265212] [<ffffffff810115c6>] ? __switch_to+0x126/0x4b0 Dec 31 12:27:49 freedom kernel: [ 4681.265221] [<ffffffff8104dee9>] ? default_spin_lock_flags+0x9/0x10 Dec 31 12:27:49 freedom kernel: [ 4681.265229] [<ffffffff8113f6c1>] ? find_get_pages_tag+0xd1/0x180 Dec 31 12:27:49 freedom kernel: [ 4681.265266] [<ffffffffa0348e32>] extent_write_cache_pages.isra.31.constprop.46+0x282/0x3e0 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.265303] [<ffffffffa034928d>] extent_writepages+0x4d/0x70 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.265336] [<ffffffffa032ea90>] ? btrfs_real_readdir+0x5c0/0x5c0 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.265369] [<ffffffffa032caa8>] btrfs_writepages+0x28/0x30 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.265378] [<ffffffff8114a4ae>] do_writepages+0x1e/0x40 Dec 31 12:27:49 freedom kernel: [ 4681.265387] [<ffffffff811ce7d0>] __writeback_single_inode+0x40/0x220 Dec 31 12:27:49 freedom kernel: [ 4681.265395] [<ffffffff811ceb4b>] writeback_sb_inodes+0x19b/0x3b0 Dec 31 12:27:49 freedom kernel: [ 4681.265403] [<ffffffff811cedff>] __writeback_inodes_wb+0x9f/0xd0 Dec 31 12:27:49 freedom kernel: [ 4681.265411] [<ffffffff811cf623>] wb_writeback+0x243/0x2c0 Dec 31 12:27:49 freedom kernel: [ 4681.265418] [<ffffffff811d1489>] bdi_writeback_workfn+0x1b9/0x3d0 Dec 31 12:27:49 freedom kernel: [ 4681.265426] [<ffffffff8107d05c>] process_one_work+0x17c/0x430 Dec 31 12:27:49 freedom kernel: [ 4681.265432] [<ffffffff8107dcac>] worker_thread+0x11c/0x3c0 Dec 31 12:27:49 freedom kernel: [ 4681.265439] [<ffffffff8107db90>] ? manage_workers.isra.24+0x2a0/0x2a0 Dec 31 12:27:49 freedom kernel: [ 4681.265447] [<ffffffff810847b0>] kthread+0xc0/0xd0 Dec 31 12:27:49 freedom kernel: [ 4681.265454] [<ffffffff810846f0>] ? kthread_create_on_node+0x120/0x120 Dec 31 12:27:49 freedom kernel: [ 4681.265461] [<ffffffff816f566c>] ret_from_fork+0x7c/0xb0 Dec 31 12:27:49 freedom kernel: [ 4681.265469] [<ffffffff810846f0>] ? kthread_create_on_node+0x120/0x120 Dec 31 12:27:49 freedom kernel: [ 4681.265476] INFO: task smbd:10275 blocked for more than 120 seconds. Dec 31 12:27:49 freedom kernel: [ 4681.265579] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 31 12:27:49 freedom kernel: [ 4681.265704] smbd D ffff88013fc14580 0 10275 723 0x00000004 Dec 31 12:27:49 freedom kernel: [ 4681.265711] ffff8800a5abbbc0 0000000000000046 ffff8800a5abbfd8 0000000000014580 Dec 31 12:27:49 freedom kernel: [ 4681.265718] ffff8800a5abbfd8 0000000000014580 ffff880133d5aee0 ffff880137ee2000 Dec 31 12:27:49 freedom kernel: [ 4681.265726] ffff880133db79e8 ffff880133db79e8 0000000000000001 ffff880132d2dc80 Dec 31 12:27:49 freedom kernel: [ 4681.265733] Call Trace: Dec 31 12:27:49 freedom kernel: [ 4681.265739] [<ffffffff816eaa79>] schedule+0x29/0x70 Dec 31 12:27:49 freedom kernel: [ 4681.265772] [<ffffffffa03296df>] wait_current_trans.isra.18+0xbf/0x120 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.265778] [<ffffffff81085580>] ? wake_up_atomic_t+0x30/0x30 Dec 31 12:27:49 freedom kernel: [ 4681.265810] [<ffffffffa032af06>] start_transaction+0x356/0x520 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.265843] [<ffffffffa032b0eb>] btrfs_start_transaction+0x1b/0x20 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.265876] [<ffffffffa0334887>] btrfs_cont_expand+0x1c7/0x460 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.265911] [<ffffffffa033cc26>] btrfs_file_aio_write+0x346/0x520 [btrfs] Dec 31 12:27:49 freedom kernel: [ 4681.265919] [<ffffffff811b9810>] ? poll_select_copy_remaining+0x130/0x130 Dec 31 12:27:49 freedom kernel: [ 4681.265928] [<ffffffff811a6640>] do_sync_write+0x80/0xb0 Dec 31 12:27:49 freedom kernel: [ 4681.265936] [<ffffffff811a6d7d>] vfs_write+0xbd/0x1e0 Dec 31 12:27:49 freedom kernel: [ 4681.265942] [<ffffffff811a7932>] SyS_pwrite64+0x72/0xb0 Dec 31 12:27:49 freedom kernel: [ 4681.265949] [<ffffffff816f571d>] system_call_fastpath+0x1a/0x1f -- For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled. Richard P. Feynman ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2013-12-31 11:46 btrfs-transaction blocked for more than 120 seconds Sulla @ 2014-01-01 12:37 ` Duncan 2014-01-01 20:08 ` Sulla 2014-01-03 17:25 ` Marc MERLIN 2014-01-02 8:49 ` Jojo 2014-01-05 20:32 ` Chris Murphy 2 siblings, 2 replies; 31+ messages in thread From: Duncan @ 2014-01-01 12:37 UTC (permalink / raw) To: linux-btrfs Sulla posted on Tue, 31 Dec 2013 12:46:04 +0100 as excerpted: > On my Ubuntu Server 13.10 I use a RAID5 blockdevice consisting of 3 > WD20EARS drives. On this I built a LVM and in this LVM I use quite > normal partitions /, /home, SWAP (/boot resides on a RAID1.) and also a > custom /data partition. Everything (except boot and swap) is on btrfs. > > sometimes my system hangs for quite some time (top is showing a high > wait percentage), then runs on normally. I get kernel messages into > /var/log/sylsog, see below. I am unable to make any sense of the kernel > messages, there is no reference to the filesystem or drive affected (at > least I can not find one). > > Question: What is happening here? > * Is a HDD failing (smart looks good, however) > * Is something wrong with my btrfs-filesystem? with which one? > * How can I find the cause? > > Dec 31 12:27:49 freedom kernel: [ 4681.264112] INFO: task > btrfs-transacti:529 blocked for more than 120 seconds. > > Dec 31 12:27:49 freedom kernel: [ 4681.264239] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. First to put your mind at rest, no, it's unlikely that your hardware is failing; and it's not an indication of a filesystem bug either. Rather, it's a characteristic of btrfs behavior in certain corner-cases, and yes, you /can/ do something about it with some relatively minor btrfs configuration adjustments... altho on spinning rust at multi-terabyte sizes, those otherwise minor adjustments might take some time (hours)! There seem to be two primary btrfs triggers for these "blocked for more than N seconds" messages. One is COW-related (COW=copy-on-write, the basis of BTRFS) fragmentation, the other is many-hardlink related. The only scenario-trigger I've seen for the many-hardlink case, however, has been when people are using a hardlink-based backup scheme, which you don't mention, so I'd guess it's the COW-related trigger for you. A bit of background on COW: (Assuming I get this correct, I don't claim to be an expert on it.) In general, copy-on-write is a data handling technique where any modification to the original data is made out-of-line from the original, then the extent map (be it memory extent map for in- memory COW applications, or on-device data extent map for filesystems, or...) is modified, replacing the original inline extent index with that of the new modification. The advantage of COW for filesystems, over in-place-modification, is that should the system crash at just the right (wrong?) moment, before the full record has been written, an in-place-modification may corrupt the entire file (or worse yet, the metadata for a whole bunch of files, effectively killing them all!), while with COW the update is atomic -- at least in theory, it has either been fully written and you get the new version, or the remapping hasn't yet occurred and you get the old version -- no corrupted case which is if you're lucky, part new and part old, and if you're unlucky, has something entirely unrelated and very possibly binary in the middle of what might have previously been for example a plain-text config file. However, COW-based filesystems work best when most updates either replace the entire file, or append to the end of the file, luckily the most common case. COW's primary down side in filesystem implementations is that for use-cases where only a small piece of the file somewhere in the middle is modified and saved, then another small piece somewhere else, and another and another... repeated tens of thousands of times, each small modification and save gets mapped to a new location and the file fragments into possibly tens of thousands of extents, each with just the content of the individual modification made to the file at that point. On a spinning rust hard drive, the time necessary to seek to each of those possibly tens of thousands of extents in ordered to read the file, as compared to the cost of simply reading the same data were it stored sequentially in a straight line, is... non-trivial to say the least! It's exactly that fragmentation and the delays caused by all the seeks to read an affected file, that result in the stalls and system hangs you are seeing. OK, so now that we know what causes it, what files are affected, and what can you do to help the situation? Fortunately, COW-fragmentation isn't a situation that dramatically impacts operations on most files, as obviously if it was, it'd be unsuited for filesystem use at all. But it does have a dramatic effect in some cases -- the ones I've seen people report on this list are listed below: 1) Installation. Apparently the way some distribution installation scripts work results in even a brand new installation being highly fragmented. =:^( If in addition they don't add autodefrag to the mount options used when mounting the filesystem for the original installation, the problem is made even worse, since the autodefrag mount option is designed to help catch some of this sort of issue, and schedule the affected files for auto-defrag by a separate thread. The fix here is to run a manual btrfs filesystem defrag -r on the filesystem immediately after installation completes, and to add autodefrag to the mount options used for the filesystem from then on, to keep updates and routine operation from triggering new fragmentation. (It's possible to do the same with just the autodefrag option over time, but depending on how fragmented the filesystem was to begin with, some people report that this makes the problem worse for awhile, and the system unusable, until the autodefrag mechanism has caught up to the existing problem. Autodefrag works best to /keep/ an already in good shape filesystem in good shape; it's not so good at getting one that's highly fragmented back into good shape. That's what btrfs filesystem defrag -r is for. =:^) 2) Pre-allocated files. Systemd's journal file is probably the most common single case here, but it's not the only case, and AFAIK ubuntu doesn't use systemd anyway, so that's highly unlikely to be your problem. A less widespread case that's never-the-less common enough is bittorrent clients that preallocate files at their final size before the download, then write into them as the torrent chunks are downloaded. BAD situation for COW filesystems including btrfs, since now the entire file is one relocated chunk after another. If the file's a multi-gig DVD image or the like, as mentioned above, that can be tens of thousands of extents! This situation is *KNOWN* to cause N-second block reports and system stalls of the nature you're reporting, but of course only triggers for those running such bittorrent clients. One potential fix if your bittorrent client has the option, is to turn preallocation off. However, it's there for a couple reasons -- on normal non-COW filesystems it has exactly the opposite effect, ensuring a file stays sequentially mapped, AND, by preallocating the file, it's easier to ensure that there's space available for the entire thing. (Altho if you're using btrfs' compression option and it compresses the allocation, more space will still be used as the actual data downloads and the file is filled in, as that won't compress as well.) Additionally, there's other cases of pre-allocated files. For these and for bittorrent if you don't want to or can't turn pre-allocation off, there's the NOCOW file attribute. See below for that. 3) Virtual machine images. Virtual machine images tend to be rather large, often several gig, and to trigger internal-image writes every time the configuration changes or something is saved to the virtual disk in the image. Again, a big worst- case for COW-based filesystems such as btrfs, as those internal image- writes are precisely the sort of behavior that triggers image file fragmentation. For these, the NOCOW option is the best. Again, see below. 4) Database files. Same COW-based-filesystem-worst-case behavior pattern here. The autodefrag mount option was actually designed to help deal with this case, however, for small databases (typically the small sqlite databases used in firefox and thunderbird, for instance). It'll detect the fragmentation and rewrite the entire file as a single extent. Of course that works well for reasonably small databases, but won't work so well for multi-gig databases, or multi-gig VMs or torrent images for that matter, since the write magnification would be very large (rewriting a whole multi-gig image for every change of a few bytes). Which is where the NOCOW file attribute comes in... Solutions beyond btrfs filesystem defrag -r, and the autodefrag mount option: The nodatacow mount option. At the filesystem level, btrfs has the nodatacow mount option. For use- cases where there's several files of the same problematic type, say a bunch of VM images, or a bunch of torrent files downloading to the same target subdir or subdirectory tree, or a bunch of database files all in the same directory subtree, creating a dedicated filesystem which can be mounted with the nodatacow option can make sense. At some point in the future, btrfs is supposed to support different mount options per subvolume, and at that point, a simple subvolume mounted with nodatacow but still located on a main system volume mounted without it, might make sense, but at this point, differing subvolume mount options aren't available, so to use this solution, you have to create a fully separate btrfs filesystem to use the nodatacow option on. But nodatacow also disables some of the other features of btrfs, such as checksumming and compression. While those don't work so well with COW- averse use-cases anyway (for some of the same reasons COW doesn't work on them), once you get rid of them on a global filesystem level, you're almost back to the level of a normal filesystem, and might as well use one. So in that case, rather than a dedicated btrfs mounted with nodatacow, I'd suggest a dedicated ext4 or reiserfs or xfs or whatever filesystem instead, particularly since btrfs is still under development, while these other filesystems have been mature and stable for years. The NOCOW file attribute. Simple command form: chattr +C /path/to/file/or/directory *CAVEAT! This attribute should be set on new/empty files before they have any content. The easiest way to do that is to set the attribute on the parent directory, after which all new files created in it will inherit the attribute. (Alternatively, touch the file to create it empty, do the chattr, then append data into it using cat source >> target or the like.) Meanwhile, if there's a point at which the file exists in its more or less permanent form and won't be written into any longer (a torrented file is fully downloaded, or a VM image is backed up), sequentially copying it elsewhere (possibly using cp --reflink=never if on the same filesystem, to avoid a reflink copy pointing at the same fragmented extents!), then deleting the original fragmented version, should effectively defragment the file too. And since it's not being written into any more at that point, it should stay defragmented. Or just btrfs filesystem defrag the individual file... Finally, there's some more work going into autodefrag now, to hopefully increase its performance, and make it work more efficiently on a bit larger files as well. The goal is to eliminate the problems with systemd's journal, among other things, now that it's known to be a common problem, given systemd's widespread use and the fact that both systemd and btrfs aim to be the accepted general Linux default within a few years. Summary: Figure out what applications on your system have the "internal write" pattern that causes so much trouble to COW-based filesystems, and turn off that behavior either in that app (as possible with torrent clients), or in the filesystem, using either a dedicated filesystem mount, or more likely, by setting the NOCOW attribute (chattr +C) on the individual target files or directories. Figuring out which files and applications are affected is left to the reader, but the information above should provide a good starting point. Then btrfs filesystem defrag -r the filesystem and add autodefrag to its mount options to help keep it free of at least smaller-file fragmentation. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-01 12:37 ` Duncan @ 2014-01-01 20:08 ` Sulla 2014-01-02 8:38 ` Duncan 2014-01-05 0:12 ` Sulla 2014-01-03 17:25 ` Marc MERLIN 1 sibling, 2 replies; 31+ messages in thread From: Sulla @ 2014-01-01 20:08 UTC (permalink / raw) To: linux-btrfs Dear Duncan! Thanks very much for your exhaustive answer. Hm, I also thought of fragmentation. Alhtough I don't think this is really very likely, as my server doesn't serve things that likely cause fragmentation. It is a mailserver (but only maildir-format), fileserver for windows clients (huge files that hardly don't get rewritten), a server for TV-records (but only copy recordings from a sat receiver after they have been recorded, so no heavy rewriting here), a tiny webserver and all kinds of such things, but not a storage for huge databases, virtual machines or a target for filesharing clients. It however serves as a target for a hardlink-based backupprogram run on windows PCs, but only once per month or so, so that shouldn't bee too much. The problem must lie somewhere on the root partition itslef, because the system is already slow before mounting the fat data-partitions. I'll give the defragmentation a try. But # sudo btrfs filesystem defrag -r doesn't work, because "-r" is an unknown option (I'm running Btrfs v0.20-rc1 on an Ubuntu 3.11.0-14-generic kernel). I'm doing a # sudo btrfs filesystem defrag / & on the root directory at the moment. Question: will this defragment everything or just the root-fs and will I need to run a defragment on /home as well, as /home is a separate btrfs filesystem? I've also added autodefrag mountoptions and will do a "mount -a" after the defragmentation. I've considered a # sudo btrfs balance start as well, would this do any good? How close should I let the data fill the partition? The large data partitions are 85% used, root is 70% used. Is this safe or should I add space? Thanx, Wolfgang ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-01 20:08 ` Sulla @ 2014-01-02 8:38 ` Duncan 2014-01-03 1:24 ` Kai Krakow 2014-01-05 0:12 ` Sulla 1 sibling, 1 reply; 31+ messages in thread From: Duncan @ 2014-01-02 8:38 UTC (permalink / raw) To: linux-btrfs Sulla posted on Wed, 01 Jan 2014 20:08:21 +0000 as excerpted: > Dear Duncan! > > Thanks very much for your exhaustive answer. > > Hm, I also thought of fragmentation. Alhtough I don't think this is > really very likely, as my server doesn't serve things that likely cause > fragmentation. > It is a mailserver (but only maildir-format), fileserver for windows > clients (huge files that hardly don't get rewritten), a server for > TV-records (but only copy recordings from a sat receiver after they have > been recorded, so no heavy rewriting here), a tiny webserver and all > kinds of such things, but not a storage for huge databases, virtual > machines or a target for filesharing clients. > It however serves as a target for a hardlink-based backupprogram run on > windows PCs, but only once per month or so, so that shouldn't bee too > much. One thing I didn't mention originally, was how to check for fragmentation. filefrag is part of e2fsprogs, and does the trick -- with one caveat. filefrag currently doesn't know about btrfs compression, and interprets each 128 KiB block as a separate extent. So if you have btrfs compression turned on and check a (larger than 128 KiB) file that btrfs has compressed, filefrag will falsely report fragmentation. If in doubt, you can always try defragging that individual file and see if filefrag reports fewer extents or not. If it has fewer extents you know it was fragmented, if not... With that you should actually be able to check some of those big files that you don't think are fragmented, to see. > The problem must lie somewhere on the root partition itslef, because the > system is already slow before mounting the fat data-partitions. > > I'll give the defragmentation a try. But > # sudo btrfs filesystem defrag -r > doesn't work, because "-r" is an unknown option (I'm running Btrfs > v0.20-rc1 on an Ubuntu 3.11.0-14-generic kernel). The -r option was added quite recently. As the wiki (at https://btrfs.wiki.kernel.org ) urges, btrfs is a development filesystem and people choosing to test it should really try to keep current, both because you're unnecessarily putting the data you're testing on btrfs at risk when running old versions with bugs patched in newer versions (that part's mostly for the kernel, tho), and because as a tester, when things /do/ go wrong and you report it, the reports are far more useful if you're running a current version. Kernal 3.11.0 is old. 3.12 has been out for well over a month now. And the btrfs-progs userspace recently switched to kernel-synced versioning as well, with version 3.12 the latest version, which also happens to be the first kernel-version-synced version. That's assuming you don't choose to run the latest git version of the userspace, and the Linus kernel RCs, which many btrfs testers do. (Tho last I updated btrfs-progs, about a week ago, the last git commit was still the version bump to 3.12, but I'm running a git kernel at version 3.13.0-rc5 plus 69 commits.) So you are encouraged to update. =:^) However, if you don't choose to upgrade ... (see next) > I'm doing a # sudo btrfs filesystem defrag / & > on the root directory at the moment. ... Before the -r option was added, btrfs filesystem defrag would only defrag the specific file it was pointed at. If pointed at a directory, it would defrag the directory metadata, but not files or subdirs below it. The way to defrag the entire system then, involved a rather more complicated command using find to output a list of everything on the system, and run defrag individually on each item listed. It's on the wiki. Let's see if I can find it... (yes, but note the wrapped link): https://btrfs.wiki.kernel.org/index.php/ UseCases#How_do_I_defragment_many_files.3F sudo find [subvol [subvol]…] -xdev -type f -exec btrfs filesystem defragment -- {} + As the wiki warns, that doesn't recurse into subvolumes (the -xdev keeps it from going onto non-btrfs filesystems but also keeps it from going into subvolumes), but you can list them as paths where noted. > Question: will this defragment everything or just the root-fs and will I > need to run a defragment on /home as well, as /home is a separate btrfs > filesystem? Well, as noted your command doesn't really defragment that much. But the find command should defragment everything on the named subvolumes. But of course this is where that bit I mentioned in the original post about possibly taking hours with multiple terabytes on spinning rust comes in too. It could take awhile, and when it gets to really fragmented files, it'll probably trigger the same sort of stalls that has us discussing the whole thing in the first place, so the system may not be exactly usable. =:^( > I've also added autodefrag mountoptions and will do a "mount -a" after > the defragmentation. > > I've considered a # sudo btrfs balance start as well, would this do any > good? How close should I let the data fill the partition? The large data > partitions are 85% used, root is 70% used. Is this safe or should I add > space? !! Be careful!! You mentioned running 3.11. Both early versions of 3.11 and 3.12 had a bug where if you tried to run a balance and a defrag at the same time, bad things could happen (lockups or even corrupted data)! Running just one at a time and letting it finish, then the other, should be fine. And later stable kernels of both 3.11 and 3.12 have that bug fixed (as does 3.13). But 3.11.0 is almost certainly still bugged in that regard, unless ubuntu backported the fix and didn't bump the kernel version. But because a full balance rewrites everything anyway, it'll effectively defrag too. So if you're going to do a balance, you can skip the defrag. =:^) And since it's likely to take hours at the terabyte scale on spinning rust, that's just as well. As for the space question, that's a whole different subject with its own convolutions. =:^\ Very briefly, the rule of thumb I use is that for partitions of sufficient size (several GiB low end), you always want btrfs filesystem show to have at LEAST enough unallocated space left to allocate one each data and metadata chunk. Data chunks default to 1 GiB, while metadata chunks default to 256 MiB, but because single-device metadata defaults to DUP mode, metadata chunks are normally allocated in pairs and that doubles to half a GiB. So you need at LEAST 1.5 GiB unallocated, in ordered to be sure balance can work, since it allocates a new chunk and writes into it from the old chunks, until it can free up the old chunks. Assuming you have large enough filesystems, I'd try to keep twice that, 3 GiB unallocated according to btrfs filesystem show, and would definitely recommend doing a rebalance any time it starts getting close to that. If you tend to have many multi-gig files, you'll probably want to keep enough unallocated space (rounded up to a whole gig, plus the 3 gig minimum I suggested above) around to handle at least one of those as well, just so you know you always have space available to move at least one of those if necessary, without using up your 3 gig safety margin. Beyond that, take a look at your btrfs filesystem df output. I already mentioned that data chunk size is 1 GiB, metadata 256 MiB (doubled to 512 MiB for default dup mode for a single device btrfs). So if data says something like total=248.00GiB, used=123.24GiB (example picked out of thin air), you know you're running a whole bunch of half empty chunks, and a balance should trim that down dramatically, to probably total=124.00GiB altho it's possible it might be 125.00GiB or something, but in any case it should be FAR closer to used than the twice-used figure in my example above. Any time total is more than a GiB above used, a balance is likely to be able to reduce it and return the extra to the unallocated pool. Of course the same applies to metadata, keeping in mind its default-dup, so you're effectively allocating in 512 MiB chunks for it. But any time total is more than 512 MiB above used, a balance will probably reduce it, returning the extra space to the unallocated pool. Of course single vs. dup on single devices, and multiple devices with all the different btrfs raid modes, throw various curves into the numbers given above. While it's reasonably straightforward to figure an individual case, explaining all the permutations gets quite complex. And while it's not supported yet, eventually btrfs is supposed to support different raid levels, etc, for different subvolumes, which will throw even MORE complexity into the thing! And obviously for small single- digit GiB partitions the rules must be adjusted, even more so for mixed- blockgroup, which is the default below 1 GiB but makes some sense in the single-digit GiB size range as well. But the reasonably large single- device default isn't /too/ bad, even if it takes a bit to explain, as I did here. Meanwhile, especially on spinning rust at terabyte sizes, those balances are going to take awhile, so you probably don't want to run them daily. And on SSDs, balances (and defrags and anything else for that matter) should go MUCH faster, but SSDs are limited-write-cycle, and any time you balance you're rewriting all that data and metadata, thus using up limited write cycles on all those gigs worth of blocks in one fell swoop! So either way, doing balances without any clear return probably isn't a good idea. But when the allocated space gets within a few gigs of total as shown by btrfs filesystem show, or when total gets multiple gigs above used as shown by btrfs filesystem df, it's time to consider a balance. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-02 8:38 ` Duncan @ 2014-01-03 1:24 ` Kai Krakow 2014-01-03 9:18 ` Duncan 0 siblings, 1 reply; 31+ messages in thread From: Kai Krakow @ 2014-01-03 1:24 UTC (permalink / raw) To: linux-btrfs Duncan <1i5t5.duncan@cox.net> schrieb: > But because a full balance rewrites everything anyway, it'll effectively > defrag too. Is that really true? I thought it just rewrites each distinct extent and shuffels chunks around... This would mean it does not merge extents together. Regards, Kai ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-03 1:24 ` Kai Krakow @ 2014-01-03 9:18 ` Duncan 0 siblings, 0 replies; 31+ messages in thread From: Duncan @ 2014-01-03 9:18 UTC (permalink / raw) To: linux-btrfs Kai Krakow posted on Fri, 03 Jan 2014 02:24:01 +0100 as excerpted: > Duncan <1i5t5.duncan@cox.net> schrieb: > >> But because a full balance rewrites everything anyway, it'll >> effectively defrag too. > > Is that really true? I thought it just rewrites each distinct extent and > shuffels chunks around... This would mean it does not merge extents > together. While I'm not a coder and they're free to correct me if I'm wrong... With a full balance (there are now options allowing one to do only data, or only metadata, or for that matter only system, and do other filtering, say to rebalance only chunks less than 10% used or only those not yet converted to a new raid level, if desired, but we're talking a full balance here), all chunks are rewritten, merging data (or metadata) into fewer chunks if possible, eliminating the then unused chunks and returning the space they took to the unallocated pool. Given that everything is being rewritten anyway, a process that can take hours or even days on multi-terabyte spinning rust filesystems, /not/ doing a file defrag as part of the process would be stupid. So doing a separate defrag and balance isn't necessary. And while we're at it, doing a separate scrub and balance isn't necessary, for the same reason. (If one copy of the data is invalid and there's another, it'll be used for the rewrite and redup if necessary during the balance and the invalid copy will simply be erased. If there's no valid copy, then there will be balance errors and I believe the chunks containing the bad data are simply not rewritten at all, tho the valid data from them might be rewritten, leaving only the bad data (I'm not sure which, on that), thus allowing the admin to try other tools to clean up or recover from the damage as necessary.) That's one reason why the balance operation can take so much longer than a straight sequential read/write of the data might indicate, because it's doing all that extra work behind the scenes as well. Tho I'm not sure that it defrags across chunks, particularly if a file's fragments reach across enough chunks that they'd not have been processed by the time a written chunk is full and the balance progresses to the next one. However, given that data chunks are 1 GiB in size, that should still cut down a multi-thousand-extent file to perhaps a few dozen extents, one each per rewritten chunk. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-01 20:08 ` Sulla 2014-01-02 8:38 ` Duncan @ 2014-01-05 0:12 ` Sulla 1 sibling, 0 replies; 31+ messages in thread From: Sulla @ 2014-01-05 0:12 UTC (permalink / raw) To: linux-btrfs Oh gosh, I don't know what went wrong with my btrfs root filesystem, and I probably will never know, too: The "sudo balance start /" was running fine for about 4 or 5 hours, running at a system load of ~3 when "balance status /" told me the balancing was on its way and had completed 19 out of 23 extents. At this moment the system load started to increase and increase an increase and when it reached 147 (!!) (while top was showing me NOTHING was going on) I resetted the computer. TTY1 showed some kernel panics and btrfs-bug messages, but those files were lost because they've never made it to disk. Fortunately my RAID5 stayed in sync and everything was fine. System also booted, but with the same 120+ secs hangs as before. System was unusable, as e.g. all IMAP logins time-out-ed. So * I booted into a live-CD * mounted a backup disk * cp-ed all files of the root fs to the backup disk (it could read them flawlessly) * formatted the root-partition to ext4 (yes, I feel sad about it) * cp-ed all root-files from the backupdisk to the ext4 root system * stroke the subvol=@ boot argument from /boot/grub/grub.cfg * and rebooted my server. How I love linux! Wouldn't be possible with M$!! Now its running fine again, system is responsive as it should be. No clue 'bout what went wrong, though. I still have /home and the huge data partitions on btrfs and plan to leave it so. While it would not be difficult to put /home on ext4 it would be a major effort to cp the ~3TB data off and on the disks... Thanx for your support, Sulla ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-01 12:37 ` Duncan 2014-01-01 20:08 ` Sulla @ 2014-01-03 17:25 ` Marc MERLIN 2014-01-03 21:34 ` Duncan 2014-01-04 20:48 ` Roger Binns 1 sibling, 2 replies; 31+ messages in thread From: Marc MERLIN @ 2014-01-03 17:25 UTC (permalink / raw) To: Duncan; +Cc: linux-btrfs First, a big thank you for taking the time to post this very informative message. On Wed, Jan 01, 2014 at 12:37:42PM +0000, Duncan wrote: > Apparently the way some distribution installation scripts work results in > even a brand new installation being highly fragmented. =:^( If in > addition they don't add autodefrag to the mount options used when > mounting the filesystem for the original installation, the problem is > made even worse, since the autodefrag mount option is designed to help > catch some of this sort of issue, and schedule the affected files for > auto-defrag by a separate thread. Assuming you can stomach a bit of occasional performance loss due to autodefrag, is there a reason not to always have this on btrfs filesystems in newer kernels? (let's say 3.12+)? Is there even a reason for this not to become a default mount option in newer kernels? > The NOCOW file attribute. > > Simple command form: > > chattr +C /path/to/file/or/directory Thank you for that tip, I had been unaware of it 'till now. This will make my virtualbox image directory much happier :) > Meanwhile, if there's a point at which the file exists in its more or > less permanent form and won't be written into any longer (a torrented > file is fully downloaded, or a VM image is backed up), sequentially > copying it elsewhere (possibly using cp --reflink=never if on the same > filesystem, to avoid a reflink copy pointing at the same fragmented > extents!), then deleting the original fragmented version, should > effectively defragment the file too. And since it's not being written > into any more at that point, it should stay defragmented. > > Or just btrfs filesystem defrag the individual file.. I know I can do the cp --reflink=never, but that will generate 100GB of new files and force me to drop all my hourly/daily/weekly snapshots, so file defrag is definitely a better option. > Finally, there's some more work going into autodefrag now, to hopefully > increase its performance, and make it work more efficiently on a bit > larger files as well. The goal is to eliminate the problems with > systemd's journal, among other things, now that it's known to be a common > problem, given systemd's widespread use and the fact that both systemd > and btrfs aim to be the accepted general Linux default within a few years. Is there a good guideline on which kinds of btrfs filesystems autodefrag is likely not a good idea, even if the current code does not have optimal performance? I suppose fragmented files that are deleted soon after being written are a loss, but otherwise it's mostly a win. Am I missing something? Unfortunately, on a 83GB vdi (virtualbox) file, with 3.12.5, it did a lot of writing and chewed up my 4 CPUs. Then, it started to be hard to move my mouse cursor and my procmeter graph was barely updating seconds. Next, nothing updated on my X server anymore, not even seconds in time widgets. But, I could still sometimes move my mouse cursor, and I could sometimes see the HD light fliker a bit before going dead again. In other words, the system wasn't fully deadlocked, but btrfs sure got into a state where it was unable to to finish the job, and took the kernel down with it (64bit, 8GB of RAM). I waited 2H and it never came out of it, I had to power down the system in the end. Note that this was on a top of the line 500MB/s write Samsung Evo 840 SSD, not a slow HD. I think I had enough free space: Label: 'btrfs_pool1' uuid: 4850ee22-bf32-4131-a841-02abdb4a5ba6 Total devices 1 FS bytes used 732.14GB devid 1 size 865.01GB used 865.01GB path /dev/dm-0 Is it possible expected behaviour of defrag to lock up on big files? Should I have had more spare free space for it to work? Other? On the plus side, the file I was trying to defragment and hung my system, was not corrupted by the process. Any idea what I should try from here? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-03 17:25 ` Marc MERLIN @ 2014-01-03 21:34 ` Duncan 2014-01-05 6:39 ` Marc MERLIN 2014-01-08 3:22 ` Marc MERLIN 2014-01-04 20:48 ` Roger Binns 1 sibling, 2 replies; 31+ messages in thread From: Duncan @ 2014-01-03 21:34 UTC (permalink / raw) To: linux-btrfs Marc MERLIN posted on Fri, 03 Jan 2014 09:25:06 -0800 as excerpted: > First, a big thank you for taking the time to post this very informative > message. > > On Wed, Jan 01, 2014 at 12:37:42PM +0000, Duncan wrote: >> Apparently the way some distribution installation scripts work results >> in even a brand new installation being highly fragmented. =:^( If in >> addition they don't add autodefrag to the mount options used when >> mounting the filesystem for the original installation, the problem is >> made even worse, since the autodefrag mount option is designed to help >> catch some of this sort of issue, and schedule the affected files for >> auto-defrag by a separate thread. > > Assuming you can stomach a bit of occasional performance loss due to > autodefrag, is there a reason not to always have this on btrfs > filesystems in newer kernels? (let's say 3.12+)? > > Is there even a reason for this not to become a default mount option in > newer kernels? For big "internal write" files, autodefrag isn't yet well tuned, because it effectively write-magnifies too much, forcing rewrite of the entire file for just a small change. If whatever app is more or less constantly writing those small changes, faster than the file can be rewritten... I don't know where the break-over might be, but certainly, multi-gig sized IO-active VMs images or databases aren't something I'd want to use it with. That's where the NOCOW thing will likely work better. IIRC someone also mentioned problems with autodefrag and an about 3/4 gig systemd journal. My gut feeling (IOW, *NOT* benchmarked!) is that double- digit MiB files should /normally/ be fine, but somewhere in the lower triple digits, write-magnification could well become an issue, depending of course on exactly how much active writing the app is doing into the file. As I said there's more work going into tuning autodefrag ATM, but as it is, I couldn't really recommend making it a global default... tho maybe a distro could enable it by default on a no-VM desktop system (as opposed to a server). Certainly I'd recommend most desktop types enable it. >> The NOCOW file attribute. >> >> Simple command form: >> >> chattr +C /path/to/file/or/directory > > Thank you for that tip, I had been unaware of it 'till now. > This will make my virtualbox image directory much happier :) I think I said it, but it bears repeating. Once you set that attribute on the dir, you may want to move the files out of the dir (to another partition would make sure the data is actually moved) and back in, so they're effectively new files in the dir. Or use something like cat oldfile > newfile, so you know it's actually creating the new file, not reflinking. That'll ensure the NOCOW takes effect. > Unfortunately, on a 83GB vdi (virtualbox) file, with 3.12.5, it did a > lot of writing and chewed up my 4 CPUs. Then, it started to be hard to > move my mouse cursor and my procmeter graph was barely updating seconds. > Next, nothing updated on my X server anymore, not even seconds in time > widgets. > > But, I could still sometimes move my mouse cursor, and I could sometimes > see the HD light fliker a bit before going dead again. In other words, > the system wasn't fully deadlocked, but btrfs sure got into a state > where it was unable to to finish the job, and took the kernel down with > it (64bit, 8GB of RAM). > > I waited 2H and it never came out of it, I had to power down the system > in the end. Note that this was on a top of the line 500MB/s write > Samsung Evo 840 SSD, not a slow HD. That was defrag (the command) or autodefrag (the mount option)? I'd guess defrag (the command). That's fragmentation for you! What did/does filefrag have to say about that file? Were you the one that posted the 6-digit extents? For something that bad, it might be faster to copy/move it off-device (expect it to take awhile) then move it back. That way you're only trying to read OR write on the device, not both, and the move elsewhere should defrag it quite a bit, effectively sequential write, then read and write on the move back. But even that might be prohibitive. At some point, you may need to either simply give up on it (if you're lazy), or get down and dirty with the tracing/profiling, working with a dev to figure out where it's spending its time and hopefully get btrfs recoded to work a bit faster for that sort of thing. > I think I had enough free space: > Label: 'btrfs_pool1' uuid: 4850ee22-bf32-4131-a841-02abdb4a5ba6 > Total devices 1 FS bytes used 732.14GB > devid 1 size 865.01GB used 865.01GB path /dev/dm-0 > > Is it possible expected behaviour of defrag to lock up on big files? > Should I have had more spare free space for it to work? > Other? >From my understanding it's not the file size, but the number of fragments. I'm guessing you simply overwhelmed the system. Ideally you never let it get that bad in the first place. =:^( As I suggested above, you might try the old school method of defrag, move the file to a different device, then move it back. And if possible do it when nothing else is using the system. But it may simply be practically inaccessible with a current kernel, in which case you'd either have to work with the devs to optimize, or give it up as a lost cause. =:( > On the plus side, the file I was trying to defragment and hung my > system, > was not corrupted by the process. > > Any idea what I should try from here? Beyond the above, it's let the devs hack on it time. =:^\ One other /narrow/ possibility if you're desperate. You could try splitting the file into chunks (generic term not btrfs chunks) of some arbitrary shorter size, and copying them out. If you spit into say 10 parts, then each piece should take roughly a tenth of the time, altho more fragmented areas will likely take longer. But by splitting into say 100 parts (which would be ~830 MiB apiece), you could at least see the progress and if there was one particular area where it suddenly got a lot worse. I know there's tools for that sort of thing, but I'm not enough into forensics to know much about them... Then if the process completed successfully, you could cat the parts back together again... and the written parts would be basically sequential, so that should go MUCH faster! =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-03 21:34 ` Duncan @ 2014-01-05 6:39 ` Marc MERLIN 2014-01-05 17:09 ` Chris Murphy 2014-01-08 3:22 ` Marc MERLIN 1 sibling, 1 reply; 31+ messages in thread From: Marc MERLIN @ 2014-01-05 6:39 UTC (permalink / raw) To: Duncan; +Cc: linux-btrfs On Fri, Jan 03, 2014 at 09:34:10PM +0000, Duncan wrote: > > Thank you for that tip, I had been unaware of it 'till now. > > This will make my virtualbox image directory much happier :) > > I think I said it, but it bears repeating. Once you set that attribute > on the dir, you may want to move the files out of the dir (to another > partition would make sure the data is actually moved) and back in, so > they're effectively new files in the dir. Or use something like cat > oldfile > newfile, so you know it's actually creating the new file, not > reflinking. That'll ensure the NOCOW takes effect. Yes, I got that. That why I ran btrfs defrag on the files after that (I explained why, copy would waste lots of snapshot space by replacing all the block needlessly). > > Unfortunately, on a 83GB vdi (virtualbox) file, with 3.12.5, it did a > > lot of writing and chewed up my 4 CPUs. Then, it started to be hard to > > move my mouse cursor and my procmeter graph was barely updating seconds. > > Next, nothing updated on my X server anymore, not even seconds in time > > widgets. > > > > But, I could still sometimes move my mouse cursor, and I could sometimes > > see the HD light fliker a bit before going dead again. In other words, > > the system wasn't fully deadlocked, but btrfs sure got into a state > > where it was unable to to finish the job, and took the kernel down with > > it (64bit, 8GB of RAM). > > > > I waited 2H and it never came out of it, I had to power down the system > > in the end. Note that this was on a top of the line 500MB/s write > > Samsung Evo 840 SSD, not a slow HD. > > That was defrag (the command) or autodefrag (the mount option)? I'd > guess defrag (the command). defrag, the btrfs subcommand. > That's fragmentation for you! What did/does filefrag have to say about > that file? Were you the one that posted the 6-digit extents? Nope, I never posted anything until now. Hopefully you agree that it's not ok for btrfs/kernel to just kill my system for over 2H until I power it off before of defragging one file. I did hit a severe performance but if it's not a never ending loop. gandalfthegreat:/var/local/nobck/VirtualBox VMs/Win7# filefrag Win7.vdi Win7.vdi: 156222 extents found Considering how virtualbox works, that's hardly surprising. > For something that bad, it might be faster to copy/move it off-device > (expect it to take awhile) then move it back. That way you're only > trying to read OR write on the device, not both, and the move elsewhere > should defrag it quite a bit, effectively sequential write, then read and > write on the move back. Yes, I know how I can work around the problem (although I'll likely have to delete all my historical snapshots to delete the old blocks, which I don't love to do). But doesn't it make sense to see why the kernel is near deadlocking on a single file defrag first? > But even that might be prohibitive. At some point, you may need to > either simply give up on it (if you're lazy), or get down and dirty with > the tracing/profiling, working with a dev to figure out where it's > spending its time and hopefully get btrfs recoded to work a bit faster > for that sort of thing. I'm on my way to a linux conf where I'm speaking, so I have limited time and can't crash my laptop, but I'm happy to type some commands and give output. > As I suggested above, you might try the old school method of defrag, move > the file to a different device, then move it back. And if possible do it > when nothing else is using the system. But it may simply be practically > inaccessible with a current kernel, in which case you'd either have to > work with the devs to optimize, or give it up as a lost cause. =:( I can fix my problem, actually virtualbox works fine with the fragmented file, without even feeling slow, so really I don't need to fix it urgently, I was just trying it out after your post. > Then if the process completed successfully, you could cat the parts back > together again... and the written parts would be basically sequential, so > that should go MUCH faster! =:^) All that noted, but I'm not desperate, just trying commands I hadn't tried yet :) Thanks for your replies, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-05 6:39 ` Marc MERLIN @ 2014-01-05 17:09 ` Chris Murphy 2014-01-05 17:54 ` Jim Salter 0 siblings, 1 reply; 31+ messages in thread From: Chris Murphy @ 2014-01-05 17:09 UTC (permalink / raw) To: Btrfs BTRFS On Jan 4, 2014, at 11:39 PM, Marc MERLIN <marc@merlins.org> wrote: > > Nope, I never posted anything until now. Hopefully you agree that it's > not ok for btrfs/kernel to just kill my system for over 2H until I power > it off before of defragging one file. I did hit a severe performance but > if it's not a never ending loop. > > gandalfthegreat:/var/local/nobck/VirtualBox VMs/Win7# filefrag Win7.vdi > Win7.vdi: 156222 extents found > > Considering how virtualbox works, that's hardly surprising. I haven't read anything so far indicating defrag applies to the VM container use case, rather nodatacow via xattr +C is the way to go. At least for now. > > But doesn't it make sense to see why the kernel is near deadlocking on a > single file defrag first? It's better than a panic or corrupt data. So far the best combination I've found, open to other suggestions though, is +C xattr on /var/lib/libvirt/images, creating non-preallocated qcow2 files, and snapshotting the qcow2 file with qemu-img. Granted when sysroot is snapshot, I'm making btrfs snapshots of these qcow2 files. Another option is to make /var/lib/libvirt/images a subvolume, and then when sysroot is snapshot, then /var/lib/libvirt/images is immune to being snapshot automatically with the parent subvolume. I'd have to explicitly snapshot it. This may be a better way to go to avoid accumulation of btrfs snapshots of qcow2 snapshot files. This may already be a known problem but it's worth sysrq+w, and then dmesg and posting those results if you haven't already. Chris Murphy ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-05 17:09 ` Chris Murphy @ 2014-01-05 17:54 ` Jim Salter 2014-01-05 19:57 ` Duncan 0 siblings, 1 reply; 31+ messages in thread From: Jim Salter @ 2014-01-05 17:54 UTC (permalink / raw) To: Chris Murphy, Btrfs BTRFS On 01/05/2014 12:09 PM, Chris Murphy wrote: > I haven't read anything so far indicating defrag applies to the VM > container use case, rather nodatacow via xattr +C is the way to go. At > least for now. Can you elaborate on the rationale behind database or VM binaries being set nodatacow? I experimented with this*, and found no significant (to me, anyway) performance enhancement with nodatacow on - maybe 10% at best, and if I understand correctly, that implies losing the live per-block checksumming of the data that's set nodatacow, meaning you won't get automatic correction if you're on a redundant array. All I've heard so far is "better performance" without any more detailed explanation, and if the only benefit is an added MAYBE 10%ish performance... I'd rather take the hit, personally. * "experimented with this" == set up a Win2008R2 test VM and ran HDTunePro for several runs on binaries stored with and without nodatacow set, 5G of random and sequential read and write access per run. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-05 17:54 ` Jim Salter @ 2014-01-05 19:57 ` Duncan 2014-01-05 20:44 ` Chris Murphy 0 siblings, 1 reply; 31+ messages in thread From: Duncan @ 2014-01-05 19:57 UTC (permalink / raw) To: linux-btrfs Jim Salter posted on Sun, 05 Jan 2014 12:54:44 -0500 as excerpted: > On 01/05/2014 12:09 PM, Chris Murphy wrote: >> I haven't read anything so far indicating defrag applies to the VM >> container use case, rather nodatacow via xattr +C is the way to go. At >> least for now. Well, NOCOW from the get-go would certainly be better, but given that the file is already there and heavily fragmented, my idea was to get it defragmented and then set the +C, to prevent it reoccurring. But I do very little snapshotting here, and as a result hadn't considered the knockon effect of 100K-plus extents in perhaps 1000 snapshots. I guess that's what's killing the defrag, however it's initiated. The only way to get rid of the problem, then, would be to move the file away and then back, but doing so does still leave all those snapshots with the crazy fragmentation, and to kill that would require either killing all those snapshots, or setting them writable and doing the same move out, move back, on each one! OUCH, but I guess that's why it just seems impossible to deal with the fragmentation on these things, whether it's autodefrag, or named file defrag, or doing the whole move out and back thing, and then having to worry about all those snapshots. Still, I'd guess ultimately it'll need done, whether it's a wipe the filesystem and restore from backup or whatever. > Can you elaborate on the rationale behind database or VM binaries being > set nodatacow? I experimented with this*, and found no significant (to > me, > anyway) performance enhancement with nodatacow on - maybe 10% at best, > and if I understand correctly, that implies losing the live per-block > checksumming of the data that's set nodatacow, meaning you won't get > automatic correction if you're on a redundant array. > > All I've heard so far is "better performance" without any more detailed > explanation, and if the only benefit is an added MAYBE 10%ish > performance... I'd rather take the hit, personally. > > * "experimented with this" == set up a Win2008R2 test VM and ran > HDTunePro for several runs on binaries stored with and without nodatacow > set, 5G of random and sequential read and write access per run. Well, the problem isn't just performance, it's that in most such cases the apps actually have their own date integrity checking and management, and sometimes the app's integrity management and that of btrfs end up fighting each other, destroying the data as a result. In normal operation, everything's fine. But should the system crash at the wrong moment, btrfs' atomic commit and data integrity mechanisms can roll back to a slightly earlier version of the file. Which is normally fine. But because hardware is known to often lie about having committed writes that may actually still only be in buffer, if the power outage/crash occurred at the wrong moment, ordinary write-barrier ordering guarantees may be invalid (particularly on large files with finite-seek-speed devices), the app's own integrity checksum may have been updated before the data it was supposed to be a checksum on actually got to disk. If btrfs ends up rolling back to that condition, btrfs will likely consider the file fine, but the app's own integrity management will consider it corrupted, which it actually is. But if btrfs only stays out of the way, the application often can fix whatever minor corruption it detects, doing its own roll-backs to an earlier checkpoint, because it's /designed/ to be able to handle such problems on filesystems that don't have integrity management. So having btrfs trying to manage integrity too on such data where the app already handles it is self-defeating, because neither knows about nor considers what the other one is doing, and the two end up undoing each other's careful work. Again, this isn't something you'll see in normal operation, but several people have reported exactly that sort of problem with the general large- internally-written-file, application-self-managed-file-integrity, scenario. In those cases, the best thing btrfs can do is simply get out of the way and let the application handle its own integrity management, and the way to tell btrfs to do that, as well as to do in-place rewrites instead of COW-based rewrites, is with the NOCOW xattrib, chattr +C, and that must be done before the file gets so fragmented (and multi- snapshotted in its fragmented state) in the first place. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-05 19:57 ` Duncan @ 2014-01-05 20:44 ` Chris Murphy 0 siblings, 0 replies; 31+ messages in thread From: Chris Murphy @ 2014-01-05 20:44 UTC (permalink / raw) To: Btrfs BTRFS, Duncan On Jan 5, 2014, at 12:57 PM, Duncan <1i5t5.duncan@cox.net> wrote: > > But I do very little snapshotting here, and as a result hadn't considered > the knockon effect of 100K-plus extents in perhaps 1000 snapshots. I wonder if this is an issue with snapshot aware defrag? Some problems were fixed recently but I'm not sure of the status. The OP's case involves Btrfs on LVM on (I think) md raid5. The mdadm default stripe size is 512KB, which would be a 1MB full stripe. There are some optimizations for non-full stripe reads and writes for raid5 (not for raid6 so it takes a much bigger performance hit) but nevertheless it might be a factor. > I > guess that's what's killing the defrag, however it's initiated. The only > way to get rid of the problem, then, would be to move the file away and > then back, but doing so does still leave all those snapshots with the > crazy fragmentation, and to kill that would require either killing all > those snapshots, or setting them writable and doing the same move out, > move back, on each one! OUCH, but I guess that's why it just seems > impossible to deal with the fragmentation on these things, whether it's > autodefrag, or named file defrag, or doing the whole move out and back > thing, and then having to worry about all those snapshots. It's why in the short term I'm using +C from the get go. And if I had more VM images and qcow2 snapshots, I would put them in a subvolume of their own so that they aren't snapshotted along with rootfs. Using Btrfs within the VM I still get the features I expect and the performance is quite good. Chris Murphy ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-03 21:34 ` Duncan 2014-01-05 6:39 ` Marc MERLIN @ 2014-01-08 3:22 ` Marc MERLIN 2014-01-08 9:45 ` Duncan 1 sibling, 1 reply; 31+ messages in thread From: Marc MERLIN @ 2014-01-08 3:22 UTC (permalink / raw) To: Duncan, Chris Murphy; +Cc: linux-btrfs, Jim Salter On Fri, Jan 03, 2014 at 09:34:10PM +0000, Duncan wrote: > IIRC someone also mentioned problems with autodefrag and an about 3/4 gig > systemd journal. My gut feeling (IOW, *NOT* benchmarked!) is that double- > digit MiB files should /normally/ be fine, but somewhere in the lower > triple digits, write-magnification could well become an issue, depending > of course on exactly how much active writing the app is doing into the > file. When I defrag'ed my 83GB vm file with 156222 extents, it was not in use or being written to. > As I said there's more work going into tuning autodefrag ATM, but as it > is, I couldn't really recommend making it a global default... tho maybe a > distro could enable it by default on a no-VM desktop system (as opposed > to a server). Certainly I'd recommend most desktop types enable it. I use VMs on my desktop :) but point taken. On Sun, Jan 05, 2014 at 10:09:38AM -0700, Chris Murphy wrote: > > gandalfthegreat:/var/local/nobck/VirtualBox VMs/Win7# filefrag Win7.vdi > > Win7.vdi: 156222 extents found > > > > Considering how virtualbox works, that's hardly surprising. > > I haven't read anything so far indicating defrag applies to the VM container use case, rather nodatacow via xattr +C is the way to go. At least for now. Yep, I'll convert the file, but since I found a pretty severe performance problem, does anyone care to get details off my system before I make the problem go away for me? > It's better than a panic or corrupt data. So far the best combination To be honest, I'd have taken a panic, it would have saved me 2H of waiting for a laptop to recover when it was never going to recover :( Data corruption, sure, obviously :) > I've found, open to other suggestions though, is +C xattr on So you're saying that defragmentation has known performance problems that can't get fixed for now, and that the solution is not to get fragmented or recreate the relevant files. If so, I'll go ahead, I just wanted to make sure I didn't have useful debug state before clearing my problem. > This may already be a known problem but it's worth sysrq+w, and then dmesg and posting those results if you haven't already. No, I had not yet, but I'll do this. On Sun, Jan 05, 2014 at 01:44:25PM -0700, Duncan wrote: > [I normally try to reply directly to list but don't believe I've seen > this there yet, but got it direct-mailed so will reply-all in response.] I like direct Cc on replies, makes my filter and mutt coloring happier :) Dupes with the same message-id are what procmail and others were written for :) > I now believe the lockup must be due to processing the hundreds of > thousands of extents on all those snapshots, too, in addition to doing That's a good call. I do have this: gandalfthegreat:/mnt/btrfs_pool1# ls var var/ var_hourly_20140105_16:00:01/ var_daily_20140102_00:01:01/ var_hourly_20140105_17:00:26/ var_daily_20140103_00:59:28/ var_weekly_20131208_00:02:02/ var_daily_20140104_00:01:01/ var_weekly_20131215_00:02:01/ var_daily_20140105_00:33:14/ var_weekly_20131229_00:02:02/ var_hourly_20140105_05:00:01/ var_weekly_20140105_00:33:14/ > it on the main volume. I don't actually make very extensive use of > snapshots here anyway, so I didn't think about that aspect originally, > but that's gotta be what's throwing the real spanner in the works, > turning a possibly long but workable normal defrag (O(1)) into a lockup > scenario (O(n)) where virtually no progress is made as currently > coded. That is indeed what I'm seeing, so it's very possible you're right. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-08 3:22 ` Marc MERLIN @ 2014-01-08 9:45 ` Duncan 0 siblings, 0 replies; 31+ messages in thread From: Duncan @ 2014-01-08 9:45 UTC (permalink / raw) To: linux-btrfs Marc MERLIN posted on Tue, 07 Jan 2014 19:22:58 -0800 as excerpted: > On Fri, Jan 03, 2014 at 09:34:10PM +0000, Duncan wrote: >> IIRC someone also mentioned problems with autodefrag and an about 3/4 >> gig systemd journal. My gut feeling (IOW, *NOT* benchmarked!) is that >> double-digit MiB files should /normally/ be fine, but somewhere in the >> lower triple digits, write-magnification could well become an issue, >> depending of course on exactly how much active writing the app is doing >> into the file. > > When I defrag'ed my 83GB vm file with 156222 extents, it was not in use > or being written to. Note the scale... I said double-digit _MiB_ should be fine, but somewhere in the triple-digits write magnification likely becomes a problem (this based on my memory of someone mentioning an issue with a 3/4 gig systemd journal file). You then say 83 _GB_, which may or may not be GiB, but either way, it's three orders of magnitude above the scale I said should be fine, and two orders of magnitude above the scale at which I said problems likely start appearing. So problems at that size are a given. > On Sun, Jan 05, 2014 at 10:09:38AM -0700, Chris Murphy wrote: >> I've found, open to other suggestions though, is +C xattr on > > So you're saying that defragmentation has known performance problems > that can't get fixed for now, and that the solution is not to get > fragmented or recreate the relevant files. > If so, I'll go ahead, I just wanted to make sure I didn't have useful > debug state before clearing my problem. Basically, yes. One of the devs said he's just starting to focus on it again now. So it's a known issue that'll take some work to make better. However, since he's focusing on it again now, now's the time to report stuff like the sysrq+w trace mentioned. > On Sun, Jan 05, 2014 at 01:44:25PM -0700, Duncan wrote: >> [I normally try to reply directly to list but don't believe I've seen >> this there yet, but got it direct-mailed so will reply-all in >> response.] > > I like direct Cc on replies, makes my filter and mutt coloring happier > :) > Dupes with the same message-id are what procmail and others were written > for :) Some of us think this sort of list works best as a public newsgroup... such distributed discussion is what they were designed for, after all... and that keeps it separate from actual email. That's where gmane.org comes in with its list2news (as well as list2web) archiving service. We subscribe to our lists as newsgroups there, use a news/nntp client for it, and save our email client for actually handling (more private) email. If you watch, you'll see links to particular messages on the gmane web interface posted from time to time. For those using gmane's list2news service (and obviously for those using its web interface as well) that's real easy, since gmane adds a header with the web link to messages it serves on the news interface as well. I've been using gmane for perhaps a decade now, but apparently it's more popular for people on this list than I might have expected from other lists, since I see more of those gmane web links posted. But I've also noticed that a lot more people on this list want CCed/ direct-mailed too, not just to read it on the list. I generally do that when I see the explicit request, but /only/ when I see the explicit request. >> I now believe the lockup must be due to processing the hundreds of >> thousands of extents on all those snapshots, too > > That's a good call. I do have this: > gandalfthegreat:/mnt/btrfs_pool1# ls var var/ > var_hourly_20140105_16:00:01/ var_daily_20140102_00:01:01/ > var_hourly_20140105_17:00:26/ var_daily_20140103_00:59:28/ > var_weekly_20131208_00:02:02/ var_daily_20140104_00:01:01/ > var_weekly_20131215_00:02:01/ var_daily_20140105_00:33:14/ > var_weekly_20131229_00:02:02/ var_hourly_20140105_05:00:01/ > var_weekly_20140105_00:33:14/ > >> I don't actually make very extensive use of >> snapshots here anyway, so I didn't think about that aspect originally, >> but that's gotta be what's throwing the real spanner in the works, >> turning a possibly long but workable normal defrag (O(1)) into a lockup >> scenario (O(n)) where virtually no progress is made as currently coded. > > That is indeed what I'm seeing, so it's very possible you're right. That's where the evidence is pointing, ATM. Hopefully the defrag work they're doing now will turn snapshotted defrag back into O(1), too. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-03 17:25 ` Marc MERLIN 2014-01-03 21:34 ` Duncan @ 2014-01-04 20:48 ` Roger Binns 1 sibling, 0 replies; 31+ messages in thread From: Roger Binns @ 2014-01-04 20:48 UTC (permalink / raw) To: linux-btrfs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/01/14 09:25, Marc MERLIN wrote: > Is there even a reason for this not to become a default mount option > in newer kernels? autodefrag can go insane because it is unbounded. For example I have a 4GB RAM system (3.12, no gui) that kept hanging. I eventually managed to work out the cause being a MySQL database (about 750MB of data only being used by tt-rss refreshing RSS feeds every 4 hours). autodefrag would eventually consume all the RAM and 20GB of swap kicking off the OOM killer and with so little RAM left for anything else that the only recourse was sysrq keys. What I'd love to see is some sort of background worker that does sensible things. For example it could defragment files, but pick the ones that need it the most, and I'd love to see extra copies of (meta)data in currently unused space that is freed as needed. deduping is another worthwhile option. So is recompressing data that hasn't changed recently but using larger block sizes to get more effective ratios. Some of these happen at the moment but they are independent and you have to be aware of the caveats. Roger -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.14 (GNU/Linux) iEYEARECAAYFAlLIc6wACgkQmOOfHg372QQgjgCeJp1sZQ0+Y7WRGE+U+IFljiDY MgQAnjEBspyJZvTC2caEn1Qkn942vPQ2 =rhNY -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2013-12-31 11:46 btrfs-transaction blocked for more than 120 seconds Sulla 2014-01-01 12:37 ` Duncan @ 2014-01-02 8:49 ` Jojo 2014-01-05 20:32 ` Chris Murphy 2 siblings, 0 replies; 31+ messages in thread From: Jojo @ 2014-01-02 8:49 UTC (permalink / raw) To: Sulla, linux-btrfs Am 31.12.2013 12:46, schrieb Sulla: > Dear all! > > On my Ubuntu Server 13.10 I use a RAID5 blockdevice consisting of 3 WD20EARS > drives. On this I built a LVM and in this LVM I use quite normal partitions > /, /home, SWAP (/boot resides on a RAID1.) and also a custom /data > partition. Everything (except boot and swap) is on btrfs. > > sometimes my system hangs for quite some time (top is showing a high wait > percentage), then runs on normally. I get kernel messages into > /var/log/sylsog, see below. I am unable to make any sense of the kernel > messages, there is no reference to the filesystem or drive affected (at > least I can not find one). > > Question: What is happening here? > * Is a HDD failing (smart looks good, however) > * Is something wrong with my btrfs-filesystem? with which one? > * How can I find the cause? > Moin Wolfgang, first ot: Happy new Year, over the last celebration days one of our servers (ubuntu 13.04) with custom kernel 3.11.04 did quite simular things, also rais5/raid6. Our Problem was writing to backup showed quit the same kernelog. Also btrfs-transaction was hanging. Also Filesystem usage with 83% looked fine. But that was not true. After some time eating investigation I found, that BTRFS may have in 3.11.x and other kernels(?) a problem with free block lists and fragmentation. Our Server was able to self recover after defragmentation and compressing run. We had problems with end-of-free blocks. After rebuilding the free block list and running defrag the server got enough free blocks to operate well. To be able to do that, we were forced to use the btrfs-git kernel and also the btrfs-progs from git. (3.13-rcX) I did on 26.12.13: # umount /ar # btrfsck --repair --init-extent-tree /dev/sda1 # mount -o clear_cache,skip_balance,autodefrag /dev/sda1 /ar # btrfs fi defragment -rc /ar/backup But attention, I thougt 83% used space shoud be enough "free blocks", but this was wrong. It seems that BTRFS free Block lists are somewhat errous. Especially "balance" may crash if an file has got too many extents/fragments, and allocating space may also hang if free blocks are running low. During the defragmentation run the response of the Server was getting slow, but did not stop in Read Access. Our state today: root@bk:~# df -m /ar Dateisystem 1M-Blöcke Benutzt Verfügbar Verw% Eingehängt auf /dev/sda1 13232966 7213717 3181874 70% /ar root@bk:~# btrfs fi show /ar Label: Archiv+Backup uuid: 72b710aa-49a0-4ff5-a470-231560bfee81 Total devices 5 FS bytes used 6.88TiB devid 1 size 2.73TiB used 2.70TiB path /dev/sda1 devid 2 size 2.73TiB used 2.70TiB path /dev/sdb1 devid 3 size 2.73TiB used 2.70TiB path /dev/sdc1 devid 4 size 2.73TiB used 2.70TiB path /dev/sdd1 devid 5 size 1.70TiB used 4.25GiB path /dev/sde4 Btrfs v3.12 root@bk:~# btrfs fi df /ar Data, single: total=8.00MiB, used=0.00 Data, RAID5: total=8.10TiB, used=6.87TiB System, single: total=4.00MiB, used=0.00 System, RAID5: total=12.00MiB, used=600.00KiB Metadata, single: total=8.00MiB, used=0.00 Metadata, RAID5: total=12.25GiB, used=10.41GiB Today the server completely recovered to full operation. Is there a plan ongoing to hangle such out of free blocks/space situations more comfortable? TIA J. Sauer -- Jürgen Sauer - automatiX GmbH, +49-4209-4699, juergen.sauer@automatix.de Geschäftsführer: Jürgen Sauer, Gerichtstand: Amtsgericht Walsrode • HRB 120986 Ust-Id: DE191468481 • St.Nr.: 36/211/08000 GPG Public Key zur Signaturprüfung: http://www.automatix.de/juergen_sauer_publickey.gpg ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2013-12-31 11:46 btrfs-transaction blocked for more than 120 seconds Sulla 2014-01-01 12:37 ` Duncan 2014-01-02 8:49 ` Jojo @ 2014-01-05 20:32 ` Chris Murphy 2014-01-05 21:17 ` Sulla 2 siblings, 1 reply; 31+ messages in thread From: Chris Murphy @ 2014-01-05 20:32 UTC (permalink / raw) To: Sulla; +Cc: linux-btrfs On Dec 31, 2013, at 4:46 AM, Sulla <Sulla@gmx.at> wrote: > Dear all! > > On my Ubuntu Server 13.10 I use a RAID5 blockdevice consisting of 3 WD20EARS Sulla is this md raid5? If so can you report the result from mdadm -D <mddevice>, I'm curious what the chunk size is. Thanks. Chris Murphy ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-05 20:32 ` Chris Murphy @ 2014-01-05 21:17 ` Sulla 2014-01-05 22:36 ` Brendan Hide 2014-01-05 23:48 ` Chris Murphy 0 siblings, 2 replies; 31+ messages in thread From: Sulla @ 2014-01-05 21:17 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Dear Chris! Certainly: I have 3 HDDs, all of which WD20EARS. Originally I wanted to let btrfs handle all 3 devices directly without making partitions, but this was impossible, as at least /boot needed to be ext4, at least back then when I set up the server. And back then btrfs also hadn't raid5-like functionality, so I decided to put good old partitions and md-Raids and LVM on them and use btrfs just as plain file-systems on the partitions provided by LVM. On the WD disks I thus created 2 partitions each, the first sdX1 being ~500MiB, the rest, 1.9995 TiB is one partition of, sdX2. I built a Raid1 on the 3 small partitions sdX1 with ext4 for boot, each disk is bootable with grub installed into the MBR. I combined the 3 large partitions to a Raid5 of size 3,64TB: /proc/mdstat reads: md0 : active raid1 sda1[5] sdb1[4] sdc1[3] 498676 blocks super 1.2 [3/3] [UUU] md1 : active raid5 sda2[5] sdb2[4] sdc2[3] 3904907520 blocks super 1.2 level 5, 8k chunk, algorithm 2 [3/3] [UUU] the information you requested: # sudo mdadm -D /dev/md1 /dev/md1: Version : 1.2 Creation Time : Thu Jul 14 18:49:25 2011 Raid Level : raid5 Array Size : 3904907520 (3724.01 GiB 3998.63 GB) Used Dev Size : 1952453760 (1862.01 GiB 1999.31 GB) Raid Devices : 3 Total Devices : 3 Persistence : Superblock is persistent Update Time : Sun Jan 5 22:07:22 2014 State : clean Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 8K Name : freedom:1 (local to host freedom) UUID : 44b72520:a78af6f7:dba13fb3:2203127d Events : 576884 Number Major Minor RaidDevice State 4 8 18 0 active sync /dev/sdb2 5 8 2 1 active sync /dev/sda2 3 8 34 2 active sync /dev/sdc2 I use the Raid5 md1 as physical volume for LVM: pvdisplay gives: --- Physical volume --- PV Name /dev/md1 VG Name MAIN PV Size 3.64 TiB / not usable 2.06 MiB Allocatable yes PE Size 4.00 MiB Total PE 953346 Free PE 6274 Allocated PE 947072 PV UUID WcuEx8-ehJL-xHdf-ElwF-b9s3-dlmM-KZlDNG I keep a reserve of 6274 4MiB blocks (=24GiB) in case one of the logical volumes runs out of space... I created the following logical volumes, named after their intended mountpoints: --- Logical volume --- LV Path /dev/MAIN/ROOT LV Name ROOT VG Name MAIN LV UUID kURJks-xHox-73B5-n02x-eZfS-agDD-n1dtAm LV Write Access read/write LV Creation host, time , LV Status available # open 1 LV Size 19.31 GiB Current LE 4944 Segments 2 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 252:0 and similar: --- Logical volume --- LV Path /dev/MAIN/SWAP: 1.8GB LV Path /dev/MAIN/HOME: 18.6GB LV Path /dev/MAIN/TMP: 9.3 GB LV Path /dev/MAIN/DATA1 2.6 TB LV Path /dev/MAIN/DATA2: 0.9 TB as filesystem I used btrfs during install form an ubuntu server, I don't recall which, might have been 11.10 or 12.04 (?) for all logical partitions except swap, of course, any other information I can supply? regards, Sulla - -- Cogito cogito ergo cogito sum. Ambrose Bierce -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.21 (MingW32) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlLJy+8ACgkQR6b2EdogPFupxgCfeDRdeO+PYoQNIjtySAYEmSEr PNoAoLPNcSqDHsDzM8pAuHlbva7j18MS =XBOA -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-05 21:17 ` Sulla @ 2014-01-05 22:36 ` Brendan Hide 2014-01-05 22:57 ` Roman Mamedov 2014-01-06 0:15 ` Chris Murphy 2014-01-05 23:48 ` Chris Murphy 1 sibling, 2 replies; 31+ messages in thread From: Brendan Hide @ 2014-01-05 22:36 UTC (permalink / raw) To: Sulla, Chris Murphy; +Cc: linux-btrfs On 2014/01/05 11:17 PM, Sulla wrote: > Certainly: I have 3 HDDs, all of which WD20EARS. Maybe/maybe-not off-topic: Poor hardware performance, though not necessarily the root cause, can be a major factor with these errors. WD Greens (Reds too, for that matter) have poor non-sequential performance. An educated guess I'd say there's a 15% chance this is a major factor to the problem and, perhaps, a 60% chance it is merely a "small contributor" to the problem. Greens are aimed at consumers wanting high capacity and a low pricepoint. The result is poor performance. See footnote * re my experience. My general recommendation (use cases vary of course) is to install a tiny SSD (60GB, for example) just for the OS. It is typically cheaper than the larger drives and will be *much* faster. WD Greens and Reds have good *sequential* throughput but comparatively abysmal random throughput even in comparison to regular non-SSD consumer drives. * I had 8x 1.5TB WD1500EARS drives in an mdRAID5 array. With it I had a single 250GB IDE disk for the OS. When the very old IDE disk inevitably died, I decided to use a spare 1.5TB drive for the OS. Performance was bad enough that I simply bought my first SSD the same week. -- __________ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-05 22:36 ` Brendan Hide @ 2014-01-05 22:57 ` Roman Mamedov 2014-01-07 10:22 ` Brendan Hide 2014-01-06 0:15 ` Chris Murphy 1 sibling, 1 reply; 31+ messages in thread From: Roman Mamedov @ 2014-01-05 22:57 UTC (permalink / raw) To: Brendan Hide; +Cc: Sulla, Chris Murphy, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 483 bytes --] On Mon, 06 Jan 2014 00:36:22 +0200 Brendan Hide <brendan@swiftspirit.co.za> wrote: > I had 8x 1.5TB WD1500EARS drives in an mdRAID5 array. With it I had a > single 250GB IDE disk for the OS. When the very old IDE disk inevitably > died, I decided to use a spare 1.5TB drive for the OS. Performance was > bad enough that I simply bought my first SSD the same week. Did you align your partitions to accommodate for the 4K sector of the EARS? -- With respect, Roman [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-05 22:57 ` Roman Mamedov @ 2014-01-07 10:22 ` Brendan Hide 0 siblings, 0 replies; 31+ messages in thread From: Brendan Hide @ 2014-01-07 10:22 UTC (permalink / raw) To: Roman Mamedov; +Cc: Sulla, Chris Murphy, linux-btrfs On 2014/01/06 12:57 AM, Roman Mamedov wrote: > Did you align your partitions to accommodate for the 4K sector of the EARS? I had, yes. I had to do a lot of research to get the array working "optimally". I didn't need to repartition the spare so this carried over to its being used as an OS disk. I actually lost the "Green" array twice - and learned some valuable lessons: 1. I had an 8-port SCSI card which was dropping the disks due to the timeout issue mentioned by Chris. That caused the first array failure. Technically all the data was on the disks - but temporarily irrecoverable as disks were constantly being dropped. I made a mistake during ddrescue which simultaneously destroyed two disks' data, meaning that the recovery operation was finally for nought. The only consolation was that I had very little data at the time and none of it was irreplaceable. 2. After replacing the SCSI card with two 4-port SATA cards, a few months later I still had a double-failure (the second failure being during the RAID5 rebuild). This time it was only due to bad disks and a lack of scrubbing/early warning - clearly my own fault. Having learnt these lessons, I'm now a big fan of scrubbing and backups. ;) I'm also pushing for RAID15 wherever data is mission-critical. I simply don't "trust" the reliability of disks any more and I also better understand how, by having more and/or larger disks in a RAID5/6 array, the overall reliability of that array array plummets. -- __________ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-05 22:36 ` Brendan Hide 2014-01-05 22:57 ` Roman Mamedov @ 2014-01-06 0:15 ` Chris Murphy 2014-01-06 0:19 ` Chris Murphy 1 sibling, 1 reply; 31+ messages in thread From: Chris Murphy @ 2014-01-06 0:15 UTC (permalink / raw) To: Btrfs BTRFS; +Cc: Sulla, Brendan Hide On Jan 5, 2014, at 3:36 PM, Brendan Hide <brendan@swiftspirit.co.za> wrote: > WD Greens (Reds too, for that matter) have poor non-sequential performance. An educated guess I'd say there's a 15% chance this is a major factor to the problem and, perhaps, a 60% chance it is merely a "small contributor" to the problem. Greens are aimed at consumers wanting high capacity and a low pricepoint. The result is poor performance. See footnote * re my experience. > > My general recommendation (use cases vary of course) is to install a tiny SSD (60GB, for example) just for the OS. It is typically cheaper than the larger drives and will be *much* faster. WD Greens and Reds have good *sequential* throughput but comparatively abysmal random throughput even in comparison to regular non-SSD consumer drives. Another thing with md raid and parallel flie systems that's been an issue is cqf. On the XFS list cqf is approximately in the realm of persona non grata. It might be worth Sulla also setting elevator=deadline and see if simply different scheduling is a work around, not that it's OK to get blocks with cqf. But it might be worth a shot as a more conservative approach to upgrading the kernel from 3.11.0. > I had 8x 1.5TB WD1500EARS drives in an mdRAID5 array. With it I had a single 250GB IDE disk for the OS. When the very old IDE disk inevitably died, I decided to use a spare 1.5TB drive for the OS. Performance was bad enough that I simply bought my first SSD the same week. Yeah for what it's worth, the current WD Green PDF says these drives are not to be used in RAID at all. Not 0, 1, 5 or 6. Even Caviar Black is proscribed from use in RAID environments using multibay chassis, as in, no warranty. It's desktop raid0 and raid1 only, and arguably the lack of configurable SCT ERC makes it not ideal even for raid1. Anyway, Sulla, how about putting up a smartctl -x for each drive? Curious if there are any bad sectors that have developed, and may be worth filtering all /var/log/messages for the word "reset" and see if you find any of these drives ever being reset by the kernel and if so, post the full output of that. Chris Murphy ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-06 0:15 ` Chris Murphy @ 2014-01-06 0:19 ` Chris Murphy 0 siblings, 0 replies; 31+ messages in thread From: Chris Murphy @ 2014-01-06 0:19 UTC (permalink / raw) To: Btrfs BTRFS On Jan 5, 2014, at 5:15 PM, Chris Murphy <lists@colorremedies.com> wrote: > > On Jan 5, 2014, at 3:36 PM, Brendan Hide <brendan@swiftspirit.co.za> wrote: > >> WD Greens (Reds too, for that matter) have poor non-sequential performance. An educated guess I'd say there's a 15% chance this is a major factor to the problem and, perhaps, a 60% chance it is merely a "small contributor" to the problem. Greens are aimed at consumers wanting high capacity and a low pricepoint. The result is poor performance. See footnote * re my experience. >> >> My general recommendation (use cases vary of course) is to install a tiny SSD (60GB, for example) just for the OS. It is typically cheaper than the larger drives and will be *much* faster. WD Greens and Reds have good *sequential* throughput but comparatively abysmal random throughput even in comparison to regular non-SSD consumer drives. > > > Another thing with md raid and parallel flie systems that's been an issue is cqf. Oops, CFQ! Chris Murphy ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-05 21:17 ` Sulla 2014-01-05 22:36 ` Brendan Hide @ 2014-01-05 23:48 ` Chris Murphy 2014-01-05 23:57 ` Chris Murphy 1 sibling, 1 reply; 31+ messages in thread From: Chris Murphy @ 2014-01-05 23:48 UTC (permalink / raw) To: Sulla; +Cc: linux-btrfs On Jan 5, 2014, at 2:17 PM, Sulla <Sulla@gmx.at> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Dear Chris! > > Certainly: I have 3 HDDs, all of which WD20EARS. These drives don't have a configurable SCT ERC, so you need to modify the SCSI block layer timeout: echo 120 >/sys/block/sdX/device/timeout You also need to schedule regular scrubs at the md level as well. echo check > /sys/block/mdX/md/sync_action cat /sys/block/mdX/mismatch_cnt More info about this is in man 4 md, and on the linux-raid list. > > 3904907520 blocks super 1.2 level 5, 8k chunk, algorithm 2 [3/3] [UUU] OK so 8KB chunk, 16KB full stripe, so that doesn't apply to what I was thinking might be the case. The workload is presumably small file sizes, like a mail server? > any other information I can supply? I'm not a developer, I don't know if this problem is known or maybe fixed in a newer kernel than 3.11.0 - which has been around for 5-6 months. I think the main suggestion is to try a newer kernel, granted with the configuration of md, lvm, and btrfs you have three layers that will likely have kernel changes. I'd make sure you have backups. While this layout is valid and should work, it's also probably less common and therefore less tested. Usually in case of blocking devs want to see sysrq+w issued. The setup is dmesg -n7, and enable sysrq functions. Then reproduce the block, and during the block issue w to the sysrq trigger, then capture dmesg contents and post the block and any other nearby btrfs messages. https://www.kernel.org/doc/Documentation/sysrq.txt Chris Murphy ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-05 23:48 ` Chris Murphy @ 2014-01-05 23:57 ` Chris Murphy 2014-01-06 0:25 ` Sulla 0 siblings, 1 reply; 31+ messages in thread From: Chris Murphy @ 2014-01-05 23:57 UTC (permalink / raw) To: Sulla; +Cc: linux-btrfs On Jan 5, 2014, at 4:48 PM, Chris Murphy <lists@colorremedies.com> wrote: > > On Jan 5, 2014, at 2:17 PM, Sulla <Sulla@gmx.at> wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Dear Chris! >> >> Certainly: I have 3 HDDs, all of which WD20EARS. > > These drives don't have a configurable SCT ERC, so you need to modify the SCSI block layer timeout: > > echo 120 >/sys/block/sdX/device/timeout > > You also need to schedule regular scrubs at the md level as well. > > echo check > /sys/block/mdX/md/sync_action > cat /sys/block/mdX/mismatch_cnt > > More info about this is in man 4 md, and on the linux-raid list. > >> >> 3904907520 blocks super 1.2 level 5, 8k chunk, algorithm 2 [3/3] [UUU] > > OK so 8KB chunk, 16KB full stripe, so that doesn't apply to what I was thinking might be the case. The workload is presumably small file sizes, like a mail server? > > >> any other information I can supply? > > I'm not a developer, I don't know if this problem is known or maybe fixed in a newer kernel than 3.11.0 - which has been around for 5-6 months. I think the main suggestion is to try a newer kernel, granted with the configuration of md, lvm, and btrfs you have three layers that will likely have kernel changes. I'd make sure you have backups. While this layout is valid and should work, it's also probably less common and therefore less tested. > > Usually in case of blocking devs want to see sysrq+w issued. The setup is dmesg -n7, and enable sysrq functions. Then reproduce the block, and during the block issue w to the sysrq trigger, then capture dmesg contents and post the block and any other nearby btrfs messages. > > https://www.kernel.org/doc/Documentation/sysrq.txt Also, this thread is pretty cluttered with other conversations by now so I think you're best off starting a new thread with this information, maybe a title of "PROBLEM: btrfs on LVM on md raid, blocking > 120 seconds" Since it's almost inevitable you'd be asked to test with a newer kernel anyway, you might as well go to 3.13rc7 and see if you can reproduce, if reproducible, be specific with the problem report by following this template: https://www.kernel.org/pub/linux/docs/lkml/reporting-bugs.html Chris Murphy ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-05 23:57 ` Chris Murphy @ 2014-01-06 0:25 ` Sulla 2014-01-06 0:49 ` Chris Murphy 0 siblings, 1 reply; 31+ messages in thread From: Sulla @ 2014-01-06 0:25 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Thanks Chris! Thanks for your support. >> echo 120 >/sys/block/sdX/device/timeout timeout is 30 for my HDDs. I'm well aware that the WD green HDDs are not the perfect ones for servers, but they were cheaper - and quieter - than the black ones for servers. I'll get the red ones next, though. ;-) >> You also need to schedule regular scrubs at the md level as well. Ubuntu does that once a month. >> cat /sys/block/mdX/mismatch_cnt this resides in cat /sys/devices/virtual/block/md1/md/mismatch_cnt on my machine. the count is zero. >> The workload is presumably small file sizes, like a mail server? Yes. It serves as a mailserver (maildir-format), but also as a samba file server with quite big files... btrfs ran fine for more than a year, so I'm not sure how reproducible the problem is... I don't really wish to install or compile cumstom kernels, to be honest. Not sure how problematic they might be during the next do-release-upgrade... Sulla - -- Russian Roulette is not the same without a gun and baby when it's love, if it's not rough, it isn't fun, fun. Lady GaGa, "Pokerface" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.21 (MingW32) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlLJ+A8ACgkQR6b2EdogPFuFwwCffSjZpDJvIj70Ag+CPbClCVuc viEAnjqnxcEdhKR2Gq84eGYEXfjfb23F =pmTS -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds 2014-01-06 0:25 ` Sulla @ 2014-01-06 0:49 ` Chris Murphy [not found] ` <52CA06FE.2030802@gmx.at> 0 siblings, 1 reply; 31+ messages in thread From: Chris Murphy @ 2014-01-06 0:49 UTC (permalink / raw) To: Sulla; +Cc: linux-btrfs On Jan 5, 2014, at 5:25 PM, Sulla <Sulla@gmx.at> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Thanks Chris! > > Thanks for your support. > >>> echo 120 >/sys/block/sdX/device/timeout > timeout is 30 for my HDDs. I don't think those drives support a configurable time out; the Green hasn't support it in years. Where are you getting this information? What do you get for 'smartctl -l scterc /dev/sdX'? > I don't really wish to install or compile cumstom kernels, to be honest. If the problem is reproducible, then that's the fastest way to find out if it's been fixed or not. In this case 3.11 is EOL already, no more updates. Chris Murphy ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <52CA06FE.2030802@gmx.at>]
* Re: btrfs-transaction blocked for more than 120 seconds [not found] ` <52CA06FE.2030802@gmx.at> @ 2014-01-06 1:55 ` Chris Murphy 0 siblings, 0 replies; 31+ messages in thread From: Chris Murphy @ 2014-01-06 1:55 UTC (permalink / raw) To: Sulla; +Cc: Btrfs BTRFS On Jan 5, 2014, at 6:29 PM, Sulla <Sulla@gmx.at> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi Chris! > > # sudo smartctl -l scterc /dev/sda > tells me > SCT Error Recovery Control command not supported > > you're right. the /sys/block/sdX/device/timeout file probably is useless then. OK there's some confusion. /sys/block/sdX/device/timeout is the SCSI block layer timeout - linux itself has a timeout for each command issued to a block device, and will reset the link upon timeout being reached. So writing 120 to this will cause linux to wait for up to 120 seconds for the drive to respond. This is necessary because if there's a bad sector, the drive must report a read error in order for the md driver to reconstruct that data from parity. This is needed bothfor effective scrubs, and recovery on read error in normal operation. It is not a persistent setting so you'll want to create a start up script for it. Chris Murphy ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <ADin1n00P0VAdqd01DioM9>]
* Re: btrfs-transaction blocked for more than 120 seconds [not found] <ADin1n00P0VAdqd01DioM9> @ 2014-01-05 20:44 ` Duncan 0 siblings, 0 replies; 31+ messages in thread From: Duncan @ 2014-01-05 20:44 UTC (permalink / raw) To: Jim Salter; +Cc: Marc MERLIN, linux-btrfs On Sun, 05 Jan 2014 08:42:46 -0500 Jim Salter <jim@jrs-s.net> wrote: > On Jan 5, 2014 1:39 AM, Marc MERLIN <marc@merlins.org> wrote: > > > > On Fri, Jan 03, 2014 at 09:34:10PM +0000, Duncan wrote: > > Yes, I got that. That why I ran btrfs defrag on the files after that > > Why are you trying to defrag an SSD? There's no seek penalty for > moving between fragmented blocks, so defrag isn't really desirable in > the first place. [I normally try to reply directly to list but don't believe I've seen this there yet, but got it direct-mailed so will reply-all in response.] There's no seek penalty so the overall problem is dramatically lessened as that's the significant part of it on spinning rust, correct, but... SSDs do remain IOPS-bound, and tens or hundreds of thousands of extents do exact an IOPS (as well as general extent bookkeeping) toll, too. That's why I ended up enabling autodefrag here when I was first setting up, even tho I'm on SSD. (Only after asking the list basically the same question, what good it is autodefrag on SSD, tho.) Luckily I don't happen to deal with any of the internal-write-in-huge-files scenarios, however, and I enabled autodefrag to cover the internal-write-in-small-file scenarios BEFORE I started putting any data on the filesystems at all, so I'm basically covered, here, without actually having to do chattr +C on anything. > That doesn't change the fact that the described lockup sounds like a > bug not a feature of course, but I think the answer to your personal > issue on that particular machine is "don't defrag a solid state > drive". I now believe the lockup must be due to processing the hundreds of thousands of extents on all those snapshots, too, in addition to doing it on the main volume. I don't actually make very extensive use of snapshots here anyway, so I didn't think about that aspect originally, but that's gotta be what's throwing the real spanner in the works, turning a possibly long but workable normal defrag (O(1)) into a lockup scenario (O(n)) where virtually no progress is made as currently coded. -- Duncan - No HTML messages please, as they are filtered as spam. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2014-01-08 9:45 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-12-31 11:46 btrfs-transaction blocked for more than 120 seconds Sulla 2014-01-01 12:37 ` Duncan 2014-01-01 20:08 ` Sulla 2014-01-02 8:38 ` Duncan 2014-01-03 1:24 ` Kai Krakow 2014-01-03 9:18 ` Duncan 2014-01-05 0:12 ` Sulla 2014-01-03 17:25 ` Marc MERLIN 2014-01-03 21:34 ` Duncan 2014-01-05 6:39 ` Marc MERLIN 2014-01-05 17:09 ` Chris Murphy 2014-01-05 17:54 ` Jim Salter 2014-01-05 19:57 ` Duncan 2014-01-05 20:44 ` Chris Murphy 2014-01-08 3:22 ` Marc MERLIN 2014-01-08 9:45 ` Duncan 2014-01-04 20:48 ` Roger Binns 2014-01-02 8:49 ` Jojo 2014-01-05 20:32 ` Chris Murphy 2014-01-05 21:17 ` Sulla 2014-01-05 22:36 ` Brendan Hide 2014-01-05 22:57 ` Roman Mamedov 2014-01-07 10:22 ` Brendan Hide 2014-01-06 0:15 ` Chris Murphy 2014-01-06 0:19 ` Chris Murphy 2014-01-05 23:48 ` Chris Murphy 2014-01-05 23:57 ` Chris Murphy 2014-01-06 0:25 ` Sulla 2014-01-06 0:49 ` Chris Murphy [not found] ` <52CA06FE.2030802@gmx.at> 2014-01-06 1:55 ` Chris Murphy [not found] <ADin1n00P0VAdqd01DioM9> 2014-01-05 20:44 ` Duncan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).