From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: btrfs-transaction blocked for more than 120 seconds
Date: Thu, 2 Jan 2014 08:38:22 +0000 (UTC) [thread overview]
Message-ID: <pan$d8c7$6644fca2$fe85bee2$cc84b5d0@cox.net> (raw)
In-Reply-To: loom.20140101T204832-543@post.gmane.org
Sulla posted on Wed, 01 Jan 2014 20:08:21 +0000 as excerpted:
> Dear Duncan!
>
> Thanks very much for your exhaustive answer.
>
> Hm, I also thought of fragmentation. Alhtough I don't think this is
> really very likely, as my server doesn't serve things that likely cause
> fragmentation.
> It is a mailserver (but only maildir-format), fileserver for windows
> clients (huge files that hardly don't get rewritten), a server for
> TV-records (but only copy recordings from a sat receiver after they have
> been recorded, so no heavy rewriting here), a tiny webserver and all
> kinds of such things, but not a storage for huge databases, virtual
> machines or a target for filesharing clients.
> It however serves as a target for a hardlink-based backupprogram run on
> windows PCs, but only once per month or so, so that shouldn't bee too
> much.
One thing I didn't mention originally, was how to check for fragmentation.
filefrag is part of e2fsprogs, and does the trick -- with one caveat.
filefrag currently doesn't know about btrfs compression, and interprets
each 128 KiB block as a separate extent. So if you have btrfs
compression turned on and check a (larger than 128 KiB) file that btrfs
has compressed, filefrag will falsely report fragmentation.
If in doubt, you can always try defragging that individual file and see
if filefrag reports fewer extents or not. If it has fewer extents you
know it was fragmented, if not...
With that you should actually be able to check some of those big files
that you don't think are fragmented, to see.
> The problem must lie somewhere on the root partition itslef, because the
> system is already slow before mounting the fat data-partitions.
>
> I'll give the defragmentation a try. But
> # sudo btrfs filesystem defrag -r
> doesn't work, because "-r" is an unknown option (I'm running Btrfs
> v0.20-rc1 on an Ubuntu 3.11.0-14-generic kernel).
The -r option was added quite recently.
As the wiki (at https://btrfs.wiki.kernel.org ) urges, btrfs is a
development filesystem and people choosing to test it should really try
to keep current, both because you're unnecessarily putting the data
you're testing on btrfs at risk when running old versions with bugs
patched in newer versions (that part's mostly for the kernel, tho), and
because as a tester, when things /do/ go wrong and you report it, the
reports are far more useful if you're running a current version.
Kernal 3.11.0 is old. 3.12 has been out for well over a month now. And
the btrfs-progs userspace recently switched to kernel-synced versioning
as well, with version 3.12 the latest version, which also happens to be
the first kernel-version-synced version.
That's assuming you don't choose to run the latest git version of the
userspace, and the Linus kernel RCs, which many btrfs testers do. (Tho
last I updated btrfs-progs, about a week ago, the last git commit was
still the version bump to 3.12, but I'm running a git kernel at version
3.13.0-rc5 plus 69 commits.)
So you are encouraged to update. =:^)
However, if you don't choose to upgrade ... (see next)
> I'm doing a # sudo btrfs filesystem defrag / &
> on the root directory at the moment.
... Before the -r option was added, btrfs filesystem defrag would only
defrag the specific file it was pointed at. If pointed at a directory,
it would defrag the directory metadata, but not files or subdirs below it.
The way to defrag the entire system then, involved a rather more
complicated command using find to output a list of everything on the
system, and run defrag individually on each item listed. It's on the
wiki. Let's see if I can find it... (yes, but note the wrapped link):
https://btrfs.wiki.kernel.org/index.php/
UseCases#How_do_I_defragment_many_files.3F
sudo find [subvol [subvol]…] -xdev -type f -exec btrfs filesystem
defragment -- {} +
As the wiki warns, that doesn't recurse into subvolumes (the -xdev keeps
it from going onto non-btrfs filesystems but also keeps it from going
into subvolumes), but you can list them as paths where noted.
> Question: will this defragment everything or just the root-fs and will I
> need to run a defragment on /home as well, as /home is a separate btrfs
> filesystem?
Well, as noted your command doesn't really defragment that much. But the
find command should defragment everything on the named subvolumes.
But of course this is where that bit I mentioned in the original post
about possibly taking hours with multiple terabytes on spinning rust
comes in too. It could take awhile, and when it gets to really
fragmented files, it'll probably trigger the same sort of stalls that has
us discussing the whole thing in the first place, so the system may not
be exactly usable. =:^(
> I've also added autodefrag mountoptions and will do a "mount -a" after
> the defragmentation.
>
> I've considered a # sudo btrfs balance start as well, would this do any
> good? How close should I let the data fill the partition? The large data
> partitions are 85% used, root is 70% used. Is this safe or should I add
> space?
!! Be careful!! You mentioned running 3.11. Both early versions of 3.11
and 3.12 had a bug where if you tried to run a balance and a defrag at
the same time, bad things could happen (lockups or even corrupted data)!
Running just one at a time and letting it finish, then the other, should
be fine. And later stable kernels of both 3.11 and 3.12 have that bug
fixed (as does 3.13). But 3.11.0 is almost certainly still bugged in
that regard, unless ubuntu backported the fix and didn't bump the kernel
version.
But because a full balance rewrites everything anyway, it'll effectively
defrag too. So if you're going to do a balance, you can skip the
defrag. =:^) And since it's likely to take hours at the terabyte scale
on spinning rust, that's just as well.
As for the space question, that's a whole different subject with its own
convolutions. =:^\
Very briefly, the rule of thumb I use is that for partitions of
sufficient size (several GiB low end), you always want btrfs filesystem
show to have at LEAST enough unallocated space left to allocate one each
data and metadata chunk. Data chunks default to 1 GiB, while metadata
chunks default to 256 MiB, but because single-device metadata defaults to
DUP mode, metadata chunks are normally allocated in pairs and that
doubles to half a GiB.
So you need at LEAST 1.5 GiB unallocated, in ordered to be sure balance
can work, since it allocates a new chunk and writes into it from the old
chunks, until it can free up the old chunks.
Assuming you have large enough filesystems, I'd try to keep twice that, 3
GiB unallocated according to btrfs filesystem show, and would definitely
recommend doing a rebalance any time it starts getting close to that.
If you tend to have many multi-gig files, you'll probably want to keep
enough unallocated space (rounded up to a whole gig, plus the 3 gig
minimum I suggested above) around to handle at least one of those as
well, just so you know you always have space available to move at least
one of those if necessary, without using up your 3 gig safety margin.
Beyond that, take a look at your btrfs filesystem df output. I already
mentioned that data chunk size is 1 GiB, metadata 256 MiB (doubled to 512
MiB for default dup mode for a single device btrfs). So if data says
something like total=248.00GiB, used=123.24GiB (example picked out of
thin air), you know you're running a whole bunch of half empty chunks,
and a balance should trim that down dramatically, to probably
total=124.00GiB altho it's possible it might be 125.00GiB or something,
but in any case it should be FAR closer to used than the twice-used
figure in my example above. Any time total is more than a GiB above
used, a balance is likely to be able to reduce it and return the extra to
the unallocated pool.
Of course the same applies to metadata, keeping in mind its default-dup,
so you're effectively allocating in 512 MiB chunks for it. But any time
total is more than 512 MiB above used, a balance will probably reduce it,
returning the extra space to the unallocated pool.
Of course single vs. dup on single devices, and multiple devices with all
the different btrfs raid modes, throw various curves into the numbers
given above. While it's reasonably straightforward to figure an
individual case, explaining all the permutations gets quite complex. And
while it's not supported yet, eventually btrfs is supposed to support
different raid levels, etc, for different subvolumes, which will throw
even MORE complexity into the thing! And obviously for small single-
digit GiB partitions the rules must be adjusted, even more so for mixed-
blockgroup, which is the default below 1 GiB but makes some sense in the
single-digit GiB size range as well. But the reasonably large single-
device default isn't /too/ bad, even if it takes a bit to explain, as I
did here.
Meanwhile, especially on spinning rust at terabyte sizes, those balances
are going to take awhile, so you probably don't want to run them daily.
And on SSDs, balances (and defrags and anything else for that matter)
should go MUCH faster, but SSDs are limited-write-cycle, and any time you
balance you're rewriting all that data and metadata, thus using up
limited write cycles on all those gigs worth of blocks in one fell swoop!
So either way, doing balances without any clear return probably isn't a
good idea. But when the allocated space gets within a few gigs of total
as shown by btrfs filesystem show, or when total gets multiple gigs above
used as shown by btrfs filesystem df, it's time to consider a balance.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2014-01-02 8:38 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-31 11:46 btrfs-transaction blocked for more than 120 seconds Sulla
2014-01-01 12:37 ` Duncan
2014-01-01 20:08 ` Sulla
2014-01-02 8:38 ` Duncan [this message]
2014-01-03 1:24 ` Kai Krakow
2014-01-03 9:18 ` Duncan
2014-01-05 0:12 ` Sulla
2014-01-03 17:25 ` Marc MERLIN
2014-01-03 21:34 ` Duncan
2014-01-05 6:39 ` Marc MERLIN
2014-01-05 17:09 ` Chris Murphy
2014-01-05 17:54 ` Jim Salter
2014-01-05 19:57 ` Duncan
2014-01-05 20:44 ` Chris Murphy
2014-01-08 3:22 ` Marc MERLIN
2014-01-08 9:45 ` Duncan
2014-01-04 20:48 ` Roger Binns
2014-01-02 8:49 ` Jojo
2014-01-05 20:32 ` Chris Murphy
2014-01-05 21:17 ` Sulla
2014-01-05 22:36 ` Brendan Hide
2014-01-05 22:57 ` Roman Mamedov
2014-01-07 10:22 ` Brendan Hide
2014-01-06 0:15 ` Chris Murphy
2014-01-06 0:19 ` Chris Murphy
2014-01-05 23:48 ` Chris Murphy
2014-01-05 23:57 ` Chris Murphy
2014-01-06 0:25 ` Sulla
2014-01-06 0:49 ` Chris Murphy
[not found] ` <52CA06FE.2030802@gmx.at>
2014-01-06 1:55 ` Chris Murphy
[not found] <ADin1n00P0VAdqd01DioM9>
2014-01-05 20:44 ` Duncan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$d8c7$6644fca2$fe85bee2$cc84b5d0@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).