Re: [4.4.1] btrfs-transacti frequent high CPU usage despite little fragmentation

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: [4.4.1] btrfs-transacti frequent high CPU usage despite little fragmentation
Date: Fri, 18 Mar 2016 23:06:04 +0000 (UTC)	[thread overview]
Message-ID: <pan$c1998$42b65a7e$8b396259$757615d8@cox.net> (raw)
In-Reply-To: 56EBCB7A.1010508@gmail.com

Ole Langbehn posted on Fri, 18 Mar 2016 10:33:46 +0100 as excerpted:

> Duncan,
> 
> thanks for your extensive answer.
> 
> On 17.03.2016 11:51, Duncan wrote:
>> Ole Langbehn posted on Wed, 16 Mar 2016 10:45:28 +0100 as excerpted:
>> 
>> Have you tried the autodefrag mount option, then defragging?  That
>> should help keep rewritten files from fragmenting so heavily, at least.
>>  On spinning rust it doesn't play so well with large (half-gig plus)
>> databases or VM images, but on ssds it should scale rather larger; on
>> fast SSDs I'd not expect problems until 1-2 GiB, possibly higher.
> 
> Since I do have some big VM images, I never tried autodefrag.

OK.  Tho as you're on ssd you might consider /trying/ it.  The big 
problem with autodefrag and big VMs and DBs is that as the filesize gets 
larger, it becomes more difficult for autodefrag to keep up with the 
incoming stream of modifications, but ssds tend to be fast enough that 
they can keep up for far longer, and it may be that you won't see a 
noticeable issue.  If you do, you can always turn the mount option back 
off.

Also, nocow should mean autodefrag doesn't affect the file anyway, as it 
won't be fragmenting due to the nocow.  So if you have your really large 
VMs and DBs set nocow, it's quite likely, particularly on ssd, that you 
can set autodefrag and not see the performance problems with those large 
files that's the reason it's normally not recommended for the large db/vm 
use-case.

And like I said you can always turn it back off if necessary.

>> For large dbs or VM images, too large for autodefrag to handle well,
>> the nocow attribute is the usual suggestion, but I'll skip the details
>> on that for now, as you may not need it with autodefrag on an ssd,
>> unless your database and VM files are several gig apiece.
> 
> Since posting the original post, I experimented with setting the firefox
> places.sqlite to nodatacow (on a new file). 1 extent since, seems to
> work.

Seems you are reasonably familiar with the nocow attribute drill, so I'll 
just cover one remaining base, in case you missed it.

Nocow interacts with snapshots.  Basically, snapshots turn nocow into 
cow1, because they lock the existing version in place due to the 
snapshot.  First changes to a block after a snapshot, then, must be cow, 
tho further changes to it after that remain nocow to the new in-place 
location.

So nocow isn't fully nocow with snapshots, and fragmentation will slow 
down, but not be eliminated.  People doing regularly scheduled 
snapshotting therefore often need to do less frequent but also regularly 
scheduled (perhaps weekly or monthly, for multiple snapshots per day) 
defrag of their nowcow files.

Tho be aware that for performance reasons, defrag isn't snapshot aware 
and will break reflinks to existing snapshots, thereby increasing 
filesystem usage.  The total effect on usage of course depends on how 
much updating the nocow files get as well as snapshotting and defrag 
frequency.

>>> BTW: I did a VACUUM on the sqlite db and afterwards it had 1 extent.
>>> Expected, just saying that vacuuming seems to be a good measure for
>>> defragmenting sqlite databases.
>> 
>> I know the concept, but out of curiousity, what tool do you use for
>> that?  I imagine my firefox sqlite dbs could use some vacuuming as
>> well, but don't have the foggiest idea how to go about it.
> 
> simple call of the command line interface, like with any other SQL DB:
> 
> # sqlite3 /path/to/db.sqlite "VACUUM;"

Cool.  As far as I knew, sqlite was library only, no executable to invoke 
in that manner.  Shows how little I knew about sqlite. =:^)  Thanks.

>> Of *most* importance, you really *really* need to do something about
>> that data chunk imbalance, and to a lessor extent that metadata chunk
>> imbalance, because your unallocated space is well under a gig (306
>> MiB), with all that extra space, hundreds of gigs of it, locked up in
>> unused or only partially used chunks.
> 
> I'm curious - why is that a bad thing?

Btrfs allocates space in two stages, first to chunks of data or metadata 
type (there's also system type but that's pretty much fixed size so once 
the filesystem is created, no further system chunks are normally needed, 
unless it's created as a single device filesystem and then a whole slew 
of additional devices are added, or if the filesystem is massively resized 
on the same device, of course), then from within those chunks to files 
from data, and to metadata nodes from metadata, as necessary.

What can happen then, and used to happen frequently before 3.17, tho much 
less frequently but it can still happen now, is that over time and with 
use, the filesystem will allocate all available space as one type, 
typically data chunks, and then run out of space in the other type of 
chunk, typically metadata, and have no unallocated space from which to 
allocate more.   So you'll have lots of space left, but it'll be all tied 
up in only partially used chunks of the one type and you'll be out of 
space in the other type.

And by the time you actually start getting ENOSPC errors as a result of 
the situation, there's often too little space left to create even the one 
additional chunk necessary for a balance to write the data from other 
chunks into, in ordered to combine some of the less used chunks into 
fewer chunks at 100% usage (but for the last one, of course).

And you were already in a tight spot in that regard and may well have had 
errors if you had simply tried an unfiltered balance, because data chunks 
are typically 1 GiB in size (and can be upto 10 GiB in some circumstances 
on large enough filesystems, tho I think the really large sizes require 
multi-device), and you were down to 300-ish MiB of unallocated space, not 
enough to create a new 1 GiB data chunk.

And considering the filesystem's near terabyte scale, to be down to under 
a GiB of unallocated space is even more startling, particularly on newer 
kernels where empty chunks are normally reclaimed automatically (tho as 
the usage=0 balances reclaimed some space for you, obviously not all of 
them had been reclaimed in your case).

That was what was alarming to me, and it /may/ have had something to do 
with the high cpu and low speeds, tho indications were that you still had 
enough space in both data and metadata that it shouldn't have been too 
bad just yet.  But it was potentially heading that way, if you didn't do 
something, which is why I stressed it as I did.  Getting out of such 
situations once you're tightly jammed can be quite difficult and 
inconvenient, tho you were lucky enough not to be that tightly jammed 
just yet, only headed that way.

>> The subject says 4.4.1, but it's unclear whether that's your kernel
>> version or your btrfs-progs userspace version.

> # uname -r 4.4.1-gentoo
> 
> # btrfs --version btrfs-progs v4.4.1
> 
> So, both 4.4.1 ;)

=:^)

>> Try this:
>> 
>> btrfs balance start -dusage=0 -musage=0.
> 
> Did this although I'm reasonably up to date kernel-wise. I am very sure
> that the filesystem has never seen <3.18. Took some minutes, ended up
> with
> 
> # btrfs filesystem usage /
> Overall:
>     Device size:                 915.32GiB
>     Device allocated:            681.32GiB
>     Device unallocated:          234.00GiB
>     Device missing:                  0.00B
>     Used:                        153.80GiB
>     Free (estimated):            751.08GiB      (min: 751.08GiB)
>     Data ratio:                       1.00
>     Metadata ratio:                   1.00
>     Global reserve:              512.00MiB      (used: 0.00B)
> 
> Data,single: Size:667.31GiB, Used:150.22GiB
>    /dev/sda2     667.31GiB
> 
> Metadata,single: Size:14.01GiB, Used:3.58GiB
>    /dev/sda2      14.01GiB
> 
> System,single: Size:4.00MiB, Used:112.00KiB
>    /dev/sda2       4.00MiB
> 
> Unallocated:
>    /dev/sda2     234.00GiB
> 
> 
> -> Helped with data, not with metadata.

Yes, and most importantly, you're already out of the tight jam you were 
headed into, now with a comfortable several hundred gigs of unallocated 
space. =:^)

With that, the specific hoops weren't all necessary for further steps.  
In particular, I was afraid that wouldn't clear any chunks at all and 
you'd still have under a GiB free, still too small to properly balance 
data chunks, thus the suggestion to start with metadata and hoping it 
worked.

>> Then start with metadata, and up the usage numbers which are
>> percentages,
>> like this:
>> 
>> btrfs balance start -musage=5.
>> 
>> Then if it works up the number to 10, 20, etc.
> 
> upped it up to 70, relocated a total of 13 out of 685 chunks:
> 
> Metadata,single: Size:5.00GiB, Used:3.58GiB
>    /dev/sda2       5.00GiB

So you cleared a few more gigs to unallocated, as metadata total was 14 
GiB, now it's 5 GiB, much more in line with used (especially given the 
fact that your half a GiB of global reserve comes from metadata but 
doesn't count as used in the above figure, so you're effectively a bit 
over 4 GiB used, meaning you may not be able to free more even if 
balancing all metadata chunks with just -m, no usage filter).

>> Once you have several gigs in unallocated, then try the same thing with
>> data:
>> 
>> btrfs balance start -musage=5
>> 
>> And again, increase it in increments of 5 or 10% at a time, to 50 or
>> 70%.
> 
> did
> 
> # btrfs balance start -dusage=70
> 
> straight away, took ages, regularly froze processes for minutes, after
> about 8h status is:
> 
> # btrfs balance status /
> Balance on '/' is paused
> 192 out of about 595 chunks balanced (194 considered),  68% left
> # btrfs filesystem usage /
> Overall:
>     Device size:                 915.32GiB
>     Device allocated:            482.04GiB
>     Device unallocated:          433.28GiB
>     Device missing:                  0.00B
>     Used:                        154.36GiB
>     Free (estimated):            759.48GiB      (min: 759.48GiB)
>     Data ratio:                       1.00
>     Metadata ratio:                   1.00
>     Global reserve:              512.00MiB      (used: 0.00B)
> 
> Data,single: Size:477.01GiB, Used:150.80GiB
>    /dev/sda2     477.01GiB
> 
> Metadata,single: Size:5.00GiB, Used:3.56GiB
>    /dev/sda2       5.00GiB
> 
> System,single: Size:32.00MiB, Used:96.00KiB
>    /dev/sda2      32.00MiB
> 
> Unallocated:
>    /dev/sda2     433.28GiB
> 
> -> Looking good. Will proceed when I don't need the box to actually be
> responsive.

The thing with the usage= filter is this.  Balancing empty (usage=0) 
chunks simply deletes them so is nearly instantaneous, and obviously 
reclaims 100% of the space because they were empty, so huge bang for the 
buck.  Balancing nearly empty chunks is still quite fast since there's 
very little data to rewrite and compact into the new chunks, and 
obviously, at usage=10, lets you compact 10 or more only 10% used or less 
chunks into one new chunk, so as long as you have a lot of them, you 
still gets really good bang for the buck.

But as usage increases, you're writing more and more data, for less and 
less bang for the buck.  At half full, 50% usage, balance is only 
combining two chunks into one, while writing the same amount of data to 
recover only one chunk of the two, as it was writing to recover 9 chunks 
out of 10, at 10% usage.

So when the filesystem still has a lot of room, a lot of people stop at 
say usage=25, where they're still recovering 3/4 of the chunks, or 
usage=33, where they're recovering 2/3.  As the filesystem fills up, they 
may need to do usage=50, recovering only 1/2 of the chunks rewritten, and 
eventually, usage=67 or 70, writing three chunks into two, and thus 
recovering only one chunk's worth of space for every three written, 1/3.

It's rarely useful to go above that, unless you're /really/ pressed for 
space, and then it's simpler to just do a balance without that filter and 
balance all chunks, tho you can still use -d or -m to only do data or 
metadata chunks, if desired.

That's why I suggested you bump the usage up in increments, with my 
intention, tho I guess I didn't clearly state it, being that you'd stop 
once total dropped reasonably close to used, for data, say 300 GiB total, 
150 used, or if you were lucky, 200 GiB total, 150 used.

With luck that would have happened at say -dusage=40 or -dusage=50, while 
your bang for the buck was still reclaiming at least half of the chunks 
in the rewrite, and -dusage=70 would have never been needed.

That's why you found it taking so long.

Meanwhile, discussion in another thread reminded me of another factor, 
quotas.

For a long time quotes were simply broken in btrfs as the code was buggy 
and various corner-cases resulted in negative (!!) reported usage and the 
like.  With kernel 4.4, known corner-case bugs are in general fixed and 
the numbers should finally be correct, but there's still a LOT of quota 
overhead for balance, etc.  They're discussing right now whether part or 
all of that can be eliminated, but for the time being, anyway, active 
btrfs quotas incur a /massive/ balance overhead, so if you use quotas and 
are going to be doing more than trivial balances, it's worth turning them 
off temporarily for the balance if you can, then rescanning after the 
balance when you turn them back on, if you do need them and didn't simply 
have quotas on because you could.  Of course depending on how you are 
using quotas, turning them off for the balance might not be an option, 
but if you can, it will avoid effectively repeated rescans during the 
balance, and while the rescan while turning them back on will take some 
time, it should take far less than the time lost to the repeated rescans 
if they're enabled during the balance.

I believe btrfs check is similarly afflicted with massive quota overhead, 
tho I'm not sure if it's /as/ bad for check.

I've never had quotas enabled here at all, however, as I don't really 
need them and the negatives are still too high, even if they're actually 
working now, and I entirely forgot about them when I was recommending the 
above to help get your chunk usage vs. total back under control.

So if you're using quotas, consider turning them off at least temporarily 
when you do reschedule those balances.  In fact, you may wish to leave 
them off if you don't really need them, at least until they figure out 
how to reduce the overhead they currently trigger in balance and check.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

next prev parent reply	other threads:[~2016-03-18 23:06 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-16  9:45 [4.4.1] btrfs-transacti frequent high CPU usage despite little fragmentation Ole Langbehn
2016-03-17 10:51 ` Duncan
2016-03-18  9:33   ` Ole Langbehn
2016-03-18 23:06     ` Duncan [this message]
2016-03-19 20:31       ` Ole Langbehn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='pan$c1998$42b65a7e$8b396259$757615d8@cox.net' \
    --to=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).