From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: Performance Issues
Date: Sat, 20 Sep 2014 05:58:52 +0000 (UTC) [thread overview]
Message-ID: <pan$9d45b$35a57b34$29d0d2ef$8358244@cox.net> (raw)
In-Reply-To: 1411145469.1601.2.camel@zarniwoop.blob
Rob Spanton posted on Fri, 19 Sep 2014 17:51:09 +0100 as excerpted:
> The evolution problem has been improved: the sqlite db that it was using
> had over 18000 fragments, so I got evolution to recreate that file with
> nocow set. It now takes "only" 30s to load my mail rather than 80s,
> which is better...
>
> On Fri, 2014-09-19 at 11:05 -0400, Josef Bacik wrote:
>> Weird, I get the exact opposite performance. Anyway it's probably
>> because of your file layouts, try defragging your git dir and see if
>> that helps. Thanks,
>
> Defragging has improved matters a bit: it now takes 26s (was 46s) to run
> git status. Still not amazing, but at the moment I have no evidence to
> suggest that it's not something to do with the machine's hardware. If I
> get time over the weekend I'll dig out an external hard disk and try a
> couple of benchmarks with that.
[Replying via mail and list both, as requested.]
If you're snapshotting those nocow files, be aware (if you aren't
already) that nocow, snapshots and defrag (all on the same files) don't
work all that well together...
First let's deal with snapshots of nocow files.
What does a snapshot do? It locks in place the existing version of a
file, both logically, so you can get at that version of it via the
snapshot even after changes have been made, and physically, it locks
existing extents where they are. With normal cow files this is fine,
since any changes would cause the changed block to be written elsewhere,
freeing the now replaced block if there's nothing holding it in place. A
snapshot simply keeps a reference to the existing extent when the data is
cowed elsewhere instead of releasing it, so there's a way to get the old
version as referenced by that snapshot back too.
But nocow files are normally overwritten in place, that's what nocow
/is/. Obviously that conflicts with what a snapshot does, locking the
existing version in place.
What btrfs does, then, to handle that, is on the first write to a (4KB)
block in a (normally) nowcow file after a snapshot, a cow write is forced
on that block anyway. The file remains nocow, and additional writes to
the /same/ block continue to write to the same new location... until
another snapshot locks /that/ in place.
All fine if you're just doing occasional snapshots and/or if the nocow
file isn't being very actively rewritten after all; it's not that big a
deal in that case. *BUT*, if you're doing time-based snapshots say every
hour or so, and the file is actively being semi-randomly rewritten, the
constant snapshotting locking in place the current version, forcing many
of those writes to cow anyway, is going to end up fragmenting that file
nearly as fast as it would without the nocow. IOW, the nocow ends up
being nearly worthless on that file!
There is a (partial) workaround, however. You can use the fact that
snapshots stop at subvolume boundaries, putting the nocow files on their
own dedicated subvolume. You can then continue snapshotting the up-tree
subvolume as you were before and it'll stop at the dedicated subvolume,
so the nocow files on that subvolume don't get snapshotted and thus don't
get fragmented anyway.
Of course without that snapshotting you'll need to do conventional backup
on the files in that dedicated nocow subvolume.
Another alternative is to continue snapshotting the dedicated subvolume
and its nocow files, but at a lower frequency, perhaps every day or twice
a day instead of every hour, or maybe twice a week instead of daily, or
whatever. That will slow down but not eliminate the snapshot-triggered
fragmentation of the nocow files.
If you then combine that with scheduled (presumably cron job or systemd-
timer) defrag of that dedicated subvolume, perhaps weekly or monthly,
depending on how fast it still fragments, that can help keep performance
from dragging down too badly.
Of course you can use the scheduled defrag technique without the
dedicated subvolume and just up the frequency of the defrags instead of
decreasing the frequency of the snapshotting, too, if it works better for
you.
Meanwhile, how big are those files? If you're not dealing with any nocow-
candidate files approaching a gig or larger, you may find that the
autodefrag mount option helps. However, it works by queuing up a rewrite
of the entire file for a worker thread that comes along a bit later, and
if the file is too big and being written to too much, the changes to the
file can end up coming faster than the file can be rewritten. Obviously
that's not a good thing. Generally, for files under 100 MB autodefrag
works very well. For actively rewritten files over a GB, it doesn't work
well at all, and for files between 100 MB and 1 GB, it depends on the
speed of your hardware and how fast the rewrites are coming in.
Actually, most folks seem to be OK up to a quarter GiB or so, and most
folks have problems starting around 3/4 GiB or so. 256-768 MiB is the
YMMV zone.
Meanwhile, from what I've read sqlite apparently works best with under
half a gig of data to manage anyway, otherwise it's time to consider
scaling up to something like mysql/mariadb. So for most people, if all
they're dealing with is sqlite files, they're usually under half a gig in
size and the autodefrag mount option works at least reasonably well.
But I mentioned defrag as not working so well with snapshots too. The
problem there is somewhat different.
Before kernel 3.9 btrfs defrag wasn't snapshot aware -- it would defrag
just the current snapshot, leaving others in place. This of course
duplicated the data that defrag moved since the old locations couldn't be
freed as other snapshots were still referencing them, thus eating up
space rather faster than might have been expected.
With 3.9 defrag became snapshot aware, and would track and adjust all
reference to a block when defrag moved it. Unfortunately, that first
attempt had **HUGE** scaling issues -- defrags that should have taken
hours were taking days, even weeks, and multiple gigabytes of memory,
such that it was running into out-of-memory errors even on 16 and 32 GiB
RAM machines! (IIRC we had one report of it happening on a 64 gig
machine too!) Let alone the poor 32-bit folks! It turned out that
quotas and snapshots were the big culprits, and people with thousands of
snapshots (as can happen with snapper and the like if it's not set to
thin them out regularly) AND quotas enabled simply found defrag didn't
work at all for them.
So along about 3.12 (I'm not sure exactly), that first attempt at
snapshot-aware-defrag was disabled again, so people could at least /run/
defrag.
While they've rewritten various bits to scale MUCH better now, snapshot-
aware-defrag remains disabled for the time being.
Which means defrag is again only working on the current snapshot it is
pointed at, leaving other snapshots in place as they are. Which means if
you're snapshotting and defragging, and not deleting those snapshots
within a reasonable time, data usage is going to go up as defrag
duplicates the data it moves around, because it's only moving it around
for the current snapshot, the references other snapshots have to the
fragmented version remain in place, continuing to take up space until all
snapshots referencing the old blocks are deleted.
So if you're doing regular snapshots, try to keep them to a reasonably
limited time frame, with conventional backups if needed before that. If
you can keep snapshots to a month's time, great. But do try to keep it
to 60 or 90 days if possible. Beyond that, keep conventional backups if
you need to. And since btrfs is still under development and not fully
stable, such backups are strongly recommended anyway.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
prev parent reply other threads:[~2014-09-20 5:59 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-09-19 12:18 Performance Issues Rob Spanton
2014-09-19 12:25 ` Swâmi Petaramesh
2014-09-19 12:58 ` Austin S Hemmelgarn
2014-09-19 12:49 ` Austin S Hemmelgarn
2014-09-19 12:59 ` Austin S Hemmelgarn
2014-09-19 13:34 ` Holger Hoffstätte
2014-09-22 11:59 ` David Sterba
2014-09-22 12:37 ` Holger Hoffstätte
2014-09-22 13:25 ` David Sterba
2014-09-19 13:51 ` Holger Hoffstätte
2014-09-19 14:53 ` Austin S Hemmelgarn
2014-09-19 16:23 ` Holger Hoffstätte
2014-09-19 17:51 ` Zach Brown
2014-09-20 8:23 ` Marc Dietrich
2014-09-20 13:41 ` Martin
2014-09-20 18:29 ` Chris Murphy
2014-09-20 14:04 ` Wang Shilong
2014-09-20 20:44 ` Marc Dietrich
2014-09-19 15:05 ` Josef Bacik
2014-09-19 16:51 ` Rob Spanton
2014-09-19 17:45 ` Josef Bacik
2014-10-30 14:23 ` Rob Spanton
2014-09-20 5:58 ` Duncan [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$9d45b$35a57b34$29d0d2ef$8358244@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).