From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] Btrfs: fix deadlock with nested trans handles
Date: Fri, 21 Mar 2014 05:44:41 +0000 (UTC) [thread overview]
Message-ID: <pan$51664$6e2e2da0$3ae011f3$4e915a26@cox.net> (raw)
In-Reply-To: CAGfcS_mZ9=gmdxyn0jj_xKFK7XiejCACe84knzoJfC9gkg7CNw@mail.gmail.com
Rich Freeman posted on Thu, 20 Mar 2014 22:13:51 -0400 as excerpted:
> However, I am my snapshots one at a time at a rate of one every 5-30
> minutes, and while that is creating surprisingly high disk loads on my
> ssd and hard drives, I don't get any panics. I figured that having only
> one deletion pending per checkpoint would eliminate locking risk.
>
> I did get some blocked task messages in dmesg, like:
> [105538.121239] INFO: task mysqld:3006 blocked for more than 120
> seconds.
These... are a continuing issue. The devs are working on it, but...
The people that seem to have it the worst are doing both scripted
snapshotting and large (gig+) constantly internal-rewritten files such as
VM images (the most commonly reported case) or databases. Properly
setting NOCOW on the files[1] helps, but...
* The key thing to realize about snapshotting continually rewritten NOCOW
files is that the first change to a block after a snapshot by definition
MUST be COWed anyway, since the file content has changed from that of the
snapshot. Further writes to the same block (until the next snapshot)
will be rewritten in-place (the existing NOCOW attribute is maintained
thru that mandatory COW), but next snapshot and following write, BAM!
gotta COW again!
So while NOCOW helps, in scenarios such as hourly snapshotting of active
VM-image data loads its ability to control actual fragmentation is
unfortunately rather limited. And it's precisely this fragmentation that
appears to be the problem! =:^(
It's almost certainly that fragmentation that's triggering your blocked
for X seconds issues. But the interesting thing here is the reports even
from people with fast SSDs where seek-time and even IOPs shouldn't be a
huge issue. In at least some cases, the problem has been CPU time, not
physical media access.
Which is one reason the snapshot-aware-defrag was disabled again
recently, because it simply wasn't scaling. (To answer the question,
yes, defrag still works; it's only the snapshot-awareness that was
disabled. Defrag is back to dumbly ignoring other snapshots and simply
defragging the working file-extent-mapping the defrag is being run on,
with other snapshots staying untouched.) They're reworking the whole
feature now in ordered to scale better.
But while that considerably reduces the pain point, people were seeing
little or no defrag/balance/restripe progress in /hours/ if they had
enough snapshots and that problem has been bypassed for the moment, we're
still left with these nasty N-second stalls at times, especially when
doing anything else involving those snapshots and the corresponding
fragmentation they cover, including deleting them. Hopefully tweaking
the algorithms and eventually optimizing can do away with much of this
problem eventually, but I've a feeling it'll be around to some degree for
some years.
Meanwhile, for data that fits that known problematic profile, the current
recommendation is, preferably, to isolate it to a subvolume that has only
very limited or no snapshotting done.
The other alternative, of course, since NOCOW already turns off many of
the features a lot of people are using btrfs for in the first place
(checksumming and compression are disabled with NOCOW as well, tho it
turns out they're not so well suited to VM images in the first place), is
that given the subvolume isolation already, just stick it on an entirely
different filesystem, either btrfs with the nocow mount option, or
arguably something a bit more traditional and mature such as ext4 or xfs,
where xfs of course is actually targeted at large to huge file use-cases
so multi-gig VMs should be an ideal fit. Of course you lose the benefits
of btrfs doing that, but given its COW nature, btrfs arguably isn't the
ideal solution for such huge internal-write files in the first place, and
even when fully mature will likely only have /acceptable/ performance
with them as suitable for use as a general purpose filesystem, with xfs
or similar still likely being a better dedicated filesystem for such use-
cases.
Meanwhile, I think everyone agrees that getting that locking down to
avoid the deadlocks, etc, really must be priority one, at least now that
the huge scaling blocker of snapshot-aware-defrag is (hopefully
temporarily) disabled. Blocking for a couple minutes at a time certainly
isn't ideal, but since the triggering jobs such as snapshot deletion,
etc, can be rescheduled to otherwise idle time, that's certainly less
critical than crashes if people accidentally or in ignorance queue up too
many snapshot deletions at a time!
---
[1] NOCOW: chattr +C . With btrfs, this should be done while the file is
zero-size, before it has content. The easiest way to do that is to
create a dedicated directory for these files and set the attribute on the
directory, such that the files inherit it at file creation.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2014-03-21 5:45 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-03-07 0:01 [PATCH] Btrfs: fix deadlock with nested trans handles Josef Bacik
2014-03-07 0:25 ` Zach Brown
2014-03-12 12:56 ` Rich Freeman
2014-03-12 15:24 ` Josef Bacik
2014-03-12 16:34 ` Rich Freeman
2014-03-14 22:40 ` Rich Freeman
2014-03-15 11:51 ` Duncan
2014-03-21 2:13 ` Rich Freeman
2014-03-21 5:44 ` Duncan [this message]
2014-03-17 14:34 ` Josef Bacik
2014-05-03 20:04 ` Alex Lyakas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$51664$6e2e2da0$3ae011f3$4e915a26@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).