linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] Btrfs: fix deadlock with nested trans handles
Date: Fri, 21 Mar 2014 05:44:41 +0000 (UTC)	[thread overview]
Message-ID: <pan$51664$6e2e2da0$3ae011f3$4e915a26@cox.net> (raw)
In-Reply-To: CAGfcS_mZ9=gmdxyn0jj_xKFK7XiejCACe84knzoJfC9gkg7CNw@mail.gmail.com

Rich Freeman posted on Thu, 20 Mar 2014 22:13:51 -0400 as excerpted:

> However, I am my snapshots one at a time at a rate of one every 5-30
> minutes, and while that is creating surprisingly high disk loads on my
> ssd and hard drives, I don't get any panics.  I figured that having only
> one deletion pending per checkpoint would eliminate locking risk.
> 
> I did get some blocked task messages in dmesg, like:
> [105538.121239] INFO: task mysqld:3006 blocked for more than 120
> seconds.

These... are a continuing issue.  The devs are working on it, but...

The people that seem to have it the worst are doing both scripted 
snapshotting and large (gig+) constantly internal-rewritten files such as 
VM images (the most commonly reported case) or databases.  Properly 
setting NOCOW on the files[1] helps, but...

* The key thing to realize about snapshotting continually rewritten NOCOW 
files is that the first change to a block after a snapshot by definition 
MUST be COWed anyway, since the file content has changed from that of the 
snapshot.  Further writes to the same block (until the next snapshot) 
will be rewritten in-place (the existing NOCOW attribute  is maintained 
thru that mandatory COW), but next snapshot and following write, BAM! 
gotta COW again!

So while NOCOW helps, in scenarios such as hourly snapshotting of active 
VM-image data loads its ability to control actual fragmentation is 
unfortunately rather limited.  And it's precisely this fragmentation that 
appears to be the problem! =:^(

It's almost certainly that fragmentation that's triggering your blocked 
for X seconds issues.  But the interesting thing here is the reports even 
from people with fast SSDs where seek-time and even IOPs shouldn't be a 
huge issue.  In at least some cases, the problem has been CPU time, not 
physical media access.

Which is one reason the snapshot-aware-defrag was disabled again 
recently, because it simply wasn't scaling.  (To answer the question, 
yes, defrag still works; it's only the snapshot-awareness that was 
disabled.  Defrag is back to dumbly ignoring other snapshots and simply 
defragging the working file-extent-mapping the defrag is being run on, 
with other snapshots staying untouched.)  They're reworking the whole 
feature now in ordered to scale better.

But while that considerably reduces the pain point, people were seeing 
little or no defrag/balance/restripe progress in /hours/ if they had 
enough snapshots and that problem has been bypassed for the moment, we're 
still left with these nasty N-second stalls at times, especially when 
doing anything else involving those snapshots and the corresponding 
fragmentation they cover, including deleting them.  Hopefully tweaking 
the algorithms and eventually optimizing can do away with much of this 
problem eventually, but I've a feeling it'll be around to some degree for 
some years.

Meanwhile, for data that fits that known problematic profile, the current 
recommendation is, preferably, to isolate it to a subvolume that has only 
very limited or no snapshotting done.

The other alternative, of course, since NOCOW already turns off many of 
the features a lot of people are using btrfs for in the first place 
(checksumming and compression are disabled with NOCOW as well, tho it 
turns out they're not so well suited to VM images in the first place), is 
that given the subvolume isolation already, just stick it on an entirely 
different filesystem, either btrfs with the nocow mount option, or 
arguably something a bit more traditional and mature such as ext4 or xfs, 
where xfs of course is actually targeted at large to huge file use-cases 
so multi-gig VMs should be an ideal fit.  Of course you lose the benefits 
of btrfs doing that, but given its COW nature, btrfs arguably isn't the 
ideal solution for such huge internal-write files in the first place, and 
even when fully mature will likely only have /acceptable/ performance 
with them as suitable for use as a general purpose filesystem, with xfs 
or similar still likely being a better dedicated filesystem for such use-
cases.

Meanwhile, I think everyone agrees that getting that locking down to 
avoid the deadlocks, etc, really must be priority one, at least now that 
the huge scaling blocker of snapshot-aware-defrag is (hopefully 
temporarily) disabled.  Blocking for a couple minutes at a time certainly 
isn't ideal, but since the triggering jobs such as snapshot deletion, 
etc, can be rescheduled to otherwise idle time, that's certainly less 
critical than crashes if people accidentally or in ignorance queue up too 
many snapshot deletions at a time!

---
[1] NOCOW: chattr +C .  With btrfs, this should be done while the file is 
zero-size, before it has content.  The easiest way to do that is to 
create a dedicated directory for these files and set the attribute on the 
directory, such that the files inherit it at file creation.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


  reply	other threads:[~2014-03-21  5:45 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-07  0:01 [PATCH] Btrfs: fix deadlock with nested trans handles Josef Bacik
2014-03-07  0:25 ` Zach Brown
2014-03-12 12:56   ` Rich Freeman
2014-03-12 15:24     ` Josef Bacik
2014-03-12 16:34       ` Rich Freeman
2014-03-14 22:40         ` Rich Freeman
2014-03-15 11:51           ` Duncan
2014-03-21  2:13             ` Rich Freeman
2014-03-21  5:44               ` Duncan [this message]
2014-03-17 14:34           ` Josef Bacik
2014-05-03 20:04 ` Alex Lyakas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='pan$51664$6e2e2da0$3ae011f3$4e915a26@cox.net' \
    --to=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).