From: Christoph Hellwig <hch@infradead.org>
To: Chris Mason <chris.mason@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Shane Shrybman <shrybman@teksavvy.com>,
linux-btrfs <linux-btrfs@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: Oops while rebalancing, now unmountable.
Date: Mon, 15 Nov 2010 14:03:35 -0500 [thread overview]
Message-ID: <20101115190335.GA11374@infradead.org> (raw)
In-Reply-To: <1289845457-sup-9432@think>
On Mon, Nov 15, 2010 at 01:46:02PM -0500, Chris Mason wrote:
> For the metadata blocks, btrfs gets into a problematic lock inversion
> where it needs to record that a block has been written so that it will
> be properly recowed when someone tries to change it again.
>
> Basically the rule for btree_writepage:
>
> 1) lock the extent buffer (different from the page)
> 2) mark the metadata block as written
> 3) lock the page
> 4) call writepage
>
> Btrfs does this correctly everywhere it uses writepage, and everyone
> else either uses writepages or is PF_MEMALLOC, except for the page
> migration code, which just jumps to step 4.
>
> So, my current fix adds a migrate page hook and adds a warning into the
> code to make sure we protest loudly when the block isn't marked as
> written. Since this shakedown worked well, I'm changing the warning to
> a BUG().
>
This sounds to me like you shouldn't bother to use ->writepage
for the case that adheres to your locking protocol, but just call into
extent_write_full_page directly. ->writepage is supposed to directly
callable from the VM, and not require filesystems specific calling
conventions. Just calling extent_write_full_page directly and
making btree_writepage do the PF_MEMALLOC unconditionally should
also fix the page migration corruption. And at the same time
making btree_writepage future proof.
Btw, magic like the one there currently does need at least a long
describing comment.
> The check for kupdate in btree_writepages is different. Once we write
> something, we have to do a good amount of work in order to modify it
> again. The btrfs log commits make sure that we write metadata from time
> to time, so we don't really need help from the flusher threads unless.
>
> We also don't want to waste time writing metadata from
> balance_dirty_pages. It'll just make more allocations later as we
> wander around and recow things, and it is much more likely to be seeky
> than the file IO. So we setup a threshold where we don't bother doing
> metadata IO unless there is a good amount pending.
>
> I'm fine with removing the metadata writepage entirely, it didn't use to
> have this many rules and it seems like a better idea to have it not
> there at all.
for_kupdate only covers a tiny subset of the flusher threads, as it's
only set for the older_than_this still writeback. It doesn't cover
regular percentage background reclaim not other asynchronous activity
from the flusher threads, like wakeup_flusher_threads or the laptop-mode
I/O completion.
At the very least it should check for_kupdate || for_background to cover
all background writeback, which is what the few other uses of
for_kupdate already do, but I suspect you simply want to not mark
the btree inode as hashed in the inode hash and skip background
writeback completely.
next prev parent reply other threads:[~2010-11-15 19:03 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-11-08 17:10 Oops while rebalancing, now unmountable Shane Shrybman
2010-11-08 17:55 ` Chris Mason
2010-11-08 20:39 ` Shane Shrybman
2010-11-08 21:04 ` Chris Mason
2010-11-08 21:25 ` Shane Shrybman
2010-11-09 13:42 ` Chris Mason
2010-11-09 18:21 ` Shane Shrybman
2010-11-14 19:55 ` Shane Shrybman
2010-11-14 20:42 ` Andrea Arcangeli
2010-11-14 22:00 ` Christoph Hellwig
2010-11-14 22:12 ` Andrea Arcangeli
2010-11-15 18:23 ` Christoph Hellwig
2010-11-15 18:46 ` Chris Mason
2010-11-15 19:03 ` Christoph Hellwig [this message]
2010-11-16 21:48 ` Shane Shrybman
2010-11-15 18:46 ` Andrea Arcangeli
2010-11-15 19:03 ` Chris Mason
2010-11-15 19:16 ` Andrea Arcangeli
2010-11-15 19:12 ` Christoph Hellwig
2010-11-15 19:18 ` Chris Mason
2010-11-15 19:29 ` Andrea Arcangeli
2010-11-15 20:54 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101115190335.GA11374@infradead.org \
--to=hch@infradead.org \
--cc=aarcange@redhat.com \
--cc=chris.mason@oracle.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=shrybman@teksavvy.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).