From: Josef Bacik <josef@redhat.com>
To: Chris Mason <chris.mason@oracle.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: PLEASE TEST: Everybody who is seeing weird and long hangs
Date: Mon, 01 Aug 2011 14:01:35 -0400 [thread overview]
Message-ID: <4E36E9FF.9050805@redhat.com> (raw)
In-Reply-To: <1312220929-sup-8405@shiny>
On 08/01/2011 01:54 PM, Chris Mason wrote:
> Excerpts from Josef Bacik's message of 2011-08-01 12:03:34 -0400:
>> On 08/01/2011 11:45 AM, Chris Mason wrote:
>>> Excerpts from Josef Bacik's message of 2011-08-01 11:21:34 -0400:
>>>> Hello,
>>>>
>>>> We've seen a lot of reports of people having these constant long pauses
>>>> when doing things like sync or such. The stack traces usually all look
>>>> the same, one is btrfs-transaction stuck in btrfs_wait_marked_extents
>>>> and one is btrfs-submit-# stuck in get_request_wait. I had originally
>>>> thought this was due to the new plugging stuff, but I think it just
>>>> makes the problem happen more quickly as we've seen that 2.6.38 which we
>>>> thought was ok will still have the problem happen if given enough time.
>>>>
>>>> I _think_ this is because of the way we write out metadata in the
>>>> transaction commit phase. We're doing write_on_page for every dirty
>>>> page in the btree during the commit. This sucks because basically we
>>>> end up with one bio per page, which makes us blow out our nr_requests
>>>> constantly, which is why btrfs-submit-# is always stuck in
>>>> get_request_wait. What we need to do instead is use filemap_fdatawrite
>>>> which will do a WB_SYNC_ALL but will do it via writepages, so hopefully
>>>> we will get less bios and this problem will go away. Please try this
>>>> very hastily put together patch if you are experiencing this problem and
>>>> let me know if it fixes it for you. Thanks,
>>>
>>> I'm definitely curious to hear if this helps, but I think it might cause
>>> a different set of problems. It writes everything that is dirty on the
>>> btree, which includes a lot of things we've cow'd in the current
>>> transaction and marked dirty. They will have to go through COW again
>>> if someone wants to modify them again.
>>>
>>
>> But this is happening in the commit after we've done all of our work, we
>> shouldn't be dirtying anything else at this point right?
>
> The commit code is setup to unblock people before we start the IO:
>
> trans->transaction->blocked = 0;
> spin_lock(&root->fs_info->trans_lock);
> root->fs_info->running_transaction = NULL;
> root->fs_info->trans_no_join = 0;
> spin_unlock(&root->fs_info->trans_lock);
> mutex_unlock(&root->fs_info->reloc_mutex);
>
> wake_up(&root->fs_info->transaction_wait);
>
> ret = btrfs_write_and_wait_transaction(trans, root);
>
> So, we should have concurrent FS mods for a new transaction while we are
> writing out this old transaction.
>
Ah right, but then this brings up another question, we shouldn't cow
them again since we would have set the new transid. And isn't this kind
of bad, since somebody could come in and dirty a piece of metadata
before we have a chance to write it out for this transaction, so we end
up writing out the new data instead of what we are trying to commit?
And also the writepages() thing would get around this problem since we
are SYNC_ALL which now tags all dirty pages as TOWRITE and then writes
those pages instead of writing all dirty pages. So anything being
dirtied once we started writepages would be fine.
So this really could explain why this is sucking for people, we are just
walking through and writing everything that's dirty, and then doing the
same thing in wait_marked_extents() again, so we could be writing out
things that aren't in the transaction that we committed, which would
mean we're writing way more than we need to.
>>
>>> The btrfs writepage code does this:
>>>
>>> ret = __extent_writepage(page, wbc, &epd);
>>>
>>> extent_write_cache_pages(tree, mapping, &wbc_writepages,
>>> __extent_writepage, &epd, flush_write_bio);
>>> flush_epd_write_bio(&epd);
>>>
>>
>> Yeah but nr_to_write is 1, so after the __extent_writepage it will be 0
>> and extent_write_cache_pages will just return since there's nothing to
>> write, so we'll still end up with 1 page at a time being written out.
>> Thanks,
>
> We bump nr_to_write to 64:
>
> struct writeback_control wbc_writepages = {
> .sync_mode = wbc->sync_mode,
> .older_than_this = NULL,
> .nr_to_write = 64,
> .range_start = page_offset(page) + PAGE_CACHE_SIZE,
> .range_end = (loff_t)-1,
> };
>
> ret = __extent_writepage(page, wbc, &epd);
>
> extent_write_cache_pages(tree, mapping, &wbc_writepages,
> __extent_writepage, &epd, flush_write_bio);
> flush_epd_write_bio(&epd);
>
Oops I missed that, thanks,
Josef
next prev parent reply other threads:[~2011-08-01 18:01 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-08-01 15:21 PLEASE TEST: Everybody who is seeing weird and long hangs Josef Bacik
2011-08-01 15:45 ` Chris Mason
2011-08-01 16:03 ` Josef Bacik
2011-08-01 17:54 ` Chris Mason
2011-08-01 18:01 ` Josef Bacik [this message]
2011-08-01 18:21 ` Chris Mason
2011-08-01 23:28 ` cwillu
2011-08-02 0:09 ` Chris Mason
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4E36E9FF.9050805@redhat.com \
--to=josef@redhat.com \
--cc=chris.mason@oracle.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.