Re: [PATCH] Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Holger Hoffstätte" <holger.hoffstaette@googlemail.com>
To: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON
Date: Thu, 23 Apr 2015 13:22:23 +0000 (UTC)	[thread overview]
Message-ID: <pan.2015.04.23.13.22.23@googlemail.com> (raw)
In-Reply-To: 5538E6C7.9050201@suse.com

On Thu, 23 Apr 2015 13:34:15 +0100, Filipe Manana wrote:

> On 04/23/2015 01:16 PM, Holger Hoffstätte wrote:
>> On Thu, 23 Apr 2015 11:28:48 +0100, Filipe Manana wrote:
>> 
>>> There's a race between releasing extent buffers that are flagged as stale
>>> and recycling them that makes us it the following BUG_ON at
>>> btrfs_release_extent_buffer_page:
>>>
>>>     BUG_ON(extent_buffer_under_io(eb))
>>>
>>> The BUG_ON is triggered because the extent buffer has the flag
>>> EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
>>> dirty by another concurrent task.
>> 
>> Awesome analysis!
>> 
>>> @@ -4768,6 +4768,25 @@ struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
>>>  			       start >> PAGE_CACHE_SHIFT);
>>>  	if (eb && atomic_inc_not_zero(&eb->refs)) {
>>>  		rcu_read_unlock();
>>> +		/*
>>> +		 * Lock our eb's refs_lock to avoid races with
>>> +		 * free_extent_buffer. When we get our eb it might be flagged
>>> +		 * with EXTENT_BUFFER_STALE and another task running
>>> +		 * free_extent_buffer might have seen that flag set,
>>> +		 * eb->refs == 2, that the buffer isn't under IO (dirty and
>>> +		 * writeback flags not set) and it's still in the tree (flag
>>> +		 * EXTENT_BUFFER_TREE_REF set), therefore being in the process
>>> +		 * of decrementing the extent buffer's reference count twice.
>>> +		 * So here we could race and increment the eb's reference count,
>>> +		 * clear its stale flag, mark it as dirty and drop our reference
>>> +		 * before the other task finishes executing free_extent_buffer,
>>> +		 * which would later result in an attempt to free an extent
>>> +		 * buffer that is dirty.
>>> +		 */
>>> +		if (test_bit(EXTENT_BUFFER_STALE, &eb->bflags)) {
>>> +			spin_lock(&eb->refs_lock);
>>> +			spin_unlock(&eb->refs_lock);
>>> +		}
>>>  		mark_extent_buffer_accessed(eb, NULL);
>>>  		return eb;
>>>  	}
>> 
>> After staring at this (and the Lovecraftian horrors of free_extent_buffer())
>> for over an hour and trying to understand how and why this could even remotely
>> work, I cannot help but think that this fix would shift the race to the much
>> smaller window between the test_bit and the first spin_lock.
>> Essentially you subtly phase-shifted all participants and make them avoid the
>> race most of the time, yet I cannot help but think it's still there (just much
>> smaller), and could strike again with different scheduling intervals.
>> 
>> Would this be accurate?
> 
> Hi Holger,
> 
> Can you explain how the race can still happen?
> 
> The goal here is just to make sure a reader does not advance too fast if
> the eb is stale and there's a concurrent call to free_extent_buffer() in
> progress.

Yes, that much I understood. I look at this change from the perspective of
an optimizing compiler:

- without the new if-block we would fall through and mark_extent_buffer
while it's being deleted, which complains with a BUG.

- the new block therefore checks for the problematic state, but then does
something that - from a work perspective - could be eliminated (lock+unlock),
since nothing seemimgly useful happens inbetween.

-that leaves the if without observable side effect and so could be
eliminated as well.

So theoretically we have not really "coordinated" anything. That made me
suspicious: at the very least I would have expected some kind of loop
or something that protects/reliably delays mark_extent_buffer so that it
really has a completion/atomicity guarantee, not just a small "bump" in
its timeline. You said that a "real fix" would be proper refcounting -
that's sort of what I was expecting, at least in terms of side effects.

Now, I understand that the block is not eliminated, but the if followed
by the lock acquiry is not atomic. That's where - very theoretically -
the same race as before could happen e.g. on a single core and the thread
running free_extent_buffer is scheduled after the if but before the lock.
We'd then get the lock, immediately unlock it and proceed to mark.

I don't know if any of this can actually happen! It's just that I've never
seen a construct like this (and I like lock-free/wait-free coordination),
so this got me curious.

Thanks!
Holger

next prev parent reply	other threads:[~2015-04-23 13:22 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-23 10:28 [PATCH] Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON Filipe Manana
2015-04-23 12:16 ` Holger Hoffstätte
2015-04-23 12:34   ` Filipe Manana
2015-04-23 13:22     ` Holger Hoffstätte [this message]
2015-04-23 13:43       ` Filipe Manana
2015-04-23 13:53         ` Holger Hoffstätte
2015-04-24 16:08           ` Filipe Manana
2015-05-12 13:04   ` David Sterba
2015-05-07 16:19 ` David Sterba
2015-05-07 19:30   ` Chris Mason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pan.2015.04.23.13.22.23@googlemail.com \
    --to=holger.hoffstaette@googlemail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).