Re: [PATCH] Btrfs: fix leaf corruption after __btrfs_drop_extents

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Liu Bo <bo.li.liu@oracle.com>
To: Filipe David Manana <fdmanana@gmail.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>, levin.front@gmail.com
Subject: Re: [PATCH] Btrfs: fix leaf corruption after __btrfs_drop_extents
Date: Sun, 8 Jun 2014 18:54:00 +0800	[thread overview]
Message-ID: <20140608105359.GB9291@localhost.localdomain> (raw)
In-Reply-To: <CAL3q7H7pYVLzaoLM6Je6dnDoHiACBAA0r_n1DGAWDuQJzJ7n3Q@mail.gmail.com>

On Fri, Jun 06, 2014 at 01:11:17PM +0100, Filipe David Manana wrote:
> On Fri, Jun 6, 2014 at 7:25 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> > Several reports about leaf corruption has been floating on the list, one of them
> > points to __btrfs_drop_extents(), and we find that the leaf becomes corrupted
> > after __btrfs_drop_extents(), it's really a rare case but it does exist.
> >
> > The problem turns out to be btrfs_next_leaf() called in __btrfs_drop_extents().
> >
> > So in btrfs_next_leaf(), we release the current path to re-search the last key of
> > the leaf for locating next leaf, and we've taken it into account that there might
> > be balance operations between leafs during this 'unlock and re-lock' dance, so
> > we check the path again and advance it if there are now more items available.
> > But things are a bit different if that last key happens to be removed and balance
> > gets a bigger key as the last one, and btrfs_search_slot will return it with
> > ret > 0,
> 
> Hi Liu,
> 
> That makes a lot of sense outside the context of btrfs_drop_extents().
> 
> If I understand you correctly, you're saying that we have a file
> extent item that gets deleted after we release the path in
> btrfs_next_leaf and before btrfs_search_slot finishes for doing a
> lookup for that item.

Hi Filipe,

Not just "finishes", even before btrfs_search_slot "starts".

> 
> But who calls btrfs_drop_extents(), is supposed to have locked the
> file range in the inode's io tree, right? Isn't the goal of locking
> ranges in the io tree to serialize manipulation of file extent items
> within a certain range? Unless there's some caller of _drop_extents()
> that isn't locking the range in the io tree, I'm not seeing how we can
> get the file extent item deleted or updated by another task.

Oh, good question!  I must admit I forget about that locking, just check the log,
so here is the answer -- the locking range and the extent range is not
consistent.

In __btrfs_drop_extents(), there is such a case,

                /*
                 *       | ---- range to drop ----- |
                 *  | -------- extent -------- |
                 */
                if (start > key.offset && end >= extent_end) {

we do actually lock the range, but in this case, the extent is not included in
our range, so this give a chance for others to hack in to do 'confusing-us' things.

I can add these to the commit log if you want, others may also feel confused.

thanks,
-liubo

> 
> Thanks
> 
> > IOW, nothing change in this leaf except the new last key, then we think
> > we're okay because there is no more item balanced in, fine, we thinks we can
> > go to the next leaf.
> >
> > However, we should return that bigger key, otherwise we deserve leaf corruption,
> > for example, in endio, skipping that key means that __btrfs_drop_extents() thinks
> > it has dropped all extent matched the required range and finish_ordered_io can
> > safely insert a new extent, but it actually doesn't and ends up a leaf
> > corruption.
> >
> > This takes the special case into account.
> >
> > Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> > ---
> >  fs/btrfs/ctree.c | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> >
> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > index 1bcfcdb..bbb256c 100644
> > --- a/fs/btrfs/ctree.c
> > +++ b/fs/btrfs/ctree.c
> > @@ -5736,6 +5736,24 @@ again:
> >                 ret = 0;
> >                 goto done;
> >         }
> > +       /*
> > +        * So the above check misses one case:
> > +        * - after releasing the path above, someone has removed the item that
> > +        *   used to be at the very end of the block, and balance between leafs
> > +        *   gets another one with bigger key.offset to replace it.
> > +        *
> > +        * This one should be returned as well, or we can get leaf corruption
> > +        * later(esp. in __btrfs_drop_extents()).
> > +        *
> > +        * And a bit more explanation about this check,
> > +        * with ret > 0, the key isn't found, the path points to the slot
> > +        * where it should be inserted, so the path->slots[0] item must be the
> > +        * bigger one.
> > +        */
> > +       if (nritems > 0 && ret > 0 && path->slots[0] == nritems - 1) {
> > +               ret = 0;
> > +               goto done;
> > +       }
> >
> >         while (level < BTRFS_MAX_LEVEL) {
> >                 if (!path->nodes[level]) {
> > --
> > 1.8.1.4
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Filipe David Manana,
> 
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."

next prev parent reply	other threads:[~2014-06-08 10:54 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-06  6:25 [PATCH] Btrfs: fix leaf corruption after __btrfs_drop_extents Liu Bo
2014-06-06 12:11 ` Filipe David Manana
2014-06-08 10:54   ` Liu Bo [this message]
2014-06-08 11:11     ` Filipe David Manana
2014-06-08 11:16       ` Filipe David Manana
2014-06-06 14:02 ` Holger Hoffstätte
2014-06-08 11:07   ` Liu Bo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140608105359.GB9291@localhost.localdomain \
    --to=bo.li.liu@oracle.com \
    --cc=fdmanana@gmail.com \
    --cc=levin.front@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).