[PATCH] Btrfs: fix leaf corruption after __btrfs_drop

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Btrfs: fix leaf corruption after __btrfs_drop_extents
@ 2014-06-06  6:25 Liu Bo
  2014-06-06 12:11 ` Filipe David Manana
  2014-06-06 14:02 ` Holger Hoffstätte
  0 siblings, 2 replies; 7+ messages in thread
From: Liu Bo @ 2014-06-06  6:25 UTC (permalink / raw)
  To: linux-btrfs; +Cc: levin.front

Several reports about leaf corruption has been floating on the list, one of them
points to __btrfs_drop_extents(), and we find that the leaf becomes corrupted
after __btrfs_drop_extents(), it's really a rare case but it does exist.

The problem turns out to be btrfs_next_leaf() called in __btrfs_drop_extents().

So in btrfs_next_leaf(), we release the current path to re-search the last key of
the leaf for locating next leaf, and we've taken it into account that there might
be balance operations between leafs during this 'unlock and re-lock' dance, so
we check the path again and advance it if there are now more items available.
But things are a bit different if that last key happens to be removed and balance
gets a bigger key as the last one, and btrfs_search_slot will return it with
ret > 0, IOW, nothing change in this leaf except the new last key, then we think
we're okay because there is no more item balanced in, fine, we thinks we can
go to the next leaf.

However, we should return that bigger key, otherwise we deserve leaf corruption,
for example, in endio, skipping that key means that __btrfs_drop_extents() thinks
it has dropped all extent matched the required range and finish_ordered_io can
safely insert a new extent, but it actually doesn't and ends up a leaf
corruption.

This takes the special case into account.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/ctree.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 1bcfcdb..bbb256c 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -5736,6 +5736,24 @@ again:
 		ret = 0;
 		goto done;
 	}
+	/*
+	 * So the above check misses one case:
+	 * - after releasing the path above, someone has removed the item that
+	 *   used to be at the very end of the block, and balance between leafs
+	 *   gets another one with bigger key.offset to replace it.
+	 *
+	 * This one should be returned as well, or we can get leaf corruption
+	 * later(esp. in __btrfs_drop_extents()).
+	 *
+	 * And a bit more explanation about this check,
+	 * with ret > 0, the key isn't found, the path points to the slot
+	 * where it should be inserted, so the path->slots[0] item must be the
+	 * bigger one.
+	 */
+	if (nritems > 0 && ret > 0 && path->slots[0] == nritems - 1) {
+		ret = 0;
+		goto done;
+	}

 	while (level < BTRFS_MAX_LEVEL) {
 		if (!path->nodes[level]) {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix leaf corruption after __btrfs_drop_extents
  2014-06-06  6:25 [PATCH] Btrfs: fix leaf corruption after __btrfs_drop_extents Liu Bo
@ 2014-06-06 12:11 ` Filipe David Manana
  2014-06-08 10:54   ` Liu Bo
  2014-06-06 14:02 ` Holger Hoffstätte
  1 sibling, 1 reply; 7+ messages in thread
From: Filipe David Manana @ 2014-06-06 12:11 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs, levin.front

On Fri, Jun 6, 2014 at 7:25 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> Several reports about leaf corruption has been floating on the list, one of them
> points to __btrfs_drop_extents(), and we find that the leaf becomes corrupted
> after __btrfs_drop_extents(), it's really a rare case but it does exist.
>
> The problem turns out to be btrfs_next_leaf() called in __btrfs_drop_extents().
>
> So in btrfs_next_leaf(), we release the current path to re-search the last key of
> the leaf for locating next leaf, and we've taken it into account that there might
> be balance operations between leafs during this 'unlock and re-lock' dance, so
> we check the path again and advance it if there are now more items available.
> But things are a bit different if that last key happens to be removed and balance
> gets a bigger key as the last one, and btrfs_search_slot will return it with
> ret > 0,

Hi Liu,

That makes a lot of sense outside the context of btrfs_drop_extents().

If I understand you correctly, you're saying that we have a file
extent item that gets deleted after we release the path in
btrfs_next_leaf and before btrfs_search_slot finishes for doing a
lookup for that item.

But who calls btrfs_drop_extents(), is supposed to have locked the
file range in the inode's io tree, right? Isn't the goal of locking
ranges in the io tree to serialize manipulation of file extent items
within a certain range? Unless there's some caller of _drop_extents()
that isn't locking the range in the io tree, I'm not seeing how we can
get the file extent item deleted or updated by another task.

Thanks

> IOW, nothing change in this leaf except the new last key, then we think
> we're okay because there is no more item balanced in, fine, we thinks we can
> go to the next leaf.
>
> However, we should return that bigger key, otherwise we deserve leaf corruption,
> for example, in endio, skipping that key means that __btrfs_drop_extents() thinks
> it has dropped all extent matched the required range and finish_ordered_io can
> safely insert a new extent, but it actually doesn't and ends up a leaf
> corruption.
>
> This takes the special case into account.
>
> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> ---
>  fs/btrfs/ctree.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 1bcfcdb..bbb256c 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -5736,6 +5736,24 @@ again:
>                 ret = 0;
>                 goto done;
>         }
> +       /*
> +        * So the above check misses one case:
> +        * - after releasing the path above, someone has removed the item that
> +        *   used to be at the very end of the block, and balance between leafs
> +        *   gets another one with bigger key.offset to replace it.
> +        *
> +        * This one should be returned as well, or we can get leaf corruption
> +        * later(esp. in __btrfs_drop_extents()).
> +        *
> +        * And a bit more explanation about this check,
> +        * with ret > 0, the key isn't found, the path points to the slot
> +        * where it should be inserted, so the path->slots[0] item must be the
> +        * bigger one.
> +        */
> +       if (nritems > 0 && ret > 0 && path->slots[0] == nritems - 1) {
> +               ret = 0;
> +               goto done;
> +       }
>
>         while (level < BTRFS_MAX_LEVEL) {
>                 if (!path->nodes[level]) {
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix leaf corruption after __btrfs_drop_extents
  2014-06-06  6:25 [PATCH] Btrfs: fix leaf corruption after __btrfs_drop_extents Liu Bo
  2014-06-06 12:11 ` Filipe David Manana
@ 2014-06-06 14:02 ` Holger Hoffstätte
  2014-06-08 11:07   ` Liu Bo
  1 sibling, 1 reply; 7+ messages in thread
From: Holger Hoffstätte @ 2014-06-06 14:02 UTC (permalink / raw)
  To: linux-btrfs

On Fri, 06 Jun 2014 14:25:58 +0800, Liu Bo wrote:

> Several reports about leaf corruption has been floating on the list, one
> of them points to __btrfs_drop_extents(), and we find that the leaf
> becomes corrupted after __btrfs_drop_extents(), it's really a rare case
> but it does exist.

Out of curiosity ("..what could go wrong?" :) I applied this to 3.14.6, 
rebooted and worked a bit on my btrfs drive - rsyncs, create & delete 
snapshots etc. For the first time in ages I got two kernel messages:

Jun  6 09:20:56 tux kernel: BTRFS error (device sdb1): block group 
351872745472 has wrong amount of free space
Jun  6 09:20:56 tux kernel: BTRFS error (device sdb1): failed to load 
free space cache for block group 351872745472

I remounted without inode_cache & clear_cache, worked some more and so 
far no more messages. Could the messages be related to this patch? Or am 
I just looking at a coincidental/unrelated occurrence? Dropping the 
snapshots correctly freed up several GBs, so I assume that extents were 
dropped..

thanks
Holger

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix leaf corruption after __btrfs_drop_extents
  2014-06-06 12:11 ` Filipe David Manana
@ 2014-06-08 10:54   ` Liu Bo
  2014-06-08 11:11     ` Filipe David Manana
  0 siblings, 1 reply; 7+ messages in thread
From: Liu Bo @ 2014-06-08 10:54 UTC (permalink / raw)
  To: Filipe David Manana; +Cc: linux-btrfs, levin.front

On Fri, Jun 06, 2014 at 01:11:17PM +0100, Filipe David Manana wrote:
> On Fri, Jun 6, 2014 at 7:25 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> > Several reports about leaf corruption has been floating on the list, one of them
> > points to __btrfs_drop_extents(), and we find that the leaf becomes corrupted
> > after __btrfs_drop_extents(), it's really a rare case but it does exist.
> >
> > The problem turns out to be btrfs_next_leaf() called in __btrfs_drop_extents().
> >
> > So in btrfs_next_leaf(), we release the current path to re-search the last key of
> > the leaf for locating next leaf, and we've taken it into account that there might
> > be balance operations between leafs during this 'unlock and re-lock' dance, so
> > we check the path again and advance it if there are now more items available.
> > But things are a bit different if that last key happens to be removed and balance
> > gets a bigger key as the last one, and btrfs_search_slot will return it with
> > ret > 0,
> 
> Hi Liu,
> 
> That makes a lot of sense outside the context of btrfs_drop_extents().
> 
> If I understand you correctly, you're saying that we have a file
> extent item that gets deleted after we release the path in
> btrfs_next_leaf and before btrfs_search_slot finishes for doing a
> lookup for that item.

Hi Filipe,

Not just "finishes", even before btrfs_search_slot "starts".

> 
> But who calls btrfs_drop_extents(), is supposed to have locked the
> file range in the inode's io tree, right? Isn't the goal of locking
> ranges in the io tree to serialize manipulation of file extent items
> within a certain range? Unless there's some caller of _drop_extents()
> that isn't locking the range in the io tree, I'm not seeing how we can
> get the file extent item deleted or updated by another task.

Oh, good question!  I must admit I forget about that locking, just check the log,
so here is the answer -- the locking range and the extent range is not
consistent.

In __btrfs_drop_extents(), there is such a case,

                /*
                 *       | ---- range to drop ----- |
                 *  | -------- extent -------- |
                 */
                if (start > key.offset && end >= extent_end) {

we do actually lock the range, but in this case, the extent is not included in
our range, so this give a chance for others to hack in to do 'confusing-us' things.

I can add these to the commit log if you want, others may also feel confused.

thanks,
-liubo

> 
> Thanks
> 
> > IOW, nothing change in this leaf except the new last key, then we think
> > we're okay because there is no more item balanced in, fine, we thinks we can
> > go to the next leaf.
> >
> > However, we should return that bigger key, otherwise we deserve leaf corruption,
> > for example, in endio, skipping that key means that __btrfs_drop_extents() thinks
> > it has dropped all extent matched the required range and finish_ordered_io can
> > safely insert a new extent, but it actually doesn't and ends up a leaf
> > corruption.
> >
> > This takes the special case into account.
> >
> > Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> > ---
> >  fs/btrfs/ctree.c | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> >
> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > index 1bcfcdb..bbb256c 100644
> > --- a/fs/btrfs/ctree.c
> > +++ b/fs/btrfs/ctree.c
> > @@ -5736,6 +5736,24 @@ again:
> >                 ret = 0;
> >                 goto done;
> >         }
> > +       /*
> > +        * So the above check misses one case:
> > +        * - after releasing the path above, someone has removed the item that
> > +        *   used to be at the very end of the block, and balance between leafs
> > +        *   gets another one with bigger key.offset to replace it.
> > +        *
> > +        * This one should be returned as well, or we can get leaf corruption
> > +        * later(esp. in __btrfs_drop_extents()).
> > +        *
> > +        * And a bit more explanation about this check,
> > +        * with ret > 0, the key isn't found, the path points to the slot
> > +        * where it should be inserted, so the path->slots[0] item must be the
> > +        * bigger one.
> > +        */
> > +       if (nritems > 0 && ret > 0 && path->slots[0] == nritems - 1) {
> > +               ret = 0;
> > +               goto done;
> > +       }
> >
> >         while (level < BTRFS_MAX_LEVEL) {
> >                 if (!path->nodes[level]) {
> > --
> > 1.8.1.4
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Filipe David Manana,
> 
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix leaf corruption after __btrfs_drop_extents
  2014-06-06 14:02 ` Holger Hoffstätte
@ 2014-06-08 11:07   ` Liu Bo
  0 siblings, 0 replies; 7+ messages in thread
From: Liu Bo @ 2014-06-08 11:07 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: linux-btrfs

On Fri, Jun 06, 2014 at 02:02:07PM +0000, Holger Hoffstätte wrote:
> On Fri, 06 Jun 2014 14:25:58 +0800, Liu Bo wrote:
> 
> > Several reports about leaf corruption has been floating on the list, one
> > of them points to __btrfs_drop_extents(), and we find that the leaf
> > becomes corrupted after __btrfs_drop_extents(), it's really a rare case
> > but it does exist.
> 
> Out of curiosity ("..what could go wrong?" :) I applied this to 3.14.6, 
> rebooted and worked a bit on my btrfs drive - rsyncs, create & delete 
> snapshots etc. For the first time in ages I got two kernel messages:
> 
> Jun  6 09:20:56 tux kernel: BTRFS error (device sdb1): block group 
> 351872745472 has wrong amount of free space
> Jun  6 09:20:56 tux kernel: BTRFS error (device sdb1): failed to load 
> free space cache for block group 351872745472
> 
> I remounted without inode_cache & clear_cache, worked some more and so 
> far no more messages. Could the messages be related to this patch? Or am 
> I just looking at a coincidental/unrelated occurrence? Dropping the 
> snapshots correctly freed up several GBs, so I assume that extents were 
> dropped..

Well, I think they're unrelated ;)

We saw similar space cache errors from other reports, but it mainly came from a
unclean shutdown or poweroff.

-liubo

> 
> thanks
> Holger
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix leaf corruption after __btrfs_drop_extents
  2014-06-08 10:54   ` Liu Bo
@ 2014-06-08 11:11     ` Filipe David Manana
  2014-06-08 11:16       ` Filipe David Manana
  0 siblings, 1 reply; 7+ messages in thread
From: Filipe David Manana @ 2014-06-08 11:11 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs, levin.front

On Sun, Jun 8, 2014 at 11:54 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Fri, Jun 06, 2014 at 01:11:17PM +0100, Filipe David Manana wrote:
>> On Fri, Jun 6, 2014 at 7:25 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
>> > Several reports about leaf corruption has been floating on the list, one of them
>> > points to __btrfs_drop_extents(), and we find that the leaf becomes corrupted
>> > after __btrfs_drop_extents(), it's really a rare case but it does exist.
>> >
>> > The problem turns out to be btrfs_next_leaf() called in __btrfs_drop_extents().
>> >
>> > So in btrfs_next_leaf(), we release the current path to re-search the last key of
>> > the leaf for locating next leaf, and we've taken it into account that there might
>> > be balance operations between leafs during this 'unlock and re-lock' dance, so
>> > we check the path again and advance it if there are now more items available.
>> > But things are a bit different if that last key happens to be removed and balance
>> > gets a bigger key as the last one, and btrfs_search_slot will return it with
>> > ret > 0,
>>
>> Hi Liu,
>>
>> That makes a lot of sense outside the context of btrfs_drop_extents().
>>
>> If I understand you correctly, you're saying that we have a file
>> extent item that gets deleted after we release the path in
>> btrfs_next_leaf and before btrfs_search_slot finishes for doing a
>> lookup for that item.
>
> Hi Filipe,
>
> Not just "finishes", even before btrfs_search_slot "starts".

Yes :)

>
>>
>> But who calls btrfs_drop_extents(), is supposed to have locked the
>> file range in the inode's io tree, right? Isn't the goal of locking
>> ranges in the io tree to serialize manipulation of file extent items
>> within a certain range? Unless there's some caller of _drop_extents()
>> that isn't locking the range in the io tree, I'm not seeing how we can
>> get the file extent item deleted or updated by another task.
>
> Oh, good question!  I must admit I forget about that locking, just check the log,
> so here is the answer -- the locking range and the extent range is not
> consistent.
>
> In __btrfs_drop_extents(), there is such a case,
>
>                 /*
>                  *       | ---- range to drop ----- |
>                  *  | -------- extent -------- |
>                  */
>                 if (start > key.offset && end >= extent_end) {
>
> we do actually lock the range, but in this case, the extent is not included in
> our range, so this give a chance for others to hack in to do 'confusing-us' things.

Hum, yes. I thought about the cases where others split our extent
item, but that shouldn't remove the key.

There is one case (I thought about after)  where it's obvious the
issue you found in btrfs_next_leaf can affect btrfs_drop_extents: we
have NO_HOLES set, our first tree seach returns us an extent that ends
before our target range's start (therefore outside the range we locked
in the io tree) and that extent is the last item in a leaf. And then
the problem you described happens between the unlock and re-lock.

>
> I can add these to the commit log if you want, others may also feel confused.

It is fine for me as it is.

Thanks Liu

>
> thanks,
> -liubo
>
>>
>> Thanks
>>
>> > IOW, nothing change in this leaf except the new last key, then we think
>> > we're okay because there is no more item balanced in, fine, we thinks we can
>> > go to the next leaf.
>> >
>> > However, we should return that bigger key, otherwise we deserve leaf corruption,
>> > for example, in endio, skipping that key means that __btrfs_drop_extents() thinks
>> > it has dropped all extent matched the required range and finish_ordered_io can
>> > safely insert a new extent, but it actually doesn't and ends up a leaf
>> > corruption.
>> >
>> > This takes the special case into account.
>> >
>> > Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
>> > ---
>> >  fs/btrfs/ctree.c | 18 ++++++++++++++++++
>> >  1 file changed, 18 insertions(+)
>> >
>> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
>> > index 1bcfcdb..bbb256c 100644
>> > --- a/fs/btrfs/ctree.c
>> > +++ b/fs/btrfs/ctree.c
>> > @@ -5736,6 +5736,24 @@ again:
>> >                 ret = 0;
>> >                 goto done;
>> >         }
>> > +       /*
>> > +        * So the above check misses one case:
>> > +        * - after releasing the path above, someone has removed the item that
>> > +        *   used to be at the very end of the block, and balance between leafs
>> > +        *   gets another one with bigger key.offset to replace it.
>> > +        *
>> > +        * This one should be returned as well, or we can get leaf corruption
>> > +        * later(esp. in __btrfs_drop_extents()).
>> > +        *
>> > +        * And a bit more explanation about this check,
>> > +        * with ret > 0, the key isn't found, the path points to the slot
>> > +        * where it should be inserted, so the path->slots[0] item must be the
>> > +        * bigger one.
>> > +        */
>> > +       if (nritems > 0 && ret > 0 && path->slots[0] == nritems - 1) {
>> > +               ret = 0;
>> > +               goto done;
>> > +       }
>> >
>> >         while (level < BTRFS_MAX_LEVEL) {
>> >                 if (!path->nodes[level]) {
>> > --
>> > 1.8.1.4
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Filipe David Manana,
>>
>> "Reasonable men adapt themselves to the world.
>>  Unreasonable men adapt the world to themselves.
>>  That's why all progress depends on unreasonable men."



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix leaf corruption after __btrfs_drop_extents
  2014-06-08 11:11     ` Filipe David Manana
@ 2014-06-08 11:16       ` Filipe David Manana
  0 siblings, 0 replies; 7+ messages in thread
From: Filipe David Manana @ 2014-06-08 11:16 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs, levin.front

On Sun, Jun 8, 2014 at 12:11 PM, Filipe David Manana <fdmanana@gmail.com> wrote:
> On Sun, Jun 8, 2014 at 11:54 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
>> On Fri, Jun 06, 2014 at 01:11:17PM +0100, Filipe David Manana wrote:
>>> On Fri, Jun 6, 2014 at 7:25 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
>>> > Several reports about leaf corruption has been floating on the list, one of them
>>> > points to __btrfs_drop_extents(), and we find that the leaf becomes corrupted
>>> > after __btrfs_drop_extents(), it's really a rare case but it does exist.
>>> >
>>> > The problem turns out to be btrfs_next_leaf() called in __btrfs_drop_extents().
>>> >
>>> > So in btrfs_next_leaf(), we release the current path to re-search the last key of
>>> > the leaf for locating next leaf, and we've taken it into account that there might
>>> > be balance operations between leafs during this 'unlock and re-lock' dance, so
>>> > we check the path again and advance it if there are now more items available.
>>> > But things are a bit different if that last key happens to be removed and balance
>>> > gets a bigger key as the last one, and btrfs_search_slot will return it with
>>> > ret > 0,
>>>
>>> Hi Liu,
>>>
>>> That makes a lot of sense outside the context of btrfs_drop_extents().
>>>
>>> If I understand you correctly, you're saying that we have a file
>>> extent item that gets deleted after we release the path in
>>> btrfs_next_leaf and before btrfs_search_slot finishes for doing a
>>> lookup for that item.
>>
>> Hi Filipe,
>>
>> Not just "finishes", even before btrfs_search_slot "starts".
>
> Yes :)
>
>>
>>>
>>> But who calls btrfs_drop_extents(), is supposed to have locked the
>>> file range in the inode's io tree, right? Isn't the goal of locking
>>> ranges in the io tree to serialize manipulation of file extent items
>>> within a certain range? Unless there's some caller of _drop_extents()
>>> that isn't locking the range in the io tree, I'm not seeing how we can
>>> get the file extent item deleted or updated by another task.
>>
>> Oh, good question!  I must admit I forget about that locking, just check the log,
>> so here is the answer -- the locking range and the extent range is not
>> consistent.
>>
>> In __btrfs_drop_extents(), there is such a case,
>>
>>                 /*
>>                  *       | ---- range to drop ----- |
>>                  *  | -------- extent -------- |
>>                  */
>>                 if (start > key.offset && end >= extent_end) {
>>
>> we do actually lock the range, but in this case, the extent is not included in
>> our range, so this give a chance for others to hack in to do 'confusing-us' things.
>
> Hum, yes. I thought about the cases where others split our extent
> item, but that shouldn't remove the key.
>
> There is one case (I thought about after)  where it's obvious the
> issue you found in btrfs_next_leaf can affect btrfs_drop_extents: we
> have NO_HOLES set, our first tree seach returns us an extent that ends
> before our target range's start (therefore outside the range we locked
> in the io tree) and that extent is the last item in a leaf. And then
> the problem you described happens between the unlock and re-lock.
>
>>
>> I can add these to the commit log if you want, others may also feel confused.
>
> It is fine for me as it is.
>
> Thanks Liu
>
>>
>> thanks,
>> -liubo
>>
>>>
>>> Thanks
>>>
>>> > IOW, nothing change in this leaf except the new last key, then we think
>>> > we're okay because there is no more item balanced in, fine, we thinks we can
>>> > go to the next leaf.
>>> >
>>> > However, we should return that bigger key, otherwise we deserve leaf corruption,
>>> > for example, in endio, skipping that key means that __btrfs_drop_extents() thinks
>>> > it has dropped all extent matched the required range and finish_ordered_io can
>>> > safely insert a new extent, but it actually doesn't and ends up a leaf
>>> > corruption.
>>> >
>>> > This takes the special case into account.
>>> >
>>> > Signed-off-by: Liu Bo <bo.li.liu@oracle.com>

Reviewed-by: Filipe Manana <fdmanana@gmail.com>

Can affect other users of btrfs_next_leaf too that aren't necessarily
modifying anything in tree (like _drop_extents), makes readers miss
items.

>>> > ---
>>> >  fs/btrfs/ctree.c | 18 ++++++++++++++++++
>>> >  1 file changed, 18 insertions(+)
>>> >
>>> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
>>> > index 1bcfcdb..bbb256c 100644
>>> > --- a/fs/btrfs/ctree.c
>>> > +++ b/fs/btrfs/ctree.c
>>> > @@ -5736,6 +5736,24 @@ again:
>>> >                 ret = 0;
>>> >                 goto done;
>>> >         }
>>> > +       /*
>>> > +        * So the above check misses one case:
>>> > +        * - after releasing the path above, someone has removed the item that
>>> > +        *   used to be at the very end of the block, and balance between leafs
>>> > +        *   gets another one with bigger key.offset to replace it.
>>> > +        *
>>> > +        * This one should be returned as well, or we can get leaf corruption
>>> > +        * later(esp. in __btrfs_drop_extents()).
>>> > +        *
>>> > +        * And a bit more explanation about this check,
>>> > +        * with ret > 0, the key isn't found, the path points to the slot
>>> > +        * where it should be inserted, so the path->slots[0] item must be the
>>> > +        * bigger one.
>>> > +        */
>>> > +       if (nritems > 0 && ret > 0 && path->slots[0] == nritems - 1) {
>>> > +               ret = 0;
>>> > +               goto done;
>>> > +       }
>>> >
>>> >         while (level < BTRFS_MAX_LEVEL) {
>>> >                 if (!path->nodes[level]) {
>>> > --
>>> > 1.8.1.4
>>> >
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> > the body of a message to majordomo@vger.kernel.org
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Filipe David Manana,
>>>
>>> "Reasonable men adapt themselves to the world.
>>>  Unreasonable men adapt the world to themselves.
>>>  That's why all progress depends on unreasonable men."
>
>
>
> --
> Filipe David Manana,
>
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-06-08 11:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-06  6:25 [PATCH] Btrfs: fix leaf corruption after __btrfs_drop_extents Liu Bo
2014-06-06 12:11 ` Filipe David Manana
2014-06-08 10:54   ` Liu Bo
2014-06-08 11:11     ` Filipe David Manana
2014-06-08 11:16       ` Filipe David Manana
2014-06-06 14:02 ` Holger Hoffstätte
2014-06-08 11:07   ` Liu Bo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).