[PATCH] Btrfs: don't panic if orphan item already exists

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Btrfs: don't panic if orphan item already exists
@ 2011-12-13 17:55 Josef Bacik
  2011-12-13 19:03 ` Phillip Susi
  0 siblings, 1 reply; 17+ messages in thread
From: Josef Bacik @ 2011-12-13 17:55 UTC (permalink / raw)
  To: linux-btrfs

I've been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in
a loop.  This is because we will add an orphan item, do the truncate, the
truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we're
left with an orphan item still in the fs.  Then we come back later to do another
truncate and it blows up because we already have an orphan item.  This is ok so
just fix the BUG_ON() to only BUG() if ret is not EEXIST.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
---
 fs/btrfs/inode.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ae5b354a..76e6c24 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2047,7 +2047,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 	/* insert an orphan item to track this unlinked/truncated file */
 	if (insert >= 1) {
 		ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
-		BUG_ON(ret);
+		BUG_ON(ret && ret != -EEXIST);
 	}

 	/* insert an orphan item to track subvolume contains orphan files */
-- 
1.7.5.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] Btrfs: don't panic if orphan item already exists
  2011-12-13 17:55 [PATCH] Btrfs: don't panic if orphan item already exists Josef Bacik
@ 2011-12-13 19:03 ` Phillip Susi
  2011-12-13 19:09   ` Josef Bacik
  0 siblings, 1 reply; 17+ messages in thread
From: Phillip Susi @ 2011-12-13 19:03 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On 12/13/2011 12:55 PM, Josef Bacik wrote:
> I've been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in
> a loop.  This is because we will add an orphan item, do the truncate, the
> truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we're
> left with an orphan item still in the fs.  Then we come back later to do another
> truncate and it blows up because we already have an orphan item.  This is ok so
> just fix the BUG_ON() to only BUG() if ret is not EEXIST.  Thanks,

Wouldn't it be better to fix the underlying bug, and remove the orphan 
item when the truncate fails?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Btrfs: don't panic if orphan item already exists
  2011-12-13 19:03 ` Phillip Susi
@ 2011-12-13 19:09   ` Josef Bacik
  2011-12-14  2:07     ` WuBo
  0 siblings, 1 reply; 17+ messages in thread
From: Josef Bacik @ 2011-12-13 19:09 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Josef Bacik, linux-btrfs

On Tue, Dec 13, 2011 at 02:03:14PM -0500, Phillip Susi wrote:
> On 12/13/2011 12:55 PM, Josef Bacik wrote:
> >I've been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in
> >a loop.  This is because we will add an orphan item, do the truncate, the
> >truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we're
> >left with an orphan item still in the fs.  Then we come back later to do another
> >truncate and it blows up because we already have an orphan item.  This is ok so
> >just fix the BUG_ON() to only BUG() if ret is not EEXIST.  Thanks,
> 
> Wouldn't it be better to fix the underlying bug, and remove the
> orphan item when the truncate fails?
> 

No because we still need the thing to be cleaned up.  If the truncate fails we
need to leave the orphan item there so the next time the fs is mounted the inode
is cleaned up, that's not a bug.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Btrfs: don't panic if orphan item already exists
  2011-12-13 19:09   ` Josef Bacik
@ 2011-12-14  2:07     ` WuBo
  2011-12-14  9:46       ` Miao Xie
  0 siblings, 1 reply; 17+ messages in thread
From: WuBo @ 2011-12-14  2:07 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Phillip Susi, linux-btrfs

On 12/14/2011 03:09 AM, Josef Bacik wrote:
> On Tue, Dec 13, 2011 at 02:03:14PM -0500, Phillip Susi wrote:
>> On 12/13/2011 12:55 PM, Josef Bacik wrote:
>>> I've been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in
>>> a loop.  This is because we will add an orphan item, do the truncate, the
>>> truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we're
>>> left with an orphan item still in the fs.  Then we come back later to do another
>>> truncate and it blows up because we already have an orphan item.  This is ok so
>>> just fix the BUG_ON() to only BUG() if ret is not EEXIST.  Thanks,
>>
>> Wouldn't it be better to fix the underlying bug, and remove the
>> orphan item when the truncate fails?
>>
> 
> No because we still need the thing to be cleaned up.  If the truncate fails we
> need to leave the orphan item there so the next time the fs is mounted the inode
> is cleaned up, that's not a bug.  Thanks,
> 
> Josef

Hi, Josef

I'm digging this issue too, actually xfstests 083 also can trigger this BUG_ON
while run loops. and I agreed with Phillip's opinion that we'd better "fix the 
underlying bug". If the btrfs_truncate faild with ENOSPC, we should not even call 
btrfs_orphan_del to clean the memory orphan list so that the next orphan item
insert will be skipped.

But, there is still a trouble. The user will get the fail result while the orphan 
inode still left in the fs. It's strange. So in the end of the btrfs_truncate,
if the btrfs_update_inode is successed, I will delete the orphan inode anyway.

what do you think of this idea? I'll make a patch if you do not have any comment.

BTW, 083 will always make the btrfs_truncate fail with btrfs_truncate_inode_items
for ENOSPC when the disk is almost full.

thanks
wubo

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Btrfs: don't panic if orphan item already exists
  2011-12-14  2:07     ` WuBo
@ 2011-12-14  9:46       ` Miao Xie
  2011-12-14 14:58         ` Josef Bacik
  0 siblings, 1 reply; 17+ messages in thread
From: Miao Xie @ 2011-12-14  9:46 UTC (permalink / raw)
  To: WuBo; +Cc: Josef Bacik, Phillip Susi, linux-btrfs

On 	wed, 14 Dec 2011 10:07:39 +0800, WuBo wrote:
> On 12/14/2011 03:09 AM, Josef Bacik wrote:
>> On Tue, Dec 13, 2011 at 02:03:14PM -0500, Phillip Susi wrote:
>>> On 12/13/2011 12:55 PM, Josef Bacik wrote:
>>>> I've been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in
>>>> a loop.  This is because we will add an orphan item, do the truncate, the
>>>> truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we're
>>>> left with an orphan item still in the fs.  Then we come back later to do another
>>>> truncate and it blows up because we already have an orphan item.  This is ok so
>>>> just fix the BUG_ON() to only BUG() if ret is not EEXIST.  Thanks,
>>>
>>> Wouldn't it be better to fix the underlying bug, and remove the
>>> orphan item when the truncate fails?
>>>
>>
>> No because we still need the thing to be cleaned up.  If the truncate fails we
>> need to leave the orphan item there so the next time the fs is mounted the inode
>> is cleaned up, that's not a bug.  Thanks,
>>
>> Josef
> 
> Hi, Josef
> 
> I'm digging this issue too, actually xfstests 083 also can trigger this BUG_ON
> while run loops. and I agreed with Phillip's opinion that we'd better "fix the 
> underlying bug". If the btrfs_truncate faild with ENOSPC, we should not even call 
> btrfs_orphan_del to clean the memory orphan list so that the next orphan item
> insert will be skipped.
> 
> But, there is still a trouble. The user will get the fail result while the orphan 
> inode still left in the fs. It's strange. So in the end of the btrfs_truncate,
> if the btrfs_update_inode is successed, I will delete the orphan inode anyway.

Another reason for that we should fix the underlying bug:
File0							| i_size
							v
	+-----------------------------------------------+
	|						|
	+-----------------------------------------------+

The user truncated File0, but failed when doing truncation:
File0				| i_size	| real size
				v		v
	+---------------------------------------+
	|					|
	+---------------------------------------+

The user did pre-allocation for File0 (keep size):
File0				| i_size	| pre-allocated extent	|
				v		v			v
	+---------------------------------------+-----------------------+
	|					|			|
	+---------------------------------------+-----------------------+

And then, the user umounted and mount the file system again. Because we left the orphan item
in the file system, btrfs will drop the pre-allocated extent when mounting it. It is not
the expected result for users.

Thanks
Miao

> 
> what do you think of this idea? I'll make a patch if you do not have any comment.
> 
> BTW, 083 will always make the btrfs_truncate fail with btrfs_truncate_inode_items
> for ENOSPC when the disk is almost full.
> 
> thanks
> wubo
> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Btrfs: don't panic if orphan item already exists
  2011-12-14  9:46       ` Miao Xie
@ 2011-12-14 14:58         ` Josef Bacik
  2011-12-14 15:14           ` Phillip Susi
  2011-12-15  1:56           ` WuBo
  0 siblings, 2 replies; 17+ messages in thread
From: Josef Bacik @ 2011-12-14 14:58 UTC (permalink / raw)
  To: Miao Xie; +Cc: WuBo, Josef Bacik, Phillip Susi, linux-btrfs

On Wed, Dec 14, 2011 at 05:46:37PM +0800, Miao Xie wrote:
> On 	wed, 14 Dec 2011 10:07:39 +0800, WuBo wrote:
> > On 12/14/2011 03:09 AM, Josef Bacik wrote:
> >> On Tue, Dec 13, 2011 at 02:03:14PM -0500, Phillip Susi wrote:
> >>> On 12/13/2011 12:55 PM, Josef Bacik wrote:
> >>>> I've been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in
> >>>> a loop.  This is because we will add an orphan item, do the truncate, the
> >>>> truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we're
> >>>> left with an orphan item still in the fs.  Then we come back later to do another
> >>>> truncate and it blows up because we already have an orphan item.  This is ok so
> >>>> just fix the BUG_ON() to only BUG() if ret is not EEXIST.  Thanks,
> >>>
> >>> Wouldn't it be better to fix the underlying bug, and remove the
> >>> orphan item when the truncate fails?
> >>>
> >>
> >> No because we still need the thing to be cleaned up.  If the truncate fails we
> >> need to leave the orphan item there so the next time the fs is mounted the inode
> >> is cleaned up, that's not a bug.  Thanks,
> >>
> >> Josef
> > 
> > Hi, Josef
> > 
> > I'm digging this issue too, actually xfstests 083 also can trigger this BUG_ON
> > while run loops. and I agreed with Phillip's opinion that we'd better "fix the 
> > underlying bug". If the btrfs_truncate faild with ENOSPC, we should not even call 
> > btrfs_orphan_del to clean the memory orphan list so that the next orphan item
> > insert will be skipped.
> > 

There is no "underlying bug", there is a shitty situation, the shitty situation
where we dont have enough room to free up space.  There is no getting around
this, there is no fix for this, we are fucked and there is nothing we can do
about it.  We have to clean it up from the in memory list because otherwise we
will complain when we go to destroy the inode, this is to catch the case that we
_didn't_ do the cleanup properly, so that has to stay.

There is nothing wrong with leaving the orphan item there.  The next time we
mount we will truncate to the i_size, so if later on down the road we were able
to free up space and we could write to the inode again then i_size will grow and
everything will be a-ok.  There is little dange in leaving the orphan item
there.  The only danger is as Miao describes below...

> > But, there is still a trouble. The user will get the fail result while the orphan 
> > inode still left in the fs. It's strange. So in the end of the btrfs_truncate,
> > if the btrfs_update_inode is successed, I will delete the orphan inode anyway.
> 
> Another reason for that we should fix the underlying bug:
> File0							| i_size
> 							v
> 	+-----------------------------------------------+
> 	|						|
> 	+-----------------------------------------------+
> 
> The user truncated File0, but failed when doing truncation:
> File0				| i_size	| real size
> 				v		v
> 	+---------------------------------------+
> 	|					|
> 	+---------------------------------------+
> 
> The user did pre-allocation for File0 (keep size):
> File0				| i_size	| pre-allocated extent	|
> 				v		v			v
> 	+---------------------------------------+-----------------------+
> 	|					|			|
> 	+---------------------------------------+-----------------------+
> 
> And then, the user umounted and mount the file system again. Because we left the orphan item
> in the file system, btrfs will drop the pre-allocated extent when mounting it. It is not
> the expected result for users.
> 

This is the only case where we will truncate space that we think should be
there.  I'm ok with it, because there is no actual data here, we're just
trimming pre-allocated space.

The only other option is to keep track of inodes that have an orphan item and if
we try to do anything to the inode (fallocate, write etc) we try and delete the
orphan item first.  That is a lot of call paths to have

if (!list_empty(BTRFS_I(inode)->i_orphan))
	delete_orphan_item()

not to mention the case that the inode has left cache and we've re-read it back
in, so everytime we read back in an inode we'll have to search for its
corresponding orphan item to see if we didn't clean it up at some point in the
past, which is going to mean an extra search on every inode lookup which is
going to suck, just to deal with the case that somebody used fallocate with
FALLOC_FL_KEEP_SIZE.

So my patch is the lesser of two evils, either we leave the damn thing in there
and deal with the remote chance that somebody preallocated space on the end of
the file after failing to do a truncate, or we put a bunch of performance
sucking checks and an extra btree search on every inode lookup to deal with this
remote chance.  My choice is my patch.

Josef "really needs a vacation" Bacik

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Btrfs: don't panic if orphan item already exists
  2011-12-14 14:58         ` Josef Bacik
@ 2011-12-14 15:14           ` Phillip Susi
  2011-12-14 15:27             ` Josef Bacik
  2011-12-14 15:34             ` Josef Bacik
  2011-12-15  1:56           ` WuBo
  1 sibling, 2 replies; 17+ messages in thread
From: Phillip Susi @ 2011-12-14 15:14 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Miao Xie, WuBo, linux-btrfs

On 12/14/2011 9:58 AM, Josef Bacik wrote:
> There is no "underlying bug", there is a shitty situation, the shitty situation

Maybe my assumptions are wrong somewhere then.  You add the orphan item 
to make sure that the truncate will be finalized even if the system 
crashes before the transaction commits right?  So if truncate() fails 
with -ENOSPC, then you shouldn't be trying to finalize the truncate on 
the next mount, should you ( because the call did not succeed )?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Btrfs: don't panic if orphan item already exists
  2011-12-14 15:14           ` Phillip Susi
@ 2011-12-14 15:27             ` Josef Bacik
  2011-12-14 15:41               ` Phillip Susi
  2011-12-14 15:34             ` Josef Bacik
  1 sibling, 1 reply; 17+ messages in thread
From: Josef Bacik @ 2011-12-14 15:27 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Josef Bacik, Miao Xie, WuBo, linux-btrfs

On Wed, Dec 14, 2011 at 10:14:13AM -0500, Phillip Susi wrote:
> On 12/14/2011 9:58 AM, Josef Bacik wrote:
> >There is no "underlying bug", there is a shitty situation, the shitty situation
> 
> Maybe my assumptions are wrong somewhere then.  You add the orphan
> item to make sure that the truncate will be finalized even if the
> system crashes before the transaction commits right?  So if
> truncate() fails with -ENOSPC, then you shouldn't be trying to
> finalize the truncate on the next mount, should you ( because the
> call did not succeed )?
> 

Except consider the case that the program was written intelligently and checks
for errors on truncate.  So he writes 100G, truncates to 50M, and the truncate
fails and he closes the file and exits.  Then somewhere down the road the inode
is evicted from cache and we reboot the box.  Next time the box comes up it only
looks like a 50M file, except we're still taking up 100G of disk space, and we
have no idea there's space there and it's still taken up in the allocator so it
will just look like we've lost ~100G of space.  This is why it's left there, so
everything can be cleaned up.

Course now that I've had the chance to think and calm down a little bit there
may be another option.  We could probably keep track of how much we've deleted
in btrfs_truncate_inode_items, and then if we fail to complete the truncate we
just set the i_size to whatever it is when we stopped truncating, update the
inode and remove the orphan item.  I'll look into this and see how tricky it is
to do.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Btrfs: don't panic if orphan item already exists
  2011-12-14 15:14           ` Phillip Susi
  2011-12-14 15:27             ` Josef Bacik
@ 2011-12-14 15:34             ` Josef Bacik
  2011-12-14 15:35               ` Josef Bacik
                                 ` (2 more replies)
  1 sibling, 3 replies; 17+ messages in thread
From: Josef Bacik @ 2011-12-14 15:34 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Josef Bacik, Miao Xie, WuBo, linux-btrfs

On Wed, Dec 14, 2011 at 10:14:13AM -0500, Phillip Susi wrote:
> On 12/14/2011 9:58 AM, Josef Bacik wrote:
> >There is no "underlying bug", there is a shitty situation, the shitty situation
> 
> Maybe my assumptions are wrong somewhere then.  You add the orphan
> item to make sure that the truncate will be finalized even if the
> system crashes before the transaction commits right?  So if
> truncate() fails with -ENOSPC, then you shouldn't be trying to
> finalize the truncate on the next mount, should you ( because the
> call did not succeed )?
> 

Yes because otherwise we'll leak space since the i_size has been updated
already.  The other option is to make btrfs_truncate_inode_items update i_size
as we truncate so if it fails we can delete the orphan item and then update the
inode with the new i_size, that way we don't leave the orphan item on disk and
we don't leak space.  I'll see how doable this is.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Btrfs: don't panic if orphan item already exists
  2011-12-14 15:34             ` Josef Bacik
@ 2011-12-14 15:35               ` Josef Bacik
  2011-12-14 16:45               ` Chris Mason
  2011-12-15  1:42               ` Miao Xie
  2 siblings, 0 replies; 17+ messages in thread
From: Josef Bacik @ 2011-12-14 15:35 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Phillip Susi, Miao Xie, WuBo, linux-btrfs

On Wed, Dec 14, 2011 at 10:34:45AM -0500, Josef Bacik wrote:
> On Wed, Dec 14, 2011 at 10:14:13AM -0500, Phillip Susi wrote:
> > On 12/14/2011 9:58 AM, Josef Bacik wrote:
> > >There is no "underlying bug", there is a shitty situation, the shitty situation
> > 
> > Maybe my assumptions are wrong somewhere then.  You add the orphan
> > item to make sure that the truncate will be finalized even if the
> > system crashes before the transaction commits right?  So if
> > truncate() fails with -ENOSPC, then you shouldn't be trying to
> > finalize the truncate on the next mount, should you ( because the
> > call did not succeed )?
> > 
> 
> Yes because otherwise we'll leak space since the i_size has been updated
> already.  The other option is to make btrfs_truncate_inode_items update i_size
> as we truncate so if it fails we can delete the orphan item and then update the
> inode with the new i_size, that way we don't leave the orphan item on disk and
> we don't leak space.  I'll see how doable this is.  Thanks,
> 

Sorry for the double reply, my email is being wonky and it didn't look like my
original response went out.

Josef

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Btrfs: don't panic if orphan item already exists
  2011-12-14 15:27             ` Josef Bacik
@ 2011-12-14 15:41               ` Phillip Susi
  2011-12-14 15:46                 ` Josef Bacik
  0 siblings, 1 reply; 17+ messages in thread
From: Phillip Susi @ 2011-12-14 15:41 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Miao Xie, WuBo, linux-btrfs

On 12/14/2011 10:27 AM, Josef Bacik wrote:
> Except consider the case that the program was written intelligently and checks
> for errors on truncate.  So he writes 100G, truncates to 50M, and the truncate
> fails and he closes the file and exits.  Then somewhere down the road the inode
> is evicted from cache and we reboot the box.  Next time the box comes up it only
> looks like a 50M file, except we're still taking up 100G of disk space, and we
> have no idea there's space there and it's still taken up in the allocator so it
> will just look like we've lost ~100G of space.  This is why it's left there, so
> everything can be cleaned up.

I'm a little confused here.  Is there a commit somewhere in there?  How 
can the 100g allocation be committed, but not the i_size of the inode? 
Shouldn't either both or neither be committed?  If both are committed, 
and then the truncate fails, then I would expect the system to come back 
up after a crash with the file still at 100g.  That is, as long as the 
orphan item is not left in place after the failed truncate.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Btrfs: don't panic if orphan item already exists
  2011-12-14 15:41               ` Phillip Susi
@ 2011-12-14 15:46                 ` Josef Bacik
  2011-12-14 19:59                   ` Phillip Susi
  0 siblings, 1 reply; 17+ messages in thread
From: Josef Bacik @ 2011-12-14 15:46 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Josef Bacik, Miao Xie, WuBo, linux-btrfs

On Wed, Dec 14, 2011 at 10:41:05AM -0500, Phillip Susi wrote:
> On 12/14/2011 10:27 AM, Josef Bacik wrote:
> >Except consider the case that the program was written intelligently and checks
> >for errors on truncate.  So he writes 100G, truncates to 50M, and the truncate
> >fails and he closes the file and exits.  Then somewhere down the road the inode
> >is evicted from cache and we reboot the box.  Next time the box comes up it only
> >looks like a 50M file, except we're still taking up 100G of disk space, and we
> >have no idea there's space there and it's still taken up in the allocator so it
> >will just look like we've lost ~100G of space.  This is why it's left there, so
> >everything can be cleaned up.
> 
> I'm a little confused here.  Is there a commit somewhere in there?
> How can the 100g allocation be committed, but not the i_size of the
> inode? Shouldn't either both or neither be committed?  If both are
> committed, and then the truncate fails, then I would expect the
> system to come back up after a crash with the file still at 100g.
> That is, as long as the orphan item is not left in place after the
> failed truncate.
> 

100g allocation succeeds

unmount

mount
truncate to 50m
i_size is set to 50m
truncate fails
orphan item left

unmount

mount

file looks like its only 50m but still has 100g of extents taking up space
orphan cleanup happens and the inode is truncated and the extra space is cleaned
up

Josef

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Btrfs: don't panic if orphan item already exists
  2011-12-14 15:34             ` Josef Bacik
  2011-12-14 15:35               ` Josef Bacik
@ 2011-12-14 16:45               ` Chris Mason
  2011-12-14 16:47                 ` Josef Bacik
  2011-12-15  1:42               ` Miao Xie
  2 siblings, 1 reply; 17+ messages in thread
From: Chris Mason @ 2011-12-14 16:45 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Phillip Susi, Miao Xie, WuBo, linux-btrfs

On Wed, Dec 14, 2011 at 10:34:45AM -0500, Josef Bacik wrote:
> On Wed, Dec 14, 2011 at 10:14:13AM -0500, Phillip Susi wrote:
> > On 12/14/2011 9:58 AM, Josef Bacik wrote:
> > >There is no "underlying bug", there is a shitty situation, the shitty situation
> > 
> > Maybe my assumptions are wrong somewhere then.  You add the orphan
> > item to make sure that the truncate will be finalized even if the
> > system crashes before the transaction commits right?  So if
> > truncate() fails with -ENOSPC, then you shouldn't be trying to
> > finalize the truncate on the next mount, should you ( because the
> > call did not succeed )?
> > 
> 
> Yes because otherwise we'll leak space since the i_size has been updated
> already.  The other option is to make btrfs_truncate_inode_items update i_size
> as we truncate so if it fails we can delete the orphan item and then update the
> inode with the new i_size, that way we don't leave the orphan item on disk and
> we don't leak space.  I'll see how doable this is.  Thanks,

If we fail with enospc though we're very likely to not be able to update
the inode item.  It may work, but the failure case will still be there
where we can't make i_size match the file contents.

-chris

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Btrfs: don't panic if orphan item already exists
  2011-12-14 16:45               ` Chris Mason
@ 2011-12-14 16:47                 ` Josef Bacik
  0 siblings, 0 replies; 17+ messages in thread
From: Josef Bacik @ 2011-12-14 16:47 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, Phillip Susi, Miao Xie, WuBo,
	linux-btrfs

On Wed, Dec 14, 2011 at 11:45:24AM -0500, Chris Mason wrote:
> On Wed, Dec 14, 2011 at 10:34:45AM -0500, Josef Bacik wrote:
> > On Wed, Dec 14, 2011 at 10:14:13AM -0500, Phillip Susi wrote:
> > > On 12/14/2011 9:58 AM, Josef Bacik wrote:
> > > >There is no "underlying bug", there is a shitty situation, the shitty situation
> > > 
> > > Maybe my assumptions are wrong somewhere then.  You add the orphan
> > > item to make sure that the truncate will be finalized even if the
> > > system crashes before the transaction commits right?  So if
> > > truncate() fails with -ENOSPC, then you shouldn't be trying to
> > > finalize the truncate on the next mount, should you ( because the
> > > call did not succeed )?
> > > 
> > 
> > Yes because otherwise we'll leak space since the i_size has been updated
> > already.  The other option is to make btrfs_truncate_inode_items update i_size
> > as we truncate so if it fails we can delete the orphan item and then update the
> > inode with the new i_size, that way we don't leave the orphan item on disk and
> > we don't leak space.  I'll see how doable this is.  Thanks,
> 
> If we fail with enospc though we're very likely to not be able to update
> the inode item.  It may work, but the failure case will still be there
> where we can't make i_size match the file contents.
> 

Yeah I was thinking we just grab a reservation to update the inode just in case
early on, so if that fails we just update the in memory i_size and we're good to
go, and then if the transaction fails during the actual truncate we can still do
a btrfs_join_transaction() and use our saved reservation.  Course we're still
screwed if our failure was ENOMEM...

Josef

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Btrfs: don't panic if orphan item already exists
  2011-12-14 15:46                 ` Josef Bacik
@ 2011-12-14 19:59                   ` Phillip Susi
  0 siblings, 0 replies; 17+ messages in thread
From: Phillip Susi @ 2011-12-14 19:59 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Miao Xie, WuBo, linux-btrfs

On 12/14/2011 10:46 AM, Josef Bacik wrote:
> file looks like its only 50m but still has 100g of extents taking up space
> orphan cleanup happens and the inode is truncated and the extra space is cleaned
> up

Yes, but isn't the only reason that the i_size change actually hit the 
disk is because of the orphan item?  So with no orphan item, the i_size 
remains at 100g.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Btrfs: don't panic if orphan item already exists
  2011-12-14 15:34             ` Josef Bacik
  2011-12-14 15:35               ` Josef Bacik
  2011-12-14 16:45               ` Chris Mason
@ 2011-12-15  1:42               ` Miao Xie
  2 siblings, 0 replies; 17+ messages in thread
From: Miao Xie @ 2011-12-15  1:42 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Phillip Susi, WuBo, linux-btrfs

On 	wed, 14 Dec 2011 10:34:45 -0500, Josef Bacik wrote:
> On Wed, Dec 14, 2011 at 10:14:13AM -0500, Phillip Susi wrote:
>> On 12/14/2011 9:58 AM, Josef Bacik wrote:
>>> There is no "underlying bug", there is a shitty situation, the shitty situation
>>
>> Maybe my assumptions are wrong somewhere then.  You add the orphan
>> item to make sure that the truncate will be finalized even if the
>> system crashes before the transaction commits right?  So if
>> truncate() fails with -ENOSPC, then you shouldn't be trying to
>> finalize the truncate on the next mount, should you ( because the
>> call did not succeed )?
>>
> 
> Yes because otherwise we'll leak space since the i_size has been updated
> already.  The other option is to make btrfs_truncate_inode_items update i_size
> as we truncate so if it fails we can delete the orphan item and then update the
> inode with the new i_size, that way we don't leave the orphan item on disk and
> we don't leak space.  I'll see how doable this is.  Thanks,

Agree.
I have made a patch based on this idea, which is under test.

Thanks
Miao

> 
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Btrfs: don't panic if orphan item already exists
  2011-12-14 14:58         ` Josef Bacik
  2011-12-14 15:14           ` Phillip Susi
@ 2011-12-15  1:56           ` WuBo
  1 sibling, 0 replies; 17+ messages in thread
From: WuBo @ 2011-12-15  1:56 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Miao Xie, Phillip Susi, linux-btrfs



On 12/14/2011 10:58 PM, Josef Bacik wrote:
> On Wed, Dec 14, 2011 at 05:46:37PM +0800, Miao Xie wrote:
>> On 	wed, 14 Dec 2011 10:07:39 +0800, WuBo wrote:
>>> On 12/14/2011 03:09 AM, Josef Bacik wrote:
>>>> On Tue, Dec 13, 2011 at 02:03:14PM -0500, Phillip Susi wrote:
>>>>> On 12/13/2011 12:55 PM, Josef Bacik wrote:
>>>>>> I've been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in
>>>>>> a loop.  This is because we will add an orphan item, do the truncate, the
>>>>>> truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we're
>>>>>> left with an orphan item still in the fs.  Then we come back later to do another
>>>>>> truncate and it blows up because we already have an orphan item.  This is ok so
>>>>>> just fix the BUG_ON() to only BUG() if ret is not EEXIST.  Thanks,
>>>>>
>>>>> Wouldn't it be better to fix the underlying bug, and remove the
>>>>> orphan item when the truncate fails?
>>>>>
>>>>
>>>> No because we still need the thing to be cleaned up.  If the truncate fails we
>>>> need to leave the orphan item there so the next time the fs is mounted the inode
>>>> is cleaned up, that's not a bug.  Thanks,
>>>>
>>>> Josef
>>>
>>> Hi, Josef
>>>
>>> I'm digging this issue too, actually xfstests 083 also can trigger this BUG_ON
>>> while run loops. and I agreed with Phillip's opinion that we'd better "fix the 
>>> underlying bug". If the btrfs_truncate faild with ENOSPC, we should not even call 
>>> btrfs_orphan_del to clean the memory orphan list so that the next orphan item
>>> insert will be skipped.
>>>
> 
> There is no "underlying bug", there is a shitty situation, the shitty situation
> where we dont have enough room to free up space.  There is no getting around
> this, there is no fix for this, we are fucked and there is nothing we can do
> about it.  We have to clean it up from the in memory list because otherwise we
> will complain when we go to destroy the inode, this is to catch the case that we
> _didn't_ do the cleanup properly, so that has to stay.
Oh I made a mistake, yes the orphan inode in memory list should be deleted.
thanks for your explain.

> 
> There is nothing wrong with leaving the orphan item there.  The next time we
> mount we will truncate to the i_size, so if later on down the road we were able
> to free up space and we could write to the inode again then i_size will grow and
> everything will be a-ok.  There is little dange in leaving the orphan item
> there.  The only danger is as Miao describes below...
If we left the orphan inode, I also worried another situation:
The disk is almost full, and there are a lot of inode truncate failing, 
and then left a lot of orphan inodes in fs. In the next mount, btrfs should clean
all of these orphan inodes even should do truncate again, which also need reserve space.
but the disk is full and mount will be failed.
So I think we'd better remove the orphan inode at the end of truncate. "update the
new i_size and remove the orphan item" is a good idea.

thanks
wubo

> 
>>> But, there is still a trouble. The user will get the fail result while the orphan 
>>> inode still left in the fs. It's strange. So in the end of the btrfs_truncate,
>>> if the btrfs_update_inode is successed, I will delete the orphan inode anyway.
>>
>> Another reason for that we should fix the underlying bug:
>> File0							| i_size
>> 							v
>> 	+-----------------------------------------------+
>> 	|						|
>> 	+-----------------------------------------------+
>>
>> The user truncated File0, but failed when doing truncation:
>> File0				| i_size	| real size
>> 				v		v
>> 	+---------------------------------------+
>> 	|					|
>> 	+---------------------------------------+
>>
>> The user did pre-allocation for File0 (keep size):
>> File0				| i_size	| pre-allocated extent	|
>> 				v		v			v
>> 	+---------------------------------------+-----------------------+
>> 	|					|			|
>> 	+---------------------------------------+-----------------------+
>>
>> And then, the user umounted and mount the file system again. Because we left the orphan item
>> in the file system, btrfs will drop the pre-allocated extent when mounting it. It is not
>> the expected result for users.
>>
> 
> This is the only case where we will truncate space that we think should be
> there.  I'm ok with it, because there is no actual data here, we're just
> trimming pre-allocated space.
> 
> The only other option is to keep track of inodes that have an orphan item and if
> we try to do anything to the inode (fallocate, write etc) we try and delete the
> orphan item first.  That is a lot of call paths to have
> 
> if (!list_empty(BTRFS_I(inode)->i_orphan))
> 	delete_orphan_item()
> 
> not to mention the case that the inode has left cache and we've re-read it back
> in, so everytime we read back in an inode we'll have to search for its
> corresponding orphan item to see if we didn't clean it up at some point in the
> past, which is going to mean an extra search on every inode lookup which is
> going to suck, just to deal with the case that somebody used fallocate with
> FALLOC_FL_KEEP_SIZE.
> 
> So my patch is the lesser of two evils, either we leave the damn thing in there
> and deal with the remote chance that somebody preallocated space on the end of
> the file after failing to do a truncate, or we put a bunch of performance
> sucking checks and an extra btree search on every inode lookup to deal with this
> remote chance.  My choice is my patch.
> 
> Josef "really needs a vacation" Bacik
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-12-15  1:56 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-13 17:55 [PATCH] Btrfs: don't panic if orphan item already exists Josef Bacik
2011-12-13 19:03 ` Phillip Susi
2011-12-13 19:09   ` Josef Bacik
2011-12-14  2:07     ` WuBo
2011-12-14  9:46       ` Miao Xie
2011-12-14 14:58         ` Josef Bacik
2011-12-14 15:14           ` Phillip Susi
2011-12-14 15:27             ` Josef Bacik
2011-12-14 15:41               ` Phillip Susi
2011-12-14 15:46                 ` Josef Bacik
2011-12-14 19:59                   ` Phillip Susi
2011-12-14 15:34             ` Josef Bacik
2011-12-14 15:35               ` Josef Bacik
2011-12-14 16:45               ` Chris Mason
2011-12-14 16:47                 ` Josef Bacik
2011-12-15  1:42               ` Miao Xie
2011-12-15  1:56           ` WuBo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).