From: Josef Bacik <josef@redhat.com>
To: Miao Xie <miaox@cn.fujitsu.com>
Cc: WuBo <wu.bo@cn.fujitsu.com>, Josef Bacik <josef@redhat.com>,
Phillip Susi <psusi@cfl.rr.com>,
linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] Btrfs: don't panic if orphan item already exists
Date: Wed, 14 Dec 2011 09:58:44 -0500 [thread overview]
Message-ID: <20111214145843.GA1925@localhost.localdomain> (raw)
In-Reply-To: <4EE8707D.7080504@cn.fujitsu.com>
On Wed, Dec 14, 2011 at 05:46:37PM +0800, Miao Xie wrote:
> On wed, 14 Dec 2011 10:07:39 +0800, WuBo wrote:
> > On 12/14/2011 03:09 AM, Josef Bacik wrote:
> >> On Tue, Dec 13, 2011 at 02:03:14PM -0500, Phillip Susi wrote:
> >>> On 12/13/2011 12:55 PM, Josef Bacik wrote:
> >>>> I've been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in
> >>>> a loop. This is because we will add an orphan item, do the truncate, the
> >>>> truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we're
> >>>> left with an orphan item still in the fs. Then we come back later to do another
> >>>> truncate and it blows up because we already have an orphan item. This is ok so
> >>>> just fix the BUG_ON() to only BUG() if ret is not EEXIST. Thanks,
> >>>
> >>> Wouldn't it be better to fix the underlying bug, and remove the
> >>> orphan item when the truncate fails?
> >>>
> >>
> >> No because we still need the thing to be cleaned up. If the truncate fails we
> >> need to leave the orphan item there so the next time the fs is mounted the inode
> >> is cleaned up, that's not a bug. Thanks,
> >>
> >> Josef
> >
> > Hi, Josef
> >
> > I'm digging this issue too, actually xfstests 083 also can trigger this BUG_ON
> > while run loops. and I agreed with Phillip's opinion that we'd better "fix the
> > underlying bug". If the btrfs_truncate faild with ENOSPC, we should not even call
> > btrfs_orphan_del to clean the memory orphan list so that the next orphan item
> > insert will be skipped.
> >
There is no "underlying bug", there is a shitty situation, the shitty situation
where we dont have enough room to free up space. There is no getting around
this, there is no fix for this, we are fucked and there is nothing we can do
about it. We have to clean it up from the in memory list because otherwise we
will complain when we go to destroy the inode, this is to catch the case that we
_didn't_ do the cleanup properly, so that has to stay.
There is nothing wrong with leaving the orphan item there. The next time we
mount we will truncate to the i_size, so if later on down the road we were able
to free up space and we could write to the inode again then i_size will grow and
everything will be a-ok. There is little dange in leaving the orphan item
there. The only danger is as Miao describes below...
> > But, there is still a trouble. The user will get the fail result while the orphan
> > inode still left in the fs. It's strange. So in the end of the btrfs_truncate,
> > if the btrfs_update_inode is successed, I will delete the orphan inode anyway.
>
> Another reason for that we should fix the underlying bug:
> File0 | i_size
> v
> +-----------------------------------------------+
> | |
> +-----------------------------------------------+
>
> The user truncated File0, but failed when doing truncation:
> File0 | i_size | real size
> v v
> +---------------------------------------+
> | |
> +---------------------------------------+
>
> The user did pre-allocation for File0 (keep size):
> File0 | i_size | pre-allocated extent |
> v v v
> +---------------------------------------+-----------------------+
> | | |
> +---------------------------------------+-----------------------+
>
> And then, the user umounted and mount the file system again. Because we left the orphan item
> in the file system, btrfs will drop the pre-allocated extent when mounting it. It is not
> the expected result for users.
>
This is the only case where we will truncate space that we think should be
there. I'm ok with it, because there is no actual data here, we're just
trimming pre-allocated space.
The only other option is to keep track of inodes that have an orphan item and if
we try to do anything to the inode (fallocate, write etc) we try and delete the
orphan item first. That is a lot of call paths to have
if (!list_empty(BTRFS_I(inode)->i_orphan))
delete_orphan_item()
not to mention the case that the inode has left cache and we've re-read it back
in, so everytime we read back in an inode we'll have to search for its
corresponding orphan item to see if we didn't clean it up at some point in the
past, which is going to mean an extra search on every inode lookup which is
going to suck, just to deal with the case that somebody used fallocate with
FALLOC_FL_KEEP_SIZE.
So my patch is the lesser of two evils, either we leave the damn thing in there
and deal with the remote chance that somebody preallocated space on the end of
the file after failing to do a truncate, or we put a bunch of performance
sucking checks and an extra btree search on every inode lookup to deal with this
remote chance. My choice is my patch.
Josef "really needs a vacation" Bacik
next prev parent reply other threads:[~2011-12-14 14:58 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-12-13 17:55 [PATCH] Btrfs: don't panic if orphan item already exists Josef Bacik
2011-12-13 19:03 ` Phillip Susi
2011-12-13 19:09 ` Josef Bacik
2011-12-14 2:07 ` WuBo
2011-12-14 9:46 ` Miao Xie
2011-12-14 14:58 ` Josef Bacik [this message]
2011-12-14 15:14 ` Phillip Susi
2011-12-14 15:27 ` Josef Bacik
2011-12-14 15:41 ` Phillip Susi
2011-12-14 15:46 ` Josef Bacik
2011-12-14 19:59 ` Phillip Susi
2011-12-14 15:34 ` Josef Bacik
2011-12-14 15:35 ` Josef Bacik
2011-12-14 16:45 ` Chris Mason
2011-12-14 16:47 ` Josef Bacik
2011-12-15 1:42 ` Miao Xie
2011-12-15 1:56 ` WuBo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111214145843.GA1925@localhost.localdomain \
--to=josef@redhat.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=miaox@cn.fujitsu.com \
--cc=psusi@cfl.rr.com \
--cc=wu.bo@cn.fujitsu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).