Re: mkfs.xfs pagefault when removed storage during operation

From: Eric Sandeen <sandeen@sandeen.net>
To: Ajeet Yadav <ajeet.yadav.77@gmail.com>
Cc: xfs@oss.sgi.com
Subject: Re: mkfs.xfs pagefault when removed storage during operation
Date: Thu, 03 Feb 2011 00:07:22 -0600	[thread overview]
Message-ID: <4D4A461A.2080503@sandeen.net> (raw)
In-Reply-To: <AANLkTi=aL89d3vUN2jq7w-_j7gcSrkiJk=fGyi4SJv5R@mail.gmail.com>

On 2/3/11 12:03 AM, Ajeet Yadav wrote:
> Sorry I do not agree, we have a bug so we cannot ignore it.
> Solving at first place can save a lot of time if same problem create a
> side effect that may sometime be very hard to catch.
> 
> Now lets consider the current problem
> 1. Its related to libxfs in xfsprogs, so its not mkfs issue anymore
> 2. If we come across any critical problem in libxfs we can cross
> verify kernel xfs implementation to find if there also a logical
> issue.
>     One learning and be used in other part.
> 3. Yes I agree that if mkfs.xfs fails we have to re-run it anyways,
> but then what is the difference between a novice code and professional
> product.
>      If you cscope libxfs_trans_read_buf() in xfsprogs, its caller
> always checks the return value, and its used extensively in xfsprogs.
> But this function always return 0. Infact there is no error handding
> at all, lets not consider EIO error only.
> 4. We are here in open community out of need, at the same time to make
> it better.
> 
> I was wondering why I am not getting any reply, I think mail subject
> was wrong......mkfs ;)

I may have jumped at that too quickly, yes ;)

> I will release the patch, please take out time to review it.

Well, that's fair enough, that's how it works - if it's important to you,
and you want to fix it, then you can!  And if properly done it gets
merged.

-Eric

> On Thu, Feb 3, 2011 at 1:10 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
>> On 2/1/11 5:06 AM, Ajeet Yadav wrote:
>>> We are testing mkfs.xfs and xfs_repair stability to look for crashes
>>> and other issues specially with removable devices.
>>> And unfortunately crashes does occur.
>>> Code inspection shows in most cases the caller does not handle
>>> libxfs_readbuf() for error cases i.e when return value = NULL.
>>>
>>> Now I need your suggestion.
>>> We should fix all such cases or the simplest way is to exit... if
>>> read() or write() fails with EIO errorno in libxfs_readbufr() and
>>> libxfs_writebufr().
>>
>> I see very little reason to gracefully handle all error cases
>> during mkfs. It would be prettier, yes, but if mkfs fails, with
>> or without an error, with or without a segfault, you have to
>> just start it over anyway, right?
>>
>> I think there are better places to focus effort.
>>
>> -Eric
>>
>>> Fortunately these function already support exit, if we use flag
>>> LIBXFS_EXIT_ON_FAILURE, LIBXFS_B_EXIT but they are used selectively.
>>>
>>> The current problem is related to function libxfs_trans_read_buf()
>>>
>>>        bp = libxfs_readbuf(dev, blkno, len, flags);
>>> #ifdef XACT_DEBUG
>>>         fprintf(stderr, "trans_read_buf buffer %p, transaction %p\n", bp, tp);
>>> #endif
>>>         xfs_buf_item_init(bp, tp->t_mountp);
>>>         bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *);
>>>         bip->bli_recur = 0;
>>>         xfs_trans_add_item(tp, (xfs_log_item_t *)bip);
>>>
>>>         /* initialise b_fsprivate2 so we can find it incore */
>>>         XFS_BUF_SET_FSPRIVATE2(bp, tp);
>>>         *bpp = bp;
>>>         return 0;
>>>
>>> if  libxfs_readbuf() fails due to device removal or other error, bp = NULL.
>>> In function xfs_buf_item_init(bp, tp->t_mountp) as soon as bp is
>>> dereferenced occurs
>>>
>>> mkfs.xfs: unhandled page fault (11) at 0x00000070, code 0x017
>>>
>>> _______________________________________________
>>> xfs mailing list
>>> xfs@oss.sgi.com
>>> http://oss.sgi.com/mailman/listinfo/xfs
>>>
>>
>>
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs