Re: [PATCH 00/10] ext4: fix inconsistency since reading old metadata from disk

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "zhangyi (F)" <yi.zhang@huawei.com>
To: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>, <tytso@mit.edu>,
	<adilger.kernel@dilger.ca>, <zhangxiaoxu5@huawei.com>
Subject: Re: [PATCH 00/10] ext4: fix inconsistency since reading old metadata from disk
Date: Wed, 10 Jun 2020 16:55:15 +0800	[thread overview]
Message-ID: <45796804-07f7-2f62-b8c5-db077950d882@huawei.com> (raw)
In-Reply-To: <20200609121920.GB12551@quack2.suse.cz>

Hi, Jan.

On 2020/6/9 20:19, Jan Kara wrote> On Mon 08-06-20 22:39:31, zhangyi (F) wrote:
>>> On Tue 26-05-20 15:17:44, zhangyi (F) wrote:
>>>> Background
>>>> ==========
>>>>
>>>> This patch set point to fix the inconsistency problem which has been
>>>> discussed and partial fixed in [1].
>>>>
>>>> Now, the problem is on the unstable storage which has a flaky transport
>>>> (e.g. iSCSI transport may disconnect few seconds and reconnect due to
>>>> the bad network environment), if we failed to async write metadata in
>>>> background, the end write routine in block layer will clear the buffer's
>>>> uptodate flag, but the data in such buffer is actually uptodate. Finally
>>>> we may read "old && inconsistent" metadata from the disk when we get the
>>>> buffer later because not only the uptodate flag was cleared but also we
>>>> do not check the write io error flag, or even worse the buffer has been
>>>> freed due to memory presure.
>>>>
>>>> Fortunately, if the jbd2 do checkpoint after async IO error happens,
>>>> the checkpoint routine will check the write_io_error flag and abort the
>>>> the journal if detect IO error. And in the journal recover case, the
>>>> recover code will invoke sync_blockdev() after recover complete, it will
>>>> also detect IO error and refuse to mount the filesystem.
>>>>
>>>> Current ext4 have already deal with this problem in __ext4_get_inode_loc()
>>>> and commit 7963e5ac90125 ("ext4: treat buffers with write errors as
>>>> containing valid data"), but it's not enough.
>>>
>>> Before we go and complicate ext4 code like this, I'd like to understand
>>> what is the desired outcome which doesn't seem to be mentioned here, in the
>>> commit 7963e5ac90125, or in the discussion you reference. If you have a
>>> flaky transport that gives you IO errors, IMO it is not a bussiness of the
>>> filesystem to try to fix that. I just has to make sure it properly reports
>>
>> If we meet some IO errors due to the flaky transport, IMO the desired outcome
>> is 1) report IO error; 2) ext4 filesystem act as the "errors=xxx" configuration
>> specified, if we set "errors=read-only || panic", we expect ext4 could remount
>> to read-only or panic immediately to avoid inconsistency. In brief, the kernel
>> should try best to guarantee the filesystem on disk is consistency, this will
>> reduce fsck's work (AFAIK, the fsck cannot fix those inconsistent in auto mode
>> for most cases caused by the async error problem I mentioned), so we could
>> recover the fs automatically in next boot.
> 
> Good, so I fully agree with your goals. Let's now talk about how to achieve
> them :)
> 
>> But now, in the case of metadata async writeback, (1) is done in
>> end_buffer_async_write(), but (2) is not guaranteed, because ext4 cannot detect
>> metadata write error, and it also cannot remount the filesystem or invoke panic
>> immediately. Finally, if we read the metadata on disk and re-write again, it
>> may lead to on-disk filesystem inconsistency.
> 
> Ah, I see. This was the important bit I was missing. And I think the
> real problem here is that ext4 cannot detect metadata write error from
> async writeback. So my plan would be to detect metadata write errors early
> and abort the journal and do appropriate errors=xxx handling. And a
> relatively simple way how to do that these days would be to use errseq in
> the block device's mapping - sb->s_bdev->bd_inode->i_mapping->wb_err - that
> gets incremented whenever there's writeback error in the block device
> mapping so (probably in ext4_journal_check_start()) we could check whether
> wb_err is different from the original value we sampled at mount time an if
> yes, we know metadata writeback error has happened and we trigger the error
> handling. What do you think?
> 

Thanks a lot for your suggestion, this solution looks good to me. But Ithink
add 'wb_err' checking into ext4_journal_check_start() maybe too early, see below
race condition (It's just theoretical analysis, I'm not test it):

ext4_journal_start()
 ext4_journal_check_start()  <-- pass checking
                                 |   end_buffer_async_write()
                                 |    mark_buffer_write_io_error() <-- set b_page
sb_getblk(bh)   <-- read old data from disk
ext4_journal_get_write_access(bh)
modify this bh  <-- modify data and lead to inconsistency
ext4_handle_dirty_metadata(bh)

So I guess it may still lead to inconsistency. How about add this checking
into ext4_journal_get_write_access() ?

>>> errors to userspace and (depending of errors= configuration) shuts itself
>>> down to limit further damage. This patch seems to try to mask those errors
>>> and that's, in my opinion, rather futile (as in you can hardly ever deal
>>> with all the cases). BTW are you running these systems on flaky iSCSI with
>>> errors=continue so that the errors don't shut the filesystem down
>>> immediately?
>>>
>> Yes, I run ext4 filesystem on a flaky iSCSI(it is stable most of the time)
>> with errors=read-only, in the cases mentioned above, the fs will not be
>> remount to read-only immediately or remount after it has already been
>> inconsistency.
>>
>> Thinking about how to fix this, one method is to invoke ext4_error() or
>> jbd2_journal_abort() when we detect write error to prevent further use of
>> the filesystem. But after looking at __ext4_get_inode_loc() and 7963e5ac90125,
>> I think although the metadata buffer was failed to write back to the disk due
>> to the occasional unstable network environment, but the data in the buffer
>> is actually uptodate, the filesystem could self-healing after the network
>> recovery. In the worst case, if the network is broken for a long time, the
>> jbd2's checkpoint routing will detect the error, or jbd2 will failed to write
>> the journal to disk, both will abort the filesystem. So I think we could
>> re-set the uptodate flag when we read buffer again as 7963e5ac90125 does.
> 
> Yeah, but I'm actually against such self-healing logic. IMHO it is too
> fragile and also fairly intrusive as your patches show. If we wanted
> something like this, we'd need to put a hard thought into whether a
> functionality like this belongs to ext4 or to some layer below it (e.g.
> think how multipath handles temporary path failures). And even if we
> decided it's worth the trouble in the filesystem, I'd rather go and change
> how fs/buffer.c deals with buffer writeback errors than resetting uptodate
> bits on buffers which just seems dangerous to me...
> 

Yeah, I see. Invoke error handlers as soon as we detect error flag could
minimize the risk.

Thanks,
Yi.

next prev parent reply	other threads:[~2020-06-10  8:55 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-26  7:17 [PATCH 00/10] ext4: fix inconsistency since reading old metadata from disk zhangyi (F)
2020-05-26  7:17 ` [PATCH 01/10] ext4: move inode eio simulation behind io completeion zhangyi (F)
2020-05-26  7:17 ` [PATCH 02/10] fs: pick out ll_rw_one_block() helper function zhangyi (F)
2020-05-28  5:07   ` Christoph Hellwig
2020-05-28 13:23     ` zhangyi (F)
2020-05-26  7:17 ` [PATCH 03/10] ext4: add ext4_sb_getblk*() wrapper functions zhangyi (F)
2020-05-26  7:17 ` [PATCH 04/10] ext4: replace sb_getblk() with ext4_sb_getblk_locked() zhangyi (F)
2020-05-26  7:17 ` [PATCH 05/10] ext4: replace sb_bread*() with ext4_sb_bread*() zhangyi (F)
2020-05-26  7:17 ` [PATCH 06/10] ext4: replace sb_getblk() with ext4_sb_getblk() zhangyi (F)
2020-05-26  7:17 ` [PATCH 07/10] ext4: switch to use ext4_sb_getblk_locked() in ext4_getblk() zhangyi (F)
2020-05-26  7:17 ` [PATCH 08/10] ext4: replace sb_breadahead() with ext4_sb_breadahead() zhangyi (F)
2020-05-26  7:17 ` [PATCH 09/10] ext4: abort the filesystem while freeing the write error io buffer zhangyi (F)
2020-05-26  7:17 ` [PATCH 10/10] ext4: remove unused parameter in jbd2_journal_try_to_free_buffers() zhangyi (F)
2020-06-08  3:32 ` [PATCH 00/10] ext4: fix inconsistency since reading old metadata from disk zhangyi (F)
2020-06-08  8:20 ` Jan Kara
2020-06-08 14:39   ` zhangyi (F)
2020-06-09 12:19     ` Jan Kara
2020-06-10  8:55       ` zhangyi (F) [this message]
2020-06-10  9:57         ` Jan Kara
2020-06-10 15:45           ` Theodore Y. Ts'o
2020-06-10 16:27             ` Jan Kara
2020-06-11  2:12               ` zhangyi (F)
2020-06-11  8:21                 ` Jan Kara
2020-06-11 16:55                   ` Theodore Y. Ts'o
2020-06-12 11:13                     ` zhangyi (F)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=45796804-07f7-2f62-b8c5-db077950d882@huawei.com \
    --to=yi.zhang@huawei.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=zhangxiaoxu5@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).