linux-f2fs-devel.lists.sourceforge.net archive mirror
 help / color / mirror / Atom feed
From: Chao Yu <chao@kernel.org>
To: Sahitya Tummala <stummala@codeaurora.org>, Chao Yu <yuchao0@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>, linux-f2fs-devel@lists.sourceforge.net
Subject: Re: [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO
Date: Sat, 6 Jul 2019 09:12:16 +0800	[thread overview]
Message-ID: <15cbab53-5110-b54f-47a5-a7046bebc751@kernel.org> (raw)
In-Reply-To: <20190705110356.GC8289@codeaurora.org>

Hi Sahitya,

On 2019-7-5 19:03, Sahitya Tummala wrote:
> Hi Chao,
> 
> On Fri, Jul 05, 2019 at 06:11:47PM +0800, Chao Yu wrote:
>> Hi Sahitya,
>>
>> On 2019/7/5 13:01, Sahitya Tummala wrote:
>>> With the commit <979b25727800> ("fsck.f2fs: check validity of nat journal"),
>>> there is a serious data loss issue observed during sudden power off tests.
>>
>> In that patch, since journal's data is untrusty, I just truncated n_nats to last
>> valid one to fix original reported issue.
> 
> The original issue will be fixed even without updating/truncating the n_nats
> as we still have sanity checks on nid and blkaddr, right?

No, actually that's v2 patch did in below link, but as Aravind reported: fsck
will still crash in latter build_nat_area_bitmap() where we access entries[nid]
out-of-range.

https://lore.kernel.org/patchwork/patch/1031172/

Quoted:

Program received signal SIGSEGV, Segmentation fault.
0x00005555555673de in build_nat_area_bitmap (sbi=<optimized out>) at
mount.c:2469
2469            if (fsck->entries[nid].block_addr != 0x0) {
(gdb) backtrace
#0  0x00005555555673de in build_nat_area_bitmap (sbi=<optimized out>)
    at mount.c:2469
#1  0x000055555555d85c in fsck_init (sbi=0x55555557db40 <gfsck>) at fsck.c:1904
#2  0x000055555555682a in do_fsck (sbi=0x55555557db40 <gfsck>) at main.c:578
#3  main (argc=<optimized out>, argv=<optimized out>) at main.c:781

Maybe we need just remove invalid entry directly to avoid this rather than just
skipping it.

> 
>>
>> Actually, the problem here is we don't know which data is correct and which data
>> is corrupted in nat or sit journal, so even we skip broken entry without
>> modifying journal->n_{nats,sits}, latter entry may still contain broken data(but
>> it may pass validation check), and those entries may cause potential data
>> loss/corruption...
>>
> 
> This issue was reported by a customer in the following test scenario -
> 
> Update the password in Android and then force crash. It is verified that
> all the files related to password data has fsync() already.
> 
> After some iterations (3-4) of this test, it is found password related files are
> lost, failing the phone to bootup.
> 
> Without this patch ("fsck.f2fs: check validity of nat journal"), they don't see
> such issue. It means that the data present in journal seems to be valid atleast
> in this case and truncating is causing to loose the valid data/files.
> 
>> It looks no matter how we change the codes, we still may face data
>> loos/corruption issues.
>>
> 
> Yes, maybe in some cases, the data loss is still expected, but there are
> chances that data may be valid as well, which is confirmed by the test case above.

Yes, I'm not against this, but still that was not the root cause...

> 
>> IMO, we need to find out the root cause of data corruption, So my question is
>> that why sudden power off tests can make nat journal being corrupted? I doubt
>> that in-place update method on checkpoint area of fsck may cause this... any
>> other thoughts?
>>
> 
> In case of sudden power off, I have printed the all the journal entries in the
> NAT and I several entries with valid blkaddr and nid as below -
> 
> [   21.007649] fsck.f2fs: withfix: 0: addr 1c373 nid 45b\x0a
> [   21.014399] fsck.f2fs: withfix: 1: addr 1c2c1 nid 1b5\x0a
> [   21.021151] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 2/37 is invalid!!!\x0a

addr(0) can pass the below condtion, why we will print above error message and
do the truncation...

if (!IS_VALID_BLK_ADDR(sbi, addr)) {

Thanks,

> [   21.030860] fsck.f2fs: withfix: 3: addr 1c1f7 nid 19b2\x0a
> [   21.037674] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 4/37 is invalid!!!\x0a
> [   21.047353] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 5/37 is invalid!!!\x0a
> [   21.057062] fsck.f2fs: withfix: 6: addr 1c377 nid ac3\x0a
> [   21.063841] fsck.f2fs: withfix: 7: addr 1c378 nid ac9\x0a
> [   21.070634] fsck.f2fs: withfix: 8: addr 1b2fe nid 807\x0a
> [   21.077356] fsck.f2fs: withfix: 9: addr 17aeb nid 805\x0a
> [   21.084102] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 10/37 is invalid!!!\x0a
> [   21.093891] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 11/37 is invalid!!!\x0a
> [   21.103662] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 12/37 is invalid!!!\x0a
> [   21.113450] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 13/37 is invalid!!!\x0a
> [   21.123227] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 14/37 is invalid!!!\x0a
> [   21.133022] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 15/37 is invalid!!!\x0a
> [   21.142792] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 16/37 is invalid!!!\x0a
> [   21.152580] fsck.f2fs: withfix: 17: addr 1c37f nid ff4\x0a
> [   21.159415] fsck.f2fs: withfix: 18: addr 1c36c nid ff9\x0a
> [   21.166229] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 19/37 is invalid!!!\x0a
> [   21.176039] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 20/37 is invalid!!!\x0a
> [   21.185831] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 21/37 is invalid!!!\x0a
> [   21.195631] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 22/37 is invalid!!!\x0a
> [   21.205394] fsck.f2fs: withfix: 23: addr 1c2f3 nid 61c\x0a
> [   21.212257] fsck.f2fs: withfix: 24: addr 1c391 nid 641\x0a
> [   21.219093] fsck.f2fs: withfix: 25: addr 1c386 nid 33b\x0a
> [   21.225905] fsck.f2fs: withfix: 26: addr 1c356 nid 258\x0a
> [   21.232767] fsck.f2fs: withfix: 27: addr 1c2f2 nid 382\x0a
> [   21.239601] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 28/37 is invalid!!!\x0a
> [   21.249398] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 29/37 is invalid!!!\x0a
> [   21.259160] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 30/37 is invalid!!!\x0a
> [   21.268970] fsck.f2fs: withfix: 31: addr 1ac03 nid 1ccd\x0a
> [   21.275874] fsck.f2fs: withfix: 32: addr 1c395 nid a57\x0a
> [   21.282709] fsck.f2fs: withfix: 33: addr 1c392 nid 92e\x0a
> [   21.289564] fsck.f2fs: withfix: 34: addr 1c38b nid aa1\x0a
> [   21.296377] fsck.f2fs: withfix: 35: addr 1c393 nid 931\x0a
> [   21.303240] fsck.f2fs: withfix: 36: addr 1c394 nid a35\x0a
> 
> Thanks,
> 
>> Thanks,
>>
>>>
>>> The reason is due to incorrect update of journal->n_nats in that patch.
>>> When fsck encounters a nat journal entry with invalid blkaddr/nid, it
>>> updates journal->n_nats with the index of such first entry and then continues
>>> the for loop which scans all nats in journal in f2fs_init_nid_bitmap().
>>> But that loop doesn't continue and doesn't update nm_i->nid_bitmap as well,
>>> as it fails the for loop condition i < nats_in_cursum(journal).
>>>
>>> There could be several other valid entries in the journal after this
>>> invalid entry. Due to this invalid journal->n_nats, build_nat_area_bitmap()
>>> could not be build properly for valid nat journal entries. It further results
>>> into these fsck asserts/errors and causes data loss due to incorrect fsck fix.
>>>
>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfb01] ino is 0\x0a
>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfb01] - _JeNcl242yB3Apz2MW,VOh6WmjsVht1W len[0x1c], type[0x1]\x0a
>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfa2d] ino is 0\x0a
>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfa2d] - _++vN7axccjjGNI,TQEVmcOemY1EAAAA len[0x14], type[0x1]\x0a
>>> ...
>>> NID[0xba5] is unreachable\x0a
>>> NID[0xfa9] is unreachable\x0a
>>> ...
>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xba5] in NAT\x0a
>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xfa9] in NAT\x0a
>>> ...
>>>
>>> Fixes: 979b25727800 ("fsck.f2fs: check validity of nat journal")
>>> Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
>>> ---
>>>  fsck/mount.c | 2 --
>>>  1 file changed, 2 deletions(-)
>>>
>>> diff --git a/fsck/mount.c b/fsck/mount.c
>>> index 60e0e4a..1b93f49 100644
>>> --- a/fsck/mount.c
>>> +++ b/fsck/mount.c
>>> @@ -1160,7 +1160,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
>>>  		addr = le32_to_cpu(nat_in_journal(journal, i).block_addr);
>>>  		if (!IS_VALID_BLK_ADDR(sbi, addr)) {
>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: addr(%u) is invalid!!!\n", addr);
>>> -			journal->n_nats = cpu_to_le16(i);
>>>  			c.fix_on = 1;
>>>  			continue;
>>>  		}
>>> @@ -1168,7 +1167,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
>>>  		nid = le32_to_cpu(nid_in_journal(journal, i));
>>>  		if (!IS_VALID_NID(sbi, nid)) {
>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: nid(%u) is invalid!!!\n", nid);
>>> -			journal->n_nats = cpu_to_le16(i);
>>>  			c.fix_on = 1;
>>>  			continue;
>>>  		}
>>>
> 


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

      parent reply	other threads:[~2019-07-06  1:12 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-05  5:01 [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO Sahitya Tummala
2019-07-05 10:11 ` Chao Yu
2019-07-05 11:03   ` Sahitya Tummala
2019-07-05 13:48     ` Sahitya Tummala
2019-07-06  1:14       ` Chao Yu
2019-07-06  1:31         ` Sahitya Tummala
2019-07-06  1:37           ` Sahitya Tummala
2019-07-06  2:10             ` Chao Yu
2019-07-06  2:10             ` Chao Yu
2019-07-06  2:17               ` Chao Yu
2019-07-06  4:50                 ` Jaegeuk Kim
2019-07-07  5:25                   ` Chao Yu
2019-07-08 23:43                     ` Jaegeuk Kim
2019-07-06  1:12     ` Chao Yu [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=15cbab53-5110-b54f-47a5-a7046bebc751@kernel.org \
    --to=chao@kernel.org \
    --cc=jaegeuk@kernel.org \
    --cc=linux-f2fs-devel@lists.sourceforge.net \
    --cc=stummala@codeaurora.org \
    --cc=yuchao0@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).