All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chao Yu <chao@kernel.org>
To: Sahitya Tummala <stummala@codeaurora.org>, Chao Yu <yuchao0@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>, linux-f2fs-devel@lists.sourceforge.net
Subject: Re: [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO
Date: Sat, 6 Jul 2019 09:14:59 +0800	[thread overview]
Message-ID: <0143261e-5592-1ee2-e09d-437d2a7eac22@kernel.org> (raw)
In-Reply-To: <20190705134859.GD8289@codeaurora.org>

Hi Sahitya,

On 2019-7-5 21:48, Sahitya Tummala wrote:
> Hi Chao,
> 
> The journal doesn't look to be corrupted, but only block_addr
> in f2fs_nat_entry is 0 whereas nid and version seems to be valid.
> These could be just truncated nodes, right?

Yes, but that should not fail the condition of journal check and result in
journal size truncation, right?

Thanks,

> 
> Here is the full dump of nat journal entries after a SPO (this is not a 
> issue case, but just a normal case with SPO) -
> 
> [   21.121325] fsck.f2fs: 0: addr 0 nid 5c5a ver 1\x0a
> [   21.127509] fsck.f2fs: 1: addr 0 nid 4c1d ver 1\x0a
> [   21.133760] fsck.f2fs: 2: addr 12075 nid 444f ver 0\x0a
> [   21.140325] fsck.f2fs: 3: addr 0 nid 4307 ver 0\x0a
> [   21.146510] fsck.f2fs: 4: addr 0 nid 4059 ver 1\x0a
> [   21.152745] fsck.f2fs: 5: addr 0 nid 3e69 ver 1\x0a
> [   21.158944] fsck.f2fs: 6: addr 0 nid 3cf9 ver 1\x0a
> [   21.165128] fsck.f2fs: 7: addr 0 nid 3b7b ver 1\x0a
> [   21.171362] fsck.f2fs: 8: addr 18f65 nid 34e2 ver 0\x0a
> [   21.177903] fsck.f2fs: 9: addr 12126 nid 2c87 ver 0\x0a
> [   21.184466] fsck.f2fs: 10: addr 0 nid 2a0a ver 1\x0a
> [   21.190795] fsck.f2fs: 11: addr 0 nid 2565 ver 1\x0a
> [   21.197070] fsck.f2fs: 12: addr 0 nid 1f48 ver 0\x0a
> [   21.203369] fsck.f2fs: 13: addr 155bf nid 1d76 ver 0\x0a
> [   21.210042] fsck.f2fs: 14: addr 19538 nid 18da ver 0\x0a
> [   21.216675] fsck.f2fs: 15: addr 0 nid 157f ver 0\x0a
> [   21.223037] fsck.f2fs: 16: addr d117 nid ac3 ver 0\x0a
> [   21.229547] fsck.f2fs: 17: addr 11e48 nid 805 ver 0\x0a
> [   21.236088] fsck.f2fs: 18: addr d112 nid 61c ver 0\x0a
> [   21.242585] fsck.f2fs: 19: addr d00b nid 45b ver 0\x0a
> [   21.249086] fsck.f2fs: 20: addr d111 nid 30d ver 0\x0a
> [   21.255544] fsck.f2fs: 21: addr d110 nid 1b5 ver 0\x0a
> [   21.262025] fsck.f2fs: 22: addr 0 nid 5e41 ver 1\x0a
> [   21.268342] fsck.f2fs: 23: addr 0 nid 5e46 ver 0\x0a
> [   21.274618] fsck.f2fs: 24: addr 0 nid 5a34 ver 1\x0a
> [   21.280916] fsck.f2fs: 25: addr 0 nid 59fa ver 1\x0a
> [   21.287192] fsck.f2fs: 26: addr 0 nid 57c3 ver 1\x0a
> [   21.293526] fsck.f2fs: 27: addr 0 nid 5776 ver 1\x0a
> [   21.299816] fsck.f2fs: 28: addr 0 nid 4f20 ver 1\x0a
> [   21.306091] fsck.f2fs: 29: addr 0 nid 4ff7 ver 0\x0a
> [   21.312416] fsck.f2fs: 30: addr 0 nid 47e1 ver 1\x0a
> [   21.318715] fsck.f2fs: 31: addr 0 nid 4880 ver 1\x0a
> [   21.324990] fsck.f2fs: 32: addr d118 nid ff4 ver 0\x0a
> [   21.331498] fsck.f2fs: 33: addr d10c nid ff9 ver 0\x0a
> [   21.337957] fsck.f2fs: 34: addr 0 nid 49ed ver 1\x0a
> [   21.344290] fsck.f2fs: 35: addr 0 nid 497e ver 0\x0a
> [   21.350612] fsck.f2fs: 36: addr fc5a nid 49d2 ver 0\x0a
> 
> Thanks,
> 
> On Fri, Jul 05, 2019 at 04:33:57PM +0530, Sahitya Tummala wrote:
>> Hi Chao,
>>
>> On Fri, Jul 05, 2019 at 06:11:47PM +0800, Chao Yu wrote:
>>> Hi Sahitya,
>>>
>>> On 2019/7/5 13:01, Sahitya Tummala wrote:
>>>> With the commit <979b25727800> ("fsck.f2fs: check validity of nat journal"),
>>>> there is a serious data loss issue observed during sudden power off tests.
>>>
>>> In that patch, since journal's data is untrusty, I just truncated n_nats to last
>>> valid one to fix original reported issue.
>>
>> The original issue will be fixed even without updating/truncating the n_nats
>> as we still have sanity checks on nid and blkaddr, right?
>>
>>>
>>> Actually, the problem here is we don't know which data is correct and which data
>>> is corrupted in nat or sit journal, so even we skip broken entry without
>>> modifying journal->n_{nats,sits}, latter entry may still contain broken data(but
>>> it may pass validation check), and those entries may cause potential data
>>> loss/corruption...
>>>
>>
>> This issue was reported by a customer in the following test scenario -
>>
>> Update the password in Android and then force crash. It is verified that
>> all the files related to password data has fsync() already.
>>
>> After some iterations (3-4) of this test, it is found password related files are
>> lost, failing the phone to bootup.
>>
>> Without this patch ("fsck.f2fs: check validity of nat journal"), they don't see
>> such issue. It means that the data present in journal seems to be valid atleast
>> in this case and truncating is causing to loose the valid data/files.
>>
>>> It looks no matter how we change the codes, we still may face data
>>> loos/corruption issues.
>>>
>>
>> Yes, maybe in some cases, the data loss is still expected, but there are
>> chances that data may be valid as well, which is confirmed by the test case above.
>>
>>> IMO, we need to find out the root cause of data corruption, So my question is
>>> that why sudden power off tests can make nat journal being corrupted? I doubt
>>> that in-place update method on checkpoint area of fsck may cause this... any
>>> other thoughts?
>>>
>>
>> In case of sudden power off, I have printed the all the journal entries in the
>> NAT and I several entries with valid blkaddr and nid as below -
>>
>> [   21.007649] fsck.f2fs: withfix: 0: addr 1c373 nid 45b\x0a
>> [   21.014399] fsck.f2fs: withfix: 1: addr 1c2c1 nid 1b5\x0a
>> [   21.021151] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 2/37 is invalid!!!\x0a
>> [   21.030860] fsck.f2fs: withfix: 3: addr 1c1f7 nid 19b2\x0a
>> [   21.037674] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 4/37 is invalid!!!\x0a
>> [   21.047353] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 5/37 is invalid!!!\x0a
>> [   21.057062] fsck.f2fs: withfix: 6: addr 1c377 nid ac3\x0a
>> [   21.063841] fsck.f2fs: withfix: 7: addr 1c378 nid ac9\x0a
>> [   21.070634] fsck.f2fs: withfix: 8: addr 1b2fe nid 807\x0a
>> [   21.077356] fsck.f2fs: withfix: 9: addr 17aeb nid 805\x0a
>> [   21.084102] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 10/37 is invalid!!!\x0a
>> [   21.093891] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 11/37 is invalid!!!\x0a
>> [   21.103662] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 12/37 is invalid!!!\x0a
>> [   21.113450] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 13/37 is invalid!!!\x0a
>> [   21.123227] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 14/37 is invalid!!!\x0a
>> [   21.133022] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 15/37 is invalid!!!\x0a
>> [   21.142792] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 16/37 is invalid!!!\x0a
>> [   21.152580] fsck.f2fs: withfix: 17: addr 1c37f nid ff4\x0a
>> [   21.159415] fsck.f2fs: withfix: 18: addr 1c36c nid ff9\x0a
>> [   21.166229] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 19/37 is invalid!!!\x0a
>> [   21.176039] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 20/37 is invalid!!!\x0a
>> [   21.185831] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 21/37 is invalid!!!\x0a
>> [   21.195631] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 22/37 is invalid!!!\x0a
>> [   21.205394] fsck.f2fs: withfix: 23: addr 1c2f3 nid 61c\x0a
>> [   21.212257] fsck.f2fs: withfix: 24: addr 1c391 nid 641\x0a
>> [   21.219093] fsck.f2fs: withfix: 25: addr 1c386 nid 33b\x0a
>> [   21.225905] fsck.f2fs: withfix: 26: addr 1c356 nid 258\x0a
>> [   21.232767] fsck.f2fs: withfix: 27: addr 1c2f2 nid 382\x0a
>> [   21.239601] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 28/37 is invalid!!!\x0a
>> [   21.249398] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 29/37 is invalid!!!\x0a
>> [   21.259160] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 30/37 is invalid!!!\x0a
>> [   21.268970] fsck.f2fs: withfix: 31: addr 1ac03 nid 1ccd\x0a
>> [   21.275874] fsck.f2fs: withfix: 32: addr 1c395 nid a57\x0a
>> [   21.282709] fsck.f2fs: withfix: 33: addr 1c392 nid 92e\x0a
>> [   21.289564] fsck.f2fs: withfix: 34: addr 1c38b nid aa1\x0a
>> [   21.296377] fsck.f2fs: withfix: 35: addr 1c393 nid 931\x0a
>> [   21.303240] fsck.f2fs: withfix: 36: addr 1c394 nid a35\x0a
>>
>> Thanks,
>>
>>> Thanks,
>>>
>>>>
>>>> The reason is due to incorrect update of journal->n_nats in that patch.
>>>> When fsck encounters a nat journal entry with invalid blkaddr/nid, it
>>>> updates journal->n_nats with the index of such first entry and then continues
>>>> the for loop which scans all nats in journal in f2fs_init_nid_bitmap().
>>>> But that loop doesn't continue and doesn't update nm_i->nid_bitmap as well,
>>>> as it fails the for loop condition i < nats_in_cursum(journal).
>>>>
>>>> There could be several other valid entries in the journal after this
>>>> invalid entry. Due to this invalid journal->n_nats, build_nat_area_bitmap()
>>>> could not be build properly for valid nat journal entries. It further results
>>>> into these fsck asserts/errors and causes data loss due to incorrect fsck fix.
>>>>
>>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfb01] ino is 0\x0a
>>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfb01] - _JeNcl242yB3Apz2MW,VOh6WmjsVht1W len[0x1c], type[0x1]\x0a
>>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfa2d] ino is 0\x0a
>>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfa2d] - _++vN7axccjjGNI,TQEVmcOemY1EAAAA len[0x14], type[0x1]\x0a
>>>> ...
>>>> NID[0xba5] is unreachable\x0a
>>>> NID[0xfa9] is unreachable\x0a
>>>> ...
>>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xba5] in NAT\x0a
>>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xfa9] in NAT\x0a
>>>> ...
>>>>
>>>> Fixes: 979b25727800 ("fsck.f2fs: check validity of nat journal")
>>>> Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
>>>> ---
>>>>  fsck/mount.c | 2 --
>>>>  1 file changed, 2 deletions(-)
>>>>
>>>> diff --git a/fsck/mount.c b/fsck/mount.c
>>>> index 60e0e4a..1b93f49 100644
>>>> --- a/fsck/mount.c
>>>> +++ b/fsck/mount.c
>>>> @@ -1160,7 +1160,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
>>>>  		addr = le32_to_cpu(nat_in_journal(journal, i).block_addr);
>>>>  		if (!IS_VALID_BLK_ADDR(sbi, addr)) {
>>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: addr(%u) is invalid!!!\n", addr);
>>>> -			journal->n_nats = cpu_to_le16(i);
>>>>  			c.fix_on = 1;
>>>>  			continue;
>>>>  		}
>>>> @@ -1168,7 +1167,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
>>>>  		nid = le32_to_cpu(nid_in_journal(journal, i));
>>>>  		if (!IS_VALID_NID(sbi, nid)) {
>>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: nid(%u) is invalid!!!\n", nid);
>>>> -			journal->n_nats = cpu_to_le16(i);
>>>>  			c.fix_on = 1;
>>>>  			continue;
>>>>  		}
>>>>
>>
>> -- 
>> --
>> Sent by a consultant of the Qualcomm Innovation Center, Inc.
>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
>>
>>
>> _______________________________________________
>> Linux-f2fs-devel mailing list
>> Linux-f2fs-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> 


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

  reply	other threads:[~2019-07-06  1:15 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-05  5:01 [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO Sahitya Tummala
2019-07-05 10:11 ` Chao Yu
2019-07-05 11:03   ` Sahitya Tummala
2019-07-05 13:48     ` Sahitya Tummala
2019-07-06  1:14       ` Chao Yu [this message]
2019-07-06  1:31         ` Sahitya Tummala
2019-07-06  1:37           ` Sahitya Tummala
2019-07-06  2:10             ` Chao Yu
2019-07-06  2:10             ` Chao Yu
2019-07-06  2:17               ` Chao Yu
2019-07-06  4:50                 ` Jaegeuk Kim
2019-07-07  5:25                   ` Chao Yu
2019-07-08 23:43                     ` Jaegeuk Kim
2019-07-06  1:12     ` Chao Yu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0143261e-5592-1ee2-e09d-437d2a7eac22@kernel.org \
    --to=chao@kernel.org \
    --cc=jaegeuk@kernel.org \
    --cc=linux-f2fs-devel@lists.sourceforge.net \
    --cc=stummala@codeaurora.org \
    --cc=yuchao0@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.