Re: [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO

linux-f2fs-devel.lists.sourceforge.net archive mirror
 help / color / mirror / Atom feed

From: Sahitya Tummala <stummala@codeaurora.org>
To: Chao Yu <yuchao0@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>, linux-f2fs-devel@lists.sourceforge.net
Subject: Re: [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO
Date: Fri, 5 Jul 2019 16:33:57 +0530	[thread overview]
Message-ID: <20190705110356.GC8289@codeaurora.org> (raw)
In-Reply-To: <15bb7741-7c2f-8dc2-065f-ba1fcaf22050@huawei.com>

Hi Chao,

On Fri, Jul 05, 2019 at 06:11:47PM +0800, Chao Yu wrote:
> Hi Sahitya,
> 
> On 2019/7/5 13:01, Sahitya Tummala wrote:
> > With the commit <979b25727800> ("fsck.f2fs: check validity of nat journal"),
> > there is a serious data loss issue observed during sudden power off tests.
> 
> In that patch, since journal's data is untrusty, I just truncated n_nats to last
> valid one to fix original reported issue.

The original issue will be fixed even without updating/truncating the n_nats
as we still have sanity checks on nid and blkaddr, right?

> 
> Actually, the problem here is we don't know which data is correct and which data
> is corrupted in nat or sit journal, so even we skip broken entry without
> modifying journal->n_{nats,sits}, latter entry may still contain broken data(but
> it may pass validation check), and those entries may cause potential data
> loss/corruption...
> 

This issue was reported by a customer in the following test scenario -

Update the password in Android and then force crash. It is verified that
all the files related to password data has fsync() already.

After some iterations (3-4) of this test, it is found password related files are
lost, failing the phone to bootup.

Without this patch ("fsck.f2fs: check validity of nat journal"), they don't see
such issue. It means that the data present in journal seems to be valid atleast
in this case and truncating is causing to loose the valid data/files.

> It looks no matter how we change the codes, we still may face data
> loos/corruption issues.
> 

Yes, maybe in some cases, the data loss is still expected, but there are
chances that data may be valid as well, which is confirmed by the test case above.

> IMO, we need to find out the root cause of data corruption, So my question is
> that why sudden power off tests can make nat journal being corrupted? I doubt
> that in-place update method on checkpoint area of fsck may cause this... any
> other thoughts?
> 

In case of sudden power off, I have printed the all the journal entries in the
NAT and I several entries with valid blkaddr and nid as below -

[   21.007649] fsck.f2fs: withfix: 0: addr 1c373 nid 45b\x0a
[   21.014399] fsck.f2fs: withfix: 1: addr 1c2c1 nid 1b5\x0a
[   21.021151] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 2/37 is invalid!!!\x0a
[   21.030860] fsck.f2fs: withfix: 3: addr 1c1f7 nid 19b2\x0a
[   21.037674] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 4/37 is invalid!!!\x0a
[   21.047353] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 5/37 is invalid!!!\x0a
[   21.057062] fsck.f2fs: withfix: 6: addr 1c377 nid ac3\x0a
[   21.063841] fsck.f2fs: withfix: 7: addr 1c378 nid ac9\x0a
[   21.070634] fsck.f2fs: withfix: 8: addr 1b2fe nid 807\x0a
[   21.077356] fsck.f2fs: withfix: 9: addr 17aeb nid 805\x0a
[   21.084102] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 10/37 is invalid!!!\x0a
[   21.093891] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 11/37 is invalid!!!\x0a
[   21.103662] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 12/37 is invalid!!!\x0a
[   21.113450] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 13/37 is invalid!!!\x0a
[   21.123227] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 14/37 is invalid!!!\x0a
[   21.133022] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 15/37 is invalid!!!\x0a
[   21.142792] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 16/37 is invalid!!!\x0a
[   21.152580] fsck.f2fs: withfix: 17: addr 1c37f nid ff4\x0a
[   21.159415] fsck.f2fs: withfix: 18: addr 1c36c nid ff9\x0a
[   21.166229] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 19/37 is invalid!!!\x0a
[   21.176039] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 20/37 is invalid!!!\x0a
[   21.185831] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 21/37 is invalid!!!\x0a
[   21.195631] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 22/37 is invalid!!!\x0a
[   21.205394] fsck.f2fs: withfix: 23: addr 1c2f3 nid 61c\x0a
[   21.212257] fsck.f2fs: withfix: 24: addr 1c391 nid 641\x0a
[   21.219093] fsck.f2fs: withfix: 25: addr 1c386 nid 33b\x0a
[   21.225905] fsck.f2fs: withfix: 26: addr 1c356 nid 258\x0a
[   21.232767] fsck.f2fs: withfix: 27: addr 1c2f2 nid 382\x0a
[   21.239601] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 28/37 is invalid!!!\x0a
[   21.249398] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 29/37 is invalid!!!\x0a
[   21.259160] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 30/37 is invalid!!!\x0a
[   21.268970] fsck.f2fs: withfix: 31: addr 1ac03 nid 1ccd\x0a
[   21.275874] fsck.f2fs: withfix: 32: addr 1c395 nid a57\x0a
[   21.282709] fsck.f2fs: withfix: 33: addr 1c392 nid 92e\x0a
[   21.289564] fsck.f2fs: withfix: 34: addr 1c38b nid aa1\x0a
[   21.296377] fsck.f2fs: withfix: 35: addr 1c393 nid 931\x0a
[   21.303240] fsck.f2fs: withfix: 36: addr 1c394 nid a35\x0a

Thanks,

> Thanks,
> 
> > 
> > The reason is due to incorrect update of journal->n_nats in that patch.
> > When fsck encounters a nat journal entry with invalid blkaddr/nid, it
> > updates journal->n_nats with the index of such first entry and then continues
> > the for loop which scans all nats in journal in f2fs_init_nid_bitmap().
> > But that loop doesn't continue and doesn't update nm_i->nid_bitmap as well,
> > as it fails the for loop condition i < nats_in_cursum(journal).
> > 
> > There could be several other valid entries in the journal after this
> > invalid entry. Due to this invalid journal->n_nats, build_nat_area_bitmap()
> > could not be build properly for valid nat journal entries. It further results
> > into these fsck asserts/errors and causes data loss due to incorrect fsck fix.
> > 
> > [ASSERT] (sanity_check_nid: 372)  --> nid[0xfb01] ino is 0\x0a
> > [FIX] (__chk_dentries:1438)  --> Unlink [0xfb01] - _JeNcl242yB3Apz2MW,VOh6WmjsVht1W len[0x1c], type[0x1]\x0a
> > [ASSERT] (sanity_check_nid: 372)  --> nid[0xfa2d] ino is 0\x0a
> > [FIX] (__chk_dentries:1438)  --> Unlink [0xfa2d] - _++vN7axccjjGNI,TQEVmcOemY1EAAAA len[0x14], type[0x1]\x0a
> > ...
> > NID[0xba5] is unreachable\x0a
> > NID[0xfa9] is unreachable\x0a
> > ...
> > [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xba5] in NAT\x0a
> > [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xfa9] in NAT\x0a
> > ...
> > 
> > Fixes: 979b25727800 ("fsck.f2fs: check validity of nat journal")
> > Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
> > ---
> >  fsck/mount.c | 2 --
> >  1 file changed, 2 deletions(-)
> > 
> > diff --git a/fsck/mount.c b/fsck/mount.c
> > index 60e0e4a..1b93f49 100644
> > --- a/fsck/mount.c
> > +++ b/fsck/mount.c
> > @@ -1160,7 +1160,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
> >  		addr = le32_to_cpu(nat_in_journal(journal, i).block_addr);
> >  		if (!IS_VALID_BLK_ADDR(sbi, addr)) {
> >  			MSG(0, "\tError: f2fs_init_nid_bitmap: addr(%u) is invalid!!!\n", addr);
> > -			journal->n_nats = cpu_to_le16(i);
> >  			c.fix_on = 1;
> >  			continue;
> >  		}
> > @@ -1168,7 +1167,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
> >  		nid = le32_to_cpu(nid_in_journal(journal, i));
> >  		if (!IS_VALID_NID(sbi, nid)) {
> >  			MSG(0, "\tError: f2fs_init_nid_bitmap: nid(%u) is invalid!!!\n", nid);
> > -			journal->n_nats = cpu_to_le16(i);
> >  			c.fix_on = 1;
> >  			continue;
> >  		}
> > 

-- 
--
Sent by a consultant of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

next prev parent reply	other threads:[~2019-07-05 11:04 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-05  5:01 [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO Sahitya Tummala
2019-07-05 10:11 ` Chao Yu
2019-07-05 11:03   ` Sahitya Tummala [this message]
2019-07-05 13:48     ` Sahitya Tummala
2019-07-06  1:14       ` Chao Yu
2019-07-06  1:31         ` Sahitya Tummala
2019-07-06  1:37           ` Sahitya Tummala
2019-07-06  2:10             ` Chao Yu
2019-07-06  2:10             ` Chao Yu
2019-07-06  2:17               ` Chao Yu
2019-07-06  4:50                 ` Jaegeuk Kim
2019-07-07  5:25                   ` Chao Yu
2019-07-08 23:43                     ` Jaegeuk Kim
2019-07-06  1:12     ` Chao Yu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190705110356.GC8289@codeaurora.org \
    --to=stummala@codeaurora.org \
    --cc=jaegeuk@kernel.org \
    --cc=linux-f2fs-devel@lists.sourceforge.net \
    --cc=yuchao0@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).