* ubifs_scan() error handling @ 2010-06-23 17:04 twebb 2010-07-13 4:28 ` Artem Bityutskiy 0 siblings, 1 reply; 7+ messages in thread From: twebb @ 2010-06-23 17:04 UTC (permalink / raw) To: linux-mtd, Lei Wen Without having a complete understanding of how the UBI and UBIFS internals work, I have a question regarding error handling in ubifs_scan(). When ubifs_scan() encounters SCANNED_EMPTY_SPACE during execution, it checks whether the LEB is all 0xFFs and if not - returns -EUCLEAN. In some cases (for example, in orphan.c/kill_orphans()), any error returned by ubifs_scan() results in a call to ubifs_recover_leb(). Would it make sense and be acceptable to make this call any time ubifs_scan() returns an error? And if so, would it make more sense to include the ubifs_recover_leb() call in ubifs_scan() at goto corruption? Thanks, twebb ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ubifs_scan() error handling 2010-06-23 17:04 ubifs_scan() error handling twebb @ 2010-07-13 4:28 ` Artem Bityutskiy 2010-07-13 4:32 ` Artem Bityutskiy 0 siblings, 1 reply; 7+ messages in thread From: Artem Bityutskiy @ 2010-07-13 4:28 UTC (permalink / raw) To: twebb; +Cc: Lei Wen, linux-mtd On Wed, 2010-06-23 at 13:04 -0400, twebb wrote: > Without having a complete understanding of how the UBI and UBIFS > internals work, I have a question regarding error handling in > ubifs_scan(). > > When ubifs_scan() encounters SCANNED_EMPTY_SPACE during execution, it > checks whether the LEB is all 0xFFs and if not - returns -EUCLEAN. Yes, because it UBIFS always writes to LEBs from the beginning to the end, node by node. Then, when power cut happens, it should have something like LEB: | good nodes | a broken node | 0xFFs | this is what it checks. If it does not see 0xFFs, this is some unknow situation for UBIFS and it prefers to refuse the flash. In MLC case, you may have bit-flips. So you need to teach UBIFS to accept 0xFFs + bitflips. > In > some cases (for example, in orphan.c/kill_orphans()), any error > returned by ubifs_scan() results in a call to ubifs_recover_leb(). I need to look closer, but looks like we just forgot to check the need_recovery flag. > Would it make sense and be acceptable to make this call any time > ubifs_scan() returns an error? May be. I'll think and return to you. > And if so, would it make more sense to > include the ubifs_recover_leb() call in ubifs_scan() at goto > corruption? Also need to think. -- Best Regards, Artem Bityutskiy (Артём Битюцкий) ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ubifs_scan() error handling 2010-07-13 4:28 ` Artem Bityutskiy @ 2010-07-13 4:32 ` Artem Bityutskiy 2011-01-28 17:13 ` twebb 0 siblings, 1 reply; 7+ messages in thread From: Artem Bityutskiy @ 2010-07-13 4:32 UTC (permalink / raw) To: twebb; +Cc: Lei Wen, linux-mtd On Tue, 2010-07-13 at 07:28 +0300, Artem Bityutskiy wrote: > > Would it make sense and be acceptable to make this call any time > > ubifs_scan() returns an error? > > May be. I'll think and return to you. We can probably go for it, but the current recovery is not enough for MLC anyway. So I'd prefer to do the change you propose as a part of larger UBIFS on MLC patch-set. Please, analyse ubifs_recover_leb(), find out what should be changed there for MLC, send a patch-set. -- Best Regards, Artem Bityutskiy (Артём Битюцкий) ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ubifs_scan() error handling 2010-07-13 4:32 ` Artem Bityutskiy @ 2011-01-28 17:13 ` twebb 2011-01-29 17:38 ` Artem Bityutskiy 0 siblings, 1 reply; 7+ messages in thread From: twebb @ 2011-01-28 17:13 UTC (permalink / raw) To: dedekind1; +Cc: Lei Wen, linux-mtd > On Tue, 2010-07-13 at 07:28 +0300, Artem Bityutskiy wrote: >> > Would it make sense and be acceptable to make this call any time >> > ubifs_scan() returns an error? >> >> May be. I'll think and return to you. > > We can probably go for it, but the current recovery is not enough for > MLC anyway. So I'd prefer to do the change you propose as a part of > larger UBIFS on MLC patch-set. > > Please, analyse ubifs_recover_leb(), find out what should be changed > there for MLC, send a patch-set. > It's been awhile but I wanted to reopen the discussion on this topic. Could you take a look at this proposed patch? Essentially this change results in the LEB being cleaned/recovered regardless of whether is_last_write() is true or not. There may be a better way to do this earlier in the same function, but I'm not familiar enough with it to know the significance of the is_last_write() call. diff -Naru a/fs/ubifs/recovery.c b/fs/ubifs/recovery.c --- a/fs/ubifs/recovery.c 2011-01-26 16:08:04.000000000 -0500 +++ b/fs/ubifs/recovery.c 2011-01-26 16:08:25.000000000 -0500 @@ -665,19 +665,8 @@ } if (!empty_chkd && !is_empty(buf, len)) { - if (is_last_write(c, buf, offs)) { - clean_buf(c, &buf, lnum, &offs, &len); - need_clean = 1; - } else { - int corruption = first_non_ff(buf, len); - - ubifs_err("corrupt empty space LEB %d:%d, corruption " - "starts at %d", lnum, offs, corruption); - /* Make sure we dump interesting non-0xFF data */ - offs = corruption; - buf += corruption; - goto corrupted; - } + clean_buf(c, &buf, lnum, &offs, &len); + need_clean = 1; } /* Drop nodes from incomplete group */ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ubifs_scan() error handling 2011-01-28 17:13 ` twebb @ 2011-01-29 17:38 ` Artem Bityutskiy 2011-02-01 1:26 ` twebb 0 siblings, 1 reply; 7+ messages in thread From: Artem Bityutskiy @ 2011-01-29 17:38 UTC (permalink / raw) To: twebb; +Cc: Lei Wen, linux-mtd On Fri, 2011-01-28 at 12:13 -0500, twebb wrote: > It's been awhile but I wanted to reopen the discussion on this topic. > Could you take a look at this proposed patch? Essentially this change > results in the LEB being cleaned/recovered regardless of whether > is_last_write() is true or not. There may be a better way to do this > earlier in the same function, but I'm not familiar enough with it to > know the significance of the is_last_write() call. But I think I explained why this check is there. Why exactly it does not work for your flash? I think you need to get better understanding what is happening in your case. I am reluctant to take this patch because it is more of a band-aid but not a proper solution. -- Best Regards, Artem Bityutskiy (Артём Битюцкий) ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ubifs_scan() error handling 2011-01-29 17:38 ` Artem Bityutskiy @ 2011-02-01 1:26 ` twebb 2011-02-06 14:58 ` Artem Bityutskiy 0 siblings, 1 reply; 7+ messages in thread From: twebb @ 2011-02-01 1:26 UTC (permalink / raw) To: dedekind1; +Cc: Lei Wen, linux-mtd >> It's been awhile but I wanted to reopen the discussion on this topic. >> Could you take a look at this proposed patch? Essentially this change >> results in the LEB being cleaned/recovered regardless of whether >> is_last_write() is true or not. There may be a better way to do this >> earlier in the same function, but I'm not familiar enough with it to >> know the significance of the is_last_write() call. > > But I think I explained why this check is there. Why exactly it does not > work for your flash? I think you need to get better understanding what > is happening in your case. I am reluctant to take this patch because it > is more of a band-aid but not a proper solution. > I don't recall an explanation about why that check is there, but regardless, I'll try to explain why things are failing on my flash: Let's take the case of a read or write disturb error causing multiple bit flips in the empty space of a LEB (not in the common header node magic number location). I believe this type of error is very common with MLC (since ECC will generally handle bit flips in non-FF areas.) LEB: | good nodes | 0xFFs | bit flip | 0xFFs | bit flip | 0xFFs| When ubifs_recover_leb() calls ubifs_scan_a_node(), it correctly returns SCANNED_EMPTY_SPACE. Then ubifs_recover_leb() finds that the LEB buf is not empty. It also finds that !is_last_write() is TRUE and breaks without setting empty_chkd. Later, as a result of !empty_chkd and !is_empty and !is_last_write all being TRUE, the LEB is marked as corrupted. This ultimately may result in a failure to mount or in a RO mount. However, because of the nature of the "corruption", if ubifs_recover_leb() ignores is_last_write() result and instead calls clean_buf() and sets need_clean = 1, then fix_unclean_leb() ultimately fixes the bit flip via the ubi_leb_change() call and without data loss. Does this make sense or is my logic wrong? I think it's OK assuming that no true corruption (associated with a power loss) happens in conjunction with the rd/wr disturb. I do understand that perhaps this is more a band-aid than a proper solution. However, I'm trying to understand whether it is reasonable and whether you think it does more good than harm. twebb ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ubifs_scan() error handling 2011-02-01 1:26 ` twebb @ 2011-02-06 14:58 ` Artem Bityutskiy 0 siblings, 0 replies; 7+ messages in thread From: Artem Bityutskiy @ 2011-02-06 14:58 UTC (permalink / raw) To: twebb; +Cc: Lei Wen, linux-mtd On Mon, 2011-01-31 at 20:26 -0500, twebb wrote: > >> It's been awhile but I wanted to reopen the discussion on this topic. > >> Could you take a look at this proposed patch? Essentially this change > >> results in the LEB being cleaned/recovered regardless of whether > >> is_last_write() is true or not. There may be a better way to do this > >> earlier in the same function, but I'm not familiar enough with it to > >> know the significance of the is_last_write() call. > > > > But I think I explained why this check is there. Why exactly it does not > > work for your flash? I think you need to get better understanding what > > is happening in your case. I am reluctant to take this patch because it > > is more of a band-aid but not a proper solution. > > > > I don't recall an explanation about why that check is there, but > regardless, I'll try to explain why things are failing on my flash: Well, I explained it in various threads, probably not in this one. But I've just pushed a patch to the ubifs-2.6.git tree which should shed some light: http://git.infradead.org/ubifs-2.6.git/blobdiff/c8aa892b861cda6fc29526feae5d41db69523c36..5b1791a1529ccd4e49c59fe9f8fb6575f74c569e:/fs/ubifs/recovery.c > Let's take the case of a read or write disturb error causing multiple > bit flips in the empty space of a LEB (not in the common header node > magic number location). I believe this type of error is very common > with MLC (since ECC will generally handle bit flips in non-FF areas.) > > LEB: | good nodes | 0xFFs | bit flip | 0xFFs | bit flip | 0xFFs| What will happen if I write to the space which goes after "good nodes" and contain bit-flips? If I fill that space with UBIFS nodes, will they be all-right? Is it safe? Because this is what UBIFS will do. ... snip ... > However, because of the nature of the "corruption", if > ubifs_recover_leb() ignores is_last_write() result and instead calls > clean_buf() and sets need_clean = 1, then fix_unclean_leb() ultimately > fixes the bit flip via the ubi_leb_change() call and without data > loss. We can do this, but if your flash is so weak that you can never rely on empty space being 0xFFed, then you are screwed anyway. Because UBIFS will write nodes to other LEBs with empty space damaged by bit-flips. With such weak flash we'd need to do much more than the patch you've sent. This is why I think your change is a hack which hides the problem, not fixes. And BTW, can you confirm that with your patch your system can survive stress tests and good power-cut testing? Did you do this, can you describe how you stress your system? If the answer is "yes", then I doubt the reason for those bit-flips is read-disturb. If "yes", then I really suspect the reason for bit-flips is somewhat similar to the unstable bits problem reported by Matthieu CASTET and discussed here. His problem is SLC-specific. The idea of the issue is that if you cut the power just after you started writing to the NAND page X, you then may find that it becomes unstable. You can sometimes read all 0xFFs from there, some-times have bit-flips, sometimes even hard ECC errors. UBIFS does not currently deal with this issue and I'm going to work on it as soon as I have time. UBI also needs fixes to deal with this issue. I did start doing some work some time ago and documented the issue in my UBI development tree: http://git.infradead.org/ubi-2.6.git/blobdiff/0d3d5b3db2e2a0c6eed40714a3e9505031769508..a3353e9ebd42c710396ed6e8db0bda0a600bb1a5:/drivers/mtd/ubi/scan.c So, probably in case of MLC the unstable bit problem becomes something more severe. Thus, if you'd invest time and more properly explore the nature of these bit-flips you have, if you experimented properly and if you could give exact description about how they appeared, not just "I guess these are read-disturbs", then we can think how to really fix the issue. Thanks! -- Best Regards, Artem Bityutskiy (Артём Битюцкий) ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2011-02-06 14:59 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-06-23 17:04 ubifs_scan() error handling twebb 2010-07-13 4:28 ` Artem Bityutskiy 2010-07-13 4:32 ` Artem Bityutskiy 2011-01-28 17:13 ` twebb 2011-01-29 17:38 ` Artem Bityutskiy 2011-02-01 1:26 ` twebb 2011-02-06 14:58 ` Artem Bityutskiy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).