From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from a.ns.miles-group.at ([95.130.255.143] helo=radon.swed.at) by bombadil.infradead.org with esmtps (Exim 4.80.1 #2 (Red Hat Linux)) id 1Xm5DG-0006Zh-8m for linux-mtd@lists.infradead.org; Wed, 05 Nov 2014 18:21:47 +0000 Message-ID: <545A6AA0.8050901@nod.at> Date: Wed, 05 Nov 2014 19:21:20 +0100 From: Richard Weinberger MIME-Version: 1.0 To: Scott Branden , Richard Weinberger Subject: Re: suspect UBIFS async operations causing issues during reboot References: <5459E090.1010300@broadcom.com> <545A64CF.20101@broadcom.com> In-Reply-To: <545A64CF.20101@broadcom.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Cc: "linux-mtd@lists.infradead.org" List-Id: Linux MTD discussion mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Hi! Am 05.11.2014 um 18:56 schrieb Scott Branden: > Hi Richard, > > Thanks for the feedback. Comments inline. > > On 14-11-05 01:22 AM, Richard Weinberger wrote: >> On Wed, Nov 5, 2014 at 9:32 AM, Scott Branden wrote: >>> We are doing reboot testing with UBIFS on the 3.10 kernel with a new chipset >>> we are working on. >>> >>> Over 1000's of reboots we eventually find that the NAND has uncorrectable >>> ECC errors reported on a random page when it is mounted. >>> >>> We have found the problem is that a NAND erase operation is in progress when >>> the reboot occurs. Since the NAND is in the middle of the erase operation >>> the page is mostly FF with some random bits not erased when the reboot >>> occurs. >>> >>> We suspect the problem is the asynchronous nature of the UBIFS operations. >>> Perhaps the small write buffer that can take 3-5 seconds to be written or >>> some other operation occuring in UBI/UBIFS? I don't think the shutdown of >>> the filesystem is dealing with all the threads properly. >> >> And what about powercuts? > powercuts would exhibit the exact same behaviour as we are observing: the erase is interrupted by loss of power so the NAND block being erased would be in a partially erased > state. powercuts have little to do with the reboot sequence I am describing. > >> UBI/UBIFS was designed to survive powercuts. > Yes, this does not cause UBIFS to fail to survive the powercut. It does cause blocks to not be erased properly. Makes sense. > The block that didn't finish to erase is uncorrectable on next boot-up: > > [ 1.330000] UBI: attaching mtd7 to ubi0 > [ 2.000000] iproc_nand 18046000.nand: uncorrectable error at 0x18700000 > > This issue is this blocks shouldn't be corrupted in the first place if UBI/UBIFS shut downs properly. > >> If your NAND shows strange issues even after a clean reboot something nasty is >> going on. Does your driver pass all UBI/MTD test? >> > We are in the process of running the MTD tests. But this appears to have nothing to do with a buggy driver or not. The NAND driver will do what it is told to do. If it is told > to erase a block it will erase a block. It can't control if the system reboots in the middle of this operation? > > This appears to be a UBI/UBIFS issue. UBI/UBIFS operations are still going on after the filesystem in unmounted. The shutdown process completes and a reboot happens. My guess is > these operations are due to the asynchronous threads of UBI/UBIFS not being handled properly during the shutdown process? > > I have found other people have reported unexplained flash corruption. We back ported this to the 3.10 kernel which solved most of the flash corruption issues: > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/super.c?id=807612db2f9940b9fa6deaef054eb16d51bd3e00 > > This only remaining flash corruption issue is due to the described issue of reboot happening in the middle of an erase cycle. You can verify your hypothesis easily. Add a printk() to ubi_detach_mtd_dev(). This function shuts down UBI and also the background thread which does all erase work. Thanks, //richard