From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-gw1-out.broadcom.com ([216.31.210.62]) by bombadil.infradead.org with esmtp (Exim 4.80.1 #2 (Red Hat Linux)) id 1Xm9Rb-00079G-Sq for linux-mtd@lists.infradead.org; Wed, 05 Nov 2014 22:52:52 +0000 Message-ID: <545AAA2B.8090007@broadcom.com> Date: Wed, 5 Nov 2014 14:52:27 -0800 From: Scott Branden MIME-Version: 1.0 To: Richard Weinberger , Richard Weinberger Subject: Re: suspect UBIFS async operations causing issues during reboot References: <5459E090.1010300@broadcom.com> <545A64CF.20101@broadcom.com> <545A6AA0.8050901@nod.at> In-Reply-To: <545A6AA0.8050901@nod.at> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit Cc: "linux-mtd@lists.infradead.org" List-Id: Linux MTD discussion mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On 14-11-05 10:21 AM, Richard Weinberger wrote: > Hi! > > Am 05.11.2014 um 18:56 schrieb Scott Branden: >> Hi Richard, >> >> Thanks for the feedback. Comments inline. >> >> On 14-11-05 01:22 AM, Richard Weinberger wrote: >>> On Wed, Nov 5, 2014 at 9:32 AM, Scott Branden wrote: >>>> We are doing reboot testing with UBIFS on the 3.10 kernel with a new chipset >>>> we are working on. >>>> >>>> Over 1000's of reboots we eventually find that the NAND has uncorrectable >>>> ECC errors reported on a random page when it is mounted. >>>> >>>> We have found the problem is that a NAND erase operation is in progress when >>>> the reboot occurs. Since the NAND is in the middle of the erase operation >>>> the page is mostly FF with some random bits not erased when the reboot >>>> occurs. >>>> >>>> We suspect the problem is the asynchronous nature of the UBIFS operations. >>>> Perhaps the small write buffer that can take 3-5 seconds to be written or >>>> some other operation occuring in UBI/UBIFS? I don't think the shutdown of >>>> the filesystem is dealing with all the threads properly. >>> >>> And what about powercuts? >> powercuts would exhibit the exact same behaviour as we are observing: the erase is interrupted by loss of power so the NAND block being erased would be in a partially erased >> state. powercuts have little to do with the reboot sequence I am describing. >> >>> UBI/UBIFS was designed to survive powercuts. >> Yes, this does not cause UBIFS to fail to survive the powercut. It does cause blocks to not be erased properly. > > Makes sense. > >> The block that didn't finish to erase is uncorrectable on next boot-up: >> >> [ 1.330000] UBI: attaching mtd7 to ubi0 >> [ 2.000000] iproc_nand 18046000.nand: uncorrectable error at 0x18700000 >> >> This issue is this blocks shouldn't be corrupted in the first place if UBI/UBIFS shut downs properly. >> >>> If your NAND shows strange issues even after a clean reboot something nasty is >>> going on. Does your driver pass all UBI/MTD test? >>> >> We are in the process of running the MTD tests. But this appears to have nothing to do with a buggy driver or not. The NAND driver will do what it is told to do. If it is told >> to erase a block it will erase a block. It can't control if the system reboots in the middle of this operation? >> >> This appears to be a UBI/UBIFS issue. UBI/UBIFS operations are still going on after the filesystem in unmounted. The shutdown process completes and a reboot happens. My guess is >> these operations are due to the asynchronous threads of UBI/UBIFS not being handled properly during the shutdown process? >> >> I have found other people have reported unexplained flash corruption. We back ported this to the 3.10 kernel which solved most of the flash corruption issues: >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/super.c?id=807612db2f9940b9fa6deaef054eb16d51bd3e00 >> >> This only remaining flash corruption issue is due to the described issue of reboot happening in the middle of an erase cycle. > > You can verify your hypothesis easily. Add a printk() to ubi_detach_mtd_dev(). This function shuts down UBI and also the background thread which does > all erase work. Hi Richard, The printk never happens. I only find ubi_detach_mtd_dev can be called by ubi_exit. But ubi_exit is only called if it is a module... static void __exit ubi_exit(void) { int i; for (i = 0; i < UBI_MAX_DEVICES; i++) if (ubi_devices[i]) { mutex_lock(&ubi_devices_mutex); ubi_detach_mtd_dev(ubi_devices[i]->ubi_num, 1); mutex_unlock(&ubi_devices_mutex); } ubi_debugfs_exit(); kmem_cache_destroy(ubi_wl_entry_slab); misc_deregister(&ubi_ctrl_cdev); class_remove_file(ubi_class, &ubi_version); class_destroy(ubi_class); } module_exit(ubi_exit); > > Thanks, > //richard >