* suspect UBIFS async operations causing issues during reboot @ 2014-11-05 8:32 Scott Branden 2014-11-05 9:22 ` Richard Weinberger 2014-11-12 11:20 ` Artem Bityutskiy 0 siblings, 2 replies; 19+ messages in thread From: Scott Branden @ 2014-11-05 8:32 UTC (permalink / raw) To: linux-mtd We are doing reboot testing with UBIFS on the 3.10 kernel with a new chipset we are working on. Over 1000's of reboots we eventually find that the NAND has uncorrectable ECC errors reported on a random page when it is mounted. We have found the problem is that a NAND erase operation is in progress when the reboot occurs. Since the NAND is in the middle of the erase operation the page is mostly FF with some random bits not erased when the reboot occurs. We suspect the problem is the asynchronous nature of the UBIFS operations. Perhaps the small write buffer that can take 3-5 seconds to be written or some other operation occuring in UBI/UBIFS? I don't think the shutdown of the filesystem is dealing with all the threads properly. Log below with printks adding in iproc_nand driver showing erase operations in progress when "Restarting system." happens. Stopped Setup Virtual Console. Stopping Apply Kernel Variables... Stopped Apply Kernel Variables. Starting Notify Audit System and Update UTMP about System Shutdown... Stopping Runtime Directory... Stopping Remount API VFS... Stopped Remount API VFS. Stopping Remount Root FS... Stopped Remount Root FS. Stopping Collect Read-Ahead Data... Stopped Collect Read-Ahead Data. Stopping Media Directory...[ 18.370000] systemd[1]: Unit systemd-readahead-collect.service entered failed state. Started Console System Reboot Logging. Stopped Runtime Directory. Stopped Media Directory. [ 18.490000] systemd[1]: Shutting down. Sending SIGTERM to remaining processes... Sending SIGKILL to remaining processes... Unmounting file systems. [ 18.530000] iproc_nand_cmdfunc: cmd 0x60 addr 0x14a40000 [ 18.540000] iproc_nand_waitfunc: native cmd 8 intfc status 0xc00000e0 [ 18.550000] UBIFS: background thread "ubifs_bgt0_0" stops Disabling swaps. Detaching loop devices. Detaching DM devices. [ 18.560000] iproc_nand_cmdfunc: cmd 0x60 addr 0x18680000 [ 18.570000] iproc_nand_waitfunc: native cmd 8 intfc status 0xc00000e0 [ 18.580000] Restarting system. [ 18.580000] iproc_nand_cmdfunc: cmd 0x60 addr 0x18700000 <REBOOT happens here with NAND ERASE COMMAND in progress corrupting 0x18700000 NAND Addresses!> Corrupted NAND only happens when erase operation in progress when restarting system happens. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: suspect UBIFS async operations causing issues during reboot 2014-11-05 8:32 suspect UBIFS async operations causing issues during reboot Scott Branden @ 2014-11-05 9:22 ` Richard Weinberger 2014-11-05 17:56 ` Scott Branden 2014-11-12 11:20 ` Artem Bityutskiy 1 sibling, 1 reply; 19+ messages in thread From: Richard Weinberger @ 2014-11-05 9:22 UTC (permalink / raw) To: Scott Branden; +Cc: linux-mtd@lists.infradead.org On Wed, Nov 5, 2014 at 9:32 AM, Scott Branden <sbranden@broadcom.com> wrote: > We are doing reboot testing with UBIFS on the 3.10 kernel with a new chipset > we are working on. > > Over 1000's of reboots we eventually find that the NAND has uncorrectable > ECC errors reported on a random page when it is mounted. > > We have found the problem is that a NAND erase operation is in progress when > the reboot occurs. Since the NAND is in the middle of the erase operation > the page is mostly FF with some random bits not erased when the reboot > occurs. > > We suspect the problem is the asynchronous nature of the UBIFS operations. > Perhaps the small write buffer that can take 3-5 seconds to be written or > some other operation occuring in UBI/UBIFS? I don't think the shutdown of > the filesystem is dealing with all the threads properly. And what about powercuts? UBI/UBIFS was designed to survive powercuts. If your NAND shows strange issues even after a clean reboot something nasty is going on. Does your driver pass all UBI/MTD test? -- Thanks, //richard ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: suspect UBIFS async operations causing issues during reboot 2014-11-05 9:22 ` Richard Weinberger @ 2014-11-05 17:56 ` Scott Branden 2014-11-05 18:21 ` Richard Weinberger 0 siblings, 1 reply; 19+ messages in thread From: Scott Branden @ 2014-11-05 17:56 UTC (permalink / raw) To: Richard Weinberger; +Cc: linux-mtd@lists.infradead.org Hi Richard, Thanks for the feedback. Comments inline. On 14-11-05 01:22 AM, Richard Weinberger wrote: > On Wed, Nov 5, 2014 at 9:32 AM, Scott Branden <sbranden@broadcom.com> wrote: >> We are doing reboot testing with UBIFS on the 3.10 kernel with a new chipset >> we are working on. >> >> Over 1000's of reboots we eventually find that the NAND has uncorrectable >> ECC errors reported on a random page when it is mounted. >> >> We have found the problem is that a NAND erase operation is in progress when >> the reboot occurs. Since the NAND is in the middle of the erase operation >> the page is mostly FF with some random bits not erased when the reboot >> occurs. >> >> We suspect the problem is the asynchronous nature of the UBIFS operations. >> Perhaps the small write buffer that can take 3-5 seconds to be written or >> some other operation occuring in UBI/UBIFS? I don't think the shutdown of >> the filesystem is dealing with all the threads properly. > > And what about powercuts? powercuts would exhibit the exact same behaviour as we are observing: the erase is interrupted by loss of power so the NAND block being erased would be in a partially erased state. powercuts have little to do with the reboot sequence I am describing. > UBI/UBIFS was designed to survive powercuts. Yes, this does not cause UBIFS to fail to survive the powercut. It does cause blocks to not be erased properly. The block that didn't finish to erase is uncorrectable on next boot-up: [ 1.330000] UBI: attaching mtd7 to ubi0 [ 2.000000] iproc_nand 18046000.nand: uncorrectable error at 0x18700000 This issue is this blocks shouldn't be corrupted in the first place if UBI/UBIFS shut downs properly. > If your NAND shows strange issues even after a clean reboot something nasty is > going on. Does your driver pass all UBI/MTD test? > We are in the process of running the MTD tests. But this appears to have nothing to do with a buggy driver or not. The NAND driver will do what it is told to do. If it is told to erase a block it will erase a block. It can't control if the system reboots in the middle of this operation? This appears to be a UBI/UBIFS issue. UBI/UBIFS operations are still going on after the filesystem in unmounted. The shutdown process completes and a reboot happens. My guess is these operations are due to the asynchronous threads of UBI/UBIFS not being handled properly during the shutdown process? I have found other people have reported unexplained flash corruption. We back ported this to the 3.10 kernel which solved most of the flash corruption issues: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/super.c?id=807612db2f9940b9fa6deaef054eb16d51bd3e00 This only remaining flash corruption issue is due to the described issue of reboot happening in the middle of an erase cycle. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: suspect UBIFS async operations causing issues during reboot 2014-11-05 17:56 ` Scott Branden @ 2014-11-05 18:21 ` Richard Weinberger 2014-11-05 22:52 ` Scott Branden 0 siblings, 1 reply; 19+ messages in thread From: Richard Weinberger @ 2014-11-05 18:21 UTC (permalink / raw) To: Scott Branden, Richard Weinberger; +Cc: linux-mtd@lists.infradead.org Hi! Am 05.11.2014 um 18:56 schrieb Scott Branden: > Hi Richard, > > Thanks for the feedback. Comments inline. > > On 14-11-05 01:22 AM, Richard Weinberger wrote: >> On Wed, Nov 5, 2014 at 9:32 AM, Scott Branden <sbranden@broadcom.com> wrote: >>> We are doing reboot testing with UBIFS on the 3.10 kernel with a new chipset >>> we are working on. >>> >>> Over 1000's of reboots we eventually find that the NAND has uncorrectable >>> ECC errors reported on a random page when it is mounted. >>> >>> We have found the problem is that a NAND erase operation is in progress when >>> the reboot occurs. Since the NAND is in the middle of the erase operation >>> the page is mostly FF with some random bits not erased when the reboot >>> occurs. >>> >>> We suspect the problem is the asynchronous nature of the UBIFS operations. >>> Perhaps the small write buffer that can take 3-5 seconds to be written or >>> some other operation occuring in UBI/UBIFS? I don't think the shutdown of >>> the filesystem is dealing with all the threads properly. >> >> And what about powercuts? > powercuts would exhibit the exact same behaviour as we are observing: the erase is interrupted by loss of power so the NAND block being erased would be in a partially erased > state. powercuts have little to do with the reboot sequence I am describing. > >> UBI/UBIFS was designed to survive powercuts. > Yes, this does not cause UBIFS to fail to survive the powercut. It does cause blocks to not be erased properly. Makes sense. > The block that didn't finish to erase is uncorrectable on next boot-up: > > [ 1.330000] UBI: attaching mtd7 to ubi0 > [ 2.000000] iproc_nand 18046000.nand: uncorrectable error at 0x18700000 > > This issue is this blocks shouldn't be corrupted in the first place if UBI/UBIFS shut downs properly. > >> If your NAND shows strange issues even after a clean reboot something nasty is >> going on. Does your driver pass all UBI/MTD test? >> > We are in the process of running the MTD tests. But this appears to have nothing to do with a buggy driver or not. The NAND driver will do what it is told to do. If it is told > to erase a block it will erase a block. It can't control if the system reboots in the middle of this operation? > > This appears to be a UBI/UBIFS issue. UBI/UBIFS operations are still going on after the filesystem in unmounted. The shutdown process completes and a reboot happens. My guess is > these operations are due to the asynchronous threads of UBI/UBIFS not being handled properly during the shutdown process? > > I have found other people have reported unexplained flash corruption. We back ported this to the 3.10 kernel which solved most of the flash corruption issues: > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/super.c?id=807612db2f9940b9fa6deaef054eb16d51bd3e00 > > This only remaining flash corruption issue is due to the described issue of reboot happening in the middle of an erase cycle. You can verify your hypothesis easily. Add a printk() to ubi_detach_mtd_dev(). This function shuts down UBI and also the background thread which does all erase work. Thanks, //richard ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: suspect UBIFS async operations causing issues during reboot 2014-11-05 18:21 ` Richard Weinberger @ 2014-11-05 22:52 ` Scott Branden 2014-11-06 21:56 ` Scott Branden 2014-11-10 7:44 ` Tanya Brokhman 0 siblings, 2 replies; 19+ messages in thread From: Scott Branden @ 2014-11-05 22:52 UTC (permalink / raw) To: Richard Weinberger, Richard Weinberger; +Cc: linux-mtd@lists.infradead.org On 14-11-05 10:21 AM, Richard Weinberger wrote: > Hi! > > Am 05.11.2014 um 18:56 schrieb Scott Branden: >> Hi Richard, >> >> Thanks for the feedback. Comments inline. >> >> On 14-11-05 01:22 AM, Richard Weinberger wrote: >>> On Wed, Nov 5, 2014 at 9:32 AM, Scott Branden <sbranden@broadcom.com> wrote: >>>> We are doing reboot testing with UBIFS on the 3.10 kernel with a new chipset >>>> we are working on. >>>> >>>> Over 1000's of reboots we eventually find that the NAND has uncorrectable >>>> ECC errors reported on a random page when it is mounted. >>>> >>>> We have found the problem is that a NAND erase operation is in progress when >>>> the reboot occurs. Since the NAND is in the middle of the erase operation >>>> the page is mostly FF with some random bits not erased when the reboot >>>> occurs. >>>> >>>> We suspect the problem is the asynchronous nature of the UBIFS operations. >>>> Perhaps the small write buffer that can take 3-5 seconds to be written or >>>> some other operation occuring in UBI/UBIFS? I don't think the shutdown of >>>> the filesystem is dealing with all the threads properly. >>> >>> And what about powercuts? >> powercuts would exhibit the exact same behaviour as we are observing: the erase is interrupted by loss of power so the NAND block being erased would be in a partially erased >> state. powercuts have little to do with the reboot sequence I am describing. >> >>> UBI/UBIFS was designed to survive powercuts. >> Yes, this does not cause UBIFS to fail to survive the powercut. It does cause blocks to not be erased properly. > > Makes sense. > >> The block that didn't finish to erase is uncorrectable on next boot-up: >> >> [ 1.330000] UBI: attaching mtd7 to ubi0 >> [ 2.000000] iproc_nand 18046000.nand: uncorrectable error at 0x18700000 >> >> This issue is this blocks shouldn't be corrupted in the first place if UBI/UBIFS shut downs properly. >> >>> If your NAND shows strange issues even after a clean reboot something nasty is >>> going on. Does your driver pass all UBI/MTD test? >>> >> We are in the process of running the MTD tests. But this appears to have nothing to do with a buggy driver or not. The NAND driver will do what it is told to do. If it is told >> to erase a block it will erase a block. It can't control if the system reboots in the middle of this operation? >> >> This appears to be a UBI/UBIFS issue. UBI/UBIFS operations are still going on after the filesystem in unmounted. The shutdown process completes and a reboot happens. My guess is >> these operations are due to the asynchronous threads of UBI/UBIFS not being handled properly during the shutdown process? >> >> I have found other people have reported unexplained flash corruption. We back ported this to the 3.10 kernel which solved most of the flash corruption issues: >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/super.c?id=807612db2f9940b9fa6deaef054eb16d51bd3e00 >> >> This only remaining flash corruption issue is due to the described issue of reboot happening in the middle of an erase cycle. > > You can verify your hypothesis easily. Add a printk() to ubi_detach_mtd_dev(). This function shuts down UBI and also the background thread which does > all erase work. Hi Richard, The printk never happens. I only find ubi_detach_mtd_dev can be called by ubi_exit. But ubi_exit is only called if it is a module... static void __exit ubi_exit(void) { int i; for (i = 0; i < UBI_MAX_DEVICES; i++) if (ubi_devices[i]) { mutex_lock(&ubi_devices_mutex); ubi_detach_mtd_dev(ubi_devices[i]->ubi_num, 1); mutex_unlock(&ubi_devices_mutex); } ubi_debugfs_exit(); kmem_cache_destroy(ubi_wl_entry_slab); misc_deregister(&ubi_ctrl_cdev); class_remove_file(ubi_class, &ubi_version); class_destroy(ubi_class); } module_exit(ubi_exit); > > Thanks, > //richard > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: suspect UBIFS async operations causing issues during reboot 2014-11-05 22:52 ` Scott Branden @ 2014-11-06 21:56 ` Scott Branden 2014-11-07 8:45 ` Richard Weinberger 2014-11-10 7:44 ` Tanya Brokhman 1 sibling, 1 reply; 19+ messages in thread From: Scott Branden @ 2014-11-06 21:56 UTC (permalink / raw) To: Richard Weinberger, Richard Weinberger; +Cc: linux-mtd@lists.infradead.org It looks like the erase happening in the middle of reboot was uncovered in 2009 and never addressed properly? https://lkml.org/lkml/2009/6/9/16 https://lkml.org/lkml/2010/2/12/144 Was there a proper resolution to this issue? On 14-11-05 02:52 PM, Scott Branden wrote: > On 14-11-05 10:21 AM, Richard Weinberger wrote: >> Hi! >> >> Am 05.11.2014 um 18:56 schrieb Scott Branden: >>> Hi Richard, >>> >>> Thanks for the feedback. Comments inline. >>> >>> On 14-11-05 01:22 AM, Richard Weinberger wrote: >>>> On Wed, Nov 5, 2014 at 9:32 AM, Scott Branden >>>> <sbranden@broadcom.com> wrote: >>>>> We are doing reboot testing with UBIFS on the 3.10 kernel with a >>>>> new chipset >>>>> we are working on. >>>>> >>>>> Over 1000's of reboots we eventually find that the NAND has >>>>> uncorrectable >>>>> ECC errors reported on a random page when it is mounted. >>>>> >>>>> We have found the problem is that a NAND erase operation is in >>>>> progress when >>>>> the reboot occurs. Since the NAND is in the middle of the erase >>>>> operation >>>>> the page is mostly FF with some random bits not erased when the reboot >>>>> occurs. >>>>> >>>>> We suspect the problem is the asynchronous nature of the UBIFS >>>>> operations. >>>>> Perhaps the small write buffer that can take 3-5 seconds to be >>>>> written or >>>>> some other operation occuring in UBI/UBIFS? I don't think the >>>>> shutdown of >>>>> the filesystem is dealing with all the threads properly. >>>> >>>> And what about powercuts? >>> powercuts would exhibit the exact same behaviour as we are observing: >>> the erase is interrupted by loss of power so the NAND block being >>> erased would be in a partially erased >>> state. powercuts have little to do with the reboot sequence I am >>> describing. >>> >>>> UBI/UBIFS was designed to survive powercuts. >>> Yes, this does not cause UBIFS to fail to survive the powercut. It >>> does cause blocks to not be erased properly. >> >> Makes sense. >> >>> The block that didn't finish to erase is uncorrectable on next boot-up: >>> >>> [ 1.330000] UBI: attaching mtd7 to ubi0 >>> [ 2.000000] iproc_nand 18046000.nand: uncorrectable error at >>> 0x18700000 >>> >>> This issue is this blocks shouldn't be corrupted in the first place >>> if UBI/UBIFS shut downs properly. >>> >>>> If your NAND shows strange issues even after a clean reboot >>>> something nasty is >>>> going on. Does your driver pass all UBI/MTD test? >>>> >>> We are in the process of running the MTD tests. But this appears to >>> have nothing to do with a buggy driver or not. The NAND driver will >>> do what it is told to do. If it is told >>> to erase a block it will erase a block. It can't control if the >>> system reboots in the middle of this operation? >>> >>> This appears to be a UBI/UBIFS issue. UBI/UBIFS operations are still >>> going on after the filesystem in unmounted. The shutdown process >>> completes and a reboot happens. My guess is >>> these operations are due to the asynchronous threads of UBI/UBIFS not >>> being handled properly during the shutdown process? >>> >>> I have found other people have reported unexplained flash corruption. >>> We back ported this to the 3.10 kernel which solved most of the flash >>> corruption issues: >>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/super.c?id=807612db2f9940b9fa6deaef054eb16d51bd3e00 >>> >>> >>> This only remaining flash corruption issue is due to the described >>> issue of reboot happening in the middle of an erase cycle. >> >> You can verify your hypothesis easily. Add a printk() to >> ubi_detach_mtd_dev(). This function shuts down UBI and also the >> background thread which does >> all erase work. > Hi Richard, > > The printk never happens. > > I only find ubi_detach_mtd_dev can be called by ubi_exit. But ubi_exit > is only called if it is a module... > > static void __exit ubi_exit(void) > { > int i; > > for (i = 0; i < UBI_MAX_DEVICES; i++) > if (ubi_devices[i]) { > mutex_lock(&ubi_devices_mutex); > ubi_detach_mtd_dev(ubi_devices[i]->ubi_num, 1); > mutex_unlock(&ubi_devices_mutex); > } > ubi_debugfs_exit(); > kmem_cache_destroy(ubi_wl_entry_slab); > misc_deregister(&ubi_ctrl_cdev); > class_remove_file(ubi_class, &ubi_version); > class_destroy(ubi_class); > } > module_exit(ubi_exit); > >> >> Thanks, >> //richard >> > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: suspect UBIFS async operations causing issues during reboot 2014-11-06 21:56 ` Scott Branden @ 2014-11-07 8:45 ` Richard Weinberger 2014-11-07 17:31 ` Scott Branden 0 siblings, 1 reply; 19+ messages in thread From: Richard Weinberger @ 2014-11-07 8:45 UTC (permalink / raw) To: Scott Branden; +Cc: linux-mtd@lists.infradead.org Am 06.11.2014 um 22:56 schrieb Scott Branden: > It looks like the erase happening in the middle of reboot was uncovered in 2009 and never addressed properly? > > https://lkml.org/lkml/2009/6/9/16 > https://lkml.org/lkml/2010/2/12/144 > > Was there a proper resolution to this issue? Did you read the threads you've posted? There two answers: https://lkml.org/lkml/2010/2/12/143 https://lkml.org/lkml/2010/2/12/144 Thanks, //richard ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: suspect UBIFS async operations causing issues during reboot 2014-11-07 8:45 ` Richard Weinberger @ 2014-11-07 17:31 ` Scott Branden 2014-11-09 10:20 ` Richard Weinberger 0 siblings, 1 reply; 19+ messages in thread From: Scott Branden @ 2014-11-07 17:31 UTC (permalink / raw) To: Richard Weinberger; +Cc: linux-mtd@lists.infradead.org On 14-11-07 12:45 AM, Richard Weinberger wrote: > Am 06.11.2014 um 22:56 schrieb Scott Branden: >> It looks like the erase happening in the middle of reboot was uncovered in 2009 and never addressed properly? >> >> https://lkml.org/lkml/2009/6/9/16 >> https://lkml.org/lkml/2010/2/12/144 >> >> Was there a proper resolution to this issue? > > Did you read the threads you've posted? > > There two answers: > https://lkml.org/lkml/2010/2/12/143 Yes, there is no hardware solution to a reset happening in the middle of an erase operation to NAND. > https://lkml.org/lkml/2010/2/12/144 Yes, it appears nobody has ever implemented a reboot notifier in the past 5 years to solve this problem for NAND. We are experimenting with a notifier in UBI and MTD which is a much cleaner shutdown than just a notifier at the MTD level. UBI reboot notifier priority needs to be shut down first and have a higher priority than flash device. Overnight results are looking promising. > > Thanks, > //richard > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: suspect UBIFS async operations causing issues during reboot 2014-11-07 17:31 ` Scott Branden @ 2014-11-09 10:20 ` Richard Weinberger 2014-11-10 5:10 ` Scott Branden 2014-11-10 8:44 ` Ricard Wanderlof 0 siblings, 2 replies; 19+ messages in thread From: Richard Weinberger @ 2014-11-09 10:20 UTC (permalink / raw) To: Scott Branden; +Cc: linux-mtd@lists.infradead.org Am 07.11.2014 um 18:31 schrieb Scott Branden: > On 14-11-07 12:45 AM, Richard Weinberger wrote: >> Am 06.11.2014 um 22:56 schrieb Scott Branden: >>> It looks like the erase happening in the middle of reboot was uncovered in 2009 and never addressed properly? >>> >>> https://lkml.org/lkml/2009/6/9/16 >>> https://lkml.org/lkml/2010/2/12/144 >>> >>> Was there a proper resolution to this issue? >> >> Did you read the threads you've posted? >> >> There two answers: >> https://lkml.org/lkml/2010/2/12/143 > Yes, there is no hardware solution to a reset happening in the middle of an erase operation to NAND. Well, I agree with David that anything we do in software will only hide the real problem or trim down the window. Thanks, //richard ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: suspect UBIFS async operations causing issues during reboot 2014-11-09 10:20 ` Richard Weinberger @ 2014-11-10 5:10 ` Scott Branden 2014-11-26 8:17 ` Brian Norris 2014-11-10 8:44 ` Ricard Wanderlof 1 sibling, 1 reply; 19+ messages in thread From: Scott Branden @ 2014-11-10 5:10 UTC (permalink / raw) To: Richard Weinberger; +Cc: linux-mtd@lists.infradead.org On 14-11-09 02:20 AM, Richard Weinberger wrote: > Am 07.11.2014 um 18:31 schrieb Scott Branden: >> On 14-11-07 12:45 AM, Richard Weinberger wrote: >>> Am 06.11.2014 um 22:56 schrieb Scott Branden: >>>> It looks like the erase happening in the middle of reboot was uncovered in 2009 and never addressed properly? >>>> >>>> https://lkml.org/lkml/2009/6/9/16 >>>> https://lkml.org/lkml/2010/2/12/144 >>>> >>>> Was there a proper resolution to this issue? >>> >>> Did you read the threads you've posted? >>> >>> There two answers: >>> https://lkml.org/lkml/2010/2/12/143 >> Yes, there is no hardware solution to a reset happening in the middle of an erase operation to NAND. > > Well, I agree with David that anything we do in software will only hide the real problem > or trim down the window. Hi Richard, Currently the NAND does not shut down in a clean manner for a reboot operation. This is due to the asynchronous ubi_thread make flash erase calls. unmount is done properly in ubi already and cleanly shuts down. reboot is not done in a clean manner as there is no reboot_notifier to handle the situation. This is not hiding a real problem. It is just shutting down ubi properly rather than pulling the power from it in the middle of operations. In addition to this - a reboot_notifier needs to be added at the mtd level to shut it down properly as well. This is not trimming down a window. It is having the drivers shut down properly so they do not look like a power failure to the NAND device. There is no solution to the power failure - it will corrupt pages in the middle of erasure. And you do handle this in UBI/UBIFS. But why corrupt other erase pages unnecessarily when all that needs to be done is shut down the drivers properly. I don't know what you are agreeing with David with? It is not making a window smaller. It is changing the functionality so that the UBI and MTD drivers are shut down cleanly in reboot situations. Right now, they are not shut down at all in these situations. > > Thanks, > //richard > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: suspect UBIFS async operations causing issues during reboot 2014-11-10 5:10 ` Scott Branden @ 2014-11-26 8:17 ` Brian Norris 2014-11-26 8:30 ` Richard Weinberger 2014-11-27 19:07 ` Scott Branden 0 siblings, 2 replies; 19+ messages in thread From: Brian Norris @ 2014-11-26 8:17 UTC (permalink / raw) To: Scott Branden Cc: Ricard Wanderlof, Richard Weinberger, linux-mtd@lists.infradead.org On Sun, Nov 09, 2014 at 09:10:03PM -0800, Scott Branden wrote: > On 14-11-09 02:20 AM, Richard Weinberger wrote: > >Well, I agree with David that anything we do in software will only hide the real problem > >or trim down the window. > Hi Richard, > > Currently the NAND does not shut down in a clean manner for a reboot > operation. This is due to the asynchronous ubi_thread make flash > erase calls. unmount is done properly in ubi already and cleanly > shuts down. reboot is not done in a clean manner as there is no > reboot_notifier to handle the situation. > > This is not hiding a real problem. It is just shutting down ubi > properly rather than pulling the power from it in the middle of > operations. > > In addition to this - a reboot_notifier needs to be added at the mtd > level to shut it down properly as well. > > This is not trimming down a window. It is having the drivers shut > down properly so they do not look like a power failure to the NAND > device. > > There is no solution to the power failure - it will corrupt pages in > the middle of erasure. And you do handle this in UBI/UBIFS. But > why corrupt other erase pages unnecessarily when all that needs to > be done is shut down the drivers properly. I don't know what you > are agreeing with David with? It is not making a window smaller. > It is changing the functionality so that the UBI and MTD drivers are > shut down cleanly in reboot situations. Right now, they are not > shut down at all in these situations. I agree with Scott's statements. While it's fine to talk about how all layers (from bootloader to UBIFS) should be able to handle a power cut in the midst of an erase, that does *not* mean that we should intentionally deny the chance to shut down cleanly. AFAICT, Scott's not trying to work around any unsound reset behaviors (in UBIFS or in his bootloader); he's just trying to shut things down gracefully, just as we would try to terminate processes, sync file systems, etc., rather than just cutting power on reboot. Brian ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: suspect UBIFS async operations causing issues during reboot 2014-11-26 8:17 ` Brian Norris @ 2014-11-26 8:30 ` Richard Weinberger 2014-11-26 9:25 ` Brian Norris 2014-11-27 19:07 ` Scott Branden 1 sibling, 1 reply; 19+ messages in thread From: Richard Weinberger @ 2014-11-26 8:30 UTC (permalink / raw) To: Brian Norris, Scott Branden Cc: Ricard Wanderlof, linux-mtd@lists.infradead.org Am 26.11.2014 um 09:17 schrieb Brian Norris: > On Sun, Nov 09, 2014 at 09:10:03PM -0800, Scott Branden wrote: >> On 14-11-09 02:20 AM, Richard Weinberger wrote: >>> Well, I agree with David that anything we do in software will only hide the real problem >>> or trim down the window. >> Hi Richard, >> >> Currently the NAND does not shut down in a clean manner for a reboot >> operation. This is due to the asynchronous ubi_thread make flash >> erase calls. unmount is done properly in ubi already and cleanly >> shuts down. reboot is not done in a clean manner as there is no >> reboot_notifier to handle the situation. >> >> This is not hiding a real problem. It is just shutting down ubi >> properly rather than pulling the power from it in the middle of >> operations. >> >> In addition to this - a reboot_notifier needs to be added at the mtd >> level to shut it down properly as well. >> >> This is not trimming down a window. It is having the drivers shut >> down properly so they do not look like a power failure to the NAND >> device. >> >> There is no solution to the power failure - it will corrupt pages in >> the middle of erasure. And you do handle this in UBI/UBIFS. But >> why corrupt other erase pages unnecessarily when all that needs to >> be done is shut down the drivers properly. I don't know what you >> are agreeing with David with? It is not making a window smaller. >> It is changing the functionality so that the UBI and MTD drivers are >> shut down cleanly in reboot situations. Right now, they are not >> shut down at all in these situations. > > I agree with Scott's statements. While it's fine to talk about how all > layers (from bootloader to UBIFS) should be able to handle a power cut > in the midst of an erase, that does *not* mean that we should > intentionally deny the chance to shut down cleanly. > > AFAICT, Scott's not trying to work around any unsound reset behaviors > (in UBIFS or in his bootloader); he's just trying to shut things down > gracefully, just as we would try to terminate processes, sync file > systems, etc., rather than just cutting power on reboot. If there is a solution which makes Artem and David happy, I'm perfectly fine. :) Thanks, //richard ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: suspect UBIFS async operations causing issues during reboot 2014-11-26 8:30 ` Richard Weinberger @ 2014-11-26 9:25 ` Brian Norris 0 siblings, 0 replies; 19+ messages in thread From: Brian Norris @ 2014-11-26 9:25 UTC (permalink / raw) To: Richard Weinberger Cc: Ricard Wanderlof, linux-mtd@lists.infradead.org, Scott Branden On Wed, Nov 26, 2014 at 09:30:17AM +0100, Richard Weinberger wrote: > If there is a solution which makes Artem and David happy, I'm perfectly fine. :) Artem already commented [1]: "And generally, I am not opposed to this solution in upstream too, if it works for everyone." And he seemed agreeable to handling it in MTD (not UBI) on the thread from a few years ago too. David hasn't been too vocal recently, so I wouldn't bet on getting a comment out of him. I'd be happy to proven wrong. Brian [1] http://lists.infradead.org/pipermail/linux-mtd/2014-November/056500.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: suspect UBIFS async operations causing issues during reboot 2014-11-26 8:17 ` Brian Norris 2014-11-26 8:30 ` Richard Weinberger @ 2014-11-27 19:07 ` Scott Branden 1 sibling, 0 replies; 19+ messages in thread From: Scott Branden @ 2014-11-27 19:07 UTC (permalink / raw) To: Brian Norris Cc: Ricard Wanderlof, Richard Weinberger, linux-mtd@lists.infradead.org On 14-11-26 12:17 AM, Brian Norris wrote: > On Sun, Nov 09, 2014 at 09:10:03PM -0800, Scott Branden wrote: >> On 14-11-09 02:20 AM, Richard Weinberger wrote: >>> Well, I agree with David that anything we do in software will only hide the real problem >>> or trim down the window. >> Hi Richard, >> >> Currently the NAND does not shut down in a clean manner for a reboot >> operation. This is due to the asynchronous ubi_thread make flash >> erase calls. unmount is done properly in ubi already and cleanly >> shuts down. reboot is not done in a clean manner as there is no >> reboot_notifier to handle the situation. >> >> This is not hiding a real problem. It is just shutting down ubi >> properly rather than pulling the power from it in the middle of >> operations. >> >> In addition to this - a reboot_notifier needs to be added at the mtd >> level to shut it down properly as well. >> >> This is not trimming down a window. It is having the drivers shut >> down properly so they do not look like a power failure to the NAND >> device. >> >> There is no solution to the power failure - it will corrupt pages in >> the middle of erasure. And you do handle this in UBI/UBIFS. But >> why corrupt other erase pages unnecessarily when all that needs to >> be done is shut down the drivers properly. I don't know what you >> are agreeing with David with? It is not making a window smaller. >> It is changing the functionality so that the UBI and MTD drivers are >> shut down cleanly in reboot situations. Right now, they are not >> shut down at all in these situations. > > I agree with Scott's statements. While it's fine to talk about how all > layers (from bootloader to UBIFS) should be able to handle a power cut > in the midst of an erase, that does *not* mean that we should > intentionally deny the chance to shut down cleanly. > > AFAICT, Scott's not trying to work around any unsound reset behaviors > (in UBIFS or in his bootloader); he's just trying to shut things down > gracefully, just as we would try to terminate processes, sync file > systems, etc., rather than just cutting power on reboot. Yes - I just want to gracefully shutdown the system. Brian's proposed untested patch is the most generic approach. We'll have to work on testing it and get back to you. Thanks, Scott > > Brian > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: suspect UBIFS async operations causing issues during reboot 2014-11-09 10:20 ` Richard Weinberger 2014-11-10 5:10 ` Scott Branden @ 2014-11-10 8:44 ` Ricard Wanderlof 2014-11-10 9:08 ` Richard Weinberger 1 sibling, 1 reply; 19+ messages in thread From: Ricard Wanderlof @ 2014-11-10 8:44 UTC (permalink / raw) To: Richard Weinberger; +Cc: linux-mtd@lists.infradead.org, Scott Branden On Sun, 9 Nov 2014, Richard Weinberger wrote: > Am 07.11.2014 um 18:31 schrieb Scott Branden: > > On 14-11-07 12:45 AM, Richard Weinberger wrote: > >> Am 06.11.2014 um 22:56 schrieb Scott Branden: > >>> It looks like the erase happening in the middle of reboot was uncovered in 2009 and never addressed properly? > >>> > >>> https://lkml.org/lkml/2009/6/9/16 > >>> https://lkml.org/lkml/2010/2/12/144 > >>> > >>> Was there a proper resolution to this issue? > >> > >> Did you read the threads you've posted? > >> > >> There two answers: > >> https://lkml.org/lkml/2010/2/12/143 > > Yes, there is no hardware solution to a reset happening in the middle of an erase operation to NAND. > > Well, I agree with David that anything we do in software will only hide the real problem > or trim down the window. There's something I don't understand here. It could be (and probably will prove to be) my lack of knowledge on the detailed workings of UBI. Back in jffs2 days, erased blocks were so indicated by writing a 'cleanmarker' pattern to the OOB area. Thus, when scanning the flash, if a block was encountered which appeared erased but lacked the cleanmarker, it was re-erased just in case the previous erase was interrupted and therefore did not leave the bits in a properly erased state. With ubifs, cleanmarkers are not used (partly because MLC flashes wouldn't support two writes to the OOB area: one for the cleanmarker and one for the ECC), but there _is_ a header at the start of each PEB. Thus the same situation really holds, if a (seemingly) erased PEB is encountered with no EC header, it could be considered the leftover of an unfinished erase operation. I don't know for a fact if (or how) UBI does this though. Of course, and interrupted erase operation could leave a block in a seemingly un-erased state, i.e. the data appears intact (but may not be). But in that case the block would already be superseded by another block (i.e. any potential data would have already been copied to another block with the header infoinvalidating the old one). So in this case the block would go on an erase list at some point because it is no longer valid. Since interrupted erase seems to be of so much a concern I've obviously missed something above. But I can't figure out what. The only thing that seems relevant among the links above is https://lkml.org/lkml/2010/2/12/144 which indicates that half-erased blocks might cause problems with certain boot loaders, but again, that's a problem with the bootloader, not UBI. /Ricard -- Ricard Wolf Wanderlöf ricardw(at)axis.com Axis Communications AB, Lund, Sweden www.axis.com Phone +46 46 272 2016 Fax +46 46 13 61 30 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: suspect UBIFS async operations causing issues during reboot 2014-11-10 8:44 ` Ricard Wanderlof @ 2014-11-10 9:08 ` Richard Weinberger 0 siblings, 0 replies; 19+ messages in thread From: Richard Weinberger @ 2014-11-10 9:08 UTC (permalink / raw) To: Ricard Wanderlof; +Cc: linux-mtd@lists.infradead.org, Scott Branden Am 10.11.2014 um 09:44 schrieb Ricard Wanderlof: > > On Sun, 9 Nov 2014, Richard Weinberger wrote: > >> Am 07.11.2014 um 18:31 schrieb Scott Branden: >>> On 14-11-07 12:45 AM, Richard Weinberger wrote: >>>> Am 06.11.2014 um 22:56 schrieb Scott Branden: >>>>> It looks like the erase happening in the middle of reboot was uncovered in 2009 and never addressed properly? >>>>> >>>>> https://lkml.org/lkml/2009/6/9/16 >>>>> https://lkml.org/lkml/2010/2/12/144 >>>>> >>>>> Was there a proper resolution to this issue? >>>> >>>> Did you read the threads you've posted? >>>> >>>> There two answers: >>>> https://lkml.org/lkml/2010/2/12/143 >>> Yes, there is no hardware solution to a reset happening in the middle of an erase operation to NAND. >> >> Well, I agree with David that anything we do in software will only hide the real problem >> or trim down the window. > > There's something I don't understand here. It could be (and probably will > prove to be) my lack of knowledge on the detailed workings of UBI. > > Back in jffs2 days, erased blocks were so indicated by writing a > 'cleanmarker' pattern to the OOB area. Thus, when scanning the flash, if a > block was encountered which appeared erased but lacked the cleanmarker, it > was re-erased just in case the previous erase was interrupted and > therefore did not leave the bits in a properly erased state. > > With ubifs, cleanmarkers are not used (partly because MLC flashes wouldn't > support two writes to the OOB area: one for the cleanmarker and one for > the ECC), but there _is_ a header at the start of each PEB. Thus the same > situation really holds, if a (seemingly) erased PEB is encountered with no > EC header, it could be considered the leftover of an unfinished erase > operation. I don't know for a fact if (or how) UBI does this though. > > Of course, and interrupted erase operation could leave a block in a > seemingly un-erased state, i.e. the data appears intact (but may not be). > But in that case the block would already be superseded by another block > (i.e. any potential data would have already been copied to another block > with the header infoinvalidating the old one). So in this case the block > would go on an erase list at some point because it is no longer valid. > > Since interrupted erase seems to be of so much a concern I've obviously > missed something above. But I can't figure out what. > > The only thing that seems relevant among the links above is > > https://lkml.org/lkml/2010/2/12/144 > > which indicates that half-erased blocks might cause problems with certain > boot loaders, but again, that's a problem with the bootloader, not UBI. Correct. UBI can deal with that, if some component in your "NAND-Chain" does not, it needs fixing. Changing UBI/MTD in a way to hide such issues in not a good solution IMHO. In the old thread the idea was rejected by both the UBI and the MTD maintainer. Thanks, //richard ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: suspect UBIFS async operations causing issues during reboot 2014-11-05 22:52 ` Scott Branden 2014-11-06 21:56 ` Scott Branden @ 2014-11-10 7:44 ` Tanya Brokhman 1 sibling, 0 replies; 19+ messages in thread From: Tanya Brokhman @ 2014-11-10 7:44 UTC (permalink / raw) To: linux-mtd On 11/6/2014 12:52 AM, Scott Branden wrote: > On 14-11-05 10:21 AM, Richard Weinberger wrote: >> Hi! >> >> Am 05.11.2014 um 18:56 schrieb Scott Branden: >>> Hi Richard, >>> >>> Thanks for the feedback. Comments inline. >>> >>> On 14-11-05 01:22 AM, Richard Weinberger wrote: >>>> On Wed, Nov 5, 2014 at 9:32 AM, Scott Branden >>>> <sbranden@broadcom.com> wrote: >>>>> We are doing reboot testing with UBIFS on the 3.10 kernel with a >>>>> new chipset >>>>> we are working on. >>>>> >>>>> Over 1000's of reboots we eventually find that the NAND has >>>>> uncorrectable >>>>> ECC errors reported on a random page when it is mounted. >>>>> >>>>> We have found the problem is that a NAND erase operation is in >>>>> progress when >>>>> the reboot occurs. Since the NAND is in the middle of the erase >>>>> operation >>>>> the page is mostly FF with some random bits not erased when the reboot >>>>> occurs. >>>>> >>>>> We suspect the problem is the asynchronous nature of the UBIFS >>>>> operations. >>>>> Perhaps the small write buffer that can take 3-5 seconds to be >>>>> written or >>>>> some other operation occuring in UBI/UBIFS? I don't think the >>>>> shutdown of >>>>> the filesystem is dealing with all the threads properly. >>>> >>>> And what about powercuts? >>> powercuts would exhibit the exact same behaviour as we are observing: >>> the erase is interrupted by loss of power so the NAND block being >>> erased would be in a partially erased >>> state. powercuts have little to do with the reboot sequence I am >>> describing. >>> >>>> UBI/UBIFS was designed to survive powercuts. >>> Yes, this does not cause UBIFS to fail to survive the powercut. It >>> does cause blocks to not be erased properly. >> >> Makes sense. >> >>> The block that didn't finish to erase is uncorrectable on next boot-up: >>> >>> [ 1.330000] UBI: attaching mtd7 to ubi0 >>> [ 2.000000] iproc_nand 18046000.nand: uncorrectable error at >>> 0x18700000 >>> >>> This issue is this blocks shouldn't be corrupted in the first place >>> if UBI/UBIFS shut downs properly. >>> >>>> If your NAND shows strange issues even after a clean reboot >>>> something nasty is >>>> going on. Does your driver pass all UBI/MTD test? >>>> >>> We are in the process of running the MTD tests. But this appears to >>> have nothing to do with a buggy driver or not. The NAND driver will >>> do what it is told to do. If it is told >>> to erase a block it will erase a block. It can't control if the >>> system reboots in the middle of this operation? >>> >>> This appears to be a UBI/UBIFS issue. UBI/UBIFS operations are still >>> going on after the filesystem in unmounted. The shutdown process >>> completes and a reboot happens. My guess is >>> these operations are due to the asynchronous threads of UBI/UBIFS not >>> being handled properly during the shutdown process? >>> >>> I have found other people have reported unexplained flash corruption. >>> We back ported this to the 3.10 kernel which solved most of the flash >>> corruption issues: >>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/super.c?id=807612db2f9940b9fa6deaef054eb16d51bd3e00 >>> >>> >>> This only remaining flash corruption issue is due to the described >>> issue of reboot happening in the middle of an erase cycle. >> >> You can verify your hypothesis easily. Add a printk() to >> ubi_detach_mtd_dev(). This function shuts down UBI and also the >> background thread which does >> all erase work. > Hi Richard, > > The printk never happens. > > I only find ubi_detach_mtd_dev can be called by ubi_exit. But ubi_exit > is only called if it is a module... ubi_detach_mtd_dev() is also called for UBI_IOCDET IOCTL (look at cdev.c ctrl_cdev_ioctl()). It is triggered by ubidetach. We had similar issues that graceful shutdowns/reboots weren't handling UBI shutdown properly. We solved it by calling ubidetach in our reboot/powerdown scripts. You're right that the issue will still remain in power cuts, but at least for graceful shutdown it is handled properly. > > static void __exit ubi_exit(void) > { > int i; > > for (i = 0; i < UBI_MAX_DEVICES; i++) > if (ubi_devices[i]) { > mutex_lock(&ubi_devices_mutex); > ubi_detach_mtd_dev(ubi_devices[i]->ubi_num, 1); > mutex_unlock(&ubi_devices_mutex); > } > ubi_debugfs_exit(); > kmem_cache_destroy(ubi_wl_entry_slab); > misc_deregister(&ubi_ctrl_cdev); > class_remove_file(ubi_class, &ubi_version); > class_destroy(ubi_class); > } > module_exit(ubi_exit); > >> >> Thanks, >> //richard >> > > > ______________________________________________________ > Linux MTD discussion mailing list > http://lists.infradead.org/mailman/listinfo/linux-mtd/ Thanks, Tanya Brokhman -- Qualcomm Israel, on behalf of Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: suspect UBIFS async operations causing issues during reboot 2014-11-05 8:32 suspect UBIFS async operations causing issues during reboot Scott Branden 2014-11-05 9:22 ` Richard Weinberger @ 2014-11-12 11:20 ` Artem Bityutskiy 2014-11-15 3:30 ` Scott Branden 1 sibling, 1 reply; 19+ messages in thread From: Artem Bityutskiy @ 2014-11-12 11:20 UTC (permalink / raw) To: Scott Branden; +Cc: linux-mtd Hi Scott, sorry for late reply, but better later than never. On Wed, 2014-11-05 at 00:32 -0800, Scott Branden wrote: > Over 1000's of reboots we eventually find that the NAND has > uncorrectable ECC errors reported on a random page when it is mounted. How do you find the uncorrectable errors? Do you scan the entire NAND chip after you boot up? Or do you read all files stored in the UBIFS file-system, or you do not do anything special, just mount and notice ECC error messages in dmesg? Does UBIFS fail to mount? What is the time-window where power cut may lead to problems in your NAND. And how these problems are seen by the software? I mean, what happens to the data? Can it become "mostly OK", except of one or few pages with too many bit-flips? I understand that during erase all 0 bits "become 1s", but not instanteneously, so in case of an interrupt they may read as 1 or 0 randomly. But the bits which were 1s - nothing happens, they stay to be 1s? > We suspect the problem is the asynchronous nature of the UBIFS > operations. Perhaps the small write buffer that can take 3-5 seconds to > be written or some other operation occuring in UBI/UBIFS? I don't think > the shutdown of the filesystem is dealing with all the threads properly. Yes, writes are asynchronous. There is the write-buffer of the NAND page size, and there is Linux write-back, which flushes dirty data in background (standard stuff for all file-systems) > <REBOOT happens here with NAND ERASE COMMAND in progress corrupting > 0x18700000 NAND Addresses!> Corrupted NAND only happens when erase > operation in progress when restarting system happens. I acknowledge that there may be problems with interrupted erase. We saw them in case of NOR, where erase is very slow and it is easy to interrupt it. We never saw this for NAND, but I may well imagine that this may be an issue in case of NAND. For NOR, we mitigated the issue by "invalidating" the PEB before erasing. Check the 'nor_erase_prepare()' function in 'drivers/mtd/ubi/io.c' and its commentaries. The first thing you may try is - add a similar quick hack to UBI and invalidate the first NAND page or the first 2 NAND pages (depends on whether you use sub-pages or not). You can just write all zeroes. The point is to corrupt data, so that the subsequent read results in a CRC check failure. See what happens. Some general notes. In general, if UBI or UBIFS decided to erase an LEB, the data in there are not longer needed. E.g., when GC of UBIFS moves all the valid data to another PEB, the older PEB is not needed, it is scheduled for erasure. The erasure happens asynchronously. If you have a power cut, and the PEB erase operation was interrupted, and you end up with a PEB which is "mostly fine", son next time you mount UBIFS it may start reading from it (e.g., if this was a journal PEB), and get errors. Now, my point is that this should not be a fundamental problem for UBIFS. This should be fixable. It may need good UBIFS knowledge to fix, and time, though. One way to deal with this is to emulate erase interruptions at UBI level. Similarly how we implemented the power cut testing infrastructure in UBIFS. On the other hand, if you can invalidate the PEB before you start erasing, this should just solve the problem. So I'd start with this, and see what happens. You may have more than one type of issues, so fixing the erase interrupt issue this way quickly may let you exlculde this type of problems. And generally, I am not opposed to this solution in upstream too, if it works for everyone. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: suspect UBIFS async operations causing issues during reboot 2014-11-12 11:20 ` Artem Bityutskiy @ 2014-11-15 3:30 ` Scott Branden 0 siblings, 0 replies; 19+ messages in thread From: Scott Branden @ 2014-11-15 3:30 UTC (permalink / raw) To: dedekind1; +Cc: linux-mtd Hi Artem, Thanks for your response. We have completed our testing and solved the issue by adding a reboot notifier - one was added to chips/cfi_cmdset_0002.c and chips/cfi_cmdset_0001.c to solve the problem 5 years ago on NOR devices. See comments inline and proposed fix at bottom - I can then send out an patch for review. On 14-11-12 03:20 AM, Artem Bityutskiy wrote: > Hi Scott, > > sorry for late reply, but better later than never. > > On Wed, 2014-11-05 at 00:32 -0800, Scott Branden wrote: >> Over 1000's of reboots we eventually find that the NAND has >> uncorrectable ECC errors reported on a random page when it is mounted. > > How do you find the uncorrectable errors? Do you scan the entire NAND > chip after you boot up? Or do you read all files stored in the UBIFS > file-system, or you do not do anything special, just mount and notice > ECC error messages in dmesg? Does UBIFS fail to mount? We just mount and notice the ECC error messages. UBIFS does not fail to mount, it handles the situation. But there shouldn't be error messages generated in the first place due to a reboot. > > What is the time-window where power cut may lead to problems in your > NAND. And how these problems are seen by the software? I mean, what > happens to the data? Can it become "mostly OK", except of one or few > pages with too many bit-flips? I understand that during erase all 0 bits > "become 1s", but not instanteneously, so in case of an interrupt they > may read as 1 or 0 randomly. But the bits which were 1s - nothing > happens, they stay to be 1s? Yes, the bits are in the middle of erase so most are 1's and some are still 0. > >> We suspect the problem is the asynchronous nature of the UBIFS >> operations. Perhaps the small write buffer that can take 3-5 seconds to >> be written or some other operation occuring in UBI/UBIFS? I don't think >> the shutdown of the filesystem is dealing with all the threads properly. > > Yes, writes are asynchronous. There is the write-buffer of the NAND page > size, and there is Linux write-back, which flushes dirty data in > background (standard stuff for all file-systems) > >> <REBOOT happens here with NAND ERASE COMMAND in progress corrupting >> 0x18700000 NAND Addresses!> Corrupted NAND only happens when erase >> operation in progress when restarting system happens. > > I acknowledge that there may be problems with interrupted erase. We saw > them in case of NOR, where erase is very slow and it is easy to > interrupt it. We never saw this for NAND, but I may well imagine that > this may be an issue in case of NAND. Yes, we hit the situation. > > For NOR, we mitigated the issue by "invalidating" the PEB before > erasing. Check the 'nor_erase_prepare()' function in > 'drivers/mtd/ubi/io.c' and its commentaries. > > The first thing you may try is - add a similar quick hack to UBI and > invalidate the first NAND page or the first 2 NAND pages (depends on > whether you use sub-pages or not). > > You can just write all zeroes. The point is to corrupt data, so that the > subsequent read results in a CRC check failure. > > See what happens. > > > Some general notes. > > In general, if UBI or UBIFS decided to erase an LEB, the data in there > are not longer needed. E.g., when GC of UBIFS moves all the valid data > to another PEB, the older PEB is not needed, it is scheduled for > erasure. The erasure happens asynchronously. If you have a power cut, > and the PEB erase operation was interrupted, and you end up with a PEB > which is "mostly fine", son next time you mount UBIFS it may start > reading from it (e.g., if this was a journal PEB), and get errors. > > Now, my point is that this should not be a fundamental problem for > UBIFS. This should be fixable. It may need good UBIFS knowledge to fix, > and time, though. > > One way to deal with this is to emulate erase interruptions at UBI > level. Similarly how we implemented the power cut testing infrastructure > in UBIFS. > > On the other hand, if you can invalidate the PEB before you start > erasing, this should just solve the problem. So I'd start with this, and > see what happens. You may have more than one type of issues, so fixing > the erase interrupt issue this way quickly may let you exlculde this > type of problems. And generally, I am not opposed to this solution in > upstream too, if it works for everyone. We add nand_shutdown to nand_base: +/** + * nand_shutdown - [NAND Interface] finish the current nand operation and + * prevent further operations + * @mtd: MTD device structure + */ +int nand_shutdown(struct mtd_info *mtd) +{ + return nand_get_device(mtd, FL_SHUTDOWN); +} +EXPORT_SYMBOL_GPL(nand_shutdown); We call nand_shutdown routine from the reboot notifier we add in our iproc driver (to be upstreamed soon). +static int iproc_nand_reboot_notifier(struct notifier_block *n, + unsigned long state, + void *cmd) +{ + struct mtd_info *mtd; + + mtd = container_of(n, struct mtd_info, reboot_notifier); + nand_shutdown(mtd); + return NOTIFY_DONE; +} If the reboot notifier can always be added somewhere in mtd it could be moved out of driver and always called? ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2014-11-27 19:08 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-11-05 8:32 suspect UBIFS async operations causing issues during reboot Scott Branden 2014-11-05 9:22 ` Richard Weinberger 2014-11-05 17:56 ` Scott Branden 2014-11-05 18:21 ` Richard Weinberger 2014-11-05 22:52 ` Scott Branden 2014-11-06 21:56 ` Scott Branden 2014-11-07 8:45 ` Richard Weinberger 2014-11-07 17:31 ` Scott Branden 2014-11-09 10:20 ` Richard Weinberger 2014-11-10 5:10 ` Scott Branden 2014-11-26 8:17 ` Brian Norris 2014-11-26 8:30 ` Richard Weinberger 2014-11-26 9:25 ` Brian Norris 2014-11-27 19:07 ` Scott Branden 2014-11-10 8:44 ` Ricard Wanderlof 2014-11-10 9:08 ` Richard Weinberger 2014-11-10 7:44 ` Tanya Brokhman 2014-11-12 11:20 ` Artem Bityutskiy 2014-11-15 3:30 ` Scott Branden
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).