linux-mtd.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* suspect UBIFS async operations causing issues during reboot
@ 2014-11-05  8:32 Scott Branden
  2014-11-05  9:22 ` Richard Weinberger
  2014-11-12 11:20 ` Artem Bityutskiy
  0 siblings, 2 replies; 19+ messages in thread
From: Scott Branden @ 2014-11-05  8:32 UTC (permalink / raw)
  To: linux-mtd

We are doing reboot testing with UBIFS on the 3.10 kernel with a new 
chipset we are working on.

Over 1000's of reboots we eventually find that the NAND has 
uncorrectable ECC errors reported on a random page when it is mounted.

We have found the problem is that a NAND erase operation is in progress 
when the reboot occurs. Since the NAND is in the middle of the erase 
operation the page is mostly FF with some random bits not erased when 
the reboot occurs.

We suspect the problem is the asynchronous nature of the UBIFS 
operations.  Perhaps the small write buffer that can take 3-5 seconds to 
be written or some other operation occuring in UBI/UBIFS?  I don't think 
the shutdown of the filesystem is dealing with all the threads properly.

Log below with printks adding in iproc_nand driver showing erase 
operations in progress when "Restarting system." happens.

Stopped Setup Virtual Console.
Stopping Apply Kernel Variables...
Stopped Apply Kernel Variables.
Starting Notify Audit System and Update UTMP about System Shutdown...
Stopping Runtime Directory...
Stopping Remount API VFS...
Stopped Remount API VFS.
Stopping Remount Root FS...
Stopped Remount Root FS.
Stopping Collect Read-Ahead Data...
Stopped Collect Read-Ahead Data.
Stopping Media Directory...[   18.370000] systemd[1]: Unit 
systemd-readahead-collect.service entered failed state.

Started Console System Reboot Logging.
Stopped Runtime Directory.
Stopped Media Directory.
[   18.490000] systemd[1]: Shutting down.
Sending SIGTERM to remaining processes...
Sending SIGKILL to remaining processes...
Unmounting file systems.
[   18.530000] iproc_nand_cmdfunc: cmd 0x60 addr 0x14a40000
[   18.540000] iproc_nand_waitfunc: native cmd 8 intfc status 0xc00000e0
[   18.550000] UBIFS: background thread "ubifs_bgt0_0" stops
Disabling swaps.
Detaching loop devices.
Detaching DM devices.
[   18.560000] iproc_nand_cmdfunc: cmd 0x60 addr 0x18680000
[   18.570000] iproc_nand_waitfunc: native cmd 8 intfc status 0xc00000e0
[   18.580000] Restarting system.
[   18.580000] iproc_nand_cmdfunc: cmd 0x60 addr 0x18700000

<REBOOT happens here with NAND ERASE COMMAND in progress corrupting 
0x18700000 NAND Addresses!>  Corrupted NAND only happens when erase 
operation in progress when restarting system happens.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: suspect UBIFS async operations causing issues during reboot
  2014-11-05  8:32 suspect UBIFS async operations causing issues during reboot Scott Branden
@ 2014-11-05  9:22 ` Richard Weinberger
  2014-11-05 17:56   ` Scott Branden
  2014-11-12 11:20 ` Artem Bityutskiy
  1 sibling, 1 reply; 19+ messages in thread
From: Richard Weinberger @ 2014-11-05  9:22 UTC (permalink / raw)
  To: Scott Branden; +Cc: linux-mtd@lists.infradead.org

On Wed, Nov 5, 2014 at 9:32 AM, Scott Branden <sbranden@broadcom.com> wrote:
> We are doing reboot testing with UBIFS on the 3.10 kernel with a new chipset
> we are working on.
>
> Over 1000's of reboots we eventually find that the NAND has uncorrectable
> ECC errors reported on a random page when it is mounted.
>
> We have found the problem is that a NAND erase operation is in progress when
> the reboot occurs. Since the NAND is in the middle of the erase operation
> the page is mostly FF with some random bits not erased when the reboot
> occurs.
>
> We suspect the problem is the asynchronous nature of the UBIFS operations.
> Perhaps the small write buffer that can take 3-5 seconds to be written or
> some other operation occuring in UBI/UBIFS?  I don't think the shutdown of
> the filesystem is dealing with all the threads properly.

And what about powercuts?
UBI/UBIFS was designed to survive powercuts.
If your NAND shows strange issues even after a clean reboot something nasty is
going on. Does your driver pass all UBI/MTD test?

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: suspect UBIFS async operations causing issues during reboot
  2014-11-05  9:22 ` Richard Weinberger
@ 2014-11-05 17:56   ` Scott Branden
  2014-11-05 18:21     ` Richard Weinberger
  0 siblings, 1 reply; 19+ messages in thread
From: Scott Branden @ 2014-11-05 17:56 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: linux-mtd@lists.infradead.org

Hi Richard,

Thanks for the feedback.  Comments inline.

On 14-11-05 01:22 AM, Richard Weinberger wrote:
> On Wed, Nov 5, 2014 at 9:32 AM, Scott Branden <sbranden@broadcom.com> wrote:
>> We are doing reboot testing with UBIFS on the 3.10 kernel with a new chipset
>> we are working on.
>>
>> Over 1000's of reboots we eventually find that the NAND has uncorrectable
>> ECC errors reported on a random page when it is mounted.
>>
>> We have found the problem is that a NAND erase operation is in progress when
>> the reboot occurs. Since the NAND is in the middle of the erase operation
>> the page is mostly FF with some random bits not erased when the reboot
>> occurs.
>>
>> We suspect the problem is the asynchronous nature of the UBIFS operations.
>> Perhaps the small write buffer that can take 3-5 seconds to be written or
>> some other operation occuring in UBI/UBIFS?  I don't think the shutdown of
>> the filesystem is dealing with all the threads properly.
>
> And what about powercuts?
powercuts would exhibit the exact same behaviour as we are observing: 
the erase is interrupted by loss of power so the NAND block being erased 
would be in a partially erased state.  powercuts have little to do with 
the reboot sequence I am describing.

> UBI/UBIFS was designed to survive powercuts.
Yes, this does not cause UBIFS to fail to survive the powercut.  It does 
cause blocks to not be erased properly.

The block that didn't finish to erase is uncorrectable on next boot-up:

[    1.330000] UBI: attaching mtd7 to ubi0
[    2.000000] iproc_nand 18046000.nand: uncorrectable error at 0x18700000

This issue is this blocks shouldn't be corrupted in the first place if 
UBI/UBIFS shut downs properly.

> If your NAND shows strange issues even after a clean reboot something nasty is
> going on. Does your driver pass all UBI/MTD test?
>
We are in the process of running the MTD tests.  But this appears to 
have nothing to do with a buggy driver or not.  The NAND driver will do 
what it is told to do.  If it is told to erase a block it will erase a 
block.  It can't control if the system reboots in the middle of this 
operation?

This appears to be a UBI/UBIFS issue.  UBI/UBIFS operations are still 
going on after the filesystem in unmounted.  The shutdown process 
completes and a reboot happens.  My guess is these operations are due to 
the asynchronous threads of UBI/UBIFS not being handled properly during 
the shutdown process?

I have found other people have reported unexplained flash corruption. 
We back ported this to the 3.10 kernel which solved most of the flash 
corruption issues:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/super.c?id=807612db2f9940b9fa6deaef054eb16d51bd3e00

This only remaining flash corruption issue is due to the described issue 
of reboot happening in the middle of an erase cycle.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: suspect UBIFS async operations causing issues during reboot
  2014-11-05 17:56   ` Scott Branden
@ 2014-11-05 18:21     ` Richard Weinberger
  2014-11-05 22:52       ` Scott Branden
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Weinberger @ 2014-11-05 18:21 UTC (permalink / raw)
  To: Scott Branden, Richard Weinberger; +Cc: linux-mtd@lists.infradead.org

Hi!

Am 05.11.2014 um 18:56 schrieb Scott Branden:
> Hi Richard,
> 
> Thanks for the feedback.  Comments inline.
> 
> On 14-11-05 01:22 AM, Richard Weinberger wrote:
>> On Wed, Nov 5, 2014 at 9:32 AM, Scott Branden <sbranden@broadcom.com> wrote:
>>> We are doing reboot testing with UBIFS on the 3.10 kernel with a new chipset
>>> we are working on.
>>>
>>> Over 1000's of reboots we eventually find that the NAND has uncorrectable
>>> ECC errors reported on a random page when it is mounted.
>>>
>>> We have found the problem is that a NAND erase operation is in progress when
>>> the reboot occurs. Since the NAND is in the middle of the erase operation
>>> the page is mostly FF with some random bits not erased when the reboot
>>> occurs.
>>>
>>> We suspect the problem is the asynchronous nature of the UBIFS operations.
>>> Perhaps the small write buffer that can take 3-5 seconds to be written or
>>> some other operation occuring in UBI/UBIFS?  I don't think the shutdown of
>>> the filesystem is dealing with all the threads properly.
>>
>> And what about powercuts?
> powercuts would exhibit the exact same behaviour as we are observing: the erase is interrupted by loss of power so the NAND block being erased would be in a partially erased
> state.  powercuts have little to do with the reboot sequence I am describing.
> 
>> UBI/UBIFS was designed to survive powercuts.
> Yes, this does not cause UBIFS to fail to survive the powercut.  It does cause blocks to not be erased properly.

Makes sense.

> The block that didn't finish to erase is uncorrectable on next boot-up:
> 
> [    1.330000] UBI: attaching mtd7 to ubi0
> [    2.000000] iproc_nand 18046000.nand: uncorrectable error at 0x18700000
> 
> This issue is this blocks shouldn't be corrupted in the first place if UBI/UBIFS shut downs properly.
> 
>> If your NAND shows strange issues even after a clean reboot something nasty is
>> going on. Does your driver pass all UBI/MTD test?
>>
> We are in the process of running the MTD tests.  But this appears to have nothing to do with a buggy driver or not.  The NAND driver will do what it is told to do.  If it is told
> to erase a block it will erase a block.  It can't control if the system reboots in the middle of this operation?
> 
> This appears to be a UBI/UBIFS issue.  UBI/UBIFS operations are still going on after the filesystem in unmounted.  The shutdown process completes and a reboot happens.  My guess is
> these operations are due to the asynchronous threads of UBI/UBIFS not being handled properly during the shutdown process?
> 
> I have found other people have reported unexplained flash corruption. We back ported this to the 3.10 kernel which solved most of the flash corruption issues:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/super.c?id=807612db2f9940b9fa6deaef054eb16d51bd3e00
> 
> This only remaining flash corruption issue is due to the described issue of reboot happening in the middle of an erase cycle.

You can verify your hypothesis easily. Add a printk() to ubi_detach_mtd_dev(). This function shuts down UBI and also the background thread which does
all erase work.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: suspect UBIFS async operations causing issues during reboot
  2014-11-05 18:21     ` Richard Weinberger
@ 2014-11-05 22:52       ` Scott Branden
  2014-11-06 21:56         ` Scott Branden
  2014-11-10  7:44         ` Tanya Brokhman
  0 siblings, 2 replies; 19+ messages in thread
From: Scott Branden @ 2014-11-05 22:52 UTC (permalink / raw)
  To: Richard Weinberger, Richard Weinberger; +Cc: linux-mtd@lists.infradead.org

On 14-11-05 10:21 AM, Richard Weinberger wrote:
> Hi!
>
> Am 05.11.2014 um 18:56 schrieb Scott Branden:
>> Hi Richard,
>>
>> Thanks for the feedback.  Comments inline.
>>
>> On 14-11-05 01:22 AM, Richard Weinberger wrote:
>>> On Wed, Nov 5, 2014 at 9:32 AM, Scott Branden <sbranden@broadcom.com> wrote:
>>>> We are doing reboot testing with UBIFS on the 3.10 kernel with a new chipset
>>>> we are working on.
>>>>
>>>> Over 1000's of reboots we eventually find that the NAND has uncorrectable
>>>> ECC errors reported on a random page when it is mounted.
>>>>
>>>> We have found the problem is that a NAND erase operation is in progress when
>>>> the reboot occurs. Since the NAND is in the middle of the erase operation
>>>> the page is mostly FF with some random bits not erased when the reboot
>>>> occurs.
>>>>
>>>> We suspect the problem is the asynchronous nature of the UBIFS operations.
>>>> Perhaps the small write buffer that can take 3-5 seconds to be written or
>>>> some other operation occuring in UBI/UBIFS?  I don't think the shutdown of
>>>> the filesystem is dealing with all the threads properly.
>>>
>>> And what about powercuts?
>> powercuts would exhibit the exact same behaviour as we are observing: the erase is interrupted by loss of power so the NAND block being erased would be in a partially erased
>> state.  powercuts have little to do with the reboot sequence I am describing.
>>
>>> UBI/UBIFS was designed to survive powercuts.
>> Yes, this does not cause UBIFS to fail to survive the powercut.  It does cause blocks to not be erased properly.
>
> Makes sense.
>
>> The block that didn't finish to erase is uncorrectable on next boot-up:
>>
>> [    1.330000] UBI: attaching mtd7 to ubi0
>> [    2.000000] iproc_nand 18046000.nand: uncorrectable error at 0x18700000
>>
>> This issue is this blocks shouldn't be corrupted in the first place if UBI/UBIFS shut downs properly.
>>
>>> If your NAND shows strange issues even after a clean reboot something nasty is
>>> going on. Does your driver pass all UBI/MTD test?
>>>
>> We are in the process of running the MTD tests.  But this appears to have nothing to do with a buggy driver or not.  The NAND driver will do what it is told to do.  If it is told
>> to erase a block it will erase a block.  It can't control if the system reboots in the middle of this operation?
>>
>> This appears to be a UBI/UBIFS issue.  UBI/UBIFS operations are still going on after the filesystem in unmounted.  The shutdown process completes and a reboot happens.  My guess is
>> these operations are due to the asynchronous threads of UBI/UBIFS not being handled properly during the shutdown process?
>>
>> I have found other people have reported unexplained flash corruption. We back ported this to the 3.10 kernel which solved most of the flash corruption issues:
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/super.c?id=807612db2f9940b9fa6deaef054eb16d51bd3e00
>>
>> This only remaining flash corruption issue is due to the described issue of reboot happening in the middle of an erase cycle.
>
> You can verify your hypothesis easily. Add a printk() to ubi_detach_mtd_dev(). This function shuts down UBI and also the background thread which does
> all erase work.
Hi Richard,

The printk never happens.

I only find ubi_detach_mtd_dev can be called by ubi_exit.   But ubi_exit 
is only called if it is a module...

static void __exit ubi_exit(void)
{
	int i;

	for (i = 0; i < UBI_MAX_DEVICES; i++)
		if (ubi_devices[i]) {
			mutex_lock(&ubi_devices_mutex);
			ubi_detach_mtd_dev(ubi_devices[i]->ubi_num, 1);
			mutex_unlock(&ubi_devices_mutex);
		}
	ubi_debugfs_exit();
	kmem_cache_destroy(ubi_wl_entry_slab);
	misc_deregister(&ubi_ctrl_cdev);
	class_remove_file(ubi_class, &ubi_version);
	class_destroy(ubi_class);
}
module_exit(ubi_exit);

>
> Thanks,
> //richard
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: suspect UBIFS async operations causing issues during reboot
  2014-11-05 22:52       ` Scott Branden
@ 2014-11-06 21:56         ` Scott Branden
  2014-11-07  8:45           ` Richard Weinberger
  2014-11-10  7:44         ` Tanya Brokhman
  1 sibling, 1 reply; 19+ messages in thread
From: Scott Branden @ 2014-11-06 21:56 UTC (permalink / raw)
  To: Richard Weinberger, Richard Weinberger; +Cc: linux-mtd@lists.infradead.org

It looks like the erase happening in the middle of reboot was uncovered 
in 2009 and never addressed properly?

https://lkml.org/lkml/2009/6/9/16
https://lkml.org/lkml/2010/2/12/144

Was there a proper resolution to this issue?



On 14-11-05 02:52 PM, Scott Branden wrote:
> On 14-11-05 10:21 AM, Richard Weinberger wrote:
>> Hi!
>>
>> Am 05.11.2014 um 18:56 schrieb Scott Branden:
>>> Hi Richard,
>>>
>>> Thanks for the feedback.  Comments inline.
>>>
>>> On 14-11-05 01:22 AM, Richard Weinberger wrote:
>>>> On Wed, Nov 5, 2014 at 9:32 AM, Scott Branden
>>>> <sbranden@broadcom.com> wrote:
>>>>> We are doing reboot testing with UBIFS on the 3.10 kernel with a
>>>>> new chipset
>>>>> we are working on.
>>>>>
>>>>> Over 1000's of reboots we eventually find that the NAND has
>>>>> uncorrectable
>>>>> ECC errors reported on a random page when it is mounted.
>>>>>
>>>>> We have found the problem is that a NAND erase operation is in
>>>>> progress when
>>>>> the reboot occurs. Since the NAND is in the middle of the erase
>>>>> operation
>>>>> the page is mostly FF with some random bits not erased when the reboot
>>>>> occurs.
>>>>>
>>>>> We suspect the problem is the asynchronous nature of the UBIFS
>>>>> operations.
>>>>> Perhaps the small write buffer that can take 3-5 seconds to be
>>>>> written or
>>>>> some other operation occuring in UBI/UBIFS?  I don't think the
>>>>> shutdown of
>>>>> the filesystem is dealing with all the threads properly.
>>>>
>>>> And what about powercuts?
>>> powercuts would exhibit the exact same behaviour as we are observing:
>>> the erase is interrupted by loss of power so the NAND block being
>>> erased would be in a partially erased
>>> state.  powercuts have little to do with the reboot sequence I am
>>> describing.
>>>
>>>> UBI/UBIFS was designed to survive powercuts.
>>> Yes, this does not cause UBIFS to fail to survive the powercut.  It
>>> does cause blocks to not be erased properly.
>>
>> Makes sense.
>>
>>> The block that didn't finish to erase is uncorrectable on next boot-up:
>>>
>>> [    1.330000] UBI: attaching mtd7 to ubi0
>>> [    2.000000] iproc_nand 18046000.nand: uncorrectable error at
>>> 0x18700000
>>>
>>> This issue is this blocks shouldn't be corrupted in the first place
>>> if UBI/UBIFS shut downs properly.
>>>
>>>> If your NAND shows strange issues even after a clean reboot
>>>> something nasty is
>>>> going on. Does your driver pass all UBI/MTD test?
>>>>
>>> We are in the process of running the MTD tests.  But this appears to
>>> have nothing to do with a buggy driver or not.  The NAND driver will
>>> do what it is told to do.  If it is told
>>> to erase a block it will erase a block.  It can't control if the
>>> system reboots in the middle of this operation?
>>>
>>> This appears to be a UBI/UBIFS issue.  UBI/UBIFS operations are still
>>> going on after the filesystem in unmounted.  The shutdown process
>>> completes and a reboot happens.  My guess is
>>> these operations are due to the asynchronous threads of UBI/UBIFS not
>>> being handled properly during the shutdown process?
>>>
>>> I have found other people have reported unexplained flash corruption.
>>> We back ported this to the 3.10 kernel which solved most of the flash
>>> corruption issues:
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/super.c?id=807612db2f9940b9fa6deaef054eb16d51bd3e00
>>>
>>>
>>> This only remaining flash corruption issue is due to the described
>>> issue of reboot happening in the middle of an erase cycle.
>>
>> You can verify your hypothesis easily. Add a printk() to
>> ubi_detach_mtd_dev(). This function shuts down UBI and also the
>> background thread which does
>> all erase work.
> Hi Richard,
>
> The printk never happens.
>
> I only find ubi_detach_mtd_dev can be called by ubi_exit.   But ubi_exit
> is only called if it is a module...
>
> static void __exit ubi_exit(void)
> {
>      int i;
>
>      for (i = 0; i < UBI_MAX_DEVICES; i++)
>          if (ubi_devices[i]) {
>              mutex_lock(&ubi_devices_mutex);
>              ubi_detach_mtd_dev(ubi_devices[i]->ubi_num, 1);
>              mutex_unlock(&ubi_devices_mutex);
>          }
>      ubi_debugfs_exit();
>      kmem_cache_destroy(ubi_wl_entry_slab);
>      misc_deregister(&ubi_ctrl_cdev);
>      class_remove_file(ubi_class, &ubi_version);
>      class_destroy(ubi_class);
> }
> module_exit(ubi_exit);
>
>>
>> Thanks,
>> //richard
>>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: suspect UBIFS async operations causing issues during reboot
  2014-11-06 21:56         ` Scott Branden
@ 2014-11-07  8:45           ` Richard Weinberger
  2014-11-07 17:31             ` Scott Branden
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Weinberger @ 2014-11-07  8:45 UTC (permalink / raw)
  To: Scott Branden; +Cc: linux-mtd@lists.infradead.org

Am 06.11.2014 um 22:56 schrieb Scott Branden:
> It looks like the erase happening in the middle of reboot was uncovered in 2009 and never addressed properly?
> 
> https://lkml.org/lkml/2009/6/9/16
> https://lkml.org/lkml/2010/2/12/144
> 
> Was there a proper resolution to this issue?

Did you read the threads you've posted?

There two answers:
https://lkml.org/lkml/2010/2/12/143
https://lkml.org/lkml/2010/2/12/144

Thanks,
//richard

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: suspect UBIFS async operations causing issues during reboot
  2014-11-07  8:45           ` Richard Weinberger
@ 2014-11-07 17:31             ` Scott Branden
  2014-11-09 10:20               ` Richard Weinberger
  0 siblings, 1 reply; 19+ messages in thread
From: Scott Branden @ 2014-11-07 17:31 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: linux-mtd@lists.infradead.org

On 14-11-07 12:45 AM, Richard Weinberger wrote:
> Am 06.11.2014 um 22:56 schrieb Scott Branden:
>> It looks like the erase happening in the middle of reboot was uncovered in 2009 and never addressed properly?
>>
>> https://lkml.org/lkml/2009/6/9/16
>> https://lkml.org/lkml/2010/2/12/144
>>
>> Was there a proper resolution to this issue?
>
> Did you read the threads you've posted?
>
> There two answers:
> https://lkml.org/lkml/2010/2/12/143
Yes, there is no hardware solution to a reset happening in the middle of 
an erase operation to NAND.

> https://lkml.org/lkml/2010/2/12/144
Yes, it appears nobody has ever implemented a reboot notifier in the 
past 5 years to solve this problem for NAND.

We are experimenting with a notifier in UBI and MTD which is a much 
cleaner shutdown than just a notifier at the MTD level.  UBI reboot 
notifier priority needs to be shut down first and have a higher priority 
than flash device.  Overnight results are looking promising.
>
> Thanks,
> //richard
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: suspect UBIFS async operations causing issues during reboot
  2014-11-07 17:31             ` Scott Branden
@ 2014-11-09 10:20               ` Richard Weinberger
  2014-11-10  5:10                 ` Scott Branden
  2014-11-10  8:44                 ` Ricard Wanderlof
  0 siblings, 2 replies; 19+ messages in thread
From: Richard Weinberger @ 2014-11-09 10:20 UTC (permalink / raw)
  To: Scott Branden; +Cc: linux-mtd@lists.infradead.org

Am 07.11.2014 um 18:31 schrieb Scott Branden:
> On 14-11-07 12:45 AM, Richard Weinberger wrote:
>> Am 06.11.2014 um 22:56 schrieb Scott Branden:
>>> It looks like the erase happening in the middle of reboot was uncovered in 2009 and never addressed properly?
>>>
>>> https://lkml.org/lkml/2009/6/9/16
>>> https://lkml.org/lkml/2010/2/12/144
>>>
>>> Was there a proper resolution to this issue?
>>
>> Did you read the threads you've posted?
>>
>> There two answers:
>> https://lkml.org/lkml/2010/2/12/143
> Yes, there is no hardware solution to a reset happening in the middle of an erase operation to NAND.

Well, I agree with David that anything we do in software will only hide the real problem
or trim down the window.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: suspect UBIFS async operations causing issues during reboot
  2014-11-09 10:20               ` Richard Weinberger
@ 2014-11-10  5:10                 ` Scott Branden
  2014-11-26  8:17                   ` Brian Norris
  2014-11-10  8:44                 ` Ricard Wanderlof
  1 sibling, 1 reply; 19+ messages in thread
From: Scott Branden @ 2014-11-10  5:10 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: linux-mtd@lists.infradead.org

On 14-11-09 02:20 AM, Richard Weinberger wrote:
> Am 07.11.2014 um 18:31 schrieb Scott Branden:
>> On 14-11-07 12:45 AM, Richard Weinberger wrote:
>>> Am 06.11.2014 um 22:56 schrieb Scott Branden:
>>>> It looks like the erase happening in the middle of reboot was uncovered in 2009 and never addressed properly?
>>>>
>>>> https://lkml.org/lkml/2009/6/9/16
>>>> https://lkml.org/lkml/2010/2/12/144
>>>>
>>>> Was there a proper resolution to this issue?
>>>
>>> Did you read the threads you've posted?
>>>
>>> There two answers:
>>> https://lkml.org/lkml/2010/2/12/143
>> Yes, there is no hardware solution to a reset happening in the middle of an erase operation to NAND.
>
> Well, I agree with David that anything we do in software will only hide the real problem
> or trim down the window.
Hi Richard,

Currently the NAND does not shut down in a clean manner for a reboot 
operation.  This is due to the asynchronous ubi_thread make flash erase 
calls.  unmount is done properly in ubi already and cleanly shuts down. 
  reboot is not done in a clean manner as there is no reboot_notifier to 
handle the situation.

This is not hiding a real problem.  It is just shutting down ubi 
properly rather than pulling the power from it in the middle of operations.

In addition to this - a reboot_notifier needs to be added at the mtd 
level to shut it down properly as well.

This is not trimming down a window.  It is having the drivers shut down 
properly so they do not look like a power failure to the NAND device.

There is no solution to the power failure - it will corrupt pages in the 
middle of erasure.  And you do handle this in UBI/UBIFS.  But why 
corrupt other erase pages unnecessarily when all that needs to be done 
is shut down the drivers properly.  I don't know what you are agreeing 
with David with?  It is not making a window smaller.  It is changing the 
functionality so that the UBI and MTD drivers are shut down cleanly in 
reboot situations.  Right now, they are not shut down at all in these 
situations.
>
> Thanks,
> //richard
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: suspect UBIFS async operations causing issues during reboot
  2014-11-05 22:52       ` Scott Branden
  2014-11-06 21:56         ` Scott Branden
@ 2014-11-10  7:44         ` Tanya Brokhman
  1 sibling, 0 replies; 19+ messages in thread
From: Tanya Brokhman @ 2014-11-10  7:44 UTC (permalink / raw)
  To: linux-mtd

On 11/6/2014 12:52 AM, Scott Branden wrote:
> On 14-11-05 10:21 AM, Richard Weinberger wrote:
>> Hi!
>>
>> Am 05.11.2014 um 18:56 schrieb Scott Branden:
>>> Hi Richard,
>>>
>>> Thanks for the feedback.  Comments inline.
>>>
>>> On 14-11-05 01:22 AM, Richard Weinberger wrote:
>>>> On Wed, Nov 5, 2014 at 9:32 AM, Scott Branden
>>>> <sbranden@broadcom.com> wrote:
>>>>> We are doing reboot testing with UBIFS on the 3.10 kernel with a
>>>>> new chipset
>>>>> we are working on.
>>>>>
>>>>> Over 1000's of reboots we eventually find that the NAND has
>>>>> uncorrectable
>>>>> ECC errors reported on a random page when it is mounted.
>>>>>
>>>>> We have found the problem is that a NAND erase operation is in
>>>>> progress when
>>>>> the reboot occurs. Since the NAND is in the middle of the erase
>>>>> operation
>>>>> the page is mostly FF with some random bits not erased when the reboot
>>>>> occurs.
>>>>>
>>>>> We suspect the problem is the asynchronous nature of the UBIFS
>>>>> operations.
>>>>> Perhaps the small write buffer that can take 3-5 seconds to be
>>>>> written or
>>>>> some other operation occuring in UBI/UBIFS?  I don't think the
>>>>> shutdown of
>>>>> the filesystem is dealing with all the threads properly.
>>>>
>>>> And what about powercuts?
>>> powercuts would exhibit the exact same behaviour as we are observing:
>>> the erase is interrupted by loss of power so the NAND block being
>>> erased would be in a partially erased
>>> state.  powercuts have little to do with the reboot sequence I am
>>> describing.
>>>
>>>> UBI/UBIFS was designed to survive powercuts.
>>> Yes, this does not cause UBIFS to fail to survive the powercut.  It
>>> does cause blocks to not be erased properly.
>>
>> Makes sense.
>>
>>> The block that didn't finish to erase is uncorrectable on next boot-up:
>>>
>>> [    1.330000] UBI: attaching mtd7 to ubi0
>>> [    2.000000] iproc_nand 18046000.nand: uncorrectable error at
>>> 0x18700000
>>>
>>> This issue is this blocks shouldn't be corrupted in the first place
>>> if UBI/UBIFS shut downs properly.
>>>
>>>> If your NAND shows strange issues even after a clean reboot
>>>> something nasty is
>>>> going on. Does your driver pass all UBI/MTD test?
>>>>
>>> We are in the process of running the MTD tests.  But this appears to
>>> have nothing to do with a buggy driver or not.  The NAND driver will
>>> do what it is told to do.  If it is told
>>> to erase a block it will erase a block.  It can't control if the
>>> system reboots in the middle of this operation?
>>>
>>> This appears to be a UBI/UBIFS issue.  UBI/UBIFS operations are still
>>> going on after the filesystem in unmounted.  The shutdown process
>>> completes and a reboot happens.  My guess is
>>> these operations are due to the asynchronous threads of UBI/UBIFS not
>>> being handled properly during the shutdown process?
>>>
>>> I have found other people have reported unexplained flash corruption.
>>> We back ported this to the 3.10 kernel which solved most of the flash
>>> corruption issues:
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/super.c?id=807612db2f9940b9fa6deaef054eb16d51bd3e00
>>>
>>>
>>> This only remaining flash corruption issue is due to the described
>>> issue of reboot happening in the middle of an erase cycle.
>>
>> You can verify your hypothesis easily. Add a printk() to
>> ubi_detach_mtd_dev(). This function shuts down UBI and also the
>> background thread which does
>> all erase work.
> Hi Richard,
>
> The printk never happens.
>
> I only find ubi_detach_mtd_dev can be called by ubi_exit.   But ubi_exit
> is only called if it is a module...

ubi_detach_mtd_dev() is also called for UBI_IOCDET IOCTL (look at cdev.c 
ctrl_cdev_ioctl()). It is triggered by ubidetach.
We had similar issues that graceful shutdowns/reboots weren't handling 
UBI shutdown properly. We solved it by calling ubidetach in our 
reboot/powerdown scripts.
You're right that the issue will still remain in power cuts, but at 
least for graceful shutdown it is handled properly.

>
> static void __exit ubi_exit(void)
> {
>      int i;
>
>      for (i = 0; i < UBI_MAX_DEVICES; i++)
>          if (ubi_devices[i]) {
>              mutex_lock(&ubi_devices_mutex);
>              ubi_detach_mtd_dev(ubi_devices[i]->ubi_num, 1);
>              mutex_unlock(&ubi_devices_mutex);
>          }
>      ubi_debugfs_exit();
>      kmem_cache_destroy(ubi_wl_entry_slab);
>      misc_deregister(&ubi_ctrl_cdev);
>      class_remove_file(ubi_class, &ubi_version);
>      class_destroy(ubi_class);
> }
> module_exit(ubi_exit);
>
>>
>> Thanks,
>> //richard
>>
>
>
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/


Thanks,
Tanya Brokhman
-- 
Qualcomm Israel, on behalf of Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: suspect UBIFS async operations causing issues during reboot
  2014-11-09 10:20               ` Richard Weinberger
  2014-11-10  5:10                 ` Scott Branden
@ 2014-11-10  8:44                 ` Ricard Wanderlof
  2014-11-10  9:08                   ` Richard Weinberger
  1 sibling, 1 reply; 19+ messages in thread
From: Ricard Wanderlof @ 2014-11-10  8:44 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: linux-mtd@lists.infradead.org, Scott Branden


On Sun, 9 Nov 2014, Richard Weinberger wrote:

> Am 07.11.2014 um 18:31 schrieb Scott Branden:
> > On 14-11-07 12:45 AM, Richard Weinberger wrote:
> >> Am 06.11.2014 um 22:56 schrieb Scott Branden:
> >>> It looks like the erase happening in the middle of reboot was uncovered in 2009 and never addressed properly?
> >>>
> >>> https://lkml.org/lkml/2009/6/9/16
> >>> https://lkml.org/lkml/2010/2/12/144
> >>>
> >>> Was there a proper resolution to this issue?
> >>
> >> Did you read the threads you've posted?
> >>
> >> There two answers:
> >> https://lkml.org/lkml/2010/2/12/143
> > Yes, there is no hardware solution to a reset happening in the middle of an erase operation to NAND.
> 
> Well, I agree with David that anything we do in software will only hide the real problem
> or trim down the window.

There's something I don't understand here. It could be (and probably will 
prove to be) my lack of knowledge on the detailed workings of UBI.

Back in jffs2 days, erased blocks were so indicated by writing a 
'cleanmarker' pattern to the OOB area. Thus, when scanning the flash, if a 
block was encountered which appeared erased but lacked the cleanmarker, it 
was re-erased just in case the previous erase was interrupted and 
therefore did not leave the bits in a properly erased state.

With ubifs, cleanmarkers are not used (partly because MLC flashes wouldn't 
support two writes to the OOB area: one for the cleanmarker and one for 
the ECC), but there _is_ a header at the start of each PEB. Thus the same 
situation really holds, if a (seemingly) erased PEB is encountered with no 
EC header, it could be considered the leftover of an unfinished erase 
operation. I don't know for a fact if (or how) UBI does this though.

Of course, and interrupted erase operation could leave a block in a 
seemingly un-erased state, i.e. the data appears intact (but may not be). 
But in that case the block would already be superseded by another block 
(i.e. any potential data would have already been copied to another block 
with the header infoinvalidating the old one). So in this case the block 
would go on an erase list at some point because it is no longer valid.

Since interrupted erase seems to be of so much a concern I've obviously 
missed something above. But I can't figure out what.

The only thing that seems relevant among the links above is

https://lkml.org/lkml/2010/2/12/144

which indicates that half-erased blocks might cause problems with certain 
boot loaders, but again, that's a problem with the bootloader, not UBI.

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: suspect UBIFS async operations causing issues during reboot
  2014-11-10  8:44                 ` Ricard Wanderlof
@ 2014-11-10  9:08                   ` Richard Weinberger
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Weinberger @ 2014-11-10  9:08 UTC (permalink / raw)
  To: Ricard Wanderlof; +Cc: linux-mtd@lists.infradead.org, Scott Branden

Am 10.11.2014 um 09:44 schrieb Ricard Wanderlof:
> 
> On Sun, 9 Nov 2014, Richard Weinberger wrote:
> 
>> Am 07.11.2014 um 18:31 schrieb Scott Branden:
>>> On 14-11-07 12:45 AM, Richard Weinberger wrote:
>>>> Am 06.11.2014 um 22:56 schrieb Scott Branden:
>>>>> It looks like the erase happening in the middle of reboot was uncovered in 2009 and never addressed properly?
>>>>>
>>>>> https://lkml.org/lkml/2009/6/9/16
>>>>> https://lkml.org/lkml/2010/2/12/144
>>>>>
>>>>> Was there a proper resolution to this issue?
>>>>
>>>> Did you read the threads you've posted?
>>>>
>>>> There two answers:
>>>> https://lkml.org/lkml/2010/2/12/143
>>> Yes, there is no hardware solution to a reset happening in the middle of an erase operation to NAND.
>>
>> Well, I agree with David that anything we do in software will only hide the real problem
>> or trim down the window.
> 
> There's something I don't understand here. It could be (and probably will 
> prove to be) my lack of knowledge on the detailed workings of UBI.
> 
> Back in jffs2 days, erased blocks were so indicated by writing a 
> 'cleanmarker' pattern to the OOB area. Thus, when scanning the flash, if a 
> block was encountered which appeared erased but lacked the cleanmarker, it 
> was re-erased just in case the previous erase was interrupted and 
> therefore did not leave the bits in a properly erased state.
> 
> With ubifs, cleanmarkers are not used (partly because MLC flashes wouldn't 
> support two writes to the OOB area: one for the cleanmarker and one for 
> the ECC), but there _is_ a header at the start of each PEB. Thus the same 
> situation really holds, if a (seemingly) erased PEB is encountered with no 
> EC header, it could be considered the leftover of an unfinished erase 
> operation. I don't know for a fact if (or how) UBI does this though.
> 
> Of course, and interrupted erase operation could leave a block in a 
> seemingly un-erased state, i.e. the data appears intact (but may not be). 
> But in that case the block would already be superseded by another block 
> (i.e. any potential data would have already been copied to another block 
> with the header infoinvalidating the old one). So in this case the block 
> would go on an erase list at some point because it is no longer valid.
> 
> Since interrupted erase seems to be of so much a concern I've obviously 
> missed something above. But I can't figure out what.
> 
> The only thing that seems relevant among the links above is
> 
> https://lkml.org/lkml/2010/2/12/144
> 
> which indicates that half-erased blocks might cause problems with certain 
> boot loaders, but again, that's a problem with the bootloader, not UBI.

Correct. UBI can deal with that, if some component in your "NAND-Chain" does not, it
needs fixing.
Changing UBI/MTD in a way to hide such issues in not a good solution IMHO.
In the old thread the idea was rejected by both the UBI and the MTD maintainer.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: suspect UBIFS async operations causing issues during reboot
  2014-11-05  8:32 suspect UBIFS async operations causing issues during reboot Scott Branden
  2014-11-05  9:22 ` Richard Weinberger
@ 2014-11-12 11:20 ` Artem Bityutskiy
  2014-11-15  3:30   ` Scott Branden
  1 sibling, 1 reply; 19+ messages in thread
From: Artem Bityutskiy @ 2014-11-12 11:20 UTC (permalink / raw)
  To: Scott Branden; +Cc: linux-mtd

Hi Scott,

sorry for late reply, but better later than never.

On Wed, 2014-11-05 at 00:32 -0800, Scott Branden wrote:
> Over 1000's of reboots we eventually find that the NAND has 
> uncorrectable ECC errors reported on a random page when it is mounted.

How do you find the uncorrectable errors? Do you scan the entire NAND
chip after you boot up? Or do you read all files stored in the UBIFS
file-system, or you do not do anything special, just mount and notice
ECC error messages in dmesg? Does UBIFS fail to mount?

What is the time-window where power cut may lead to problems in your
NAND. And how these problems are seen by the software? I mean, what
happens to the data? Can it become "mostly OK", except of one or few
pages with too many bit-flips? I understand that during erase all 0 bits
"become 1s", but not instanteneously, so in case of an interrupt they
may read as 1 or 0 randomly. But the bits which were 1s - nothing
happens, they stay to be 1s?

> We suspect the problem is the asynchronous nature of the UBIFS 
> operations.  Perhaps the small write buffer that can take 3-5 seconds to 
> be written or some other operation occuring in UBI/UBIFS?  I don't think 
> the shutdown of the filesystem is dealing with all the threads properly.

Yes, writes are asynchronous. There is the write-buffer of the NAND page
size, and there is Linux write-back, which flushes dirty data in
background (standard stuff for all file-systems)

> <REBOOT happens here with NAND ERASE COMMAND in progress corrupting 
> 0x18700000 NAND Addresses!>  Corrupted NAND only happens when erase 
> operation in progress when restarting system happens.

I acknowledge that there may be problems with interrupted erase. We saw
them in case of NOR, where erase is very slow and it is easy to
interrupt it. We never saw this for NAND, but I may well imagine that
this may be an issue in case of NAND.

For NOR, we mitigated the issue by "invalidating" the PEB before
erasing. Check the 'nor_erase_prepare()' function in
'drivers/mtd/ubi/io.c' and its commentaries.

The first thing you may try is - add a similar quick hack to UBI and
invalidate the first NAND page or the first 2 NAND pages (depends on
whether you use sub-pages or not).

You can just write all zeroes. The point is to corrupt data, so that the
subsequent read results in a CRC check failure.

See what happens.


Some general notes.

In general, if UBI or UBIFS decided to erase an LEB, the data in there
are not longer needed. E.g., when GC of UBIFS moves all the valid data
to another PEB, the older PEB is not needed, it is scheduled for
erasure. The erasure happens asynchronously. If you have a power cut,
and the PEB erase operation was interrupted, and you end up with a PEB
which is "mostly fine", son next time you mount UBIFS it may start
reading from it (e.g., if this was a journal PEB), and get errors.

Now, my point is that this should not be a fundamental problem for
UBIFS. This should be fixable. It may need good UBIFS knowledge to fix,
and time, though.

One way to deal with this is to emulate erase interruptions at UBI
level. Similarly how we implemented the power cut testing infrastructure
in UBIFS.

On the other hand, if you can invalidate the PEB before you start
erasing, this should just solve the problem. So I'd start with this, and
see what happens. You may have more than one type of issues, so fixing
the erase interrupt issue this way quickly may let you exlculde this
type of problems. And generally, I am not opposed to this solution in
upstream too, if it works for everyone.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: suspect UBIFS async operations causing issues during reboot
  2014-11-12 11:20 ` Artem Bityutskiy
@ 2014-11-15  3:30   ` Scott Branden
  0 siblings, 0 replies; 19+ messages in thread
From: Scott Branden @ 2014-11-15  3:30 UTC (permalink / raw)
  To: dedekind1; +Cc: linux-mtd

Hi Artem,

Thanks for your response.  We have completed our testing and solved the 
issue by adding a reboot notifier - one was added to 
chips/cfi_cmdset_0002.c and chips/cfi_cmdset_0001.c to solve the problem 
5 years ago on NOR devices.

See comments inline and proposed fix at bottom - I can then send out an 
patch for review.

On 14-11-12 03:20 AM, Artem Bityutskiy wrote:
> Hi Scott,
>
> sorry for late reply, but better later than never.
>
> On Wed, 2014-11-05 at 00:32 -0800, Scott Branden wrote:
>> Over 1000's of reboots we eventually find that the NAND has
>> uncorrectable ECC errors reported on a random page when it is mounted.
>
> How do you find the uncorrectable errors? Do you scan the entire NAND
> chip after you boot up? Or do you read all files stored in the UBIFS
> file-system, or you do not do anything special, just mount and notice
> ECC error messages in dmesg? Does UBIFS fail to mount?
We just mount and notice the ECC error messages.  UBIFS does not fail to 
mount, it handles the situation.  But there shouldn't be error messages 
generated in the first place due to a reboot.
>
> What is the time-window where power cut may lead to problems in your
> NAND. And how these problems are seen by the software? I mean, what
> happens to the data? Can it become "mostly OK", except of one or few
> pages with too many bit-flips? I understand that during erase all 0 bits
> "become 1s", but not instanteneously, so in case of an interrupt they
> may read as 1 or 0 randomly. But the bits which were 1s - nothing
> happens, they stay to be 1s?
Yes, the bits are in the middle of erase so most are 1's and some are 
still 0.
>
>> We suspect the problem is the asynchronous nature of the UBIFS
>> operations.  Perhaps the small write buffer that can take 3-5 seconds to
>> be written or some other operation occuring in UBI/UBIFS?  I don't think
>> the shutdown of the filesystem is dealing with all the threads properly.
>
> Yes, writes are asynchronous. There is the write-buffer of the NAND page
> size, and there is Linux write-back, which flushes dirty data in
> background (standard stuff for all file-systems)
>
>> <REBOOT happens here with NAND ERASE COMMAND in progress corrupting
>> 0x18700000 NAND Addresses!>  Corrupted NAND only happens when erase
>> operation in progress when restarting system happens.
>
> I acknowledge that there may be problems with interrupted erase. We saw
> them in case of NOR, where erase is very slow and it is easy to
> interrupt it. We never saw this for NAND, but I may well imagine that
> this may be an issue in case of NAND.
Yes, we hit the situation.
>
> For NOR, we mitigated the issue by "invalidating" the PEB before
> erasing. Check the 'nor_erase_prepare()' function in
> 'drivers/mtd/ubi/io.c' and its commentaries.
>
> The first thing you may try is - add a similar quick hack to UBI and
> invalidate the first NAND page or the first 2 NAND pages (depends on
> whether you use sub-pages or not).
>
> You can just write all zeroes. The point is to corrupt data, so that the
> subsequent read results in a CRC check failure.
>
> See what happens.
>
>
> Some general notes.
>
> In general, if UBI or UBIFS decided to erase an LEB, the data in there
> are not longer needed. E.g., when GC of UBIFS moves all the valid data
> to another PEB, the older PEB is not needed, it is scheduled for
> erasure. The erasure happens asynchronously. If you have a power cut,
> and the PEB erase operation was interrupted, and you end up with a PEB
> which is "mostly fine", son next time you mount UBIFS it may start
> reading from it (e.g., if this was a journal PEB), and get errors.
>
> Now, my point is that this should not be a fundamental problem for
> UBIFS. This should be fixable. It may need good UBIFS knowledge to fix,
> and time, though.
>
> One way to deal with this is to emulate erase interruptions at UBI
> level. Similarly how we implemented the power cut testing infrastructure
> in UBIFS.
>
> On the other hand, if you can invalidate the PEB before you start
> erasing, this should just solve the problem. So I'd start with this, and
> see what happens. You may have more than one type of issues, so fixing
> the erase interrupt issue this way quickly may let you exlculde this
> type of problems. And generally, I am not opposed to this solution in
> upstream too, if it works for everyone.
We add nand_shutdown to nand_base:

+/**
+ * nand_shutdown - [NAND Interface] finish the current nand operation and
+ *                 prevent further operations
+ * @mtd: MTD device structure
+ */
+int nand_shutdown(struct mtd_info *mtd)
+{
+	return nand_get_device(mtd, FL_SHUTDOWN);
+}
+EXPORT_SYMBOL_GPL(nand_shutdown);

We call nand_shutdown routine from the reboot notifier we add in our 
iproc driver (to be upstreamed soon).

+static int iproc_nand_reboot_notifier(struct notifier_block *n,
+				      unsigned long state,
+				      void *cmd)
+{
+	struct mtd_info *mtd;
+
+	mtd = container_of(n, struct mtd_info, reboot_notifier);
+	nand_shutdown(mtd);
+	return NOTIFY_DONE;
+}

If the reboot notifier can always be added somewhere in mtd it could be 
moved out of driver and always called?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: suspect UBIFS async operations causing issues during reboot
  2014-11-10  5:10                 ` Scott Branden
@ 2014-11-26  8:17                   ` Brian Norris
  2014-11-26  8:30                     ` Richard Weinberger
  2014-11-27 19:07                     ` Scott Branden
  0 siblings, 2 replies; 19+ messages in thread
From: Brian Norris @ 2014-11-26  8:17 UTC (permalink / raw)
  To: Scott Branden
  Cc: Ricard Wanderlof, Richard Weinberger,
	linux-mtd@lists.infradead.org

On Sun, Nov 09, 2014 at 09:10:03PM -0800, Scott Branden wrote:
> On 14-11-09 02:20 AM, Richard Weinberger wrote:
> >Well, I agree with David that anything we do in software will only hide the real problem
> >or trim down the window.
> Hi Richard,
> 
> Currently the NAND does not shut down in a clean manner for a reboot
> operation.  This is due to the asynchronous ubi_thread make flash
> erase calls.  unmount is done properly in ubi already and cleanly
> shuts down.  reboot is not done in a clean manner as there is no
> reboot_notifier to handle the situation.
> 
> This is not hiding a real problem.  It is just shutting down ubi
> properly rather than pulling the power from it in the middle of
> operations.
> 
> In addition to this - a reboot_notifier needs to be added at the mtd
> level to shut it down properly as well.
> 
> This is not trimming down a window.  It is having the drivers shut
> down properly so they do not look like a power failure to the NAND
> device.
> 
> There is no solution to the power failure - it will corrupt pages in
> the middle of erasure.  And you do handle this in UBI/UBIFS.  But
> why corrupt other erase pages unnecessarily when all that needs to
> be done is shut down the drivers properly.  I don't know what you
> are agreeing with David with?  It is not making a window smaller.
> It is changing the functionality so that the UBI and MTD drivers are
> shut down cleanly in reboot situations.  Right now, they are not
> shut down at all in these situations.

I agree with Scott's statements. While it's fine to talk about how all
layers (from bootloader to UBIFS) should be able to handle a power cut
in the midst of an erase, that does *not* mean that we should
intentionally deny the chance to shut down cleanly.

AFAICT, Scott's not trying to work around any unsound reset behaviors
(in UBIFS or in his bootloader); he's just trying to shut things down
gracefully, just as we would try to terminate processes, sync file
systems, etc., rather than just cutting power on reboot.

Brian

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: suspect UBIFS async operations causing issues during reboot
  2014-11-26  8:17                   ` Brian Norris
@ 2014-11-26  8:30                     ` Richard Weinberger
  2014-11-26  9:25                       ` Brian Norris
  2014-11-27 19:07                     ` Scott Branden
  1 sibling, 1 reply; 19+ messages in thread
From: Richard Weinberger @ 2014-11-26  8:30 UTC (permalink / raw)
  To: Brian Norris, Scott Branden
  Cc: Ricard Wanderlof, linux-mtd@lists.infradead.org

Am 26.11.2014 um 09:17 schrieb Brian Norris:
> On Sun, Nov 09, 2014 at 09:10:03PM -0800, Scott Branden wrote:
>> On 14-11-09 02:20 AM, Richard Weinberger wrote:
>>> Well, I agree with David that anything we do in software will only hide the real problem
>>> or trim down the window.
>> Hi Richard,
>>
>> Currently the NAND does not shut down in a clean manner for a reboot
>> operation.  This is due to the asynchronous ubi_thread make flash
>> erase calls.  unmount is done properly in ubi already and cleanly
>> shuts down.  reboot is not done in a clean manner as there is no
>> reboot_notifier to handle the situation.
>>
>> This is not hiding a real problem.  It is just shutting down ubi
>> properly rather than pulling the power from it in the middle of
>> operations.
>>
>> In addition to this - a reboot_notifier needs to be added at the mtd
>> level to shut it down properly as well.
>>
>> This is not trimming down a window.  It is having the drivers shut
>> down properly so they do not look like a power failure to the NAND
>> device.
>>
>> There is no solution to the power failure - it will corrupt pages in
>> the middle of erasure.  And you do handle this in UBI/UBIFS.  But
>> why corrupt other erase pages unnecessarily when all that needs to
>> be done is shut down the drivers properly.  I don't know what you
>> are agreeing with David with?  It is not making a window smaller.
>> It is changing the functionality so that the UBI and MTD drivers are
>> shut down cleanly in reboot situations.  Right now, they are not
>> shut down at all in these situations.
> 
> I agree with Scott's statements. While it's fine to talk about how all
> layers (from bootloader to UBIFS) should be able to handle a power cut
> in the midst of an erase, that does *not* mean that we should
> intentionally deny the chance to shut down cleanly.
> 
> AFAICT, Scott's not trying to work around any unsound reset behaviors
> (in UBIFS or in his bootloader); he's just trying to shut things down
> gracefully, just as we would try to terminate processes, sync file
> systems, etc., rather than just cutting power on reboot.

If there is a solution which makes Artem and David happy, I'm perfectly fine. :)

Thanks,
//richard

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: suspect UBIFS async operations causing issues during reboot
  2014-11-26  8:30                     ` Richard Weinberger
@ 2014-11-26  9:25                       ` Brian Norris
  0 siblings, 0 replies; 19+ messages in thread
From: Brian Norris @ 2014-11-26  9:25 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Ricard Wanderlof, linux-mtd@lists.infradead.org, Scott Branden

On Wed, Nov 26, 2014 at 09:30:17AM +0100, Richard Weinberger wrote:
> If there is a solution which makes Artem and David happy, I'm perfectly fine. :)

Artem already commented [1]:

"And generally, I am not opposed to this solution in
upstream too, if it works for everyone."

And he seemed agreeable to handling it in MTD (not UBI) on the thread
from a few years ago too.

David hasn't been too vocal recently, so I wouldn't bet on getting a
comment out of him. I'd be happy to proven wrong.

Brian

[1] http://lists.infradead.org/pipermail/linux-mtd/2014-November/056500.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: suspect UBIFS async operations causing issues during reboot
  2014-11-26  8:17                   ` Brian Norris
  2014-11-26  8:30                     ` Richard Weinberger
@ 2014-11-27 19:07                     ` Scott Branden
  1 sibling, 0 replies; 19+ messages in thread
From: Scott Branden @ 2014-11-27 19:07 UTC (permalink / raw)
  To: Brian Norris
  Cc: Ricard Wanderlof, Richard Weinberger,
	linux-mtd@lists.infradead.org

On 14-11-26 12:17 AM, Brian Norris wrote:
> On Sun, Nov 09, 2014 at 09:10:03PM -0800, Scott Branden wrote:
>> On 14-11-09 02:20 AM, Richard Weinberger wrote:
>>> Well, I agree with David that anything we do in software will only hide the real problem
>>> or trim down the window.
>> Hi Richard,
>>
>> Currently the NAND does not shut down in a clean manner for a reboot
>> operation.  This is due to the asynchronous ubi_thread make flash
>> erase calls.  unmount is done properly in ubi already and cleanly
>> shuts down.  reboot is not done in a clean manner as there is no
>> reboot_notifier to handle the situation.
>>
>> This is not hiding a real problem.  It is just shutting down ubi
>> properly rather than pulling the power from it in the middle of
>> operations.
>>
>> In addition to this - a reboot_notifier needs to be added at the mtd
>> level to shut it down properly as well.
>>
>> This is not trimming down a window.  It is having the drivers shut
>> down properly so they do not look like a power failure to the NAND
>> device.
>>
>> There is no solution to the power failure - it will corrupt pages in
>> the middle of erasure.  And you do handle this in UBI/UBIFS.  But
>> why corrupt other erase pages unnecessarily when all that needs to
>> be done is shut down the drivers properly.  I don't know what you
>> are agreeing with David with?  It is not making a window smaller.
>> It is changing the functionality so that the UBI and MTD drivers are
>> shut down cleanly in reboot situations.  Right now, they are not
>> shut down at all in these situations.
>
> I agree with Scott's statements. While it's fine to talk about how all
> layers (from bootloader to UBIFS) should be able to handle a power cut
> in the midst of an erase, that does *not* mean that we should
> intentionally deny the chance to shut down cleanly.
>
> AFAICT, Scott's not trying to work around any unsound reset behaviors
> (in UBIFS or in his bootloader); he's just trying to shut things down
> gracefully, just as we would try to terminate processes, sync file
> systems, etc., rather than just cutting power on reboot.

Yes - I just want to gracefully shutdown the system.  Brian's proposed 
untested patch is the most generic approach.  We'll have to work on 
testing it and get back to you.

Thanks,
Scott

>
> Brian
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-11-27 19:08 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-05  8:32 suspect UBIFS async operations causing issues during reboot Scott Branden
2014-11-05  9:22 ` Richard Weinberger
2014-11-05 17:56   ` Scott Branden
2014-11-05 18:21     ` Richard Weinberger
2014-11-05 22:52       ` Scott Branden
2014-11-06 21:56         ` Scott Branden
2014-11-07  8:45           ` Richard Weinberger
2014-11-07 17:31             ` Scott Branden
2014-11-09 10:20               ` Richard Weinberger
2014-11-10  5:10                 ` Scott Branden
2014-11-26  8:17                   ` Brian Norris
2014-11-26  8:30                     ` Richard Weinberger
2014-11-26  9:25                       ` Brian Norris
2014-11-27 19:07                     ` Scott Branden
2014-11-10  8:44                 ` Ricard Wanderlof
2014-11-10  9:08                   ` Richard Weinberger
2014-11-10  7:44         ` Tanya Brokhman
2014-11-12 11:20 ` Artem Bityutskiy
2014-11-15  3:30   ` Scott Branden

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).