From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from a.ns.miles-group.at ([95.130.255.143] helo=radon.swed.at)
 by bombadil.infradead.org with esmtps (Exim 4.80.1 #2 (Red Hat Linux))
 id 1Xm5DG-0006Zh-8m
 for linux-mtd@lists.infradead.org; Wed, 05 Nov 2014 18:21:47 +0000
Message-ID: <545A6AA0.8050901@nod.at>
Date: Wed, 05 Nov 2014 19:21:20 +0100
From: Richard Weinberger <richard@nod.at>
MIME-Version: 1.0
To: Scott Branden <sbranden@broadcom.com>,
 Richard Weinberger <richard.weinberger@gmail.com>
Subject: Re: suspect UBIFS async operations causing issues during reboot
References: <5459E090.1010300@broadcom.com>
 <CAFLxGvwTLd_uWBrj9RsD6FPFCSGsC_VcOmi_j0VLgVCJ=YVQ9w@mail.gmail.com>
 <545A64CF.20101@broadcom.com>
In-Reply-To: <545A64CF.20101@broadcom.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Cc: "linux-mtd@lists.infradead.org" <linux-mtd@lists.infradead.org>
List-Id: Linux MTD discussion mailing list <linux-mtd.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-mtd>,
 <mailto:linux-mtd-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-mtd/>
List-Post: <mailto:linux-mtd@lists.infradead.org>
List-Help: <mailto:linux-mtd-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-mtd>,
 <mailto:linux-mtd-request@lists.infradead.org?subject=subscribe>

Hi!

Am 05.11.2014 um 18:56 schrieb Scott Branden:
> Hi Richard,
> 
> Thanks for the feedback.  Comments inline.
> 
> On 14-11-05 01:22 AM, Richard Weinberger wrote:
>> On Wed, Nov 5, 2014 at 9:32 AM, Scott Branden <sbranden@broadcom.com> wrote:
>>> We are doing reboot testing with UBIFS on the 3.10 kernel with a new chipset
>>> we are working on.
>>>
>>> Over 1000's of reboots we eventually find that the NAND has uncorrectable
>>> ECC errors reported on a random page when it is mounted.
>>>
>>> We have found the problem is that a NAND erase operation is in progress when
>>> the reboot occurs. Since the NAND is in the middle of the erase operation
>>> the page is mostly FF with some random bits not erased when the reboot
>>> occurs.
>>>
>>> We suspect the problem is the asynchronous nature of the UBIFS operations.
>>> Perhaps the small write buffer that can take 3-5 seconds to be written or
>>> some other operation occuring in UBI/UBIFS?  I don't think the shutdown of
>>> the filesystem is dealing with all the threads properly.
>>
>> And what about powercuts?
> powercuts would exhibit the exact same behaviour as we are observing: the erase is interrupted by loss of power so the NAND block being erased would be in a partially erased
> state.  powercuts have little to do with the reboot sequence I am describing.
> 
>> UBI/UBIFS was designed to survive powercuts.
> Yes, this does not cause UBIFS to fail to survive the powercut.  It does cause blocks to not be erased properly.

Makes sense.

> The block that didn't finish to erase is uncorrectable on next boot-up:
> 
> [    1.330000] UBI: attaching mtd7 to ubi0
> [    2.000000] iproc_nand 18046000.nand: uncorrectable error at 0x18700000
> 
> This issue is this blocks shouldn't be corrupted in the first place if UBI/UBIFS shut downs properly.
> 
>> If your NAND shows strange issues even after a clean reboot something nasty is
>> going on. Does your driver pass all UBI/MTD test?
>>
> We are in the process of running the MTD tests.  But this appears to have nothing to do with a buggy driver or not.  The NAND driver will do what it is told to do.  If it is told
> to erase a block it will erase a block.  It can't control if the system reboots in the middle of this operation?
> 
> This appears to be a UBI/UBIFS issue.  UBI/UBIFS operations are still going on after the filesystem in unmounted.  The shutdown process completes and a reboot happens.  My guess is
> these operations are due to the asynchronous threads of UBI/UBIFS not being handled properly during the shutdown process?
> 
> I have found other people have reported unexplained flash corruption. We back ported this to the 3.10 kernel which solved most of the flash corruption issues:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/super.c?id=807612db2f9940b9fa6deaef054eb16d51bd3e00
> 
> This only remaining flash corruption issue is due to the described issue of reboot happening in the middle of an erase cycle.

You can verify your hypothesis easily. Add a printk() to ubi_detach_mtd_dev(). This function shuts down UBI and also the background thread which does
all erase work.

Thanks,
//richard