From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from down.free-electrons.com ([37.187.137.238] helo=mail.free-electrons.com) by bombadil.infradead.org with esmtp (Exim 4.80.1 #2 (Red Hat Linux)) id 1aEKwu-0007zv-Jj for linux-mtd@lists.infradead.org; Wed, 30 Dec 2015 17:54:15 +0000 Date: Wed, 30 Dec 2015 18:53:48 +0100 From: Boris Brezillon To: "Franklin S Cooper Jr." Cc: , Subject: Re: Testing generic empty page bit flips recovery Message-ID: <20151230185348.4f0ad161@bbrezillon> In-Reply-To: <56841842.7060102@ti.com> References: <5683E5CC.6020008@ti.com> <20151230154055.662b4df8@bbrezillon> <5683F960.90901@ti.com> <20151230170207.45e7fb01@bbrezillon> <56840911.7080404@ti.com> <20151230175938.69c319cb@bbrezillon> <56841842.7060102@ti.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit List-Id: Linux MTD discussion mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Wed, 30 Dec 2015 11:45:38 -0600 "Franklin S Cooper Jr." wrote: > > > On 12/30/2015 10:59 AM, Boris Brezillon wrote: > > On Wed, 30 Dec 2015 10:40:49 -0600 > > "Franklin S Cooper Jr." wrote: > > > >> > >> On 12/30/2015 10:02 AM, Boris Brezillon wrote: > >>> On Wed, 30 Dec 2015 09:33:52 -0600 > >>> "Franklin S Cooper Jr." wrote: > >>> > >>>> On 12/30/2015 08:40 AM, Boris Brezillon wrote: > >>>>> Hi Franklin, > >>>>> > >>>>> On Wed, 30 Dec 2015 08:10:20 -0600 > >>>>> "Franklin S Cooper Jr." wrote: > >>>>> > >>>>>> I am trying to follow up on this discussion from this patch > >>>>>> set (https://patchwork.ozlabs.org/patch/539059/) which > >>>>>> suggested that Michael instead test the generic bitflips > >>>>>> recovery that is implemented by Boris "mtd: nand: properly > >>>>>> handle bitflips in erased pages" patchset > >>>>>> (http://lists.infradead.org/pipermail/linux-mtd/2015-September/061617.html). > >>>>>> I would like to test Boris patchset but first I need to > >>>>>> recreate the error that his patch is fixing. > >>>>>> > >>>>>> The error that the patchset is attempting to fix isn't > >>>>>> something I have ever encountered before. Currently I am > >>>>>> trying to reproduce this issue on a TI K2E evm that uses the > >>>>>> davinci nand driver. I flashed the nand's file-system > >>>>>> partition with a ubi filesystem and the board is currently > >>>>>> set to boot using the file-system on the nand. After about > >>>>>> 60 secs I cut the power from the board and boot the board > >>>>>> again. What I would expect is that the board will eventually > >>>>>> fail to mount the ubi filesystem but currently the board has > >>>>>> ran for over 24 hours and powered on and off over 1400 times > >>>>>> and its still mounting the file-system perfectly fine. > >>>>>> > >>>>>> Any suggestions on a test case that I can use to force the > >>>>>> empty page bit flips error? > >>>>>> > >>>>>> > >>>>> The davinci driver seems to support raw accesses, so you can try to > >>>>> apply this patch [1] against the mtd-utils tree (not sure it still > >>>>> applies cleany, but it should work with mtd-utils-1.5.1), and use the > >>>>> nandflipbits tool: > >>>>> > >>>>> # flash_erase /dev/mtdX 1 > >>>>> # nandflipbits /dev/mtdX 1@ > >>>>> # nanddump -f /tmp/dump -s -l /dev/mtdX > >>>>> > >>>>> Without the patch, nanddump should complain about uncorrectable errors, > >>>>> and if you hexdump /dev/dump you should see the bitflip. > >>>>> If nanddump does not complain after applying my patch, then it means it > >>>>> fixes the "bitflips in erased pages" bug. > >>>>> > >>>>> Best Regards, > >>>>> > >>>>> Boris > >>>>> > >>>>> [1]http://lists.infradead.org/pipermail/linux-mtd/2014-November/056634.html > >>>> Hi Boris, > >>>> > >>>> Thanks for the quick reply. I built mtd-utils with your > >>>> patch and ran the suggested commands on a 4.1 based kernel > >>>> without your kernel patchset and I didn't see your expected > >>>> output. The 4.1 based kernel hasn't had any changes to > >>>> davinci_nand or nand subsystem that would address this > >>>> bitflip error. > >>>> > >>>> I'm currently going to attempt to run the same test on the > >>>> latest mainline. > >>>> > >>>> Here is the output I received when I ran your suggested > >>>> commands on the 4.1 based kernel.Any > >>>> root@k2e-evm:~# ./flash_erase /dev/mtd4 4096 1 > >>>> Erasing 128 Kibyte @ 0 -- 100 % complete > >>>> root@k2e-evm:~# ./nandflipbits /dev/mtd4 1@4096 > >>>> root@k2e-evm:~# ./nanddump -f /tmp/dump -s 4096 -l 2048 > >>>> /dev/mtd4 > >>>> ECC failed: 0 > >>>> ECC corrected: 0 > >>>> Number of bad blocks: 0 > >>>> Number of bbt blocks: 4 > >>>> Block size 131072, page size 2048, OOB size 64 > >>>> root@k2e-evm:~# hexdump /tmp/dump > >>>> 0000000 fffd ffff ffff ffff ffff ffff ffff ffff > >>>> 0000010 ffff ffff ffff ffff ffff ffff ffff ffff > >>>> * > >>>> 0000800 > >>>> > >>>> Any thoughts on why I'm not seeing the expected error? > >>>> > >>> Oh, actually this behavior is explained in the commit message: > >>> > >>> "Currently empty page bit flips are not corrected and report 0 errors." > >>> > >>> Which explains why you're seeing the bitflip in the dump, but nothing > >>> reported by the MTD layer. > >>> > >>> After applying my patch, the bitflip should simply disappear. You can > >>> then try to generate more bitflips than the engine can actually fix > >>> (nandflipbits /dev/mtd4 1@0:5@0:49@0:98@0:132@0) and check that MTD > >>> reports an uncorrectable error. > >> I verified that I am indeed using ecc4bit mode. > >> > >> I attempted to run the series of nandflipsbits as you > >> suggested but I get "invalid bit description" error from the > >> utility. Some reason I can only use the nandflipsbits > >> utility for bits 1-7. Anything higher and I get the "Invalid > >> bit description" error. > > Indeed. I developed that tool a long time ago and didn't remember that > > the bit field is encoding the bit offset within a byte. This command > > should work. > > > > nandflipbits /dev/mtd4 1@0:5@0:7@30:3@46:5@47 > > > >> On the latest master commit I ran nandflipsbits for bits 1-7 > >> at address 0. However, I still didn't receive any error from > >> nanddump although I do see the flip bits from the hexdump > >> /tmp/dump output. > > How many of them do you see? > > > >> I then applied your patchset ontop of the latest mainline > >> and ran nandflipsbits for bits 1-7 at address 0. > >> I get the below output which seems to be correct. > >> > >> root@k2e-evm:~# ./nandflipbits /dev/mtd4 1@0 > >> root@k2e-evm:~# ./nandflipbits /dev/mtd4 2@0 > >> root@k2e-evm:~# ./nandflipbits /dev/mtd4 3@0 > >> root@k2e-evm:~# ./nandflipbits /dev/mtd4 4@0 > >> root@k2e-evm:~# ./nandflipbits /dev/mtd4 5@0 > >> root@k2e-evm:~# ./nandflipbits /dev/mtd4 6@0 > >> root@k2e-evm:~# ./nandflipbits /dev/mtd4 7@0 > >> root@k2e-evm:~# ./nanddump -f /tmp/dump -s 0 -l 2048 > >> /dev/mtd4 > >> > >> ECC failed: 1 > >> ECC corrected: 18 > >> Number of bad blocks: 0 > >> Number of bbt blocks: 4 > >> Block size 131072, page size 2048, OOB size 64 > >> Dumping data starting at 0x00000000 and ending at 0x00000800... > >> ECC: 4 corrected bitflip(s) at offset 0x00000000 > >> root@k2e-evm:~# hexdump /tmp/dump > >> 0000000 ffff ffff ffff ffff ffff ffff ffff ffff > >> * > >> 0000800 > > Hm, that's weird. You should get an ECC failure since the ECC strength > > is only 4bits/512byte and you 8 bits have been flipped. > > > >> One thing that confuses me is if I repeatedly call nanddump > >> I continue to get the "ECC: 4 corrected bitflips" message > >> and the "ECC corrected" count increases by 4 each time. If > >> these bits are being corrected which is apparent from > >> looking at the output of nanddump shouldn't sequential calls > >> indicate that no bitflips needed to be corrected since it > >> was corrected previously? > > Nope, they're corrected on the fly and only in RAM, so each time you > > read the page, you'll have to fix the bitflips until you erase and > > rewrite the faulty block. > > > > > > Hi Boris, > > Here is the entire output that should answer your questions. > > In the log I am running the following commands: > flash_erase /dev/mtd4 0 0 > ./nanddump -f /tmp/dump -s 0 -l 2048 /dev/mtd4 > hexdump /tmp/dump > ./nandflipbits /dev/mtd4 1@0:5@0:7@30:3@46:5@47 > ./nanddump -f /tmp/dump -s 0 -l 2048 /dev/mtd4 > hexdump /tmp/dump > > Output on mainline kernel without bitflip correction patches: > http://pastebin.com/MgBVxALR > > Output on mainline kernel with bitflip correction patches: > http://pastebin.com/NdKv0NhV > > Some reason I'm only getting 1 bit being corrected when > using the bitflip correction patches. Comparing my logs from > before to now the only difference I'm seeing is that ECC > failed is increasing but ECC corrected isn't changing. > That's what I was expecting: your ECC engine is only fixing 4bits/512byte, which is why the bitflip in erased page correction fail when you have more than 4 bits flipped in a given 512byte block. Now try to flip only 4 bits instead of 5: ./nandflipbits /dev/mtd4 1@0:5@0:7@30:3@46 -- Boris Brezillon, Free Electrons Embedded Linux and Kernel engineering http://free-electrons.com