From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from a.ns.miles-group.at ([95.130.255.143] helo=radon.swed.at) by bombadil.infradead.org with esmtps (Exim 4.80.1 #2 (Red Hat Linux)) id 1aoRia-0004jb-WC for linux-mtd@lists.infradead.org; Fri, 08 Apr 2016 08:24:42 +0000 Subject: Re: [RFC] UBI torture test fails to detect some bad blocks. To: Arnaud Mouiche , Artem Bityutskiy , David Woodhouse , Brian Norris , linux-mtd@lists.infradead.org, boris.brezillon@free-electrons.com, peterpansjtu@gmail.com References: <1460100214-31298-1-git-send-email-arnaud.mouiche@invoxia.com> From: Richard Weinberger Message-ID: <57076AAB.7050008@nod.at> Date: Fri, 8 Apr 2016 10:24:11 +0200 MIME-Version: 1.0 In-Reply-To: <1460100214-31298-1-git-send-email-arnaud.mouiche@invoxia.com> Content-Type: text/plain; charset=iso-8859-15 Content-Transfer-Encoding: 8bit List-Id: Linux MTD discussion mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Hi! Am 08.04.2016 um 09:23 schrieb Arnaud Mouiche: > Hi all. > > Just some details about what I experience recently with some bad blocs on > a MX35LF1GE4AB spinand device (SLC, 1Gb, 4bits ECC per 512 sub-page), > where a UBI partition is attached to manage rootfs & co (as usual). > > I get the hand on some devices refusing to boot. > The analyse of the Erase Counters shows that some of them where erased > more than 100K, while the majority have an EC below 20 ! Ouch. > Looking at the bad one, they run the following scenario nearly in loop: > - linux read some file inside the rootfs > - a bitflip is detected > - scrubbing is scheduled. > - the scrubbing target a PEB with a pretty high EC, > - this high EC is also due to frequent bitflip in the target PEB in the past. > - while the PEB data are moved, a bitflip is detected scheduling a torture test. > - the torture test *ALWAYS* pass (whereas bitflip are *VERY* frequent for > the same PEB when the read comes filesystem read). > > So, it seems obvious the PEBs in question are bad PEBs. > The question is now why the torture test pass. > > Reproducing the pattern test by hand on this block shows the same result. > But applying different patterns on different pages within the block shows that > the content of some pages are affected by the content of the other pages. > In particularly, for this block, if the first page is full of FF and the rest > of the block is full of 00, I can count more than 100 bitflips (!) 100 flips per ECC step? Shouldn't this lead to a uncorrectable ECC error? I have no idea how much bits your ECC can fix.. Which bitflip threshold do you have? UBI sees bitflips only after a threshold is reached. If it is too low, UBI scrubs too often, which seems to be the case here. It is perfectly fine to have bitflips. So, we need dig a bit deeper first. > What kind of pattern should be added to detect those kind of issues ? This is a very hard question and almost impossible to answer as it is vendor specific. > We can think of testing every page one by one, but given the relatively large > number of pages in a block, it doesn't sound realistic. > The easiest way could be to use a random pattern, and try it a relative low > number of times. > Indeed, this simple random test is efficient to detect every bad block of this device. > If the random test pass once (because this is a random test), there are chances > that the next bit flip detection will trigger a new torture test, and at the end, > it will be finally detected as bad. Having an additional random pattern is not a bad idea. This is definitively something we can consider adding to UBI. But I'm not happy with your implementation. peb_rnd_buff = kmalloc(ubi->peb_size, GFP_KERNEL); ... is a big no-no. peb_size can be a few megabytes. What about repeating a few random bytes over and over? > And the implementation is pretty obvious... ;-) Thanks, //richard