[RFC] UBI torture test fails to detect some bad blocks.

linux-mtd.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [RFC] UBI torture test fails to detect some bad blocks.
@ 2016-04-08  7:23 Arnaud Mouiche
  2016-04-08  7:23 ` [RFC] UBI: harden torture_peb to miss less " Arnaud Mouiche
  2016-04-08  8:24 ` [RFC] UBI torture test fails to detect some " Richard Weinberger
  0 siblings, 2 replies; 4+ messages in thread
From: Arnaud Mouiche @ 2016-04-08  7:23 UTC (permalink / raw)
  To: Artem Bityutskiy, Richard Weinberger, David Woodhouse,
	Brian Norris, linux-mtd, boris.brezillon, peterpansjtu
  Cc: Arnaud Mouiche

Hi all.

Just some details about what I experience recently with some bad blocs on 
a MX35LF1GE4AB spinand device (SLC, 1Gb, 4bits ECC per 512 sub-page), 
where a UBI partition is attached to manage rootfs & co  (as usual).

I get the hand on some devices refusing to boot.
The analyse of the Erase Counters shows that some of them where erased 
more than 100K, while the majority have an EC below 20 !

Looking at the bad one, they run the following scenario nearly in loop:
- linux read some file inside the rootfs
- a bitflip is detected
- scrubbing is scheduled.
- the scrubbing target a PEB with a pretty high EC,
- this high EC is also due to frequent bitflip in the target PEB in the past.
- while the PEB data are moved, a bitflip is detected scheduling a torture test.
- the torture test *ALWAYS* pass (whereas bitflip are *VERY* frequent for 
  the same PEB when the read comes filesystem read).

So, it seems obvious the PEBs in question are bad PEBs.
The question is now why the torture test pass.

Reproducing the pattern test by hand on this block shows the same result.
But applying different patterns on different pages within the block shows that 
the content of some pages are affected by the content of the other pages.
In particularly, for this block, if the first page is full of FF and the rest 
of the block is full of 00, I can count  more than 100 bitflips (!)

What kind of pattern should be added to detect those kind of issues ?
We can think of testing every page one by one, but given the relatively large 
number of pages in a block, it doesn't sound realistic.
The easiest way could be to use a random pattern, and try it a relative low 
number of times.
Indeed, this simple random test is efficient to detect every bad block of this device.
If the random test pass once (because this is a random test), there are chances 
that the next bit flip detection will trigger a new torture test, and at the end, 
it will be finally detected as bad.

And the implementation is pretty obvious...

Arnaud

PS:
Yes, I know, spinand is not supported yet, but since there is a pending
effort for refactoring bbt & stuff for spinand inclusion, my driver implementation
is pretty meaningless. If somebody want a look, no problem... 
But my future role will be better to test and support various spinand devices since 
I own some samples from various manufacturers.

Arnaud Mouiche (1):
  UBI: harden torture_peb to miss less bad blocks.

 drivers/mtd/ubi/io.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC] UBI: harden torture_peb to miss less bad blocks.
  2016-04-08  7:23 [RFC] UBI torture test fails to detect some bad blocks Arnaud Mouiche
@ 2016-04-08  7:23 ` Arnaud Mouiche
  2016-04-08  8:24 ` [RFC] UBI torture test fails to detect some " Richard Weinberger
  1 sibling, 0 replies; 4+ messages in thread
From: Arnaud Mouiche @ 2016-04-08  7:23 UTC (permalink / raw)
  To: Artem Bityutskiy, Richard Weinberger, David Woodhouse,
	Brian Norris, linux-mtd, boris.brezillon, peterpansjtu
  Cc: Arnaud Mouiche

Using an uniform pattern spread all over the block under test detect
some issues but not all.

ex: bits in a page affected by bits at a same offset in another page.

Testing every cases is difficult, but at least we can use fuzzing
techniques to discover such issues.

Signed-off-by: Arnaud Mouiche <arnaud.mouiche@invoxia.com>
---
 drivers/mtd/ubi/io.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/drivers/mtd/ubi/io.c b/drivers/mtd/ubi/io.c
index ed0bcb3..91ab744 100644
--- a/drivers/mtd/ubi/io.c
+++ b/drivers/mtd/ubi/io.c
@@ -89,6 +89,7 @@
 #include <linux/crc32.h>
 #include <linux/err.h>
 #include <linux/slab.h>
+#include <linux/random.h>
 #include "ubi.h"
 
 static int self_check_not_bad(const struct ubi_device *ubi, int pnum);
@@ -412,6 +413,8 @@ static uint8_t patterns[] = {0xa5, 0x5a, 0x0};
 static int torture_peb(struct ubi_device *ubi, int pnum)
 {
 	int err, i, patt_count;
+	uint8_t *peb_rnd_buff = NULL;
+	const int rnd_check_count = 4;
 
 	ubi_msg(ubi, "run torture test for PEB %d", pnum);
 	patt_count = ARRAY_SIZE(patterns);
@@ -456,12 +459,41 @@ static int torture_peb(struct ubi_device *ubi, int pnum)
 			goto out;
 		}
 	}
+	/* try a random pattern */
+	peb_rnd_buff = kmalloc(ubi->peb_size, GFP_KERNEL);
+	if (!peb_rnd_buff) {
+		err = 0;
+		goto out;
+	}
+	for (i=0; i < rnd_check_count; i++) {
+		get_random_bytes(peb_rnd_buff, ubi->peb_size);
+
+		err = do_sync_erase(ubi, pnum);
+		if (err)
+			goto out;
+
+		err = ubi_io_write(ubi, peb_rnd_buff, pnum, 0, ubi->peb_size);
+		if (err)
+			goto out;
+
+		err = ubi_io_read(ubi, ubi->peb_buf, pnum, 0, ubi->peb_size);
+		if (err)
+			goto out;
+
+		if (memcmp(peb_rnd_buff, ubi->peb_buf, ubi->peb_size)) {
+			ubi_err(ubi, "random pattern checking failed for PEB %d",
+				pnum);
+			err = -EIO;
+			goto out;
+		}
+	}
 
 	err = patt_count;
 	ubi_msg(ubi, "PEB %d passed torture test, do not mark it as bad", pnum);
 
 out:
 	mutex_unlock(&ubi->buf_mutex);
+	kfree(peb_rnd_buff);
 	if (err == UBI_IO_BITFLIPS || mtd_is_eccerr(err)) {
 		/*
 		 * If a bit-flip or data integrity error was detected, the test
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC] UBI torture test fails to detect some bad blocks.
  2016-04-08  7:23 [RFC] UBI torture test fails to detect some bad blocks Arnaud Mouiche
  2016-04-08  7:23 ` [RFC] UBI: harden torture_peb to miss less " Arnaud Mouiche
@ 2016-04-08  8:24 ` Richard Weinberger
  2016-04-08  9:02   ` arnaud.mouiche
  1 sibling, 1 reply; 4+ messages in thread
From: Richard Weinberger @ 2016-04-08  8:24 UTC (permalink / raw)
  To: Arnaud Mouiche, Artem Bityutskiy, David Woodhouse, Brian Norris,
	linux-mtd, boris.brezillon, peterpansjtu

Hi!

Am 08.04.2016 um 09:23 schrieb Arnaud Mouiche:
> Hi all.
> 
> Just some details about what I experience recently with some bad blocs on 
> a MX35LF1GE4AB spinand device (SLC, 1Gb, 4bits ECC per 512 sub-page), 
> where a UBI partition is attached to manage rootfs & co  (as usual).
> 
> I get the hand on some devices refusing to boot.
> The analyse of the Erase Counters shows that some of them where erased 
> more than 100K, while the majority have an EC below 20 !

Ouch.

> Looking at the bad one, they run the following scenario nearly in loop:
> - linux read some file inside the rootfs
> - a bitflip is detected
> - scrubbing is scheduled.
> - the scrubbing target a PEB with a pretty high EC,
> - this high EC is also due to frequent bitflip in the target PEB in the past.
> - while the PEB data are moved, a bitflip is detected scheduling a torture test.
> - the torture test *ALWAYS* pass (whereas bitflip are *VERY* frequent for 
>   the same PEB when the read comes filesystem read).
> 
> So, it seems obvious the PEBs in question are bad PEBs.
> The question is now why the torture test pass.
> 
> Reproducing the pattern test by hand on this block shows the same result.
> But applying different patterns on different pages within the block shows that 
> the content of some pages are affected by the content of the other pages.
> In particularly, for this block, if the first page is full of FF and the rest 
> of the block is full of 00, I can count  more than 100 bitflips (!)

100 flips per ECC step? Shouldn't this lead to a uncorrectable ECC error?
I have no idea how much bits your ECC can fix..
Which bitflip threshold do you have? UBI sees bitflips only after a threshold
is reached. If it is too low, UBI scrubs too often, which seems to be the case here.
It is perfectly fine to have bitflips.

So, we need dig a bit deeper first.

> What kind of pattern should be added to detect those kind of issues ?

This is a very hard question and almost impossible to answer as it is vendor
specific.

> We can think of testing every page one by one, but given the relatively large 
> number of pages in a block, it doesn't sound realistic.
> The easiest way could be to use a random pattern, and try it a relative low 
> number of times.
> Indeed, this simple random test is efficient to detect every bad block of this device.
> If the random test pass once (because this is a random test), there are chances 
> that the next bit flip detection will trigger a new torture test, and at the end, 
> it will be finally detected as bad.

Having an additional random pattern is not a bad idea.
This is definitively something we can consider adding to UBI.
But I'm not happy with your implementation.

peb_rnd_buff = kmalloc(ubi->peb_size, GFP_KERNEL);
... is a big no-no. peb_size can be a few megabytes.

What about repeating a few random bytes over and over?

> And the implementation is pretty obvious...

;-)

Thanks,
//richard

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] UBI torture test fails to detect some bad blocks.
  2016-04-08  8:24 ` [RFC] UBI torture test fails to detect some " Richard Weinberger
@ 2016-04-08  9:02   ` arnaud.mouiche
  0 siblings, 0 replies; 4+ messages in thread
From: arnaud.mouiche @ 2016-04-08  9:02 UTC (permalink / raw)
  To: Richard Weinberger, Artem Bityutskiy, David Woodhouse,
	Brian Norris, linux-mtd, boris.brezillon, peterpansjtu



Le 08/04/2016 10:24, Richard Weinberger a écrit :
> Hi!
>
> Am 08.04.2016 um 09:23 schrieb Arnaud Mouiche:
>> Hi all.
>>
>> Just some details about what I experience recently with some bad blocs on
>> a MX35LF1GE4AB spinand device (SLC, 1Gb, 4bits ECC per 512 sub-page),
>> where a UBI partition is attached to manage rootfs & co  (as usual).
>>
>> I get the hand on some devices refusing to boot.
>> The analyse of the Erase Counters shows that some of them where erased
>> more than 100K, while the majority have an EC below 20 !
> Ouch.
>
>> Looking at the bad one, they run the following scenario nearly in loop:
>> - linux read some file inside the rootfs
>> - a bitflip is detected
>> - scrubbing is scheduled.
>> - the scrubbing target a PEB with a pretty high EC,
>> - this high EC is also due to frequent bitflip in the target PEB in the past.
>> - while the PEB data are moved, a bitflip is detected scheduling a torture test.
>> - the torture test *ALWAYS* pass (whereas bitflip are *VERY* frequent for
>>    the same PEB when the read comes filesystem read).
>>
>> So, it seems obvious the PEBs in question are bad PEBs.
>> The question is now why the torture test pass.
>>
>> Reproducing the pattern test by hand on this block shows the same result.
>> But applying different patterns on different pages within the block shows that
>> the content of some pages are affected by the content of the other pages.
>> In particularly, for this block, if the first page is full of FF and the rest
>> of the block is full of 00, I can count  more than 100 bitflips (!)
> 100 flips per ECC step? Shouldn't this lead to a uncorrectable ECC error?
yes, the hardware ECC obviously says "I can't manage".
I just compare the expected pattern (FF) with the page content when read 
without ECC.
> I have no idea how much bits your ECC can fix..
4 bits per 512 bytes. which looks large enough for SLC. And since the 
ECC hardware is embedded inside the spinand chip, we can expect the 
manufacturer to have correctly chosen its strength.
> Which bitflip threshold do you have? UBI sees bitflips only after a threshold
Yes I noticed that.
In the early Nand datasheet, the ECC status register just says "no 
errors" or "1-4 corrected bits" or "uncorrectables bits".
So the threshold was set to 1 at this time.
Then I change the driver implementation in case of "1-4 corrected bits" 
to read back the page without ECC and count the exact numbers of errors.
So now, the threshold is set to 3.

> is reached. If it is too low, UBI scrubs too often, which seems to be the case here.
> It is perfectly fine to have bitflips.
>
> So, we need dig a bit deeper first.
>
>> What kind of pattern should be added to detect those kind of issues ?
> This is a very hard question and almost impossible to answer as it is vendor
> specific.
>
>> We can think of testing every page one by one, but given the relatively large
>> number of pages in a block, it doesn't sound realistic.
>> The easiest way could be to use a random pattern, and try it a relative low
>> number of times.
>> Indeed, this simple random test is efficient to detect every bad block of this device.
>> If the random test pass once (because this is a random test), there are chances
>> that the next bit flip detection will trigger a new torture test, and at the end,
>> it will be finally detected as bad.
> Having an additional random pattern is not a bad idea.
> This is definitively something we can consider adding to UBI.
> But I'm not happy with your implementation.
>
> peb_rnd_buff = kmalloc(ubi->peb_size, GFP_KERNEL);
> ... is a big no-no. peb_size can be a few megabytes.
>
> What about repeating a few random bytes over and over?
You must not repeat the same page content, otherwise, pages don't affect 
the others.
But is is true we can fill ubi->peb with a repeated random pattern of a 
prime length (eg. 353 bytes). so it is short enough to do a small kmalloc.

Otherwise what we can do is to simply:
- fill ubi->peb_buf with our random pattern
- ubi_io_write(ubi, ubi->peb
- ubi_io_read(ubi, ubi->peb...
- and just trust the ubi_io_read result
The memcmp is actually a paranoid check isn't it ?

Regards,
Arnaud

>
>> And the implementation is pretty obvious...
> ;-)
>
> Thanks,
> //richard

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-04-08  9:02 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-08  7:23 [RFC] UBI torture test fails to detect some bad blocks Arnaud Mouiche
2016-04-08  7:23 ` [RFC] UBI: harden torture_peb to miss less " Arnaud Mouiche
2016-04-08  8:24 ` [RFC] UBI torture test fails to detect some " Richard Weinberger
2016-04-08  9:02   ` arnaud.mouiche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).