[PATCH] mtd: gpmi: Deal with bitflips in erased regions regions

public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH] mtd: gpmi: Deal with bitflips in erased regions regions
@ 2013-12-17  8:01 Elie De Brauwer
  2013-12-17 10:17 ` Gupta, Pekon
  0 siblings, 1 reply; 6+ messages in thread
From: Elie De Brauwer @ 2013-12-17  8:01 UTC (permalink / raw)
  To: b32955, dwmw2, dedekind1, computersforpeace, shijie8
  Cc: Elie De Brauwer, linux-mtd

The BCH block typically used with a GPMI block on an i.MX28/i.MX6 is only
able to correct bitflips on data actually streamed through the block.
When erasing a block the data does not stream through the BCH block
and therefore no ECC data is written to the NAND chip. This causes
gpmi_ecc_read_page to return failure as soon as a single non-1-bit is
found in an erased page. Typically causing problems at higher levels
(ubifs corrupted empty space warnings). This problem was also observed
when using SLC NAND devices.

This patch configures the BCH block to mark a block as 'erased' if
no more than gf_len/2 bitflips are found. Next HW_BCH_STATUS0:ALLONES
is used to check if the data read were all ones, indicating a read of a
properly erased chunk was performed. If this was not the case a slow path
is entered where bitflips are counted and corrected in software,
allowing the upper layers to take proper actions.

Signed-off-by: Elie De Brauwer <eliedebrauwer@gmail.com>
Acked-by: Peter Korsgaard <peter@korsgaard.com>
---
 drivers/mtd/nand/gpmi-nand/bch-regs.h  |    2 ++
 drivers/mtd/nand/gpmi-nand/gpmi-lib.c  |   16 ++++++++++++
 drivers/mtd/nand/gpmi-nand/gpmi-nand.c |   44 +++++++++++++++++++++++++++++---
 drivers/mtd/nand/gpmi-nand/gpmi-nand.h |    1 +
 4 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/drivers/mtd/nand/gpmi-nand/bch-regs.h b/drivers/mtd/nand/gpmi-nand/bch-regs.h
index 588f537..a30502f 100644
--- a/drivers/mtd/nand/gpmi-nand/bch-regs.h
+++ b/drivers/mtd/nand/gpmi-nand/bch-regs.h
@@ -30,7 +30,9 @@
 #define BM_BCH_CTRL_COMPLETE_IRQ		(1 << 0)
 
 #define HW_BCH_STATUS0				0x00000010
+#define BM_BCH_STATUS0_ALLONES_MASK		(1 << 4)
 #define HW_BCH_MODE				0x00000020
+#define BM_BCH_MODE_ERASE_THRESHOLD_MASK	0xff
 #define HW_BCH_ENCODEPTR			0x00000030
 #define HW_BCH_DATAPTR				0x00000040
 #define HW_BCH_METAPTR				0x00000050
diff --git a/drivers/mtd/nand/gpmi-nand/gpmi-lib.c b/drivers/mtd/nand/gpmi-nand/gpmi-lib.c
index aaced29..15f5670 100644
--- a/drivers/mtd/nand/gpmi-nand/gpmi-lib.c
+++ b/drivers/mtd/nand/gpmi-nand/gpmi-lib.c
@@ -286,6 +286,13 @@ int bch_set_geometry(struct gpmi_nand_data *this)
 			| BF_BCH_FLASH0LAYOUT1_DATAN_SIZE(block_size, this),
 			r->bch_regs + HW_BCH_FLASH0LAYOUT1);
 
+	/*
+	 * Set the tolerance for bitflips when reading erased blocks
+	 * equal to half the gf_len.
+	 */
+	writel((gf_len / 2) & BM_BCH_MODE_ERASE_THRESHOLD_MASK,
+		r->bch_regs + HW_BCH_MODE);
+
 	/* Set *all* chip selects to use layout 0. */
 	writel(0, r->bch_regs + HW_BCH_LAYOUTSELECT);
 
@@ -1094,6 +1101,15 @@ int gpmi_is_ready(struct gpmi_nand_data *this, unsigned chip)
 	return reg & mask;
 }
 
+/* Returns 1 if the last transaction consisted only out of ones. */
+int gpmi_all_ones(struct gpmi_nand_data *this)
+{
+	struct resources *r = &this->resources;
+	uint32_t reg = readl(r->bch_regs + HW_BCH_STATUS0);
+
+	return reg & BM_BCH_STATUS0_ALLONES_MASK;
+}
+
 static inline void set_dma_type(struct gpmi_nand_data *this,
 					enum dma_ops_type type)
 {
diff --git a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
index dabbc14..81488a0 100644
--- a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
+++ b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
@@ -1012,6 +1012,30 @@ static void block_mark_swapping(struct gpmi_nand_data *this,
 	p[1] = (p[1] & mask) | (from_oob >> (8 - bit));
 }
 
+/*
+ * Count the number of 0 bits in a supposed to be
+ * erased region and correct them. Return the number
+ * of bitflips or zero when the region was correct.
+ */
+static unsigned int erased_sector_bitflips(unsigned char *data,
+					unsigned int chunk,
+					struct bch_geometry *geo)
+{
+	unsigned int flip_bits = 0;
+	int i;
+	int base = geo->ecc_chunk_size * chunk;
+
+	/* Count bitflips */
+	for (i = 0; i < geo->ecc_chunk_size; i++)
+		flip_bits += hweight8(~data[base + i]);
+
+	/* Correct bitflips by 0xFF'ing this chunk. */
+	if (flip_bits)
+		memset(&data[base], 0xFF, geo->ecc_chunk_size);
+
+	return flip_bits;
+}
+
 static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip,
 				uint8_t *buf, int oob_required, int page)
 {
@@ -1023,6 +1047,7 @@ static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip,
 	dma_addr_t    auxiliary_phys;
 	unsigned int  i;
 	unsigned char *status;
+	unsigned int  flips;
 	unsigned int  max_bitflips = 0;
 	int           ret;
 
@@ -1057,15 +1082,28 @@ static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip,
 	status = auxiliary_virt + nfc_geo->auxiliary_status_offset;
 
 	for (i = 0; i < nfc_geo->ecc_chunk_count; i++, status++) {
-		if ((*status == STATUS_GOOD) || (*status == STATUS_ERASED))
+		if (*status == STATUS_GOOD)
 			continue;
 
 		if (*status == STATUS_UNCORRECTABLE) {
 			mtd->ecc_stats.failed++;
 			continue;
 		}
-		mtd->ecc_stats.corrected += *status;
-		max_bitflips = max_t(unsigned int, max_bitflips, *status);
+
+		if (*status == STATUS_ERASED) {
+			if (gpmi_all_ones(this))
+				break;
+			else
+				/* Erased block with bitflips. */
+				flips = erased_sector_bitflips(payload_virt, i,
+							       nfc_geo);
+		} else {
+			/* BCH block corrected some errors for us. */
+			flips = *status;
+		}
+
+		mtd->ecc_stats.corrected += flips;
+		max_bitflips = max_t(unsigned int, max_bitflips, flips);
 	}
 
 	if (oob_required) {
diff --git a/drivers/mtd/nand/gpmi-nand/gpmi-nand.h b/drivers/mtd/nand/gpmi-nand/gpmi-nand.h
index a7685e3..98db09e 100644
--- a/drivers/mtd/nand/gpmi-nand/gpmi-nand.h
+++ b/drivers/mtd/nand/gpmi-nand/gpmi-nand.h
@@ -268,6 +268,7 @@ extern void gpmi_clear_bch(struct gpmi_nand_data *);
 extern void gpmi_dump_info(struct gpmi_nand_data *);
 extern int bch_set_geometry(struct gpmi_nand_data *);
 extern int gpmi_is_ready(struct gpmi_nand_data *, unsigned chip);
+extern int gpmi_all_ones(struct gpmi_nand_data *);
 extern int gpmi_send_command(struct gpmi_nand_data *);
 extern void gpmi_begin(struct gpmi_nand_data *);
 extern void gpmi_end(struct gpmi_nand_data *);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* RE: [PATCH] mtd: gpmi: Deal with bitflips in erased regions regions
  2013-12-17  8:01 [PATCH] mtd: gpmi: Deal with bitflips in erased regions regions Elie De Brauwer
@ 2013-12-17 10:17 ` Gupta, Pekon
  2013-12-17 10:26   ` Huang Shijie
  2013-12-17 12:38   ` Elie De Brauwer
  0 siblings, 2 replies; 6+ messages in thread
From: Gupta, Pekon @ 2013-12-17 10:17 UTC (permalink / raw)
  To: Elie De Brauwer, b32955@freescale.com, dwmw2@infradead.org,
	dedekind1@gmail.com, computersforpeace@gmail.com,
	shijie8@gmail.com
  Cc: linux-mtd@lists.infradead.org

>From: Elie De Brauwer
>
>The BCH block typically used with a GPMI block on an i.MX28/i.MX6 is only
>able to correct bitflips on data actually streamed through the block.
>When erasing a block the data does not stream through the BCH block
>and therefore no ECC data is written to the NAND chip. This causes
>gpmi_ecc_read_page to return failure as soon as a single non-1-bit is
>found in an erased page. Typically causing problems at higher levels
>(ubifs corrupted empty space warnings). This problem was also observed
>when using SLC NAND devices.
>
>This patch configures the BCH block to mark a block as 'erased' if
>no more than gf_len/2 bitflips are found. Next HW_BCH_STATUS0:ALLONES
>is used to check if the data read were all ones, indicating a read of a
>properly erased chunk was performed. If this was not the case a slow path
>is entered where bitflips are counted and corrected in software,
>allowing the upper layers to take proper actions.
>
>Signed-off-by: Elie De Brauwer <eliedebrauwer@gmail.com>
>Acked-by: Peter Korsgaard <peter@korsgaard.com>
[...]

>--- a/drivers/mtd/nand/gpmi-nand/gpmi-lib.c
>+++ b/drivers/mtd/nand/gpmi-nand/gpmi-lib.c
>@@ -286,6 +286,13 @@ int bch_set_geometry(struct gpmi_nand_data *this)
> 			| BF_BCH_FLASH0LAYOUT1_DATAN_SIZE(block_size, this),
> 			r->bch_regs + HW_BCH_FLASH0LAYOUT1);
>
>+	/*
>+	 * Set the tolerance for bitflips when reading erased blocks
>+	 * equal to half the gf_len.
>+	 */
>+	writel((gf_len / 2) & BM_BCH_MODE_ERASE_THRESHOLD_MASK,
>+		r->bch_regs + HW_BCH_MODE);
>+
> 	/* Set *all* chip selects to use layout 0. */
> 	writel(0, r->bch_regs + HW_BCH_LAYOUTSELECT);
>
>@@ -1094,6 +1101,15 @@ int gpmi_is_ready(struct gpmi_nand_data *this, unsigned chip)
> 	return reg & mask;
> }
>
>+/* Returns 1 if the last transaction consisted only out of ones. */
>+int gpmi_all_ones(struct gpmi_nand_data *this)
>+{
>+	struct resources *r = &this->resources;
>+	uint32_t reg = readl(r->bch_regs + HW_BCH_STATUS0);
>+
>+	return reg & BM_BCH_STATUS0_ALLONES_MASK;
>+}
>+
>
Just query.. 
Is GPMI controller is able to detect number of 0s or 1s while reading
the raw data from NAND ?
I'm asking because a similar implementation was used in omap2.c
driver reference: drivers/mtd/nand/omap2.c: erased_sector_bitflips()
But, we ended up degrading the driver performance, so even I was
looking for alternative implementations..

 [...]

>
>+/*
>+ * Count the number of 0 bits in a supposed to be
>+ * erased region and correct them. Return the number
>+ * of bitflips or zero when the region was correct.
>+ */
>+static unsigned int erased_sector_bitflips(unsigned char *data,
>+					unsigned int chunk,
>+					struct bch_geometry *geo)
>+{
>+	unsigned int flip_bits = 0;
>+	int i;
>+	int base = geo->ecc_chunk_size * chunk;
>+
>+	/* Count bitflips */
>+	for (i = 0; i < geo->ecc_chunk_size; i++)
>+		flip_bits += hweight8(~data[base + i]);
>+
>+	/* Correct bitflips by 0xFF'ing this chunk. */
>+	if (flip_bits)
>+		memset(&data[base], 0xFF, geo->ecc_chunk_size);

But I don't quite understand here is.. 
Why do you want to mask off the bit-flips in NAND, and give corrected
data to upper layers ? Won't this lead to accumulation of bit-flips ?

Actually UBIFS checks for clean erased-pages before writing anything
on it [1]. And if UBIFS finds an unclean page, it re-erases it without
harming the system to avoid accumulation of bit-flips [2].
(Artem, please correct me if I'm wrong here)..

But with your above patch you will actually fool UBIFS to write on
unclean pages, which will increase the probability of bit-flips failures.
Example:  your erased page had 4-bit flips, but you still reported
data as clean. Now when UBIFS will write on this erased page, it
has the margin of tolerating only 4 more bit-flips.

Now, if you want to get rid of the debug messages and stack_dump
coming into your boot log because of this, then better suppress
debug message in following file
----------------------
diff --git a/drivers/mtd/ubi/io.c b/drivers/mtd/ubi/io.c
index bf79def..e589ff3 100644
--- a/drivers/mtd/ubi/io.c
+++ b/drivers/mtd/ubi/io.c
@@ -1430,7 +1430,9 @@ fail:
        print_hex_dump(KERN_DEBUG, "", DUMP_PREFIX_OFFSET, 32, 1, buf, len, 1);
        err = -EINVAL;
 error:
+#ifdef DEBUG
        dump_stack();
+#endif
        vfree(buf);
        return err;
 }
----------------------


[1] drivers/mtd/ubi/io.c
@@ ubi_io_write(...)
	err = ubi_self_check_all_ff( ... )

[2] drivers/mtd/ubi/wl.c
@@ wear_leveling_worker( ... )
       /UBI_IO_FF_BITFLIPS


with regards, pekon

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] mtd: gpmi: Deal with bitflips in erased regions regions
  2013-12-17 10:17 ` Gupta, Pekon
@ 2013-12-17 10:26   ` Huang Shijie
  2013-12-17 12:38   ` Elie De Brauwer
  1 sibling, 0 replies; 6+ messages in thread
From: Huang Shijie @ 2013-12-17 10:26 UTC (permalink / raw)
  To: Gupta, Pekon
  Cc: Elie De Brauwer, dedekind1@gmail.com, shijie8@gmail.com,
	linux-mtd@lists.infradead.org, computersforpeace@gmail.com,
	dwmw2@infradead.org

On Tue, Dec 17, 2013 at 10:17:38AM +0000, Gupta, Pekon wrote:
> >+/* Returns 1 if the last transaction consisted only out of ones. */
> >+int gpmi_all_ones(struct gpmi_nand_data *this)
> >+{
> >+	struct resources *r = &this->resources;
> >+	uint32_t reg = readl(r->bch_regs + HW_BCH_STATUS0);
> >+
> >+	return reg & BM_BCH_STATUS0_ALLONES_MASK;
> >+}
> >+
> >
> Just query.. 
> Is GPMI controller is able to detect number of 0s or 1s while reading
> the raw data from NAND ?

The gpmi can detect the all-one case, in other word, if the page is all
0xff, we can know this case.

Since the bit flips of erase page is very very rare, i think 
the performance is not effected by this patch.


> I'm asking because a similar implementation was used in omap2.c
> driver reference: drivers/mtd/nand/omap2.c: erased_sector_bitflips()
> But, we ended up degrading the driver performance, so even I was
> looking for alternative implementations..
> 
>  [...]
> 
> >
> >+/*
> >+ * Count the number of 0 bits in a supposed to be
> >+ * erased region and correct them. Return the number
> >+ * of bitflips or zero when the region was correct.
> >+ */
> >+static unsigned int erased_sector_bitflips(unsigned char *data,
> >+					unsigned int chunk,
> >+					struct bch_geometry *geo)
> >+{
> >+	unsigned int flip_bits = 0;
> >+	int i;
> >+	int base = geo->ecc_chunk_size * chunk;
> >+
> >+	/* Count bitflips */
> >+	for (i = 0; i < geo->ecc_chunk_size; i++)
> >+		flip_bits += hweight8(~data[base + i]);
> >+
> >+	/* Correct bitflips by 0xFF'ing this chunk. */
> >+	if (flip_bits)
> >+		memset(&data[base], 0xFF, geo->ecc_chunk_size);
of course, this function could be optimized more better.

> 
> But I don't quite understand here is.. 
> Why do you want to mask off the bit-flips in NAND, and give corrected
> data to upper layers ? Won't this lead to accumulation of bit-flips ?
> 
> Actually UBIFS checks for clean erased-pages before writing anything
> on it [1]. And if UBIFS finds an unclean page, it re-erases it without
> harming the system to avoid accumulation of bit-flips [2].
> (Artem, please correct me if I'm wrong here)..
> 
> But with your above patch you will actually fool UBIFS to write on
> unclean pages, which will increase the probability of bit-flips failures.
> Example:  your erased page had 4-bit flips, but you still reported
> data as clean. Now when UBIFS will write on this erased page, it
> has the margin of tolerating only 4 more bit-flips.
> 
if we use the 8 as the ECC strengh, we still can correct the bitflips,
am I right? I feel a little confused at this.

Btw: From Pekon's comment, i suddenly notice that we should fill the
     ERASE_THRESHOLD like this:

	int threshold = gf_len / 2;

	if (threshold > geo->ecc_strength)
		threshold = geo->ecc_strength;

	/*
	 * Set the tolerance for bitflips when reading erased blocks
	 * equal to half the gf_len.
	 */
	writel(threshold & BM_BCH_MODE_ERASE_THRESHOLD_MASK,
		r->bch_regs + HW_BCH_MODE);


thanks
Huang Shijie

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mtd: gpmi: Deal with bitflips in erased regions regions
  2013-12-17 10:17 ` Gupta, Pekon
  2013-12-17 10:26   ` Huang Shijie
@ 2013-12-17 12:38   ` Elie De Brauwer
  2013-12-17 13:21     ` Gupta, Pekon
  1 sibling, 1 reply; 6+ messages in thread
From: Elie De Brauwer @ 2013-12-17 12:38 UTC (permalink / raw)
  To: Gupta, Pekon
  Cc: dedekind1@gmail.com, shijie8@gmail.com, b32955@freescale.com,
	linux-mtd@lists.infradead.org, computersforpeace@gmail.com,
	dwmw2@infradead.org

On Tue, Dec 17, 2013 at 11:17 AM, Gupta, Pekon <pekon@ti.com> wrote:
>>From: Elie De Brauwer
>>
>>The BCH block typically used with a GPMI block on an i.MX28/i.MX6 is only
>>able to correct bitflips on data actually streamed through the block.
>>When erasing a block the data does not stream through the BCH block
>>and therefore no ECC data is written to the NAND chip. This causes
>>gpmi_ecc_read_page to return failure as soon as a single non-1-bit is
>>found in an erased page. Typically causing problems at higher levels
>>(ubifs corrupted empty space warnings). This problem was also observed
>>when using SLC NAND devices.
>>
>>This patch configures the BCH block to mark a block as 'erased' if
>>no more than gf_len/2 bitflips are found. Next HW_BCH_STATUS0:ALLONES
>>is used to check if the data read were all ones, indicating a read of a
>>properly erased chunk was performed. If this was not the case a slow path
>>is entered where bitflips are counted and corrected in software,
>>allowing the upper layers to take proper actions.
>>
>>Signed-off-by: Elie De Brauwer <eliedebrauwer@gmail.com>
>>Acked-by: Peter Korsgaard <peter@korsgaard.com>
> [...]
>
>>--- a/drivers/mtd/nand/gpmi-nand/gpmi-lib.c
>>+++ b/drivers/mtd/nand/gpmi-nand/gpmi-lib.c
>>@@ -286,6 +286,13 @@ int bch_set_geometry(struct gpmi_nand_data *this)
>>                       | BF_BCH_FLASH0LAYOUT1_DATAN_SIZE(block_size, this),
>>                       r->bch_regs + HW_BCH_FLASH0LAYOUT1);
>>
>>+      /*
>>+       * Set the tolerance for bitflips when reading erased blocks
>>+       * equal to half the gf_len.
>>+       */
>>+      writel((gf_len / 2) & BM_BCH_MODE_ERASE_THRESHOLD_MASK,
>>+              r->bch_regs + HW_BCH_MODE);
>>+
>>       /* Set *all* chip selects to use layout 0. */
>>       writel(0, r->bch_regs + HW_BCH_LAYOUTSELECT);
>>
>>@@ -1094,6 +1101,15 @@ int gpmi_is_ready(struct gpmi_nand_data *this, unsigned chip)
>>       return reg & mask;
>> }
>>
>>+/* Returns 1 if the last transaction consisted only out of ones. */
>>+int gpmi_all_ones(struct gpmi_nand_data *this)
>>+{
>>+      struct resources *r = &this->resources;
>>+      uint32_t reg = readl(r->bch_regs + HW_BCH_STATUS0);
>>+
>>+      return reg & BM_BCH_STATUS0_ALLONES_MASK;
>>+}
>>+
>>
> Just query..
> Is GPMI controller is able to detect number of 0s or 1s while reading
> the raw data from NAND ?
> I'm asking because a similar implementation was used in omap2.c
> driver reference: drivers/mtd/nand/omap2.c: erased_sector_bitflips()
> But, we ended up degrading the driver performance, so even I was
> looking for alternative implementations..
>
>  [...]

If you look at my original patch, the correction algorithm is more or
less stolen from omap2 driver. In my original version of this patch
there was indeed a performance penalty which occurred each time you
were reading an erased block. Fortunately Huang Shijie pointed my at
the 'allones' case where the BCH block is able to tell us whether or
net it read 'all ones' as in a properly erased block. We combine this
with the ERASE_THRESHOLD to create a fast path (erased block without
any bitflips) and a slow path (erased block with bitflips) and in this
slow path we correct this. Bringing the overhead of this patch in the
normal case to an additional register access which we only do when the
BCH block told us the block is likely an erased block.



>
>>
>>+/*
>>+ * Count the number of 0 bits in a supposed to be
>>+ * erased region and correct them. Return the number
>>+ * of bitflips or zero when the region was correct.
>>+ */
>>+static unsigned int erased_sector_bitflips(unsigned char *data,
>>+                                      unsigned int chunk,
>>+                                      struct bch_geometry *geo)
>>+{
>>+      unsigned int flip_bits = 0;
>>+      int i;
>>+      int base = geo->ecc_chunk_size * chunk;
>>+
>>+      /* Count bitflips */
>>+      for (i = 0; i < geo->ecc_chunk_size; i++)
>>+              flip_bits += hweight8(~data[base + i]);
>>+
>>+      /* Correct bitflips by 0xFF'ing this chunk. */
>>+      if (flip_bits)
>>+              memset(&data[base], 0xFF, geo->ecc_chunk_size);
>
> But I don't quite understand here is..
> Why do you want to mask off the bit-flips in NAND, and give corrected
> data to upper layers ? Won't this lead to accumulation of bit-flips ?

Actually the data is corrected the same way a properly functioning BCH
block would do if the FEC data would have been written. The trick is
that the in the function which calls this we return the number of
corrected bits (the same way as would happen on 'real' data with
proper ECC).  So we will not trick any upper layer into writing onto
possible damaged data. Whenever the upper layer reads an erased page
we will tell it 'we read what you requested but we corrected 'n'
bitflips' where the tolerance of the 'n' was an open point.  I'm no
expert on a decent value for this tolerance.

An earlier post ( ref:
http://article.gmane.org/gmane.linux.drivers.mtd/46266/match=corrupted+empty+space+again
) this corrupted empty space was tackled at ubifs level, returning
-EUCLEAN instead of -EBADMSG and the result of that was that the
bitflip was not corrected but migrated away to another PEB, hence in
this solution I wanted to tackle this at the mtd level.

I also tested this on a board with bitflips in empty space (see also
my initial posting) and my patch made the issue disappear.

>
> Actually UBIFS checks for clean erased-pages before writing anything
> on it [1]. And if UBIFS finds an unclean page, it re-erases it without
> harming the system to avoid accumulation of bit-flips [2].
> (Artem, please correct me if I'm wrong here)..
>
> But with your above patch you will actually fool UBIFS to write on
> unclean pages, which will increase the probability of bit-flips failures.
> Example:  your erased page had 4-bit flips, but you still reported
> data as clean. Now when UBIFS will write on this erased page, it
> has the margin of tolerating only 4 more bit-flips.
>
> Now, if you want to get rid of the debug messages and stack_dump
> coming into your boot log because of this, then better suppress
> debug message in following file
> ----------------------
> diff --git a/drivers/mtd/ubi/io.c b/drivers/mtd/ubi/io.c
> index bf79def..e589ff3 100644
> --- a/drivers/mtd/ubi/io.c
> +++ b/drivers/mtd/ubi/io.c
> @@ -1430,7 +1430,9 @@ fail:
>         print_hex_dump(KERN_DEBUG, "", DUMP_PREFIX_OFFSET, 32, 1, buf, len, 1);
>         err = -EINVAL;
>  error:
> +#ifdef DEBUG
>         dump_stack();
> +#endif
>         vfree(buf);
>         return err;
>  }
> ----------------------
>
>
> [1] drivers/mtd/ubi/io.c
> @@ ubi_io_write(...)
>         err = ubi_self_check_all_ff( ... )
>
> [2] drivers/mtd/ubi/wl.c
> @@ wear_leveling_worker( ... )
>        /UBI_IO_FF_BITFLIPS
>
>
> with regards, pekon



-- 
Elie De Brauwer

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH] mtd: gpmi: Deal with bitflips in erased regions regions
  2013-12-17 12:38   ` Elie De Brauwer
@ 2013-12-17 13:21     ` Gupta, Pekon
  2013-12-17 13:56       ` Elie De Brauwer
  0 siblings, 1 reply; 6+ messages in thread
From: Gupta, Pekon @ 2013-12-17 13:21 UTC (permalink / raw)
  To: Elie De Brauwer, shijie8@gmail.com
  Cc: b32955@freescale.com, computersforpeace@gmail.com,
	dwmw2@infradead.org, linux-mtd@lists.infradead.org,
	dedekind1@gmail.com

>From: Elie De Brauwer [mailto:eliedebrauwer@gmail.com]
>>On Tue, Dec 17, 2013 at 11:17 AM, Gupta, Pekon <pekon@ti.com> wrote:
>>>From: Elie De Brauwer
[...]

>> Just query..
>> Is GPMI controller is able to detect number of 0s or 1s while reading
>> the raw data from NAND ?
>> I'm asking because a similar implementation was used in omap2.c
>> driver reference: drivers/mtd/nand/omap2.c: erased_sector_bitflips()
>> But, we ended up degrading the driver performance, so even I was
>> looking for alternative implementations..
>>
>>  [...]
>
>If you look at my original patch, the correction algorithm is more or
>less stolen from omap2 driver. In my original version of this patch
>there was indeed a performance penalty which occurred each time you
>were reading an erased block. Fortunately Huang Shijie pointed my at
>the 'allones' case where the BCH block is able to tell us whether or
>net it read 'all ones' as in a properly erased block. We combine this
>with the ERASE_THRESHOLD to create a fast path (erased block without
>any bitflips) and a slow path (erased block with bitflips) and in this
>slow path we correct this. Bringing the overhead of this patch in the
>normal case to an additional register access which we only do when the
>BCH block told us the block is likely an erased block.
>
Yes, Shijie explained that..
GPMI (FSL controller) supports 'allones' in hardware. (good controller :-) )
Unfortunately GPMC (TI controller) does not have such mechanism,
hence omap2.c driver had to do all read-data comparisons in software,
leading to performance penalty, if bit-flips were encountered.


>>>
>>>+/*
>>>+ * Count the number of 0 bits in a supposed to be
>>>+ * erased region and correct them. Return the number
>>>+ * of bitflips or zero when the region was correct.
>>>+ */
>>>+static unsigned int erased_sector_bitflips(unsigned char *data,
>>>+                                      unsigned int chunk,
>>>+                                      struct bch_geometry *geo)
>>>+{
>>>+      unsigned int flip_bits = 0;
>>>+      int i;
>>>+      int base = geo->ecc_chunk_size * chunk;
>>>+
>>>+      /* Count bitflips */
>>>+      for (i = 0; i < geo->ecc_chunk_size; i++)
>>>+              flip_bits += hweight8(~data[base + i]);
>>>+
>>>+      /* Correct bitflips by 0xFF'ing this chunk. */
>>>+      if (flip_bits)
>>>+              memset(&data[base], 0xFF, geo->ecc_chunk_size);
>>
>> But I don't quite understand here is..
>> Why do you want to mask off the bit-flips in NAND, and give corrected
>> data to upper layers ? Won't this lead to accumulation of bit-flips ?
>
>Actually the data is corrected the same way a properly functioning BCH
>block would do if the FEC data would have been written. The trick is
>that the in the function which calls this we return the number of
>corrected bits (the same way as would happen on 'real' data with
>proper ECC).  So we will not trick any upper layer into writing onto
>possible damaged data. Whenever the upper layer reads an erased page
>we will tell it 'we read what you requested but we corrected 'n'
>bitflips' where the tolerance of the 'n' was an open point.  I'm no
>expert on a decent value for this tolerance.
>
>An earlier post ( ref:
>http://article.gmane.org/gmane.linux.drivers.mtd/46266/match=corrupted+empty+space+again
>) this corrupted empty space was tackled at ubifs level, returning
>-EUCLEAN instead of -EBADMSG and the result of that was that the
>bitflip was not corrected but migrated away to another PEB, hence in
>this solution I wanted to tackle this at the mtd level.
>
Understood.. Thanks for explaining..
I missed that you were incrementing "err_stats.corrected" on detecting
bit-flips in erased-pages. This caused nand_base.c to return -EUCLEAN.

But, another query is.. 
For erased-pages, does UBI differentiates between -EUCLEAN v/s -EBADMSG ?
I couldn't find such difference in mtd/ubi/*..

In UBI, I could trace that erased-pages are checked only on following paths:
(a) ubi_io_write_data()
	|- ubi_io_write()
		|- ubi_self_check_all_ff()

(b) ubi_io_write_ec_hdr()
	|- ubi_io_write()
		|- ubi_self_check_all_ff()

(c) ubi_io_write_vid_hdr()
	|- ubi_io_write()
		|- ubi_self_check_all_ff()

(d) ubi_wl_get_peb()
	|- ubi_self_check_all_ff()

And in each case if an erased-page page was returned with error
(whether -EUCLEAN or -EBADMSG) the PEB was re-added to
erase-queue for erasure.
So I'm not sure, what is causing UBI to scream on -EBADMSG and
not on -EUCLEAN (just for erased-pages). Do you have any pointers ?


with regards, pekon

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mtd: gpmi: Deal with bitflips in erased regions regions
  2013-12-17 13:21     ` Gupta, Pekon
@ 2013-12-17 13:56       ` Elie De Brauwer
  0 siblings, 0 replies; 6+ messages in thread
From: Elie De Brauwer @ 2013-12-17 13:56 UTC (permalink / raw)
  To: Gupta, Pekon
  Cc: dedekind1@gmail.com, dwmw2@infradead.org, b32955@freescale.com,
	linux-mtd@lists.infradead.org, computersforpeace@gmail.com,
	shijie8@gmail.com

On Tue, Dec 17, 2013 at 2:21 PM, Gupta, Pekon <pekon@ti.com> wrote:
>>From: Elie De Brauwer [mailto:eliedebrauwer@gmail.com]
>>>On Tue, Dec 17, 2013 at 11:17 AM, Gupta, Pekon <pekon@ti.com> wrote:
>>>>From: Elie De Brauwer
> [...]
>
>>> Just query..
>>> Is GPMI controller is able to detect number of 0s or 1s while reading
>>> the raw data from NAND ?
>>> I'm asking because a similar implementation was used in omap2.c
>>> driver reference: drivers/mtd/nand/omap2.c: erased_sector_bitflips()
>>> But, we ended up degrading the driver performance, so even I was
>>> looking for alternative implementations..
>>>
>>>  [...]
>>
>>If you look at my original patch, the correction algorithm is more or
>>less stolen from omap2 driver. In my original version of this patch
>>there was indeed a performance penalty which occurred each time you
>>were reading an erased block. Fortunately Huang Shijie pointed my at
>>the 'allones' case where the BCH block is able to tell us whether or
>>net it read 'all ones' as in a properly erased block. We combine this
>>with the ERASE_THRESHOLD to create a fast path (erased block without
>>any bitflips) and a slow path (erased block with bitflips) and in this
>>slow path we correct this. Bringing the overhead of this patch in the
>>normal case to an additional register access which we only do when the
>>BCH block told us the block is likely an erased block.
>>
> Yes, Shijie explained that..
> GPMI (FSL controller) supports 'allones' in hardware. (good controller :-) )
> Unfortunately GPMC (TI controller) does not have such mechanism,
> hence omap2.c driver had to do all read-data comparisons in software,
> leading to performance penalty, if bit-flips were encountered.
>
>
>>>>
>>>>+/*
>>>>+ * Count the number of 0 bits in a supposed to be
>>>>+ * erased region and correct them. Return the number
>>>>+ * of bitflips or zero when the region was correct.
>>>>+ */
>>>>+static unsigned int erased_sector_bitflips(unsigned char *data,
>>>>+                                      unsigned int chunk,
>>>>+                                      struct bch_geometry *geo)
>>>>+{
>>>>+      unsigned int flip_bits = 0;
>>>>+      int i;
>>>>+      int base = geo->ecc_chunk_size * chunk;
>>>>+
>>>>+      /* Count bitflips */
>>>>+      for (i = 0; i < geo->ecc_chunk_size; i++)
>>>>+              flip_bits += hweight8(~data[base + i]);
>>>>+
>>>>+      /* Correct bitflips by 0xFF'ing this chunk. */
>>>>+      if (flip_bits)
>>>>+              memset(&data[base], 0xFF, geo->ecc_chunk_size);
>>>
>>> But I don't quite understand here is..
>>> Why do you want to mask off the bit-flips in NAND, and give corrected
>>> data to upper layers ? Won't this lead to accumulation of bit-flips ?
>>
>>Actually the data is corrected the same way a properly functioning BCH
>>block would do if the FEC data would have been written. The trick is
>>that the in the function which calls this we return the number of
>>corrected bits (the same way as would happen on 'real' data with
>>proper ECC).  So we will not trick any upper layer into writing onto
>>possible damaged data. Whenever the upper layer reads an erased page
>>we will tell it 'we read what you requested but we corrected 'n'
>>bitflips' where the tolerance of the 'n' was an open point.  I'm no
>>expert on a decent value for this tolerance.
>>
>>An earlier post ( ref:
>>http://article.gmane.org/gmane.linux.drivers.mtd/46266/match=corrupted+empty+space+again
>>) this corrupted empty space was tackled at ubifs level, returning
>>-EUCLEAN instead of -EBADMSG and the result of that was that the
>>bitflip was not corrected but migrated away to another PEB, hence in
>>this solution I wanted to tackle this at the mtd level.
>>
> Understood.. Thanks for explaining..
> I missed that you were incrementing "err_stats.corrected" on detecting
> bit-flips in erased-pages. This caused nand_base.c to return -EUCLEAN.
>
> But, another query is..
> For erased-pages, does UBI differentiates between -EUCLEAN v/s -EBADMSG ?
> I couldn't find such difference in mtd/ubi/*..
>
> In UBI, I could trace that erased-pages are checked only on following paths:
> (a) ubi_io_write_data()
>         |- ubi_io_write()
>                 |- ubi_self_check_all_ff()
>
> (b) ubi_io_write_ec_hdr()
>         |- ubi_io_write()
>                 |- ubi_self_check_all_ff()
>
> (c) ubi_io_write_vid_hdr()
>         |- ubi_io_write()
>                 |- ubi_self_check_all_ff()
>
> (d) ubi_wl_get_peb()
>         |- ubi_self_check_all_ff()
>
> And in each case if an erased-page page was returned with error
> (whether -EUCLEAN or -EBADMSG) the PEB was re-added to
> erase-queue for erasure.
> So I'm not sure, what is causing UBI to scream on -EBADMSG and
> not on -EUCLEAN (just for erased-pages). Do you have any pointers ?
>

Well I traced what I made scream in my specific case. And the error
messages I got pointed to the right direction:

[    3.831323] UBI warning: ubi_io_read: error -74 (ECC error) while
reading 16384 bytes from PEB 443:245760, read only 16384 bytes, retry
[    3.845026] UBI warning: ubi_io_read: error -74 (ECC error) while
reading 16384 bytes from PEB 443:245760, read only 16384 bytes, retry
[    3.858710] UBI warning: ubi_io_read: error -74 (ECC error) while
reading 16384 bytes from PEB 443:245760, read only 16384 bytes, retry
[    3.872408] UBI error: ubi_io_read: error -74 (ECC error) while
reading 16384 bytes from PEB 443:245760, read 16384 bytes

And simply looking at ubi_io_read() shows that the code is
distinguishing between EUCLEAN and just deals with it. And EBADMSG
which is making it retry/dump stack and return EIO.

In a next stage I got UBIFS complaining about corrupted empty space
which looks something like:

[    4.011529] UBIFS error (pid 36): ubifs_recover_leb: corrupt empty
space LEB 27:237568, corruption starts at 9815
[    4.021897] UBIFS error (pid 36): ubifs_scanned_corruption:
corruption at LEB 27:247383

So  I actually followed the advice of earlier threads and dove deeper
(into the mtd layers) rather than upwards in the direction of
UBI/UBIFS.

On the other hand, just looking around points me to:
- wear_leveling_worker() which calls ubi_io_read_vid_hdr() (calling in
turn ubi_io_read(), see above ) and which may return a.o
UBI_IO_FF_BITFLIPS which the wear_levelling_worker then marks for
scrubbing.

my 2 cents
E.



-- 
Elie De Brauwer

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-12-17 13:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-17  8:01 [PATCH] mtd: gpmi: Deal with bitflips in erased regions regions Elie De Brauwer
2013-12-17 10:17 ` Gupta, Pekon
2013-12-17 10:26   ` Huang Shijie
2013-12-17 12:38   ` Elie De Brauwer
2013-12-17 13:21     ` Gupta, Pekon
2013-12-17 13:56       ` Elie De Brauwer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox