[PATCH v6 0/6] mtd: nand: omap: optimized chip->ecc.correct() for H/W ECC schemes

linux-mtd.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 0/6] mtd: nand: omap: optimized chip->ecc.correct() for H/W ECC schemes
@ 2014-01-04  2:48 Pekon Gupta
  2014-01-04  2:48 ` [PATCH v6 1/6] mtd: nand: omap: add field to indicate current ecc-scheme in 'struct omap_nand_info' Pekon Gupta
                   ` (7 more replies)
  0 siblings, 8 replies; 14+ messages in thread
From: Pekon Gupta @ 2014-01-04  2:48 UTC (permalink / raw)
  To: Brian Norris
  Cc: Stefan Roese, linux-mtd, Pekon Gupta, balbi, Artem Bityutskiy

*changes v5 -> v6*
[PATCH 1/6] <no change>
[PATCH 2/6] introduced variable 'actual_eccbytes' to omit reserved byte-position
[PATCH 3/6] split from [PATCH v5 3/5] fix erased-page bit-flip correction
[PATCH 4/6] split from [PATCH v5 3/5] fix erased-page detection
[PATCH 5/6] <minor cleanup>
[PATCH 6/6] <no change>


*changes v4 -> v5*
This patch series was split version of earlier patch:
http://lists.infradead.org/pipermail/linux-mtd/2013-November/050241.html

chip->ecc.correct() is used for detecting and correcting bit-flips during read
operations. In OMAP NAND driver different ecc-schemes have different callbacks:
 - omap_correct_data()		for HAM1_HW ecc-schemes (Untouched)
 - nand_bch_correct_data()	for BCHx_HW_DETECTION_SW ecc-schemes (Untouched)
 - omap_elm_correct_data()	for BCHx_HW ecc-schemes

This patch-series fixes following issues in omap_elm_correct_data():
(1) Dependency on a specific reserved byte-position in OOB area
    to differentiates between erased-pages v/s programmed-pages.
    Problem: reserved byte-position cannot be accomodated in all ecc-schemes
    Problem: reserved byte-position can itself be subjected upto 8 bit-flips
             causing the 0xff to become 0x00, causing page to be
             mis-recognized as erased-page.

(2) Bit-flips in erased-pages are detected by comparing each byte of Data & OOB
    with 0xff in check_erased_page().
    Problem: This is causes performance penalty when erased-pages are checked.

(3) Current code is not scalable for future ECC schemes due to presence of 
    tweaks for BCH4_ECC and BCH8_ECC at multiple places.

(4) Currently, bit-flips are evaluated and fixed even when ELM reports them as
    un-correctable bit-flips, this should not happen as 'number-of-error' field
    in ELM_LOCATION_STATUS becomes invalid when un-correctable flag is set.

(5) Driver should return with error-code = '-EBADMSG' when
     uncorrectable bit-flip is detected
     bit-flip outside valid Data and OOB region is detected


Pekon Gupta (6):
  mtd: nand: omap: add field to indicate current ecc-scheme in 'struct
    omap_nand_info'
  mtd: nand: omap: ecc.correct: omap_elm_correct_data: rename ambiguous
    variable 'eccsize' and 'ecc_vector_size'
  mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page
    bit-flip correction for H/W ECC schemes
  mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page
    detection for BCHx_HW ECC schemes
  mtd: nand: omap: ecc.correct: omap_elm_correct_data: cleanup for
    future enhancements
  mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix
    programmed-page bit-flip correction logic

 drivers/mtd/nand/omap2.c | 251 +++++++++++++++++++++--------------------------
 1 file changed, 114 insertions(+), 137 deletions(-)

-- 
1.8.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v6 1/6] mtd: nand: omap: add field to indicate current ecc-scheme in 'struct omap_nand_info'
  2014-01-04  2:48 [PATCH v6 0/6] mtd: nand: omap: optimized chip->ecc.correct() for H/W ECC schemes Pekon Gupta
@ 2014-01-04  2:48 ` Pekon Gupta
  2014-01-04  2:48 ` [PATCH v6 2/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: rename ambiguous variable 'eccsize' and 'ecc_vector_size' Pekon Gupta
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Pekon Gupta @ 2014-01-04  2:48 UTC (permalink / raw)
  To: Brian Norris
  Cc: Stefan Roese, linux-mtd, Pekon Gupta, balbi, Artem Bityutskiy

Information of currently selected ECC scheme 'enum omap_ecc ecc_opt' should
available outside platform-data, so that single nand_chip->ecc callback can
support multiple ecc-scheme configurations.

Signed-off-by: Pekon Gupta <pekon@ti.com>
---
 drivers/mtd/nand/omap2.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/mtd/nand/omap2.c b/drivers/mtd/nand/omap2.c
index 768e7cf..8fb64d5 100644
--- a/drivers/mtd/nand/omap2.c
+++ b/drivers/mtd/nand/omap2.c
@@ -160,6 +160,7 @@ struct omap_nand_info {
 	int				gpmc_cs;
 	unsigned long			phys_base;
 	unsigned long			mem_size;
+	enum omap_ecc			ecc_opt;
 	struct completion		comp;
 	struct dma_chan			*dma;
 	int				gpmc_irq_fifo;
@@ -1657,6 +1658,7 @@ static int omap_nand_probe(struct platform_device *pdev)
 	info->gpmc_cs		= pdata->cs;
 	info->reg		= pdata->reg;
 	info->of_node		= pdata->of_node;
+	info->ecc_opt		= pdata->ecc_opt;
 	mtd			= &info->mtd;
 	mtd->priv		= &info->nand;
 	mtd->name		= dev_name(&pdev->dev);
@@ -1812,7 +1814,7 @@ static int omap_nand_probe(struct platform_device *pdev)
 	/* populate MTD interface based on ECC scheme */
 	nand_chip->ecc.layout	= &omap_oobinfo;
 	ecclayout		= &omap_oobinfo;
-	switch (pdata->ecc_opt) {
+	switch (info->ecc_opt) {
 	case OMAP_ECC_HAM1_CODE_HW:
 		pr_info("nand: using OMAP_ECC_HAM1_CODE_HW\n");
 		nand_chip->ecc.mode             = NAND_ECC_HW;
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 2/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: rename ambiguous variable 'eccsize' and 'ecc_vector_size'
  2014-01-04  2:48 [PATCH v6 0/6] mtd: nand: omap: optimized chip->ecc.correct() for H/W ECC schemes Pekon Gupta
  2014-01-04  2:48 ` [PATCH v6 1/6] mtd: nand: omap: add field to indicate current ecc-scheme in 'struct omap_nand_info' Pekon Gupta
@ 2014-01-04  2:48 ` Pekon Gupta
  2014-01-04  2:48 ` [PATCH v6 3/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page bit-flip correction for H/W ECC schemes Pekon Gupta
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Pekon Gupta @ 2014-01-04  2:48 UTC (permalink / raw)
  To: Brian Norris
  Cc: Stefan Roese, linux-mtd, Pekon Gupta, balbi, Artem Bityutskiy

renaming following variables as they cause confusion due to resemblence to
another similar field in 'struct nand_ecc_ctrl'  (nand_chip->ecc.size).
  ecc_vector_size = info->nand.ecc.bytes  ---> eccbytes = info->nand.ecc.bytes;
  eccsize = ecc_vector_size - 1;          ---> actual_eccbytes - 1;

Signed-off-by: Pekon Gupta <pekon@ti.com>
---
 drivers/mtd/nand/omap2.c | 39 +++++++++++++++++++++++----------------
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/drivers/mtd/nand/omap2.c b/drivers/mtd/nand/omap2.c
index 8fb64d5..2c73389 100644
--- a/drivers/mtd/nand/omap2.c
+++ b/drivers/mtd/nand/omap2.c
@@ -1359,9 +1359,10 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 {
 	struct omap_nand_info *info = container_of(mtd, struct omap_nand_info,
 			mtd);
+	int eccbytes	= info->nand.ecc.bytes;
 	int eccsteps = info->nand.ecc.steps;
 	int i , j, stat = 0;
-	int eccsize, eccflag, ecc_vector_size;
+	int eccflag, actual_eccbytes;
 	struct elm_errorvec err_vec[ERROR_VECTOR_MAX];
 	u_char *ecc_vec = calc_ecc;
 	u_char *spare_ecc = read_ecc;
@@ -1369,6 +1370,20 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 	enum bch_ecc type;
 	bool is_error_reported = false;
 
+	switch (info->ecc_opt) {
+	case OMAP_ECC_BCH4_CODE_HW:
+		/* omit  7th ECC byte reserved for ROM code compatibility */
+		actual_eccbytes = eccbytes - 1;
+		break;
+	case OMAP_ECC_BCH8_CODE_HW:
+		/* omit 14th ECC byte reserved for ROM code compatibility */
+		actual_eccbytes = eccbytes - 1;
+		break;
+	default:
+		pr_err("invalid driver configuration\n");
+		return -EINVAL;
+	}
+
 	/* Initialize elm error vector to zero */
 	memset(err_vec, 0, sizeof(err_vec));
 
@@ -1380,14 +1395,6 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 		erased_ecc_vec = bch4_vector;
 	}
 
-	ecc_vector_size = info->nand.ecc.bytes;
-
-	/*
-	 * Remove extra byte padding for BCH8 RBL
-	 * compatibility and erased page handling
-	 */
-	eccsize = ecc_vector_size - 1;
-
 	for (i = 0; i < eccsteps ; i++) {
 		eccflag = 0;	/* initialize eccflag */
 
@@ -1395,8 +1402,7 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 		 * Check any error reported,
 		 * In case of error, non zero ecc reported.
 		 */
-
-		for (j = 0; (j < eccsize); j++) {
+		for (j = 0; j < actual_eccbytes; j++) {
 			if (calc_ecc[j] != 0) {
 				eccflag = 1; /* non zero ecc, error present */
 				break;
@@ -1421,7 +1427,7 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 			 * zeros are more than threshold erased page, either
 			 * case page reported as uncorrectable.
 			 */
-			if (hweight8(~read_ecc[eccsize]) >= threshold) {
+			if (hweight8(~read_ecc[actual_eccbytes]) >= threshold) {
 				/*
 				 * Update elm error vector as
 				 * data area is programmed
@@ -1433,7 +1439,8 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 				int bitflip_count;
 				u_char *buf = &data[info->nand.ecc.size * i];
 
-				if (memcmp(calc_ecc, erased_ecc_vec, eccsize)) {
+				if (memcmp(calc_ecc, erased_ecc_vec,
+							 actual_eccbytes)) {
 					bitflip_count = erased_sector_bitflips(
 							buf, read_ecc, info);
 
@@ -1446,8 +1453,8 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 		}
 
 		/* Update the ecc vector */
-		calc_ecc += ecc_vector_size;
-		read_ecc += ecc_vector_size;
+		calc_ecc += eccbytes;
+		read_ecc += eccbytes;
 	}
 
 	/* Check if any error reported */
@@ -1496,7 +1503,7 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 
 		/* Update page data with sector size */
 		data += info->nand.ecc.size;
-		spare_ecc += ecc_vector_size;
+		spare_ecc += eccbytes;
 	}
 
 	for (i = 0; i < eccsteps; i++)
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 3/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page bit-flip correction for H/W ECC schemes
  2014-01-04  2:48 [PATCH v6 0/6] mtd: nand: omap: optimized chip->ecc.correct() for H/W ECC schemes Pekon Gupta
  2014-01-04  2:48 ` [PATCH v6 1/6] mtd: nand: omap: add field to indicate current ecc-scheme in 'struct omap_nand_info' Pekon Gupta
  2014-01-04  2:48 ` [PATCH v6 2/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: rename ambiguous variable 'eccsize' and 'ecc_vector_size' Pekon Gupta
@ 2014-01-04  2:48 ` Pekon Gupta
  2014-01-14  4:05   ` Brian Norris
  2014-01-04  2:48 ` [PATCH v6 4/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page detection for BCHx_HW " Pekon Gupta
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 14+ messages in thread
From: Pekon Gupta @ 2014-01-04  2:48 UTC (permalink / raw)
  To: Brian Norris
  Cc: Stefan Roese, linux-mtd, Pekon Gupta, balbi, Artem Bityutskiy

chip->ecc.correct() is used for detecting and correcting bit-flips during read
operations. In OMAP NAND driver different ecc-schemes have different callbacks:
 - omap_correct_data()		for HAM1_HW ecc-schemes (Untouched)
 - nand_bch_correct_data()	for BCHx_HW_DETECTION_SW ecc-schemes (Untouched)
 - omap_elm_correct_data()	for BCHx_HW ecc-schemes (updated)

This patch solves following problems in ECC correction for BCHx_HW ecc-schemes.
Problem: In current implementation, number of bit-flips in erased-pages is
         calculated comparing each byte of Data & OOB with 0xff in
         function check_erased_page().
         But, with growing density of NAND devices (especially for MLC NAND)
         where pagesize is of 4k or more bytes, occurence of bit-flips is
         very common. Thus comparing each byte of both main-area and OOB
         with 0xff decreases NAND performance due to CPU overhead.

Known attributes of an erased-page
 - Bit-flips in erased-page can't be corrected because there is no ECC
   present in OOB.
 - Irrespective of number of bit-flips an erased-page will only contain
   all(0xff), So its safe to return error-code -EUCLEAN with data_buf=0xff,
   instead of -EBADMSG.
 - And incrementing ecc_stat->corrected makes chip->read_page() return -EUCLEAN.

Solution: In NAND based file-systems like UBIFS, each page is checked for
         bit-flips before writing to it. If correctable bit-flips are found
         in erased-page then -EUCLEAN error-code is returned which causes
         UBIFS to re-erase the complete block. So, its unnecessary to get
         exact count of bit-flips in an erased-page.
         Hence this patch just compares calc_ecc[] with known
         ECC-syndromp-of-all(0xff), to confirm if erased-page has bit-flips.
         - if bit-flips are found, then
             read buffer is set to all(0xff) and
             correctable bit-flips = max(ecc-strength) is reported back to
             upper layer to take preventive action (like re-erasing block).
         - else
              read-data is returned normally.

Signed-off-by: Pekon Gupta <pekon@ti.com>
---
 drivers/mtd/nand/omap2.c | 87 ++++++++++++++++++------------------------------
 1 file changed, 32 insertions(+), 55 deletions(-)

diff --git a/drivers/mtd/nand/omap2.c b/drivers/mtd/nand/omap2.c
index 2c73389..5a6ee6b 100644
--- a/drivers/mtd/nand/omap2.c
+++ b/drivers/mtd/nand/omap2.c
@@ -1292,45 +1292,6 @@ static int omap3_calculate_ecc_bch(struct mtd_info *mtd, const u_char *dat,
 }
 
 /**
- * erased_sector_bitflips - count bit flips
- * @data:	data sector buffer
- * @oob:	oob buffer
- * @info:	omap_nand_info
- *
- * Check the bit flips in erased page falls below correctable level.
- * If falls below, report the page as erased with correctable bit
- * flip, else report as uncorrectable page.
- */
-static int erased_sector_bitflips(u_char *data, u_char *oob,
-		struct omap_nand_info *info)
-{
-	int flip_bits = 0, i;
-
-	for (i = 0; i < info->nand.ecc.size; i++) {
-		flip_bits += hweight8(~data[i]);
-		if (flip_bits > info->nand.ecc.strength)
-			return 0;
-	}
-
-	for (i = 0; i < info->nand.ecc.bytes - 1; i++) {
-		flip_bits += hweight8(~oob[i]);
-		if (flip_bits > info->nand.ecc.strength)
-			return 0;
-	}
-
-	/*
-	 * Bit flips falls in correctable level.
-	 * Fill data area with 0xFF
-	 */
-	if (flip_bits) {
-		memset(data, 0xFF, info->nand.ecc.size);
-		memset(oob, 0xFF, info->nand.ecc.bytes);
-	}
-
-	return flip_bits;
-}
-
-/**
  * omap_elm_correct_data - corrects page data area in case error reported
  * @mtd:	MTD device structure
  * @data:	page data
@@ -1359,14 +1320,16 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 {
 	struct omap_nand_info *info = container_of(mtd, struct omap_nand_info,
 			mtd);
+	enum omap_ecc ecc_opt	= info->ecc_opt;
 	int eccbytes	= info->nand.ecc.bytes;
-	int eccsteps = info->nand.ecc.steps;
+	int eccsize	= info->nand.ecc.size;
+	int eccstrength	= info->nand.ecc.strength;
+	int eccsteps	= info->nand.ecc.steps;
 	int i , j, stat = 0;
 	int eccflag, actual_eccbytes;
 	struct elm_errorvec err_vec[ERROR_VECTOR_MAX];
 	u_char *ecc_vec = calc_ecc;
 	u_char *spare_ecc = read_ecc;
-	u_char *erased_ecc_vec;
 	enum bch_ecc type;
 	bool is_error_reported = false;
 
@@ -1389,10 +1352,8 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 
 	if (info->nand.ecc.strength == BCH8_MAX_ERROR) {
 		type = BCH8_ECC;
-		erased_ecc_vec = bch8_vector;
 	} else {
 		type = BCH4_ECC;
-		erased_ecc_vec = bch4_vector;
 	}
 
 	for (i = 0; i < eccsteps ; i++) {
@@ -1436,19 +1397,35 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 				is_error_reported = true;
 			} else {
 				/* Error reported in erased page */
-				int bitflip_count;
-				u_char *buf = &data[info->nand.ecc.size * i];
-
-				if (memcmp(calc_ecc, erased_ecc_vec,
-							 actual_eccbytes)) {
-					bitflip_count = erased_sector_bitflips(
-							buf, read_ecc, info);
-
-					if (bitflip_count)
-						stat += bitflip_count;
-					else
-						return -EINVAL;
+				/* check bit-flips in erased-pages
+				 * - Instead of comparing each byte of main-area
+				 *   with 0xff, comparing just calc_ecc[] with
+				 *   known ECC-syndrome-of-all(0xff). This
+				 *   confirms that main-area + OOB are all(0xff)
+				 *   This will save CPU cycles.
+				 * - Bit-flips in erased-page can't be corrected
+				 *   because there is no ECC present in OOB
+				 * - Irrespective of number of bit-flips an
+				 *   erased-page will only contain all(0xff),
+				 *   So its safe to return error-code -EUCLEAN
+				 *   with data_buf=0xff, instead of -EBADMSG
+				 * - And incrementing ecc_stat->corrected makes
+				 *   chip->read_page() return -EUCLEAN */
+				switch (ecc_opt) {
+				case OMAP_ECC_BCH4_CODE_HW:
+					if (memcmp(calc_ecc, bch4_vector,
+							 actual_eccbytes))
+						stat += eccstrength;
+					break;
+				case OMAP_ECC_BCH8_CODE_HW:
+					if (memcmp(calc_ecc, bch8_vector,
+							 actual_eccbytes))
+						stat += eccstrength;
+					break;
+				default:
+					return -EINVAL;
 				}
+				memset(&data[i * eccsize], 0xff, eccsize);
 			}
 		}
 
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 4/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page detection for BCHx_HW ECC schemes
  2014-01-04  2:48 [PATCH v6 0/6] mtd: nand: omap: optimized chip->ecc.correct() for H/W ECC schemes Pekon Gupta
                   ` (2 preceding siblings ...)
  2014-01-04  2:48 ` [PATCH v6 3/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page bit-flip correction for H/W ECC schemes Pekon Gupta
@ 2014-01-04  2:48 ` Pekon Gupta
  2014-01-14  4:25   ` Brian Norris
  2014-01-04  2:48 ` [PATCH v6 5/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: cleanup for future enhancements Pekon Gupta
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 14+ messages in thread
From: Pekon Gupta @ 2014-01-04  2:48 UTC (permalink / raw)
  To: Brian Norris
  Cc: Stefan Roese, linux-mtd, Pekon Gupta, balbi, Artem Bityutskiy

chip->ecc.correct() is used for detecting and correcting bit-flips during read
operations. In OMAP NAND driver different ecc-schemes have different callbacks:
 - omap_correct_data()		for HAM1_HW ecc-schemes (Untouched)
 - nand_bch_correct_data()	for BCHx_HW_DETECTION_SW ecc-schemes (Untouched)
 - omap_elm_correct_data()	for BCHx_HW ecc-schemes (updated)

This patch solves following problems in ECC correction for BCHx_HW ecc-schemes.
Problem: Current implementation depends on a specific byte-position (reserved
         as 0x00) in ecc-layout to differentiate between programmed-pages v/s
         erased-pages.
      1) All ecc-scheme layouts do not have such Reserved byte marker to
         differentiate between erased-page v/s programmed-page. Thus this is a
         customized solution.
      2) Reserved byte can itself be subjected to bit-flips causing erased-page
         to be misunderstood as programmed-page.

Solution: This patch removes dependency on single byte-position ini ecc-layout
         to differentiating between erased-page v/s programeed-page.
         This patch 'assumes' any page to be 'erased':
		(a) if        all(read_ecc)  == 0xff
		(b) else if   all(read_data) == 0xff

Reasons for (a)
      -  An abrupt termination of page programming (like power failure)
         may result in partial write, leaving page in corrupted state with
         un-stable bits. As OOB region is programmed after the data-region,
         so if read_ecc[] == 0xff, then a page should treadted as erased.

      -  Also, as ECC is not present, any bitflips in page cannot be detected.

Reasons for (b)
      - Due to architecture of NAND cell, bit-flips cannot change programmed
        value from '0' -> '1'. So if all(read_data) == 0xff then its confirmed
        that there is 'no bit-flips in data-region'. Hence, read_data[] == 0xff
        can be safely returned.

      - if page was programmed-page with 0xff and 'calc_ecc[] != 0x00' then
        it means that page has bit-flips in OOB-region.

      - if page was erased-page and 'read_ecc[] != ecc_of_all_0xff' then
        it  mean that there are bit-flips in OOB-region of page.

Signed-off-by: Pekon Gupta <pekon@ti.com>
---
 drivers/mtd/nand/omap2.c | 69 ++++++++++++++++++++++++------------------------
 1 file changed, 34 insertions(+), 35 deletions(-)

diff --git a/drivers/mtd/nand/omap2.c b/drivers/mtd/nand/omap2.c
index 5a6ee6b..589db4c 100644
--- a/drivers/mtd/nand/omap2.c
+++ b/drivers/mtd/nand/omap2.c
@@ -1296,24 +1296,10 @@ static int omap3_calculate_ecc_bch(struct mtd_info *mtd, const u_char *dat,
  * @mtd:	MTD device structure
  * @data:	page data
  * @read_ecc:	ecc read from nand flash
- * @calc_ecc:	ecc read from HW ECC registers
- *
- * Calculated ecc vector reported as zero in case of non-error pages.
- * In case of error/erased pages non-zero error vector is reported.
- * In case of non-zero ecc vector, check read_ecc at fixed offset
- * (x = 13/7 in case of BCH8/4 == 0) to find page programmed or not.
- * To handle bit flips in this data, count the number of 0's in
- * read_ecc[x] and check if it greater than 4. If it is less, it is
- * programmed page, else erased page.
- *
- * 1. If page is erased, check with standard ecc vector (ecc vector
- * for erased page to find any bit flip). If check fails, bit flip
- * is present in erased page. Count the bit flips in erased page and
- * if it falls under correctable level, report page with 0xFF and
- * update the correctable bit information.
- * 2. If error is reported on programmed page, update elm error
- * vector and correct the page with ELM error correction routine.
- *
+ * @calc_ecc:	ecc calculated after reading Data and OOB regions from flash
+ *		calc_ecc would be non-zero only in following cases:
+ *		- bit-flips in data or oob region
+ *		- erased page, where no ECC is written in OOB area
  */
 static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 				u_char *read_ecc, u_char *calc_ecc)
@@ -1325,6 +1311,8 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 	int eccsize	= info->nand.ecc.size;
 	int eccstrength	= info->nand.ecc.strength;
 	int eccsteps	= info->nand.ecc.steps;
+	bool page_is_erased;
+	u8 *buf;
 	int i , j, stat = 0;
 	int eccflag, actual_eccbytes;
 	struct elm_errorvec err_vec[ERROR_VECTOR_MAX];
@@ -1371,24 +1359,35 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 		}
 
 		if (eccflag == 1) {
-			/*
-			 * Set threshold to minimum of 4, half of ecc.strength/2
-			 * to allow max bit flip in byte to 4
-			 */
-			unsigned int threshold = min_t(unsigned int, 4,
-					info->nand.ecc.strength / 2);
+			/* (a) page can be 'assumed' erased if
+			 * all(read_ecc) == 0xff */
+			page_is_erased = true;
+			for (j = 0; j < (eccbytes - 1); j++) {
+				if (read_ecc[j] != 0xff) {
+					page_is_erased = false;
+					break;
+				}
+			}
 
-			/*
-			 * Check data area is programmed by counting
-			 * number of 0's at fixed offset in spare area.
-			 * Checking count of 0's against threshold.
-			 * In case programmed page expects at least threshold
-			 * zeros in byte.
-			 * If zeros are less than threshold for programmed page/
-			 * zeros are more than threshold erased page, either
-			 * case page reported as uncorrectable.
-			 */
-			if (hweight8(~read_ecc[actual_eccbytes]) >= threshold) {
+			/* (b) Due to architecture of NAND cell, bit-flip cannot
+			 * change cell-value from '0' -> '1'. So if page has
+			 * all(read_data) == 0xff, then its confirmed that
+			 * there are no bit-flips in its data-region. Hence,
+			 * read_data == 0xff can be safely returned. */
+			if (!page_is_erased) {
+				page_is_erased = true;
+				buf = &data[eccsize * i];
+				for (j = 0; j < (eccsize - 1); j++) {
+					if (buf[j] != 0xff) {
+						page_is_erased = false;
+						break;
+					}
+				}
+			}
+
+			/* erased-page needs to be handled separately, as ELM
+			 * engine cannot parse pages with all(ECC) == 0xff */
+			if (!page_is_erased) {
 				/*
 				 * Update elm error vector as
 				 * data area is programmed
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 5/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: cleanup for future enhancements
  2014-01-04  2:48 [PATCH v6 0/6] mtd: nand: omap: optimized chip->ecc.correct() for H/W ECC schemes Pekon Gupta
                   ` (3 preceding siblings ...)
  2014-01-04  2:48 ` [PATCH v6 4/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page detection for BCHx_HW " Pekon Gupta
@ 2014-01-04  2:48 ` Pekon Gupta
  2014-01-04  2:48 ` [PATCH v6 6/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix programmed-page bit-flip correction logic Pekon Gupta
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Pekon Gupta @ 2014-01-04  2:48 UTC (permalink / raw)
  To: Brian Norris
  Cc: Stefan Roese, linux-mtd, Pekon Gupta, balbi, Artem Bityutskiy

Current omap_elm_correct_data() code is not scalable for future ecc-schemes
due to presence of tweaks and hard-coded macros for BCH4_ECC and BCH8_ECC
ecc-schemes at multiple places.

This patch:
 - replaces 'ecc_opt' with '(info->nand.ecc.strength == BCH8_MAX_ERROR)
   used to differentiate between BCH8_HW and BCH4_SW
 - replaces macros (defining magic number for specific ecc-scheme) with
   generic variables
 - removes dependency on macros defined in elm.h (like BCHx_ECC_OOB_BYTES)

Signed-off-by: Pekon Gupta <pekon@ti.com>
---
 drivers/mtd/nand/omap2.c | 38 +++++++++++++-------------------------
 1 file changed, 13 insertions(+), 25 deletions(-)

diff --git a/drivers/mtd/nand/omap2.c b/drivers/mtd/nand/omap2.c
index 589db4c..4cb08f4 100644
--- a/drivers/mtd/nand/omap2.c
+++ b/drivers/mtd/nand/omap2.c
@@ -118,14 +118,9 @@
 
 #define OMAP24XX_DMA_GPMC		4
 
-#define BCH8_MAX_ERROR		8	/* upto 8 bit correctable */
-#define BCH4_MAX_ERROR		4	/* upto 4 bit correctable */
-
 #define SECTOR_BYTES		512
 /* 4 bit padding to make byte aligned, 56 = 52 + 4 */
 #define BCH4_BIT_PAD		4
-#define BCH8_ECC_MAX		((SECTOR_BYTES + BCH8_ECC_OOB_BYTES) * 8)
-#define BCH4_ECC_MAX		((SECTOR_BYTES + BCH4_ECC_OOB_BYTES) * 8)
 
 /* GPMC ecc engine settings for read */
 #define BCH_WRAPMODE_1		1	/* BCH wrap mode 1 */
@@ -1318,8 +1313,8 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 	struct elm_errorvec err_vec[ERROR_VECTOR_MAX];
 	u_char *ecc_vec = calc_ecc;
 	u_char *spare_ecc = read_ecc;
-	enum bch_ecc type;
 	bool is_error_reported = false;
+	u32 bit_pos, byte_pos, error_max, pos;
 
 	switch (info->ecc_opt) {
 	case OMAP_ECC_BCH4_CODE_HW:
@@ -1338,12 +1333,6 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 	/* Initialize elm error vector to zero */
 	memset(err_vec, 0, sizeof(err_vec));
 
-	if (info->nand.ecc.strength == BCH8_MAX_ERROR) {
-		type = BCH8_ECC;
-	} else {
-		type = BCH4_ECC;
-	}
-
 	for (i = 0; i < eccsteps ; i++) {
 		eccflag = 0;	/* initialize eccflag */
 
@@ -1443,20 +1432,19 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 	for (i = 0; i < eccsteps; i++) {
 		if (err_vec[i].error_reported) {
 			for (j = 0; j < err_vec[i].error_count; j++) {
-				u32 bit_pos, byte_pos, error_max, pos;
-
-				if (type == BCH8_ECC)
-					error_max = BCH8_ECC_MAX;
-				else
-					error_max = BCH4_ECC_MAX;
-
-				if (info->nand.ecc.strength == BCH8_MAX_ERROR)
-					pos = err_vec[i].error_loc[j];
-				else
-					/* Add 4 to take care 4 bit padding */
+				switch (ecc_opt) {
+				case OMAP_ECC_BCH4_CODE_HW:
+					/* Add 4 bits to take care of padding */
 					pos = err_vec[i].error_loc[j] +
 						BCH4_BIT_PAD;
-
+					break;
+				case OMAP_ECC_BCH8_CODE_HW:
+					pos = err_vec[i].error_loc[j];
+					break;
+				default:
+					return -EINVAL;
+				}
+				error_max = (eccsize + actual_eccbytes) * 8;
 				/* Calculate bit position of error */
 				bit_pos = pos % 8;
 
@@ -1478,7 +1466,7 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 		stat += err_vec[i].error_count;
 
 		/* Update page data with sector size */
-		data += info->nand.ecc.size;
+		data += eccsize;
 		spare_ecc += eccbytes;
 	}
 
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 6/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix programmed-page bit-flip correction logic
  2014-01-04  2:48 [PATCH v6 0/6] mtd: nand: omap: optimized chip->ecc.correct() for H/W ECC schemes Pekon Gupta
                   ` (4 preceding siblings ...)
  2014-01-04  2:48 ` [PATCH v6 5/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: cleanup for future enhancements Pekon Gupta
@ 2014-01-04  2:48 ` Pekon Gupta
  2014-01-06  7:42 ` [PATCH v6 0/6] mtd: nand: omap: optimized chip->ecc.correct() for H/W ECC schemes Stefan Roese
  2014-01-14  3:27 ` Brian Norris
  7 siblings, 0 replies; 14+ messages in thread
From: Pekon Gupta @ 2014-01-04  2:48 UTC (permalink / raw)
  To: Brian Norris
  Cc: Stefan Roese, linux-mtd, Pekon Gupta, balbi, Artem Bityutskiy

This patch updates following checks when bit-flips are detected by ELM:

 - Do not evaluate bit-flips when un-correctable bit-flips is reported by ELM,
   because as per [1] when ELM reports an un-correctable bit-flips,
   'number of error' field in its ELM_LOCATION_STATUS register is also invalid.

 - Return with error-code '-EBADMSG' on detection of un-correctable bit-flip.

 - Return with error-code '-EBADMSG' when bit-flips position is outside current
   Sector and OOB area.

[1] ELM IP spec Table-25 ELM_LOCATION_STATUS Register.
    ELM_LOCATION_STATUS[8] = ECC_CORRECTABLE: Error location process exit status
        0x0: ECC error location process failed.
             Number of errors and error locations are invalid.
        0x1: all errors were successfully located.
             Number of errors and error locations are valid.

Signed-off-by: Pekon Gupta <pekon@ti.com>
---
 drivers/mtd/nand/omap2.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/mtd/nand/omap2.c b/drivers/mtd/nand/omap2.c
index 4cb08f4..c690721 100644
--- a/drivers/mtd/nand/omap2.c
+++ b/drivers/mtd/nand/omap2.c
@@ -1309,6 +1309,7 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 	bool page_is_erased;
 	u8 *buf;
 	int i , j, stat = 0;
+	int err = 0;
 	int eccflag, actual_eccbytes;
 	struct elm_errorvec err_vec[ERROR_VECTOR_MAX];
 	u_char *ecc_vec = calc_ecc;
@@ -1430,7 +1431,10 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 	elm_decode_bch_error_page(info->elm_dev, ecc_vec, err_vec);
 
 	for (i = 0; i < eccsteps; i++) {
-		if (err_vec[i].error_reported) {
+		if (err_vec[i].error_uncorrectable) {
+			pr_err("nand: uncorrectable bit-flips found\n");
+			err = -EBADMSG;
+		} else if (err_vec[i].error_reported) {
 			for (j = 0; j < err_vec[i].error_count; j++) {
 				switch (ecc_opt) {
 				case OMAP_ECC_BCH4_CODE_HW:
@@ -1452,13 +1456,18 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 				byte_pos = (error_max - pos - 1) / 8;
 
 				if (pos < error_max) {
+					pr_debug("fixed bit-flip @ [%d][%d]\n",
+							 byte_pos, bit_pos);
 					if (byte_pos < 512)
 						data[byte_pos] ^= 1 << bit_pos;
 					else
 						spare_ecc[byte_pos - 512] ^=
 							1 << bit_pos;
+				} else {
+					pr_err("invalid bit-flip @ [%d][%d]\n",
+							 byte_pos, bit_pos);
+					err = -EBADMSG;
 				}
-				/* else, not interested to correct ecc */
 			}
 		}
 
@@ -1470,12 +1479,7 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
 		spare_ecc += eccbytes;
 	}
 
-	for (i = 0; i < eccsteps; i++)
-		/* Return error if uncorrectable error present */
-		if (err_vec[i].error_uncorrectable)
-			return -EINVAL;
-
-	return stat;
+	return (err) ? err : stat;
 }
 
 /**
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 0/6] mtd: nand: omap: optimized chip->ecc.correct() for H/W ECC schemes
  2014-01-04  2:48 [PATCH v6 0/6] mtd: nand: omap: optimized chip->ecc.correct() for H/W ECC schemes Pekon Gupta
                   ` (5 preceding siblings ...)
  2014-01-04  2:48 ` [PATCH v6 6/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix programmed-page bit-flip correction logic Pekon Gupta
@ 2014-01-06  7:42 ` Stefan Roese
  2014-01-14  3:27 ` Brian Norris
  7 siblings, 0 replies; 14+ messages in thread
From: Stefan Roese @ 2014-01-06  7:42 UTC (permalink / raw)
  To: Pekon Gupta, Brian Norris; +Cc: linux-mtd, balbi, Artem Bityutskiy

On 04.01.2014 03:48, Pekon Gupta wrote:
> *changes v5 -> v6*
> [PATCH 1/6] <no change>
> [PATCH 2/6] introduced variable 'actual_eccbytes' to omit reserved byte-position
> [PATCH 3/6] split from [PATCH v5 3/5] fix erased-page bit-flip correction
> [PATCH 4/6] split from [PATCH v5 3/5] fix erased-page detection
> [PATCH 5/6] <minor cleanup>
> [PATCH 6/6] <no change>

Thanks, Pekon. This fixes some problems we have seen on an AM335x board
with ECC errors on UBI/UBIFS mounting. So for the whole patch series:

Tested-by: Stefan Roese <sr@denx.de>

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 0/6] mtd: nand: omap: optimized chip->ecc.correct() for H/W ECC schemes
  2014-01-04  2:48 [PATCH v6 0/6] mtd: nand: omap: optimized chip->ecc.correct() for H/W ECC schemes Pekon Gupta
                   ` (6 preceding siblings ...)
  2014-01-06  7:42 ` [PATCH v6 0/6] mtd: nand: omap: optimized chip->ecc.correct() for H/W ECC schemes Stefan Roese
@ 2014-01-14  3:27 ` Brian Norris
  7 siblings, 0 replies; 14+ messages in thread
From: Brian Norris @ 2014-01-14  3:27 UTC (permalink / raw)
  To: Pekon Gupta; +Cc: Stefan Roese, linux-mtd, balbi, Artem Bityutskiy

Hi Pekon,

First of all, thanks a lot for the extra efforts to make your commits
easier to digest. I do have a few comments, now that I can understand
the pieces a little better.

On Sat, Jan 04, 2014 at 08:18:12AM +0530, Pekon Gupta wrote:
> This patch-series fixes following issues in omap_elm_correct_data():
> (1) Dependency on a specific reserved byte-position in OOB area
>     to differentiates between erased-pages v/s programmed-pages.
>     Problem: reserved byte-position cannot be accomodated in all ecc-schemes
>     Problem: reserved byte-position can itself be subjected upto 8 bit-flips
>              causing the 0xff to become 0x00, causing page to be
>              mis-recognized as erased-page.
> 
> (2) Bit-flips in erased-pages are detected by comparing each byte of Data & OOB
>     with 0xff in check_erased_page().
>     Problem: This is causes performance penalty when erased-pages are checked.

Is this performance penalty significant, though? Shouldn't we be
checking for an erased page only on uncorrectable errors? And aren't
those uncorrectable occasions rare? I ask because it seems like you're
trading some precision for performance, but I think the case you're
optimizing is an uncommon path anyway. But perhaps I'm wrong.

I'll comment with some specifics on the patch itslef.

> (3) Current code is not scalable for future ECC schemes due to presence of 
>     tweaks for BCH4_ECC and BCH8_ECC at multiple places.
> 
> (4) Currently, bit-flips are evaluated and fixed even when ELM reports them as
>     un-correctable bit-flips, this should not happen as 'number-of-error' field
>     in ELM_LOCATION_STATUS becomes invalid when un-correctable flag is set.
> 
> (5) Driver should return with error-code = '-EBADMSG' when
>      uncorrectable bit-flip is detected
>      bit-flip outside valid Data and OOB region is detected

Your other points look good.

Thanks,
Brian

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 3/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page bit-flip correction for H/W ECC schemes
  2014-01-04  2:48 ` [PATCH v6 3/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page bit-flip correction for H/W ECC schemes Pekon Gupta
@ 2014-01-14  4:05   ` Brian Norris
  2014-01-14 17:37     ` Brian Norris
  2014-01-15 22:03     ` Gupta, Pekon
  0 siblings, 2 replies; 14+ messages in thread
From: Brian Norris @ 2014-01-14  4:05 UTC (permalink / raw)
  To: Pekon Gupta; +Cc: Stefan Roese, linux-mtd, balbi, Artem Bityutskiy

Hi Pekon,

On Sat, Jan 04, 2014 at 08:18:15AM +0530, Pekon Gupta wrote:
> chip->ecc.correct() is used for detecting and correcting bit-flips during read
> operations. In OMAP NAND driver different ecc-schemes have different callbacks:
>  - omap_correct_data()		for HAM1_HW ecc-schemes (Untouched)
>  - nand_bch_correct_data()	for BCHx_HW_DETECTION_SW ecc-schemes (Untouched)
>  - omap_elm_correct_data()	for BCHx_HW ecc-schemes (updated)
> 
> This patch solves following problems in ECC correction for BCHx_HW ecc-schemes.
> Problem: In current implementation, number of bit-flips in erased-pages is
>          calculated comparing each byte of Data & OOB with 0xff in
>          function check_erased_page().
>          But, with growing density of NAND devices (especially for MLC NAND)
>          where pagesize is of 4k or more bytes, occurence of bit-flips is
>          very common. Thus comparing each byte of both main-area and OOB
>          with 0xff decreases NAND performance due to CPU overhead.

But the overhead from this check is only incurred if you are getting
uncorrectable errors, right? This shouldn't commonly occur on programmed
pages, so you're only focusing on high number of bitflips in erased
pages. But UBIFS reads erased pages only when trying to recover at
power-up, right? And I don't think there are really any common cases
where this should occur.

> Known attributes of an erased-page
>  - Bit-flips in erased-page can't be corrected because there is no ECC
>    present in OOB.
>  - Irrespective of number of bit-flips an erased-page will only contain
>    all(0xff), So its safe to return error-code -EUCLEAN with data_buf=0xff,
>    instead of -EBADMSG.

Are you saying that all bitflips in erased pages should yield -EUCLEAN?
I agree that they shouldn't return -EBADMSG (up to the strength
threshold), but I also think that we should still be able to report the
number of bitflips "corrected" in our erased page handling. That way,
pages with small numbers of bitflips can still be corrected.

Put another way: what if every page starts to experience at least one
bitflip? Do you want UBIFS to scrub the page every time? Rather, I think
you want to calculate the proper count so that MTD can mask the bitflips
if they are under the threshold. See my comment labeled [***] in the
patch context below.

>  - And incrementing ecc_stat->corrected makes chip->read_page() return -EUCLEAN.

It only returns -EUCLEAN if the 'corrected' exceeds the threshold. Per
my comment above, I think you want to retain this distinction if
possible.

> Solution: In NAND based file-systems like UBIFS, each page is checked for
>          bit-flips before writing to it.

No, UBIFS only checks pages like this when it thinks it doesn't have a
consistent state (Artem can correct me if I'm wrong). Particularly, it
does this at attach time, when it suspects power-cut issues. At other
times, UBIFS should be smart enough to know which pages have not been
written since erasure.

> If correctable bit-flips are found
>          in erased-page then -EUCLEAN error-code is returned which causes
>          UBIFS to re-erase the complete block. So, its unnecessary to get
>          exact count of bit-flips in an erased-page.

I dispute this point above.

>          Hence this patch just compares calc_ecc[] with known
>          ECC-syndromp-of-all(0xff), to confirm if erased-page has bit-flips.
>          - if bit-flips are found, then
>              read buffer is set to all(0xff) and
>              correctable bit-flips = max(ecc-strength) is reported back to
>              upper layer to take preventive action (like re-erasing block).
>          - else
>               read-data is returned normally.
> 
> Signed-off-by: Pekon Gupta <pekon@ti.com>
> ---
>  drivers/mtd/nand/omap2.c | 87 ++++++++++++++++++------------------------------
>  1 file changed, 32 insertions(+), 55 deletions(-)
> 
> diff --git a/drivers/mtd/nand/omap2.c b/drivers/mtd/nand/omap2.c
> index 2c73389..5a6ee6b 100644
> --- a/drivers/mtd/nand/omap2.c
> +++ b/drivers/mtd/nand/omap2.c
> @@ -1292,45 +1292,6 @@ static int omap3_calculate_ecc_bch(struct mtd_info *mtd, const u_char *dat,
>  }
>  
>  /**
> - * erased_sector_bitflips - count bit flips
> - * @data:	data sector buffer
> - * @oob:	oob buffer
> - * @info:	omap_nand_info
> - *
> - * Check the bit flips in erased page falls below correctable level.
> - * If falls below, report the page as erased with correctable bit
> - * flip, else report as uncorrectable page.
> - */
> -static int erased_sector_bitflips(u_char *data, u_char *oob,
> -		struct omap_nand_info *info)
> -{
> -	int flip_bits = 0, i;
> -
> -	for (i = 0; i < info->nand.ecc.size; i++) {
> -		flip_bits += hweight8(~data[i]);
> -		if (flip_bits > info->nand.ecc.strength)
> -			return 0;
> -	}
> -
> -	for (i = 0; i < info->nand.ecc.bytes - 1; i++) {
> -		flip_bits += hweight8(~oob[i]);
> -		if (flip_bits > info->nand.ecc.strength)
> -			return 0;
> -	}
> -
> -	/*
> -	 * Bit flips falls in correctable level.
> -	 * Fill data area with 0xFF
> -	 */
> -	if (flip_bits) {
> -		memset(data, 0xFF, info->nand.ecc.size);
> -		memset(oob, 0xFF, info->nand.ecc.bytes);
> -	}
> -
> -	return flip_bits;
> -}
> -
> -/**
>   * omap_elm_correct_data - corrects page data area in case error reported
>   * @mtd:	MTD device structure
>   * @data:	page data
> @@ -1359,14 +1320,16 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
>  {
>  	struct omap_nand_info *info = container_of(mtd, struct omap_nand_info,
>  			mtd);
> +	enum omap_ecc ecc_opt	= info->ecc_opt;
>  	int eccbytes	= info->nand.ecc.bytes;
> -	int eccsteps = info->nand.ecc.steps;
> +	int eccsize	= info->nand.ecc.size;
> +	int eccstrength	= info->nand.ecc.strength;
> +	int eccsteps	= info->nand.ecc.steps;

Rather than 4 variables which all reference info->nand.ecc, can't you do
this?

	struct nand_ecc_ctrl *ecc = &info->nand.ecc;

Then all the other references can still be pretty short. i.e.:

  eccbytes    ===>  ecc->bytes
  eccsize     ===>  ecc->size
  eccstrength ===>  ecc->strength
  eccsteps    ===>  ecc->steps

>  	int i , j, stat = 0;
>  	int eccflag, actual_eccbytes;
>  	struct elm_errorvec err_vec[ERROR_VECTOR_MAX];
>  	u_char *ecc_vec = calc_ecc;
>  	u_char *spare_ecc = read_ecc;
> -	u_char *erased_ecc_vec;
>  	enum bch_ecc type;
>  	bool is_error_reported = false;
>  
> @@ -1389,10 +1352,8 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
>  
>  	if (info->nand.ecc.strength == BCH8_MAX_ERROR) {
>  		type = BCH8_ECC;
> -		erased_ecc_vec = bch8_vector;
>  	} else {
>  		type = BCH4_ECC;
> -		erased_ecc_vec = bch4_vector;

Why are you dropping this vector assignment out of this if/else (which
you later convert to a switch/case)? After this patch series, you seem
to have effectively duplicated the same switch/case statement in two
places. I think the vector assignment should just stay here unless you
see some other good reason.

>  	}
>  
>  	for (i = 0; i < eccsteps ; i++) {
> @@ -1436,19 +1397,35 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
>  				is_error_reported = true;
>  			} else {
>  				/* Error reported in erased page */
> -				int bitflip_count;
> -				u_char *buf = &data[info->nand.ecc.size * i];
> -
> -				if (memcmp(calc_ecc, erased_ecc_vec,
> -							 actual_eccbytes)) {
> -					bitflip_count = erased_sector_bitflips(
> -							buf, read_ecc, info);
> -
> -					if (bitflip_count)
> -						stat += bitflip_count;
> -					else
> -						return -EINVAL;
> +				/* check bit-flips in erased-pages
> +				 * - Instead of comparing each byte of main-area
> +				 *   with 0xff, comparing just calc_ecc[] with
> +				 *   known ECC-syndrome-of-all(0xff). This
> +				 *   confirms that main-area + OOB are all(0xff)
> +				 *   This will save CPU cycles.
> +				 * - Bit-flips in erased-page can't be corrected
> +				 *   because there is no ECC present in OOB
> +				 * - Irrespective of number of bit-flips an
> +				 *   erased-page will only contain all(0xff),
> +				 *   So its safe to return error-code -EUCLEAN
> +				 *   with data_buf=0xff, instead of -EBADMSG
> +				 * - And incrementing ecc_stat->corrected makes
> +				 *   chip->read_page() return -EUCLEAN */

/*
 * Can you reformat your multiline comments to fit this style? (With
 * asterisks on their own lines.) You use this style a few times.
 */

> +				switch (ecc_opt) {
> +				case OMAP_ECC_BCH4_CODE_HW:
> +					if (memcmp(calc_ecc, bch4_vector,
> +							 actual_eccbytes))
> +						stat += eccstrength;
> +					break;
> +				case OMAP_ECC_BCH8_CODE_HW:
> +					if (memcmp(calc_ecc, bch8_vector,
> +							 actual_eccbytes))
> +						stat += eccstrength;
> +					break;
> +				default:
> +					return -EINVAL;
>  				}
> +				memset(&data[i * eccsize], 0xff, eccsize);
>  			}
>  		}
>  

Brian

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 4/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page detection for BCHx_HW ECC schemes
  2014-01-04  2:48 ` [PATCH v6 4/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page detection for BCHx_HW " Pekon Gupta
@ 2014-01-14  4:25   ` Brian Norris
  2014-01-15 23:17     ` Gupta, Pekon
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Norris @ 2014-01-14  4:25 UTC (permalink / raw)
  To: Pekon Gupta; +Cc: Stefan Roese, linux-mtd, balbi, Artem Bityutskiy

Hi Pekon,

On Sat, Jan 04, 2014 at 08:18:16AM +0530, Pekon Gupta wrote:
> chip->ecc.correct() is used for detecting and correcting bit-flips during read
> operations. In OMAP NAND driver different ecc-schemes have different callbacks:
>  - omap_correct_data()		for HAM1_HW ecc-schemes (Untouched)
>  - nand_bch_correct_data()	for BCHx_HW_DETECTION_SW ecc-schemes (Untouched)
>  - omap_elm_correct_data()	for BCHx_HW ecc-schemes (updated)
> 
> This patch solves following problems in ECC correction for BCHx_HW ecc-schemes.
> Problem: Current implementation depends on a specific byte-position (reserved
>          as 0x00) in ecc-layout to differentiate between programmed-pages v/s
>          erased-pages.
>       1) All ecc-scheme layouts do not have such Reserved byte marker to
>          differentiate between erased-page v/s programmed-page. Thus this is a
>          customized solution.
>       2) Reserved byte can itself be subjected to bit-flips causing erased-page
>          to be misunderstood as programmed-page.
> 
> Solution: This patch removes dependency on single byte-position ini ecc-layout
>          to differentiating between erased-page v/s programeed-page.
>          This patch 'assumes' any page to be 'erased':
> 		(a) if        all(read_ecc)  == 0xff
> 		(b) else if   all(read_data) == 0xff

Your assumptions break down if somebody intentionally programmed a page
with 0xff data. Then the ECC region will not be 0xff, but the data area
may be all 0xff. Isn't this a major problem? (You can try testing this
with nandwrite/nanddump using a page of 0xff.)

> 
> Reasons for (a)
>       -  An abrupt termination of page programming (like power failure)
>          may result in partial write, leaving page in corrupted state with
>          un-stable bits. As OOB region is programmed after the data-region,
>          so if read_ecc[] == 0xff, then a page should treadted as erased.
> 
>       -  Also, as ECC is not present, any bitflips in page cannot be detected.
> 
> Reasons for (b)
>       - Due to architecture of NAND cell, bit-flips cannot change programmed
>         value from '0' -> '1'. So if all(read_data) == 0xff then its confirmed
>         that there is 'no bit-flips in data-region'. Hence, read_data[] == 0xff
>         can be safely returned.
> 
>       - if page was programmed-page with 0xff and 'calc_ecc[] != 0x00' then
>         it means that page has bit-flips in OOB-region.
> 
>       - if page was erased-page and 'read_ecc[] != ecc_of_all_0xff' then
>         it  mean that there are bit-flips in OOB-region of page.

I think several assumptions and statements you are making are flawed,
and (unless I'm wrong) you probably need to rethink the approach in this
patch and the previous one.

Are you mainly trying to (1) improve performance and (2) remove reliance
on a single byte marker? I think the first point may be misguided and
not important in this case, but the second looks like you can solve it
well. What do you think of trying to tackle (2) without (1)? You would
be keeping much of the existing hweight()-related code for determining
bitflips across the whole data+OOB, but you could still have a smaller
ECC-only check to perform a faster first pass.

> Signed-off-by: Pekon Gupta <pekon@ti.com>
> ---
>  drivers/mtd/nand/omap2.c | 69 ++++++++++++++++++++++++------------------------
>  1 file changed, 34 insertions(+), 35 deletions(-)
> 
> diff --git a/drivers/mtd/nand/omap2.c b/drivers/mtd/nand/omap2.c
> index 5a6ee6b..589db4c 100644
> --- a/drivers/mtd/nand/omap2.c
> +++ b/drivers/mtd/nand/omap2.c
> @@ -1296,24 +1296,10 @@ static int omap3_calculate_ecc_bch(struct mtd_info *mtd, const u_char *dat,
>   * @mtd:	MTD device structure
>   * @data:	page data
>   * @read_ecc:	ecc read from nand flash
> - * @calc_ecc:	ecc read from HW ECC registers
> - *
> - * Calculated ecc vector reported as zero in case of non-error pages.
> - * In case of error/erased pages non-zero error vector is reported.
> - * In case of non-zero ecc vector, check read_ecc at fixed offset
> - * (x = 13/7 in case of BCH8/4 == 0) to find page programmed or not.
> - * To handle bit flips in this data, count the number of 0's in
> - * read_ecc[x] and check if it greater than 4. If it is less, it is
> - * programmed page, else erased page.
> - *
> - * 1. If page is erased, check with standard ecc vector (ecc vector
> - * for erased page to find any bit flip). If check fails, bit flip
> - * is present in erased page. Count the bit flips in erased page and
> - * if it falls under correctable level, report page with 0xFF and
> - * update the correctable bit information.
> - * 2. If error is reported on programmed page, update elm error
> - * vector and correct the page with ELM error correction routine.
> - *
> + * @calc_ecc:	ecc calculated after reading Data and OOB regions from flash
> + *		calc_ecc would be non-zero only in following cases:
> + *		- bit-flips in data or oob region
> + *		- erased page, where no ECC is written in OOB area
>   */
>  static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
>  				u_char *read_ecc, u_char *calc_ecc)
> @@ -1325,6 +1311,8 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
>  	int eccsize	= info->nand.ecc.size;
>  	int eccstrength	= info->nand.ecc.strength;
>  	int eccsteps	= info->nand.ecc.steps;
> +	bool page_is_erased;
> +	u8 *buf;
>  	int i , j, stat = 0;
>  	int eccflag, actual_eccbytes;
>  	struct elm_errorvec err_vec[ERROR_VECTOR_MAX];
> @@ -1371,24 +1359,35 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
>  		}
>  
>  		if (eccflag == 1) {
> -			/*
> -			 * Set threshold to minimum of 4, half of ecc.strength/2
> -			 * to allow max bit flip in byte to 4
> -			 */
> -			unsigned int threshold = min_t(unsigned int, 4,
> -					info->nand.ecc.strength / 2);
> +			/* (a) page can be 'assumed' erased if
> +			 * all(read_ecc) == 0xff */

Can you fix your comment style please? Note the multiline style used by
the comments you are deleting.

> +			page_is_erased = true;
> +			for (j = 0; j < (eccbytes - 1); j++) {
> +				if (read_ecc[j] != 0xff) {
> +					page_is_erased = false;
> +					break;
> +				}
> +			}
>  
> -			/*
> -			 * Check data area is programmed by counting
> -			 * number of 0's at fixed offset in spare area.
> -			 * Checking count of 0's against threshold.
> -			 * In case programmed page expects at least threshold
> -			 * zeros in byte.
> -			 * If zeros are less than threshold for programmed page/
> -			 * zeros are more than threshold erased page, either
> -			 * case page reported as uncorrectable.
> -			 */
> -			if (hweight8(~read_ecc[actual_eccbytes]) >= threshold) {
> +			/* (b) Due to architecture of NAND cell, bit-flip cannot
> +			 * change cell-value from '0' -> '1'. So if page has
> +			 * all(read_data) == 0xff, then its confirmed that
> +			 * there are no bit-flips in its data-region. Hence,
> +			 * read_data == 0xff can be safely returned. */

Ditto.

> +			if (!page_is_erased) {
> +				page_is_erased = true;
> +				buf = &data[eccsize * i];
> +				for (j = 0; j < (eccsize - 1); j++) {
> +					if (buf[j] != 0xff) {
> +						page_is_erased = false;
> +						break;
> +					}
> +				}
> +			}
> +
> +			/* erased-page needs to be handled separately, as ELM
> +			 * engine cannot parse pages with all(ECC) == 0xff */

Ditto.

> +			if (!page_is_erased) {
>  				/*
>  				 * Update elm error vector as
>  				 * data area is programmed

Brian

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 3/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page bit-flip correction for H/W ECC schemes
  2014-01-14  4:05   ` Brian Norris
@ 2014-01-14 17:37     ` Brian Norris
  2014-01-15 22:03     ` Gupta, Pekon
  1 sibling, 0 replies; 14+ messages in thread
From: Brian Norris @ 2014-01-14 17:37 UTC (permalink / raw)
  To: Pekon Gupta; +Cc: Stefan Roese, linux-mtd, balbi, Artem Bityutskiy

Hi Pekon,

On Mon, Jan 13, 2014 at 08:05:54PM -0800, Brian Norris wrote:
> On Sat, Jan 04, 2014 at 08:18:15AM +0530, Pekon Gupta wrote:
> >  - Irrespective of number of bit-flips an erased-page will only contain
> >    all(0xff), So its safe to return error-code -EUCLEAN with data_buf=0xff,
> >    instead of -EBADMSG.
> 
> Are you saying that all bitflips in erased pages should yield -EUCLEAN?
> I agree that they shouldn't return -EBADMSG (up to the strength
> threshold), but I also think that we should still be able to report the
> number of bitflips "corrected" in our erased page handling. That way,
> pages with small numbers of bitflips can still be corrected.
> 
> Put another way: what if every page starts to experience at least one
> bitflip? Do you want UBIFS to scrub the page every time? Rather, I think
> you want to calculate the proper count so that MTD can mask the bitflips
> if they are under the threshold. See my comment labeled [***] in the
> patch context below.

I totally forgot to put the [***] below when I got to it! See below (for
real this time!)

[...]
> > --- a/drivers/mtd/nand/omap2.c
> > +++ b/drivers/mtd/nand/omap2.c
> > @@ -1436,19 +1397,35 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
[...]
> > +				switch (ecc_opt) {
> > +				case OMAP_ECC_BCH4_CODE_HW:
> > +					if (memcmp(calc_ecc, bch4_vector,
> > +							 actual_eccbytes))
> > +						stat += eccstrength;

The [***] goes here!

[***] I don't think you can assume 'eccstrength' number of bitflips
here. This sidesteps the bitflip threshold accounting, and it means that
even a single bitflip will cause the entire block to be scrubbed, even
if it was just erased. The key point here is that eventually, we can't
assume that scrubbing will remove *all* bit errors in NAND flash. An
erased page is allowed to have a small number of bitflips or stuck bits.

> > +					break;
> > +				case OMAP_ECC_BCH8_CODE_HW:
> > +					if (memcmp(calc_ecc, bch8_vector,
> > +							 actual_eccbytes))
> > +						stat += eccstrength;

[***] Same here.

(BTW, this switch/case block can be reduced to a single case if you
select bch4_vector vs. bch8_vector in the switch/case from earlier in
this function.)

> > +					break;
> > +				default:
> > +					return -EINVAL;
> >  				}
> > +				memset(&data[i * eccsize], 0xff, eccsize);
> >  			}
> >  		}
> >  

Brian

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH v6 3/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page bit-flip correction for H/W ECC schemes
  2014-01-14  4:05   ` Brian Norris
  2014-01-14 17:37     ` Brian Norris
@ 2014-01-15 22:03     ` Gupta, Pekon
  1 sibling, 0 replies; 14+ messages in thread
From: Gupta, Pekon @ 2014-01-15 22:03 UTC (permalink / raw)
  To: Brian Norris; +Cc: Stefan Roese, linux-mtd, Balbi, Felipe, Artem Bityutskiy

Hi Brian,

Thanks for your feedbacks. Your review helps a lot.

>From: Brian Norris [mailto:computersforpeace@gmail.com]
>>On Sat, Jan 04, 2014 at 08:18:15AM +0530, Pekon Gupta wrote:
>> chip->ecc.correct() is used for detecting and correcting bit-flips during read
>> operations. In OMAP NAND driver different ecc-schemes have different callbacks:
>>  - omap_correct_data()		for HAM1_HW ecc-schemes (Untouched)
>>  - nand_bch_correct_data()	for BCHx_HW_DETECTION_SW ecc-schemes (Untouched)
>>  - omap_elm_correct_data()	for BCHx_HW ecc-schemes (updated)
>>
>> This patch solves following problems in ECC correction for BCHx_HW ecc-schemes.
>> Problem: In current implementation, number of bit-flips in erased-pages is
>>          calculated comparing each byte of Data & OOB with 0xff in
>>          function check_erased_page().
>>          But, with growing density of NAND devices (especially for MLC NAND)
>>          where pagesize is of 4k or more bytes, occurence of bit-flips is
>>          very common. Thus comparing each byte of both main-area and OOB
>>          with 0xff decreases NAND performance due to CPU overhead.
>
>But the overhead from this check is only incurred if you are getting
>uncorrectable errors, right? This shouldn't commonly occur on programmed
>pages, so you're only focusing on high number of bitflips in erased
>pages. But UBIFS reads erased pages only when trying to recover at
>power-up, right? And I don't think there are really any common cases
>where this should occur.
>
Yes, I misinterpreted a function call in UBI which was related only for
debug purpose.  UBI does 0xff checks mostly while
(1) scanning the device,
(2) erasing the block, and
(3) scrubbing
And that too on small chunk of bytes usually sizeof(ec-header) or sideof(vid-header).
But, in latest technology NAND flash (usually MLC), we are seeing lot of
bit-flips on erased-pages as well. And if NAND driver starts comparing each
byte of erased-page with 0xff, then it's not optimal solution (even if
it's executed only once during mount).
I'll take your advice, but still do you have some better idea than comparing all(0xff) ?


>> Known attributes of an erased-page
>>  - Bit-flips in erased-page can't be corrected because there is no ECC
>>    present in OOB.
>>  - Irrespective of number of bit-flips an erased-page will only contain
>>    all(0xff), So its safe to return error-code -EUCLEAN with data_buf=0xff,
>>    instead of -EBADMSG.
>
>Are you saying that all bitflips in erased pages should yield -EUCLEAN?
>I agree that they shouldn't return -EBADMSG (up to the strength
>threshold), but I also think that we should still be able to report the
>number of bitflips "corrected" in our erased page handling. That way,
>pages with small numbers of bitflips can still be corrected.
>
>Put another way: what if every page starts to experience at least one
>bitflip? Do you want UBIFS to scrub the page every time? Rather, I think
>you want to calculate the proper count so that MTD can mask the bitflips
>if they are under the threshold. See my comment labeled [***] in the
>patch context below.
>
I agree.. I'll update the patch return correct count of bit-flips on erased-page.


>>  - And incrementing ecc_stat->corrected makes chip->read_page() return -EUCLEAN.
>
>It only returns -EUCLEAN if the 'corrected' exceeds the threshold. Per
>my comment above, I think you want to retain this distinction if
>possible.
>
>> Solution: In NAND based file-systems like UBIFS, each page is checked for
>>          bit-flips before writing to it.
>
>No, UBIFS only checks pages like this when it thinks it doesn't have a
>consistent state (Artem can correct me if I'm wrong). Particularly, it
>does this at attach time, when it suspects power-cut issues. At other
>times, UBIFS should be smart enough to know which pages have not been
>written since erasure.
>
Yes, correct.
I have suggested another way of passing information between MTD/NAND
and UBI layers via error-codes. Your comments would be helpful.
http://lists.infradead.org/pipermail/linux-mtd/2014-January/051556.html


>> If correctable bit-flips are found
>>          in erased-page then -EUCLEAN error-code is returned which causes
>>          UBIFS to re-erase the complete block. So, its unnecessary to get
>>          exact count of bit-flips in an erased-page.
>
>I dispute this point above.
>
Accepted, I'll update the patch (as mentioned above).


>>          Hence this patch just compares calc_ecc[] with known
>>          ECC-syndromp-of-all(0xff), to confirm if erased-page has bit-flips.
>>          - if bit-flips are found, then
>>              read buffer is set to all(0xff) and
>>              correctable bit-flips = max(ecc-strength) is reported back to
>>              upper layer to take preventive action (like re-erasing block).
>>          - else
>>               read-data is returned normally.
>>
>> Signed-off-by: Pekon Gupta <pekon@ti.com>
>> ---

[...]

>> @@ -1359,14 +1320,16 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
>>  {
>>  	struct omap_nand_info *info = container_of(mtd, struct omap_nand_info,
>>  			mtd);
>> +	enum omap_ecc ecc_opt	= info->ecc_opt;
>>  	int eccbytes	= info->nand.ecc.bytes;
>> -	int eccsteps = info->nand.ecc.steps;
>> +	int eccsize	= info->nand.ecc.size;
>> +	int eccstrength	= info->nand.ecc.strength;
>> +	int eccsteps	= info->nand.ecc.steps;
>
>Rather than 4 variables which all reference info->nand.ecc, can't you do
>this?
>
>	struct nand_ecc_ctrl *ecc = &info->nand.ecc;
>
>Then all the other references can still be pretty short. i.e.:
>
>  eccbytes    ===>  ecc->bytes
>  eccsize     ===>  ecc->size
>  eccstrength ===>  ecc->strength
>  eccsteps    ===>  ecc->steps
>
Ok.. I thought this would be optimized by compiler ?
I din 't want to add any new variable on function's stack just for
initialization purpose, so kept it like this.


>>  	int i , j, stat = 0;
>>  	int eccflag, actual_eccbytes;
>>  	struct elm_errorvec err_vec[ERROR_VECTOR_MAX];
>>  	u_char *ecc_vec = calc_ecc;
>>  	u_char *spare_ecc = read_ecc;
>> -	u_char *erased_ecc_vec;
>>  	enum bch_ecc type;
>>  	bool is_error_reported = false;
>>
>> @@ -1389,10 +1352,8 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
>>
>>  	if (info->nand.ecc.strength == BCH8_MAX_ERROR) {
>>  		type = BCH8_ECC;
>> -		erased_ecc_vec = bch8_vector;
>>  	} else {
>>  		type = BCH4_ECC;
>> -		erased_ecc_vec = bch4_vector;
>
>Why are you dropping this vector assignment out of this if/else (which
>you later convert to a switch/case)? After this patch series, you seem
>to have effectively duplicated the same switch/case statement in two
>places. I think the vector assignment should just stay here unless you
>see some other good reason.
>
Thanks for pointing this out.
Yes, this could simplify the code further, but I'll make this vector assignment
based on ecc-opt (ecc-scheme), not on ecc.strength.


>> +				/* check bit-flips in erased-pages
>> +				 * - Instead of comparing each byte of main-area
>> +				 *   with 0xff, comparing just calc_ecc[] with
>> +				 *   known ECC-syndrome-of-all(0xff). This
>> +				 *   confirms that main-area + OOB are all(0xff)
>> +				 *   This will save CPU cycles.
>> +				 * - Bit-flips in erased-page can't be corrected
>> +				 *   because there is no ECC present in OOB
>> +				 * - Irrespective of number of bit-flips an
>> +				 *   erased-page will only contain all(0xff),
>> +				 *   So its safe to return error-code -EUCLEAN
>> +				 *   with data_buf=0xff, instead of -EBADMSG
>> +				 * - And incrementing ecc_stat->corrected makes
>> +				 *   chip->read_page() return -EUCLEAN */
>
>/*
> * Can you reformat your multiline comments to fit this style? (With
> * asterisks on their own lines.) You use this style a few times.
> */
>
Ok ..


with regards, pekon

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH v6 4/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page detection for BCHx_HW ECC schemes
  2014-01-14  4:25   ` Brian Norris
@ 2014-01-15 23:17     ` Gupta, Pekon
  0 siblings, 0 replies; 14+ messages in thread
From: Gupta, Pekon @ 2014-01-15 23:17 UTC (permalink / raw)
  To: Brian Norris; +Cc: Stefan Roese, linux-mtd, Balbi, Felipe, Artem Bityutskiy

HI Brian,


>From: Brian Norris [mailto:computersforpeace@gmail.com]
[...]
>>
>> This patch solves following problems in ECC correction for BCHx_HW ecc-schemes.
>> Problem: Current implementation depends on a specific byte-position (reserved
>>          as 0x00) in ecc-layout to differentiate between programmed-pages v/s
>>          erased-pages.
>>       1) All ecc-scheme layouts do not have such Reserved byte marker to
>>          differentiate between erased-page v/s programmed-page. Thus this is a
>>          customized solution.
>>       2) Reserved byte can itself be subjected to bit-flips causing erased-page
>>          to be misunderstood as programmed-page.
>>
>> Solution: This patch removes dependency on single byte-position ini ecc-layout
>>          to differentiating between erased-page v/s programeed-page.
>>          This patch 'assumes' any page to be 'erased':
>> 		(a) if        all(read_ecc)  == 0xff
>> 		(b) else if   all(read_data) == 0xff
>
>Your assumptions break down if somebody intentionally programmed a page
>with 0xff data. Then the ECC region will not be 0xff, but the data area
>may be all 0xff. Isn't this a major problem? (You can try testing this
>with nandwrite/nanddump using a page of 0xff.)
>
I thought of this much, but I found that UBI doesn't really differentiate between
 (a) page intentionally programmed with 0xff data.
 (b) page left blank.
This is why we have 'white-space-fixup' feature in UBIFS, which re-erased the
page which it thinks are blank. For UBI, any blocks which has 'vid-header'
is assumed to contain some user-data.
Example: When UBI image is flashed using 'nand write' command from u-boot.
u-boot writes 0xff as data along with its OOB into empty pages.
But when UBIFS scans all the blocks it considers the blocks without 'vid-header'
as erased, in spite the fact that the pages is programmed with 0xff data.
(However, Artem can confirm this more)..


>>
>> Reasons for (a)
>>       -  An abrupt termination of page programming (like power failure)
>>          may result in partial write, leaving page in corrupted state with
>>          un-stable bits. As OOB region is programmed after the data-region,
>>          so if read_ecc[] == 0xff, then a page should treadted as erased.
>>
>>       -  Also, as ECC is not present, any bitflips in page cannot be detected.
>>
>> Reasons for (b)
>>       - Due to architecture of NAND cell, bit-flips cannot change programmed
>>         value from '0' -> '1'. So if all(read_data) == 0xff then its confirmed
>>         that there is 'no bit-flips in data-region'. Hence, read_data[] == 0xff
>>         can be safely returned.
>>
I think this is a valid assumption, because converting all the programmed bits
back to '1' (s) requires erase operation, which internally is done at slightly
higher voltage (as per my understanding of flash cell & floating-gate architecture).


>>       - if page was programmed-page with 0xff and 'calc_ecc[] != 0x00' then
>>         it means that page has bit-flips in OOB-region.
>>
Sorry, yes may be I wrote this incorrectly .. (please ignore this)


>>       - if page was erased-page and 'read_ecc[] != ecc_of_all_0xff' then
>>         it  mean that there are bit-flips in OOB-region of page.
Sorry there is a typo here.. it should be calc_ecc[] != ecc_of_all_0xff.
	If (OOB == all(0xff) /* page is erased */
		If (calc_ecc[] != ecc_of_all_0xff)  /* erased-page has bit-flips */

So, I think above assumption is also correct ??
(And this has been already followed in existing OMAP driver, refer bch8_vector)


>
>I think several assumptions and statements you are making are flawed,
>and (unless I'm wrong) you probably need to rethink the approach in this
>patch and the previous one.
>
>Are you mainly trying to (1) improve performance and (2) remove reliance
>on a single byte marker? I think the first point may be misguided and
>not important in this case, but the second looks like you can solve it
>well. What do you think of trying to tackle (2) without (1)? You would
>be keeping much of the existing hweight()-related code for determining
>bitflips across the whole data+OOB, but you could still have a smaller
>ECC-only check to perform a faster first pass.
>
Yes, I want to solve (2) "remove reliance on a single byte marker".
That's on priority. 
But recently I have come across some MLC and new technology NAND
devices, which have very high bit-flip rates. 
(a) Almost every page has bit-flips. And some pages have stuck bits too.
   But all this is within correctable range so performance will certainly hit.
(b) Even the erased-pages are having lot of bit-flips in OOB area, so
  it's difficult to differentiate between erased-page v/s programmed-page.
(c) And due to the bit-flips in erased-pages the 'faster' ECC-only check will
   mostly fail, and so we fall back to comparing each byte with 0xFF. Thus
   performance falls further.

I'm unable to find a proper solution for avoiding (2) and still able to meet (b).
So, in this patch I just did two checks
- compare all OOB with 0xff to see if the page is programmed.
- compare all data with 0xff to see if data is present.
If either of the two returns as true, then I consider the page as erased.
because UBIFS does not distinguish between erased-page and
programmed-page having all(0xff) data.


[...]

>> +			/* (a) page can be 'assumed' erased if
>> +			 * all(read_ecc) == 0xff */
>
>Can you fix your comment style please? Note the multiline style used by
>the comments you are deleting.
>
Ok.. But these are wasting 2 lines :-)


with regards, pekon

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-01-15 23:17 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-04  2:48 [PATCH v6 0/6] mtd: nand: omap: optimized chip->ecc.correct() for H/W ECC schemes Pekon Gupta
2014-01-04  2:48 ` [PATCH v6 1/6] mtd: nand: omap: add field to indicate current ecc-scheme in 'struct omap_nand_info' Pekon Gupta
2014-01-04  2:48 ` [PATCH v6 2/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: rename ambiguous variable 'eccsize' and 'ecc_vector_size' Pekon Gupta
2014-01-04  2:48 ` [PATCH v6 3/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page bit-flip correction for H/W ECC schemes Pekon Gupta
2014-01-14  4:05   ` Brian Norris
2014-01-14 17:37     ` Brian Norris
2014-01-15 22:03     ` Gupta, Pekon
2014-01-04  2:48 ` [PATCH v6 4/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page detection for BCHx_HW " Pekon Gupta
2014-01-14  4:25   ` Brian Norris
2014-01-15 23:17     ` Gupta, Pekon
2014-01-04  2:48 ` [PATCH v6 5/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: cleanup for future enhancements Pekon Gupta
2014-01-04  2:48 ` [PATCH v6 6/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix programmed-page bit-flip correction logic Pekon Gupta
2014-01-06  7:42 ` [PATCH v6 0/6] mtd: nand: omap: optimized chip->ecc.correct() for H/W ECC schemes Stefan Roese
2014-01-14  3:27 ` Brian Norris

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).