OneNAND: Rate of write errors

public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed

* OneNAND: Rate of write errors
@ 2007-02-22  0:21 Julianne C.
  2007-02-22  9:28 ` Adrian Hunter
  0 siblings, 1 reply; 5+ messages in thread
From: Julianne C. @ 2007-02-22  0:21 UTC (permalink / raw)
  To: linux-mtd

We are still struggling to understand and manage the OneNAND part on a
LogicPD PXA270 board.  We are using the mtd development snapshot build
of 2-15-07 for the fs and device layers.  Our requirements lead us to
use JFFS2 as the file system.

What we are seeing is that when we write to a file system that is
freshly erased and mounted using the command:
>mount -t jffs2 /dev/mtdblockx /mnt
and then performing some operation like tar or rsync to place files in
the new fs, we see about 5 to 8 "write errors" of the form per MB:

onenand_write: verify failed -74
Write of 2663 bytes at 0x007a6e14 failed. returned -74, retlen 0
Not marking the space at 0x007a6e14 as dirty because the flash driver
returned retlen zero

In further testing, we have replaced the memcmp function in
onenand_verify with a procedure that manually goes through the list,
and issues a printk statement for each bad byte it detects.  Here is a
sample of the bad bytes we see:

Cmp failed [1596]  eb  00
Cmp failed [1594]  e6  9f
Cmp failed [1954]  7b  4d
Cmp failed [1654]  ae  00
Cmp failed [1972]  82  00
Cmp failed [462]  d3  00
Cmp failed [972]  a7  26
Cmp failed [1242]  d8  8d
Cmp failed [54]  6e  a0
Cmp failed [824]  3a  56
Cmp failed [1360]  78  67
Cmp failed [1584]  82  00
Cmp failed [1376]  00  5a
Cmp failed [64]  3f  00
Cmp failed [444]  90  e5
Cmp failed [310]  94  2d
Cmp failed [1764]  7a  04
Cmp failed [1030]  f8  14
Cmp failed [68]  1e  72
Cmp failed [1910]  de  01
Cmp failed [780]  37  00
Cmp failed [1536]  76  00
Cmp failed [1064]  2c  00
Cmp failed [644]  58  00
Cmp failed [1428]  25  00
Cmp failed [440]  89  00
Cmp failed [1852]  6d  00

where the first byte is the expected buffer value, while the second is
what is actually seen, and the value in the brackets is the index in
the 2048 byte array being tested.

These values were accumulated over about 4 MB of writes to the fs.

Is this common to see this many errors in that amount of page writes?
If not, are there adjustments that can be made to the device setup to
help reduce these errors?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: OneNAND: Rate of write errors
  2007-02-22  0:21 OneNAND: Rate of write errors Julianne C.
@ 2007-02-22  9:28 ` Adrian Hunter
  0 siblings, 0 replies; 5+ messages in thread
From: Adrian Hunter @ 2007-02-22  9:28 UTC (permalink / raw)
  To: linux-mtd

ext Julianne C. wrote:
> We are still struggling to understand and manage the OneNAND part on a
> LogicPD PXA270 board.  We are using the mtd development snapshot build
> of 2-15-07 for the fs and device layers.  Our requirements lead us to
> use JFFS2 as the file system.
> 
> What we are seeing is that when we write to a file system that is
> freshly erased and mounted using the command:
>> mount -t jffs2 /dev/mtdblockx /mnt
> and then performing some operation like tar or rsync to place files in
> the new fs, we see about 5 to 8 "write errors" of the form per MB:
> 
> onenand_write: verify failed -74
> Write of 2663 bytes at 0x007a6e14 failed. returned -74, retlen 0
> Not marking the space at 0x007a6e14 as dirty because the flash driver
> returned retlen zero

Note that verify errors will not occur if you have ECC turned on because
you will get ECC errors instead, in which case I would say the block
is bad.  Possibly you have inadvertently removed a bad block marker.

The other possibility is that the data is not making it to OneNAND
correctly in the first place.  By default this is done by
onenand_write_bufferram.  You could add a comparison to be sure.

Also I guess jffs2 would be happier if the length was returned
when the verify fails, say like this:

diff --git a/drivers/mtd/onenand/onenand_base.c b/drivers/mtd/onenand/onenand_base.c
index 1a38414..8fc1570 100644
--- a/drivers/mtd/onenand/onenand_base.c
+++ b/drivers/mtd/onenand/onenand_base.c
@@ -1238,6 +1238,8 @@ static int onenand_write(struct mtd_info
 			break;
 		}
 
+		written += thislen;
+
 		/* Only check verify write turn on */
 		ret = onenand_verify(mtd, (u_char *) wbuf, to, thislen);
 		if (ret) {
@@ -1245,8 +1247,6 @@ static int onenand_write(struct mtd_info
 			break;
 		}
 
-		written += thislen;
-
 		if (written == len)
 			break;
 



> Is this common to see this many errors in that amount of page writes?

Not in my experience with OneNAND.

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* OneNAND: Rate of write errors
@ 2007-02-22 16:35 Julianne C.
  2007-02-23  8:04 ` Adrian Hunter
  2007-02-26  0:41 ` Kyungmin Park
  0 siblings, 2 replies; 5+ messages in thread
From: Julianne C. @ 2007-02-22 16:35 UTC (permalink / raw)
  To: linux-mtd

Further thought about the numerous write errors to the OneNAND part
got me thinking about the symptoms, i.e., when we see the -EBADMSG
error return, there is no corresponding fault reported in the ECC
status register.  Consequently, we concluded that the bufferram may be
getting corrupted before the data is ever committed to the NAND array.

Hence, we rewrote the code for the setup as follows in the
onenand_write procedure:

        do
        {
            this->write_bufferram (mtd,
                                   ONENAND_DATARAM,
                                   wbuf,
                                   0,
                                   mtd->writesize);

            ret = onenand_do_check_bufferram (mtd,
                                              ONENAND_DATARAM,
                                              wbuf,
                                              0,
                                              mtd->writesize);

            if (ret != 0) // then
            {
                retrys = retrys + 1;

                printk (KERN_WARNING
                        "onenandwrite: bad buffer ram, retrying (%d)\n",
                        retrys);
            } // end if
        } while (ret != 0 &&
                 retrys < max_retrys);

        if (retrys >= max_retrys) // then
        {
            ret = -EBADMSG;

            break;
        } // end if

With max_retrys set to three (we have seen double attempts) to make
this work all the time.  There are no more errors reported back to the
JFFS2 system, and the file system cleanly mounts and unmounts.

This does verify the suspicion that the buffer was corrupted before it
was committed.  Does anyone have any idea how or why the data in the
bufferram might be corrupted?

Julianne C.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: OneNAND: Rate of write errors
  2007-02-22 16:35 Julianne C.
@ 2007-02-23  8:04 ` Adrian Hunter
  2007-02-26  0:41 ` Kyungmin Park
  1 sibling, 0 replies; 5+ messages in thread
From: Adrian Hunter @ 2007-02-23  8:04 UTC (permalink / raw)
  To: linux-mtd

ext Julianne C. wrote:
> This does verify the suspicion that the buffer was corrupted before it
> was committed.  Does anyone have any idea how or why the data in the
> bufferram might be corrupted?

This is outside my experience but as far as I know there is little that
OneNAND can control aside from latency and size of synchronous burst
reads/writes.  The rest is controlled by other hardware
e.g. a memory controller of some sort

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: OneNAND: Rate of write errors
  2007-02-22 16:35 Julianne C.
  2007-02-23  8:04 ` Adrian Hunter
@ 2007-02-26  0:41 ` Kyungmin Park
  1 sibling, 0 replies; 5+ messages in thread
From: Kyungmin Park @ 2007-02-26  0:41 UTC (permalink / raw)
  To: 'Julianne C.', 'linux-mtd'

Hi, 

> 
> Further thought about the numerous write errors to the 
> OneNAND part got me thinking about the symptoms, i.e., when 
> we see the -EBADMSG error return, there is no corresponding 
> fault reported in the ECC status register.  Consequently, we 
> concluded that the bufferram may be getting corrupted before 
> the data is ever committed to the NAND array.
> 
> Hence, we rewrote the code for the setup as follows in the 
> onenand_write procedure:
> 
>         do
>         {
>             this->write_bufferram (mtd,
>                                    ONENAND_DATARAM,
>                                    wbuf,
>                                    0,
>                                    mtd->writesize);

write_bufferram does just copy data from host to internal bufferram (SRAM).

> 
>             ret = onenand_do_check_bufferram (mtd,
>                                               ONENAND_DATARAM,
>                                               wbuf,
>                                               0,
>                                               mtd->writesize);

So I think it's just delay function. 

> 
>             if (ret != 0) // then
>             {
>                 retrys = retrys + 1;
> 
>                 printk (KERN_WARNING
>                         "onenandwrite: bad buffer ram, 
> retrying (%d)\n",
>                         retrys);
>             } // end if
>         } while (ret != 0 &&
>                  retrys < max_retrys);
> 
>         if (retrys >= max_retrys) // then
>         {
>             ret = -EBADMSG;
> 
>             break;
>         } // end if
> 
> With max_retrys set to three (we have seen double attempts) 
> to make this work all the time.  There are no more errors 
> reported back to the
> JFFS2 system, and the file system cleanly mounts and unmounts.
> 
> This does verify the suspicion that the buffer was corrupted 
> before it was committed.  Does anyone have any idea how or 
> why the data in the bufferram might be corrupted?
> 

Then we can assume that 

Case 1: the interrnal buffer ram is corrupted because of some reasons. such
as memory timings, or hardware failure.
In my experiences, if some onenand pin is connected wrongly. it's possible,
but it's rare.
As you know internal bufferram is SRAM. it's means it's not changed if the
power is connected.

Case 2: verify failure. since we have too short write time.
Because of too short write time the write is failed without error report.
The write verify function acts like this.
1. write data
2. read written data to another buffer ram
3. verify two data
It means even though write is passed, the verification can be failed.

So I would recommend that
First check the interrnal bufferram changes then check write data.

Thank you,
Kyungmin Park

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-02-26  0:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-02-22  0:21 OneNAND: Rate of write errors Julianne C.
2007-02-22  9:28 ` Adrian Hunter
  -- strict thread matches above, loose matches on Subject: below --
2007-02-22 16:35 Julianne C.
2007-02-23  8:04 ` Adrian Hunter
2007-02-26  0:41 ` Kyungmin Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox