* Fw: corrupt my NAND flash device
@ 2003-04-22 20:03 Alex Samoutin
2003-04-22 20:26 ` Jörn Engel
2003-04-23 20:45 ` Charles Manning
0 siblings, 2 replies; 31+ messages in thread
From: Alex Samoutin @ 2003-04-22 20:03 UTC (permalink / raw)
To: paul.wong; +Cc: linux-mtd
>Next step, I erased the file and copied another 5 MB file to it. The
>device said " no enough spare.
Yes. You have no all free space immediately after erasing, because garbage
collector
didn't complete his work yet.
>Then I use the "mkyaffs" to format it,
>it is shown many bad block in the device. I checked the OOB[5] ( bad
>block flag) it is set to 0x00. Why? Why the yaffs set the health
>block to the bad block after erase file? Is the YAFFS not support big
>file ? any ideal?
It's different problem. When you "erase" file it not really erased.
This file just marked for erasing and then GC working in background provede
real erasing.
So, if you erase file and start immediately write new file - you have two
processes which try
access the NAND chip at the same time (Writing and GC). NAND driver has lock
mechanism to prevent problem.
Before each physical access to chip you mast grab the lock (using
nand_get_chip()).
The code look exelent, but .. it doesn't work! (At least in my 2.4.21-pre2
kernel.
I don't know why. And I had the same problem as you. When lock doesn't work
you can
get unpredictable result including absolutely wrong data in OOB. I quickly
fix this problem using mutex.
I've placed down() before each nand_get_chip() call and up() after
spin_unlock_bh().
It is not very elegant and, probably, not very good for kernel efficiency,
however it works.
Alexander
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-04-22 20:03 Fw: corrupt my NAND flash device Alex Samoutin
@ 2003-04-22 20:26 ` Jörn Engel
2003-04-22 20:59 ` Jörn Engel
2003-04-23 20:45 ` Charles Manning
1 sibling, 1 reply; 31+ messages in thread
From: Jörn Engel @ 2003-04-22 20:26 UTC (permalink / raw)
To: Alex Samoutin; +Cc: linux-mtd, paul.wong
Disclaimer: I know nothing about yaffs.
On Tue, 22 April 2003 13:03:43 -0700, Alex Samoutin wrote:
>
> >Next step, I erased the file and copied another 5 MB file to it. The
> >device said " no enough spare.
>
> Yes. You have no all free space immediately after erasing, because garbage
> collector
> didn't complete his work yet.
Doesn't make much sense, see below.
> >Then I use the "mkyaffs" to format it,
> >it is shown many bad block in the device. I checked the OOB[5] ( bad
> >block flag) it is set to 0x00. Why? Why the yaffs set the health
> >block to the bad block after erase file? Is the YAFFS not support big
> >file ? any ideal?
>
> It's different problem. When you "erase" file it not really erased.
> This file just marked for erasing and then GC working in background provede
> real erasing.
> So, if you erase file and start immediately write new file - you have two
> processes which try
> access the NAND chip at the same time (Writing and GC). NAND driver has lock
> mechanism to prevent problem.
> Before each physical access to chip you mast grab the lock (using
> nand_get_chip()).
Agreed. But this problem should be handled in kernel. In other words,
the writing process has to trigger GC and continue writing, if GC
managed to free more space. It would make sense to tell GC, how much
space is needed, so it won't work over the whole device, locking the
NAND for a long time.
Anything else is very unintuitive to the user and a plain bug, at
least in my book.
> The code look exelent, but .. it doesn't work! (At least in my 2.4.21-pre2
> kernel.
> I don't know why. And I had the same problem as you. When lock doesn't work
> you can
> get unpredictable result including absolutely wrong data in OOB. I quickly
> fix this problem using mutex.
> I've placed down() before each nand_get_chip() call and up() after
> spin_unlock_bh().
> It is not very elegant and, probably, not very good for kernel efficiency,
> however it works.
A race car can drive at any speed, if it doesn't reach the finish
line, it won't win any races.
Correct code is always faster than incorrect code. ;)
Jörn
--
To recognize individual spam features you have to try to get into the
mind of the spammer, and frankly I want to spend as little time inside
the minds of spammers as possible.
-- Paul Graham
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-04-22 20:26 ` Jörn Engel
@ 2003-04-22 20:59 ` Jörn Engel
0 siblings, 0 replies; 31+ messages in thread
From: Jörn Engel @ 2003-04-22 20:59 UTC (permalink / raw)
To: Alex Samoutin; +Cc: paul.wong, linux-mtd
On Tue, 22 April 2003 22:26:27 +0200, Jörn Engel wrote:
>
> Agreed. But this problem should be handled in kernel. In other words,
> the writing process has to trigger GC and continue writing, if GC
> managed to free more space. It would make sense to tell GC, how much
> space is needed, so it won't work over the whole device, locking the
> NAND for a long time.
>
> Anything else is very unintuitive to the user and a plain bug, at
> least in my book.
BTW: See the thread "Does jffs2 garbage collection include erasing"
jffs2 does just this. And I didn't even know the code until a few
minutes ago. :)
Jörn
--
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-04-22 20:03 Fw: corrupt my NAND flash device Alex Samoutin
2003-04-22 20:26 ` Jörn Engel
@ 2003-04-23 20:45 ` Charles Manning
2003-04-24 18:25 ` Alex Samoutin
1 sibling, 1 reply; 31+ messages in thread
From: Charles Manning @ 2003-04-23 20:45 UTC (permalink / raw)
To: Alex Samoutin, linux-mtd
Alex
I am unaware of this problem. I would like to understand it better. Could you
give me more details?
-- CHarles
On Wednesday 23 April 2003 08:03, you wrote:
> >Next step, I erased the file and copied another 5 MB file to it. The
> >device said " no enough spare.
>
> Yes. You have no all free space immediately after erasing, because garbage
> collector
> didn't complete his work yet.
>
> >Then I use the "mkyaffs" to format it,
> >it is shown many bad block in the device. I checked the OOB[5] ( bad
> >block flag) it is set to 0x00. Why? Why the yaffs set the health
> >block to the bad block after erase file? Is the YAFFS not support big
> >file ? any ideal?
>
> It's different problem. When you "erase" file it not really erased.
> This file just marked for erasing and then GC working in background provede
> real erasing.
In yaffs this does not really happen. Even though the real deleting is done
"in background", it is always done in the same thread. ie. this is done by
parasitic code of the form:
write_data()
{
do_background_gc();
do_actual_write();
}
YAFFS is locked by a semaphore so that only one thread is in any YAFFS
partition at a time.
> So, if you erase file and start immediately write new file - you have two
> processes which try
> access the NAND chip at the same time (Writing and GC). NAND driver has
> lock mechanism to prevent problem.
> Before each physical access to chip you mast grab the lock (using
> nand_get_chip()).
> The code look exelent, but .. it doesn't work! (At least in my 2.4.21-pre2
> kernel.
> I don't know why. And I had the same problem as you. When lock doesn't work
> you can
> get unpredictable result including absolutely wrong data in OOB. I quickly
> fix this problem using mutex.
> I've placed down() before each nand_get_chip() call and up() after
> spin_unlock_bh().
> It is not very elegant and, probably, not very good for kernel efficiency,
> however it works.
I guess there could be a lower-level problem if two or more partitions are in
use and locking is not working in the mtd code.
So could running mkyaffs while the partitiion is mounted (assuming locking is
not working).
Comments anyone?
-- Charles
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: Fw: corrupt my NAND flash device
2003-04-23 20:45 ` Charles Manning
@ 2003-04-24 18:25 ` Alex Samoutin
2003-04-25 13:01 ` Jörn Engel
0 siblings, 1 reply; 31+ messages in thread
From: Alex Samoutin @ 2003-04-24 18:25 UTC (permalink / raw)
To: linux-mtd
> I am unaware of this problem. I would like to understand it better. Could
you
> give me more details?
>
> > It's different problem. When you "erase" file it not really erased.
> > This file just marked for erasing and then GC working in background
provide
> > real erasing.
>
> In yaffs this does not really happen. Even though the real deleting is
done
> "in background", it is always done in the same thread. ie. this is done
by
> parasitic code of the form:
>
> write_data()
> {
> do_background_gc();
> do_actual_write();
> }
Honestly speaking I don't know much about JAFFS, but GC thread should work
similar to JFFS2 GC thread, I think.
And I had problem very similar to originally described on my JFFS2.
It is nothing wrong with JFFS2 itself. It was a low level problem. For
example we have empty flash with 100 erase blocks. We wrote File1 and
occupied blocks 0-39. Then we deleted this file and start writing File2.
This new file will be written to blocks 40-79. And now we have two processes
working.. Process one (GC) is erasing block 0-39, process 2 is writing new
file to blocks 40-79. Both of them are working at the same time (at least in
case of JFFS2). I sow it. From blocks point of view all Ok. One set of
blocks is erasing and absolutely different set of block is writing. But we
have only one NAND chip. When we want to write something to NAND we have to
send writing command, then provide some command parameters (like address)
and then send data itself. And if locking mechanism doesn't work properly we
can get situation when , for example, one process send command to write and
then second process start sending something to chip which will be
interpreted as address for writing but this address will be incorrect, of
cause. As I mentioned NAND driver (nand.c) has lock mechanism to prevent
this situation. But it doesn't work in my case. I don't know why - the code
look Ok. A added some additional locks and it solved this problem.
Alexander Samoutin
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: Fw: corrupt my NAND flash device
2003-04-24 18:25 ` Alex Samoutin
@ 2003-04-25 13:01 ` Jörn Engel
2003-04-25 22:23 ` Alex Samoutin
0 siblings, 1 reply; 31+ messages in thread
From: Jörn Engel @ 2003-04-25 13:01 UTC (permalink / raw)
To: Alex Samoutin; +Cc: linux-mtd
On Thu, 24 April 2003 11:25:43 -0700, Alex Samoutin wrote:
>
> It is nothing wrong with JFFS2 itself. It was a low level problem. For
> example we have empty flash with 100 erase blocks. We wrote File1 and
> occupied blocks 0-39. Then we deleted this file and start writing File2.
> This new file will be written to blocks 40-79. And now we have two processes
> working.. Process one (GC) is erasing block 0-39, process 2 is writing new
> file to blocks 40-79. Both of them are working at the same time (at least in
> case of JFFS2). I sow it. From blocks point of view all Ok. One set of
> blocks is erasing and absolutely different set of block is writing. But we
> have only one NAND chip. When we want to write something to NAND we have to
> send writing command, then provide some command parameters (like address)
> and then send data itself. And if locking mechanism doesn't work properly we
> can get situation when , for example, one process send command to write and
> then second process start sending something to chip which will be
> interpreted as address for writing but this address will be incorrect, of
> cause. As I mentioned NAND driver (nand.c) has lock mechanism to prevent
> this situation. But it doesn't work in my case. I don't know why - the code
> look Ok. A added some additional locks and it solved this problem.
Ah, sorry for me misinterpreting you original post. I thought, you had
found a problem in yaffs.
Do you have a patch for this? Even if it is ugly and slow, correct
code is better than broken one.
Jörn
--
But this is not to say that the main benefit of Linux and other GPL
software is lower-cost. Control is the main benefit--cost is secondary.
-- Bruce Perens
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-04-25 13:01 ` Jörn Engel
@ 2003-04-25 22:23 ` Alex Samoutin
2003-04-25 23:10 ` Thayne Harbaugh
2003-04-26 10:18 ` Jörn Engel
0 siblings, 2 replies; 31+ messages in thread
From: Alex Samoutin @ 2003-04-25 22:23 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-mtd
> > cause. As I mentioned NAND driver (nand.c) has lock mechanism to prevent
> > this situation. But it doesn't work in my case. I don't know why - the
code
> > look Ok. A added some additional locks and it solved this problem.
>
> Ah, sorry for me misinterpreting you original post. I thought, you had
> found a problem in yaffs.
>
> Do you have a patch for this? Even if it is ugly and slow, correct
> code is better than broken one.
>
This is diff file for drivers/mtd/nand.c (I made to many changes to create
a patch)
151a152,154
> static struct semaphore sam;
> #define NAND_WRITE_RETRY 1
>
362c365
<
---
> int ret = 0;
370c373,381
<
---
> #ifdef NAND_WRITE_RETRY
> repeat:
> if (ret>3){
> printk ("%s: " "Failed write verify, page 0x%08x ", __FUNCTION__, page);
> return -EIO;
>
> }
> #endif
>
476a488,491
> #ifdef NAND_WRITE_RETRY
> ++ret;
> goto repeat;
> #endif
485c500,504
< DEBUG (MTD_DEBUG_LEVEL0, "%s: " "Failed write verify, page 0x%08x ",
__FUNCTION__, page);
---
> DEBUG (MTD_DEBUG_LEVEL0, "%s: " "Failed write verify OOB, page 0x%08x
", __FUNCTION__, page);
> #ifdef NAND_WRITE_RETRY
> ++ret;
> goto repeat;
> #endif
507a527,530
> #ifdef NAND_WRITE_RETRY
> ++ret;
> goto repeat;
> #endif
557c580
<
---
> down(&sam);
701a725
> up(&sam);
737c761
<
---
> down(&sam);
763c787
<
---
> up(&sam);
807c831
<
---
> down(&sam);
851c875,876
<
---
> up(&sam);
>
880a906
> down(&sam);
944c970,971
<
---
> up(&sam);
>
990a1018
> down(&sam);
1063a1092
> up(&sam);
1098c1127,1128
<
---
>
> down(&sam);
1189a1220
> up(&sam);
1373a1405
> init_MUTEX(&sam);
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: Fw: corrupt my NAND flash device
2003-04-25 22:23 ` Alex Samoutin
@ 2003-04-25 23:10 ` Thayne Harbaugh
2003-04-26 10:23 ` Jörn Engel
2003-04-30 16:54 ` Alex Samoutin
2003-04-26 10:18 ` Jörn Engel
1 sibling, 2 replies; 31+ messages in thread
From: Thayne Harbaugh @ 2003-04-25 23:10 UTC (permalink / raw)
To: Alex Samoutin; +Cc: linux-mtd, Jörn Engel
[-- Attachment #1: Type: text/plain, Size: 1900 bytes --]
On Fri, 2003-04-25 at 16:23, Alex Samoutin wrote:
> > > cause. As I mentioned NAND driver (nand.c) has lock mechanism to prevent
> > > this situation. But it doesn't work in my case. I don't know why - the
> code
> > > look Ok. A added some additional locks and it solved this problem.
You added locks?
> >
> > Ah, sorry for me misinterpreting you original post. I thought, you had
> > found a problem in yaffs.
> >
> > Do you have a patch for this? Even if it is ugly and slow, correct
> > code is better than broken one.
> >
> This is diff file for drivers/mtd/nand.c (I made to many changes to create
> a patch)
>
> 151a152,154
> > static struct semaphore sam;
> > #define NAND_WRITE_RETRY 1
Hmmm - looks like you have more than just locking - you also have some
retry logic for failed commands.
I'm jumping in because I have some chips that appear to be broken - they
occasionally drop operations (erase/write). Eric Beiderman and I have
been through the code (specifically cfi_cmdset_0002.c) and can't find
anywhere that the software might be at fault. Until I connect a logic
analyzer I can't be certain it's the hardware either - although none of
the other models of flash chip have this problem and everything points
to bad hardware.
Right now Eric and I are debating adding retry logic. When a command is
"dropped" it seems to always succeed when sent a second time (don't have
any examples that failed on the second try). I'm interested because
your situation seems to be related. Is the problem that the chip
sometimes ignores a command? Does your retry fix things?
The big question is how common is this problem (needing retries)?
Should this be more formalized and added to he chip drivers or should it
be left up to individuals having to fix things for their special or less
common cases?
--
Thayne Harbaugh
Linux Networx
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 232 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-04-25 23:10 ` Thayne Harbaugh
@ 2003-04-26 10:23 ` Jörn Engel
2003-04-28 15:02 ` Thayne Harbaugh
2003-04-30 16:54 ` Alex Samoutin
1 sibling, 1 reply; 31+ messages in thread
From: Jörn Engel @ 2003-04-26 10:23 UTC (permalink / raw)
To: Thayne Harbaugh; +Cc: Alex Samoutin, linux-mtd
On Fri, 25 April 2003 17:10:43 -0600, Thayne Harbaugh wrote:
>
> The big question is how common is this problem (needing retries)?
> Should this be more formalized and added to he chip drivers or should it
> be left up to individuals having to fix things for their special or less
> common cases?
This looks more like an implementation problem. If you manage to add
retry code somewhere central so that all drivers benefit from it,
there should hardly be a performance hit, even for drivers that don't
need it. And once in a central place, it would be trivial to add a
config option for this.
But someone has to do it. :)
Jörn
--
You can't tell where a program is going to spend its time. Bottlenecks
occur in surprising places, so don't try to second guess and put in a
speed hack until you've proven that's where the bottleneck is.
-- Rob Pike
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-04-26 10:23 ` Jörn Engel
@ 2003-04-28 15:02 ` Thayne Harbaugh
2003-04-28 21:14 ` Charles Manning
0 siblings, 1 reply; 31+ messages in thread
From: Thayne Harbaugh @ 2003-04-28 15:02 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-mtd
[-- Attachment #1: Type: text/plain, Size: 2161 bytes --]
On Sat, 2003-04-26 at 04:23, Jörn Engel wrote:
> On Fri, 25 April 2003 17:10:43 -0600, Thayne Harbaugh wrote:
> >
> > The big question is how common is this problem (needing retries)?
> > Should this be more formalized and added to he chip drivers or should it
> > be left up to individuals having to fix things for their special or less
> > common cases?
>
> This looks more like an implementation problem. If you manage to add
> retry code somewhere central
Any suggestion where somewhere central is? Right now I've been playing
with it down in the cmdset code - hardly a central place. I'll have to
look at the mtdchar.c or something. The nice thing about the cmdsets is
that they are the best place to retry an individual command - especially
individual writes.
> so that all drivers benefit from it,
Agreed - why do the work in each command set - duplicate code issues.
> there should hardly be a performance hit, even for drivers that don't
> need it.
The performance hit is very small - loop while a counter is less than a
set maximum test - compared to the time of a command (erase/write).
> And once in a central place, it would be trivial to add a
> config option for this.
Agreed.
> But someone has to do it. :)
>
> Jörn
The whole thing just makes me sick. It's ugly putting in such a hack.
One little voice in my head keeps telling me that there's an error in
software and I just have to find and fix the bug. Another little voice
in my head keeps telling me that broken hardware is more common than
most people want to believe.
I haven't been very aggressive about adding the retry code because right
now I'm interested in more data points: Am I the only one that sees the
problem of a flash chip that occasionally drops commands or are others
seeing this same problem? Is this problem more common but people don't
see it because the flash filesystems think that a location is bad and
mark it as unusable?
I'm more than happy to do the work if others think it's a good thing and
if someone has a suggestion where/how it should be done.
--
Thayne Harbaugh
Linux Networx
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 232 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-04-28 15:02 ` Thayne Harbaugh
@ 2003-04-28 21:14 ` Charles Manning
2003-04-28 22:59 ` Thomas Gleixner
0 siblings, 1 reply; 31+ messages in thread
From: Charles Manning @ 2003-04-28 21:14 UTC (permalink / raw)
To: Thayne Harbaugh, Jörn Engel; +Cc: linux-mtd
I have seen some wierd stuff before... comments further below:
>
> The whole thing just makes me sick. It's ugly putting in such a hack.
> One little voice in my head keeps telling me that there's an error in
> software and I just have to find and fix the bug. Another little voice
> in my head keeps telling me that broken hardware is more common than
> most people want to believe.
Yes, there are/ have been cases where the chips do not latch their commands
correctly. This can be made worse by marginal chip select timing etc.
I was sent some errata sheets by Samsung at some stage, but I did not secure
permission to forward these. In all cases, the identified problems have been
addressed in currently shipping product. To paraphrase the mentioned problems:
* Reading the status too soon after issuing the command: some parts need a
brief wait after latching the command before the busy flag is valid. Without
the wait, the busy state might be misinterpreted. 500ns would be ample.
* Ensuring the correct number of address cycles: I have observed cases where
a chip seems to work when the wrong number of address cycles was issued, but
gave erratic results.
* Issue a reset command before any read/write/erase command. This is a small
overhead and ensures that the command register is always in a consistent
state.
Also check the basics like power and signal integrity. Overshooting/ringing
clocks could very easily be latching spurious data and corrupting the
commands.
>
> I haven't been very aggressive about adding the retry code because right
> now I'm interested in more data points: Am I the only one that sees the
> problem of a flash chip that occasionally drops commands or are others
> seeing this same problem? Is this problem more common but people don't
> see it because the flash filesystems think that a location is bad and
> mark it as unusable?
I'd suggest exploring the above first.
YAFFS is very aggressive about the way it retires data blocks. If any reads
(including verification) have any corruption (even if the ECC fixes them),
then the block is retired. The reason for this strategy is that I have a
theory that blocks get bad with age/use. Rather than encountering
unrecoverable data errors, YAFFS retires blocks on the first sign of a
problem.
Someone has previously suggested that I provide a flag to disable on-NAND
retirement marking during development. I think it is time I added this.
I guess it would be a good thing to do retries in YAFFS to at least get more
information.
-- Charles
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-04-28 21:14 ` Charles Manning
@ 2003-04-28 22:59 ` Thomas Gleixner
2003-04-29 1:23 ` Charles Manning
0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2003-04-28 22:59 UTC (permalink / raw)
To: manningc2, Thayne Harbaugh, Jörn Engel; +Cc: linux-mtd
On Monday 28 April 2003 23:14, Charles Manning wrote:
> I have seen some wierd stuff before... comments further below:
> > The whole thing just makes me sick. It's ugly putting in such a hack.
> > One little voice in my head keeps telling me that there's an error in
> > software and I just have to find and fix the bug. Another little voice
> > in my head keeps telling me that broken hardware is more common than
> > most people want to believe.
>
> Yes, there are/ have been cases where the chips do not latch their commands
> correctly. This can be made worse by marginal chip select timing etc.
That's nothing, what should be fixed by generic software drivers. Either the
chips are buggy or the signal timings are wrong or even both. If we would
take care of all broken hardware, we would experiencing magic kernel source
size explosion within no time.
> * Reading the status too soon after issuing the command: some parts need a
> brief wait after latching the command before the busy flag is valid.
> Without the wait, the busy state might be misinterpreted. 500ns would be
> ample.
If this is an issue, I'm willing to add this to nand.c in form of a hardware
driver supplied delay, which is 0 by default.
> * Ensuring the correct number of address cycles: I have observed cases
> where a chip seems to work when the wrong number of address cycles was
> issued, but gave erratic results.
The address cycles in the generic nand.c command function are correct. I don't
know, if anybody uses a hardware driver supplied command function.
> * Issue a reset command before any read/write/erase command. This is a
> small overhead and ensures that the command register is always in a
> consistent state.
If that helps, I'm willing to add this too, conditional, defaulting to zero. I
remember a big thread complainig about this overhead, before it was removed.
I did this carefully and there is no "maybe a write is interrupted by another
thread issue". Only erases can be interrupted, but they are restarted later.
And on interruption of erase the reset comand is issued.
Can anybody add a check, whether the erase is interrupted immidiately before
the write error occures ? If that's the case, then we have to check the
datasheet of the offending chip and maybe block erase interruption
conditionally, defaulting to not, as it works here and is proven to do so
elsewhere.
> Also check the basics like power and signal integrity. Overshooting/ringing
> clocks could very easily be latching spurious data and corrupting the
> commands.
I have seen this on some hardware, where address lines were used for CLE and
ALE, which is possible with compliance to all timing constraints. But it's
really not easy to match this under all circumstances (interrupts, dma, cache
refill ....).
> > I haven't been very aggressive about adding the retry code because right
> > now I'm interested in more data points: Am I the only one that sees the
> > problem of a flash chip that occasionally drops commands or are others
> > seeing this same problem? Is this problem more common but people don't
> > see it because the flash filesystems think that a location is bad and
> > mark it as unusable?
>
> I'd suggest exploring the above first.
I have running NAND-FLASH with YAFFS and JFFS2 partitions for more than a year
in a mostly permanent copy/remove/move cycle. I had no spurious commands or
anything like that. I never got blocks marked bad randomly. I have different
sized SmartMedia Cards from various vendors and production dates in use, so
it is not a random good part luck.
I know about a bunch of implementations, where NAND has been proven reliable
in extensive tests.
I'm really _NOT_ willing to buy, that adding of some obscure retry mechanism
will solve all this problems for ever. They may dissapear for now and come
back in a different EMC or application environement.
--
Thomas
________________________________________________________________________
linutronix - competence in embedded & realtime linux
http://www.linutronix.de
mail: tglx@linutronix.de
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-04-28 22:59 ` Thomas Gleixner
@ 2003-04-29 1:23 ` Charles Manning
2003-04-29 8:03 ` Thomas Gleixner
0 siblings, 1 reply; 31+ messages in thread
From: Charles Manning @ 2003-04-29 1:23 UTC (permalink / raw)
To: tglx, Thayne Harbaugh, Jörn Engel; +Cc: linux-mtd
O
> > Yes, there are/ have been cases where the chips do not latch their
> > commands correctly. This can be made worse by marginal chip select timing
> > etc.
>
> That's nothing, what should be fixed by generic software drivers. Either
> the chips are buggy or the signal timings are wrong or even both. If we
> would take care of all broken hardware, we would experiencing magic kernel
> source size explosion within no time.
Agree. Getting chip selects etc right is not the job of nand.c. I was trying
to identify those problems that could kick up issues on a specific platform.
>
> > * Reading the status too soon after issuing the command: some parts need
> > a brief wait after latching the command before the busy flag is valid.
> > Without the wait, the busy state might be misinterpreted. 500ns would be
> > ample.
>
> If this is an issue, I'm willing to add this to nand.c in form of a
> hardware driver supplied delay, which is 0 by default.
Sounds like a good compromise.
>
> > * Ensuring the correct number of address cycles: I have observed cases
> > where a chip seems to work when the wrong number of address cycles was
> > issued, but gave erratic results.
>
> The address cycles in the generic nand.c command function are correct. I
> don't know, if anybody uses a hardware driver supplied command function.
I do not doubt nand.c is broken here. I saw the problem on a non-Linux
platform.
>
> > * Issue a reset command before any read/write/erase command. This is a
> > small overhead and ensures that the command register is always in a
> > consistent state.
>
> If that helps, I'm willing to add this too, conditional, defaulting to
> zero. I remember a big thread complainig about this overhead, before it was
> removed. I did this carefully and there is no "maybe a write is interrupted
> by another thread issue". Only erases can be interrupted, but they are
> restarted later. And on interruption of erase the reset comand is issued.
There is an overhead which is variable depending on the operation being
performed. It seems likely to me that the only condition where this is likely
to improve things is when recovering from some hardware problem (eg. signal
integrity).
Why do you interrupt erases? It seems to me like potentially an unhealthy
thing to do on NAND since NAND does not support erase suspend. NAND erases
quite quickly (say 2mS) so do you gain anything real by doing this?
>
> Can anybody add a check, whether the erase is interrupted immidiately
> before the write error occures ? If that's the case, then we have to check
> the datasheet of the offending chip and maybe block erase interruption
> conditionally, defaulting to not, as it works here and is proven to do so
> elsewhere.
>
> > Also check the basics like power and signal integrity.
> > Overshooting/ringing clocks could very easily be latching spurious data
> > and corrupting the commands.
>
> I have seen this on some hardware, where address lines were used for CLE
> and ALE, which is possible with compliance to all timing constraints. But
> it's really not easy to match this under all circumstances (interrupts,
> dma, cache refill ....).
Yes, I agree. With cached systems, the bus traffic is quite variable making
it difficult to find all the corner cases.
>
> > > I haven't been very aggressive about adding the retry code because
> > > right now I'm interested in more data points: Am I the only one that
> > > sees the problem of a flash chip that occasionally drops commands or
> > > are others seeing this same problem? Is this problem more common but
> > > people don't see it because the flash filesystems think that a location
> > > is bad and mark it as unusable?
> >
> > I'd suggest exploring the above first.
>
> I have running NAND-FLASH with YAFFS and JFFS2 partitions for more than a
> year in a mostly permanent copy/remove/move cycle. I had no spurious
> commands or anything like that. I never got blocks marked bad randomly. I
> have different sized SmartMedia Cards from various vendors and production
> dates in use, so it is not a random good part luck.
>
> I know about a bunch of implementations, where NAND has been proven
> reliable in extensive tests.
>
> I'm really _NOT_ willing to buy, that adding of some obscure retry
> mechanism will solve all this problems for ever. They may dissapear for now
> and come back in a different EMC or application environement.
Agree. Many people are using YAFFS with no problems. Retry sounds like an
attempt to fix something else (hardware/timing issue). Rather fix the real
problem.
-- Charles
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-04-29 1:23 ` Charles Manning
@ 2003-04-29 8:03 ` Thomas Gleixner
2003-04-29 19:37 ` Charles Manning
0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2003-04-29 8:03 UTC (permalink / raw)
To: manningc2, Thayne Harbaugh, Jörn Engel; +Cc: linux-mtd
On Tuesday 29 April 2003 03:23, Charles Manning wrote:
> > If that helps, I'm willing to add this too, conditional, defaulting to
> > zero. I remember a big thread complainig about this overhead, before it
> > was removed. I did this carefully and there is no "maybe a write is
> > interrupted by another thread issue". Only erases can be interrupted, but
> > they are restarted later. And on interruption of erase the reset comand
> > is issued.
>
> There is an overhead which is variable depending on the operation being
> performed. It seems likely to me that the only condition where this is
> likely to improve things is when recovering from some hardware problem (eg.
> signal integrity).
>
> Why do you interrupt erases? It seems to me like potentially an unhealthy
> thing to do on NAND since NAND does not support erase suspend. NAND erases
> quite quickly (say 2mS) so do you gain anything real by doing this?
Toshiba Datasheet:
Reset
The Reset mode stops all operations. For example, in the case of a Program or
Erase operation the internally generated voltage is discharged to 0 volts and
the device enters Wait state. The response to an FFH Reset command input
during the various device operations is as follows:
Samsung Datasheet:
RESET
The device offers a reset feature, executed by writing FFh to the command
register. When the device is in Busy state during random read, program or
erase mode, the reset operation will abort these operations. The contents of
memory cells being altered are no longer valid, as the data will be partially
programmed or erased. The command register is cleared to wait for the next
command, and the Status Register is cleared to value C0h when WP is high.
Refer to table 3 for device status after reset operation. If the device is
already in reset state a new reset command will not be accepted by the
command register. The R/B pin transitions to low for tRST after the Reset
command is written. Reset command is not necessary for normal operation.
Refer to Figure 10 below.
I have tested this with both chiptypes. The erase is aborted and restarted by
the erase function.
It would really be interresting, if those problems are in related to an erase
abort. Can anybody insert some debugging in nand_get_chip, where the abort is
done?
--
Thomas
________________________________________________________________________
linutronix - competence in embedded & realtime linux
http://www.linutronix.de
mail: tglx@linutronix.de
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-04-29 8:03 ` Thomas Gleixner
@ 2003-04-29 19:37 ` Charles Manning
2003-04-29 22:04 ` Thomas Gleixner
0 siblings, 1 reply; 31+ messages in thread
From: Charles Manning @ 2003-04-29 19:37 UTC (permalink / raw)
To: tglx, Thayne Harbaugh, Jörn Engel; +Cc: linux-mtd
Thomas
> > Why do you interrupt erases? It seems to me like potentially an unhealthy
> > thing to do on NAND since NAND does not support erase suspend. NAND
> > erases quite quickly (say 2mS) so do you gain anything real by doing
> > this?
<snip>
>
> I have tested this with both chiptypes. The erase is aborted and restarted
> by the erase function.
A couple of comments:
* "Both chip types" is misleading. There is not just a Toshiba and a Samsung
chiptype. Each of these vendors provides chips with different internal
architectures. That is one of the reason characteristics like the number of
partial page writes etc change.
* "Aborted and restarted" is perhaps incorrect. Don't you really mean
"aborted and re-performed"? I do not believe these parts have a way of
remembering their internal eraseure state to restart line NOR parts do.
My other question remains: do you really gain anything by adding the erase
interruption feature? From a Samsung datasheet:
* Block erase takes typically 2ms, max 3ms.
* If you do a reset while the part is erasing, the reset might take as long
as 500us.
You then have to restart the erase for it to take another 2ms (unless it gets
interrupted again).
This certainly adds some unpredictability to the behaviour. Most likely it
does all work, but is it worth it?
>
> It would really be interresting, if those problems are in related to an
> erase abort. Can anybody insert some debugging in nand_get_chip, where the
> abort is done?
Yes, it certianly would be interesting.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-04-29 19:37 ` Charles Manning
@ 2003-04-29 22:04 ` Thomas Gleixner
0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2003-04-29 22:04 UTC (permalink / raw)
To: manningc2, Thayne Harbaugh, Jörn Engel; +Cc: linux-mtd
On Tuesday 29 April 2003 21:37, Charles Manning wrote:
> > I have tested this with both chiptypes. The erase is aborted and
> > restarted by the erase function.
>
> A couple of comments:
> * "Both chip types" is misleading. There is not just a Toshiba and a
> Samsung chiptype. Each of these vendors provides chips with different
> internal architectures. That is one of the reason characteristics like the
> number of partial page writes etc change.
True. I meant a couple of different types.
> * "Aborted and restarted" is perhaps incorrect. Don't you really mean
> "aborted and re-performed"? I do not believe these parts have a way of
> remembering their internal eraseure state to restart line NOR parts do.
Yep. The command is thrown away according to datasheet and it is issued again
later. Sorry for misleading expression.
> My other question remains: do you really gain anything by adding the erase
> interruption feature? From a Samsung datasheet:
> * Block erase takes typically 2ms, max 3ms.
> * If you do a reset while the part is erasing, the reset might take as long
> as 500us.
> You then have to restart the erase for it to take another 2ms (unless it
> gets interrupted again).
I have done test on different chips. The erasetime varies a lot. The peak was
45ms. So that matters IMHO. The specs allow up to 200ms.
> This certainly adds some unpredictability to the behaviour. Most likely it
> does all work, but is it worth it?
Depends. :)
> > It would really be interresting, if those problems are in related to an
> > erase abort. Can anybody insert some debugging in nand_get_chip, where
> > the abort is done?
>
> Yes, it certianly would be interesting.
I think, I have to throw out some harsh remarks again to get an answer on this
question, as the "hurray, I'm willing to volunteer" replies are not really
much up to now. :)
--
Thomas
________________________________________________________________________
linutronix - competence in embedded & realtime linux
http://www.linutronix.de
mail: tglx@linutronix.de
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-04-25 23:10 ` Thayne Harbaugh
2003-04-26 10:23 ` Jörn Engel
@ 2003-04-30 16:54 ` Alex Samoutin
2003-04-30 18:13 ` Thomas Gleixner
1 sibling, 1 reply; 31+ messages in thread
From: Alex Samoutin @ 2003-04-30 16:54 UTC (permalink / raw)
To: Thayne Harbaugh; +Cc: tglx, linux-mtd
Hi Thayne,
>I'm jumping in because I have some chips that appear to be broken - they
>occasionally drop operations (erase/write)
> When a command is
>"dropped" it seems to always succeed when sent a second time (don't have
>any examples that failed on the second try). I'm interested because
>your situation seems to be related. Is the problem that the chip
>sometimes ignores a command? Does your retry fix things?
Yes I had absolutely the same problem. Some times first write command was
ignored, but second always successful. However I have no problem with erase
operations, only write some times was ignored.
(For Thomas) It’s not a bad H/W or incorrect timing. I’ve played with timing
and result was the same. Also I have 5 boards with Toshiba NAND chip and 2
of them are working fine without retry, but other 3 need retry for normal
operation.
Alex.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-04-30 16:54 ` Alex Samoutin
@ 2003-04-30 18:13 ` Thomas Gleixner
2003-07-02 17:43 ` Alex Samoutin
0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2003-04-30 18:13 UTC (permalink / raw)
To: Alex Samoutin, Thayne Harbaugh; +Cc: linux-mtd
On Wednesday 30 April 2003 18:54, Alex Samoutin wrote:
> Yes I had absolutely the same problem. Some times first write command was
> ignored, but second always successful. However I have no problem with erase
> operations, only write some times was ignored.
>
> (For Thomas) It’s not a bad H/W or incorrect timing. I’ve played with
> timing and result was the same. Also I have 5 boards with Toshiba NAND chip
> and 2 of them are working fine without retry, but other 3 need retry for
> normal operation.
What timing params did you play with ?
Are CLE/ALE connected to GPIO pins ?
Do you use a ready function, which reads the R/B hardware pin ?
Can you please check the following:
1. Add a delay into nand_wait and play with the time
--- nand.c 14 Apr 2003 07:00:39 -0000 1.43
+++ nand.c 30 Apr 2003 17:05:26 -0000
@@ -226,6 +226,8 @@
this->hwcontrol (NAND_CTL_CLRALE);
}
+ udelay (500);
+
/*
* program and erase have their own busy handlers
* status and sequential in needs no delay
2.. Report if that helps or changes anything
3. Remove the erase abort in nand_get_chip
--- nand.c 14 Apr 2003 07:00:39 -0000 1.43
+++ nand.c 30 Apr 2003 17:08:23 -0000
@@ -287,16 +287,6 @@
return;
}
- if (this->state == FL_ERASING) {
- if (new_state != FL_ERASING) {
- this->state = new_state;
- spin_unlock_bh (&this->chip_lock);
- nand_select (); /* select in any case */
- this->cmdfunc(mtd, NAND_CMD_RESET, -1, -1);
- return;
- }
- }
-
set_current_state (TASK_UNINTERRUPTIBLE);
add_wait_queue (&this->wq, &wait);
spin_unlock_bh (&this->chip_lock);
4.. Report if that helps or changes anything
Thanks
--
Thomas
________________________________________________________________________
linutronix - competence in embedded & realtime linux
http://www.linutronix.de
mail: tglx@linutronix.de
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: Fw: corrupt my NAND flash device
2003-04-30 18:13 ` Thomas Gleixner
@ 2003-07-02 17:43 ` Alex Samoutin
2003-07-02 17:53 ` Jasmine Strong
2003-07-03 5:44 ` Stephan Linke
0 siblings, 2 replies; 31+ messages in thread
From: Alex Samoutin @ 2003-07-02 17:43 UTC (permalink / raw)
To: tglx, Thayne Harbaugh; +Cc: linux-mtd
Hi Thomas,
Sorry for big delay with answer – I had no hardware to test. Now I got my
CerfCube 405ep back and can play with it.
So I had two problems
1.. Write verify sometimes fail
2.. Write operation during erase sometimes cause data corruption
Hardware details :
- NAND chip Toshiba TC58256AFT
- ALE/CLE and CE connected to GPIO
- R/B pin connected to GPIO and I use ready function which reads it pin.
- I played with different timings – even slowest setting gets me the same
result
1-st problem was solved by applying new MTD snapshot (Jun 26). It’s look
like nand_deselect(); nand_select() fixes the problem.
However after applying new MTD release the 2-nd problem still remained.
Then I comment out erase abort in nand_get_chip (as you suggested) and it
fixes my second problem!
Could you remove this erase abort from MTD source? I think it will not
affect much on efficiency.
BTW - jedec_probe.c file has no #include <linux/init.h> . And it cause
compilation problem within my source tree.
Alex Samoutin
Intrinsyc Software.
----- Original Message -----
From: "Thomas Gleixner" <tglx@linutronix.de>
To: "Alex Samoutin" <samoutin@hotbox.ru>; "Thayne Harbaugh"
<tharbaugh@lnxi.com>
Cc: <linux-mtd@lists.infradead.org>
Sent: Wednesday, April 30, 2003 11:13 AM
Subject: Re: Fw: corrupt my NAND flash device
On Wednesday 30 April 2003 18:54, Alex Samoutin wrote:
> Yes I had absolutely the same problem. Some times first write command was
> ignored, but second always successful. However I have no problem with
erase
> operations, only write some times was ignored.
>
> (For Thomas) It’s not a bad H/W or incorrect timing. I’ve played with
> timing and result was the same. Also I have 5 boards with Toshiba NAND
chip
> and 2 of them are working fine without retry, but other 3 need retry for
> normal operation.
What timing params did you play with ?
Are CLE/ALE connected to GPIO pins ?
Do you use a ready function, which reads the R/B hardware pin ?
Can you please check the following:
1. Add a delay into nand_wait and play with the time
--- nand.c 14 Apr 2003 07:00:39 -0000 1.43
+++ nand.c 30 Apr 2003 17:05:26 -0000
@@ -226,6 +226,8 @@
this->hwcontrol (NAND_CTL_CLRALE);
}
+ udelay (500);
+
/*
* program and erase have their own busy handlers
* status and sequential in needs no delay
2.. Report if that helps or changes anything
3. Remove the erase abort in nand_get_chip
--- nand.c 14 Apr 2003 07:00:39 -0000 1.43
+++ nand.c 30 Apr 2003 17:08:23 -0000
@@ -287,16 +287,6 @@
return;
}
- if (this->state == FL_ERASING) {
- if (new_state != FL_ERASING) {
- this->state = new_state;
- spin_unlock_bh (&this->chip_lock);
- nand_select (); /* select in any case */
- this->cmdfunc(mtd, NAND_CMD_RESET, -1, -1);
- return;
- }
- }
-
set_current_state (TASK_UNINTERRUPTIBLE);
add_wait_queue (&this->wq, &wait);
spin_unlock_bh (&this->chip_lock);
4.. Report if that helps or changes anything
Thanks
--
Thomas
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: Fw: corrupt my NAND flash device
2003-07-02 17:43 ` Alex Samoutin
@ 2003-07-02 17:53 ` Jasmine Strong
2003-07-02 20:10 ` Alex Samoutin
2003-07-04 1:43 ` David Woodhouse
2003-07-03 5:44 ` Stephan Linke
1 sibling, 2 replies; 31+ messages in thread
From: Jasmine Strong @ 2003-07-02 17:53 UTC (permalink / raw)
To: Alex Samoutin; +Cc: tglx, linux-mtd, Thayne Harbaugh
On Wednesday, Jul 2, 2003, at 18:43 Europe/London, Alex Samoutin wrote:
> Sorry for big delay with answer – I had no hardware to test. Now I got
> my
> CerfCube 405ep back and can play with it.
>
> - NAND chip Toshiba TC58256AFT
>
The timings for the 405 EBIU bus do not match the required timings for
the Toshiba chip's read cycle. There is no good solution to this
problem.
We ended up putting the !RE pin onto a GPIO, but this caused (huge)
problems with interrupts and so forth.
-Jas.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-07-02 17:53 ` Jasmine Strong
@ 2003-07-02 20:10 ` Alex Samoutin
2003-07-04 1:43 ` David Woodhouse
1 sibling, 0 replies; 31+ messages in thread
From: Alex Samoutin @ 2003-07-02 20:10 UTC (permalink / raw)
To: Jasmine Strong; +Cc: linux-mtd
>On Wednesday, Jul 2, 2003, at 18:43 Europe/London, Alex Samoutin wrote:
>> Sorry for big delay with answer – I had no hardware to test. Now I got
>> my
>> CerfCube 405ep back and can play with it.
>>
>> - NAND chip Toshiba TC58256AFT
>
>The timings for the 405 EBIU bus do not match the required timings for
>the Toshiba chip's read cycle.
Why not? For any CS line you can configure timings as you want. Could you
explain what exatly dosn't much the Toshiba chip's read cycle.?
>We ended up putting the !RE pin onto a GPIO, but this caused (huge)
>problems with interrupts and so forth.
Sorry But I don't undestand it. You can just disable interrupts for this
line.
Alex Samoutin
Intrinsyc Software
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-07-02 17:53 ` Jasmine Strong
2003-07-02 20:10 ` Alex Samoutin
@ 2003-07-04 1:43 ` David Woodhouse
1 sibling, 0 replies; 31+ messages in thread
From: David Woodhouse @ 2003-07-04 1:43 UTC (permalink / raw)
To: Jasmine Strong; +Cc: Alex Samoutin, tglx, linux-mtd, Thayne Harbaugh
On Wed, 2003-07-02 at 18:53, Jasmine Strong wrote:
> The timings for the 405 EBIU bus do not match the required timings for
> the Toshiba chip's read cycle. There is no good solution to this
> problem.
>
> We ended up putting the !RE pin onto a GPIO, but this caused (huge)
> problems with interrupts and so forth.
Hmmm. Perhaps this is a situation in which using something like a
DiskOnChip might be useful. The DiskOnChip ASIC isolates the flash bus
from the host and gives you a sensible pipeline for data transfer;
dual-host-cycle read/write accesses when appropriate.
Otherwise yes, you need to invent the same kind of thing to meet the
timing constraints of the NAND chip.
--
dwmw2
^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: Fw: corrupt my NAND flash device
2003-07-02 17:43 ` Alex Samoutin
2003-07-02 17:53 ` Jasmine Strong
@ 2003-07-03 5:44 ` Stephan Linke
2003-07-05 15:15 ` Thomas Gleixner
1 sibling, 1 reply; 31+ messages in thread
From: Stephan Linke @ 2003-07-03 5:44 UTC (permalink / raw)
To: Alex Samoutin; +Cc: Linux-Mtd
Hi Alex,
I just had a look at this mail and one of the problems you mention just reminds me of me experiences a few month ago.
You say that write verify on you NAND flash sometimes failes. I had the same on my board using YAFFS on a NAND. I figured out that
verify failes during partial writes. The reason the compair routine doesn'T deal with that special situation where there is already
some data in the page and someone writes only a few 0's leaving the rest at 0xFF since the manual says to do so in partial writes.
To deal with that write verify whould have to read the page first and after writing the new data check if the data read back is
"resonable" compaired to the original data and the new data. Unfortately this test can't show you whether the page content is what
you whant it to be but it checks whether the result is reasonable under the given preconditions...
This version of write verify is mutch more complicated and will take a lot more time but on the other hand this is the only correct
way and write verify is mainly for "debugging" anyway.
I didn't keep the modifications in my code (just switched of write verify instead) so I can't send you a simple patch but it's not
as difficult to do anyway. It's only that the patch is quite uggly.
Stephan
> -----Original Message-----
> From: linux-mtd-bounces@lists.infradead.org [mailto:linux-mtd-bounces@lists.infradead.org]On Behalf Of Alex Samoutin
> Sent: Mittwoch, 2. Juli 2003 19:43
> To: tglx@linutronix.de; Thayne Harbaugh
> Cc: linux-mtd@lists.infradead.org
> Subject: Re: Fw: corrupt my NAND flash device
>
>
> Hi Thomas,
>
>
>
> Sorry for big delay with answer – I had no hardware to test. Now I got my
> CerfCube 405ep back and can play with it.
>
> So I had two problems
>
> 1.. Write verify sometimes fail
> 2.. Write operation during erase sometimes cause data corruption
>
>
> Hardware details :
>
> - NAND chip Toshiba TC58256AFT
>
> - ALE/CLE and CE connected to GPIO
>
> - R/B pin connected to GPIO and I use ready function which reads it pin.
>
> - I played with different timings – even slowest setting gets me the same
> result
>
>
>
> 1-st problem was solved by applying new MTD snapshot (Jun 26). It’s look
> like nand_deselect(); nand_select() fixes the problem.
>
>
>
> However after applying new MTD release the 2-nd problem still remained.
> Then I comment out erase abort in nand_get_chip (as you suggested) and it
> fixes my second problem!
>
>
>
> Could you remove this erase abort from MTD source? I think it will not
> affect much on efficiency.
>
>
>
> BTW - jedec_probe.c file has no #include <linux/init.h> . And it cause
> compilation problem within my source tree.
>
>
>
> Alex Samoutin
>
> Intrinsyc Software.
>
>
>
>
>
>
>
> ----- Original Message -----
>
> From: "Thomas Gleixner" <tglx@linutronix.de>
>
> To: "Alex Samoutin" <samoutin@hotbox.ru>; "Thayne Harbaugh"
> <tharbaugh@lnxi.com>
>
> Cc: <linux-mtd@lists.infradead.org>
>
> Sent: Wednesday, April 30, 2003 11:13 AM
>
> Subject: Re: Fw: corrupt my NAND flash device
>
>
>
> On Wednesday 30 April 2003 18:54, Alex Samoutin wrote:
>
> > Yes I had absolutely the same problem. Some times first write command was
> > ignored, but second always successful. However I have no problem with
> erase
> > operations, only write some times was ignored.
> >
> > (For Thomas) It’s not a bad H/W or incorrect timing. I’ve played with
> > timing and result was the same. Also I have 5 boards with Toshiba NAND
> chip
> > and 2 of them are working fine without retry, but other 3 need retry for
> > normal operation.
>
> What timing params did you play with ?
> Are CLE/ALE connected to GPIO pins ?
> Do you use a ready function, which reads the R/B hardware pin ?
>
> Can you please check the following:
>
> 1. Add a delay into nand_wait and play with the time
>
> --- nand.c 14 Apr 2003 07:00:39 -0000 1.43
> +++ nand.c 30 Apr 2003 17:05:26 -0000
> @@ -226,6 +226,8 @@
> this->hwcontrol (NAND_CTL_CLRALE);
> }
>
> + udelay (500);
> +
> /*
> * program and erase have their own busy handlers
> * status and sequential in needs no delay
>
> 2.. Report if that helps or changes anything
>
> 3. Remove the erase abort in nand_get_chip
> --- nand.c 14 Apr 2003 07:00:39 -0000 1.43
> +++ nand.c 30 Apr 2003 17:08:23 -0000
> @@ -287,16 +287,6 @@
> return;
> }
>
> - if (this->state == FL_ERASING) {
> - if (new_state != FL_ERASING) {
> - this->state = new_state;
> - spin_unlock_bh (&this->chip_lock);
> - nand_select (); /* select in any case */
> - this->cmdfunc(mtd, NAND_CMD_RESET, -1, -1);
> - return;
> - }
> - }
> -
> set_current_state (TASK_UNINTERRUPTIBLE);
> add_wait_queue (&this->wq, &wait);
> spin_unlock_bh (&this->chip_lock);
>
> 4.. Report if that helps or changes anything
>
> Thanks
>
> --
> Thomas
>
>
>
>
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
>
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: Fw: corrupt my NAND flash device
2003-07-03 5:44 ` Stephan Linke
@ 2003-07-05 15:15 ` Thomas Gleixner
2003-07-07 9:27 ` Stephan Linke
0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2003-07-05 15:15 UTC (permalink / raw)
To: Stephan Linke, Alex Samoutin; +Cc: Linux-Mtd
On Thursday 03 July 2003 07:44, Stephan Linke wrote:
> Hi Alex,
>
> I just had a look at this mail and one of the problems you mention just
> reminds me of me experiences a few month ago. You say that write verify on
> you NAND flash sometimes failes. I had the same on my board using YAFFS on
> a NAND. I figured out that verify failes during partial writes. The reason
> the compair routine doesn'T deal with that special situation where there is
> already some data in the page and someone writes only a few 0's leaving the
> rest at 0xFF since the manual says to do so in partial writes.
We have canceled partial page writes some time ago.
--
Thomas
________________________________________________________________________
linutronix - competence in embedded & realtime linux
http://www.linutronix.de
mail: tglx@linutronix.de
^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: Fw: corrupt my NAND flash device
2003-07-05 15:15 ` Thomas Gleixner
@ 2003-07-07 9:27 ` Stephan Linke
2003-07-07 13:48 ` Thomas Gleixner
0 siblings, 1 reply; 31+ messages in thread
From: Stephan Linke @ 2003-07-07 9:27 UTC (permalink / raw)
To: tglx, Alex Samoutin; +Cc: Linux-Mtd
Hi Thomas,
>From my experience partial page write is still used by YAFFS in OOB area. And I don't know how you are doing a journaling FS on a
NAND without partial writes.
Stephan
> -----Original Message-----
> From: Thomas Gleixner [mailto:tglx@linutronix.de]
> Sent: Samstag, 5. Juli 2003 17:15
> To: Stephan Linke; Alex Samoutin
> Cc: Linux-Mtd
> Subject: Re: Fw: corrupt my NAND flash device
>
>
> On Thursday 03 July 2003 07:44, Stephan Linke wrote:
> > Hi Alex,
> >
> > I just had a look at this mail and one of the problems you mention just
> > reminds me of me experiences a few month ago. You say that write verify on
> > you NAND flash sometimes failes. I had the same on my board using YAFFS on
> > a NAND. I figured out that verify failes during partial writes. The reason
> > the compair routine doesn'T deal with that special situation where there is
> > already some data in the page and someone writes only a few 0's leaving the
> > rest at 0xFF since the manual says to do so in partial writes.
>
> We have canceled partial page writes some time ago.
>
> --
> Thomas
> ________________________________________________________________________
> linutronix - competence in embedded & realtime linux
> http://www.linutronix.de
> mail: tglx@linutronix.de
>
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-07-07 9:27 ` Stephan Linke
@ 2003-07-07 13:48 ` Thomas Gleixner
2003-07-08 7:50 ` David Woodhouse
0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2003-07-07 13:48 UTC (permalink / raw)
To: Stephan Linke, Alex Samoutin; +Cc: Linux-Mtd
On Monday 07 July 2003 11:27, Stephan Linke wrote:
> Hi Thomas,
>
> From my experience partial page write is still used by YAFFS in OOB area.
I see, I meant the data area. There we don't use partial programming. For the
oob area it works with JFFS2. I have to check for YAFFS
> And I don't know how you are doing a journaling FS on a NAND without
> partial writes.
By writing full data pages all the time.
--
Thomas
________________________________________________________________________
linutronix - competence in embedded & realtime linux
http://www.linutronix.de
mail: tglx@linutronix.de
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-07-07 13:48 ` Thomas Gleixner
@ 2003-07-08 7:50 ` David Woodhouse
0 siblings, 0 replies; 31+ messages in thread
From: David Woodhouse @ 2003-07-08 7:50 UTC (permalink / raw)
To: tglx; +Cc: Alex Samoutin, Linux-Mtd, Stephan Linke
On Mon, 2003-07-07 at 14:48, Thomas Gleixner wrote:
> I see, I meant the data area. There we don't use partial programming. For the
> oob area it works with JFFS2.
Not always. I've encountered some Toshiba NAND chips which zero the
entire data page and the unwritten part of the OOB area if you try a
partial OOB write. In fact, even if you try a complete OOB write they
zero the data page.
Issuing a RESET before the write command fixes this.
--
dwmw2
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-04-25 22:23 ` Alex Samoutin
2003-04-25 23:10 ` Thayne Harbaugh
@ 2003-04-26 10:18 ` Jörn Engel
2003-04-28 8:57 ` Thomas Gleixner
1 sibling, 1 reply; 31+ messages in thread
From: Jörn Engel @ 2003-04-26 10:18 UTC (permalink / raw)
To: Alex Samoutin; +Cc: David Woodhouse, Thomas Gleixner, linux-mtd
On Fri, 25 April 2003 15:23:36 -0700, Alex Samoutin wrote:
>
> This is diff file for drivers/mtd/nand.c
Thanks!
> (I made to many changes to create a patch)
Does that mean you cannot "diff -u"? I looked briefly at checking your
changes in, but the non-uniform format means a lot more work.
David, Thomas, any objections against Alex' changes?
Jörn
--
You can't tell where a program is going to spend its time. Bottlenecks
occur in surprising places, so don't try to second guess and put in a
speed hack until you've proven that's where the bottleneck is.
-- Rob Pike
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Fw: corrupt my NAND flash device
2003-04-26 10:18 ` Jörn Engel
@ 2003-04-28 8:57 ` Thomas Gleixner
0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2003-04-28 8:57 UTC (permalink / raw)
To: Jörn Engel, Alex Samoutin; +Cc: linux-mtd, David Woodhouse
On Saturday 26 April 2003 12:18, Jörn Engel wrote:
> On Fri, 25 April 2003 15:23:36 -0700, Alex Samoutin wrote:
> > This is diff file for drivers/mtd/nand.c
>
> Thanks!
>
> > (I made to many changes to create a patch)
>
> Does that mean you cannot "diff -u"? I looked briefly at checking your
> changes in, but the non-uniform format means a lot more work.
>
> David, Thomas, any objections against Alex' changes?
Yes, as I can't see any neccecarity for an additional lock.
I'm not willing to buy, that a global lock is the solution for something, what
breaks a per chip lock. If there is an issue with the chip locking, then this
issue has to be solved and nothing else.
Can anybody please explain how you manage to break the locking and the command
order ?
--
Thomas
________________________________________________________________________
linutronix - competence in embedded & realtime linux
http://www.linutronix.de
mail: tglx@linutronix.de
^ permalink raw reply [flat|nested] 31+ messages in thread
* Fw: corrupt my NAND flash device
@ 2003-08-18 11:36 Eugeny Mints
0 siblings, 0 replies; 31+ messages in thread
From: Eugeny Mints @ 2003-08-18 11:36 UTC (permalink / raw)
To: samoutin; +Cc: David Woodhouse, linux-mtd
Alex, all,
>Alex Samoutin samoutin at hotbox.ru
>Wed Jul 2 11:43:11 BST 2003
>Previous message: 2nd try: [PATCH] jffs2 on DOC
>Next message: Fw: corrupt my NAND flash device
>Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>Hi Thomas,
>Sorry for big delay with answer - I had no hardware to test. Now I got
my
>CerfCube 405ep back and can play with it.
>So I had two problems
> 1.. Write verify sometimes fail
> 2.. Write operation during erase sometimes cause data corruption
Could you please describe you test which detects the first problem? I
have the same, but it arises very seldom and unstable. I'd like to be
able to reproduce the bug predictably.
>Hardware details :
>- NAND chip Toshiba TC58256AFT
>- ALE/CLE and CE connected to GPIO
>- R/B pin connected to GPIO and I use ready function which reads it
pin.
>- I played with different timings - even slowest setting gets me the
same
>result
>1-st problem was solved by applying new MTD snapshot (Jun 26). It's
look
>like nand_deselect(); nand_select() fixes the problem.
It is interesting - this fix doesn't help me:( (I have MIPS Au1100 and
Toshiba TC58256AFTI )
More over, I discovered that fix proposed by David in the latests
(august) snapshots (reset before write command) hangs my system during
eraseall:(
>However after applying new MTD release the 2-nd problem still
remained.
>Then I comment out erase abort in nand_get_chip (as you suggested) and
it
>fixes my second problem!
>Could you remove this erase abort from MTD source? I think it will not
>affect much on efficiency.
My second problem is that system hangs if the device is filled once and
an attempt to re-use previously used sectors is made. Comment out erase
abort in nand_get_chip fixes my second problem too.
Regards,
Eugeny
^ permalink raw reply [flat|nested] 31+ messages in thread
* Fw: corrupt my NAND flash device
@ 2003-04-22 7:05 Paul Wong
0 siblings, 0 replies; 31+ messages in thread
From: Paul Wong @ 2003-04-22 7:05 UTC (permalink / raw)
To: linux-mtd
Hi all!
i installed the YAFFS file system in the NAND flash (samsung 16MB). And
test the reliability. I tried to copy a 5 MB file to the mounted yaffs
(spare 8MB) directory ( i separated the device to 3 partition - 4MB 4MB and
8MB) . and then check the disk space, it is shown that it has 3 MB spare.
Next step, I erased the file and copied another 5 MB file to it. The device
said " no enough spare. Then I use the "mkyaffs" to format it, it is shown
many bad block in the device. I checked the OOB[5] ( bad block flag) it is
set to 0x00. Why? Why the yaffs set the health block to the bad block after
erase file? Is the YAFFS not support big file ? any ideal?
Thanks
Paul
^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2003-08-18 11:37 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-04-22 20:03 Fw: corrupt my NAND flash device Alex Samoutin
2003-04-22 20:26 ` Jörn Engel
2003-04-22 20:59 ` Jörn Engel
2003-04-23 20:45 ` Charles Manning
2003-04-24 18:25 ` Alex Samoutin
2003-04-25 13:01 ` Jörn Engel
2003-04-25 22:23 ` Alex Samoutin
2003-04-25 23:10 ` Thayne Harbaugh
2003-04-26 10:23 ` Jörn Engel
2003-04-28 15:02 ` Thayne Harbaugh
2003-04-28 21:14 ` Charles Manning
2003-04-28 22:59 ` Thomas Gleixner
2003-04-29 1:23 ` Charles Manning
2003-04-29 8:03 ` Thomas Gleixner
2003-04-29 19:37 ` Charles Manning
2003-04-29 22:04 ` Thomas Gleixner
2003-04-30 16:54 ` Alex Samoutin
2003-04-30 18:13 ` Thomas Gleixner
2003-07-02 17:43 ` Alex Samoutin
2003-07-02 17:53 ` Jasmine Strong
2003-07-02 20:10 ` Alex Samoutin
2003-07-04 1:43 ` David Woodhouse
2003-07-03 5:44 ` Stephan Linke
2003-07-05 15:15 ` Thomas Gleixner
2003-07-07 9:27 ` Stephan Linke
2003-07-07 13:48 ` Thomas Gleixner
2003-07-08 7:50 ` David Woodhouse
2003-04-26 10:18 ` Jörn Engel
2003-04-28 8:57 ` Thomas Gleixner
-- strict thread matches above, loose matches on Subject: below --
2003-08-18 11:36 Eugeny Mints
2003-04-22 7:05 Paul Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox