public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed
* cfi_cmdset_0001.c: Excessive erase suspends
@ 2008-04-17 20:38 Anders Grafström
  2008-04-18 15:02 ` Alexey Korolev
  0 siblings, 1 reply; 12+ messages in thread
From: Anders Grafström @ 2008-04-17 20:38 UTC (permalink / raw)
  To: Linux-MTD Mailing List

With recent kernels I've been seeing a lot of these
"Newly-erased block contained word 0xffff0000 at offset 0x00180000"
on a board using Intel 28F640J5 flash chips.

It looks like the errors are caused by large amounts of erase suspends.
Each erase gets suspended around 8500 times and in some extreme cases
a lot more. The erase ends without any error bits set but it turns out
that it has failed.

It seems like some flash chips have a limit on the number of times that
the erase can be suspended. I have not seen any information on the Intel
chips but a Spansion AppNote says 5,980 times for some of their devices
before running the risk of an erase fail.

So I'm guessing that erase suspends need to be paced somehow?

/Anders

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: cfi_cmdset_0001.c: Excessive erase suspends
  2008-04-17 20:38 cfi_cmdset_0001.c: Excessive erase suspends Anders Grafström
@ 2008-04-18 15:02 ` Alexey Korolev
  2008-04-18 16:35   ` Jamie Lokier
  0 siblings, 1 reply; 12+ messages in thread
From: Alexey Korolev @ 2008-04-18 15:02 UTC (permalink / raw)
  To: Anders Grafström; +Cc: Linux-MTD Mailing List

Hi Anders,

> With recent kernels I've been seeing a lot of these
> "Newly-erased block contained word 0xffff0000 at offset 0x00180000"
> on a board using Intel 28F640J5 flash chips.
> 
> It looks like the errors are caused by large amounts of erase suspends.
> Each erase gets suspended around 8500 times and in some extreme cases
> a lot more. The erase ends without any error bits set but it turns out
> that it has failed.
> 
> It seems like some flash chips have a limit on the number of times that
> the erase can be suspended. I have not seen any information on the Intel
> chips but a Spansion AppNote says 5,980 times for some of their devices
> before running the risk of an erase fail.
> 
We saw the similar  problem in our tests. As a possible solution I could
suggest to disable erase suspend on write. 

Regarding limit of suspend/resume cycles: it is rather unclear how many cycles would be ok how many cycles would be not. 
Special investigations are required here. 

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: cfi_cmdset_0001.c: Excessive erase suspends
  2008-04-18 15:02 ` Alexey Korolev
@ 2008-04-18 16:35   ` Jamie Lokier
  2008-04-18 17:54     ` Jared Hulbert
  0 siblings, 1 reply; 12+ messages in thread
From: Jamie Lokier @ 2008-04-18 16:35 UTC (permalink / raw)
  To: Alexey Korolev; +Cc: Linux-MTD Mailing List, Anders Grafström

Alexey Korolev wrote:
> > "Newly-erased block contained word 0xffff0000 at offset 0x00180000"
> > on a board using Intel 28F640J5 flash chips.
> > 
> > It looks like the errors are caused by large amounts of erase suspends.
> > Each erase gets suspended around 8500 times and in some extreme cases
> > a lot more. The erase ends without any error bits set but it turns out
> > that it has failed.
> > 
> > It seems like some flash chips have a limit on the number of times that
> > the erase can be suspended. I have not seen any information on the Intel
> > chips but a Spansion AppNote says 5,980 times for some of their devices
> > before running the risk of an erase fail.

That's very interesting, thanks.

> We saw the similar  problem in our tests. As a possible solution I could
> suggest to disable erase suspend on write. 

That's quite bad for write latency, though.  Adding a suspend cycle
counter, and disabling suspend on write when it reaches a certain
number sounds better.

> Regarding limit of suspend/resume cycles: it is rather unclear how
> many cycles would be ok how many cycles would be not.  Special
> investigations are required here.

That's interesting too.

  - Do other chip docs say how many cycles are acceptable?
    Is there a count we can assume is safe for all devices of this
    type - like say 100?

  - Does the time spent in erase suspend matter?  E.g. if it was
    suspended for 1 minute due to lots of pending writes, restarted,
    and then suspended _again_ for 1 minute, etc. does that reduce the
    number of safe suspend-resume cycles due to the unstable
    partially-erased physical state?

  - Is it worth reading a block after erasing it, to verify that it's
    wiped - and mark blocks which have experienced >threshold suspend
    cycles as needing verification and re-erase, rather than meaning
    it's a bad block? ( Verification could be done lazily, on each part
    of the block just before writing. )  But is this good physically,
    or does too many suspends put the block into an unreliable state
    even if it does pass verification, so that it's important to limit
    the suspends rather than allow many and verify afterwards?

-- Jamie

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: cfi_cmdset_0001.c: Excessive erase suspends
  2008-04-18 16:35   ` Jamie Lokier
@ 2008-04-18 17:54     ` Jared Hulbert
  2008-04-18 22:11       ` Anders Grafström
  0 siblings, 1 reply; 12+ messages in thread
From: Jared Hulbert @ 2008-04-18 17:54 UTC (permalink / raw)
  To: Alexey Korolev, Anders Grafström, Linux-MTD Mailing List

Anders,
Just to make sure we aren't overlooking an easier problem:

1) the original report is a JFFS2 error, not direct observation of the
mtd partition.

2) Did you check that the MTD is configured to treat this flash with
the correct bus width?  I've been burned trying to figure out why a
16bit configuration was missing half the data, turned out I had it
configured for 32bit MTD accesses.

3) I'd like to see that you can't use flash_eraseall and hexdump
/dev/mtdX to see this behavior.  Or maybe you could try to create a
simple test that would suspend an erase 10K times and verify the erase
looking for errors.

Before we go off implement solutions lets verify that we understand
the problem by implementing some kind a repeatable test condition to
trigger this.  However, if there is a problem with excessive
suspending...

>   - Do other chip docs say how many cycles are acceptable?
>     Is there a count we can assume is safe for all devices of this
>     type - like say 100?

A number like that would probably serve to punish many devices while
failing to be safe for "all devices".  I'm not a big fan of this
approach.

>   - Does the time spent in erase suspend matter?  E.g. if it was
>     suspended for 1 minute due to lots of pending writes, restarted,
>     and then suspended _again_ for 1 minute, etc. does that reduce the
>     number of safe suspend-resume cycles due to the unstable
>     partially-erased physical state?

I'm pretty sure that any difficultly with suspends is more about the
time between suspends rather than the time _in_ suspend.  Rather than
minimizing the time in suspend, maximize the time out of suspend.
Think of the erase being suspended as a process being context
switched.  If you context switch to quickly all you do is the
switching, no real work in the process.  You keep that up to long, you
_could_ have a problem.  A solution might be an small delay before a
write suspends the erase.

>   - Is it worth reading a block after erasing it, to verify that it's
>     wiped - and mark blocks which have experienced >threshold suspend
>     cycles as needing verification and re-erase, rather than meaning
>     it's a bad block? ( Verification could be done lazily, on each part
>     of the block just before writing. )  But is this good physically,
>     or does too many suspends put the block into an unreliable state
>     even if it does pass verification, so that it's important to limit
>     the suspends rather than allow many and verify afterwards?

Similar to what you are thinking, I believe that even _if_  you can
put a block in a funky state by excessive suspending the part would
still read cleanly.  I don't think this is worth much.

Excessive suspending is much more likely to trigger a failed erase,
with an error condition being reported by the chip.  The MTD will
return an error in that case so you don't really need to look for an
error.  It's kind of in your face.  I'd be very surprised if can you
can get a consistent behavior like what you describe just by
suspending.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: cfi_cmdset_0001.c: Excessive erase suspends
  2008-04-18 17:54     ` Jared Hulbert
@ 2008-04-18 22:11       ` Anders Grafström
  2008-04-19  2:47         ` Jared Hulbert
  2008-04-19  9:18         ` Joakim Tjernlund
  0 siblings, 2 replies; 12+ messages in thread
From: Anders Grafström @ 2008-04-18 22:11 UTC (permalink / raw)
  To: Jared Hulbert, Alexey Korolev, Linux-MTD Mailing List

Jared Hulbert wrote:
> 2) Did you check that the MTD is configured to treat this flash with
> the correct bus width?  I've been burned trying to figure out why a
> 16bit configuration was missing half the data, turned out I had it
> configured for 32bit MTD accesses.

I think I got it right.

> 3) I'd like to see that you can't use flash_eraseall and hexdump
> /dev/mtdX to see this behavior.  Or maybe you could try to create a
> simple test that would suspend an erase 10K times and verify the erase

See below.

> A solution might be an small delay before a
> write suspends the erase.

udelay(100) right before the suspend command seems to work for me.
It reduces the suspend count to about 1700 and no errors reported
by JFFS2.


I ran this routine in a standalone test program from u-boot:

void test_flash_erase_suspend(void)
{
	volatile unsigned int *flash = (unsigned int *)0x41040000;
	unsigned int i;
	unsigned int suspends = 0;
	unsigned int words = 10;

	/* Make sure there's something to erase */
	for (i = 0; i < 0x10000; i++) {
		if (flash[i] == 0xffffffff) {
			flash[i] = 0x00400040;
			flash[i] = 0;
			while ((flash[i] & 0x00800080) != 0x00800080);
		}
	}

	/* Clear status */
	flash[0] = 0x00500050;

	/* Erase */
	flash[0] = 0x00200020;
	flash[0] = 0x00d000d0;

	while (1) {
		/* Suspend */
		flash[0] = 0x00b000b0;
		suspends++;

		/* Short delay */
		for (i = 0; i < 1000; i++);

		/* Resume */
		flash[0] = 0x00d000d0;
		flash[0] = 0x00700070;

		/* Erase done ? */
		if ((flash[i] & 0x00800080) == 0x00800080) {
			printf("\nStatus %08x\n", flash[i]);
			break;
		}

		/* Short delay */
		for (i = 0; i < 3500; i++);
	}

	/* Read array */
	flash[0] = 0x00ff00ff;

	/* Show the first 10 failed words */
	for (i = 0; i < 0x10000; i++) {
		if (flash[i] != 0xffffffff) {
			printf("%08x\n", flash[i]);
			if (--words == 0)
				break;
		}
	}

	printf("%d suspends\n", suspends);
}

The delay loop doing 3500 loops seems to be the critical one.

Most of the time I get something like this, which I assume
means 'Operation abort'.

Status 00a000a0
aa695555
695a5555
5a555555
a56a5555
59555555
595a5555
6aa65555
69a65555
69595555
95965555
5647 suspends

But sometimes I get this:

Status 00a00080
aa690000
696a0000
5a950000
a66a0000
59590000
595a0000
6a960000
69a60000
69590000
95960000
5625 suspends

Here's a link to the document that I previously referred to:
http://www.spansion.com/application_notes/erase_susp_appnote_00_a1_e.pdf

/Anders

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: cfi_cmdset_0001.c: Excessive erase suspends
  2008-04-18 22:11       ` Anders Grafström
@ 2008-04-19  2:47         ` Jared Hulbert
  2008-04-19  9:18         ` Joakim Tjernlund
  1 sibling, 0 replies; 12+ messages in thread
From: Jared Hulbert @ 2008-04-19  2:47 UTC (permalink / raw)
  To: Anders Grafström; +Cc: Linux-MTD Mailing List, Alexey Korolev

> > A solution might be an small delay before a
> > write suspends the erase.
> >
>
>  udelay(100) right before the suspend command seems to work for me.
>  It reduces the suspend count to about 1700 and no errors reported
>  by JFFS2.

good.

Not sure how/if that sort of fix can be merged.

So you didn't see this kind of problem with older kernels?

>  I ran this routine in a standalone test program from u-boot:
>
>  void test_flash_erase_suspend(void)
>  {
>  }
>
>  The delay loop doing 3500 loops seems to be the critical one.

Interesting.  I'm going to talk with some people...

>  Most of the time I get something like this, which I assume
>  means 'Operation abort'.
>
>  Status 00a000a0

Yes, bit 5 is erase error.  You've got to expect data to be bad under
those circumstances.

>  Status 00a00080
>  aa690000

you have 2 parts in parallel, right?  So the chip on the high order
word is expected to be garbage, erase error and all.  But the lower
word chip confuses me, that shouldn't happen.

Alexey, what do you think?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: cfi_cmdset_0001.c: Excessive erase suspends
  2008-04-18 22:11       ` Anders Grafström
  2008-04-19  2:47         ` Jared Hulbert
@ 2008-04-19  9:18         ` Joakim Tjernlund
  2008-04-19 13:47           ` Anders Grafström
  1 sibling, 1 reply; 12+ messages in thread
From: Joakim Tjernlund @ 2008-04-19  9:18 UTC (permalink / raw)
  To: 'Anders Grafström', 'Jared Hulbert',
	'Alexey Korolev', 'Linux-MTD Mailing List'



> -----Original Message-----
> From: linux-mtd-bounces@lists.infradead.org [mailto:linux-mtd-bounces@lists.infradead.org] On Behalf
> Of Anders Grafström
> Sent: den 19 april 2008 00:11
> To: Jared Hulbert; Alexey Korolev; Linux-MTD Mailing List
> Subject: Re: cfi_cmdset_0001.c: Excessive erase suspends
> 
> Jared Hulbert wrote:
> > 2) Did you check that the MTD is configured to treat this flash with
> > the correct bus width?  I've been burned trying to figure out why a
> > 16bit configuration was missing half the data, turned out I had it
> > configured for 32bit MTD accesses.
> 
> I think I got it right.
> 
> > 3) I'd like to see that you can't use flash_eraseall and hexdump
> > /dev/mtdX to see this behavior.  Or maybe you could try to create a
> > simple test that would suspend an erase 10K times and verify the erase
> 
> See below.
> 
> > A solution might be an small delay before a
> > write suspends the erase.
> 
> udelay(100) right before the suspend command seems to work for me.
> It reduces the suspend count to about 1700 and no errors reported
> by JFFS2.
> 
> 
> I ran this routine in a standalone test program from u-boot:
> 
> void test_flash_erase_suspend(void)
> {
> 	volatile unsigned int *flash = (unsigned int *)0x41040000;
> 	unsigned int i;
> 	unsigned int suspends = 0;
> 	unsigned int words = 10;
> 
> 	/* Make sure there's something to erase */
> 	for (i = 0; i < 0x10000; i++) {
> 		if (flash[i] == 0xffffffff) {
> 			flash[i] = 0x00400040;
> 			flash[i] = 0;
> 			while ((flash[i] & 0x00800080) != 0x00800080);
> 		}
> 	}
> 
> 	/* Clear status */
> 	flash[0] = 0x00500050;
> 
> 	/* Erase */
> 	flash[0] = 0x00200020;
> 	flash[0] = 0x00d000d0;
> 
> 	while (1) {
> 		/* Suspend */
> 		flash[0] = 0x00b000b0;
> 		suspends++;
> 
> 		/* Short delay */
> 		for (i = 0; i < 1000; i++);
> 

Don't you need to check that the chip is suspended before you do any
new operation?

> 		/* Resume */
> 		flash[0] = 0x00d000d0;
> 		flash[0] = 0x00700070;
> 
> 		/* Erase done ? */
> 		if ((flash[i] & 0x00800080) == 0x00800080) {
> 			printf("\nStatus %08x\n", flash[i]);
> 			break;
> 		}
> 
> 		/* Short delay */
> 		for (i = 0; i < 3500; i++);
> 	}
> 
> 	/* Read array */
> 	flash[0] = 0x00ff00ff;
> 
> 	/* Show the first 10 failed words */
> 	for (i = 0; i < 0x10000; i++) {
> 		if (flash[i] != 0xffffffff) {
> 			printf("%08x\n", flash[i]);
> 			if (--words == 0)
> 				break;
> 		}
> 	}
> 
> 	printf("%d suspends\n", suspends);
> }

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: cfi_cmdset_0001.c: Excessive erase suspends
  2008-04-19  9:18         ` Joakim Tjernlund
@ 2008-04-19 13:47           ` Anders Grafström
  2008-04-19 17:01             ` Joakim Tjernlund
  2008-04-24 14:34             ` Alexey Korolev
  0 siblings, 2 replies; 12+ messages in thread
From: Anders Grafström @ 2008-04-19 13:47 UTC (permalink / raw)
  To: Joakim Tjernlund, 'Jared Hulbert',
	'Alexey Korolev', 'Linux-MTD Mailing List'

Joakim Tjernlund wrote:
> Don't you need to check that the chip is suspended before you do any
> new operation?

You're right.

Here's the improved version;

static void erase_suspend(unsigned int loops)
{
	volatile unsigned int *flash = (unsigned int *)0x41040000;
	unsigned int i;
	unsigned int suspends = 0;
	unsigned int words = 2;
	unsigned int timeout;

	/* Make sure there's something to erase */
	for (i = 0; i < 0x10000; i++) {
		if (flash[i] == 0xffffffff) {
			flash[i] = 0x00400040;
			flash[i] = 0;
			while ((flash[i] & 0x00800080) != 0x00800080);
		}
	}

	/* Clear status */
	flash[0] = 0x00500050;

	/* Erase */
	flash[0] = 0x00200020;
	flash[0] = 0x00d000d0;

	while (1) {
		/* Suspend */
		flash[0] = 0x00b000b0;
		flash[0] = 0x00700070;
		timeout = 100000;
		while ((flash[0] & 0x00800080) != 0x00800080 && --timeout);
		if (!timeout)
			printf("suspend timeout\n");
		suspends++;

		/* Short delay */
		for (i = 0; i < 1000; i++);

		/* Resume */
		flash[0] = 0x00d000d0;
		flash[0] = 0x00700070;

		/* Erase done ? */
		if ((flash[0] & 0x00800080) == 0x00800080) {
			printf("\nstatus %08x\n", flash[0]);
			break;
		}

		/* Short delay */
		for (i = 0; i < loops; i++);
	}

	/* Read array */
	flash[0] = 0x00ff00ff;

	/* Show the first 2 failed words */
	for (i = 0; i < 0x10000; i++) {
		if (flash[i] != 0xffffffff) {
			printf("%08x\n", flash[i]);
			if (--words == 0)
				break;
		}
	}

	printf("%d suspends\n", suspends);
}

void test_flash_erase_suspend(void)
{
	unsigned int loops;

	for (loops = 6000; loops > 2000; loops -= 500) {
		printf("\n%d loops", loops);
		erase_suspend(loops);
	}
}


It produced this output at one occasion:

6000 loops
status 00800080
ffff0000
ffff0000
4783 suspends

5500 loops
status 00800080
5508 suspends

5000 loops
status 008000a0
6394 suspends

4500 loops
status 00a000a0
fffffeff
fffffffe
5169 suspends

4000 loops
status 00a000a0
fffffeff
fffffeff
5378 suspends

3500 loops
status 00a000a0
aaaaaaaa
aaaaaaaa
5715 suspends

3000 loops
status 00a000a0
55555555
55555555
6142 suspends

2500 loops
status 00a000a0
00000000
00000000
6929 suspends

It locks up when I drop it down to 2000 or less iterations in the loop between resume and suspend.
I think I need to do more testing with properly timed delays.

Is there anything else that I have missed?

/Anders

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: cfi_cmdset_0001.c: Excessive erase suspends
  2008-04-19 13:47           ` Anders Grafström
@ 2008-04-19 17:01             ` Joakim Tjernlund
  2008-04-24 14:34             ` Alexey Korolev
  1 sibling, 0 replies; 12+ messages in thread
From: Joakim Tjernlund @ 2008-04-19 17:01 UTC (permalink / raw)
  To: 'Anders Grafström', 'Jared Hulbert',
	'Alexey Korolev', 'Linux-MTD Mailing List'



> -----Original Message-----
> From: Anders Grafström [mailto:grfstrm@users.sourceforge.net]
> Sent: den 19 april 2008 15:47
> To: Joakim Tjernlund; 'Jared Hulbert'; 'Alexey Korolev'; 'Linux-MTD Mailing List'
> Subject: Re: cfi_cmdset_0001.c: Excessive erase suspends
> 
> Joakim Tjernlund wrote:
> > Don't you need to check that the chip is suspended before you do any
> > new operation?
> 
> You're right.
> 
> Here's the improved version;
> 
> static void erase_suspend(unsigned int loops)
> {
> 	volatile unsigned int *flash = (unsigned int *)0x41040000;
> 	unsigned int i;
> 	unsigned int suspends = 0;
> 	unsigned int words = 2;
> 	unsigned int timeout;
> 
> 	/* Make sure there's something to erase */
> 	for (i = 0; i < 0x10000; i++) {
> 		if (flash[i] == 0xffffffff) {
> 			flash[i] = 0x00400040;
> 			flash[i] = 0;
> 			while ((flash[i] & 0x00800080) != 0x00800080);
> 		}
> 	}
> 
> 	/* Clear status */
> 	flash[0] = 0x00500050;
> 
> 	/* Erase */
> 	flash[0] = 0x00200020;
> 	flash[0] = 0x00d000d0;
> 
> 	while (1) {
> 		/* Suspend */
> 		flash[0] = 0x00b000b0;
> 		flash[0] = 0x00700070;
> 		timeout = 100000;
> 		while ((flash[0] & 0x00800080) != 0x00800080 && --timeout);
> 		if (!timeout)
> 			printf("suspend timeout\n");
> 		suspends++;
> 
> 		/* Short delay */
> 		for (i = 0; i < 1000; i++);
> 
> 		/* Resume */
> 		flash[0] = 0x00d000d0;
> 		flash[0] = 0x00700070;
> 
> 		/* Erase done ? */
> 		if ((flash[0] & 0x00800080) == 0x00800080) {
> 			printf("\nstatus %08x\n", flash[0]);
> 			break;
> 		}
> 
> 		/* Short delay */
> 		for (i = 0; i < loops; i++);
> 	}
> 
> 	/* Read array */
> 	flash[0] = 0x00ff00ff;
> 
> 	/* Show the first 2 failed words */
> 	for (i = 0; i < 0x10000; i++) {
> 		if (flash[i] != 0xffffffff) {
> 			printf("%08x\n", flash[i]);
> 			if (--words == 0)
> 				break;
> 		}
> 	}
> 
> 	printf("%d suspends\n", suspends);
> }
> 
> void test_flash_erase_suspend(void)
> {
> 	unsigned int loops;
> 
> 	for (loops = 6000; loops > 2000; loops -= 500) {
> 		printf("\n%d loops", loops);
> 		erase_suspend(loops);
> 	}
> }
> 
> 
> It produced this output at one occasion:
> 
> 6000 loops
> status 00800080
> ffff0000
> ffff0000
> 4783 suspends
> 
> 5500 loops
> status 00800080
> 5508 suspends
> 
> 5000 loops
> status 008000a0
> 6394 suspends
> 
> 4500 loops
> status 00a000a0
> fffffeff
> fffffffe
> 5169 suspends
> 
> 4000 loops
> status 00a000a0
> fffffeff
> fffffeff
> 5378 suspends
> 
> 3500 loops
> status 00a000a0
> aaaaaaaa
> aaaaaaaa
> 5715 suspends
> 
> 3000 loops
> status 00a000a0
> 55555555
> 55555555
> 6142 suspends
> 
> 2500 loops
> status 00a000a0
> 00000000
> 00000000
> 6929 suspends
> 
> It locks up when I drop it down to 2000 or less iterations in the loop between resume and suspend.
> I think I need to do more testing with properly timed delays.
> 
> Is there anything else that I have missed?

Perhaps, you should make "i" a volatile. That said I know that Intel
had erase suspend bugs in their early strata flash, that’s why there is
a define in cmdset_0001 that disables them.

Maybe you got buggy chips?
oh, just remembered, strata chips needs to have a stable address bus before
CS is pulled. We had add a small delay on CS in the memory controller, otherwise
they would fail, especially when cold. 

> 
> /Anders
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: cfi_cmdset_0001.c: Excessive erase suspends
  2008-04-19 13:47           ` Anders Grafström
  2008-04-19 17:01             ` Joakim Tjernlund
@ 2008-04-24 14:34             ` Alexey Korolev
  2008-04-24 21:02               ` Anders Grafström
  1 sibling, 1 reply; 12+ messages in thread
From: Alexey Korolev @ 2008-04-24 14:34 UTC (permalink / raw)
  To: Anders Grafström; +Cc: 'Linux-MTD Mailing List', Joakim Tjernlund

Hi Anders,

Could you please try this patch on your platform. 
It should solve the issue. 
As I said before we also have a case which reproduces the issue.
After applying it the issue is not seen on test items which usually
fails. (However to prove that it is Ok
we need to execute whole bunch. It will be completed on next day)

diff -aurp linux-2.6.24-a/drivers/mtd/chips/cfi_cmdset_0001.c linux-2.6.24-b/drivers/mtd/chips/cfi_cmdset_0001.c
--- linux-2.6.24-a/drivers/mtd/chips/cfi_cmdset_0001.c	2008-02-11 08:51:11.000000000 +0300
+++ linux-2.6.24-b/drivers/mtd/chips/cfi_cmdset_0001.c	2008-04-24 18:13:00.000000000 +0400
@@ -874,6 +874,10 @@ static void put_chip(struct map_info *ma
 		   do. */
 		map_write(map, CMD(0xd0), adr);
 		map_write(map, CMD(0x70), adr);
+		/* According to recomendations from h/w experts adding wait  
+		   of 100us just after resume solves the erase suspend issues   
+		   seen on some NOR parts. */
+		udelay(100);
 		chip->oldstate = FL_READY;
 		chip->state = FL_ERASING;
 		break;


Thanks,
Alexey

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: cfi_cmdset_0001.c: Excessive erase suspends
  2008-04-24 14:34             ` Alexey Korolev
@ 2008-04-24 21:02               ` Anders Grafström
  2008-04-25  9:59                 ` Alexey Korolev
  0 siblings, 1 reply; 12+ messages in thread
From: Anders Grafström @ 2008-04-24 21:02 UTC (permalink / raw)
  To: Alexey Korolev; +Cc: 'Linux-MTD Mailing List', Joakim Tjernlund

Alexey Korolev wrote:
> Could you please try this patch on your platform. 
> It should solve the issue. 
> As I said before we also have a case which reproduces the issue.

So I'm not the only one seeing it. That's 'good' to know.

> After applying it the issue is not seen on test items which usually
> fails. (However to prove that it is Ok
> we need to execute whole bunch. It will be completed on next day)

Works a lot better but it has failed one time so far with the patch applied.

To answer Jared's question earlier, I've seen the problem with 2.6.18 as well
but it was very rare, a couple of months in between sightings.
With 2.6.23 it got frequent and even worse with 2.6.25.
I'm overwriting a file, size about 6MB, to a file system that is filled to the brim.
It triggers around 25-30 erases and around 10 of those usually fail.

So the problem is probably not the number of suspends but more likely a latency issue?

This patch used to work around the issue but it's not enough any more:

diff --git a/fs/jffs2/erase.c b/fs/jffs2/erase.c
index 5e2719c..be8fc87 100644
--- a/fs/jffs2/erase.c
+++ b/fs/jffs2/erase.c
@@ -18,6 +18,9 @@
  #include <linux/pagemap.h>
  #include "nodelist.h"

+/* max. erase failures before we mark a block bad */
+#define MAX_ERASE_FAILURES 	2
+
  struct erase_priv_struct {
  	struct jffs2_eraseblock *jeb;
  	struct jffs2_sb_info *c;
@@ -190,8 +193,23 @@ static void jffs2_erase_failed(struct jffs2_sb_info *c, struct jffs2_eraseblock
  			mutex_unlock(&c->erase_free_sem);
  			return;
  		}
+	} else if (c->mtd->type != MTD_NANDFLASH) {
+		if( ++jeb->bad_count < MAX_ERASE_FAILURES) {
+			/* We'd like to give this block another try. */
+			printk(KERN_ERR "Retrying erase at 0x%08x\n", jeb->offset);
+			mutex_lock(&c->erase_free_sem);
+			spin_lock(&c->erase_completion_lock);
+			list_move(&jeb->list, &c->erase_pending_list);
+			c->erasing_size -= c->sector_size;
+			c->dirty_size += c->sector_size;
+			jeb->dirty_size = c->sector_size;
+			spin_unlock(&c->erase_completion_lock);
+			mutex_unlock(&c->erase_free_sem);
+			return;
+		}
  	}

+	printk(KERN_ERR "Failed erase at 0x%08x\n", jeb->offset);
  	mutex_lock(&c->erase_free_sem);
  	spin_lock(&c->erase_completion_lock);
  	c->erasing_size -= c->sector_size;
@@ -454,6 +472,7 @@ static void jffs2_mark_erased_block(struct jffs2_sb_info *c, struct jffs2_eraseb
  	}
  	/* Everything else got zeroed before the erase */
  	jeb->free_size = c->sector_size;
+	jeb->bad_count = 0;

  	mutex_lock(&c->erase_free_sem);
  	spin_lock(&c->erase_completion_lock);

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: cfi_cmdset_0001.c: Excessive erase suspends
  2008-04-24 21:02               ` Anders Grafström
@ 2008-04-25  9:59                 ` Alexey Korolev
  0 siblings, 0 replies; 12+ messages in thread
From: Alexey Korolev @ 2008-04-25  9:59 UTC (permalink / raw)
  To: Anders Grafström; +Cc: 'Linux-MTD Mailing List', Joakim Tjernlund

Hi

> > After applying it the issue is not seen on test items which usually
> > fails. (However to prove that it is Ok
> > we need to execute whole bunch. It will be completed on next day)
> 
> Works a lot better but it has failed one time so far with the patch applied.
>
Could you please specify how it fails. I'm not sure that we are talking
about the same issue. Also I need to see the data read just after broken
erase. Could you please read the same data several times just after
broken erase and compare it? Will data in erase block be the same? Is 
you flash configuration interleaved? 

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-04-25  9:59 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-17 20:38 cfi_cmdset_0001.c: Excessive erase suspends Anders Grafström
2008-04-18 15:02 ` Alexey Korolev
2008-04-18 16:35   ` Jamie Lokier
2008-04-18 17:54     ` Jared Hulbert
2008-04-18 22:11       ` Anders Grafström
2008-04-19  2:47         ` Jared Hulbert
2008-04-19  9:18         ` Joakim Tjernlund
2008-04-19 13:47           ` Anders Grafström
2008-04-19 17:01             ` Joakim Tjernlund
2008-04-24 14:34             ` Alexey Korolev
2008-04-24 21:02               ` Anders Grafström
2008-04-25  9:59                 ` Alexey Korolev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox