Kernel oops after sync command in jffs2

public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed

* Kernel oops after sync command in jffs2
@ 2002-05-16 16:04 Frederic Giasson
  2002-05-16 17:34 ` David Woodhouse
  0 siblings, 1 reply; 12+ messages in thread
From: Frederic Giasson @ 2002-05-16 16:04 UTC (permalink / raw)
  To: 'linux-mtd@lists.infradead.org'

Hi,

	I am having a kernel oops when I do a sync command right after a
file erase in jffs2.  It have setup a test which copies a bunch of big files
into my jffs2 drive, then erase them and does a sync before doing another
loop.  My test program never fails at the first loop, but only after a few
tens of loops.

	The test I am running fills jffs2 to 99% full.  I have a 4MB AMD
AM29DL323 CFI flash device, therefore I am using cfi_cmdset_002 chip driver.
Right before the kernel oops message pops, I get this error message from the
chip driver: "waiting for erase to complete timed out.  Erase at 0x320000
failed immediately: errno -5".  Peering down into the chip driver code, I
found that the following contition occurs: 		

		/* The erasing didn't stop?? */
		if ((status & dq5) == dq5) {
			/* dq5 is active so we can do a reset and stop the
erase */

			printk( "Erase internally timed out in flash
device!!!\n" ); // temp by FG

			cfi_write(map, CMD(0xF0), chip->start);
		}

	I know that the sync command triggers pending erases in jffs2.  This
is especially why I use it, because it proved that the "rm" / "cp" cycle of
my files is faster much more constant when I use sync between the "rm" and
the "cp" command.  Is sync command correctly supported by JFFS2?  Anyone
else experienced problems similar to mine?

Thanks in avance,

Frédéric Giasson 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel oops after sync command in jffs2
  2002-05-16 16:04 Kernel oops after sync command in jffs2 Frederic Giasson
@ 2002-05-16 17:34 ` David Woodhouse
  0 siblings, 0 replies; 12+ messages in thread
From: David Woodhouse @ 2002-05-16 17:34 UTC (permalink / raw)
  To: Frederic Giasson; +Cc: 'linux-mtd@lists.infradead.org'

Show oops? Decoded.

--
dwmw2

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Kernel oops after sync command in jffs2
@ 2002-05-17 18:04 Frederic Giasson
  2002-05-18  7:43 ` David Woodhouse
  0 siblings, 1 reply; 12+ messages in thread
From: Frederic Giasson @ 2002-05-17 18:04 UTC (permalink / raw)
  To: 'David Woodhouse'; +Cc: linux-mtd

Hi,

I think I found a possible source for the oops I reported previously this
week.

When the oops happen, the do_erase_oneblock() function if cfi_cmdset_0002.c
complains that DQ5 has been asserted to "1" while the flash was still
erasing, meaning that a program command has been attempted.  

So I examined the chip driver to see if there were a gap in the mutual
exclusion of the chip access.  I found that function cfi_amdstd_write() send
unlock bypass command to the flash without verifying if the flash is ready
first, if cfi->fast_prog is true.  I tryied to disable fast_prog in
cfi_probe.c, and the oops did not happen again.  It sound like it were the
problem, although I am not 100% sure yet.

I will send you the oops trace along with the callback trace if it ever
shows up again.

Frédéric Giasson

|-----Original Message-----
|From: David Woodhouse [mailto:dwmw2@infradead.org]
|Sent: Thursday, May 16, 2002 1:35 PM
|To: Frederic Giasson
|Cc: 'linux-mtd@lists.infradead.org'
|Subject: Re: Kernel oops after sync command in jffs2 
|
|
|
|Show oops? Decoded.
|
|--
|dwmw2
|
|

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel oops after sync command in jffs2
  2002-05-17 18:04 Frederic Giasson
@ 2002-05-18  7:43 ` David Woodhouse
  0 siblings, 0 replies; 12+ messages in thread
From: David Woodhouse @ 2002-05-18  7:43 UTC (permalink / raw)
  To: Frederic Giasson; +Cc: linux-mtd

fgiasson@mediatrix.com said:
>  When the oops happen, the do_erase_oneblock() function if
> cfi_cmdset_0002.c complains that DQ5 has been asserted to "1" while
> the flash was still erasing, meaning that a program command has been
> attempted.  

> So I examined the chip driver to see if there were a gap in the mutual
> exclusion of the chip access.  I found that function
> cfi_amdstd_write() send unlock bypass command to the flash without
> verifying if the flash is ready first, if cfi->fast_prog is true.  I
> tryied to disable fast_prog in cfi_probe.c, and the oops did not
> happen again.  It sound like it were the problem, although I am not
> 100% sure yet.

Thanks. I've disabled fast programming mode completely, until the code gets 
fixed. 

That should have just caused an I/O error though; not an oops.


--
dwmw2

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Kernel oops after sync command in jffs2
@ 2002-05-21 18:54 Frederic Giasson
  2002-05-21 21:07 ` David Woodhouse
  0 siblings, 1 reply; 12+ messages in thread
From: Frederic Giasson @ 2002-05-21 18:54 UTC (permalink / raw)
  To: 'David Woodhouse'; +Cc: 'linux-mtd@lists.infradead.org'

Here is the oops trace:

sync
mtxmap_write16() called when erasing!
waiting for erase to complete timed out.<4>Erase at 0x00030000 failed
immediately: errno -5
Oops: kernel access of bad area, sig: 11
NIP: C000FA20 XER: 20000000 LR: C000FA1C SP: C1B79E10 REGS: c1b79d60 TRAP:
0300    Not tainted
MSR: 00001032 EE: 0 PR: 0 FP: 0 ME: 1 IR/DR: 11
DAR: 00000000, DSISR: 20000000
TASK = c1b78000[155] 'sync' Last syscall: 36
last math c1b78000 last altivec 00000000
GPR00: 00001032 C1B79E10 C1B78000 00000001 00009032 00000001 00000000
00000016
GPR08: 003F0000 C00BCEE0 00400000 C1B79E28 0F000000 1001F398 00000000
00000000
GPR16: 00000000 00000000 00000000 00000000 00009032 01B79F40 00000000
00000003
GPR24: C0130000 C0130000 C1B79E28 00000001 00000000 00000000 00000001
C1B79E10
Call backtrace:
00000000 C00BCEFC C00BD26C C008E460 C008E6F8 C008F7F4 C003CEE0
C0038344 C0038398 C0005ADC 10000B38 0FED0780 00000000

Here is the call backtrace decoding:
C00BCEFC:	concat_erase_callback()
C00BD26C:	concat_erase()
C008E460:	jffs2_erase_block()
C008E6F8:	jffs2_erase_pending_blocks()
C008F7F4:	jffs2_write_super(0
C003CEE0:	sync_supers()
C0038344:	fsync_dev()
C0038398:	sys_sync()
C0005ADC:	ret_from_syscall1()

Something is even more interesting that the call backtrace itself in the
trace log above:
"mtxmap_write16() called when erasing!"

This is a trace which I put down at the map driver level which prints only
when the write16() function is called when the flash device is erasing (i.e.
in the meantime that the 6 commands of an erase cycle are sent to the flash
and the time DQ6 stops toggling).
This should not happen!  I have only 1 flash device on my board and
cfi_cmdset_0002.c chip driver does not support multiple banks, which allows
to read or write to another bank while the first is erasing.  

Moreover, the trace: "waiting for erase to complete timed out." comes from
the chip driver's do_erase_oneblock() function and pops because DQ5 has been
asserted, meaning that a write has been attempted to a sector in the same
bank that the sector that was erasing.  This is confirmed by the first
traced I highlighted.

I keep on actively searching for the cause of the problem.

Frédéric Giasson

|-----Original Message-----
|From: David Woodhouse [mailto:dwmw2@infradead.org]
|Sent: Thursday, May 16, 2002 1:35 PM
|To: Frederic Giasson
|Cc: 'linux-mtd@lists.infradead.org'
|Subject: Re: Kernel oops after sync command in jffs2 
|
|
|
|Show oops? Decoded.
|
|--
|dwmw2
|
|

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel oops after sync command in jffs2
  2002-05-21 18:54 Frederic Giasson
@ 2002-05-21 21:07 ` David Woodhouse
  0 siblings, 0 replies; 12+ messages in thread
From: David Woodhouse @ 2002-05-21 21:07 UTC (permalink / raw)
  To: Frederic Giasson; +Cc: 'linux-mtd@lists.infradead.org'

> mtxmap_write16() called when erasing!
> waiting for erase to complete timed out.
> Erase at 0x00030000 failed immediately: errno -5
> Oops: kernel access of bad area, sig: 11

> C00BCEFC:	concat_erase_callback()
> C00BD26C:	concat_erase()

Thanks. mtdconcat.c shouldn't have been calling the erase completion 
callback _and_ returning failure. Can you update it and try again?

T'would still be interesting to know _why_ the erase failed though. What 
does the 'called while erasing!' message mean?

--
dwmw2

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Kernel oops after sync command in jffs2
@ 2002-05-22  2:03 Frederic Giasson
  2002-05-22  6:27 ` David Woodhouse
  0 siblings, 1 reply; 12+ messages in thread
From: Frederic Giasson @ 2002-05-22  2:03 UTC (permalink / raw)
  To: 'David Woodhouse'; +Cc: 'linux-mtd@lists.infradead.org'

Your fix in mtdconcat.c solves the OOPS.  However, erase is still failing.  

The 'called while erasing!' message means that I put a trace in the map
driver write16() function to tell me if write16()is called when a global
variable is set.  This global variable indicates that do_erase_oneword has
sent the erase sector command and is currently polling for the erase
operation to complete.

I tried removing DQ5 check while do_erase_oneblock() is polling the flash to
know when the erase is completed.  DQ5 rises if one of the 2 following
conditions occurs: the flash device received a write command for a bank that
is still erasing or a timeout internal to the flash device itself has
expired while the flash was still erasing.  
Theorically, the software timeout in do_erase_oneblock (timeo) should expire
before the flash's internal timeout, which eliminates the second condition.
If the first condition *really* occurs, then the function which issued a
write command to the flash should fail, and I should be able to track it
down this way.  What I suspect is that DQ5 is somehow not *really* raising
and the polling algorithm in do_erase_oneblock() reports an error anyways.
If my test succeeds this time with DQ5 polling disabled, I will inform you.
Otherwise, I will keep on searching the cause of DQ5 raising.

Frédéric Giasson

|-----Original Message-----
|From: David Woodhouse [mailto:dwmw2@infradead.org]
|Sent: Tuesday, May 21, 2002 5:08 PM
|To: Frederic Giasson
|Cc: 'linux-mtd@lists.infradead.org'
|Subject: Re: Kernel oops after sync command in jffs2 
|
|
|> mtxmap_write16() called when erasing!
|> waiting for erase to complete timed out.
|> Erase at 0x00030000 failed immediately: errno -5
|> Oops: kernel access of bad area, sig: 11
|
|> C00BCEFC:	concat_erase_callback()
|> C00BD26C:	concat_erase()
|
|Thanks. mtdconcat.c shouldn't have been calling the erase completion 
|callback _and_ returning failure. Can you update it and try again?
|
|T'would still be interesting to know _why_ the erase failed 
|though. What 
|does the 'called while erasing!' message mean?
|
|--
|dwmw2
|
|

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel oops after sync command in jffs2
  2002-05-22  2:03 Frederic Giasson
@ 2002-05-22  6:27 ` David Woodhouse
  2002-05-22 10:51   ` Ales Makarov
  0 siblings, 1 reply; 12+ messages in thread
From: David Woodhouse @ 2002-05-22  6:27 UTC (permalink / raw)
  To: Frederic Giasson; +Cc: 'linux-mtd@lists.infradead.org'

fgiasson@mediatrix.com said:
> The 'called while erasing!' message means that I put a trace in the
> map driver write16() function to tell me if write16()is called when a
> global variable is set.  This global variable indicates that
> do_erase_oneword has sent the erase sector command and is currently
> polling for the erase operation to complete. 

OK. We already fixed one of these by disabling fast programming mode, after 
you pointed out that it was sending the unlock command without paying due 
attention to the state machine. If you're definitely using v1.55 of 
cfi_cmdset_0002.c, can you put a BUG() in the mtxmap_write16() call just 
after it prints the 'called while erasing!' message, and we'll see where 
it's being called _from_. 

--
dwmw2

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel oops after sync command in jffs2
  2002-05-22  6:27 ` David Woodhouse
@ 2002-05-22 10:51   ` Ales Makarov
  0 siblings, 0 replies; 12+ messages in thread
From: Ales Makarov @ 2002-05-22 10:51 UTC (permalink / raw)
  To: Linux MTD mailing list

I have also reported our oops messages few day ago. I traced down the
functions and I got from JFFS2 to MTD. The problem is somewhere in
cfi_amdstd_write(). The problem appears only if there are 2 consecutive
writes (immediately one after another).

This happens regulary when GC tries to reclaim some space. The kernel
dies when GC calls mtd_fake_writev()...

Now I tried the latest version of cfi_cmdset_0002.c (v1.55). The kernel
oops messages are gone, but now I get:

Last[2] is ffff, datum is 1985
Write of 50 bytes at 0x00052438 failed. returned 0, retlen 0
Not marking the space at 0x00052438 as dirty because the flash driver
returned retlen zero
jffs2_write_dirent in garbage_collect_dirent failed: -5

Ales

David Woodhouse wrote:
> 
> fgiasson@mediatrix.com said:
> > The 'called while erasing!' message means that I put a trace in the
> > map driver write16() function to tell me if write16()is called when a
> > global variable is set.  This global variable indicates that
> > do_erase_oneword has sent the erase sector command and is currently
> > polling for the erase operation to complete.
> 
> OK. We already fixed one of these by disabling fast programming mode, after
> you pointed out that it was sending the unlock command without paying due
> attention to the state machine. If you're definitely using v1.55 of
> cfi_cmdset_0002.c, can you put a BUG() in the mtxmap_write16() call just
> after it prints the 'called while erasing!' message, and we'll see where
> it's being called _from_.
> 
> --
> dwmw2
> 
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel oops after sync command in jffs2
@ 2002-05-22 14:52 Frederic Giasson
  2002-05-23 15:34 ` Ales Makarov
  0 siblings, 1 reply; 12+ messages in thread
From: Frederic Giasson @ 2002-05-22 14:52 UTC (permalink / raw)
  To: 'ales.makarov@wolfhill.cz'
  Cc: David Woodhouse (E-mail), 'linux-mtd@lists.infradead.org'

Hi,

The oops you was getting does not seem to come from the same problem that I
found.  In my case, do_erase_oneblock() was failing. This lead us to
discover a bug in mtdconcat.c, which was the source of the oops.

However, I had a look at do_erase_oneword() function.  I seen that to verify
if write operation has completed, we simply read back the written value
until it matches or until we read back 10000 times or more.  This works most
of the time, but in my opinion it is not the best way to do that.  

What I think we should do to determine if write operation has completed OK
is to poll DQ6 and DQ5, as we do for erase operation.  However, since
program operation takes *much* less time than erase operation ( 150 us max
accoring to my AMD spec ), we should not schedule.  We should simply poll
150 times at most, inserting a "udelay(1);" in each loop.  If DQ5 rises, we
must threat it the same way we do in do_erase_oneblock().

Also, speaking of do_erase_oneblock(), it is said in my AMD spec that when
DQ5 rises, even if that happen while DQ6 still toggle, it is necessary to
read DQ0-DQ7 twice once again and compare both values to see if it still
toggles.  If it does not toggle, it means that erase operation has completed
correctly.  You can verify that my yourself by having a look at figure 6, p.
29 in spec sheet "Am29DL322D/323D/324D, revD, May 8th, 2001". 

Frederic Giasson

Ales Makarov wrote:

I have also reported our oops messages few day ago. I traced down the
functions and I got from JFFS2 to MTD. The problem is somewhere in
cfi_amdstd_write(). The problem appears only if there are 2 consecutive
writes (immediately one after another).

This happens regulary when GC tries to reclaim some space. The kernel
dies when GC calls mtd_fake_writev()...

Now I tried the latest version of cfi_cmdset_0002.c (v1.55). The kernel
oops messages are gone, but now I get:

Last[2] is ffff, datum is 1985
Write of 50 bytes at 0x00052438 failed. returned 0, retlen 0
Not marking the space at 0x00052438 as dirty because the flash driver
returned retlen zero
jffs2_write_dirent in garbage_collect_dirent failed: -5

Ales

David Woodhouse wrote:
> 
> fgiasson@mediatrix.com said:
> > The 'called while erasing!' message means that I put a trace in the
> > map driver write16() function to tell me if write16()is called when a
> > global variable is set.  This global variable indicates that
> > do_erase_oneword has sent the erase sector command and is currently
> > polling for the erase operation to complete.
> 
> OK. We already fixed one of these by disabling fast programming mode,
after
> you pointed out that it was sending the unlock command without paying due
> attention to the state machine. If you're definitely using v1.55 of
> cfi_cmdset_0002.c, can you put a BUG() in the mtxmap_write16() call just
> after it prints the 'called while erasing!' message, and we'll see where
> it's being called _from_.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel oops after sync command in jffs2
  2002-05-22 14:52 Frederic Giasson
@ 2002-05-23 15:34 ` Ales Makarov
  0 siblings, 0 replies; 12+ messages in thread
From: Ales Makarov @ 2002-05-23 15:34 UTC (permalink / raw)
  To: Frederic Giasson, Linux MTD mailing list

Frederic Giasson wrote:
> 
> However, I had a look at do_erase_oneword() function.  I seen that to verify

I guess you mean do_write_oneword() function.

> if write operation has completed, we simply read back the written value
> until it matches or until we read back 10000 times or more.  This works most
> of the time, but in my opinion it is not the best way to do that.

You are right. This could be my problem...
 
> What I think we should do to determine if write operation has completed OK
> is to poll DQ6 and DQ5, as we do for erase operation.  However, since
> program operation takes *much* less time than erase operation ( 150 us max
> accoring to my AMD spec ), we should not schedule.  We should simply poll
> 150 times at most, inserting a "udelay(1);" in each loop.  If DQ5 rises, we
> must threat it the same way we do in do_erase_oneblock().

I agree. I would also prefer the DQ6 and DQ5 testing.

> 
> Also, speaking of do_erase_oneblock(), it is said in my AMD spec that when
> DQ5 rises, even if that happen while DQ6 still toggle, it is necessary to
> read DQ0-DQ7 twice once again and compare both values to see if it still
> toggles.  If it does not toggle, it means that erase operation has completed
> correctly.  You can verify that my yourself by having a look at figure 6, p.
> 29 in spec sheet "Am29DL322D/323D/324D, revD, May 8th, 2001".
> 
> Frederic Giasson
> 

I have read it also in the datasheet of the Fujitsu MBM29LV320 (p.32).
These additional reads should be done, because the toggle bit may have
stopped toggling just as DQ5 went high.

Ales

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Kernel oops after sync command in jffs2
@ 2002-05-23 15:36 Frederic Giasson
  0 siblings, 0 replies; 12+ messages in thread
From: Frederic Giasson @ 2002-05-23 15:36 UTC (permalink / raw)
  To: 'ales.makarov@wolfhill.cz', Linux MTD mailing list

Hi Ales,

I am currently testing the modifications I suggested in do_write_oneword and
do_erase_oneblock regarding polling.  If you want my sources to test it
and/or improve it yourself, just let me known.  Otherwise, when I will have
tested it throughfully, I will commit it into the CVS tree.

Regards,

Frédéric Giasson  

|-----Original Message-----
|From: Ales Makarov [mailto:ales.makarov@wolfhill.cz]
|Sent: Thursday, May 23, 2002 11:34 AM
|To: Frederic Giasson; Linux MTD mailing list
|Subject: Re: Kernel oops after sync command in jffs2
|
|
|Frederic Giasson wrote:
|> 
|> However, I had a look at do_erase_oneword() function.  I 
|seen that to verify
|
|I guess you mean do_write_oneword() function.
|
|> if write operation has completed, we simply read back the 
|written value
|> until it matches or until we read back 10000 times or more.  
|This works most
|> of the time, but in my opinion it is not the best way to do that.
|
|You are right. This could be my problem...
| 
|> What I think we should do to determine if write operation 
|has completed OK
|> is to poll DQ6 and DQ5, as we do for erase operation.  However, since
|> program operation takes *much* less time than erase 
|operation ( 150 us max
|> accoring to my AMD spec ), we should not schedule.  We 
|should simply poll
|> 150 times at most, inserting a "udelay(1);" in each loop.  
|If DQ5 rises, we
|> must threat it the same way we do in do_erase_oneblock().
|
|I agree. I would also prefer the DQ6 and DQ5 testing.
|
|> 
|> Also, speaking of do_erase_oneblock(), it is said in my AMD 
|spec that when
|> DQ5 rises, even if that happen while DQ6 still toggle, it is 
|necessary to
|> read DQ0-DQ7 twice once again and compare both values to see 
|if it still
|> toggles.  If it does not toggle, it means that erase 
|operation has completed
|> correctly.  You can verify that my yourself by having a look 
|at figure 6, p.
|> 29 in spec sheet "Am29DL322D/323D/324D, revD, May 8th, 2001".
|> 
|> Frederic Giasson
|> 
|
|I have read it also in the datasheet of the Fujitsu MBM29LV320 (p.32).
|These additional reads should be done, because the toggle bit may have
|stopped toggling just as DQ5 went high.
|
|Ales
|

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2002-05-23 15:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-05-16 16:04 Kernel oops after sync command in jffs2 Frederic Giasson
2002-05-16 17:34 ` David Woodhouse
  -- strict thread matches above, loose matches on Subject: below --
2002-05-17 18:04 Frederic Giasson
2002-05-18  7:43 ` David Woodhouse
2002-05-21 18:54 Frederic Giasson
2002-05-21 21:07 ` David Woodhouse
2002-05-22  2:03 Frederic Giasson
2002-05-22  6:27 ` David Woodhouse
2002-05-22 10:51   ` Ales Makarov
2002-05-22 14:52 Frederic Giasson
2002-05-23 15:34 ` Ales Makarov
2002-05-23 15:36 Frederic Giasson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox