failed write verify causes segmentation fault

public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed

* failed write verify causes segmentation fault
@ 2005-03-16  1:09 Sergei Sharonov
  2005-03-16 12:08 ` Artem B. Bityuckiy
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Sergei Sharonov @ 2005-03-16  1:09 UTC (permalink / raw)
  To: linux-mtd

Hello,

I believe I ran into a problem that might not have shown up under normal

test conditions. I was testing NAND flash endurance at elevated 
temperatures and after a few weeks of continuous 
write/remount/read/verify/erase of a 10 MByte random test file 
the nand_verify_pages() failed and caused segmentation fault.
Test setup: at91rm9200 based custom board, linux 2.6.10 patched with
mtd-snapshot-20050123, JFFS2 filesystem (non-root) on a 2 Gbit Toshiba 
NAND chip, write verify enabled, compression disabled. 
Seems that when (eventually) write verify failed the system could not 
handle it graciously. Has anybody seen this before?

BTW, I cannot get my regular corporate mailing system (outlook/exchange)

to work correctly with the rules this list imposes. 
I tried gmane this morning and the post have not showed up. 
I am open to any suggestions on other free mailing 
services that are known to work for this list.


Relevant startup messages:
-------------------8<-----------------
NAND device: Manufacturer ID: 0x98, Chip ID: 0xda 
(Toshiba NAND 256MiB 3,3V 8-bit)
Scanning device for bad blocks
Creating 1 MTD partitions on "NAND 256MiB 3,3V 8-bit":
0x00000000-0x10000000 : "Storage"
mtd: Giving out device 3 to Storage
-------------------8<-----------------


Error output:
-------------------8<-----------------
Creating file </mnt/flash/flashtest_1499_932674.log>.. OK
Writing 10000000 bytes.. nand_verify_pages: Failed write verify, 
page 0x00013dfd <5>Write of 4164 bytes at 0x09efe1bc failed. retur4
jffs2_flash_writev(): Non-contiguous write to 09eff200
Unable to handle kernel NULL pointer dereference at virtual 
address 00000000
pgd = c1e00000
[00000000] *pgd=202a3011, *pte=00000000, *ppte=00000000
Internal error: Oops: 807 [#1]
Modules linked in:
CPU: 0
PC is at jffs2_flash_writev+0x188/0x5a0
LR is at 0x1
pc : [<c00d9dc4>]    lr : [<00000001>]    Not tainted
sp : c1271bc8  ip : 60000093  fp : c1271c40
r10: 00000000  r9 : 09ee0000  r8 : 00000000
r7 : 09eff200  r6 : c125f800  r5 : ffffffff  r4 : 00000000
r3 : 00000000  r2 : 00000000  r1 : 0a485cda  r0 : 0000003a
Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  Segment user
Control: C000717F  Table: 21E00000  DAC: 00000015
Process flashtest (pid: 1880, stack limit = 0xc1270190)
Stack: (0xc1271bc8 to 0xc1272000)
1bc0:                   04b6061b 0337050e 062b0676 0662060a c125f91c
09eff200
1be0: 00000000 06ca066e ffffffff 00000000 00000000 00000002 c1271c78
03e9066d
1c00: 05e10000 06da064d 05050415 05df06be 06d605ff 065f0535 09eff200
00000000
1c20: c18c34f4 c15d6d50 c125f800 c0a3be54 c18c34f4 c1271cb0 c1271c48
c00d0694
1c40: c00d9c50 09eff200 00000000 c1271c74 00001932 00000dbc c1a42000
c1271cb0
1c60: 00000002 00000000 c1dcae34 c15d6d50 00000000 00000000 c18c34f4
00000044
1c80: c1a41000 00000dbc c1271ce0 00000000 c18c34f4 c15d6d50 c125f800
00143000
1ca0: 00000000 c1271d10 c1271cb4 c00d14b0 c00d0530 00000dbc 09eff200
00000003
1cc0: 00000001 00000000 c1a41000 00000dbc 00000dbc c1a41000 00000e00
09eff200
1ce0: e1f14050 e1f14050 c18c351c c18c3518 00000000 c18c34f4 c15d6d78
c15d6d50
1d00: c024d820 c1271d60 c1271d14 c00ccae4 c00d11d8 00143000 00001000
c1271d2c
1d20: c125f800 00001000 00000000 00000000 000e3b49 000e3b49 00001000
00000000
1d40: c020b644 c024d820 00143000 00000000 00000000 c1271e00 c1271d64
c004ad34
1d60: c00cc94c 00001000 00153f24 00001000 c15d6d78 c01cf8dc c15d6e10
c19411e0
1d80: 00000001 c1271e6c 00000000 c1271f10 00000000 7b271da0 00000002
00000000
1da0: c024d800 c024d820 c0249720 c02430e0 c023dec0 c023dee0 c024d580
c024d5a0
1dc0: c024d5c0 c024d5e0 c024ce80 c024cea0 c024cec0 c024cee0 00000000
00000000
1de0: 00000000 c1271e34 c19411e0 00000000 00000000 c1271e68 c1271e08
c004b3c8
1e00: c004a9e4 00000000 00000000 c1271f74 00846680 00143000 00000000
c15d6d78
1e20: 00989680 c1271f74 c1271f10 c1271e6c 00000001 00989680 00000000
00000000
1e40: 00000000 c126fd40 c1271ecc c1271e6c 00000000 00000000 c1271f74
c1271f08
1e60: c1271e6c c004b4f4 c004af88 c1271e78 c010ac54 00000000 00000001
ffffffff
1e80: c19411e0 c1295804 00000000 c1271ef0 00000000 c00fcb6c 60000093
c126fd40
1ea0: 00000000 00000000 4005a361 c1270000 c1149be4 c1271ed8 c1271ec4
c002c1cc
1ec0: c002c134 c126fd40 c00432b8 c1271ecc c1271ecc c00f6968 c002c1b0
00000019
1ee0: 00000019 c15d6de0 c15d6e10 00000000 c15d6d78 c19411e0 00010f24
c1271f3c
1f00: c1271f0c c004b730 c004b480 000e3b44 00010f24 00989680 00000000
c19411e0
1f20: 00000000 00000000 00989680 c1271f74 c1271f70 c1271f40 c00636b0
c004b6e4
1f40: 00000001 c1271fb0 c1941204 fffffff7 c19411e0 c1271f74 00000000
00000000
1f60: 40058178 c1271fa4 c1271f74 c00637ac c00635dc 00000000 00000000
00000000
1f80: 002625a0 0026259f 00010f24 00000004 c001b924 c1270000 00000000
c1271fa8
1fa0: c001b7a0 c006376c 002625a0 c0033964 00000003 00010f24 00989680
4005a348
1fc0: 002625a0 0026259f 00010f24 00000003 000e3b42 befffd24 40058178
000005db
1fe0: 00010eec befffc1c 00008828 40046ce0 60000010 00000003 e1a05000
e1a0c001
Backtrace:
[<c00d9c40>] (jffs2_flash_writev+0x4/0x5a0) from [<c00d0694>]
(jffs2_write_dnode+0x174/0x684)
[<c00d0520>] (jffs2_write_dnode+0x0/0x684) from [<c00d14b0>]
(jffs2_write_inode_range+0x2e8/0x46c)
[<c00d11c8>] (jffs2_write_inode_range+0x0/0x46c) from [<c00ccae4>]
(jffs2_commit_write+0x1a8/0x304)
[<c00cc93c>] (jffs2_commit_write+0x0/0x304) from [<c004ad34>]
(generic_file_buffered_write+0x364/0x5a8)
[<c004a9d4>] (generic_file_buffered_write+0x4/0x5a8) from [<c004b3c8>]
(__generic_file_aio_write_nolock+0x450/0x478)
[<c004af78>] (__generic_file_aio_write_nolock+0x0/0x478) from
[<c004b4f4>]
(__generic_file_write_nolock+0x84/0xac)
[<c004b470>] (__generic_file_write_nolock+0x0/0xac) from [<c004b730>]
(generic_file_write+0x5c/0xe8)
 r9 = 00010F24  r8 = C19411E0  r7 = C15D6D78  r6 = 00000000
 r5 = C15D6E10  r4 = C15D6DE0
[<c004b6d4>] (generic_file_write+0x0/0xe8) from [<c00636b0>] 
(vfs_write+0xe4/0x11c)
[<c00635cc>] (vfs_write+0x0/0x11c) from [<c00637ac>] 
(sys_write+0x50/0x74)
[<c006375c>] (sys_write+0x0/0x74) from [<c001b7a0>] 
ret_fast_syscall+0x0/0x2c)
 r9 = C1270000  r8 = C001B924  r7 = 00000004  r6 = 00010F24
 r5 = 0026259F  r4 = 002625A0
Code: 159f035c 10812003 1bfd5662 e3a03000 (e5833000)
 Segmentation fault
#
-------------------8<-----------------

After rebooting the file system appears to be intact except that the 
last file was partially written (1323008 bytes instead of 10000000) 
and CRC error:

# mount /mnt/flash/
# jffs2_get_inode_nodes(): Data CRC failed on node at 0x09efe1bc: 
Read 0x440f7cb2, calculated 0x1db0b909

Will take any help/advice offered ;-)
Thanks,

Sergei Sharonov

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: failed write verify causes segmentation fault
  2005-03-16  1:09 failed write verify causes segmentation fault Sergei Sharonov
@ 2005-03-16 12:08 ` Artem B. Bityuckiy
  2005-03-16 19:36   ` Thomas Gleixner
  2005-03-16 19:51   ` Thomas Gleixner
  2005-03-16 19:53 ` Thomas Gleixner
  2005-03-16 21:44 ` Thomas Gleixner
  2 siblings, 2 replies; 8+ messages in thread
From: Artem B. Bityuckiy @ 2005-03-16 12:08 UTC (permalink / raw)
  To: Sergei Sharonov; +Cc: linux-mtd

Hi

It is difficult to me to say what's going wrong without an additional
information. The following are my thoughts.

> Writing 10000000 bytes.. nand_verify_pages: Failed write verify, 
> page 0x00013dfd <5>Write of 4164 bytes at 0x09efe1bc failed. retur4
> jffs2_flash_writev(): Non-contiguous write to 09eff200
> Unable to handle kernel NULL pointer dereference at virtual 
This is odd. If you take a glimpse to the corresponding code in fs/jffs2/wbuf.c,
you'll find:

        if (to != PAD(c->wbuf_ofs + c->wbuf_len)) {
                /* We're not writing immediately after the writebuffer. Bad. */
                printk(KERN_CRIT "jffs2_flash_writev(): Non-contiguous write to %08lx\n", (unsigned long)to);
                if (c->wbuf_len)
                        printk(KERN_CRIT "wbuf was previously %08x-%08x\n",
                                          c->wbuf_ofs, c->wbuf_ofs+c->wbuf_len);
                BUG();
        }

I don't see any reason why the NULL dereferencing might happen after the
"jffs2_flash_writev(): Non-contiguous write to.." output. BUG()
should be called instead.

The fact that the "if (to != PAD(c->wbuf_ofs + c->wbuf_len)" condition fails
implies that something goes wring. This means that a write operation is made
the wrong place, not the page which is currently being represented by the write buffer.

> Seems that when (eventually) write verify failed the system could not 
> handle it graciously. Has anybody seen this before?
This is only one possible reason. The other possible reason which I
think is more likely has happened in your case is that JFFS2 tried to
write to a non-empty NAND page, e.g., the page didn't contain all 0xFF.
In this case write_verify() might fail as well. I don't know why JFFS2
might do that, possibly there is some bug.

I'd suggest you to debug JFFS2. You might try to do the following
things.

1. Before writing anything, check that the target NAND page is empty.
For this purpose you might insert the corresponding checking code at
wbuf.c:466 (__jffs2_flush_wbuf() function). The line number must be
valid for the last MTD snapshot ($Id: wbuf.c,v 1.89 2005/02/09 09:23:54
pavlov Exp $.). All writes pass this functions in case of NAND flash.

To read a page you may insert something like this:
char testbuf[2048];
memset(testbuf, '\0', 2048);
jffs2_flash_read(c, c->wbuf_ofs, c->wbuf_pagesize, &retlen, &testbuf
[0]);

Then check that testbuf[] contains all 0xFF.

2. Insert printk's in different places. You might enable the Level 1
jffs2 debug output. But this will be too noisy.

You might alternatively introduce some variable like 'int
_shit_have_happened_' and export it.

Redefine D1() to something like:

#define D1(x) { if (_shit_have_happened_) { x; } }

Set _shit_have_happened_ to 1 in nand_verify_page if an error happened,
or in __jffs2_flush_wbuf() if you've found that you write to non-empty
NAND page and the like.

This might help.

I wonder does something like this happens in case of normal
temperatures? Might something except NAND get crazy ?

-- 
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: failed write verify causes segmentation fault
  2005-03-16 12:08 ` Artem B. Bityuckiy
@ 2005-03-16 19:36   ` Thomas Gleixner
  2005-03-16 19:51   ` Thomas Gleixner
  1 sibling, 0 replies; 8+ messages in thread
From: Thomas Gleixner @ 2005-03-16 19:36 UTC (permalink / raw)
  To: dedekind; +Cc: linux-mtd

On Wed, 2005-03-16 at 15:08 +0300, Artem B. Bityuckiy wrote:
> I don't see any reason why the NULL dereferencing might happen after the
> "jffs2_flash_writev(): Non-contiguous write to.." output. BUG()
> should be called instead.

>From include/asm-arm/bug.h
#define BUG()           (*(int *)0 = 0)

This is a NULL pointer dereference.

tglx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: failed write verify causes segmentation fault
  2005-03-16 12:08 ` Artem B. Bityuckiy
  2005-03-16 19:36   ` Thomas Gleixner
@ 2005-03-16 19:51   ` Thomas Gleixner
  2005-03-17 18:20     ` Sergei Sharonov
  1 sibling, 1 reply; 8+ messages in thread
From: Thomas Gleixner @ 2005-03-16 19:51 UTC (permalink / raw)
  To: dedekind; +Cc: linux-mtd, Sergei Sharonov

On Wed, 2005-03-16 at 15:08 +0300, Artem B. Bityuckiy wrote:
> This is only one possible reason. The other possible reason which I
> think is more likely has happened in your case is that JFFS2 tried to
> write to a non-empty NAND page, e.g., the page didn't contain all 0xFF.
> In this case write_verify() might fail as well. I don't know why JFFS2
> might do that, possibly there is some bug.

The question is not why the write verify fails. In his test environment
a page failure is expected to happen at some point. The question is why
JFFS2 did not cope with the problem as it is supposed to be.

tglx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: failed write verify causes segmentation fault
  2005-03-16  1:09 failed write verify causes segmentation fault Sergei Sharonov
  2005-03-16 12:08 ` Artem B. Bityuckiy
@ 2005-03-16 19:53 ` Thomas Gleixner
  2005-03-16 21:44 ` Thomas Gleixner
  2 siblings, 0 replies; 8+ messages in thread
From: Thomas Gleixner @ 2005-03-16 19:53 UTC (permalink / raw)
  To: Sergei Sharonov; +Cc: linux-mtd

On Tue, 2005-03-15 at 19:09 -0600, Sergei Sharonov wrote:
> test conditions. I was testing NAND flash endurance at elevated 
> temperatures and after a few weeks of continuous 
> write/remount/read/verify/erase of a 10 MByte random test file 
> the nand_verify_pages() failed and caused segmentation fault.

The write/verify failure is nothing scary. Its the nature of NAND Flash
that this can and will happen.

The segfault of your user space application/script is a consequence of a
BUG() in the kernel. This _is_ scary.

> Test setup: at91rm9200 based custom board, linux 2.6.10 patched with
> mtd-snapshot-20050123, JFFS2 filesystem (non-root) on a 2 Gbit Toshiba 
> NAND chip, write verify enabled, compression disabled. 
> Seems that when (eventually) write verify failed the system could not 
> handle it graciously. Has anybody seen this before?

This should be handled by JFFS2 and I'm know for sure this worked some
time ago. I have no idea yet, which part of the code has been changed to
make this functionality go away. I will fake the write failure and check
whether I can reprocude your problem here. If it is reproducable then we
can fix it, if not we'll see. :)

> Backtrace:
> [<c00d9c40>] (jffs2_flash_writev+0x4/0x5a0) from [<c00d0694>]
> (jffs2_write_dnode+0x174/0x684)
> [<c00d0520>] (jffs2_write_dnode+0x0/0x684) from [<c00d14b0>]
> (jffs2_write_inode_range+0x2e8/0x46c)
> [<c00d11c8>] (jffs2_write_inode_range+0x0/0x46c) from [<c00ccae4>]
> (jffs2_commit_write+0x1a8/0x304)
> [<c00cc93c>] (jffs2_commit_write+0x0/0x304) from [<c004ad34>]
> (generic_file_buffered_write+0x364/0x5a8)

Not very helpful to find the source of the problem, but thanks for
providing a detailed and descriptive bug report.

> After rebooting the file system appears to be intact except that the 
> last file was partially written (1323008 bytes instead of 10000000) 
> and CRC error:

That's an expected consequence of the problem. 

tglx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: failed write verify causes segmentation fault
  2005-03-16  1:09 failed write verify causes segmentation fault Sergei Sharonov
  2005-03-16 12:08 ` Artem B. Bityuckiy
  2005-03-16 19:53 ` Thomas Gleixner
@ 2005-03-16 21:44 ` Thomas Gleixner
  2005-03-17 15:29   ` Sergei Sharonov
  2 siblings, 1 reply; 8+ messages in thread
From: Thomas Gleixner @ 2005-03-16 21:44 UTC (permalink / raw)
  To: Sergei Sharonov; +Cc: linux-mtd

On Tue, 2005-03-15 at 19:09 -0600, Sergei Sharonov wrote:

> mtd-snapshot-20050123

Bad luck. 

The problem causing your BUG() was fixed on 
Mon Jan 24 21:24:15 2005 UTC 
in MTD CVS. Please update to current CVS Head

tglx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: failed write verify causes segmentation fault
  2005-03-16 21:44 ` Thomas Gleixner
@ 2005-03-17 15:29   ` Sergei Sharonov
  0 siblings, 0 replies; 8+ messages in thread
From: Sergei Sharonov @ 2005-03-17 15:29 UTC (permalink / raw)
  To: linux-mtd


Hi,

> > mtd-snapshot-20050123
> 
> Bad luck. 
> 
> The problem causing your BUG() was fixed on 
> Mon Jan 24 21:24:15 2005 UTC 
> in MTD CVS. Please update to current CVS Head
> 
> tglx
> 

Wov! Speaking of bad luck.. Thanks, appreciate your help. Best regards,

Sergei Sharonov

P.S. I am trying to post reply using gmane.org. Outlook/Exchange combination
is broken (Dah!) and does not thread. Will see if gmane will work
this time.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: failed write verify causes segmentation fault
  2005-03-16 19:51   ` Thomas Gleixner
@ 2005-03-17 18:20     ` Sergei Sharonov
  0 siblings, 0 replies; 8+ messages in thread
From: Sergei Sharonov @ 2005-03-17 18:20 UTC (permalink / raw)
  To: linux-mtd

Hi,

> > This is only one possible reason. The other possible reason which I
> > think is more likely has happened in your case is that JFFS2 tried to
> > write to a non-empty NAND page, e.g., the page didn't contain all 0xFF.
> > In this case write_verify() might fail as well. I don't know why JFFS2
> > might do that, possibly there is some bug.
> 
> The question is not why the write verify fails. In his test environment
> a page failure is expected to happen at some point. The question is why
> JFFS2 did not cope with the problem as it is supposed to be.

Exactly. Could not have said better myself. How should jffs2 handle failed 
writes? Replace bad page transparently on the fly, return number of bytes 
written < number of bytes requested, segfault ;-) ? 

Sergei

P.S. For you, poor souls stuck behind corporate firewalls with Outlook/Exchange,
it looks like gmane.org is a good alternative. It does threading, posts plain 
text and checks for >80 chars/lines.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-03-17 18:26 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-16  1:09 failed write verify causes segmentation fault Sergei Sharonov
2005-03-16 12:08 ` Artem B. Bityuckiy
2005-03-16 19:36   ` Thomas Gleixner
2005-03-16 19:51   ` Thomas Gleixner
2005-03-17 18:20     ` Sergei Sharonov
2005-03-16 19:53 ` Thomas Gleixner
2005-03-16 21:44 ` Thomas Gleixner
2005-03-17 15:29   ` Sergei Sharonov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox