From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hong Tran Duc Subject: Oops when read/write or mount/unmount continuously ~ 600.000 times Date: Sun, 03 Aug 2008 19:49:50 +0700 Message-ID: <4895A96E.2040303@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-ide@vger.kernel.org Return-path: Sender: linux-ide-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org Hi all, I=92m using kernel 2.4.20 with fully preemptive enable (patch & set the= =20 CONFIG option). My CPU is PowerPC 750FX, HDD 80GB, RAM 512, I got many Oops when try to mount/unmount or read/write on ATA HDD=20 continuously about 600.000 times (in several hours). Oops often occurre= d=20 when CPU trap SIGSEGV or SIGILL, sometime on page management module,=20 sometimes on scheduler, block I/O manipulation, filesystem. The most frequently happened on: Block I/O : make_request, generic_make_request, submit_bh, bdfind, bmap= ,=20 __wait_on_buffer .. =46ilesystem: journal_commit_transaction, kill_super, invalidate_inode,= =20 invalidate_list .. The reasons is almost linked list on those function was broken. Ex:=20 linkedlist->next linkedlist->prev =3D NULL or set to invalid address. In the situation SIGILL, the instruction pointer (NIP) is same as the=20 return address register (LR). The newest Oops, I got on function __wait_on_buffer(). The main=20 sequences of __wait_on_buffer() are: 1. put_bh -> atomic_inc(bh->b_count); 2. add wait queue 3. loop: do some thing task manipulation, call *schedule()* 4. remove wait queue 5. get_bh -> atomic_dec(bh->b_count); *<- Got Oops here, SEGV because=20 bh->b_count =3D R25 =3D 0x02 * After analysis assembly code (I upload on pastebin bellow) at this=20 point, I found that: * At the point (1) -> address of bh->b_count stored in register r25. * The point from (2) ->(4) all of affect to register 25 will be restore= d=20 from stack (r25 act as non violent register in gcc ABI). * An step 5, *r25 =3D 0x02 ??? I don=92t know why r25 is changed ? May = be=20 stack on somewhere was corrupted ?* This Oops is very difficult to replicate (about 2 hours run stress test= =20 program). I try to increase/reduce the HZ 10 times, but the frequency o= f=20 bug is no change. And, I tried on ext2/ext3, it=92s same result. I=92m really confusing now, I don=92t know where the real problem is, a= nd=20 what is effected with the frequency of Oops, how to debug and figure=20 this bug ? I post my situation to this ML and hope to get some advice from you, Some Oops, I uploaded on pastebin here: http://vnoss.net/p/5783 http://vnoss.net/p/5785 Sources and assembly of __wait_on_buffer() http://vnoss.net/p/5784 Thanks for your help, --=20 nm. GPG Key ID: 0xDD253B25 =46ingerprint: 2B17 D64A 9561 A443 2ABC 1302 4641 D0B7 DD25 3B25