* OOPS with raid1
@ 2003-05-08 15:22 Torsten Landschoff
2003-05-11 11:45 ` Torsten Landschoff
0 siblings, 1 reply; 2+ messages in thread
From: Torsten Landschoff @ 2003-05-08 15:22 UTC (permalink / raw)
To: linux-raid
[-- Attachment #1: Type: text/plain, Size: 7495 bytes --]
Hi *,
I wonder if nobody has noticed this before but there seems to be at
least one serious bug in the raid1 code. I only noticed because I
simulated a disk failure to check if the raid is working.
I don't have a log of the original failure on a production system but it
is reproducible for me quite simple:
pulsar:~# uname -a
Linux pulsar 2.4.20-686-smp #1 SMP Mon Jan 13 23:06:41 EST 2003 i686
(that's the kernel as shipped in Debian unstable's
kernel-image-2.4.20-686-smp)
pulsar:~# mdadm --create /dev/md0 -l 1 -n 2 /dev/ram0 /dev/ram1
md: bind<[dev 01:00],1>
md0: WARNING: [dev 01:01] appears to be on the same physical disk as [dev 01:00]. True
protection against single-disk failure might be compromised.
md: bind<[dev 01:01],2>
md: [dev 01:01]'s event counter: 00000000
md: [dev 01:00]'s event counter: 00000000
md: md0: raid array is not clean -- starting background reconstruction
md: RAID level 1 does not need chunksize! Continuing anyway.
md: raid1 personality registered as nr 3
md0: max total readahead window set to 124k
md0: 1 data-disks, max readahead per data-disk: 124k
raid1: device [dev 01:01] operational as mirror 1
raid1: device [dev 01:00] operational as mirror 0
raid1: raid set md0 not clean; reconstructing mirrors
raid1: raid set md0 active with 2 out of 2 mirrors
md: updating md0 RAID superblock on device
md: [dev 01:01] [events: 00000001]<6>(write) [dev 01:01]'s sb offset: 4032
md: [dev 01:00] [events: 00000001]<6>(write) [dev 01:00]'s sb offset: 4032
md: syncing RAID array md0
md: minimum _guaranteed_ reconstruction speed: 100 KB/sec/disc.
md: using maximum available idle IO bandwith (but not more than 100000 KB/sec) for reconstruction.
md: using 124k window, over a total of 4032 blocks.
mdadm: array /dev/md0 started.
pulsar:~# mdadm -f /dev/md0 /dev/ram1
raid1: Disk failure on [dev 01:01], disabling device.
(that's the kernel as shipped in Debian unstable's kernel-image-2.4.20-686-smp)
peration continuing on 1 devices
md: updating md0 RAID superblock on device
md: [dev 01:01] [events: 00000002]<6>(write) [dev 01:01]'s sb offset: 4032
md: recovery thread got woken up ...
md0: no spare disk to reconstruct array! -- continuing in degraded mode
CPU: 1
EIP: 0010:[<d8921383>] Not tainted
EFLAGS: 00010246
eax: c1a70380 ebx: 00000101 ecx: 00000400 edx: 00001000
esi: 00000000 edi: cc06f000 ebp: cec5ff60 esp: cee73f44
ds: 0018 es: 0018 ss: 0018
Process raid1d (pid: 1219, stackpage=cee73000)
Stack: cec5ff60 d56101a0 d1e7a660 d1e7a674 00000000 c1a70380 d8921653 cec5ff60
d892617f 00000002 d1e7a660 cee72000 d56103a0 00000001 00000064 00000000
d8a468bd d1e7a660 cee72000 cee72000 d56103a0 00000001 00000001 00000004
Call Trace: [<d8921653>] [<d892617f>] [<d8a468bd>] [<d8924259>] [<c0107164>]
Code: f3 a5 8b 44 24 14 8b 54 24 10 f0 0f ab 50 18 8b 44 24 14 e8
mdadm: set /dev/ram1 faulty in /dev/md0
Relevant excerpt from syslog:
May 8 17:12:28 pulsar kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000
May 8 17:12:28 pulsar kernel: printing eip:
May 8 17:12:28 pulsar kernel: d8921383
May 8 17:12:28 pulsar kernel: *pde = 00000000
May 8 17:12:28 pulsar kernel: Oops: 0000
May 8 17:12:28 pulsar kernel: CPU: 1
May 8 17:12:28 pulsar kernel: EIP: 0010:[input:__insmod_input_O/lib/modules/2.4.20-686-smp/kernel/drivers/+-60541/96] Not tainted
May 8 17:12:28 pulsar kernel: EFLAGS: 00010246
May 8 17:12:28 pulsar kernel: md: recovery thread finished ...
May 8 17:12:28 pulsar kernel: eax: c1a70380 ebx: 00000101 ecx: 00000400 edx: 00001000
May 8 17:12:28 pulsar kernel: esi: 00000000 edi: cc06f000 ebp: cec5ff60 esp: cee73f44
May 8 17:12:28 pulsar kernel: ds: 0018 es: 0018 ss: 0018
May 8 17:12:28 pulsar kernel: Process raid1d (pid: 1219, stackpage=cee73000)
May 8 17:12:28 pulsar kernel: Stack: cec5ff60 d56101a0 d1e7a660 d1e7a674 00000000 c1a70380 d8921653 cec5ff60
May 8 17:12:28 pulsar kernel: d892617f 00000002 d1e7a660 cee72000 d56103a0 00000001 00000064 00000000
May 8 17:12:28 pulsar kernel: d8a468bd d1e7a660 cee72000 cee72000 d56103a0 00000001 00000001 00000004
May 8 17:12:28 pulsar kernel: Call Trace: [input:__insmod_input_O/lib/modules/2.4.20-686-smp/kernel/drivers/+-59821/96] [input:__insmod_input_O/lib/modules/2.4.20-686-smp/kernel/drivers/+-40577/96] [awe_wave:__insmod_awe_wave_S.data_L9840+1456189/224684141] [input:__insmod_input_O/lib/modules/2.4.20-686-smp/kernel/drivers/+-48551/96] [kernel_thread+40/56]
May 8 17:12:28 pulsar kernel:
May 8 17:12:28 pulsar kernel: Code: f3 a5 8b 44 24 14 8b 54 24 10 f0 0f ab 50 18 8b 44 24 14 e8
For reference: On the production server I got a similar entry:
May 5 15:54:13 linuxserv kernel: raid1: Disk failure on sdb1, disabling device.
May 5 15:54:13 linuxserv kernel: ^IOperation continuing on 1 devices
May 5 15:54:13 linuxserv kernel: md: updating md0 RAID superblock on device
May 5 15:54:13 linuxserv kernel: md: sdb1 [events: 0000001b]<6>(write) sdb1's sb offset: 497856
May 5 15:54:13 linuxserv kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000
May 5 15:54:13 linuxserv kernel: printing eip:
May 5 15:54:13 linuxserv kernel: c01c0fb3
May 5 15:54:13 linuxserv kernel: *pde = 00000000
May 5 15:54:13 linuxserv kernel: ops: 0000
May 5 15:54:13 linuxserv kernel: nfsd lockd sunrpc xfs serial loop ide-disk ide-probe-mod ide-mod lvm-mod 3c59x
May 5 15:54:13 linuxserv kernel: CPU: 1
May 5 15:54:13 linuxserv kernel: EIP: 0010:[write_disk_sb+375/472] Not tainted
May 5 15:54:13 linuxserv kernel: EFLAGS: 00010246
May 5 15:54:13 linuxserv kernel: eax: c20d6920 ebx: 00000811 ecx: 00000400 edx: 00001000
May 5 15:54:13 linuxserv kernel: esi: 00000000 edi: c45e3000 ebp: dfff6ca0 esp: dfb15f44
May 5 15:54:13 linuxserv kernel: ds: 0018 es: 0018 ss: 0018
May 5 15:54:13 linuxserv kernel: Process raid1d (pid: 11, stackpage=dfb15000)
May 5 15:54:13 linuxserv kernel: Stack: dfff6ca0 dfff6d20 dfb18ea0 dfb18eb4 00000000 c20d6920 c01c1283 dfff6ca0
May 5 15:54:13 linuxserv kernel: c022fcff 0000001b dfb18ea0 dfb14000 dfd18620 00000001 00000064 00000000
May 5 15:54:13 linuxserv kernel: c01b909d dfb18ea0 dfb14000 dfb14000 dfd18620 00000001 00000002 00000025
May 5 15:54:13 linuxserv kernel: Call Trace: [md_update_sb+359/460] [raid1d+29/1112] [md_thread+345/444] [kernel_thread+40/56]
May 5 15:54:13 linuxserv kernel: Code: f3 a5 8b 44 24 14 8b 54 24 10 f0 0f ab 50 18 8b 44 24 14 e8
May 5 15:54:13 linuxserv kernel: <6>md: recovery thread got woken up ...
May 5 15:54:13 linuxserv kernel: md0: no spare disk to reconstruct array! -- continuing in degraded mode
I just wonder what happened to the call trace over here.
Now doing
pulsar:~# mdadm --manage -r /dev/md0 /dev/ram1
works as expected, but reconstructing the raid does not work. Probably
this is just a result of the null pointer dereference earlier on, but
anyway:
pulsar:~# mdadm --manage -a /dev/md0 /dev/ram1
seems to work, but:
pulsar:~# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 [dev 01:01][2] [dev 01:00][0]
4032 blocks [2/1] [U_]
[>....................] recovery = 0.0% (0/4032) finish=43.0min speed=0K/sec
unused devices: <none>
Greetings
Torsten
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: OOPS with raid1
2003-05-08 15:22 OOPS with raid1 Torsten Landschoff
@ 2003-05-11 11:45 ` Torsten Landschoff
0 siblings, 0 replies; 2+ messages in thread
From: Torsten Landschoff @ 2003-05-11 11:45 UTC (permalink / raw)
To: linux-raid
[-- Attachment #1: Type: text/plain, Size: 556 bytes --]
Hi,
On Thu, May 08, 2003 at 05:22:37PM +0200, Torsten Landschoff wrote:
> pulsar:~# uname -a
> Linux pulsar 2.4.20-686-smp #1 SMP Mon Jan 13 23:06:41 EST 2003 i686
>
> (that's the kernel as shipped in Debian unstable's
> kernel-image-2.4.20-686-smp)
Additional info: I tried to reproduce this problem in user-mode-linux
for easier debugging. Funny enough I did not get it to break there. Both
systems I used to test this where running an SMP kernel so I'll try to
reproduce this without SMP on a real system now.
Greetings
Torsten
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2003-05-11 11:45 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-05-08 15:22 OOPS with raid1 Torsten Landschoff
2003-05-11 11:45 ` Torsten Landschoff
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).