* linux-image-2.6.32-5-686: kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764!
@ 2012-06-22 12:19 Jose Manuel dos Santos Calhariz
2012-06-24 8:21 ` NeilBrown
0 siblings, 1 reply; 9+ messages in thread
From: Jose Manuel dos Santos Calhariz @ 2012-06-22 12:19 UTC (permalink / raw)
To: linux-raid
[-- Attachment #1: Type: text/plain, Size: 7580 bytes --]
In another day during the periodic mdadm RAID check:
- the linux kernel gave a kernel BUG,
- tried to kick out a failed disk and
- stopped accepting I/O to the affected raid.
The affected programs were in state D. The only way to recover was to
do a reboot. After reboot the problematic disk was replaced.
I reported the bug to Debian and is there all the information about it:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=675969
I was asked to report the BUG here in case someone knows what happened.
Here is a summary of the more relevant information:
This machine have 2 x RAID6 with 6 disks each, for a total of 12 disks.
I have 5 systems with a similar setup and only one failed, maybe
because of the failing disk. I will use one of the systems to try to
reproduce the bug, before triyng a new kernel.
The proprietary module is the openafs filesystem v1.6.1 backported
from Debian testing.
The kernel bug is:
build/source_i386_none/drivers/md/raid5.c:2764!
(...)
Jun 3 01:35:53 afs04 kernel: raid5:md2: read error corrected (8 sectors at 73343216 on cciss/c1d3p1)
Jun 3 01:35:53 afs04 kernel: raid5:md2: read error corrected (8 sectors at 73343224 on cciss/c1d3p1)
Jun 3 01:35:53 afs04 kernel: raid5:md2: read error corrected (8 sectors at 73343232 on cciss/c1d3p1)
Jun 3 01:35:53 afs04 kernel: raid5:md2: read error corrected (8 sectors at 73343240 on cciss/c1d3p1)
Jun 3 01:35:56 afs04 kernel: cciss: cmd f6000de0 has CHECK CONDITION sense key = 0x3
Jun 3 01:35:56 afs04 kernel: end_request: I/O error, dev cciss/c1d3, sector 73343280
Jun 3 01:35:56 afs04 kernel: raid5:md2: read error NOT corrected!! (sector 73343248 on cciss/c1d3p1).
Jun 3 01:35:56 afs04 kernel: raid5: Disk failure on cciss/c1d3p1, disabling device.
Jun 3 01:35:56 afs04 kernel: raid5: Operation continuing on 5 devices.
Jun 3 01:35:56 afs04 kernel: raid5:md2: read error NOT corrected!! (sector 73343256 on cciss/c1d3p1).
Jun 3 01:35:56 afs04 kernel: raid5:md2: read error NOT corrected!! (sector 73343264 on cciss/c1d3p1).
Jun 3 01:35:56 afs04 kernel: raid5:md2: read error NOT corrected!! (sector 73343272 on cciss/c1d3p1).
Jun 3 01:35:56 afs04 kernel: raid5:md2: read error NOT corrected!! (sector 73343280 on cciss/c1d3p1).
Jun 3 01:35:56 afs04 kernel: raid5:md2: read error NOT corrected!! (sector 73343288 on cciss/c1d3p1).
Jun 3 01:35:56 afs04 kernel: ------------[ cut here ]------------
Jun 3 01:35:56 afs04 kernel: kernel BUG at /tmp/buildd/linux-2.6-2.6.32/debian/build/source_i386_none/drivers/md/raid5.c:2764!
Jun 3 01:35:56 afs04 kernel: invalid opcode: 0000 [#1] SMP
Jun 3 01:35:56 afs04 kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:1c.0/0000:02:01.0/cciss0/c0d0/block/cciss!c0d0/removable
Jun 3 01:35:56 afs04 kernel: Modules linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs reiserfs ext4 jbd2 crc16 openafs(P) lp parport_pc parport joydev st sd_mod crc_t10dif ext2 loop tun xt_multiport xfs exportfs 8021q garp stp ip6table_filter ip6_tables iptable_filter ip_tables x_tables ide_generic ide_gd_mod ide_cd_mod ide_core snd_pcm snd_timer hpilo snd soundcore snd_page_alloc hpwdt e752x_edac shpchp rng_core i6300esb edac_core pci_hotplug pcspkr container processor evdev button psmouse serio_raw ext3 jbd mbcache dm_mod raid456 md_mod async_raid6_recov async_pq usbhid hid raid6_pq async_xor xor async_memcpy async_tx sg sr_mod cdrom ata_generic thermal uhci_hcd cciss tg3 floppy ata_piix ehci_hcd libata e1000 usbcore libphy scsi_mod nls_base thermal_sys [last unloaded: openafs]
Jun 3 01:35:56 afs04 kernel:
Jun 3 01:35:56 afs04 kernel: Pid: 743, comm: md2_raid6 Tainted: P (2.6.32-5-686 #1) ProLiant DL360 G4
Jun 3 01:35:56 afs04 kernel: EIP: 0060:[<f818c811>] EFLAGS: 00010297 CPU: 3
Jun 3 01:35:56 afs04 kernel: EIP is at handle_stripe+0x89d/0x173e [raid456]
Jun 3 01:35:56 afs04 kernel: EAX: 00000005 EBX: 00000002 ECX: 00000003 EDX: 00000001
Jun 3 01:35:56 afs04 kernel: ESI: f6394000 EDI: 00000003 EBP: f6394028 ESP: f58d5e6c
Jun 3 01:35:56 afs04 kernel: DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Jun 3 01:35:56 afs04 kernel: Process md2_raid6 (pid: 743, ti=f58d4000 task=f6569980 task.ti=f58d4000)
Jun 3 01:35:56 afs04 kernel: Stack:
Jun 3 01:35:56 afs04 kernel: e6fde3e6 c2988138 00000006 f61c8e00 00000006 0002d995 00020003 00000000
Jun 3 01:35:56 afs04 kernel: <0> c2988138 f4cbc86c f65699ac 000f0e67 00000000 f639431c 00000005 fffffffc
Jun 3 01:35:56 afs04 kernel: <0> f4cbc86c c1025461 00000000 00000000 00000002 00000005 00988100 c127a45c
Jun 3 01:35:56 afs04 kernel: Call Trace:
Jun 3 01:35:56 afs04 kernel: [<c1025461>] ? check_preempt_wakeup+0x196/0x202
Jun 3 01:35:56 afs04 kernel: [<f818d9fb>] ? raid5d+0x349/0x389 [raid456]
Jun 3 01:35:56 afs04 kernel: [<c103b623>] ? del_timer_sync+0xa/0x14
Jun 3 01:35:56 afs04 kernel: [<c103b6cb>] ? process_timeout+0x0/0x5
Jun 3 01:35:56 afs04 kernel: [<f816206e>] ? md_thread+0xe1/0xf8 [md_mod]
Jun 3 01:35:56 afs04 kernel: [<c104433a>] ? autoremove_wake_function+0x0/0x2d
Jun 3 01:35:56 afs04 kernel: [<f8161f8d>] ? md_thread+0x0/0xf8 [md_mod]
Jun 3 01:35:56 afs04 kernel: [<c1044108>] ? kthread+0x61/0x66
Jun 3 01:35:56 afs04 kernel: [<c10440a7>] ? kthread+0x0/0x66
Jun 3 01:35:56 afs04 kernel: [<c1003d47>] ? kernel_thread_helper+0x7/0x10
Jun 3 01:35:56 afs04 kernel: Code: e9 9b 01 00 00 83 7c 24 7c 02 74 04 0f 0b eb fe f6 46 28 10 c7 46 3c 00 00 00 00 0f 85 7f 01 00 00 8b 44 24 38 39 44 24 70 7d 04 <0f> 0b eb fe 83 7c 24 7c 02 75 20 6b 84 24 a8 00 00 00 78 ff 44
Jun 3 01:35:56 afs04 kernel: EIP: [<f818c811>] handle_stripe+0x89d/0x173e [raid456] SS:ESP 0068:f58d5e6c
Jun 3 01:35:56 afs04 kernel: ---[ end trace b6f4aa295d5e4948 ]---
Jun 3 01:35:56 afs04 mdadm[2376]: Fail event detected on md device /dev/md2, component device /dev/cciss/c1d3p1
Jun 3 02:59:50 afs04 kernel: md: md3: data-check done.
Jun 3 06:16:21 afs04 kernel: afs: Lost contact with volume location server 193.136.128.36 in cell ist.utl.pt
Jun 3 06:16:21 afs04 kernel: afs: Lost contact with volume location server 193.136.128.36 in cell ist.utl.pt
Jun 3 06:17:18 afs04 kernel: afs: Lost contact with file server 193.136.128.36 in cell ist.utl.pt (all multi-homed ip addresses down for the server)
Jun 3 06:17:18 afs04 kernel: afs: Lost contact with file server 193.136.128.36 in cell ist.utl.pt (all multi-homed ip addresses down for the server)
Jun 3 07:35:21 afs04 kernel: cciss: cmd f6000000 has CHECK CONDITION sense key = 0x3
Jun 3 07:35:21 afs04 kernel: end_request: I/O error, dev cciss/c1d3, sector 128
Jun 3 07:35:21 afs04 kernel: __ratelimit: 21 callbacks suppressed
Jun 3 07:35:21 afs04 kernel: Buffer I/O error on device cciss/c1d3, logical block 16
Jun 3 07:35:22 afs04 kernel: cciss: cmd f6000000 has CHECK CONDITION sense key = 0x3
Jun 3 07:35:22 afs04 kernel: end_request: I/O error, dev cciss/c1d3, sector 128
Jun 3 07:35:22 afs04 kernel: Buffer I/O error on device cciss/c1d3, logical block 16
Jun 3 07:35:23 afs04 kernel: cciss: cmd f6000000 has CHECK CONDITION sense key = 0x3
Jun 3 07:35:23 afs04 kernel: end_request: I/O error, dev cciss/c1d3, sector 128
Jun 3 07:35:23 afs04 kernel: Buffer I/O error on device cciss/c1d3, logical block 16
(...)
TIA
Jose Calhariz
--
--
Ambição: um supremo desejo de ser vilipendiado por seus inimigos enquanto você está vivo e ser ridicularizado pelos amigos quando estiver morto
--Ambrose Bierce
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: linux-image-2.6.32-5-686: kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764!
2012-06-22 12:19 linux-image-2.6.32-5-686: kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764! Jose Manuel dos Santos Calhariz
@ 2012-06-24 8:21 ` NeilBrown
2012-06-24 17:02 ` Jose Manuel dos Santos Calhariz
0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2012-06-24 8:21 UTC (permalink / raw)
To: jose.spam; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 1487 bytes --]
On Fri, 22 Jun 2012 13:19:53 +0100 Jose Manuel dos Santos Calhariz
<jose.spam@netvisao.pt> wrote:
>
> In another day during the periodic mdadm RAID check:
> - the linux kernel gave a kernel BUG,
> - tried to kick out a failed disk and
> - stopped accepting I/O to the affected raid.
>
> The affected programs were in state D. The only way to recover was to
> do a reboot. After reboot the problematic disk was replaced.
>
> I reported the bug to Debian and is there all the information about it:
>
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=675969
>
> I was asked to report the BUG here in case someone knows what happened.
>
> Here is a summary of the more relevant information:
>
> This machine have 2 x RAID6 with 6 disks each, for a total of 12 disks.
>
> I have 5 systems with a similar setup and only one failed, maybe
> because of the failing disk. I will use one of the systems to try to
> reproduce the bug, before triyng a new kernel.
>
>
> The proprietary module is the openafs filesystem v1.6.1 backported
> from Debian testing.
>
> The kernel bug is:
>
>
> build/source_i386_none/drivers/md/raid5.c:2764!
This bug was fixed in 2.6.32.49 and 3.2
http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commitdiff;h=61d433c479a6ccfed6a7e73e6111ca8fa0348c63
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=9a3f530f39f4490eaa18b02719fb74ce5f4d2d86
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: linux-image-2.6.32-5-686: kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764!
2012-06-24 8:21 ` NeilBrown
@ 2012-06-24 17:02 ` Jose Manuel dos Santos Calhariz
2012-06-25 2:39 ` NeilBrown
0 siblings, 1 reply; 9+ messages in thread
From: Jose Manuel dos Santos Calhariz @ 2012-06-24 17:02 UTC (permalink / raw)
To: NeilBrown; +Cc: jose.spam, linux-raid
[-- Attachment #1: Type: text/plain, Size: 2038 bytes --]
On Sun, Jun 24, 2012 at 06:21:46PM +1000, NeilBrown wrote:
> On Fri, 22 Jun 2012 13:19:53 +0100 Jose Manuel dos Santos Calhariz
> <jose.spam@netvisao.pt> wrote:
>
> >
> > In another day during the periodic mdadm RAID check:
> > - the linux kernel gave a kernel BUG,
> > - tried to kick out a failed disk and
> > - stopped accepting I/O to the affected raid.
> >
> > The affected programs were in state D. The only way to recover was to
> > do a reboot. After reboot the problematic disk was replaced.
> >
> > I reported the bug to Debian and is there all the information about it:
> >
> > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=675969
> >
> > I was asked to report the BUG here in case someone knows what happened.
> >
> > Here is a summary of the more relevant information:
> >
> > This machine have 2 x RAID6 with 6 disks each, for a total of 12 disks.
> >
> > I have 5 systems with a similar setup and only one failed, maybe
> > because of the failing disk. I will use one of the systems to try to
> > reproduce the bug, before triyng a new kernel.
> >
> >
> > The proprietary module is the openafs filesystem v1.6.1 backported
> > from Debian testing.
> >
> > The kernel bug is:
> >
> >
> > build/source_i386_none/drivers/md/raid5.c:2764!
>
> This bug was fixed in 2.6.32.49 and 3.2
>
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commitdiff;h=61d433c479a6ccfed6a7e73e6111ca8fa0348c63
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=9a3f530f39f4490eaa18b02719fb74ce5f4d2d86
>
> NeilBrown
The failing kernel had that fix all ready. The machine was running
the kernel Debian 2.6.32-41squeeze2. Looking into the change log,
this kernel have all the fixes until 2.6.32.51 plus other fixes.
Jose Calhariz
--
--
Ambição: um supremo desejo de ser vilipendiado por seus inimigos enquanto você está vivo e ser ridicularizado pelos amigos quando estiver morto
--Ambrose Bierce
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: linux-image-2.6.32-5-686: kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764!
2012-06-24 17:02 ` Jose Manuel dos Santos Calhariz
@ 2012-06-25 2:39 ` NeilBrown
2012-06-25 2:58 ` Christian Balzer
0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2012-06-25 2:39 UTC (permalink / raw)
To: jose.spam; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 2308 bytes --]
On Sun, 24 Jun 2012 18:02:34 +0100 Jose Manuel dos Santos Calhariz
<jose.spam@netvisao.pt> wrote:
> On Sun, Jun 24, 2012 at 06:21:46PM +1000, NeilBrown wrote:
> > On Fri, 22 Jun 2012 13:19:53 +0100 Jose Manuel dos Santos Calhariz
> > <jose.spam@netvisao.pt> wrote:
> >
> > >
> > > In another day during the periodic mdadm RAID check:
> > > - the linux kernel gave a kernel BUG,
> > > - tried to kick out a failed disk and
> > > - stopped accepting I/O to the affected raid.
> > >
> > > The affected programs were in state D. The only way to recover was to
> > > do a reboot. After reboot the problematic disk was replaced.
> > >
> > > I reported the bug to Debian and is there all the information about it:
> > >
> > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=675969
> > >
> > > I was asked to report the BUG here in case someone knows what happened.
> > >
> > > Here is a summary of the more relevant information:
> > >
> > > This machine have 2 x RAID6 with 6 disks each, for a total of 12 disks.
> > >
> > > I have 5 systems with a similar setup and only one failed, maybe
> > > because of the failing disk. I will use one of the systems to try to
> > > reproduce the bug, before triyng a new kernel.
> > >
> > >
> > > The proprietary module is the openafs filesystem v1.6.1 backported
> > > from Debian testing.
> > >
> > > The kernel bug is:
> > >
> > >
> > > build/source_i386_none/drivers/md/raid5.c:2764!
>
> >
> > This bug was fixed in 2.6.32.49 and 3.2
> >
> > http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commitdiff;h=61d433c479a6ccfed6a7e73e6111ca8fa0348c63
> >
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=9a3f530f39f4490eaa18b02719fb74ce5f4d2d86
> >
> > NeilBrown
>
> The failing kernel had that fix all ready. The machine was running
> the kernel Debian 2.6.32-41squeeze2. Looking into the change log,
> this kernel have all the fixes until 2.6.32.51 plus other fixes.
>
> Jose Calhariz
>
The oops report said:
(2.6.32-5-686 #1)
is "5" the same as "41squeeze2" ??? This is a genuine question - I have
little idea about Debian versioning so maybe these are the same thing
somehow. But they look different.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: linux-image-2.6.32-5-686: kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764!
2012-06-25 2:39 ` NeilBrown
@ 2012-06-25 2:58 ` Christian Balzer
2012-06-25 6:42 ` NeilBrown
0 siblings, 1 reply; 9+ messages in thread
From: Christian Balzer @ 2012-06-25 2:58 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
On Mon, 25 Jun 2012 12:39:06 +1000 NeilBrown wrote:
> On Sun, 24 Jun 2012 18:02:34 +0100 Jose Manuel dos Santos Calhariz
> <jose.spam@netvisao.pt> wrote:
>
> > On Sun, Jun 24, 2012 at 06:21:46PM +1000, NeilBrown wrote:
> > > On Fri, 22 Jun 2012 13:19:53 +0100 Jose Manuel dos Santos Calhariz
> > > <jose.spam@netvisao.pt> wrote:
> > >
> > > >
> > > > In another day during the periodic mdadm RAID check:
> > > > - the linux kernel gave a kernel BUG,
> > > > - tried to kick out a failed disk and
> > > > - stopped accepting I/O to the affected raid.
> > > >
> > > > The affected programs were in state D. The only way to recover
> > > > was to do a reboot. After reboot the problematic disk was
> > > > replaced.
> > > >
> > > > I reported the bug to Debian and is there all the information
> > > > about it:
> > > >
> > > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=675969
> > > >
> > > > I was asked to report the BUG here in case someone knows what
> > > > happened.
> > > >
> > > > Here is a summary of the more relevant information:
> > > >
> > > > This machine have 2 x RAID6 with 6 disks each, for a total of 12
> > > > disks.
> > > >
> > > > I have 5 systems with a similar setup and only one failed, maybe
> > > > because of the failing disk. I will use one of the systems to try
> > > > to reproduce the bug, before triyng a new kernel.
> > > >
> > > >
> > > > The proprietary module is the openafs filesystem v1.6.1 backported
> > > > from Debian testing.
> > > >
> > > > The kernel bug is:
> > > >
> > > >
> > > > build/source_i386_none/drivers/md/raid5.c:2764!
> >
> > >
> > > This bug was fixed in 2.6.32.49 and 3.2
> > >
> > > http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commitdiff;h=61d433c479a6ccfed6a7e73e6111ca8fa0348c63
> > >
> > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=9a3f530f39f4490eaa18b02719fb74ce5f4d2d86
> > >
> > > NeilBrown
> >
> > The failing kernel had that fix all ready. The machine was running
> > the kernel Debian 2.6.32-41squeeze2. Looking into the change log,
> > this kernel have all the fixes until 2.6.32.51 plus other fixes.
> >
> > Jose Calhariz
> >
>
> The oops report said:
>
> (2.6.32-5-686 #1)
>
> is "5" the same as "41squeeze2" ??? This is a genuine question - I have
> little idea about Debian versioning so maybe these are the same thing
> somehow. But they look different.
>
Yes, the "name' of the kernel and it's actual detail version are disjunct
like that in Debian, the current kernel of that vintage is:
---
Package: linux-image-2.6.32-5-amd64
Source: linux-2.6
Version: 2.6.32-44
---
Regards,
Christian
--
Christian Balzer Network/Systems Engineer
chibi@gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: linux-image-2.6.32-5-686: kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764!
2012-06-25 2:58 ` Christian Balzer
@ 2012-06-25 6:42 ` NeilBrown
2012-06-25 6:55 ` Christian Balzer
2012-06-25 10:59 ` Jose Manuel dos Santos Calhariz
0 siblings, 2 replies; 9+ messages in thread
From: NeilBrown @ 2012-06-25 6:42 UTC (permalink / raw)
To: Christian Balzer; +Cc: linux-raid, Jose Manuel dos Santos Calhariz
[-- Attachment #1: Type: text/plain, Size: 3803 bytes --]
On Mon, 25 Jun 2012 11:58:33 +0900 Christian Balzer <chibi@gol.com> wrote:
> On Mon, 25 Jun 2012 12:39:06 +1000 NeilBrown wrote:
>
> > On Sun, 24 Jun 2012 18:02:34 +0100 Jose Manuel dos Santos Calhariz
> > <jose.spam@netvisao.pt> wrote:
> >
> > > On Sun, Jun 24, 2012 at 06:21:46PM +1000, NeilBrown wrote:
> > > > On Fri, 22 Jun 2012 13:19:53 +0100 Jose Manuel dos Santos Calhariz
> > > > <jose.spam@netvisao.pt> wrote:
> > > >
> > > > >
> > > > > In another day during the periodic mdadm RAID check:
> > > > > - the linux kernel gave a kernel BUG,
> > > > > - tried to kick out a failed disk and
> > > > > - stopped accepting I/O to the affected raid.
> > > > >
> > > > > The affected programs were in state D. The only way to recover
> > > > > was to do a reboot. After reboot the problematic disk was
> > > > > replaced.
> > > > >
> > > > > I reported the bug to Debian and is there all the information
> > > > > about it:
> > > > >
> > > > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=675969
> > > > >
> > > > > I was asked to report the BUG here in case someone knows what
> > > > > happened.
> > > > >
> > > > > Here is a summary of the more relevant information:
> > > > >
> > > > > This machine have 2 x RAID6 with 6 disks each, for a total of 12
> > > > > disks.
> > > > >
> > > > > I have 5 systems with a similar setup and only one failed, maybe
> > > > > because of the failing disk. I will use one of the systems to try
> > > > > to reproduce the bug, before triyng a new kernel.
> > > > >
> > > > >
> > > > > The proprietary module is the openafs filesystem v1.6.1 backported
> > > > > from Debian testing.
> > > > >
> > > > > The kernel bug is:
> > > > >
> > > > >
> > > > > build/source_i386_none/drivers/md/raid5.c:2764!
> > >
> > > >
> > > > This bug was fixed in 2.6.32.49 and 3.2
> > > >
> > > > http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commitdiff;h=61d433c479a6ccfed6a7e73e6111ca8fa0348c63
> > > >
> > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=9a3f530f39f4490eaa18b02719fb74ce5f4d2d86
> > > >
> > > > NeilBrown
> > >
> > > The failing kernel had that fix all ready. The machine was running
> > > the kernel Debian 2.6.32-41squeeze2. Looking into the change log,
> > > this kernel have all the fixes until 2.6.32.51 plus other fixes.
> > >
> > > Jose Calhariz
> > >
> >
> > The oops report said:
> >
> > (2.6.32-5-686 #1)
> >
> > is "5" the same as "41squeeze2" ??? This is a genuine question - I have
> > little idea about Debian versioning so maybe these are the same thing
> > somehow. But they look different.
> >
> Yes, the "name' of the kernel and it's actual detail version are disjunct
> like that in Debian, the current kernel of that vintage is:
> ---
> Package: linux-image-2.6.32-5-amd64
> Source: linux-2.6
> Version: 2.6.32-44
> ---
Ok.
So the version number reported by "uname -a" doesn't change when you upgrade
a Debian kernel? That's rather sad.
I means that one has to take the reporters work for which kernel was running
rather than looking in the oops message for where the kernels tells me
what version it was.
Given the report, it is entirely possible that an older kernel was running
while a newer kernel was installed.
Jose: how certain are you that the kernel that was running at the time was
exactly the kernel that was installed at the time. i.e. you had not
performed a software update since the last reboot?
However even if you can confirm that a new kernel was running I doubt I could
find an answer. There isn't really much info to go on. So unless you can
reproduce the problem, I doubt I'll even start looking.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: linux-image-2.6.32-5-686: kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764!
2012-06-25 6:42 ` NeilBrown
@ 2012-06-25 6:55 ` Christian Balzer
2012-06-25 10:59 ` Jose Manuel dos Santos Calhariz
1 sibling, 0 replies; 9+ messages in thread
From: Christian Balzer @ 2012-06-25 6:55 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid, Jose Manuel dos Santos Calhariz
On Mon, 25 Jun 2012 16:42:30 +1000 NeilBrown wrote:
> On Mon, 25 Jun 2012 11:58:33 +0900 Christian Balzer <chibi@gol.com>
> wrote:
>
> > On Mon, 25 Jun 2012 12:39:06 +1000 NeilBrown wrote:
> >
> > > On Sun, 24 Jun 2012 18:02:34 +0100 Jose Manuel dos Santos Calhariz
> > > <jose.spam@netvisao.pt> wrote:
> > >
> > > > On Sun, Jun 24, 2012 at 06:21:46PM +1000, NeilBrown wrote:
> > > > > On Fri, 22 Jun 2012 13:19:53 +0100 Jose Manuel dos Santos
> > > > > Calhariz <jose.spam@netvisao.pt> wrote:
> > > > >
> > > > > >
> > > > > > In another day during the periodic mdadm RAID check:
> > > > > > - the linux kernel gave a kernel BUG,
> > > > > > - tried to kick out a failed disk and
> > > > > > - stopped accepting I/O to the affected raid.
> > > > > >
> > > > > > The affected programs were in state D. The only way to recover
> > > > > > was to do a reboot. After reboot the problematic disk was
> > > > > > replaced.
> > > > > >
> > > > > > I reported the bug to Debian and is there all the information
> > > > > > about it:
> > > > > >
> > > > > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=675969
> > > > > >
> > > > > > I was asked to report the BUG here in case someone knows what
> > > > > > happened.
> > > > > >
> > > > > > Here is a summary of the more relevant information:
> > > > > >
> > > > > > This machine have 2 x RAID6 with 6 disks each, for a total of
> > > > > > 12 disks.
> > > > > >
> > > > > > I have 5 systems with a similar setup and only one failed,
> > > > > > maybe because of the failing disk. I will use one of the
> > > > > > systems to try to reproduce the bug, before triyng a new
> > > > > > kernel.
> > > > > >
> > > > > >
> > > > > > The proprietary module is the openafs filesystem v1.6.1
> > > > > > backported from Debian testing.
> > > > > >
> > > > > > The kernel bug is:
> > > > > >
> > > > > >
> > > > > > build/source_i386_none/drivers/md/raid5.c:2764!
> > > >
> > > > >
> > > > > This bug was fixed in 2.6.32.49 and 3.2
> > > > >
> > > > > http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commitdiff;h=61d433c479a6ccfed6a7e73e6111ca8fa0348c63
> > > > >
> > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=9a3f530f39f4490eaa18b02719fb74ce5f4d2d86
> > > > >
> > > > > NeilBrown
> > > >
> > > > The failing kernel had that fix all ready. The machine was running
> > > > the kernel Debian 2.6.32-41squeeze2. Looking into the change log,
> > > > this kernel have all the fixes until 2.6.32.51 plus other fixes.
> > > >
> > > > Jose Calhariz
> > > >
> > >
> > > The oops report said:
> > >
> > > (2.6.32-5-686 #1)
> > >
> > > is "5" the same as "41squeeze2" ??? This is a genuine question - I
> > > have little idea about Debian versioning so maybe these are the same
> > > thing somehow. But they look different.
> > >
> > Yes, the "name' of the kernel and it's actual detail version are
> > disjunct like that in Debian, the current kernel of that vintage is:
> > ---
> > Package: linux-image-2.6.32-5-amd64
> > Source: linux-2.6
> > Version: 2.6.32-44
> > ---
>
> Ok.
> So the version number reported by "uname -a" doesn't change when you
> upgrade a Debian kernel? That's rather sad.
It kinda does, the -5 part is the the version bit that will increase for
each significant release, but it doesn't quite reflect the more
detailed version info:
---
engtest01:~# uname -a
Linux engtest01 2.6.32-5-686 #1 SMP Mon Oct 3 04:15:24 UTC 2011 i686 GNU/Linux
engtest01:~# cat /proc/version
Linux version 2.6.32-5-686 (Debian 2.6.32-38) (ben@decadent.org.uk) (gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Mon Oct 3 04:15:24 UTC 2011
---
So while I have 2.6.32-45 kernel installed on that machine above, it's not
been rebooted for 220 days and still runs the -38 incarnation.
Of the -5 kernel according to uname and yes, that can be confusing.
Regards,
Christian
> I means that one has to take the reporters work for which kernel was
> running rather than looking in the oops message for where the kernels
> tells me what version it was.
>
> Given the report, it is entirely possible that an older kernel was
> running while a newer kernel was installed.
>
> Jose: how certain are you that the kernel that was running at the time
> was exactly the kernel that was installed at the time. i.e. you had not
> performed a software update since the last reboot?
>
> However even if you can confirm that a new kernel was running I doubt I
> could find an answer. There isn't really much info to go on. So unless
> you can reproduce the problem, I doubt I'll even start looking.
>
> NeilBrown
--
Christian Balzer Network/Systems Engineer
chibi@gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: linux-image-2.6.32-5-686: kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764!
2012-06-25 6:42 ` NeilBrown
2012-06-25 6:55 ` Christian Balzer
@ 2012-06-25 10:59 ` Jose Manuel dos Santos Calhariz
[not found] ` <CAGqmV7rBk9R-q-LRVw1tzxmXoQMLTQbQY8f9C0SOuaMOf7AfoQ@mail.gmail.com>
1 sibling, 1 reply; 9+ messages in thread
From: Jose Manuel dos Santos Calhariz @ 2012-06-25 10:59 UTC (permalink / raw)
To: NeilBrown; +Cc: Christian Balzer, linux-raid, Jose Manuel dos Santos Calhariz
[-- Attachment #1: Type: text/plain, Size: 6567 bytes --]
On Mon, Jun 25, 2012 at 04:42:30PM +1000, NeilBrown wrote:
> On Mon, 25 Jun 2012 11:58:33 +0900 Christian Balzer <chibi@gol.com> wrote:
>
> > On Mon, 25 Jun 2012 12:39:06 +1000 NeilBrown wrote:
> >
> > > On Sun, 24 Jun 2012 18:02:34 +0100 Jose Manuel dos Santos Calhariz
> > > <jose.spam@netvisao.pt> wrote:
> > >
> > > > On Sun, Jun 24, 2012 at 06:21:46PM +1000, NeilBrown wrote:
> > > > > On Fri, 22 Jun 2012 13:19:53 +0100 Jose Manuel dos Santos Calhariz
> > > > > <jose.spam@netvisao.pt> wrote:
> > > > >
> > > > > >
> > > > > > In another day during the periodic mdadm RAID check:
> > > > > > - the linux kernel gave a kernel BUG,
> > > > > > - tried to kick out a failed disk and
> > > > > > - stopped accepting I/O to the affected raid.
> > > > > >
> > > > > > The affected programs were in state D. The only way to recover
> > > > > > was to do a reboot. After reboot the problematic disk was
> > > > > > replaced.
> > > > > >
> > > > > > I reported the bug to Debian and is there all the information
> > > > > > about it:
> > > > > >
> > > > > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=675969
> > > > > >
> > > > > > I was asked to report the BUG here in case someone knows what
> > > > > > happened.
> > > > > >
> > > > > > Here is a summary of the more relevant information:
> > > > > >
> > > > > > This machine have 2 x RAID6 with 6 disks each, for a total of 12
> > > > > > disks.
> > > > > >
> > > > > > I have 5 systems with a similar setup and only one failed, maybe
> > > > > > because of the failing disk. I will use one of the systems to try
> > > > > > to reproduce the bug, before triyng a new kernel.
> > > > > >
> > > > > >
> > > > > > The proprietary module is the openafs filesystem v1.6.1 backported
> > > > > > from Debian testing.
> > > > > >
> > > > > > The kernel bug is:
> > > > > >
> > > > > >
> > > > > > build/source_i386_none/drivers/md/raid5.c:2764!
> > > >
> > > > >
> > > > > This bug was fixed in 2.6.32.49 and 3.2
> > > > >
> > > > > http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commitdiff;h=61d433c479a6ccfed6a7e73e6111ca8fa0348c63
> > > > >
> > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=9a3f530f39f4490eaa18b02719fb74ce5f4d2d86
> > > > >
> > > > > NeilBrown
> > > >
> > > > The failing kernel had that fix all ready. The machine was running
> > > > the kernel Debian 2.6.32-41squeeze2. Looking into the change log,
> > > > this kernel have all the fixes until 2.6.32.51 plus other fixes.
> > > >
> > > > Jose Calhariz
> > > >
> > >
> > > The oops report said:
> > >
> > > (2.6.32-5-686 #1)
> > >
> > > is "5" the same as "41squeeze2" ??? This is a genuine question - I have
> > > little idea about Debian versioning so maybe these are the same thing
> > > somehow. But they look different.
> > >
> > Yes, the "name' of the kernel and it's actual detail version are disjunct
> > like that in Debian, the current kernel of that vintage is:
> > ---
> > Package: linux-image-2.6.32-5-amd64
> > Source: linux-2.6
> > Version: 2.6.32-44
> > ---
>
> Ok.
> So the version number reported by "uname -a" doesn't change when you upgrade
> a Debian kernel? That's rather sad.
> I means that one has to take the reporters work for which kernel was running
> rather than looking in the oops message for where the kernels tells me
> what version it was.
>
> Given the report, it is entirely possible that an older kernel was running
> while a newer kernel was installed.
>
> Jose: how certain are you that the kernel that was running at the time was
> exactly the kernel that was installed at the time. i.e. you had not
> performed a software update since the last reboot?
Whenever I reboot a server I run a script to collect information about
it: Kernel boot messages, kernel version, kernel modules, md raid
information, etc.
So I have the kernel boot messages for the precise boot that gave the
BUG. From that boot log:
[ 0.000000] Linux version 2.6.32-5-686 (Debian 2.6.32-41squeeze2)
(dannf@debian.org) (gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Mon
Mar 26 05:20:33 UTC 2012
The version of the running kernel is 2.6.32-41squeeze2. In the
changelog of the Debian package, for version 2.6.32-41:
* Add longterm release 2.6.32.54
The complete changelog, in case someone want look into it:
http://packages.debian.org/changelogs/pool/main/l/linux-2.6/linux-2.6_2.6.32-45/changelog
On the previous Debian version 2.6.32-40 there is this entry on the
changelog:
* Add longterm release 2.6.32.49, including:
- SCSI: st: fix race in st_scsi_execute_end
- NFS/sunrpc: don't use a credential with extra groups.
- netlink: validate NLA_MSECS length
- hfs: add sanity check for file name length (CVE-2011-4330)
- md/raid5: abort any pending parity operations when array fails.
- mm: avoid null pointer access in vm_struct via /proc/vmallocinfo
- ipv6: udp: fix the wrong headroom check (CVE-2011-4326)
- USB: Fix Corruption issue in USB ftdi driver ftdi_sio.c
The complete boot log is on:
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=15;filename=kernel-boot;att=1;bug=675969
>
> However even if you can confirm that a new kernel was running I doubt I could
> find an answer. There isn't really much info to go on. So unless you can
> reproduce the problem, I doubt I'll even start looking.
I have too much information about the system that gave the BUG, but no
way to sort it out what is relevant and what it's not relevant. Is
there anything more you would like to know?
I understand if you can't help me. I have 5 similar servers that are
running 2.6.32.x for 3 months but I have 1 BUG only. I have one
server where I am trying to reproduce the BUG without avail.
- Doing a re-sync of the RAID when there is a "error read corrected"
don't trigger the BUG.
- Hot unplug a disk don't trigger the BUG.
My guess is this bug is related with bad disks and errors messages
that sometimes the disks give to the kernel. But is more difficult to
find disks that give this errors messages in a reproducible way than
finding disks with bad sectors for the test server.
>
> NeilBrown
--
--
Ambição: um supremo desejo de ser vilipendiado por seus inimigos enquanto você está vivo e ser ridicularizado pelos amigos quando estiver morto
--Ambrose Bierce
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-06-27 14:12 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-22 12:19 linux-image-2.6.32-5-686: kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764! Jose Manuel dos Santos Calhariz
2012-06-24 8:21 ` NeilBrown
2012-06-24 17:02 ` Jose Manuel dos Santos Calhariz
2012-06-25 2:39 ` NeilBrown
2012-06-25 2:58 ` Christian Balzer
2012-06-25 6:42 ` NeilBrown
2012-06-25 6:55 ` Christian Balzer
2012-06-25 10:59 ` Jose Manuel dos Santos Calhariz
[not found] ` <CAGqmV7rBk9R-q-LRVw1tzxmXoQMLTQbQY8f9C0SOuaMOf7AfoQ@mail.gmail.com>
2012-06-27 14:12 ` Jose Manuel dos Santos Calhariz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).